L2/03-310

Proposal for making Property(Value)Aliases.txt actually machine readable

Asmus Freytag
2003-09-09

Overview
--------

This document presents a Review of PropertyAliases.txt
and PropertyValueAliases.txt from the perspective of
their usefulness during updates to a new version
of the UCD followed by a five step proposal to set
things to rights.

The problem
-----------

I've come across what I consider a pretty serious
deficiency in the way we've designed PropertyAliases.txt
and PropertyValueAliases.txt.

The primary purpose of these files is to provide a
list of standardized labels for citing properties
in various contexts, be it regular expressions,
user interfaces, or documentation.

The PropertyAliases.txt file gives entries like these:

...
jt        ; Joining_Type
lb        ; Line_Break
NFC_QC    ; NFC_Quick_Check

...

with a short name on the left and a long name on the right.

The PropertyValueAliases gives long and short names for
each of the possible values of a property, indexed by
the short name of the property in the first column:

...

jt ; C         ; Join_Causing
jt ; D         ; Dual_Joining
jt ; L         ; Left_Joining
jt ; R         ; Right_Joining
jt ; T         ; Transparent
jt ; U         ; Non_Joining

lb ; AI        ; Ambiguous
lb ; AL        ; Alphabetic
lb ; B2        ; Break_Both
lb ; BA        ; Break_After

...
qc ; M         ; Maybe
qc ; N         ; No
qc ; Y         ; Yes

..


The data is presented in a format that on the surface
appears machine readable, but which is organized in a
way that makes it pretty useless for automatic updates
when the UCD gets rev'd.

What's needed is to be able to link the information on
location of the data (file and field), with the set of
legal values and the name of the property.

For example, if I want to let a user select all characters
that have linebreak property BA (Break_After), I need the
following pieces:

1) list of property names
2) link to a short name for each
3) link to a location for each (data file and field)
4) list of possible values for each property
5) link to the name used in the file for each value

With a list of all property names, I can populate a
listbox in a dialog to let the user select the property.

With the short name, I can access the list of values
in the value aliases (or alternatively expose that to
the user in regular expression syntax, etc.)

With a pointer to a location I can parse the data
file, once the user has selected the name for the
actual value from a list of possible values.
[or use it as an index into a pre-parsed table]

While the list of values could expose nice names,
the program needs to have a link to the actual
expression of that value in the data file.

This example is based on the kind of property lookup
as done in unibook (http://www.unicode.org/unibook/)
but it works just the same for maintaining a regular
expression parser or any other form of user interaction
or API call that's based on the long or short names of
properties and property values.

In all cases like this, it's possible to assemble
the required information from the data in the database
in a semi-manual way. However, with each addition
to the UCD, this gets out of date and must be manually
maintained. Why can the organization of the UCD not be
machine readable? Why is it not machine readable?

Here are the specific problems that I encounter:

1) The description of which data values are where is
not machine readable at all - it's a table in UCD.html.
The table is a valuable resource for people trying
to read the UCD manually, but it could be exposed in
machine readable form.

2) The PropertyAliases.txt file is not machine readable.
It contains comments (which a parser would ignore) which
organize the properties into different types (which then
are treated differently, although not consistently so,
in PropertyValueAliases.txt)

3) PropertyValueAliases.txt contains many exceptions that
make it hard to work with for automated purposes:

a) it uses a 'pseudo-property' called 'qc' for all
quick check properties. This pseudo property is not
defined anywhere and the only explanation is in the
comment. (My parser promptly barfed on it as an
'illegal property short-name').

b) boolean properties are not listed, as are properties
such as 'character name', or 'lower case mappings' that
have open ended values.

c) there is an inconsistent use of 'n/a' in the file.
Most often, n/a means that there is no 'short-name'
defined for a value. Less often it means that the
value itself is an n/a value. However, for some properties
that is indicated with a value alias of "none".

d) the canonical combining class property has an extra
field, making the number of fields variable (it could
have been made empty in the rest of the file)

4) Some data files contain properties for which no alias
is listed. Some data files contain properties for which
an alias is listed in the data file which does not match
any of the aliases given in the aliases file.
given.

5) The comments suggest that in the future, additional
aliases can be introduced for a property or a property
value. For property values that have no short name,
it's impossible to add an alias:

Example:
old:
blk ; n/a ; Arabic
blk ; n/a ; Greek
blk ; n/a ; Latin


new:
blk ; n/a ; Alias
blk ; n/a ; Arabic
blk ; n/a ; Greek


In this example it's not possible to know whether
'Alias' is a new value or an alias of another one.
However, 'inventing' up to 100 new, fake short block
names to add an occasional alias is not helpful
either.

Where short names do exist, an alias could be defined,
but then only for *either* the short or the long
names, not for both:

lb ; AL        ; Alphabetic
lb ; B2        ; Break_Both
lb ; B2        ; Break_Before_And_After
lb ; BA        ; Break_After

but:

lb ; AL        ; Alphabetic
lb ; BB        ; Break_Both
lb ; B2        ; Break_Both
lb ; B2        ; Break_Before_And_After
lb ; BA        ; Break_After

would be much harder to deal with, since
one would need to thread all aliases together.

Furthermore, if one of the aliases is preferred,
there's no clean way, other than order, to indicate
that, but the files are alphabetized.

The Proposal
------------

I propose the following steps to turn the collection
of aliases into a parsable index of property names
and value names:

1) make a machine readable copy of the table in
UCD.html, linking the short property name to the
name of the data file and a field number (or tag
value in case of Unihan).

Once the rest of the information is in place, it
should be possible to use a script to check this
new locator table by scanning the data files for
expected property values.

2) use a machine readable non-comment convention
to organize PropertyAliases.txt to indicate which
properties have value aliases, which do not (by their
nature) and which have predictable (true/false)
values that need no alias.

The easiest would be a third column that indicates
the type of property. This information is useful
in other contexts.

3) In the PropertyValueAliases.txt file,
remove the pseudo property, put the combining
class into a field that's consistently present
across the file; add an index field for aliases.
(the ccc and index field could reside in a
single column of numerical values). Use the
value n/a consistently only for cases where
there is no defined alias (short or long) for
a property value, and use the word 'none' if
needed both for long and short alias for
the 'does not have this property'-value.

4) Make sure that all value and property
labels from all the data files are actually
represented in PropertyAliases.txt and
PropertyValueAliases.txt.

5) Where multiple aliases exist, add a
way to indicate the alias that is actually
used in the current version of the data file.
Other than such indication which is needed
to link to the information in item 1
all aliases can be defined as equivalent,
in other words, I recommend against identifying
a 'preferred' alias since I suspect that
changing that status could create problems
with implementations.

Given the enormous difficulties in actually
using these files as machine readable data
as they are exposed today, I have little
hesitation in suggesting a reorganization of
the data files.


The Conclusion
--------------

With these five steps, the relevant information
in the two files, and UCD.html would become
usable in many contexts allowing implementors
to automate key steps in their migration to
a new revision of the database, or to create
utilities that are self-updating.

The fact that some property data files contain
information that is complex and needs special
support, such as the data in SpecialCasing.txt
does not invalidate this proposal. The vast
majority of individual properties are either
boolean or simple enumerations. These are the
types of properties that are most likely to
be added between versions or that see their
range of values change.

The Credits
-----------

Ken Whistler found the fatal flaw in the current
non-scheme to do multiple aliases.