Parsing the UCD

L2/11-358

Re:	Parsing the UCD
To:	UTC
From:	Mark Davis
Date:	2011-10-05
Live doc:	http://goo.gl/9MjQJ

I wrote my original UCD parser some 15 years ago, and with various additions over the years, it was getting pretty crufty and hard to maintain. I decided to see what it would be like to rewrite it…

Alas, it is not that easy to parse the UCD.

While doing this, I found some small problems in properties that I’d recommend we fix now, and some longer term ideas for adding information that would make it easier to parse—those not fully fleshed out, so I’m looking for feedback at this point. I’m filing the shorter-terms issues as separate feedback, so here are the longer terms issues (for post 6.1):

A. Full Machine Readable Parsing

Here are the thoughts I had about cleaning up the data to support fully machine-readable parsing. The following references some rough-draft data files, at the following locations. (The data, while in rough format, was enough to read and validate the UTD.

Topics:

We need @missing values for every property so that we know what the values are if they do not occur. Some are missing; if there is no other file to put them in, they should be in PropertyValueAliases.

Note that it would be easier to parse if all were uniformly in PropertyValueAliases, instead of being split into different files.
For an example of how this might look, see ExtraPropertyValueAliases.txt

We don’t clearly distinguish when a missing field value means “no value”, and when it means “empty string” and when it means the “missing value”. For example, in NFKC_CF, an empty value means the empty string, but in the Emoji file or some UnicodeData fields it means “no value” or the “default value”. We should add data to clarify that.

For an example of how this might look, see ExtraPropertyValueAliases.txt
(look for @empty)

We have regex patterns, but there are problems.

The ones listed in UAX #44 are out-of-date (listed in another document)
None of them are in machine-readable files.
For an example of how this might look, see UcdPropertyRegex.txt

We have quite a number of multivalued properties. We should supply a machine-readable data file that indicates which properties are:

single-valued
multi-valued
extensible: “currently” single-valued, but could potentially change (many Unihan, acc. to John/Richard).
ordered values

where order is significant. That is, in some cases a multivalued property has a set value (the values could be in any order), and in others a list value (where the order is significant, as in kMandarin).

For an example of how this might look, see UcdPropertyRegex.txt

In Unihan regex, we should use the same mechanism for readability of regex that we do in UAX #44. For example, we could replace the following, for a much more readable result.

U\+2?[0-9A-F]{4}(<k[A-Za-z0-9]+(:[TBZ]+)?(,k[A-Za-z0-9]+(:[TBZ]+)?)*)?
by:
$ucodepoint(<$kSemantic(,$kSemantic)*)?

We should extend PropertyAliases.txt to include all of the provisional properties as well. See ExtraPropertyAliases.txt
Parsing UcdFiles is not trivial, because the format varies widely. We should supply more precise information for parsing.

For an example of how this might look, see UcdExtraParseInfo.txt

In PropertyValueAliases, all but ccc have the same field order. Not sure how to do this, but it would be less ugly to parse if it had the same format!
Note that some of these data are not formal properties in the UCD, such as CJK_Radical. They would be cleaner if we made them at least provisional properties.

B. Additional Extracted Data Files

Regarding #6, an alternative would be to have more extracted files covering the the data that are ugly to parse. These include:

UnicodeData; Decomposition_Mapping (The Decomposition_Type already is extracted)
UnicodeData; Simple_Uppercase_Mapping
UnicodeData; Simple_Lowercase_Mapping
UnicodeData; Simple_Titlecase_Mapping
CJKRadicals
CompositionExclusions
CaseFolding ; Simple_Case_Folding
CaseFolding ; Case_Folding
SpecialCasing ; Lowercase_Mapping
SpecialCasing ; Titlecase_Mapping
SpecialCasing ; Uppercase_Mapping

C. General Property File Format

In an ideal world, our data files would all be uniform. Every property would be in its own file, with the file name being the correct long property name, and each line in the file would tell which code points had which property value, something like the following. This makes it as easy as possible for simple parsers.

A_Property_Name.txt

# HEADER

...

1234..5678 ; A_Property_Name ; A_Property_Value # comments

…

# EOF

While it is far to late to change what we have, any movement towards that for future properties and derived property files would make people’s lives easier. So I suggest that we have the policy of producing future files in this format.