L2/11-358

Re:

Parsing the UCD

To:

UTC

From:

Mark Davis

Date:

2011-10-05

Live doc:

http://goo.gl/9MjQJ 

I wrote my original UCD parser some 15 years ago, and with various additions over the years, it was getting pretty crufty and hard to maintain. I decided to see what it would be like to rewrite it…

Alas, it is not that easy to parse the UCD.

While doing this, I found some small problems in properties that I’d recommend we fix now, and some longer term ideas for adding information that would make it easier to parse—those not fully fleshed out, so I’m looking for feedback at this point. I’m filing the shorter-terms issues as separate feedback, so here are the longer terms issues (for post 6.1):

A. Full Machine Readable Parsing

Here are the thoughts I had about cleaning up the data to support fully machine-readable parsing. The following references some rough-draft data files, at the following locations. (The data, while in rough format, was enough to read and validate the UTD.

  1. ExtraPropertyAliases.txt
  2. ExtraPropertyValueAliases.txt
  3. UcdPropertyRegex.txt
  4. UcdExtraParseInfo.txt

Topics:

  1. We need @missing values for every property so that we know what the values are if they do not occur. Some are missing; if there is no other file to put them in, they should be in PropertyValueAliases.
  1. Note that it would be easier to parse if all were uniformly in PropertyValueAliases, instead of being split into different files.
  2. For an example of how this might look, see  ExtraPropertyValueAliases.txt
  1. We don’t clearly distinguish when a missing field value means “no value”, and when it means “empty string” and when it means the “missing value”. For example, in NFKC_CF, an empty value means the empty string, but in the Emoji file or some UnicodeData fields it means “no value” or the “default value”. We should add data to clarify that.
  1. For an example of how this might look, see  ExtraPropertyValueAliases.txt
  2. (look for @empty)
  1. We have regex patterns, but there are problems.
  1. The ones listed in UAX #44 are out-of-date (listed in another document)
  2. None of them are in machine-readable files.
  3. For an example of how this might look, see UcdPropertyRegex.txt
  1. We have quite a number of multivalued properties. We should supply a machine-readable data file that indicates which properties are:
  1. single-valued
  2. multi-valued
  3. extensible: “currently” single-valued, but could potentially change (many Unihan, acc. to John/Richard).
  4. ordered values
  1. where order is significant. That is, in some cases a multivalued property has a set value (the values could be in any order), and in others a list value (where the order is significant, as in kMandarin).
  1. For an example of how this might look, see UcdPropertyRegex.txt
  1. In Unihan regex, we should use the same mechanism for readability of regex that we do in UAX #44. For example, we could replace the following, for a much more readable result.
  1. U\+2?[0-9A-F]{4}(<k[A-Za-z0-9]+(:[TBZ]+)?(,k[A-Za-z0-9]+(:[TBZ]+)?)*)?
  2. by:
  3. $ucodepoint(<$kSemantic(,$kSemantic)*)?
  1. We should extend PropertyAliases.txt to include all of the provisional properties as well. See ExtraPropertyAliases.txt
  2. Parsing UcdFiles is not trivial, because the format varies widely. We should supply more precise information for parsing.
  1. For an example of how this might look, see UcdExtraParseInfo.txt
  1. In PropertyValueAliases, all but ccc have the same field order. Not sure how to do this, but it would be less ugly to parse if it had the same format!
  2. Note that some of these data are not formal properties in the UCD, such as CJK_Radical. They would be cleaner if we made them at least provisional properties.

B. Additional Extracted Data Files

Regarding #6, an alternative would be to have more extracted files covering the the data that are ugly to parse. These include:

  1. UnicodeData; Decomposition_Mapping (The Decomposition_Type already is extracted)
  2. UnicodeData; Simple_Uppercase_Mapping
  3. UnicodeData; Simple_Lowercase_Mapping
  4. UnicodeData; Simple_Titlecase_Mapping
  5. CJKRadicals
  6. CompositionExclusions
  7. CaseFolding ; Simple_Case_Folding
  8. CaseFolding ; Case_Folding
  9. SpecialCasing ; Lowercase_Mapping
  10. SpecialCasing ; Titlecase_Mapping
  11. SpecialCasing ; Uppercase_Mapping

C. General Property File Format

In an ideal world, our data files would all be uniform. Every property would be in its own file, with the file name being the correct long property name, and each line in the file would tell which code points had which property value, something like the following. This makes it as easy as possible for simple parsers.

A_Property_Name.txt

# HEADER

...

1234..5678 ; A_Property_Name ; A_Property_Value # comments

# EOF

While it is far to late to change what we have, any movement towards that for future properties and derived property files would make people’s lives easier. So I suggest that we have the policy of producing future files in this format.