L2/16-344

Source: Jonathan Lettvin

Date: 2016-11-06

Subject: Suggestions for UCD in version 10.0

I write code that reads authoritative language specifications and generates authoritative grammars.
The file UnicodeData.txt is an excellent resource but
I was unable to find the column names in a "normative" or "authoritative" resource.

The old file UnicodeData-3.0.0.html gave more hints than the latest 9.0.0 dataset.
UnicodeData-3.0.0.html names them, but with inconsistencies for lexing/parsing/usage.
TR44 names them but in HTML without specific class or id or other hints.

I recommend several changes to be incorporated into 10.0.0 to make column names normative.
My personal preference is for these names to be friendly to tokenization
to make code that uses the column names map clearly to the official names.

One choice I made is to use one style of name casing for all column names.
Another choice is to use underscore in place of space characters.
Another choice is to put numeric fields at the end rather than beginning of an id for a column name.
These are typical choices for programmers with some variations, but should be universally acceptable.

Recommended contents of a new resource file named UnicodeDataColumnNames.txt:

Codepoint
Character_Name
General_Category
Canonical_Combining_Classes
Bidirectional_Category
Character_Decomposition_Mapping
Decimal_Digit_Value
Digit_Value
Numeric_value
Mirrored
Unicode_1_0_Name
Comment_Field_10646
Uppercase_Mapping
Lowercase_Mapping
Titlecase_Mapping

Unnecessary:
It may be desirable to make this a semicolon separated line file with additional fields.
One field could give a clearer text description.
Another field could give a resource file name for further research into that column.
Another field could identify the previous version of the file where the field was different.

Necessary:
These are the changes to the current naming convention with deference to tokenizing:
  1. Column 1 (index 0) Unicode 3.0 "Code Value" is renamed to the unambiguous "Codepoint"
  2. Column 2 (index 1) Unicode 3.0 "Character name" is renamed to "Character_Name"
  3. Column 7 (index 6) Unicode 3.0 "Decimal digit value" is renamed to "Decimal_Digit_Value"
  4. Column 8 (index 7) Unicode 3.0 "Digit value" is renamed to "Digit_Value"
  5. Column 9 (index 8) Unicode 3.0 "Numeric value" is renamed to "Numeric_Value"
  6. Column 11 (index 10) Unicode 3.0 "Unicode 1.0 Name" is renamed to "Unicode_1_0_Name"
  7. Column 12 (index 11) Unicode 3.0 "10646 comment field" is renamed "Comment_Field_10646"
These changes could also be folded into TR44 in the following style:

<a name="Codepoint" class="UnicodeDataColumn" id="Column1">Codepoint</a>

Other resource files may benefit from inserting class/id into the HTML of TR44.

If making a new resource file is undesirable, the HTML change is sufficient but
changing the HTML may break custom ingest routines for current unicode compliant developers.
My strong recommendation is to make a new normative resource file UnicodeDataColumnNames.txt.