L2/06-026

Subject: Documenting missing values in data files
Source: Mark Davis
Date: 2006/01/26

===============

Currently, the UCD data files don't provide explicit values for many  
(actually, usually *most*) code points. Instead, they document what values  
the missing code points are to have. In some cases this is simple: all code  
points not explicitly mentioned have only one value. In other cases, such  
as for Arabic Shaping, different values are possible:

> # Note: Code points that are not explicitly listed in this file are
> # either of joining type T or U:
> #
> # - Those that not explicitly listed that are of General Category Mn, Me, or Cf
> #   have joining type T.
> # - All others not explicitly listed have type U.
> #
> # For an explicit listing of characters of joining type T, see
> # the derived property file DerivedJoiningType.txt.

Unfortunately, this means that a mechanical parser can't actually assign  
the right values to code points strictly based on the data file.  In the  
future, perhaps, this may be addressed by an XML format. But even once we  
have such a format, the current data files will have to coexist for some  
time.

To address this, I have the following proposal (refined after comments  
from Asmus, Markus).

For each data file, allow the addition of comment lines with a special  
format that specify values for missing code points. These lines can be  
ignored by current parsers, or read by updated parsers. The format is:

# @missing: 0000..10FFFF; XX

where everything after "# @missing: " would be a valid line in that data file.

There can be any number of these lines. They must all occur before any  
real data line in the file. Values in any range in these lines are  
overridden by any subsequent line, either an @missing line or a real data  
line. Thus you can have:

# @missing: 0100..0200; XX
...
0123; YY
0159; ZZ
01A8..01B7; WW
...

instead of having to break the @missing range into a whole slew of small ranges.

This allows *all* of the values in the file to be machine readable.

Note: unfortunately the UnicodeData file itself cannot contain comments,  
for reasons best not explored. There are two options; either leave this  
file be, or have an extra file ("UnicodeDataX.txt"??) that has the comments  
that would be there if we could add them.