L2/06-026 Subject: Documenting missing values in data files Source: Mark Davis Date: 2006/01/26 =============== Currently, the UCD data files don't provide explicit values for many (actually, usually *most*) code points. Instead, they document what values the missing code points are to have. In some cases this is simple: all code points not explicitly mentioned have only one value. In other cases, such as for Arabic Shaping, different values are possible: > # Note: Code points that are not explicitly listed in this file are > # either of joining type T or U: > # > # - Those that not explicitly listed that are of General Category Mn, Me, or Cf > # have joining type T. > # - All others not explicitly listed have type U. > # > # For an explicit listing of characters of joining type T, see > # the derived property file DerivedJoiningType.txt. Unfortunately, this means that a mechanical parser can't actually assign the right values to code points strictly based on the data file. In the future, perhaps, this may be addressed by an XML format. But even once we have such a format, the current data files will have to coexist for some time. To address this, I have the following proposal (refined after comments from Asmus, Markus). For each data file, allow the addition of comment lines with a special format that specify values for missing code points. These lines can be ignored by current parsers, or read by updated parsers. The format is: # @missing: 0000..10FFFF; XX where everything after "# @missing: " would be a valid line in that data file. There can be any number of these lines. They must all occur before any real data line in the file. Values in any range in these lines are overridden by any subsequent line, either an @missing line or a real data line. Thus you can have: # @missing: 0100..0200; XX ... 0123; YY 0159; ZZ 01A8..01B7; WW ... instead of having to break the @missing range into a whole slew of small ranges. This allows *all* of the values in the file to be machine readable. Note: unfortunately the UnicodeData file itself cannot contain comments, for reasons best not explored. There are two options; either leave this file be, or have an extra file ("UnicodeDataX.txt"??) that has the comments that would be there if we could add them.