L2/02-127

To: UTC
Re: Machine-Readable StandardizedVariants.txt
From: Mark Davis
Date: 2002-03-28

I went to try a mechanical test for the variation selectors to check for consistency after we got an error report, and found that (lo and behold) we do not provide a machine-readable data file right now that can be used to programmatically test whether a sequence <X VSn> is a defined sequence (for a given version of Unicode)! While too late for 3.2, to be really serious about fighting against misuse of  VS, we have to supply that file, and encourage people to offer an API to detect it, just as they offer APIs for isAssigned(codepoint) right now.

I first suggested that the file can simply be of the format:

2229; FE00; with serifs
222A; FE00; with serifs
....
1820; 180B; second form
....
188A; 180B; second form

Asmus suggested the following revision, which I agree with:

There would be two requirements (in my mind) that we would need to place on such a file.

(A) The strings in the file must match the description strings in the HTML file. As conceived, this limits our ability to write descriptions (which may be fine, but was not anticipated). Under no circumstances should any divergence between the files be allowed. This leads to:

(B) The file would be used to generate StandardizedVariants.html automatically. The names of the gifs are all derived from their code points, that helps. The boilerplate and HTML instructions can come from a practically trivial PERL script, so this is not a disabling requirement. This leads to:

(C) The file would need to carry all the information. For the mongolian characters, some positional forms do not admit variation; variation for these forms is *undefined*. Therefore the file needs to specify that information.

Combining both, the file needs to contain data of this form

2229; FE00; any; with serifs # INTERSECTION
222A; FE00; any; with serifs # SUPERSET OF ABOVE NOT EQUAL TO
....
1820; 180B; isolate; second form # MONGOLIAN LETTER A
1820; 180B; medial;  second form # MONGOLIAN LETTER A
1820; 180B; final;   second form # MONGOLIAN LETTER A
 ...
188A; 180B; initial; second form # MONGOLIAN LETTER ALI GALI NGA
188A; 180B; medial;  second form # MONGOLIAN LETTER ALI GALI NGA

(where the comment is optional, if we want to write a more complex PERL script to extract the base names on the fly, but we may want to keep them for readability).

The HTML for each row of the tables is fairly straightforward, even
for the last row for example, which collapses two lines of the datafile into
one row of the table:

<tr>
      <td align="center"><img border="0" alt="[]" src="images/U188A.gif"></td>
      <td>188A, 180B</td>
      <td>initial,<br>medial</td>
      <td align="center"><img border="0" alt="[]" rc="images/U188A180Binit.gif"><br>
        <img border="0" alt="[]" src="images/U188A180Bmedi.gif"></td>
      <td>MONGOLIAN LETTER ALI GALI NGA second form</td>
</tr>

I would add only two items to this:

  1. I generate all the derived data files anyway; I can add a module to generate the HTML without changing our process to add a Perl script. It would not need to depend on the comment, since I have the full UCD accessible in the tool.
     
  2. Asmus raises an interesting point with "variation for these forms is *undefined*". I disagree with this. I don't think it is an acceptable burden to say that a variation sequence is only defined in context; that one must precede or follow the variation sequence with the right characters, otherwise the text is illegitimate. Users will have keys that generate a sequence; saying that sometimes those are wrong is very clumsy; and if I had a "Show Invalid" menu item, I would not expect it to flag those cases.

    How I interpreted the Mongolian lines in StandardizedVariants.html is that there a visible difference in the shapes of a character only in those circumstances mentioned, but that the variation sequence is always valid. This needs some further clarification.