L2/00-375 Kenneth Whistler on 10/09/2000 04:14:23 PM To: "Multiple Recipients of Unicore" unicore@unicode.org Subject: Re: Agenda Item: Script Names Mark, > At the next UTC, I will propose that we make draft TR #24 into a TR, and > put the data file into the next version of the UCD. (See > http://www.unicode.org/unicode/reports/tr24/). If you would have any > problems with that, please send your feedback now so that we have a chance > to discuss it before the meeting. Grump, grump. Well, in this case, silence is in danger of being taken as assent, so since I don't exactly see a long line at the microphone, I guess I'm nominated. Executive summary: This DUTR is not ready for prime time yet, and I do not think it should be progressed to a TR. A simple problem to deal with: The Status block in the 1.1 version on the web now is incorrect. The title shows this to be a DUTR, but the status block has not been updated from the PDUTR. I know, picky, picky, but we have processes and status blocks on these things for a reason. More substantive issues follow. This DUTR is basically just a one-page introduction for the data file Scripts.txt, which is intended eventually to be a part of the Unicode Character Database. A more rigorous introduction is actually needed, since it is inevitable that: a) the data file *is* going to be part of the UCD, presumably as soon as possible, and b) the DUTR is of a nature that guarantees that eventually it will be turned into a UAX -- i.e., part of the standard itself. This is not just informative material, although it is getting its camel's nose under the tent that way, but stuff that is on an inexorable path to becoming part of the Unicode Standard itself. Hence, I want questions answered *before* we approve this thing, rather than scramble around trying to sort things out *after* approval. The most glaring problem is that the list of script names is not present in the DUTR itself. It just points to Scripts.txt as if that is good enough. But it clearly is *not* good enough. Why? Because Scripts.txt is presenting a new *partition* of Unicode properties. Whenever a partition is created, it raises issues of how that partition is to be extended. But in this case, it is inevitable that new values are going to be added to this partition in the near future. The handling of that extensibility needs to be explicitly stated in the document. This is a different kind of partition from the General Category partition, which we have explicitly stated is *never* again to be given additional values. Since this is on track to become a UAX -- part of the standard itself -- it is appropriate to list *all* script values in the text itself (along with the non-script values, i.e. "Common" and "Unknown"). Whenever the standard is updated with the encoding of new scripts, this UTR (= UAX) will also need to be reviewed and revised. I suspect Mark will counter that that is the reason for *not* listing the scripts in the UTR text itself -- it makes it easier to just update the data file without having to revise the UTR. But I find that unacceptable in the case of an introduction of a partition property to the standard. Just as the General Category and Bidi Category values are explicitly listed in the standard, and any change to them would be cause for explicit revision, so should any other partition category. Effectively, anyone's implementation of the Script partition is going to be affected by any extension of values, and this is worth making explicit note of in the general text describing the property. To do otherwise would be a disservice to those intending to use the standard. The DUTR says that "it is expected than [sic] the Scripts.txt file will eventually be part of the Unicode Character Database." This is way too vague to allow in as part of an approved UTR. Under what circumstances will Scripts.txt become part of the UCD? Upon approval of the UTR? As part of Unicode 3.1 or before as another update revision (3.0.2?) If as part of Unicode 3.1, then the data file *and* the list of scripts will need updating first. And will this be part of updating the status of the UTR to a UAX? The UTR text should be explicit about it being a track to a UAX in that case. Assignments of "Common" and "Unknown" are problematical. The text implies that such assignments are made in Scripts.txt, but of course they are not, since no such entries occur in the data file. The language needs to be tightened up considerably here, if Common and Unknown assignments are not actually made in the data file. The note about Draft ISO 15924 says "Once this standard is final, these codes can be used to represent the script names." They cannot be if they don't match correctly. The intent of the Scripts.txt script name assignments and of the ISO 15924 script code assignments is not entirely parallel, and there is no guarantee that they will match exactly. For example, which of the following three ISO 15924 script codes is supposed to match the script name "Georgian" in Scripts.txt: Geoa Georgian (Asomtavruli) Geon Georgian (Nuskhuri) Geor Georgian (Mkhedruli) The problem is that they *all* match parts of "Georgian" as defined in Scripts.txt. Geoa applies to 10A0..10C5, and Geon and Geor *both* apply to 10D0..10F6. Since ISO 15924 is in DIS (if it hasn't already been approved and published -- the date on it is 2000), I want the crossmatching to be done *explicitly* in the text of UTR #24, with annotation of any problems in matching -- not just some vague handwaving in a note. Now on to problems in the data file itself. The "Telugu" script is misspelled "Telegu" in the Scripts.txt file. The first really serious problem is the restriction of script assignments only to characters of General Category *Letter*. This seriously misrepresents the script identity of many characters. It omits all the formally combining vowel letters of all the Indic scripts, as well as the vowellings of Semitic scripts. It also relegates to "Common" all of the script-specific digits -- which also cannot be considered either self-evident or correct. I think it would be a serious misrepresentation of the meaning of a Script property to publish such a data file. At the very least, all of the vowels and digits need to be given appropriate assignments, and we should do a serious pass on the script specific punctuation as well -- even if we decide to leave that as Common in the end. I cannot imagine what the argument could be for omitting the script designation of the digits. But for the combining marks, I suppose it could be argued that all "Common" script combining marks inherit their property from the baseform they apply to. However, unlike the case of the Bidi property partition, I do not think this is an appropriate stance to take for a Script property partition. On a character-by-character basis this just results in incorrect assignments. Instead, if you have some implementation that is taking shortcuts based on the General Category assignments of letters, you could add an implementation note that would indicate that separate script values don't need to be stored in a data file for implementing regexp, or whatever, because of the inheritance associated with NSM's. But even that is problematical, because of the possibility of citation of combining marks in isolation. Inheriting their script property from a SPACE, for example, gives the wrong answer. The net is: I want the script assignments *correctly* assigned in Scripts.txt, with any implementation simplification opportunities called out in an implementation notes section of UTR #24. I don't want UTR #24 to be just another ad hoc assignment of properties related to scripts which is as seriously flawed as the Blocks.txt assignment that it is trying to supersede. Greek/Coptic problem. Script.txt assigns all the Coptic letters the script value "Greek". But their names are "COPTIC XYZ", they are part of a "Greek and Coptic" block, and we have an emerging consensus that unifying the rest of Coptic with Greek letters was a mistake and will need to be rectified in the near future. As such, it seems ill-advised to go out with a data file assigning "Greek" as the script for explicitly Coptic letters. Other edge case assignment problems. The Kanbun characters 3190..319F are omitted from the script assignment, presumably because they have General Category So or No. However, clearly 3192..319F at least are just small Han ideographs in special usage, and ipso facto should be considered Han script. If the contents of Scripts.txt are teleologically determined by some sense of making regular expression syntax come out right, rather than on the merits of the character identities per se, then that needs to be made *much* more explicit, with big warning signs everywhere, since it is quite obvious that otherwise, when the Unicode Consortium comes out with a UAX called "Script Names" and publishes a data file called "Scripts.txt", that people are going to assume we are making normative assignments of script properties for characters -- not that we are just producing a handly implementation dodge for regular expressions. The Roman numerals are self-evidently part of the Latin script. Their omission in Scripts.txt is once again a result of the L(x) constraint applied regardless of its sensibility or lack of same in particular instances. Styled letter bullets, squared Katakana words, and the like also arguably have script specific properties, although of a different nature than those characters which are part of the orthography per se. I can understand omission of these from a formal listing of script properties -- but their omission should be mentioned and justified. The Kangxi radicals and CJK radicals are clearly in the Han script. And the Yi radicals are clearly in the Yi script. Until I see all these issue addressed, I will be opposed to advancing the status of this DUTR any further. --Ken 1