L2/00-375

Kenneth Whistler <kenw@sybase.com> on 10/09/2000 04:14:23 PM
To:   "Multiple Recipients of Unicore" unicore@unicode.org

Subject:  Re: Agenda Item: Script Names


Mark,

> At the next UTC, I will propose that we make draft TR #24 into a TR, and
> put the data file into the next version of the UCD. (See
> http://www.unicode.org/unicode/reports/tr24/). If you would have any
> problems with that, please send your feedback now so that we have a chance
> to discuss it before the meeting.

Grump, grump.

Well, in this case, silence is in danger of being taken as assent, so
since I don't exactly see a long line at the microphone, I guess I'm
nominated.

Executive summary: This DUTR is not ready for prime time yet, and I do not think it should be progressed to a TR.

A simple problem to deal with:

The Status block in the 1.1 version on the web now is incorrect. The
title shows this to be a DUTR, but the status block has not been updated
from the PDUTR. I know, picky, picky, but we have processes and status
blocks on these things for a reason.

More substantive issues follow.

This DUTR is basically just a one-page introduction for the data file
Scripts.txt, which is intended eventually to be a part of the Unicode
Character Database. A more rigorous introduction is actually needed, since
it is inevitable that: a) the data file *is* going to be part of the UCD,
presumably as soon as possible, and b) the DUTR is of a nature that
guarantees that eventually it will be turned into a UAX -- i.e., part of
the standard itself. This is not just informative material, although it
is getting its camel's nose under the tent that way, but stuff that is
on an inexorable path to becoming part of the Unicode Standard itself.
Hence, I want questions answered *before* we approve this thing, rather
than scramble around trying to sort things out *after* approval.

The most glaring problem is that the list of script names is not
present in the DUTR itself. It just points to Scripts.txt as if that
is good enough. But it clearly is *not* good enough. Why? Because Scripts.txt
is presenting a new *partition* of Unicode properties. Whenever a partition
is created, it raises issues of how that partition is to be extended.
But in this case, it is inevitable that new values are going to be added
to this partition in the near future. The handling of that extensibility
needs to be explicitly stated in the document. This is a different kind
of partition from the General Category partition, which we have explicitly
stated is *never* again to be given additional values.

Since this is on track to become a UAX -- part of the standard itself -- it
is appropriate to list *all* script values in the text itself (along with
the non-script values, i.e. "Common" and "Unknown"). Whenever the standard
is updated with the encoding of new scripts, this UTR (= UAX) will also
need to be reviewed and revised. I suspect Mark will counter that that is
the reason for *not* listing the scripts in the UTR text itself -- it makes
it easier to just update the data file without having to revise the UTR.
But I find that unacceptable in the case of an introduction of a partition
property to the standard. Just as the General Category and Bidi Category
values are explicitly listed in the standard, and any change to them would
be cause for explicit revision, so should any other partition category.
Effectively, anyone's implementation of the Script partition is going to
be affected by any extension of values, and this is worth making explicit
note of in the general text describing the property. To do otherwise would
be a disservice to those intending to use the standard.

The DUTR says that "it is expected than [sic] the Scripts.txt file will
eventually be part of the Unicode Character Database." This is way too
vague to allow in as part of an approved UTR. Under what circumstances
will Scripts.txt become part of the UCD? Upon approval of the UTR? As
part of Unicode 3.1 or before as another update revision (3.0.2?) If as
part of Unicode 3.1, then the data file *and* the list of scripts will
need updating first. And will this be part of updating the status of the
UTR to a UAX? The UTR text should be explicit about it being a track to a
UAX in that case.

Assignments of "Common" and "Unknown" are problematical. The text implies
that such assignments are made in Scripts.txt, but of course they are
not, since no such entries occur in the data file. The language needs to
be tightened up considerably here, if Common and Unknown assignments are
not actually made in the data file.

The note about Draft ISO 15924 says "Once this standard is final, these codes
can be used to represent the script names." They cannot be if they don't
match correctly. The intent of the Scripts.txt script name assignments
and of the ISO 15924 script code assignments is not entirely parallel,
and there is no guarantee that they will match exactly. For example,
which of the following three ISO 15924 script codes is supposed to
match the script name "Georgian" in Scripts.txt:

Geoa Georgian (Asomtavruli)
Geon Georgian (Nuskhuri)
Geor Georgian (Mkhedruli)

The problem is that they *all* match parts of "Georgian" as defined in
Scripts.txt. Geoa applies to 10A0..10C5, and Geon and Geor *both* apply
to 10D0..10F6.

Since ISO 15924 is in DIS (if it hasn't already been approved and published --
the date on it is 2000), I want the crossmatching to be done *explicitly*
in the text of UTR #24, with annotation of any problems in matching -- not
just some vague handwaving in a note.


Now on to problems in the data file itself.

The "Telugu" script is misspelled "Telegu" in the Scripts.txt file.

The first really serious problem is the restriction of script assignments
only to characters of General Category *Letter*. This seriously misrepresents
the script identity of many characters. It omits all the formally combining
vowel letters of all the Indic scripts, as well as the vowellings of Semitic
scripts. It also relegates to "Common" all of the script-specific digits --
which also cannot be considered either self-evident or correct. I think it would
be a serious misrepresentation of the meaning of a Script property to publish
such a data file. At the very least, all of the vowels and digits need to be
given appropriate assignments, and we should do a serious pass on the
script specific punctuation as well -- even if we decide to leave that
as Common in the end.

I cannot imagine what the argument could be for omitting the script designation
of the digits. But for the combining marks, I suppose it could be argued that
all "Common" script combining marks inherit their property from the baseform
they apply to. However, unlike the case of the Bidi property partition, I do
not think this is an appropriate stance to take for a Script property partition.
On a character-by-character basis this just results in incorrect assignments.
Instead, if you have some implementation that is taking shortcuts based on
the General Category assignments of letters, you could add an implementation
note that would indicate that separate script values don't need to be stored
in a data file for implementing regexp, or whatever, because of the inheritance
associated with NSM's. But even that is problematical, because of the
possibility of citation of combining marks in isolation. Inheriting their
script property from a SPACE, for example, gives the wrong answer.

The net is: I want the script assignments *correctly* assigned in Scripts.txt,
with any implementation simplification opportunities called out in an
implementation notes section of UTR #24. I don't want UTR #24 to be just another
ad hoc assignment of properties related to scripts which is as seriously
flawed as the Blocks.txt assignment that it is trying to supersede.

Greek/Coptic problem. Script.txt assigns all the Coptic letters the
script value "Greek". But their names are "COPTIC XYZ", they are part
of a "Greek and Coptic" block, and we have an emerging consensus that
unifying the rest of Coptic with Greek letters was a mistake and will
need to be rectified in the near future. As such, it seems ill-advised
to go out with a data file assigning "Greek" as the script for explicitly
Coptic letters.

Other edge case assignment problems.

The Kanbun characters 3190..319F are omitted from the script
assignment, presumably because they have General Category So or No.
However, clearly 3192..319F at least are just small Han ideographs
in special usage, and ipso facto should be considered Han script.
If the contents of Scripts.txt are teleologically determined by
some sense of making regular expression syntax come out right, rather
than on the merits of the character identities per se, then that needs
to be made *much* more explicit, with big warning signs everywhere, since
it is quite obvious that otherwise, when the Unicode Consortium comes
out with a UAX called "Script Names" and publishes a data file called
"Scripts.txt", that people are going to assume we are making normative
assignments of script properties for characters -- not that we are
just producing a handly implementation dodge for regular expressions.

The Roman numerals are self-evidently part of the Latin script. Their
omission in Scripts.txt is once again a result of the L(x) constraint
applied regardless of its sensibility or lack of same in particular
instances.

Styled letter bullets, squared Katakana words, and the like also
arguably have script specific properties, although of a different
nature than those characters which are part of the orthography per se.
I can understand omission of these from a formal listing of script
properties -- but their omission should be mentioned and justified.

The Kangxi radicals and CJK radicals are clearly in the Han script.
And the Yi radicals are clearly in the Yi script.

Until I see all these issue addressed, I will be opposed to advancing
the status of this DUTR any further.

--Ken


	1