Re: Level of Unicode support required for various languages

From: John H. Jenkins (jenkins@apple.com)
Date: Tue Oct 30 2007 - 16:37:58 CST

Next message: John H. Jenkins: "Re: Level of Unicode support required for various languages"

Previous message: John H. Jenkins: "Re: Level of Unicode support required for various languages"
Maybe in reply to: Timothy Armes: "Level of Unicode support required for various languages"
Next in thread: James Kass: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I really don't want to continue this discussion because I don't think
it's productive at this point and, frankly, my temper is fraying, but
I'd like to make a couple of final points.

The IRG's embedding Latin in IDSs (and yes, they do use that term) is
wrong, not so much because they violate the formal grammar but because
it really isn't serving the purpose the IRG intends it to serve. The
whole reason the IRG adopted IDSs in its work was to provide a quick
first-order way of doing unifications. Their use of the Latin text
is, basically, an admission that a particular character cannot be
broken down into encoded parts, in which case the IDS doesn't serve
any genuine purpose.

The IDCs were added to Unicode because they were added to 10646 and
they were added to 10646 ultimately because the PRC wanted them. They
were added without sufficient attention given to the technical
ramifications of using them, which left the UTC scrambling to try to
make some sort of sense as to how to actually make them work. Part of
that was restricting their scope. It turns out that the original
restrictions were too great and so additional uses were added.

One of the main technical problems that the IDCs presented was there
was no limit to the complexity of the characters potentially formed,
making it difficult to produce systems which could even parse the
limits of an IDS. Ultimately, however, the real problem is the
enormous difficulty of defining normalization forms and equivalence.

For example, a normalization algorithm would first be able to parse an
IDS (or whatever) for validity and then make sure that all the pieces
in it are "spelled" properly, that is, normalize each of the
substrings. This would likely involve a huge list of known potential
expansions for various forms.

These problems are IMHO inherent to any scheme which attempts to
provide a compositional model for encoding Han. (The IDCs and IDSs
have the further known limitation of being inadequate to provide
acceptable rendering.) This is a conclusion I come to most
reluctantly, since I authored (years before the IDCs were added to the
standard) a paper urging the IRG to adopt a compositional model and
did a fair amount of leg-work on it.

A compositional model for Han is *very* attractive given that it
reflects the way the script works and the way that (most) new
characters are coined. Unfortunately, the practical problems involved
in getting that to work are much greater than initially appears to be
the case.

Beyond the technical problems are political problems of getting such a
scheme to be adopted in WG2 without the approval of the PRC, and the
PRC has shown itself to be enormously reluctant to move away from the
approach of separately encoding each ideograph. If nothing else, the
PRC (and other governmental bodies in the Far East) want to discourage
people from coining new ideographs because of the headaches that
creates.

After all, the current set of encodable ideographs is largely the
fault of that very same thing -- village chiefs making up a new
ideograph for their town's name, or proud parents making up a new
ideograph for their kid's name, or quirky authors deliberately (or
accidentally) creating something new on the fly, or somebody creating
a new taboo form for someone important. Leaving this set so fully
open is a detriment to communication, not an aid, because there's no
authoritative way to provide data on a character other than how to
draw it. What does it mean? How is it pronounced? Who knows? It
turns the Han script into an infinitely large set of dingbats.

The biggest single gain in terms of the effort involved in encoding
ideographs would derive from shifting to variation sequences for
variants rather than attempting to encode them all separately. The
second biggest gain would derive from insisting on stricter standards
for data *about* an ideograph, such as its definition, pronunciation,
and provenance.

I'm ccing the Unicode list, even though your last message was sent
directly to me, because I'm not actually quoting anything in that
message.

=====
John H. Jenkins
jenkins@apple.com

Next message: John H. Jenkins: "Re: Level of Unicode support required for various languages"
Previous message: John H. Jenkins: "Re: Level of Unicode support required for various languages"
Maybe in reply to: Timothy Armes: "Level of Unicode support required for various languages"
Next in thread: James Kass: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Oct 30 2007 - 16:40:30 CST