Re: Digraphs as Distinct Logical Units

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Aug 08 2002 - 16:58:01 EDT


Roozbeh,

> On Thu, 8 Aug 2002, Kenneth Whistler wrote:
>
> > That is true, but for those ligatures as well, the compatibility
> > decomposition is not actually useful in implementation.
>
> They help implementations in sorting, searching, comparing,

Compatibility decompositions are part of the input into defining
default collation weights for Unicode characters -- but only *part*
of the input. Expecting the compatibility decompositions to
serve this purpose effectively is overvaluing what they can actually
do. The issues of sorting, searching, comparing should come out
of the UCA (and tailorings thereof) -- not be directly based on
compatibility decompositions in UnicodeData.txt.

> providing
> backup rendering when they lack the glyph,

This seems unlikely to be particularly helpful in this *particular*
case.

> reading a text stream aloud,
> and things like that,

And this requires much more than just some raw access to an
NFKD normalization of the text stream to make any sense, for
any real application.

> a compatiblity deomposition helps making life
> easier. Is there anything special about this character which makes it
> different from JALLAJALALOUHOU, for example?

Yes. JALLAJALALOUHOU was encoded (without a decomposition specified)
in 1993. A compatibility decomposition was specified in 1996, in
an effort then to make Unicode 2.0 consistent in its treatment of
Arabic ligatures. Implementation practice since then has suggested
that compatibility decompositions for these Arabic word ligatures
used symbolically are not much help -- and if any thing just provoke
edge case failures for implementations. (U+FDFA has to be in
everybody's test cases for normalization, since it has a ridiculous
18-character decomposition.)

The BISMALLAH ARRAHMAN ARRAHIM will be encoded in 2003 (I predict), and
since implementation practice suggests that having a compatibility
decomposition for these things is generally more trouble than it is
worth, given the usage of the characters, the UTC is unlikely
to specify a decomposition.

> We need consistency, at the
> minimum.
>
> > Yes, it is for transcoding -- with a Pakistani standard for Urdu.
>
> The same Pakistani standard has a character for "No Vowel". Things that
> appear after each letter meaning that they don't carry any vowels! They
> have this to help them do *binary* sorting, so they won't need to worry
> about the vowels being in the second level of importance. Would you encode
> this in Unicode?

Nope. The UTC wouldn't do it, nor have the Pakistani delegates
working with the UTC and WG2 asked for it.

>
> Have you read the UZT? It simply proves why you need a cutoff date. To
> stop people from encoding things that would make Unicode a mess.

Not everything that gets into a national standard gets into Unicode.

Characters like the BISMALLAH ARRAHMAN ARRAHIM that meet obvious
local requirements and make implementation sense are acceptable
to the UTC. (And the fact that they might have been part of a
national standard created for whatever reason -- including political --
is beside the point.) Characters that make no architectural sense
for Unicode, on the other hand, simply won't get in; having them in
a national standard at this point doesn't grease the skids.

> UZT
> doesn't follow ISO 8859 or Unicode principles. It is the output of a
> committee trying to implement democracy rather than technical excellence.
> What other kind of committee would have imposed such a "No Vowel"
> character to make all various aspects of text processing harder and make
> hackish sorting easier?
>
> UZT is there also to make a point: that Urdu computing is different (from
> whatever you are thinking about)! National pride, I'll call it.

Which doesn't change the fact that Pakistan has brought forward
some characters whose justification seems sufficient for inclusion
in Unicode.

>
> > Important examples: GBK (later GB18030) in China, and JIS X 0213 in
> > Japan. You'll find many characters in Unicode that got there for
> > compatibility with those two (recent) standards.
>
> Hear this all you people out there with some power in your national
> standards institute? This is the green light! You can thicken your walls
> for some more years. Japan and China have been doing this for years, and
> Pakistan and Cambodia are also in the ballots. Why not your country? You
> only need to insist a little.
>
> Sorry to be so rude. I didn't want to say these this way, but the words
> just came out. I'm fighting with this inside my own country, and I've been
> successful to stop them from submitting a single insane proposal to UTC
> (they wanted to ask UTC to disunify all Persian letters from the Arabic
> ones, for example, just because of national pride). It hurts when UTC
> surrenders so easily.

You are right to push hard against nonsensical approaches, such as
you mention here for Persian.

But I think you may be overestimating the caving in going on here.
The UTC is still pushing back on another proposal to disunify Urdu
digits, for example -- those did *not* get accepted by WG2, nor do I
expect they will pass muster in future UTC meetings.

Doug Ewell already mentioned some of the nonsensical characters that
were rejected out of proposals based on the DPRK standard -- as another
example.

The fact that the UTC and WG2 are responsive to well-argued proposals
from various constituent communities -- sometimes based on new
standards, and sometimes based on whole-cloth new material with
no standards justification at all -- is a necessary part of their
responsibility as maintainers of character encoding standards
intended to have universal applicability. That doesn't mean that
there is a green light for any kind of nonsense created by any
kind of bozo to get encoded as characters. Ask some of the
proposers just how much effort (extended over how long a sustained
period) has to be devoted to actually getting characters added
to the standards.

--Ken



This archive was generated by hypermail 2.1.2 : Thu Aug 08 2002 - 15:21:23 EDT