Re: Regulating PUA.

From: Philippe Verdy
Date: Wed Jan 24 2007 - 04:44:33 CST

    From: <>
    > Unicode has consistently rejected using this approach of putting two
    > Chinese characters together to make a new one, and insists each new
    > CJKV character must be encoded, even though this would cut down the
    > number of codepionts required dramatically. Most Chinese characters
    > are in fact made in this way (over 80% if the one allows combinations
    > of combinations).

    I must ackowledge that this design choice, where the character model was tweaked horribly to match the desires of existing and past vendors, is somewhat flawed, and then it's difficult to understand the position of the UTC and ISO WG2 regarding other scripts that are horribly more complicate to implement and disavantaged (Hebrew, Indic scripts) because, on the opposite, a much stricter character model was chosen for them.

    Some choices like this inthe character model (Thaļ visible ordering, Hangul syllables...) at UTC (and at ISO WG2) are clearly inconsistant and were guided only to support legacy applications without any adaptation, but clearly against the encoding policy, but are now perceived as severely limitating or devastating for the evolution of the standard (and it is now a severe problem for rare scripts that are still not encoded, and that will be difficult to have them widely supported in implementations).

    This is something that, some day, will block the evolutions and put an end to the standard, so it places a complete industry to the risk of a future major switch to a new standard with necessarily incompatibilities and lots of costs for the future migration.

    Regarding Han, the current desire to keep ideographs encoded at the glyph square level only will not be maintainable (and consistancy problems have already occured, with multiple encondings of the same square), simply because the composition of these ideograph squares was not documented.

    It was said that ideographs do not compose easily into squares. This may be true for some wellknown blocks, but I think this is not really the rule. So these exceptions could have been handled like ligatures. If Han had been consistantly encoded, it would have priviledged the decomposed model based on radicals.

    In the same spirit, it would have been enough to encode Hangul just with base jamos (like they are learnt at school), using only a single syllable break character were needed to makethe distinction between final and leading consonnants and reasonnable default rules for the position of these composed syllable breaks. The whole Hangul script would have been encodable like a regular alphabet, something that was forgotten but that it really IS: Unicode and ISO have unnecessarily complicated hat was really a very simple script, and have wasted tens of thousands of positions in the BMP just for Hangul... instead of documenting a basic composition model which, for Hangul, is in fact very simple and extremely regular.

