Re: combining marks vs IDS (was: Why people still want to encode precomposed letters)

Date: Thu Nov 20 2008 - 06:34:09 CST

  • Next message: John H. Jenkins: "Re: combining marks vs IDS (was: Why people still want to encode precomposed letters)"

    Quoting "Julian Bradfield" <>:

    > One thing I don't really understand is the basis for the difference of
    > approach between alphabetic(-ish) and Han.
    > The UTC has said, no more precomposed characters.
    > On the other hand, the IRG is still encoding more and more obscure
    > hanzi, although surely the vast majority of them are describable using
    > ideographic description sequences, mostly in a canonical way. (And for
    > those characters with two equally obvious decompositions, I'm sure one
    > could impose a reasonable canonicalization criterion to choose one.)

    IDS are by definition not combining characters - this would make them
    effectively stateful which is a route unicode does not wish to follow.
    Therefore in unicode term CJK characters can not be decomposed using

    Actually it would be wrong to say that the newer characters are all
    more and more obsurce. They are characters not in the larger
    dictionaries. these include names of places, surnames; characters used
    in various dialects. Characters are processed by the IRG on a first
    come first serve basis, therefore "really obscure" characters
    submitted in the 1990's are already encoded, whereas only some useful
    "everyday" characters have yet to be encoded. Since Extension B at the
    turn of the century, the average time for a proposal of new CJK
    characters has effectively become 12 years.

    Some additions to CJK characters are effectively adding like adding a
    new script to unicode. For example, take a area that I know a little
    about, the CJK characters used by the Zhuang. Zhuang the mother tongue
    of over 10 million people, the 50 something largest people group in
    the world, is traditionally written using CJK ideographs, these
    characters have yet to be systematically encoded. When eventually
    added as CJK ideographs the name will just be CJK ideograph U+XXXXX ,
    the significance hidden by the naming convention.

    Nevertheless this does make cjk ideographs an open ended set. If one
    limits the number of components to say 300 and four components make a
    character the are then 300x300x300x300 = 8,100,000,000 possible
    characters 0.01% of which is 81,000 characters, close to the current
    unicode count. This illustrates that the combinations people use of
    components is rather limited. This begs the question as to whether
    allowing unlimited combinatons is an appropriate model.

    Yes, if CJK ideographs had been encoded as composites it would have
    made the encoding process much easier, but everything else more work.
    CJK are in some respects the exception that proves the rule.

    > Why are IDSes seen as a stop-gap measure until the described hanzi is
    > separately encoded, whereas combining diacritics are seen as the
    > definitive way to do things?

    Please note as stated above IDS do not combine, they are legacy rather
    than stop-gap.

    John Knightley

    > --
    > The University of Edinburgh is a charitable body, registered in
    > Scotland, with registration number SC005336.

    This archive was generated by hypermail 2.1.5 : Thu Nov 20 2008 - 07:14:38 CST