Re: Level of Unicode support required for various languages

From: James Kass (thunder-bird@earthlink.net)
Date: Fri Oct 26 2007 - 00:20:39 CDT

  • Next message: vunzndi@vfemail.net: "Re: Level of Unicode support required for various languages"

     
    John H. Jenkins wrote,

    > There is actual considerable room for improvement.

    There is always room for improvement in any system.
     
    > First of all, the experience of Extension C showed that there was a
    > serious QA problem in the IRG. The amount of effort involved in
    > identifying unifiable pairs entirely by hand left the whole process
    > error-prone. This has largely been corrected with Extension D work.

    To save those unfamiliar with the abbreviation the trouble of looking
    it up, "QA" means quality assurance in this case. During the review
    period and prior to formal encoding of Extension C, some problems with
    Ext. C were brought to the attention of IRG. IRG responded admirably,
    resulting in better QC (quality control) for future work.

    Goes to show that public review periods are essential, they might even
    be considered as part of the QA/QC process.

    > Secondly, the whole issue of "distinct ideographs" is getting nastier
    > and nastier as the IRG has to deal with increasingly rare characters
    > of uncertain provenance and meaning. So long as the IRG continues to
    > treat each "distinct" ideograph as something that needs independent
    > encoding, this is going to be a problem that plagues us.

    As you may know, I've been studying and trying to get a solid
    understanding of CJK unification. Something I'm having trouble
    grasping is why identical/otherwise-unifiable pairs are considered
    non-unifiable if they come from two different sources with two
    apparently different meanings. After all, in UNIHAN.TXT there
    are many single characters with more than one definition. Just
    as there are many English words with more than one meaning.

    (Examples exist, like U+3ADA (㫚) and U+66F6 (曶).)

    So, if a rare character has uncertain provenance and meaning, but
    it is unifiable, shouldn't it just be unified? And, if that character
    is not unifiable, but it exists in texts (however obscure) that
    someone may wish to reproduce electronically (for posterity,
    perhaps), shouldn't it be encoded?
     
    > If, for example, we'd had the concept of variant selectors an
    > established part of the standard during the Extension B work, the IRG
    > could have saved literally thousands of code points which are now
    > dedicated to obscure variants found in the Hanyu Da Zidian. If we
    > abandon the idea that every distinct ideograph requires separate
    > encoding, we could speed up the whole process, improve the quality of
    > work, and -- most important -- make implementation much simpler.

    We seem to have drifted off-topic for this thread. I thought
    about changing the thread title to "CJK unification and variation
    selectors", but that might get me started on VS characters again.

    Is it really possible to speed up the process of encoding an
    open-ended set?

    Best regards,

    James Kass



    This archive was generated by hypermail 2.1.5 : Fri Oct 26 2007 - 00:22:25 CDT