RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: vunzndi@vfemail.net
Date: Fri Nov 02 2007 - 22:30:23 CST

  • Next message: Bala: "RE: Re: Tamil Sri / Shri"

    Quoting mpsuzuki@hiroshima-u.ac.jp:

    > Hi,
    >
    > It may be too late to involve the discussion about the component
    > based encoding for CJKV ideographs stopped 1 week ago, but similar
    > comments promoting component encoding as good alternative to
    > support huge CJKV character collection may be posted in future.
    > I think there are 2 typical problems in component based encoding
    > for CJKV ideographs, but, unfortunately, I've never seen the
    > proposal with some precautions against them. If anybody knows,
    > please let me know.
    >
    > 1. information interchange of "unified" ideograph.
    > --------------------------------------------------
    > For some ideographs, IDS is too "descriptive" to identify
    > an ideograph whose shape is varied under ISO/IEC 10646 Annex S.
    > Unicode Standard 5.0 p. 429-430 explains that multiple IDSs
    > are possible to describe an ideograph and there's no algorithm
    > to check the equivalence of the characters described by 2 IDSs.
    > I think one of the important policy in Unicode is: multiple
    > expressions for single character is not good idea. Thus, using
    > a code point is better for information interchange without
    > ambiguity.
    >
    > For example, when PRC, Taiwanese, Japanese, Korean and Vietnamese
    > instances in ISO/IEC 10646 five-columns of following characters
    > are expressed by IDS, the expressions won't be same:
    > U+518E, U+5203, U+5205, U+5544, U+559A, U+55AD, U+55B6, U+55BA, U+55C2,
    > U+5605, U+5629, U+5668, U+569D, U+56B3, U+570A, U+5832, U+5835,
    > U+5840, U+58B7, etc etc.
    >
    Point taken, however the unambiguous cases are far more.

    > If IDS is expected to be useful for information interchange,
    > these ideographs should not be over-decomposed. In the case of
    > Kawabata-san's database, these characters have multiple IDS
    > expressions for each instances in ISO/IEC 10646's five-column
    > instances. As far as there's no standard to evaluate the equality
    > of these multiple IDS expressions, these characters should not
    > be over-decomposed. But, the instances in ISO/IEC 10646 is not
    > the perfect collection of unifiable ideographs. So, again, it's
    > difficult to list all characters which IDS decomposition should
    > be restricted. I guess Kawabata-san wants people to learn UCS
    > unification rule and keep from over-differenciation of "new"
    > ideograph (e.g. "this character is not coded yet, I want to
    > display this character, I cannot find existing fonts").
    > But I'm suspicious if the educational approach can block such
    > requests.
    >

    Mr Kawabata's work has a particular purpose, not all of his approaches
    are equally applicable to this thread.

    In pratice some sort of registar of IDS would be a good idea. This
    would help developers and font makers. A registar could among ohter
    things note 'unsafe' IDS, and 'safe' IDS. Infact if one only uses
    precomposed glyphs then this is infact ones safe list.

    Even limited use of a compositional model would save a lagre number of
    code points. Take for example the mouth radical, basically a small box
    shape, placed on the lefthand side of a character. Over 900 characters
    in extension B are a combination of a mouth radical on the left and an
    encoded character on the right,

    I have before be a set of @ 5000 unencoded characters, 242, @5%, of
    which are a left hand mouth and right hand encoded characters.

    Ext B similarly has over 300 characters that are U+4EBB 亻, the
    person radical on the left plus encoded on the right. The above
    unencode set 94.

    -------------------------------------------------
    This message sent through Virus Free Email
    http://www.vfemail.net



    This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 22:32:42 CST