RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

Date: Fri Nov 02 2007 - 22:30:23 CST

  • Next message: Bala: "RE: Re: Tamil Sri / Shri"


    > Hi,
    > It may be too late to involve the discussion about the component
    > based encoding for CJKV ideographs stopped 1 week ago, but similar
    > comments promoting component encoding as good alternative to
    > support huge CJKV character collection may be posted in future.
    > I think there are 2 typical problems in component based encoding
    > for CJKV ideographs, but, unfortunately, I've never seen the
    > proposal with some precautions against them. If anybody knows,
    > please let me know.
    > 1. information interchange of "unified" ideograph.
    > --------------------------------------------------
    > For some ideographs, IDS is too "descriptive" to identify
    > an ideograph whose shape is varied under ISO/IEC 10646 Annex S.
    > Unicode Standard 5.0 p. 429-430 explains that multiple IDSs
    > are possible to describe an ideograph and there's no algorithm
    > to check the equivalence of the characters described by 2 IDSs.
    > I think one of the important policy in Unicode is: multiple
    > expressions for single character is not good idea. Thus, using
    > a code point is better for information interchange without
    > ambiguity.
    > For example, when PRC, Taiwanese, Japanese, Korean and Vietnamese
    > instances in ISO/IEC 10646 five-columns of following characters
    > are expressed by IDS, the expressions won't be same:
    > U+518E, U+5203, U+5205, U+5544, U+559A, U+55AD, U+55B6, U+55BA, U+55C2,
    > U+5605, U+5629, U+5668, U+569D, U+56B3, U+570A, U+5832, U+5835,
    > U+5840, U+58B7, etc etc.
    Point taken, however the unambiguous cases are far more.

    > If IDS is expected to be useful for information interchange,
    > these ideographs should not be over-decomposed. In the case of
    > Kawabata-san's database, these characters have multiple IDS
    > expressions for each instances in ISO/IEC 10646's five-column
    > instances. As far as there's no standard to evaluate the equality
    > of these multiple IDS expressions, these characters should not
    > be over-decomposed. But, the instances in ISO/IEC 10646 is not
    > the perfect collection of unifiable ideographs. So, again, it's
    > difficult to list all characters which IDS decomposition should
    > be restricted. I guess Kawabata-san wants people to learn UCS
    > unification rule and keep from over-differenciation of "new"
    > ideograph (e.g. "this character is not coded yet, I want to
    > display this character, I cannot find existing fonts").
    > But I'm suspicious if the educational approach can block such
    > requests.

    Mr Kawabata's work has a particular purpose, not all of his approaches
    are equally applicable to this thread.

    In pratice some sort of registar of IDS would be a good idea. This
    would help developers and font makers. A registar could among ohter
    things note 'unsafe' IDS, and 'safe' IDS. Infact if one only uses
    precomposed glyphs then this is infact ones safe list.

    Even limited use of a compositional model would save a lagre number of
    code points. Take for example the mouth radical, basically a small box
    shape, placed on the lefthand side of a character. Over 900 characters
    in extension B are a combination of a mouth radical on the left and an
    encoded character on the right,

    I have before be a set of @ 5000 unencoded characters, 242, @5%, of
    which are a left hand mouth and right hand encoded characters.

    Ext B similarly has over 300 characters that are U+4EBB 亻, the
    person radical on the left plus encoded on the right. The above
    unencode set 94.

    This message sent through Virus Free Email

    This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 22:32:42 CST