RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: vunzndi@vfemail.net
Date: Fri Nov 02 2007 - 22:30:23 CST

Next message: Bala: "RE: Re: Tamil Sri / Shri"

Previous message: James Kass: "Re: Tamil Sri / Shri"
In reply to: mpsuzuki@hiroshima-u.ac.jp: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Hi,
>
> It may be too late to involve the discussion about the component
> based encoding for CJKV ideographs stopped 1 week ago, but similar
> comments promoting component encoding as good alternative to
> support huge CJKV character collection may be posted in future.
> I think there are 2 typical problems in component based encoding
> for CJKV ideographs, but, unfortunately, I've never seen the
> proposal with some precautions against them. If anybody knows,
> please let me know.
>
> 1. information interchange of "unified" ideograph.
> --------------------------------------------------
> For some ideographs, IDS is too "descriptive" to identify
> an ideograph whose shape is varied under ISO/IEC 10646 Annex S.
> Unicode Standard 5.0 p. 429-430 explains that multiple IDSs
> are possible to describe an ideograph and there's no algorithm
> to check the equivalence of the characters described by 2 IDSs.
> I think one of the important policy in Unicode is: multiple
> expressions for single character is not good idea. Thus, using
> a code point is better for information interchange without
> ambiguity.
>
> For example, when PRC, Taiwanese, Japanese, Korean and Vietnamese
> instances in ISO/IEC 10646 five-columns of following characters
> are expressed by IDS, the expressions won't be same:
> U+518E, U+5203, U+5205, U+5544, U+559A, U+55AD, U+55B6, U+55BA, U+55C2,
> U+5605, U+5629, U+5668, U+569D, U+56B3, U+570A, U+5832, U+5835,
> U+5840, U+58B7, etc etc.
>
Point taken, however the unambiguous cases are far more.

> If IDS is expected to be useful for information interchange,
> these ideographs should not be over-decomposed. In the case of
> Kawabata-san's database, these characters have multiple IDS
> expressions for each instances in ISO/IEC 10646's five-column
> instances. As far as there's no standard to evaluate the equality
> of these multiple IDS expressions, these characters should not
> be over-decomposed. But, the instances in ISO/IEC 10646 is not
> the perfect collection of unifiable ideographs. So, again, it's
> difficult to list all characters which IDS decomposition should
> be restricted. I guess Kawabata-san wants people to learn UCS
> unification rule and keep from over-differenciation of "new"
> ideograph (e.g. "this character is not coded yet, I want to
> display this character, I cannot find existing fonts").
> But I'm suspicious if the educational approach can block such
> requests.
>

Mr Kawabata's work has a particular purpose, not all of his approaches
are equally applicable to this thread.

In pratice some sort of registar of IDS would be a good idea. This
would help developers and font makers. A registar could among ohter
things note 'unsafe' IDS, and 'safe' IDS. Infact if one only uses
precomposed glyphs then this is infact ones safe list.

Even limited use of a compositional model would save a lagre number of
code points. Take for example the mouth radical, basically a small box
shape, placed on the lefthand side of a character. Over 900 characters
in extension B are a combination of a mouth radical on the left and an
encoded character on the right,

I have before be a set of @ 5000 unencoded characters, 242, @5%, of
which are a left hand mouth and right hand encoded characters.

Ext B similarly has over 300 characters that are U+4EBB 亻, the
person radical on the left plus encoded on the right. The above
unencode set 94.

-------------------------------------------------
This message sent through Virus Free Email
http://www.vfemail.net

Next message: Bala: "RE: Re: Tamil Sri / Shri"
Previous message: James Kass: "Re: Tamil Sri / Shri"
In reply to: mpsuzuki@hiroshima-u.ac.jp: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 22:32:42 CST