Re: Unified CJK characters in Unicode

From: jenkins (jenkins@apple.com)
Date: Thu Jun 05 1997 - 17:09:22 EDT


On 6/5/97 10:36 AM Unicode Discussion (unicode@unicode.org) wrote:

>Can someone with with an understanding about chinese sorting please
>provide information about the significance of Unified Han codepoints with
>respect to sorting rules in Asian coutries. My understanding is that many
>unified code points actually have various sounds depending on their use
>within Asian text and also on the language the glyph is used for.
>

Basically, it is always wrong to sort Unihan by code point alone.

There are a number of different sorting techniques in common use in East
Asia, most notably phonetic and radical/stroke count. However,
characters have multiple pronunciations even within a single language,
and the radical/stroke counting varies from country to country, so
neither of these was sufficiently standard to base Unihan on. What was
agreed upon was the four-dictionary algorithm described in the book,
which is universal, rote and mechanical -- but not something appropriate
for use in most real-life situations.

>My current understanding suggests that the implications for sorting text in
>the various Unicode code pages are significant and could be a are a good
>reason for language tagging. I beleive taht there may be implications for
>text parsing tools which may wish to identify key words and phrases from
>non spacing languages which make use of unified code points.
>

Language tagging alone doesn't solve the sorting problem. Even in
English, how "St." sorts depends on whether it's "Street" or "Saint," and
"Mc-" can sort in various ways. As for Japanese, knowing that it's
Japanese (as opposed to Chinese) doesn't help an awful lot for sorting
ideographs, which can have numerous wildly-different pronunciations.

=====
John H. Jenkins
jenkins@apple.com
tseng@blueneptune.com
http://www.blueneptune.com/~tseng



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT