Re: Unified CJK characters in Unicode

From: jenkins (jenkins@apple.com)
Date: Thu Jun 05 1997 - 17:09:22 EDT

Next message: Tom Stern: "RE: Romanian characters Erare umanum est, perseverare diabolicum"
Previous message: Martin J. Duerst: "Feel, not sell!"
Maybe in reply to: Neil Walker: "Unified CJK characters in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 6/5/97 10:36 AM Unicode Discussion (unicode@unicode.org) wrote:

>Can someone with with an understanding about chinese sorting please
>provide information about the significance of Unified Han codepoints with
>respect to sorting rules in Asian coutries. My understanding is that many
>unified code points actually have various sounds depending on their use
>within Asian text and also on the language the glyph is used for.
>

Basically, it is always wrong to sort Unihan by code point alone.

There are a number of different sorting techniques in common use in East
Asia, most notably phonetic and radical/stroke count. However,
characters have multiple pronunciations even within a single language,
and the radical/stroke counting varies from country to country, so
neither of these was sufficiently standard to base Unihan on. What was
agreed upon was the four-dictionary algorithm described in the book,
which is universal, rote and mechanical -- but not something appropriate
for use in most real-life situations.

>My current understanding suggests that the implications for sorting text in
>the various Unicode code pages are significant and could be a are a good
>reason for language tagging. I beleive taht there may be implications for
>text parsing tools which may wish to identify key words and phrases from
>non spacing languages which make use of unified code points.
>

Language tagging alone doesn't solve the sorting problem. Even in
English, how "St." sorts depends on whether it's "Street" or "Saint," and
"Mc-" can sort in various ways. As for Japanese, knowing that it's
Japanese (as opposed to Chinese) doesn't help an awful lot for sorting
ideographs, which can have numerous wildly-different pronunciations.

=====
John H. Jenkins
jenkins@apple.com
tseng@blueneptune.com
http://www.blueneptune.com/~tseng

Next message: Tom Stern: "RE: Romanian characters Erare umanum est, perseverare diabolicum"
Previous message: Martin J. Duerst: "Feel, not sell!"
Maybe in reply to: Neil Walker: "Unified CJK characters in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT