Re: Unified CJK characters in Unicode

From: jenkins (
Date: Thu Jun 05 1997 - 17:09:22 EDT

On 6/5/97 10:36 AM Unicode Discussion ( wrote:

>Can someone with with an understanding about chinese sorting please
>provide information about the significance of Unified Han codepoints with
>respect to sorting rules in Asian coutries. My understanding is that many
>unified code points actually have various sounds depending on their use
>within Asian text and also on the language the glyph is used for.

Basically, it is always wrong to sort Unihan by code point alone.

There are a number of different sorting techniques in common use in East
Asia, most notably phonetic and radical/stroke count. However,
characters have multiple pronunciations even within a single language,
and the radical/stroke counting varies from country to country, so
neither of these was sufficiently standard to base Unihan on. What was
agreed upon was the four-dictionary algorithm described in the book,
which is universal, rote and mechanical -- but not something appropriate
for use in most real-life situations.

>My current understanding suggests that the implications for sorting text in
>the various Unicode code pages are significant and could be a are a good
>reason for language tagging. I beleive taht there may be implications for
>text parsing tools which may wish to identify key words and phrases from
>non spacing languages which make use of unified code points.

Language tagging alone doesn't solve the sorting problem. Even in
English, how "St." sorts depends on whether it's "Street" or "Saint," and
"Mc-" can sort in various ways. As for Japanese, knowing that it's
Japanese (as opposed to Chinese) doesn't help an awful lot for sorting
ideographs, which can have numerous wildly-different pronunciations.

John H. Jenkins

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT