Re: Unified CJK characters in Unicode

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Thu Jun 05 1997 - 16:05:49 EDT


On Thu, 5 Jun 1997, Unicode Discussion wrote:

> Can someone with with an understanding about chinese sorting please
> provide information about the significance of Unified Han codepoints with
> respect to sorting rules in Asian coutries. My understanding is that many
> unified code points actually have various sounds depending on their use
> within Asian text and also on the language the glyph is used for.

Yes indeed. In Chinese, most have only one pronounciation (which hovewer
varies for each sublanguage or dialect, such as Mandarin or Kantonese),
some have more. In Japanese, most have more, some up to 20. In Korean,
most have one, but up to four, and also regionally different.

> My current understanding suggests that the implications for sorting text in
> the various Unicode code pages are significant and could be a are a good
> reason for language tagging. I beleive taht there may be implications for
> text parsing tools which may wish to identify key words and phrases from
> non spacing languages which make use of unified code points.

Sorting, for all kinds of languages, has to be done on the expectance
of the reader, not on the language of the items itself. Immagine a list
of names ordered by language. Some friend of yours may have a German
sounding name, but speak French and Italian equally well. Where do
you search for him?

For multilingual search ideographic text, shape-based searching
seems the only widely useful solution. There might be a few exeptions,
e.g. a few names of Chinese people whose Japanese pronounciation
is established well enoug. There are dozens of shape-based search
techniques, and computers don't restrict us that much to using only
one or two as this is the case for dictionaries.

Language tags are useful if you want to search for a word in a particular
language, but not for sorting.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT