Re: Web Form: Other Question: CJK

From: John Jenkins (jenkins@apple.com)
Date: Sat Oct 11 2003 - 18:19:04 CST


On 2003¦~10¤ë10¤é, at ¤U¤È2:48, Magda Danish (Unicode) wrote:

>> My problem is to recognize from the 32 bit value of unicode
>> character if this
>> is a chinese character or korean or japanese. How can do this?
>>

It's basically impossible and largely meaningless. It's the equivalent
of asking if "a" is an English letter or a French one. There are
*some* characters where one can guess based on the source information
in Unihan.txt that it's traditional Chinese, simplified Chinese,
Japanese, Korean, or Vietnamese, but there are too many exceptions to
make this really reliable. (For example, one particularly nasty
obscenity in Cantonese would probably have never been encoded for
Cantonese, but has made it in for the sake of Korean, where one hopes
it isn't nearly as obscene.)

The phonetic data in Unihan.txt should not be used for this purpose. A
blank in the phonetic data means that nobody's supplied a reading, not
that a reading doesn't exist. Because updating the Unihan database is
an ongoing process, these fields will be increasingly filled out as
time goes on, but they should never be taken as absolutely complete.
In particular, there are obscure characters where it is known that
there *is* a reading, but since the character does not occur in
standard dictionaries, we are unable to supply it (e.g., U+40DF in
Cantonese).

A better solution is to look at the text as a whole: if there's a fair
amount of kana, it's probably Japanese, and if there's a fair amount of
hangul, it's probably Korean.

The only proper mechanism is, as for determining whether "chat" is
spelled correctly in English or French, is to use a higher-level
protocol.

========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://homepage..mac.com/jhjenkins/



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST