RE: FAQ question...

From: Marco.Cimarosti@icl.com
Date: Tue May 09 2000 - 04:45:45 EDT


Michael Friedman (1st question):
> Q: How can I normalize Unicode text so that equivalent
> Simplified and Traditional Chinese characters compare equal?

As Mark has already pointed out, there is no straightforward mapping between
simplified and traditional Chinese. In particular, a single simplified
ideograph often corresponds to two or more traditional characters. So,
converting from one writing system to the other involves a lot of contextual
information, that can only be achieved with "knowledge" of Chinese grammar
and vocabulary, and some form of "understanding" of the text.

Moreover, simplified and traditional characters come together with other
differences in spelling and terminology. For example, the two writing
traditions use different characters to transliterate foreign names (e.g.,
"Italy" is "Yidali" in both locales, but the 3 ideographs, yi, da, li, used
in Beijing are not the same 3 used in Taipei). Another example, the terms
for new technological gadgets (e.g. "computer") are not the same in Taiwan
and in the PRC.

All summed up, automatically converting from "simplified" to "traditional"
Chinese is more an NLP application of electronic translation than a trivial
character-set conversion.

However, if I understand correctly, Michael is asking for a much simpler
thing: a "loose" searching function ( la "case insensitive" or "accent
insensitive").

What you need, in this case, is just a list of characters that *might* be
variants of any given character -- either because they are
simplified/traditional pairs or for other reasons (e.g., Japanese simplified
forms, Korean clones, spelling variants, etc.).

In this case, the "normalization" is done only on *temporary* copies of the
search string and of the sought text, so there is no question of losing
content. There also is little need for accuracy: the worst thing that can
happen is that you have some false matches.

An "official" source of information for building such a variants list is in
the Unihan database (ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt -- see
the comment at the beginning of the file for details on format).

The relevant fields are:

- kSemanticVariant: The Unicode value for a semantic variant for this
character. A semantic variant is an x- or y-variant with similar or
identical meaning which can generally be used in place of the indicated
character.

- kSimplifiedVariant: The Unicode value for a (Chinese) simplified variant
for this character.

- kSpecializedSemanticVariant: The Unicode value for a specialized semantic
variant for this character. A specialized semantic variant is an x- or
y-variant with similar or identical meaning only in certain contexts
(such as accountants' numerals).

- kTraditionalVariant: The Unicode value for a (Chinese) traditional
variant for this character.

Among unofficial sources, the best I've seen is a file by Koichi Yasuoka
(ftp://ginkaku.kudpc.kyoto-u.ac.jp/CJKtable/UniVariants.Z found in
http://www.kudpc.kyoto-u.ac.jp/~yasuoka/CJK.html). This does not seem to be
a simple derivation from Unihan.txt: I think it is rather the result of
independent research by Koichi.

Other variants are directly included in the main Unicode database
(ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt), in the form of
canonical or compatibility mappings. Both should be applied *before* doing
your "ideograph variant folding" transformation, by applying Normalization
Form *KD*, as explained in UTR #15
(http://www.unicode.org/unicode/reports/tr15/tr15-18.html).

Michael Friedman (2nd question):
> How can I normalize the text so that equivalent Simplified
> and Traditional Chinese characters do not compare equal?

I don't understand this other question. Simplified and traditional Chinese
characters *are* different in Unicode, i.e. they have different code points.
So you don't need to do anything: just do a binary comparison and S/T
characters would *not* be equated.

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT