Re: Reading Chinese Characters from a browser

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 09 2003 - 22:01:57 EDT

  • Next message: Peter Lofting: "Re: hPhags-pa Proposal"

    Philippe Verdy responded to a question by SRIDHARAN Aravind:

    > > How can I differentiate whether a given character in chinese is
    > > simplified or traditional?
    >
    > Normally you can't with Unicode/ISO10646:
    > They are unified now by the UniHan working group, to be used
    > for Traditional or Simplied Chinese, or Japanese, or traditional
    > Korean and Vietnamese, and other minority languages written with
    > this ideographic script.

    Correcting some misstatements here...

    Actually, in most instances in Unicode you *can* differentiate
    whether a given Chinese character is simplified or traditional,
    precisely because the two related forms are *NOT* unified in
    Unicode. Thus, to pick an example which hasn't already been
    rendered hackneyed by discussion:

    U+9BE8 jing1 'whale' (traditional character form)
    U+9CB8 jing1 'whale' (simplified character form)

    So in Unicode you can differentiate the two *by code point*.

    Of course, coming up with the exact list of code points is
    non-trivial, but as Philippe pointed out, you can get a
    lot of information here by examining Unihan.txt. In particular,
    the kTraditional and kSimplified fields give mappings back
    and form between such pairs. (The problem is, however, messy
    around the edges because of "traditional simplified" forms,
    1-to-n mappings, distinct national simplifications, and
    similar problems.)

    I think what Philippe was trying to convey is that if text
    is identified as being encoded using Unicode, you cannot
    use that fact alone to determine whether the text is
    "traditional" or "simplified" in orthography, since Unicode
    includes both forms and encompasses text in either
    orthography (or even mix-and-match text that would use
    both orthographies together, e.g. to contrast the two usages).

    This differs from the situation for some traditional East
    Asian character sets. For example, identification of
    charset = cp936 would indicate that text is "simplified",
    since that character encoding does not include many
    traditional forms, whereas charset = cp950 would indicate
    that text is "traditional", since that character
    encoding does not include many simplified forms.

    Incidentally, the "UniHan working group" is a misnomer. The
    correct term is Ideographic Rapporteur Group (IRG), the
    group which does unifications of candidate CJK ideographs
    on behalf of WG2 (for ISO/IEC 10646).

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jul 09 2003 - 22:47:30 EDT