RE: traditional vs simplified chinese

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Feb 13 2003 - 09:18:51 EST

  • Next message: Zhang Weiwu: "Re: traditional vs simplified chinese"

    Paul Hastings wrote:
    > i suppose this is a really simple minded question but is
    > there any way of telling if an incoming chunk of text
    > (say from a browser form) is traditional or simplified
    > chinese?

    Please notice that the classification you want is not always meaningful.
    E.g., what if the incoming text is in Spanish? Would you classify it as
    traditional or simplified Chinese?...

    Anyway. You can obtain the base data for each Chinese character from the
    file http://www.unicode.org/Public/UNIDATA/Unihan.txt, by checking the
    existence of fields <kSimplifiedVariant> and <kTraditionalVariant>.

    Any Unicode character, falls in one of these four categories:

            0) All characters not listed in Unihan.txt (i.e., non-Chinese
    characters) are *neither* "Traditional" nor "Simplified";

            1) All characters having <kSimplifiedVariant> but *no*
    <kTraditionalVariant> are "Traditional";

            2) All characters having <kTraditionalVariant> but *no*
    <kSimplifiedVariant> are "Simplified";

            3) All other characters listed in Unihan.txt are *both*
    "Traditional" and "Simplified".

    From these character-level categories, you can assign a category to the
    input stream:

            If at least one character has category 1 AND at least one character
    has category 2, then:

                    stream is both "Traditional" and "Simplified (category 3);

            Else, if at least one character has category 1, then:

                    stream is "Traditional" (category 1);

            Else, if at least one character has category 2, then:

                    stream is "Simplified" (category 2);

            Else, if at least one character has category 3:

                    stream is both "Traditional" and "Simplified (category 3
    again);

            Else (all characters have category 0, then):

                    stream is neither "Traditional" nor "Simplified (category
    0);

            End.

    Anyway, I don't see how this information could be of any use for any
    purpose...

    _ Marco



    This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 10:03:51 EST