Re: Unicode Normalization on MS-Windows

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Apr 27 2003 - 14:16:58 EDT

  • Next message: Doug Ewell: "Re: General punctuation spaces U+2000 to U+200B"

    Jane Liu <xjliu_ca at yahoo dot com> wrote:

    > I am using IBM ICU V1.8 for some testing on Windows 2000 and XP, I
    > found when I process some CJK characters, ICU by default will
    > normalize it. For example, U+FA19(神) will be replaced by U+795E
    > (神). However, if I save that two characters into a file on Windows
    > 2000 and XP by using Notepad and select "Unicode" as the encoding, I
    > don't see Notepad would do such normalization/replacement. Also, on
    > Windows file system, I can also use that two characters in the
    > file/folder name, and no normalization seems to be done by the OS
    > either ...

    At first I was going to reply, somewhat smugly, that U+FA19 was in the
    CJK *Compatibility* Ideographs block, and of course operating systems
    and other processes are not required (or even encouraged) to substitute
    compatibility-equivalent characters automatically.

    But upon checking the UCD, I found that U+FA19 and U+795E are in fact
    *canonical* equivalents, not compatibility equivalents, despite the name
    of the block.

    U+FA19 falls into the category of "ideographs from various regional and
    industry standards [that] were encoded in this block, primarily to
    achieve round-trip conversion compatibility" (TUS 3.0, p. 267). In the
    code charts, U+FA19 is listed as one of "The IBM 32 compatibility
    additions" (p. 803). So the intent in encoding these characters seems
    to have been to support round-trip conversion with an existing standard,
    and it occurs to me that for that reason, an operating system might need
    to maintain the distinction between the two.

    Conformance requirement C9 says, "A process shall not assume that the
    interpretations of two canonical-equivalent character sequences are
    distinct" (p. 38). To me, the word "sequences" is a hint that the UTC
    may have been thinking more of combining sequences (like "a" plus
    diaeresis) than Han equivalents. The text immediately following C9
    says, "There are practical circumstances under which implementations may
    reasonably distinguish them." One could easily conclude that an
    operating system's need to preserve round-trip capability is one of
    these circumstances.

    So:

    > Can anyone please shed some lights on:
    >
    > 1. Why Windows doesn't do normalization,

    Because it isn't required to, and there may be a compelling reason in
    this case not to.

    > and is there any ways to ask Windows to do it?

    No.

    > 2. If Windows never do normalization, how should I balance this in my
    > Windows based application since I am using the ICU. I don't think
    > simply turn off the normalization process in the ICU would be a good
    > idea though, however, if I keep to use ICU to normalize everything in
    > my application, then I will possible run into some troubles when
    > dealing with the Windows system ...

    If you are dealing with one system (ICU) that does or does not perform
    this normalization, depending on user preference, and another system
    (Windows) that does not, and you need to have the results from the two
    systems match, then it seems logical to turn off normalization in ICU.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Apr 27 2003 - 14:56:48 EDT