Re: Unicode Normalization on MS-Windows

From: Jane Liu (
Date: Mon Apr 28 2003 - 11:27:36 EDT

  • Next message: Doug Ewell: "Re: [OT] multilingual support in MS products"


    Thanks for sharing some backgrounds. Yes, the character "U+FA19" was
    originally from the JIS standard, level-3, IBM extensions and NEC
    selected extensions and it has been assigned two code points: 0xFB7E
    (IBM portion) and 0xEE62 (NEC portion) in the native encoding ...

    Correct me if I'm wrong, it seems to me, not only for this case,
    actually in general, neither Microsoft Windows nor those popular UNIX
    systems (AIX, Solaris, HP-UX) currently supply the explicit support
    of Unicode normalization at the encoding/converison level. I suspect
    this would also apply to all major databases. The bottom line would
    be "WYSIWYG = What You See Is What You Get", Right?

    If that's true, can we conclude that in order to maintain the
    transperancy and round-trip safty between application and OS, the
    application should not use normalization?

    Alos, it would be nice to give the flexibility that allowing the
    application user to choose On/Off of the normalization process,
    however, this may sounds useless since the majority of those systems
    don't even care.


    --- Doug Ewell <> wrote:
    > Jane Liu <xjliu_ca at yahoo dot com> wrote:
    > > I am using IBM ICU V1.8 for some testing on Windows 2000 and XP,
    > I
    > > found when I process some CJK characters, ICU by default will
    > > normalize it. For example, U+FA19(?) will be replaced by U+795E
    > > (?). However, if I save that two characters into a file on
    > Windows
    > > 2000 and XP by using Notepad and select "Unicode" as the
    > encoding, I
    > > don't see Notepad would do such normalization/replacement. Also,
    > on
    > > Windows file system, I can also use that two characters in the
    > > file/folder name, and no normalization seems to be done by the OS
    > > either ...
    > At first I was going to reply, somewhat smugly, that U+FA19 was in
    > the
    > CJK *Compatibility* Ideographs block, and of course operating
    > systems
    > and other processes are not required (or even encouraged) to
    > substitute
    > compatibility-equivalent characters automatically.
    > But upon checking the UCD, I found that U+FA19 and U+795E are in
    > fact
    > *canonical* equivalents, not compatibility equivalents, despite the
    > name
    > of the block.
    > U+FA19 falls into the category of "ideographs from various regional
    > and
    > industry standards [that] were encoded in this block, primarily to
    > achieve round-trip conversion compatibility" (TUS 3.0, p. 267). In
    > the
    > code charts, U+FA19 is listed as one of "The IBM 32 compatibility
    > additions" (p. 803). So the intent in encoding these characters
    > seems
    > to have been to support round-trip conversion with an existing
    > standard,
    > and it occurs to me that for that reason, an operating system might
    > need
    > to maintain the distinction between the two.
    > Conformance requirement C9 says, "A process shall not assume that
    > the
    > interpretations of two canonical-equivalent character sequences are
    > distinct" (p. 38). To me, the word "sequences" is a hint that the
    > UTC
    > may have been thinking more of combining sequences (like "a" plus
    > diaeresis) than Han equivalents. The text immediately following C9
    > says, "There are practical circumstances under which
    > implementations may
    > reasonably distinguish them." One could easily conclude that an
    > operating system's need to preserve round-trip capability is one of
    > these circumstances.
    > So:
    > > Can anyone please shed some lights on:
    > >
    > > 1. Why Windows doesn't do normalization,
    > Because it isn't required to, and there may be a compelling reason
    > in
    > this case not to.
    > > and is there any ways to ask Windows to do it?
    > No.
    > > 2. If Windows never do normalization, how should I balance this
    > in my
    > > Windows based application since I am using the ICU. I don't think
    > > simply turn off the normalization process in the ICU would be a
    > good
    > > idea though, however, if I keep to use ICU to normalize
    > everything in
    > > my application, then I will possible run into some troubles when
    > > dealing with the Windows system ...
    > If you are dealing with one system (ICU) that does or does not
    > perform
    > this normalization, depending on user preference, and another
    > system
    > (Windows) that does not, and you need to have the results from the
    > two
    > systems match, then it seems logical to turn off normalization in
    > ICU.
    > -Doug Ewell
    > Fullerton, California

    Do you Yahoo!?
    The New Yahoo! Search - Faster. Easier. Bingo.

    This archive was generated by hypermail 2.1.5 : Mon Apr 28 2003 - 12:10:31 EDT