Re: Unicode Normalization on MS-Windows

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 28 2003 - 15:31:59 EDT

  • Next message: Peter_Constable@sil.org: "RE: Private Use Area"

    Jane Liu wrote:

    > Thanks for sharing some backgrounds. Yes, the character "U+FA19" was
    > originally from the JIS standard, level-3, IBM extensions and NEC
    > selected extensions and it has been assigned two code points: 0xFB7E
    > (IBM portion) and 0xEE62 (NEC portion) in the native encoding ...

    i.e., in Code Page 932.

    But in EUC-JP (proper), there is *no* encoding for this.

    In the EUC-JP vendor extensions to interoperate with Code Page 932,
    there is *one* encoding: 0xFB 0x7E.

    In the IBM host encoding Code Page 930, there is *one*
    encoding: (0x0E) 0x5F 0xD5 (0x0F).

    And in GB 18030, U+FA19 converts to: 0x84 0x30 0x9B 0x39.

    So even interoperating between systems without normalizing, you
    have to be concerned about the retention or substitution of
    this character. Most other systems cannot roundtrip the two
    encodings of the character in Code Page 932. And if you have
    a Windows system hooked up to an EUC-JP back end, U+FA19 might
    or might not survive, depending on the level of that system
    and the support or non-support for EUC-JP extensions.

    >
    > Correct me if I'm wrong, it seems to me, not only for this case,
    > actually in general, neither Microsoft Windows nor those popular UNIX
    > systems (AIX, Solaris, HP-UX) currently supply the explicit support
    > of Unicode normalization at the encoding/converison level. I suspect
    > this would also apply to all major databases.

    You have to be cautious here, too. Major databases may or may
    not normalize Unicode data when their internal storage is
    in Unicode. This may or may not be a user-definable setting.
    The choice to normalize (for canonical equivalences) in databases
    is often made for performance reasons, because optimizing comparisons
    across table joins for non-normalized data can be very messy and
    can kill database performance on queries.

    > The bottom line would
    > be "WYSIWYG = What You See Is What You Get", Right?

    Nope. The point of most canonical equivalences is that a
    typical end-user cannot tell the difference in what they get.
    Canonical equivalences usually refer to two different sequences
    for representing the "same" thing (where, in a few cases, as
    for CJK compatibility characters, the *sequence* may consist
    only of a single character).

    You have picked out a particularly problematical subset of
    the CJK compatibility characters, however. U+FA19 consists of
    a user-visible and distinguishable variant of U+795E. From
    the point of view of Han unification, the two forms are simply
    variants of the *same* unified character. Only a requirement for roundtrip
    convertibility with Code Page 932 resulted in separate encoding
    for U+FA19. But in this case, as for others of the IBM 32, the
    separate encoding of the variant was to make it visible and
    usable by end users, distinct from U+795E. Some of the
    discussion underway about finding ways to declare tailorings
    of normalization are precisely to enable retention of these
    variant distinctions for some CJK compatibility characters.

    >
    > If that's true, can we conclude that in order to maintain the
    > transperancy and round-trip safty between application and OS, the
    > application should not use normalization?

    Yes, but...

    The problem often is that there is no clear boundary to "the
    application". Applications these days are often distributed,
    and parts may operate on different platforms. It may not be
    easy to define what parts are or are not using normalization
    of Unicode data.

    The conformance requirement for the Unicode Standard is that
    one process cannot *demand* that another process maintain a
    distinction between canonically equivalent sequences. So even
    if your application doesn't normalize, if you interact with
    any other application (and the OS platforms and databases
    also constitute complex applications, in their own ways), you
    cannot guarantee that they will not normalize.

    The defensive way to program is to write one's own application
    in such a way that it does not maintain distinctions between
    canonically equivalent sequences, or when it does, it does
    not break if interoperating with some other process that
    does not maintain such distinctions.

    --Ken

    >
    > Alos, it would be nice to give the flexibility that allowing the
    > application user to choose On/Off of the normalization process,
    > however, this may sounds useless since the majority of those systems
    > don't even care.
    >
    > Jane
    >



    This archive was generated by hypermail 2.1.5 : Mon Apr 28 2003 - 16:35:22 EDT