Re: Unicode Normalization on MS-Windows

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 28 2003 - 15:31:59 EDT

Next message: Peter_Constable@sil.org: "RE: Private Use Area"

Previous message: John Cowan: "Re: Title Case (Was: [OT] multilingual support in MS products"
Maybe in reply to: Jane Liu: "Unicode Normalization on MS-Windows"
Next in thread: Doug Ewell: "Re: Private Use Area"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Jane Liu wrote:

> Thanks for sharing some backgrounds. Yes, the character "U+FA19" was
> originally from the JIS standard, level-3, IBM extensions and NEC
> selected extensions and it has been assigned two code points: 0xFB7E
> (IBM portion) and 0xEE62 (NEC portion) in the native encoding ...

i.e., in Code Page 932.

But in EUC-JP (proper), there is *no* encoding for this.

In the EUC-JP vendor extensions to interoperate with Code Page 932,
there is *one* encoding: 0xFB 0x7E.

In the IBM host encoding Code Page 930, there is *one*
encoding: (0x0E) 0x5F 0xD5 (0x0F).

And in GB 18030, U+FA19 converts to: 0x84 0x30 0x9B 0x39.

So even interoperating between systems without normalizing, you
have to be concerned about the retention or substitution of
this character. Most other systems cannot roundtrip the two
encodings of the character in Code Page 932. And if you have
a Windows system hooked up to an EUC-JP back end, U+FA19 might
or might not survive, depending on the level of that system
and the support or non-support for EUC-JP extensions.

>
> Correct me if I'm wrong, it seems to me, not only for this case,
> actually in general, neither Microsoft Windows nor those popular UNIX
> systems (AIX, Solaris, HP-UX) currently supply the explicit support
> of Unicode normalization at the encoding/converison level. I suspect
> this would also apply to all major databases.

You have to be cautious here, too. Major databases may or may
not normalize Unicode data when their internal storage is
in Unicode. This may or may not be a user-definable setting.
The choice to normalize (for canonical equivalences) in databases
is often made for performance reasons, because optimizing comparisons
across table joins for non-normalized data can be very messy and
can kill database performance on queries.

> The bottom line would
> be "WYSIWYG = What You See Is What You Get", Right?

Nope. The point of most canonical equivalences is that a
typical end-user cannot tell the difference in what they get.
Canonical equivalences usually refer to two different sequences
for representing the "same" thing (where, in a few cases, as
for CJK compatibility characters, the *sequence* may consist
only of a single character).

You have picked out a particularly problematical subset of
the CJK compatibility characters, however. U+FA19 consists of
a user-visible and distinguishable variant of U+795E. From
the point of view of Han unification, the two forms are simply
variants of the *same* unified character. Only a requirement for roundtrip
convertibility with Code Page 932 resulted in separate encoding
for U+FA19. But in this case, as for others of the IBM 32, the
separate encoding of the variant was to make it visible and
usable by end users, distinct from U+795E. Some of the
discussion underway about finding ways to declare tailorings
of normalization are precisely to enable retention of these
variant distinctions for some CJK compatibility characters.

>
> If that's true, can we conclude that in order to maintain the
> transperancy and round-trip safty between application and OS, the
> application should not use normalization?

Yes, but...

The problem often is that there is no clear boundary to "the
application". Applications these days are often distributed,
and parts may operate on different platforms. It may not be
easy to define what parts are or are not using normalization
of Unicode data.

The conformance requirement for the Unicode Standard is that
one process cannot *demand* that another process maintain a
distinction between canonically equivalent sequences. So even
if your application doesn't normalize, if you interact with
any other application (and the OS platforms and databases
also constitute complex applications, in their own ways), you
cannot guarantee that they will not normalize.

The defensive way to program is to write one's own application
in such a way that it does not maintain distinctions between
canonically equivalent sequences, or when it does, it does
not break if interoperating with some other process that
does not maintain such distinctions.

--Ken

>
> Alos, it would be nice to give the flexibility that allowing the
> application user to choose On/Off of the normalization process,
> however, this may sounds useless since the majority of those systems
> don't even care.
>
> Jane
>

Next message: Peter_Constable@sil.org: "RE: Private Use Area"
Previous message: John Cowan: "Re: Title Case (Was: [OT] multilingual support in MS products"
Maybe in reply to: Jane Liu: "Unicode Normalization on MS-Windows"
Next in thread: Doug Ewell: "Re: Private Use Area"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Apr 28 2003 - 16:35:22 EDT