Re: Unicode Imcompatibilities on Windows 95/NT

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 05 1998 - 14:32:24 EST


Kazama-san reported:

>
> Recently, many japanese programmers reported imcompatibilities of
> Unicode used by Microsoft Windows 95/NT.
>
> As a result of my tests, I found that microsoft uses his own encoding
> conversion scheme.

I'll let a Microsoft person speak to the detailed reasons for these
particular differences in mappings (which I believe have to do with various
legacy codepage mapping issues), but this is old news.

The mapping tables posted on www.unicode.org and available with
the standard have all of these differences down in machine-readable
form.

The table JIS0208.TXT (dated March 8, 1994) shows the preferred
mapping of JIS X 0208 characters to Unicode (in the absence of
legacy codepage issues).

The table CP932.TXT (dated April 14, 1996, provided by Microsoft)
shows the actual mapping of Microsoft Windows Code Page 932 to
Unicode.

>
> For example, "WAVE DASH" of JIS X 0208 is converted to "WAVE DASH"
> (U+301C) of Unicode ordinarily (Ex. JIS X 0221 = ISO/IEC 10646). But
> Windows 95/NT converts it to "FULLWIDTH TILDE" (U+FF5E).
>
> And "MINUS SIGN" of JIS X 0208 is converted to "MINUS SIGN" (U+2212)
> ordinarily. But Windows 95/NT converts it to "FULLWIDTH HYPHEN-MINUS"
> (U+FF0D).

These differences are easily derivable by comparing the two data
tables cited above.

>
> These differences of encoding conversion produce imcompatibilities
> between different unicode-based systems (Ex. Windows and Java).
>
> Microsoft may want to use "Halfwidth and Fullwidth Forms" area. But
> Windows 95/NT are rich text system and they can design appropriate
> glyph size and width fonts easily.

This comment ignores the fact that many of these JIS <==> DBCS
Asian vendor code page mapping issues already existed as legacy
issues for DOS and Windows-based code pages.

>
> Why microsoft uses non-standard encoding conversions although it
> produces imcompatibilities? Are these bugs?

No, I don't believe so--although I expect Microsoft will provide
more details.

Conversion between legacy character sets through Unicode must
always be done with care for the particular problems of mismatches
and/or non-one-to-one conversions required. This is especially
true of the large Asian character sets, which already had large
numbers of round-trip conversion problems. An obvious and well-known
example is the dual interpretation of 0x5C as either YEN SIGN or
REVERSE SOLIDUS ("backslash") in Japanese implementations, but
there are other problems involving tilde, macron, not sign, and
various halfwidth and fullwidth symbols.

--Ken Whistler, Technical Director, Unicode, Inc.

>
> Kazuhiro Kazama (kazama@ingrid.org) Ingrid Project
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT