RE: Unicode Imcompatibilities on Windows 95/NT

From: Lori Brownell (loribr@microsoft.com)
Date: Tue Jan 06 1998 - 21:25:05 EST


What you are comparing is the conversion that is described in the table
provided by Glenn Adams that is named ShiftJIS.txt to the Microsoft 932
conversion which is described in the CP932.txt file. The CP932.txt file
describes the actual encoding and xlation that will and does occur on
Windows NT and Windows 9x.

All non-Unicode encodings require special handling when doing translation
and comparison between them. This is not particular to Japan. The same
issues would apply if you try to compare ISO 8859 to the European Windows
code pages. There are differences, although they are very close. Code page
932 is the Microsoft Japanese Windows encoding. It is not something that we
can simply redefine because of compatability issues with existing data and
applications.

If you have Windows data, you can and should use the Windows conversion APIs
or tables to convert that data to Unicode or UTF-8 directly. Once in
Unicode there should be no problems.

Lori

> -----Original Message-----
> From: kazama@ingrid.org [SMTP:kazama@ingrid.org]
> Sent: Monday, January 05, 1998 7:56 PM
> To: Multiple Recipients of
> Subject: Re: Unicode Imcompatibilities on Windows 95/NT
>
> Thank you, Kenneth.
>
> From: kenw@sybase.com (Kenneth Whistler)
> Subject: Re: Unicode Imcompatibilities on Windows 95/NT
> Date: Mon, 5 Jan 1998 11:32:24 -0800
> > The table CP932.TXT (dated April 14, 1996, provided by Microsoft)
> > shows the actual mapping of Microsoft Windows Code Page 932 to
> > Unicode.
>
> I misunderstood Microsoft's Cp 932 is upper-compatible with Japanese
> Shift-JIS encoding.
>
> > This comment ignores the fact that many of these JIS <==> DBCS
> > Asian vendor code page mapping issues already existed as legacy
> > issues for DOS and Windows-based code pages.
>
> In fact, many vendor uses their original Shift-JIS variant. But they
> only add characters in the area that JIS didn't define.
>
> For Cp 932, they change the character mapping in the area that JIS
> defined.
>
> > Conversion between legacy character sets through Unicode must
> > always be done with care for the particular problems of mismatches
> > and/or non-one-to-one conversions required. This is especially
>
> There is a difference between a YEN SIGN problem and a Cp 932 problem.
> A YEN SIGN problem is a japanese-specific, but not
> vendor-specific. But a Cp 932 problem is vendor-specific.
>
> And a Cp 932 problem produces the imcompatibility of UTF-8
> representation form. I think this is a serious problem because UTF-8
> will be used in many web standards.
>
> For example, many japanese store their japanese texts in Shift-JIS
> encoding. If we uses only characters in JIS X 0201, 0208 and 0212,
> there is no difference.
>
> But UTF-8 converted files on Windows 95 is "not" equal to them on
> Java.
>
> And I think this isn't a collation issues. In fact, I will show a
> sample java program. This sample program treats "WAVE DASH" and
> "FULLWIDTH TILDE" as differenct characters in a japanese locale.
>
> Would you think how our japanese treat this problem?
>
> I think that the best solution is supporting JIS X 0221 standard
> mapping in Cp 932. Are there good walkaround?
>
> Kazuhiro Kazama (kazama@ingrid.org) Ingrid Project
> ----
> import java.text.*;
> import java.util.*;
>
> class MSCollation {
> public static void main(String args[]) {
> Collator c = Collator.getInstance(Locale.JAPANESE);
> c.setStrength(Collator.PRIMARY);
>
> if (c.equals("\u301c", "\uff5e"))
> System.out.println("\"WAVE DASH\" is equal to \"FULLWIDTH
> TILDE\".");
> else
> System.out.println("\"WAVE DASH\" is not equal to \"FULLWIDTH
> TILDE\".");
>
> if (c.equals("\u2212", "\uff0d"))
> System.out.println("\"MINUS SIGN\" is equal to \"FULLWIDTH
> HYPHEN-MINUS\".");
> else
> System.out.println("\"MINUS SIGN\" is not equal to \"FULLWIDTH
> HYPHEN-MINUS\".");
> }
> }



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT