Re: Shift-JIS/Unicode mapping in JAVA

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 29 2003 - 11:11:05 EDT

  • Next message: Marco Cimarosti: "RE: The role of country codes/Not snazzy"

    Don't use "Windows-31J", it is a encoding name alias that is not used by Microsoft for its 932 codepage! So it would cause problems with other compliant JVMs.

    Better use "CP932" which seems to be the canonical name used by Sun in its reference implementation, or "windows-932" documented in the Microsoft codepages documentation, and is accepted by the IBM's JVM, and by other Java runtime libraries...

    There's an interesting comparison of encoding alias names in IBM's ICU reference docs, and even in a runtime ICU table used to disambiguate aliases names according to their usage context.
    Look at the "icu/source/data/mappings/convrtrs.txt" file in ICU's online CVS repository... It lists a lot of aliases with their prefered usage.
    a small part of this file contains:

    # CJK encodings

    ibm-942_P12A-1999 { UTR22* } # ibm-942_P120 is a rarely used alternate mapping (sjis78 is already old)
                            ibm-942 { IBM* }
                            ibm-932 { IBM }
                            cp932
                            shift_jis78
                            sjis78
                            ibm-942_VSUB_VPUA
                            ibm-932_VSUB_VPUA
                            # Is this "JIS_C6226-1978"?

    ibm-943_P14A-1999 { UTR22* }
                            ibm-943 # Leave untagged because this isn't the default
                            Shift_JIS { IANA* MIME* WINDOWS JAVA }
                            MS_Kanji { IANA WINDOWS JAVA }
                            csShiftJIS { IANA WINDOWS JAVA }
                            windows-31j { IANA JAVA } # A further extension of Shift_JIS to include NEC special characters (Row 13)
                            csWindows31J { IANA WINDOWS JAVA } # A further extension of Shift_JIS to include NEC special characters (Row 13)
                            x-sjis { WINDOWS JAVA }
                            x-ms-cp932 { WINDOWS }
                            cp932 { WINDOWS }
                            windows-932 { WINDOWS* }
                            cp943c { JAVA* } # This is slightly different, but the backslash mapping is the same.
                            ms932
                            pck # Probably SOLARIS
                            sjis # This might be for ibm-1351
                            ibm-943_VSUB_VPUA
                            # cp943 # This isn't Windows, and no one else uses it.
                            # IANA says that Windows-31J is an extension to csshiftjis ibm-932

    (...)
    ibm-33722_P12A-1999 { UTR22* }
                            ibm-33722 # Leave untagged because this isn't the default
                            ibm-5050 # Leave untagged because this isn't the default, and yes this alias is correct
                            EUC-JP { IANA MIME* WINDOWS JAVA* }
                            Extended_UNIX_Code_Packed_Format_for_Japanese { IANA* WINDOWS JAVA }
                            csEUCPkdFmtJapanese { IANA WINDOWS JAVA }
                            X-EUC-JP { WINDOWS JAVA } # Japan EUC. x-euc-jp is a MIME name
                            eucjis { JAVA }
                            windows-51932 { WINDOWS* }
                            ibm-33722_VPUA
                            IBM-eucJP
    (...)
    # These were removed due to age, and they are rarely used.
    #(...)
    #ibm-942_P120-1999 { UTR22* }
    # #ibm-942 { IBM* }
    # ibm-942_VASCII_VSUB_VPUA
    # #ibm-932 { IBM }
    # ibm-932_VASCII_VSUB_VPUA # Old s_jis

    The relevant prefered aliases for Java are marked with { JAVA* }, and posible other aliases for Java are marked with { JAVA } without the asterisk.

    So "Shift_JIS" is the prefered aliases for IANA and MIME, and a non-prefered but recognized alias for WINDOWS and JAVA.
    "x-ms-cp932" and "cp932" are used by Windows as aliases for Shift_JIS, but they do not designate the same encoding as the one used in Windows.

    "windows-51932" (the prefered name in Windows) is the character set which Java and MIME preferably designate as "EUC-JP" (this alias is also recognized but not prefered by IANA and Windows)

    So in Java, I would recommend to use "Shift_JIS" as the base standard, and "EUC-JP" for the extension used in Windows codepage 932 in Windows 2000/XP. (the 932 codepage is a placeholder in Windows, whose mapping to an effective encoding depends on the OS version, in a way similar to the "ANSI" and "OEM" codepages which vary accross systems).
    The newest 932 codepage is in fact codepage 51932, preferale named "EUC-JP" in Java.

    The oldest one is "Shift_JIS" and was mapped to codepage 932, but its usage is not recommanded in newer versions of Windows as it is conflicting (some documents created on Windows 95/98/ME or NT4 do not show the same character on Windows 2000 and XP!) Microsoft changed the mapping but kept the same codepage number, thinking it would be easier for users to migrate their systems that use "932" in their batches!

    Look above the conflict with the alias name "cp932": the alias to "shift_jis78" exists only in ICU but not in IANA, Java, MIME or Windows. On Windows it now designates "Shift_JIS" (internally codepage 943 in Windows).

    "Windows-31J" also contain proprietary NEC extensions to Shift_JIS, but it is not strictly the encoding used in Windows codepage 51932. It is used only on NEC systems (including its OEM version of Windows which handle it internally as the codepage number 31) and is not recommended for data and application interchange or portability.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Thu May 29 2003 - 11:44:34 EDT