L2/01-192

Problems on Interoperativity between Unicode and CJK Local Encodings

This page introduces problems around convertion between Unicode and CJK local encodings. Mainly, non-letter symbols.

EUC-JP round-trip compatibility

This is the easiest problem. I mean, easy to understand there exists a problem, not easy to solve this problem.

In CJK world, CES (Character Encoding Scheme) and CCS (Coded Character Set) are actually different concept. I.e., one CES may contain multiple CCS. For example, EUC-JP is a CES which includes CCS of ASCII and JIS X 0208 (optionally JIS X 0201 Kana and JIS X 0212).

Unicode Consortium's conversion table from JIS X 0208 to Unicode (http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT). It (version 0.9, 1994-03-08) maps 0x2140 in JIS X 0208 into U+005C (REVERSE SOLIDUS). Though this is OK when JIS X 0208 is used separately, this causes a conflict of code point when used combined with ASCII for EUC-JP.

To implement EUC-JP with JIS X 0212, one more conflict problem occur. It is JIS X 0x2217 in JIS X 0212, which is mapped into U+007E by http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0212.TXT.

Conversion tables differ between venders

There are many CES (Character Encoding Schemes) which use a common CCS (Coded Character Set). For example, CES such as EUC-JP, Shift_JIS, and CP932 include JIS X 0208 as CCS.

For these CES, character from the same CCS should be mapped into same UCS character. However, this is not realized for dozens of characters.

The following table is a table of characters with witch same character in JIS X 0208 and so on are mapped into different code points by using various conversion tables.

-----------------------------------------------------------------------

ORIGINAL                      Converted** to U+????/EastAsianWidth

CCS     Shift_JIS* EUC-JP*    0208    SJIS    CP932   APPLE   0221A   0221B   JAVAA   JAVAB

---------------------------------------------------------------------------------------------

[ASCII]

0x5C    ----       0x5C       ----    ----    ----    ----    ----    005C/Na ----    005C/Na

0x7E    ----       0x7E       ----    ----    ----    ----    ----    007E/Na ----    007E/Na

[JISX0201 Roman]

0x5C    0x5C       ----       ----    00A5/Na 005C/Na 00A5/Na 00A5/Na ----    005C/Na 00A5/Na

0x7E    0x7E       ----       ----    203E/N  007E/Na 007E/Na 203E/N  ----    007E/Na 203E/N

[JISX0208]

0x2131  0x81 0x50  0xA1 0xB1  FFE3/F  FFE3/F  FFE3/F  FFE3/F  FFE3/F  203E/N  FFE3/F  FFE3/F

0x213D  0x81 0x5C  0xA1 0xBD  2015/A  2015/A  2015/A  2014/A  2014/A  2014/A  2015/A  2015/A

0x2140  0x81 0x5F  0xA1 0xC0  005C/Na 005C/Na FF3C/F  FF3C/F  005C/Na FF3C/F  FF3C/F  FF3C/F

0x2141  0x81 0x60  0xA1 0xC1  301C/W  301C/W  FF5E/F  301C/W  301C/W  301C/W  301C/W  301C/W

0x2142  0x81 0x61  0xA1 0xC2  2016/A  2016/A  2225/A  2016/A  2016/A  2016/A  2016/A  2016/A

0x215D  0x81 0x7C  0xA1 0xDD  2212/N  2212/N  FF0D/F  2212/N  2212/N  2212/N  2212/N  2212/N

0x216F  0x81 0x8F  0xA1 0xEF  FFE5/F  FFE5/F  FFE5/F  FFE5/F  FFE5/F  00A5/Na FFE5/F  FFE5/F

0x2171  0x81 0x91  0xA1 0xF1  00A2/Na 00A2/Na FFE0/F  00A2/Na 00A2/Na 00A2/Na 00A2/Na 00A2/Na

0x2172  0x81 0x92  0xA1 0xF2  00A3/Na 00A3/Na FFE1/F  00A3/Na 00A3/Na 00A3/Na 00A3/Na 00A3/Na

0x224C  0x81 0xCA  0xA2 0xCC  00AC/Na 00AC/Na FFE2/F  00AC/Na 00AC/Na 00AC/Na 00AC/Na 00AC/Na

[JISX0212]

0x2217  ----       0x8F,A2,97 ----    ----    ----    ----    007E/Na FF5E/F  ----    ----

---------------------------------------------------------------------------------------------

Note 1 This table mentions Japanese encodings only.

Note 2 This table doesn't contain vendors' extended characters (invalid characters in formal EUC_JP and Shift_JIS).

Note * Converted from ASCII, JISX0201 Roman, and JISX0208 algorithmically. The algorithm for EUC-JP is described in http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT. The algorithm to convert from JIS X 0208 to Shift_JIS is:

out1 = (((in1 - 1) >> 1) + (in1 <= 0x5e) ? 0x71 : 0xb1);

out2 = in2 + ((in1 & 1) ? ((in2 < 0x60) ? 0x1f : 0x20) : 0x7e);

where in1 and in2 are the 1st and 2nd bytes of JIS X 0208 respectively and out1 and out2 are the 1st and 2nd bytes of Shift_JIS. Shift_JIS value is used for original code for conversion of "SJIS", "CP932", "Win98", and "Apple", because all of them (other than Shift_JIS itself) are supersets of Shift_JIS.

Note **

0208 = http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT (Version 0.9, 1994-03-08)
SJIS = http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT (Version 0.9, 1994-03-08)
CP932 = http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT (Version 2.01, 1998-04-15)
APPLE = http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT (1999-09-22)
0221A = JIS X 0221 annex 3 (JIS X 0201), from http://www.ingrid.org/java/i18n/unicode.html (downloaded 2001-04-13). JIS X 0221 is a Japanese national standard corresponding to ISO 10646.
0221B = JIS X 0221 annex 3 (ISO/IEC 646-IRV), from http://www.ingrid.org/java/i18n/unicode.html.
JAVAA = Java (SJIS & EUCJIS), from http://www.ingrid.org/java/i18n/unicode.html.
JAVAB = Java (JIS), from http://www.ingrid.org/java/i18n/unicode.html.

Thus, same characters in Japanese encodings is mapped into different Unicode characters, according to the conversion table. Especially, CP932 (which has relatively more differences) is called Shift_JIS in Microsoft OSes and very widely used. This will introduce vast problems in future when Unicode will be more popular in Japan.

Width problems

Computers have been used for long years in the CJK world, as Euro-American world. Ideographs have occupied two columns in terminal-based softwares and hardwares since CJK people had come to use Ideographs by computers. Thus, there are singlewidth or narrow ("Hankaku" in Japanese) characters and doublewidth or wide ("Zenkaku" in Japanese) characters. Though there are no official standards which mention width of characters (at least in Japan), the concept of width is very strong de-facto standard in CJK world.

In CJK local encodings, it is very easy to tell which a character is singlewidth or doublewidth. Characters from ISO 646 (ASCII, JIS X 0201 Roman, and so on) and JIS X 0201 Kana are singlewidth and others are doublewidth. CJK people have long history (tens of years) to widely rely on this de-facto standard and IMO this proves the de-facto standard has no fatal problems. Thus, Unicode and its conversion tables are responsible to the problem I am going to explain below.

Unicode Consortium supplies Unicode Standard Annex #11 EAST ASIAN WIDTH (UAX#11, former UTR#11) in order to keep compatibility to CJK the de-facto standard. It classifies UCS characters into a few categories - "N", "A", "H", "W", "F", and "Na".

To keep compatibility with CJK de-facto standard, characters from ISO 646 (ASCII, JIS X 0201 Roman, and so on) and JIS X 0201 Kana have to have "Na" or "H" and others have to have "W", "F", or "A" in CJK encodings. In addition, appearance of "N" should be regarded as a bug of UAX#11.

I checked by using a script and found the following problems in EastAsianWidth.txt

FILE JIS0208.TXT------

0x2140  U+005C  Na  # REVERSE SOLIDUS

0x215D  U+2212  N  # MINUS SIGN

0x2171  U+00A2  Na  # CENT SIGN

0x2172  U+00A3  Na  # POUND SIGN

0x224C  U+00AC  Na  # NOT SIGN

FILE JIS0212.TXT------

0x2234  U+00AF  Na  # MACRON

0x2237  U+007E  Na  # TILDE

0x2238  U+0384  N  # GREEK TONOS

0x2239  U+0385  N  # GREEK DIALYTIKA TONOS

0x2243  U+00A6  Na  # BROKEN BAR

0x226D  U+00A9  N  # COPYRIGHT SIGN

0x226E  U+00AE  N  # REGISTERED SIGN

0x2271  U+2116  N  # NUMERO SIGN

0x2661  U+0386  N  # GREEK CAPITAL LETTER ALPHA WITH TONOS

0x2662  U+0388  N  # GREEK CAPITAL LETTER EPSILON WITH TONOS

0x2663  U+0389  N  # GREEK CAPITAL LETTER ETA WITH TONOS

0x2664  U+038A  N  # GREEK CAPITAL LETTER IOTA WITH TONOS

0x2665  U+03AA  N  # GREEK CAPITAL LETTER IOTA WITH DIALYTIKA

0x2667  U+038C  N  # GREEK CAPITAL LETTER OMICRON WITH TONOS

0x2669  U+038E  N  # GREEK CAPITAL LETTER UPSILON WITH TONOS

0x266A  U+03AB  N  # GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA

0x266C  U+038F  N  # GREEK CAPITAL LETTER OMEGA WITH TONOS

0x2671  U+03AC  N  # GREEK SMALL LETTER ALPHA WITH TONOS

0x2672  U+03AD  N  # GREEK SMALL LETTER EPSILON WITH TONOS

0x2673  U+03AE  N  # GREEK SMALL LETTER ETA WITH TONOS

0x2674  U+03AF  N  # GREEK SMALL LETTER IOTA WITH TONOS

0x2675  U+03CA  N  # GREEK SMALL LETTER IOTA WITH DIALYTIKA

0x2676  U+0390  N  # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS

0x2677  U+03CC  N  # GREEK SMALL LETTER OMICRON WITH TONOS

0x2678  U+03C2  N  # GREEK SMALL LETTER FINAL SIGMA

0x2679  U+03CD  N  # GREEK SMALL LETTER UPSILON WITH TONOS

0x267A  U+03CB  N  # GREEK SMALL LETTER UPSILON WITH DIALYTIKA

0x267B  U+03B0  N  # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

0x267C  U+03CE  N  # GREEK SMALL LETTER OMEGA WITH TONOS

0x2742  U+0402  N  # CYRILLIC CAPITAL LETTER DJE

0x2743  U+0403  N  # CYRILLIC CAPITAL LETTER GJE

0x2744  U+0404  N  # CYRILLIC CAPITAL LETTER UKRAINIAN IE

0x2745  U+0405  N  # CYRILLIC CAPITAL LETTER DZE

0x2746  U+0406  N  # CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I

0x2747  U+0407  N  # CYRILLIC CAPITAL LETTER YI

0x2748  U+0408  N  # CYRILLIC CAPITAL LETTER JE

0x2749  U+0409  N  # CYRILLIC CAPITAL LETTER LJE

0x274A  U+040A  N  # CYRILLIC CAPITAL LETTER NJE

0x274B  U+040B  N  # CYRILLIC CAPITAL LETTER TSHE

0x274C  U+040C  N  # CYRILLIC CAPITAL LETTER KJE

0x274D  U+040E  N  # CYRILLIC CAPITAL LETTER SHORT U

0x274E  U+040F  N  # CYRILLIC CAPITAL LETTER DZHE

0x2772  U+0452  N  # CYRILLIC SMALL LETTER DJE

0x2773  U+0453  N  # CYRILLIC SMALL LETTER GJE

0x2774  U+0454  N  # CYRILLIC SMALL LETTER UKRAINIAN IE

0x2775  U+0455  N  # CYRILLIC SMALL LETTER DZE

0x2776  U+0456  N  # CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I

0x2777  U+0457  N  # CYRILLIC SMALL LETTER YI

0x2778  U+0458  N  # CYRILLIC SMALL LETTER JE

0x2779  U+0459  N  # CYRILLIC SMALL LETTER LJE

0x277A  U+045A  N  # CYRILLIC SMALL LETTER NJE

0x277B  U+045B  N  # CYRILLIC SMALL LETTER TSHE

0x277C  U+045C  N  # CYRILLIC SMALL LETTER KJE

0x277D  U+045E  N  # CYRILLIC SMALL LETTER SHORT U

0x277E  U+045F  N  # CYRILLIC SMALL LETTER DZHE

0x2922  U+0110  N  # LATIN CAPITAL LETTER D WITH STROKE

0x294B  U+014B  N  # LATIN SMALL LETTER ENG

0x2A21  U+00C1  N  # LATIN CAPITAL LETTER A WITH ACUTE

0x2A22  U+00C0  N  # LATIN CAPITAL LETTER A WITH GRAVE

0x2A23  U+00C4  N  # LATIN CAPITAL LETTER A WITH DIAERESIS

0x2A24  U+00C2  N  # LATIN CAPITAL LETTER A WITH CIRCUMFLEX

0x2A25  U+0102  N  # LATIN CAPITAL LETTER A WITH BREVE

0x2A26  U+01CD  N  # LATIN CAPITAL LETTER A WITH CARON

0x2A27  U+0100  N  # LATIN CAPITAL LETTER A WITH MACRON

0x2A28  U+0104  N  # LATIN CAPITAL LETTER A WITH OGONEK

0x2A29  U+00C5  N  # LATIN CAPITAL LETTER A WITH RING ABOVE

0x2A2A  U+00C3  N  # LATIN CAPITAL LETTER A WITH TILDE

0x2A2B  U+0106  N  # LATIN CAPITAL LETTER C WITH ACUTE

0x2A2C  U+0108  N  # LATIN CAPITAL LETTER C WITH CIRCUMFLEX

0x2A2D  U+010C  N  # LATIN CAPITAL LETTER C WITH CARON

0x2A2E  U+00C7  N  # LATIN CAPITAL LETTER C WITH CEDILLA

0x2A2F  U+010A  N  # LATIN CAPITAL LETTER C WITH DOT ABOVE

0x2A30  U+010E  N  # LATIN CAPITAL LETTER D WITH CARON

0x2A31  U+00C9  N  # LATIN CAPITAL LETTER E WITH ACUTE

0x2A32  U+00C8  N  # LATIN CAPITAL LETTER E WITH GRAVE

0x2A33  U+00CB  N  # LATIN CAPITAL LETTER E WITH DIAERESIS

0x2A34  U+00CA  N  # LATIN CAPITAL LETTER E WITH CIRCUMFLEX

0x2A35  U+011A  N  # LATIN CAPITAL LETTER E WITH CARON

0x2A36  U+0116  N  # LATIN CAPITAL LETTER E WITH DOT ABOVE

0x2A37  U+0112  N  # LATIN CAPITAL LETTER E WITH MACRON

0x2A38  U+0118  N  # LATIN CAPITAL LETTER E WITH OGONEK

0x2A3A  U+011C  N  # LATIN CAPITAL LETTER G WITH CIRCUMFLEX

0x2A3B  U+011E  N  # LATIN CAPITAL LETTER G WITH BREVE

0x2A3C  U+0122  N  # LATIN CAPITAL LETTER G WITH CEDILLA

0x2A3D  U+0120  N  # LATIN CAPITAL LETTER G WITH DOT ABOVE

0x2A3E  U+0124  N  # LATIN CAPITAL LETTER H WITH CIRCUMFLEX

0x2A3F  U+00CD  N  # LATIN CAPITAL LETTER I WITH ACUTE

0x2A40  U+00CC  N  # LATIN CAPITAL LETTER I WITH GRAVE

0x2A41  U+00CF  N  # LATIN CAPITAL LETTER I WITH DIAERESIS

0x2A42  U+00CE  N  # LATIN CAPITAL LETTER I WITH CIRCUMFLEX

0x2A43  U+01CF  N  # LATIN CAPITAL LETTER I WITH CARON

0x2A44  U+0130  N  # LATIN CAPITAL LETTER I WITH DOT ABOVE

0x2A45  U+012A  N  # LATIN CAPITAL LETTER I WITH MACRON

0x2A46  U+012E  N  # LATIN CAPITAL LETTER I WITH OGONEK

0x2A47  U+0128  N  # LATIN CAPITAL LETTER I WITH TILDE

0x2A48  U+0134  N  # LATIN CAPITAL LETTER J WITH CIRCUMFLEX

0x2A49  U+0136  N  # LATIN CAPITAL LETTER K WITH CEDILLA

0x2A4A  U+0139  N  # LATIN CAPITAL LETTER L WITH ACUTE

0x2A4B  U+013D  N  # LATIN CAPITAL LETTER L WITH CARON

0x2A4C  U+013B  N  # LATIN CAPITAL LETTER L WITH CEDILLA

0x2A4D  U+0143  N  # LATIN CAPITAL LETTER N WITH ACUTE

0x2A4E  U+0147  N  # LATIN CAPITAL LETTER N WITH CARON

0x2A4F  U+0145  N  # LATIN CAPITAL LETTER N WITH CEDILLA

0x2A50  U+00D1  N  # LATIN CAPITAL LETTER N WITH TILDE

0x2A51  U+00D3  N  # LATIN CAPITAL LETTER O WITH ACUTE

0x2A52  U+00D2  N  # LATIN CAPITAL LETTER O WITH GRAVE

0x2A53  U+00D6  N  # LATIN CAPITAL LETTER O WITH DIAERESIS

0x2A54  U+00D4  N  # LATIN CAPITAL LETTER O WITH CIRCUMFLEX

0x2A55  U+01D1  N  # LATIN CAPITAL LETTER O WITH CARON

0x2A56  U+0150  N  # LATIN CAPITAL LETTER O WITH DOUBLE ACUTE

0x2A57  U+014C  N  # LATIN CAPITAL LETTER O WITH MACRON

0x2A58  U+00D5  N  # LATIN CAPITAL LETTER O WITH TILDE

0x2A59  U+0154  N  # LATIN CAPITAL LETTER R WITH ACUTE

0x2A5A  U+0158  N  # LATIN CAPITAL LETTER R WITH CARON

0x2A5B  U+0156  N  # LATIN CAPITAL LETTER R WITH CEDILLA

0x2A5C  U+015A  N  # LATIN CAPITAL LETTER S WITH ACUTE

0x2A5D  U+015C  N  # LATIN CAPITAL LETTER S WITH CIRCUMFLEX

0x2A5E  U+0160  N  # LATIN CAPITAL LETTER S WITH CARON

0x2A5F  U+015E  N  # LATIN CAPITAL LETTER S WITH CEDILLA

0x2A60  U+0164  N  # LATIN CAPITAL LETTER T WITH CARON

0x2A61  U+0162  N  # LATIN CAPITAL LETTER T WITH CEDILLA

0x2A62  U+00DA  N  # LATIN CAPITAL LETTER U WITH ACUTE

0x2A63  U+00D9  N  # LATIN CAPITAL LETTER U WITH GRAVE

0x2A64  U+00DC  N  # LATIN CAPITAL LETTER U WITH DIAERESIS

0x2A65  U+00DB  N  # LATIN CAPITAL LETTER U WITH CIRCUMFLEX

0x2A66  U+016C  N  # LATIN CAPITAL LETTER U WITH BREVE

0x2A67  U+01D3  N  # LATIN CAPITAL LETTER U WITH CARON

0x2A68  U+0170  N  # LATIN CAPITAL LETTER U WITH DOUBLE ACUTE

0x2A69  U+016A  N  # LATIN CAPITAL LETTER U WITH MACRON

0x2A6A  U+0172  N  # LATIN CAPITAL LETTER U WITH OGONEK

0x2A6B  U+016E  N  # LATIN CAPITAL LETTER U WITH RING ABOVE

0x2A6C  U+0168  N  # LATIN CAPITAL LETTER U WITH TILDE

0x2A6D  U+01D7  N  # LATIN CAPITAL LETTER U WITH DIAERESIS AND ACUTE

0x2A6E  U+01DB  N  # LATIN CAPITAL LETTER U WITH DIAERESIS AND GRAVE

0x2A6F  U+01D9  N  # LATIN CAPITAL LETTER U WITH DIAERESIS AND CARON

0x2A70  U+01D5  N  # LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON

0x2A71  U+0174  N  # LATIN CAPITAL LETTER W WITH CIRCUMFLEX

0x2A72  U+00DD  N  # LATIN CAPITAL LETTER Y WITH ACUTE

0x2A73  U+0178  N  # LATIN CAPITAL LETTER Y WITH DIAERESIS

0x2A74  U+0176  N  # LATIN CAPITAL LETTER Y WITH CIRCUMFLEX

0x2A75  U+0179  N  # LATIN CAPITAL LETTER Z WITH ACUTE

0x2A76  U+017D  N  # LATIN CAPITAL LETTER Z WITH CARON

0x2A77  U+017B  N  # LATIN CAPITAL LETTER Z WITH DOT ABOVE

0x2B23  U+00E4  N  # LATIN SMALL LETTER A WITH DIAERESIS

0x2B24  U+00E2  N  # LATIN SMALL LETTER A WITH CIRCUMFLEX

0x2B25  U+0103  N  # LATIN SMALL LETTER A WITH BREVE

0x2B28  U+0105  N  # LATIN SMALL LETTER A WITH OGONEK

0x2B29  U+00E5  N  # LATIN SMALL LETTER A WITH RING ABOVE

0x2B2A  U+00E3  N  # LATIN SMALL LETTER A WITH TILDE

0x2B2B  U+0107  N  # LATIN SMALL LETTER C WITH ACUTE

0x2B2C  U+0109  N  # LATIN SMALL LETTER C WITH CIRCUMFLEX

0x2B2D  U+010D  N  # LATIN SMALL LETTER C WITH CARON

0x2B2E  U+00E7  N  # LATIN SMALL LETTER C WITH CEDILLA

0x2B2F  U+010B  N  # LATIN SMALL LETTER C WITH DOT ABOVE

0x2B30  U+010F  N  # LATIN SMALL LETTER D WITH CARON

0x2B33  U+00EB  N  # LATIN SMALL LETTER E WITH DIAERESIS

0x2B36  U+0117  N  # LATIN SMALL LETTER E WITH DOT ABOVE

0x2B38  U+0119  N  # LATIN SMALL LETTER E WITH OGONEK

0x2B39  U+01F5  N  # LATIN SMALL LETTER G WITH ACUTE

0x2B3A  U+011D  N  # LATIN SMALL LETTER G WITH CIRCUMFLEX

0x2B3B  U+011F  N  # LATIN SMALL LETTER G WITH BREVE

0x2B3D  U+0121  N  # LATIN SMALL LETTER G WITH DOT ABOVE

0x2B3E  U+0125  N  # LATIN SMALL LETTER H WITH CIRCUMFLEX

0x2B41  U+00EF  N  # LATIN SMALL LETTER I WITH DIAERESIS

0x2B42  U+00EE  N  # LATIN SMALL LETTER I WITH CIRCUMFLEX

0x2B46  U+012F  N  # LATIN SMALL LETTER I WITH OGONEK

0x2B47  U+0129  N  # LATIN SMALL LETTER I WITH TILDE

0x2B48  U+0135  N  # LATIN SMALL LETTER J WITH CIRCUMFLEX

0x2B49  U+0137  N  # LATIN SMALL LETTER K WITH CEDILLA

0x2B4A  U+013A  N  # LATIN SMALL LETTER L WITH ACUTE

0x2B4B  U+013E  N  # LATIN SMALL LETTER L WITH CARON

0x2B4C  U+013C  N  # LATIN SMALL LETTER L WITH CEDILLA

0x2B4F  U+0146  N  # LATIN SMALL LETTER N WITH CEDILLA

0x2B50  U+00F1  N  # LATIN SMALL LETTER N WITH TILDE

0x2B53  U+00F6  N  # LATIN SMALL LETTER O WITH DIAERESIS

0x2B54  U+00F4  N  # LATIN SMALL LETTER O WITH CIRCUMFLEX

0x2B56  U+0151  N  # LATIN SMALL LETTER O WITH DOUBLE ACUTE

0x2B58  U+00F5  N  # LATIN SMALL LETTER O WITH TILDE

0x2B59  U+0155  N  # LATIN SMALL LETTER R WITH ACUTE

0x2B5A  U+0159  N  # LATIN SMALL LETTER R WITH CARON

0x2B5B  U+0157  N  # LATIN SMALL LETTER R WITH CEDILLA

0x2B5C  U+015B  N  # LATIN SMALL LETTER S WITH ACUTE

0x2B5D  U+015D  N  # LATIN SMALL LETTER S WITH CIRCUMFLEX

0x2B5E  U+0161  N  # LATIN SMALL LETTER S WITH CARON

0x2B5F  U+015F  N  # LATIN SMALL LETTER S WITH CEDILLA

0x2B60  U+0165  N  # LATIN SMALL LETTER T WITH CARON

0x2B61  U+0163  N  # LATIN SMALL LETTER T WITH CEDILLA

0x2B65  U+00FB  N  # LATIN SMALL LETTER U WITH CIRCUMFLEX

0x2B66  U+016D  N  # LATIN SMALL LETTER U WITH BREVE

0x2B68  U+0171  N  # LATIN SMALL LETTER U WITH DOUBLE ACUTE

0x2B6A  U+0173  N  # LATIN SMALL LETTER U WITH OGONEK

0x2B6B  U+016F  N  # LATIN SMALL LETTER U WITH RING ABOVE

0x2B6C  U+0169  N  # LATIN SMALL LETTER U WITH TILDE

0x2B71  U+0175  N  # LATIN SMALL LETTER W WITH CIRCUMFLEX

0x2B72  U+00FD  N  # LATIN SMALL LETTER Y WITH ACUTE

0x2B73  U+00FF  N  # LATIN SMALL LETTER Y WITH DIAERESIS

0x2B74  U+0177  N  # LATIN SMALL LETTER Y WITH CIRCUMFLEX

0x2B75  U+017A  N  # LATIN SMALL LETTER Z WITH ACUTE

0x2B76  U+017E  N  # LATIN SMALL LETTER Z WITH CARON

0x2B77  U+017C  N  # LATIN SMALL LETTER Z WITH DOT ABOVE

FILE SHIFTJIS.TXT------

0x7E  U+203E  N  # OVERLINE

0x815F  U+005C  Na  # REVERSE SOLIDUS

0x817C  U+2212  N  # MINUS SIGN

0x8191  U+00A2  Na  # CENT SIGN

0x8192  U+00A3  Na  # POUND SIGN

0x81CA  U+00AC  Na  # NOT SIGN

FILE CP932.TXT------

0x8782  U+2116  N  #NUMERO SIGN

0xFA59  U+2116  N  #NUMERO SIGN

FILE JAPANESE.TXT------

FILE GB2312.TXT------

0x216D  U+2116  N  # NUMERO SIGN

FILE CHINSIMP.TXT------

FILE BIG5.TXT------

0xA145  U+2022  N  # BULLET

0xA14E  U+FF64  H  # HALFWIDTH IDEOGRAPHIC COMMA

0xA1C2  U+203E  N  # OVERLINE

0xA1F2  U+2641  N  # EARTH

0xA244  U+00A5  Na  # YEN SIGN

0xA246  U+00A2  Na  # CENT SIGN

0xA247  U+00A3  Na  # POUND SIGN

FILE CHINTRAD.TXT------

FILE KSX1001.TXT------

FILE KOREAN.TXT------

The script is following:

#!/usr/bin/perl

open(FILE, "EastAsianWidth.txt") || die "Cannot open width file.";

while($a = <FILE>) {

        $a =~ /^([0-9A-F]+);([A-Za-z]+)/;

        $num = $1; $w = $2;

        if ($num eq "") {next;}

        $width{$num} = $w;

close(FILE);

sub checkfile($$$$) {

        my($file, $localcolumn, $ucscolumn, $commentcolumn)=@_;

        open(FILE, $file) || die "Cannot open $file";

        print "FILE $file------\n";

        while($a = <FILE>) {

            if ($a =~ /^\#/) {next;}

            chomp($a);

            @list = split(/\t/, $a);

            $loc = $list[$localcolumn];

            $ucs = $list[$ucscolumn];

            if ($ucs < 0x20 || ($ucs >= 0x7f && $ucs <= 0x9f)) {next;}

            $ucs =~ s/0x//;

            $width = $width{$ucs};

            $com = $list[$commentcolumn];

            if ($loc < 0x100 &&

               ($width eq "W" || $width eq "F" || $width eq "A" || $width eq "N")) {

               print "$loc  U+$ucs  $width  $com\n";

            } elsif ($loc > 0x100 &&

               ($width eq "N" || $width eq "H" || $width eq "Na")) {

               print "$loc  U+$ucs  $width  $com\n";

&checkfile("JIS0208.TXT", 1, 2, 3);

&checkfile("JIS0212.TXT", 0, 1, 2);

&checkfile("SHIFTJIS.TXT", 0, 1, 2);

&checkfile("CP932.TXT", 0, 1, 2);

&checkfile("JAPANESE.TXT", 0, 1, 2);

&checkfile("GB2312.TXT", 0, 1, 2);

&checkfile("CHINSIMP.TXT", 0, 1, 2);

&checkfile("BIG5.TXT", 0, 1, 2);

&checkfile("CHINTRAD.TXT", 0, 1, 2);

&checkfile("KSX1001.TXT", 0, 1, 2);

&checkfile("KOREAN.TXT", 0, 1, 2);

Note the limit of this research that only conversion tables from Unicode Consortium are examined.

This result can be regarded as a bug of UAX#11 or a bug of conversion tables. For some cases, this problem can be fixed by only modifying UAX#11, like the following:

U+2212 MINUS SIGN (−) "N" -> "A" (0x215D in JIS0208.TXT)
U+00A2 CENT SIGN (¢) "Na" -> "A" (0x2171 in JIS0208.TXT)
U+00A3 POUND SIGN (£) "Na" -> "A" (0x2172 in JIS0208.TXT)
U+00AC NOT SIGN (¬) "Na" -> "A" (0x224C in JIS0208.TXT)
U+00AF MACRON (¯) "Na" -> "A" (0x2234 in JIS0212.TXT)
U+0384 GREEK TONOS (΄) "N" -> "A" (0x2238 in JIS0212.TXT)
U+0385 GREEK DIALYTIKA TONOS (΅) "N" -> "A" (0x2239 in JIS0212.TXT)
U+00A6 BROKEN BAR (¦) "Na" -> "A" (0x2243 in JIS0212.TXT)
U+00AE REGISTERED SIGN (®) "N" -> "A" (0x226E in JIS0212.TXT)
U+2116 NUMERO SIGN (№) "N" -> "A" (0x8782 in CP932.TXT)
U+2022 BULLET (•) "N" -> "A" (0xA145 in BIG5.TXT)
U+203E OVERLINE (‾) "N" -> "A" (0xA1C2 in BIG5.TXT)
U+2641 EARTH (♁) "N" -> "A" (0xA1F2 in BIG5.TXT)

However, for U+005c REVERSE SOLIDUS (\), we cannot modify UAX#11 to satisfy all encodings which I tested now, because some tables (such as JIS0208.TXT and SHIFTJIS.TXT) need U+005C to be doublewidth while other tables (such as CP932) need U+005C to be singlewidth. As a standard, Unicode can classify U+005C into "A". However, some softwares will consider "A" characters as doublewidth when compatiblity is needed. Thus, the only solution is to modify conversion tables. I imagine that the most moderate solution is to classify U+005C into "Na" and modify JIS0208.TXT, SHIFTJIS.TXT, and JIS X 0221 to convert JIS X 0208 0x2140 into U+FF3C (＼).

Other similar problematic characters are:

U+203E OVERLINE (‾) singlewidth in SHIFTJIS.TXT, doublewidth in JIS X 0221.
U+00A5 YEN SIGN (¥) singlewidth in SHIFTJIS.TXT, doublewidth in BIG5.TXT.

I think the line

0xA14E  U+FF64  H  # HALFWIDTH IDEOGRAPHIC COMMA

in BIG5.TXT is a bug. 0xA14E must be doublewidth while U+FF64 must be "H". Thus, conversion table BIG5.TXT must be modified.

JIS X 0213

Unicode Consortium has not yet released conversion table for JIS X 0213. Since this new Japanese national standard includes many non-letter symbols, new examples of these problems will appear.

I imagine that much more non-letter symbols will be needed to classified to "A" in UAX#11.

ASCII and JIS X 0201 Roman

When converting EUC-JP and Shift_JIS, handling of 0x5c and 0x7e can be a problem. Since both encodings have long history and Japanese people have lot of experience how to handle them, I now introduce it.

Solution is very simple. Just regard YEN SIGN and REVERSE SOLIDUS as a different glyphs of the same character. Then, distinction between ASCII and JIS X 0201 Roman can be neglected.

Thus, when a Japanese person (almost Japanese people don't know about encoding; a certain amount of people [Windows and Macintosh users] know the word "Shift_JIS" as the only usable encoding) says "Shift_JIS", almost always it means "CP932".

Please don't blame such Japanese people who don't aware of distinction between Shift_JIS and CP932. The difference between Shift_JIS and CP932 was only that CP932 has extension characters. It is the introduction of Unicode and conversion to/from it that brought a confusing incompatibility of non-letter symbols between Shift_JIS and CP932.

The reason why I wrote that when a Japanese person says "Shift_JIS", almost always it means "CP932" is the following. For example, DOS/Windows programmers write YEN SIGN + "n" to mean new line (in Shift_JIS, strictly speaking, CP932). DOS/Windows use YEN SIGN (0x5c) for directory name separator. This is why Microsoft cannot convert 0x5c in CP932 into characters other than U+005C.

Not only Windows users but also UNIX users regarded 0x5c in Shift_JIS as an ambiguous character of YEN SIGN and REVERSE SOLIDUS. For example, popular Japanese encode converters such as nkf and qkc don't care about distinction between ASCII and JIS X 0201 Kana. When I often use TeraTerm, a telnet/ssh client for Windows, and read YEN SIGN, I read it as a REVERSE SOLIDUS according to the context. (When a Japanese person is a writer, it means YEN SIGN in most cases. When a non-Japanese person is a writer, it always means REVERSE SOLIDUS).

Thus, I don't complain if 0x5c in Shift_JIS is mapped into U+005C. Rather, distinction of them (i.e., being strict to official standards) might confuse many Japanese people.

Tomohiro KUBOTA mailto:%20kubota@debian.org