L2/00-403 From: Kenneth Whistler [kenw@sybase.com] Sent: Monday, June 19, 2000 7:42 PM Subject: GB 18030-2000 mappings, anyone? Unicadetti, Is anyone working on, or have in hand a mapping table yet for the new Chinese standard, GB 18030-2000? If so, this is something that would be quite useful to post up to the Unicode website for mapping tables, and information that needs to be added to the Unihan.txt mother-of-all-Han-databases. Incidentally, for those of you who have not yet seen GB 18030-2000, here is the strange and wondrous story of what we will have to be supporting. --Ken Strange and Wondrous Story of GB 18030-2000 GB 18030-2000 starts from GB 2312-1980 and fills the tables out until they are most replete. Essentially everything in ISO/IEC 10646-1:2000 that could fit was crammed into all available space, and then an extension mechanism was created to spill out all other code points (note: not characters per se) into a 4 byte form. (Ack!) The single-byte portion contains ASCII at 0x00..0x7F, and adds the EURO SIGN at 0x80. (!!) The double-byte portion grandfathers in GB 2312-1980 code points at A1A1..A9FE (symbols, Latin, Greek, Cyrillic, Hiragana, Katakana) and B0A1..F7FE (Han). But then the double-byte portion defined by the larger space 8140..FEFE is crammed solid with everything else that would fit. In more detail: A1A1..A1FE .. A9A1..A9FE GB2312-1980 symbols and alphabets, with a few symbol additions AAA1..AAFE .. AFA1..AFFE PUA #1, mapped to U+E000..U+E233 B0A1..B0FE .. F7A1..F7FE GB2312-1980 Han characters F8A1..F8FE .. FEA1..FEFE PUA #2, mapped to U+E234..U+E4C5 8140..81FE .. A040..A0FE Han characters not in GB2312-1980: U+4E02..U+72DB A140..A1A0 .. A740..A7A0 PUA #3, mapped to U+E4C6..U+E765 A840..A8A0 .. A940..A9A0 More symbols, espec. CJK symbols AA40..AAA0 .. FE40..FEA0 Han characters not in GB2312-1980: U+72DC..U+9FA5; U+F92C..U+FA29 (only the unified Han characters in that range); plus a collection of radicals and characters listed as U+E815..U+E864 (which will need mappings to their real points in the CJK radicals and/or Vertical Extension A). Then there is the 4 byte extension mechanism: First two byte ranges: 8130..8139 8230..8239 8330..8339 8430..8432 Third byte ranges: 81..FE Fourth byte ranges: 30..39 So, for example: 0x81 0x31 0xE9 0x32 = U+0531 ARMENIAN CAPITAL LETTER AYB *All* of the remaining code points in UCS-2 are simply decanted into this 4-byte extensions, starting with: 0x81 0x30 0x81 0x30 = U+0081 and extending to: 0x84 0x32 0xEB 0x36 = U+FFFF That is *ALL* code points, including unassigned code points, PUA code points not mapped into the double-byte space above, and surrogate code points. (urk!) This implies that there will by an *eight*-byte form for referring to characters off the BMP, since you would have to use a pair of surrogates to get at them, by this design: 0x83 0x36 0xC7 0x39 = U+D800 0x83 0x37 0xB0 0x33 = U+DC00 so, presumably: 0x83 0x36 0xC7 0x39 0x83 0x37 0xB0 0x33 = U-00010000 (Does anyone really need this? A new escape mechanism for a legacy character set to refer to new characters in a new standard by transcoding the escape mechanism for *that* standard. ?!?) There is a serious problem for mapping that I'm not sure how China is going to fix. The late-breaking characters they were concerned about that made it into Unicode 3.0, but that they felt they had to have in the DBCS part of GB 18030-2000 have Unicode PUA code points -- and those code points are, of course, not included in the 4-byte extensions (since obviously the 4-byte extension table was generated algorithmically from China's database of all code points 0000..FFFF not already accounted for the in DBCS encoding. BUT, most (or all?) of those are actually encoded in Unicode now. To wit: U+1E3F m-acute A8BC U+E7C7 U+01F9 n-grave A8BF U+E7C8 U+303E IDEOGRAPHIC VARIATION INDICATOR A989 U+E7E7 U+2FF0..U+2FFB IDS's A98A..A995 U+E7E8..U+E7F3 (various) radicals & Han characters FE50..FEA0 U+E815..U+E864 So this is going to create a mapping problem. As it stands now, GB 18030-2000 claims that U+303E is 0x81 0x39 0xA6 0x34, but it also claims that the IDEOGRAPHIC VARIATION INDICATOR is 0xA9 0x89 and is mapped to U+E7E7. And so on for this entire set of last-minute additions. How you gonna do that? My first reaction to this is that the 4-byte form, while well-intentioned, to provide an escape mechanism to allow a GB 18030 implementation to refer to *any* Unicode character, is going to be both: a) hard to implement and b) cause mapping problems because of standard assignments that caught GB 18030 in transition. Frankly, I think it would be best for any implementations to just completely ignore the 4-byte form, correct the mappings of the small set of PUA-mapped characters noted above, and let China come around to correcting their standard in due time. The assignment of mappings for the PUA space in Unicode is also *very* strange. What I can see so far is: U+E000..U+E233 mapped to PUA #1 in GB 18030 U+E234..U+E4C5 mapped to PUA #2 in GB 18030 U+E4C6..U+E765 mapped to PUA #3 in GB 18030 U+E766..U+E7C6 mapped to fill holes in the symbols and alphabets block in GB 18030 (A1A1..A9FE) U+E7C7..U+E7C8 m-acute and n-grave (see above) U+E7C9..U+E7E6 mapped to fill holes in the symbols and alphabets block U+E7E7..U+E7F3 IRV and IDS's (see above) U+E7F4..U+E80F mapped to fill holes in the symbols and alphabets block U+E810..U+E814 ??? [I cannot find these.] U+E815..U+E864 set of radicals & Han characters not in Unicode 2.0 U+E865..U+F8FF mapped to 4-byte forms: 83 38 98 37 .. 84 31 C7 37