L2/00-403

From: Kenneth Whistler [kenw@sybase.com]
Sent: Monday, June 19, 2000 7:42 PM

Subject: GB 18030-2000 mappings, anyone?

Unicadetti,

Is anyone working on, or have in hand a mapping table yet for
the new Chinese standard, GB 18030-2000? If so, this is something
that would be quite useful to post up to the Unicode website
for mapping tables, and information that needs to be added
to the Unihan.txt mother-of-all-Han-databases.

Incidentally, for those of you who have not yet seen GB 18030-2000,
here is the strange and wondrous story of what we will have to
be supporting.

--Ken

Strange and Wondrous Story of GB 18030-2000

GB 18030-2000 starts from GB 2312-1980 and fills the tables out
until they are most replete. Essentially everything in ISO/IEC 10646-1:2000
that could fit was crammed into all available space, and then an extension
mechanism was created to spill out all other code points (note: not characters
per se) into a 4 byte form. (Ack!)

The single-byte portion contains ASCII at 0x00..0x7F, and adds the
EURO SIGN at 0x80. (!!)

The double-byte portion grandfathers in GB 2312-1980 code points at
A1A1..A9FE (symbols, Latin, Greek, Cyrillic, Hiragana, Katakana) and
B0A1..F7FE (Han).

But then the double-byte portion defined by the larger space 8140..FEFE
is crammed solid with everything else that would fit.

In more detail:

A1A1..A1FE
..
A9A1..A9FE    GB2312-1980 symbols and alphabets, with a few symbol additions

AAA1..AAFE
..
AFA1..AFFE    PUA #1, mapped to U+E000..U+E233

B0A1..B0FE
..
F7A1..F7FE    GB2312-1980 Han characters

F8A1..F8FE
..
FEA1..FEFE    PUA #2, mapped to U+E234..U+E4C5

8140..81FE
..
A040..A0FE    Han characters not in GB2312-1980: U+4E02..U+72DB

A140..A1A0
..
A740..A7A0    PUA #3, mapped to U+E4C6..U+E765

A840..A8A0
..
A940..A9A0    More symbols, espec. CJK symbols

AA40..AAA0
..
FE40..FEA0    Han characters not in GB2312-1980: U+72DC..U+9FA5; U+F92C..U+FA29
                (only the unified Han characters in that range); plus a collection
                of radicals and characters listed as U+E815..U+E864 (which will
                need mappings to their real points in the CJK radicals and/or
                Vertical Extension A).

Then there is the 4 byte extension mechanism:

First two byte ranges:

8130..8139
8230..8239
8330..8339
8430..8432

Third byte ranges: 81..FE
Fourth byte ranges: 30..39

So, for example: 0x81 0x31 0xE9 0x32 = U+0531 ARMENIAN CAPITAL LETTER AYB

*All* of the remaining code points in UCS-2 are simply decanted into this
4-byte extensions, starting with:

0x81 0x30 0x81 0x30 = U+0081

and extending to:

0x84 0x32 0xEB 0x36 = U+FFFF

That is *ALL* code points, including unassigned code points, PUA code points
not mapped into the double-byte space above, and surrogate code points. (urk!)

This implies that there will by an *eight*-byte form for referring to
characters off the BMP, since you would have to use a pair of surrogates
to get at them, by this design:

0x83 0x36 0xC7 0x39 = U+D800
0x83 0x37 0xB0 0x33 = U+DC00

so, presumably:

0x83 0x36 0xC7 0x39 0x83 0x37 0xB0 0x33 = U-00010000

(Does anyone really need this? A new escape mechanism for a legacy
character set to refer to new characters in a new standard by transcoding
the escape mechanism for *that* standard. ?!?)

There is a serious problem for mapping that I'm not sure how China is
going to fix. The late-breaking characters they were concerned about that
made it into Unicode 3.0, but that they felt they had to have in the DBCS
part of GB 18030-2000 have Unicode PUA code points -- and those code points
are, of course, not included in the 4-byte extensions (since obviously the
4-byte extension table was generated algorithmically from China's database
of all code points 0000..FFFF not already accounted for the in DBCS encoding.

BUT, most (or all?) of those are actually encoded in Unicode now. To wit:

U+1E3F m-acute                          A8BC        U+E7C7
U+01F9 n-grave                          A8BF        U+E7C8
U+303E IDEOGRAPHIC VARIATION INDICATOR  A989        U+E7E7
U+2FF0..U+2FFB IDS's                    A98A..A995  U+E7E8..U+E7F3
(various)    radicals & Han characters  FE50..FEA0  U+E815..U+E864

So this is going to create a mapping problem. As it stands now, GB 18030-2000
claims that U+303E is 0x81 0x39 0xA6 0x34, but it also claims that
the IDEOGRAPHIC VARIATION INDICATOR is 0xA9 0x89 and is mapped to U+E7E7.
And so on for this entire set of last-minute additions. How you gonna do that?

My first reaction to this is that the 4-byte form, while well-intentioned,
to provide an escape mechanism to allow a GB 18030 implementation to refer
to *any* Unicode character, is going to be both: a) hard to implement and
b) cause mapping problems because of standard assignments that caught
GB 18030 in transition. Frankly, I think it would be best for any implementations
to just completely ignore the 4-byte form, correct the mappings of the
small set of PUA-mapped characters noted above, and let China come around
to correcting their standard in due time.

The assignment of mappings for the PUA space in Unicode is also *very* strange.
What I can see so far is:

U+E000..U+E233    mapped to PUA #1 in GB 18030
U+E234..U+E4C5    mapped to PUA #2 in GB 18030
U+E4C6..U+E765    mapped to PUA #3 in GB 18030
U+E766..U+E7C6    mapped to fill holes in the symbols and alphabets block
                    in GB 18030 (A1A1..A9FE)
U+E7C7..U+E7C8    m-acute and n-grave (see above)
U+E7C9..U+E7E6    mapped to fill holes in the symbols and alphabets block
U+E7E7..U+E7F3    IRV and IDS's (see above)
U+E7F4..U+E80F    mapped to fill holes in the symbols and alphabets block
U+E810..U+E814    ??? [I cannot find these.]
U+E815..U+E864    set of radicals & Han characters not in Unicode 2.0
U+E865..U+F8FF    mapped to 4-byte forms: 83 38 98 37 .. 84 31 C7 37