Disclaimer: this is only my interpretation of GB 18030. Use at your own
GB 18030 can be three different things, depending on how you interpret it:
1. it is a coded character set, defined by the glyph pictures in the
published standard. That collection does not include characters
for the so-called minority scripts (e.g. Mongolian)
2. it is a coded character set made of 1 + the minority scripts (you
see that by reading the - a?- document that describes the
3. it is roughly a UTF, the most notable deviations being that it can
represent a bit more than 0x0 - 0x10ffff and allows the surrogate
Under interpretations 1 and 2, you also get a mapping between those
collections and Unicode. Except for 25 characters, they are all mapped
to non-PUA BMP scalar values. The remaining 25 are mapped to PUA BMP
scalar values. Some of those 25 characters are believed to be in the
Unicode repertoire (e.g. GB+FE51 is mapped to U+E816, and is believed to
The duality collection/encoding form is in my opinion the most painful
aspect. In particular, it makes the publication of a new mapping (e.g.
to a different version of Unicode, as HKSCS did to take into account
newly encoded Unicode characters) very problematic.
By the way, here are a couple of things that may be of interest. HK+
means HKSCS code point; GB+ means GB 18030 code point:
1. PUA confusion:
HK+9571 maps to U+2721B under the 3.2 mapping (and is an ideograph)
HK+9571 maps to U+E78D under the 3.0 mapping
GB+A6D9 maps to U+E78D.
GB+A6D9 is definitely is not an ideograph.
2. PUA differentiation:
HK+8BFA maps to U+20087 under the 3.2 mapping
HK+8BFA maps to U+F572 under the 3.0 mapping
GB+FE51 maps to U+E816
GB+FE51 is believed to be U+20087
This archive was generated by hypermail 2.1.2 : Fri Jul 19 2002 - 13:30:52 EDT