Re: is GB18030 a combination of CJK and CJK extension?

From: Eric Muller (
Date: Fri Jul 19 2002 - 15:29:34 EDT

Disclaimer: this is only my interpretation of GB 18030. Use at your own

GB 18030 can be three different things, depending on how you interpret it:

   1. it is a coded character set, defined by the glyph pictures in the
      published standard. That collection does not include characters
      for the so-called minority scripts (e.g. Mongolian)
   2. it is a coded character set made of 1 + the minority scripts (you
      see that by reading the - a?- document that describes the
      certification testing)
   3. it is roughly a UTF, the most notable deviations being that it can
      represent a bit more than 0x0 - 0x10ffff and allows the surrogate
      code points.

Under interpretations 1 and 2, you also get a mapping between those
collections and Unicode. Except for 25 characters, they are all mapped
to non-PUA BMP scalar values. The remaining 25 are mapped to PUA BMP
scalar values. Some of those 25 characters are believed to be in the
Unicode repertoire (e.g. GB+FE51 is mapped to U+E816, and is believed to
be U+20087).

The duality collection/encoding form is in my opinion the most painful
aspect. In particular, it makes the publication of a new mapping (e.g.
to a different version of Unicode, as HKSCS did to take into account
newly encoded Unicode characters) very problematic.

By the way, here are a couple of things that may be of interest. HK+
means HKSCS code point; GB+ means GB 18030 code point:

   1. PUA confusion:
      HK+9571 maps to U+2721B under the 3.2 mapping (and is an ideograph)
      HK+9571 maps to U+E78D under the 3.0 mapping
      GB+A6D9 maps to U+E78D.
      GB+A6D9 is definitely is not an ideograph.
   2. PUA differentiation:
      HK+8BFA maps to U+20087 under the 3.2 mapping
      HK+8BFA maps to U+F572 under the 3.0 mapping
      GB+FE51 maps to U+E816
      GB+FE51 is believed to be U+20087


This archive was generated by hypermail 2.1.2 : Fri Jul 19 2002 - 13:30:52 EDT