L2/06-394

Update on GB 18030:2005
Ken Lunde, Adobe Systems


1) There are only twenty-four characters that continue to be mapped to
PUA code points. These were mapped to the same PUA code points in the
original (2000) standard. All twenty-four of these characters have
valid non-PUA code points, either in Plane 2 (six cases) or thanks to
Unicode 4.1 (eighteen cases). The following are the mappings:

  GB 18030  PUA       Unicode 4.1 & Plane 2
  0xA6D9 -> U+E78D -> U+FE10
  0xA6DA -> U+E78E -> U+FE12
  0xA6DB -> U+E78F -> U+FE11
  0xA6DC -> U+E790 -> U+FE13
  0xA6DD -> U+E791 -> U+FE14
  0xA6DE -> U+E792 -> U+FE15
  0xA6DF -> U+E793 -> U+FE16
  0xA6EC -> U+E794 -> U+FE17
  0xA6ED -> U+E795 -> U+FE18
  0xA6F3 -> U+E796 -> U+FE19
  0xFE51 -> U+E816 -> U+20087
  0xFE52 -> U+E817 -> U+20089
  0xFE53 -> U+E818 -> U+200CC
  0xFE59 -> U+E81E -> U+9FB4
  0xFE61 -> U+E826 -> U+9FB5
  0xFE66 -> U+E82B -> U+9FB6
  0xFE67 -> U+E82C -> U+9FB7
  0xFE6C -> U+E831 -> U+215D7
  0xFE6D -> U+E832 -> U+9FB8
  0xFE76 -> U+E83B -> U+2298F
  0xFE7E -> U+E843 -> U+9FB9
  0xFE90 -> U+E854 -> U+9FBA
  0xFE91 -> U+E855 -> U+241FE
  0xFEA0 -> U+E864 -> U+9FBB

2) Given GB 18030's tendency to include all CJK Unified Ideographs
(with the exception of the CJK Compatibility Ideographs blocks), the
characters in the range U+9FA6 through U+9FBB (22 characters) should
be added. Eight are already in GB 18030 (see #1 above), meaning
fourteen new characters. I have recommended this to CESI.

3) Extension B (42,711 characters) is printed in the standard on pp 240-443.

4) The mapping for "m" with the acute diacritic has been changed as follows:

  0xA8BC -> U+E7C7 (PUA; GB 18030-2000) -> U+1E3F (GB 18030-2005)

5) The six regional (aka, minority) scripts are Korean, Mongolian, Tai
Le, Tibetan, Uyghur, and Yi. None of them make use of PUA code
points. The following are the number of characters for each script
that have glyphs printed:

  Korean = 3,376 Hangul plus 69 Jamo plus 51 Compatibility Jamo
  Mongolian = 149
  Tai Le = 35
  Tibetan = 193
  Uyghur = 49 plus 155 Presentation Forms
  Yi = 1,215 (glyphs for U+A4A2, U+A4A3, U+A4B4, U+A4C1, and U+A4C5 are missing)

For Korean, the GB 12052-89 standard includes three levels of Hangul,
with 2,068, 1,356, and 1,779 characters, respectively. None of these
add up to the 3,376 figure, though I strongly suspect that it is the
first two levels with some tweaking.

6) The following four code points had their prototypical glyphs
silently changed (corrected):

  CID=23137 0x8230AD37 (U+3665) = GB 18030 error (13th stroke is missing)
  CID=25539 0x8232A139 (U+3FD0) = GB 18030 error (10th stroke is missing)
  CID=28741 0x8234E631 (U+4C6A) = GB 18030 error (left-side radical should be traditional)
  CID=28882 0x8234F432 (U+4CFD) = GB 18030 error (right-side radical should be traditional)

Font developers need to know this. I first detected these glyph errors
about a year ago, and reported them to CESI.

7) All of the glyphs printed in the 2000 edition are considered
mandatory. This amounts to approximately 29,000 glyphs, and is
effectively CJK Unified Ideographs, CJK Unified Ideographs Extension
A, and some additional characters. The glyphs for the six regional
scripts, along with those for CJK Unified Ideographs Extension B, are
not mandatory, but are instead recommended.

8) The GB 18030-2005 standard was established on May 1, 2006, and
first printed in August of 2006. It is just over 500 pages in length.

---