L2/06-394 Update on GB 18030:2005 Ken Lunde, Adobe Systems 1) There are only twenty-four characters that continue to be mapped to PUA code points. These were mapped to the same PUA code points in the original (2000) standard. All twenty-four of these characters have valid non-PUA code points, either in Plane 2 (six cases) or thanks to Unicode 4.1 (eighteen cases). The following are the mappings: GB 18030 PUA Unicode 4.1 & Plane 2 0xA6D9 -> U+E78D -> U+FE10 0xA6DA -> U+E78E -> U+FE12 0xA6DB -> U+E78F -> U+FE11 0xA6DC -> U+E790 -> U+FE13 0xA6DD -> U+E791 -> U+FE14 0xA6DE -> U+E792 -> U+FE15 0xA6DF -> U+E793 -> U+FE16 0xA6EC -> U+E794 -> U+FE17 0xA6ED -> U+E795 -> U+FE18 0xA6F3 -> U+E796 -> U+FE19 0xFE51 -> U+E816 -> U+20087 0xFE52 -> U+E817 -> U+20089 0xFE53 -> U+E818 -> U+200CC 0xFE59 -> U+E81E -> U+9FB4 0xFE61 -> U+E826 -> U+9FB5 0xFE66 -> U+E82B -> U+9FB6 0xFE67 -> U+E82C -> U+9FB7 0xFE6C -> U+E831 -> U+215D7 0xFE6D -> U+E832 -> U+9FB8 0xFE76 -> U+E83B -> U+2298F 0xFE7E -> U+E843 -> U+9FB9 0xFE90 -> U+E854 -> U+9FBA 0xFE91 -> U+E855 -> U+241FE 0xFEA0 -> U+E864 -> U+9FBB 2) Given GB 18030's tendency to include all CJK Unified Ideographs (with the exception of the CJK Compatibility Ideographs blocks), the characters in the range U+9FA6 through U+9FBB (22 characters) should be added. Eight are already in GB 18030 (see #1 above), meaning fourteen new characters. I have recommended this to CESI. 3) Extension B (42,711 characters) is printed in the standard on pp 240-443. 4) The mapping for "m" with the acute diacritic has been changed as follows: 0xA8BC -> U+E7C7 (PUA; GB 18030-2000) -> U+1E3F (GB 18030-2005) 5) The six regional (aka, minority) scripts are Korean, Mongolian, Tai Le, Tibetan, Uyghur, and Yi. None of them make use of PUA code points. The following are the number of characters for each script that have glyphs printed: Korean = 3,376 Hangul plus 69 Jamo plus 51 Compatibility Jamo Mongolian = 149 Tai Le = 35 Tibetan = 193 Uyghur = 49 plus 155 Presentation Forms Yi = 1,215 (glyphs for U+A4A2, U+A4A3, U+A4B4, U+A4C1, and U+A4C5 are missing) For Korean, the GB 12052-89 standard includes three levels of Hangul, with 2,068, 1,356, and 1,779 characters, respectively. None of these add up to the 3,376 figure, though I strongly suspect that it is the first two levels with some tweaking. 6) The following four code points had their prototypical glyphs silently changed (corrected): CID=23137 0x8230AD37 (U+3665) = GB 18030 error (13th stroke is missing) CID=25539 0x8232A139 (U+3FD0) = GB 18030 error (10th stroke is missing) CID=28741 0x8234E631 (U+4C6A) = GB 18030 error (left-side radical should be traditional) CID=28882 0x8234F432 (U+4CFD) = GB 18030 error (right-side radical should be traditional) Font developers need to know this. I first detected these glyph errors about a year ago, and reported them to CESI. 7) All of the glyphs printed in the 2000 edition are considered mandatory. This amounts to approximately 29,000 glyphs, and is effectively CJK Unified Ideographs, CJK Unified Ideographs Extension A, and some additional characters. The glyphs for the six regional scripts, along with those for CJK Unified Ideographs Extension B, are not mandatory, but are instead recommended. 8) The GB 18030-2005 standard was established on May 1, 2006, and first printed in August of 2006. It is just over 500 pages in length. ---