L2/01-216

Unicode FAQ on GB 18030

What is GB 18030?

GB 18030 is a new Chinese codepage standard that extends GB 2312-1980 and GBK (which itself is an extension of GB 2312-1980).

What is new in GB 18030?

It is a multi-byte encoding using 1-byte, 2-byte, and 4-byte codes. The 1-byte and 2-byte codes have the same assignments as in GBK, which itself is a superset of GB 2312-1980.
There are about 1.6 million valid byte sequences.
It is not possible to determine if a byte sequence is either 2 or 4 bytes long by just examining the lead byte — the second byte must be examined as well.

Why is GB 18030 important?

The Chinese Government has mandated that all applications released on or after 2001-Sep-01 must support GB 18030.

How does GB 18030 relate to Unicode?

The specification refers directly to a mapping of GB 18030 codes to and from Unicode to define most character assignments. Some characters that used to be mapped for GBK to the PUA (Private Use Area) for Unicode 2.1 are now assigned in Unicode 3.0, and their mappings from GB 18030 use only the Unicode 3.0 code points.

In addition, GB 18030 defines roundtrip mappings for all 1.1 million Unicode code points including unassigned and non-character ones, but excluding single surrogates. This makes GB 18030 functionally very similar to a UTF.

Where can I get a Unicode mapping table for GB 18030?

A Unicode mapping table for GB 18030 in XML format is available from the ICU website (.xml and .zip) and from Mark Davis' website (.zip only).

Where can I get more information on GB 18030?

An article with more details and with implementation suggestions is available on the developerWorks site.