Re: GB2312 to Unicode map

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 15 2006 - 17:22:36 CST

  • Next message: Rick McGowan: "New Public Review Issue: IVD Submission, Adobe-Japan1"

    Isn't this mapping documented by the Chinese standard body itself, that created the GB2312 standard, and later upgraded it to GB18030? Anyway, there are tables documented by Microsoft (for its Chinese codepage extension of the GB2312 standard, an extension which was included into GB18030 which is a superset of the GB2312 mapping).

    Correct me if I'm wrong, but GB2312 is obsolete, due to some different legacy implementions, something that GB18030 fixed (however I don't know if the GB18030 mapping really has no exception with the Microsoft's codepage extension; I am not speaking here about the MS codepage versions which were upgraded several times depending on Windows versions). But I think that there exists a common superset of all claimed GB2312 implementations, that contains only characters found in some of these legacy implementation, but which is a pure subset of the GB18030 standard.

    However, I know that there are some characters that were integrated into GB18030 before they were part of Unicode/ISO10646, and so they had no corresponding mapping to Unicode other than to PUAs. When those characters were assigned into ISO/IEC 10646 and Unicode, these mappings have changed from PUAs to the new assigned codepoints.

    But note that the reverse mapping from Unicode to GB18030 also exists, and it is described now as part of the GB18030 standard so that all new Unicode/ISO/IEC 10646 codepoints will have a predefined mapping to GB18030. This means that the newly assigned Unicode/ISO/IECO 10646 codepoints for characters that were previously assigned to legacy GB18030 positions will have two possible mappings when reversed back to Unicode (the PUA and the new assignments). If the Chinese standard has agreed not to allocate any new position in GB18030 to characters not defined in ISO/IEC 10646, then the mapping is effectively fixed now and algorithmic by intervals (with the exception of the legacy GB mappings that still require a conversion table to Unicode).

    I would be curious to hear comments from those that maintain the conversion mappings at Microsoft, and at IBM (in ICU), because I don't know if the P.R. Chinese conversion rules have changed with more exceptions.

      ----- Original Message -----
      From: John H. Jenkins
      To: unicode@unicode.org
      Sent: Friday, December 15, 2006 12:54 AM
      Subject: Re: GB2312 to Unicode map

      That only covers the ideographs. For the non-ideographs, you have to use a different mapping table.

      On Dec 14, 2006, at 2:37 PM, Andrew Miller wrote:

        What about the kGB0 tag in Unihan.txt? It contains 6763 mappings in the version 5.0 file

        Andrew Miller



    This archive was generated by hypermail 2.1.5 : Fri Dec 15 2006 - 17:25:05 CST