Re: GB2312 to Unicode map

From: vunzndi@vfemail.net
Date: Sat Dec 16 2006 - 18:16:50 CST

  • Next message: Robert Kidd: "Unicode or specific language charset"

    Unfortunately, people often use GB2312 now to mean GB18030, this is
    not good pratice but seems to stem from the fact that GB18030
    maintains the legacy encodings of GB2312 for backward compatibility.

    As far as I know there have not been any further additions to GB18030
    outside of
    ISO/IEC 10646. And there are no plans to do so, in fact the PR of
    China sends by far the most representatives to the IRG.

    GB18030, with @1.6 million code points, does however have over 50%
    more code points in it the the current unicode standard.

    Also ISO/IEC 10646 has yet to include all characters used on ID cards
    which are part of peoples names or address.
    http://www.cse.cuhk.edu.hk/~irg/irg/irg27/IRGN1262_SubmittedToD_China.pdf

    John Knightley

    Quoting Philippe Verdy <verdy_p@wanadoo.fr>:

    > Isn't this mapping documented by the Chinese standard body itself,
    > that created the GB2312 standard, and later upgraded it to GB18030?
    > Anyway, there are tables documented by Microsoft (for its Chinese
    > codepage extension of the GB2312 standard, an extension which was
    > included into GB18030 which is a superset of the GB2312 mapping).
    >
    > Correct me if I'm wrong, but GB2312 is obsolete, due to some
    > different legacy implementions, something that GB18030 fixed
    > (however I don't know if the GB18030 mapping really has no exception
    > with the Microsoft's codepage extension; I am not speaking here
    > about the MS codepage versions which were upgraded several times
    > depending on Windows versions). But I think that there exists a
    > common superset of all claimed GB2312 implementations, that contains
    > only characters found in some of these legacy implementation, but
    > which is a pure subset of the GB18030 standard.
    >
    > However, I know that there are some characters that were integrated
    > into GB18030 before they were part of Unicode/ISO10646, and so they
    > had no corresponding mapping to Unicode other than to PUAs. When
    > those characters were assigned into ISO/IEC 10646 and Unicode, these
    > mappings have changed from PUAs to the new assigned codepoints.
    >
    > But note that the reverse mapping from Unicode to GB18030 also
    > exists, and it is described now as part of the GB18030 standard so
    > that all new Unicode/ISO/IEC 10646 codepoints will have a predefined
    > mapping to GB18030. This means that the newly assigned
    > Unicode/ISO/IECO 10646 codepoints for characters that were
    > previously assigned to legacy GB18030 positions will have two
    > possible mappings when reversed back to Unicode (the PUA and the new
    > assignments). If the Chinese standard has agreed not to allocate
    > any new position in GB18030 to characters not defined in ISO/IEC
    > 10646, then the mapping is effectively fixed now and algorithmic by
    > intervals (with the exception of the legacy GB mappings that still
    > require a conversion table to Unicode).
    >
    > I would be curious to hear comments from those that maintain the
    > conversion mappings at Microsoft, and at IBM (in ICU), because I
    > don't know if the P.R. Chinese conversion rules have changed with
    > more exceptions.
    >
    > ----- Original Message -----
    > From: John H. Jenkins
    > To: unicode@unicode.org
    > Sent: Friday, December 15, 2006 12:54 AM
    > Subject: Re: GB2312 to Unicode map
    >
    >
    > That only covers the ideographs. For the non-ideographs, you have
    > to use a different mapping table.
    >
    >
    > On Dec 14, 2006, at 2:37 PM, Andrew Miller wrote:
    >
    >
    > What about the kGB0 tag in Unihan.txt? It contains 6763
    > mappings in the version 5.0 file
    >
    > Andrew Miller

    -------------------------------------------------
    This message sent through Virus Free Email
    http://www.vfemail.net



    This archive was generated by hypermail 2.1.5 : Sat Dec 16 2006 - 18:20:59 CST