Re: Manchu/Mongolian in Unicode

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Oct 15 2002 - 12:49:08 EDT

  • Next message: Roozbeh Pournader: "Re: Hindi keyboard with the Microsoft Hindi font Mangal"

    Andrew C. West wrote:

    > On Tue, 15 Oct 2002, "Stefan Persson" wrote:
    >
    >>That font also includes some characters mapped to the PUA: A € sign, and
    >>several 漢 character, many of which look like radicals. Why? Is that
    >>something that's also required by that law?
    >
    > It's my experience that many fonts include gunk in the Private Use Area. A quick check of some of
    > the CJK glyphs in the PUA of SimSun-18030 shows that they are not unique, but are also mapped to
    > codepoints in the CJK Radical Supplement and CJK-A blocks for example.

    I may be able to shed some light on this.

    GB 18030 is really an extension not only of GB 2312, but also of GBK.
    GBK contained all ideographs from Unicode 2.0, plus of course many other characters.

    GB 18030 is based on Unicode 3.0. Between 2.0 and 3.0 some characters were added to Unicode that GBK had mapped to the Unicode Private Use Area. GB 18030 maps those characters to their Unicode 3.0 code points instead of PUA ones, and the PUA ones now map instead to linearly enumerated 4-byte sequences.
    About 80 such characters are affected, among them the Euro sign and the Ideographic Description Sequence characters. (Listed in Appendix E of the GB 18030 standard.)

    I assume that the font shows glyphs for those 80 or so characters in both the old GBK/Unicode PUA position and for the new GB 18030/Unicode 3.0 real code point.

    See http://oss.software.ibm.com/icu/docs/papers/gb18030.html

    > I believe that it is intended to maintain a one-to-one correspondence between the GB18030 standard
    > and Unicode, and so there should be no need for any supplementary glyphs in the PUA.

    >

    > The new PRC law is, as you hint, overly restrictive and prescriptive, and is, I think, a serious
    > setback for popularisation of Unicode on the Web. The intent is that GB18030 should replace GB2312

    ... and GBK ...

    > and Big5, and so that instead of the current mishmash of GB2312 (SC) and Big5 (TC) websites, in the
    > future Traditional and Simplified Chinese sites (at least those hosted in China) will use the same
    > GB18030 encoding.

    I am not sure about this. GB 18030 requires to _support_ its new encoding, but I believe it does not require to _use_ it.
    Most implementations have a converter to/from Unicode, and GB 18030 works quite well for that because it is defined _in terms of_ Unicode.
    As such, it actually boosts the spread of Unicode-based software. The drawback is of course that a GB 18030 converter requires special code on top of a large mapping table.

    > Where does this leave websites written in Unicode Chinese ? Out in the cold !
    >
    > At present web pages written in Unicode Chinese (some of mine for example) are not being indexed by
    > Google, and are ignored by both Yahoo China (SC) and Chinese Yahoo (TC). The situation will
    > certainly not be improved by the replacement of GB2312 and Big5 with GB18030.

    There is no reason for that. You should contact Google to get that fixed.

    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Tue Oct 15 2002 - 13:22:37 EDT