GB18030 mapping (was Re: ISO 10646 compliance and EU law )

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Sat Jan 08 2005 - 08:22:35 CST

  • Next message: Christopher Fynn: "Re: GB18030 mapping"

    On Thu, 6 Jan 2005 17:36:01 (CST), Mark Davis wrote:
    >
    > I agree with Ken's statement, but would qualify one bit.
    >
    >> to about March 31, 2005 will contain the mappings:
    >>
    >> FE90 <--> U+E854
    >> 82359133 <--> U+9FBA
    >>
    >> After that time, they will contain the mappings:
    >>
    >> ???? <--> U+E854
    >> FE90 <--> U+9FBA
    >> 82359133 <--> ???? (probably U+FFFD)
    >
    >
    >The http://www.unicode.org/reports/tr22/ recommends mapping tables of the
    >following form to handle that situation, by changing the old cases into
    >one-way mappings. This provides a more graceful transition.
    >
    >
    > FE90 <-- U+E854
    > FE90 <--> U+9FBA
    > 82359133 --> U+9FBA
    >

    I'm sorry, but I just can't agree with this analysis.

    At present a GB18030-Unicode mapping table includes the entries :

    GB FE90 <--> U+E854
    GB 82359133 <--> U+9FBA
    GB 8338E335 <--> U+F300

    A pan-GB18030 font will map :
            FE90/U+E854 to a CJK ideograph glyph
            82359133/U+9FBA to the notdef glyph
            8338E335/U+F300 to the notdef glyph

    Some time in the future the CJK ideograph represented at FE90/U+E854 may be
    encoded at 82359133/U+9FBA, and 8338E335/U+F300 may be defined as the
    precomposed Tibetan syllable I. If this happens the GB18030-Unicode mapping
    table will still be :

    GB FE90 <--> U+E854
    GB 82359133 <--> U+9FBA
    GB 8338E335 <--> U+F300

    However, now a pan-GB18030 font should map :
            FE90/U+E854 to the notdef glyph
            82359133/U+9FBA to a CJK ideograph glyph
            8338E335/U+F300 to a glyph corresponding to <U+0F68 U+0F72>

    As far as I understand things the mappings between GB18030 and Unicode won't
    change; what may change is what any particular GB18030 code point represents.

    There will, however, be a mapping between different implicit versions of GB18030
    when such changes in the GB18030 repertoire take place, so that, for example,
    GB18030 version A FE90 = GB18030 version B 82359133. The mapping "FE90 <-->
    U+9FBA" given by Ken and Mark is making an implicit conversion from GB18030
    version A to GB18030 version B (i.e. FE90 --> 82359133 --> U+9FBA), which I do
    not believe is appropriate in most circumstances.

    Also, I think it would not be correct to state that 8338E335 should map to
    <U+0F68 U+0F72> just because 8338E335 represents a precomposed Tibetan character
    equivalent to <U+0F68 U+0F72>. I would say that the relationship between
    8338E335 and <U+0F68 U+0F72> is more like a normalization mapping; that is to
    say, 8338E335 maps to U+F300 for all versions of GB18030, but for "version B" of
    GB18030 U+F300 may optionally be "normalized" to <U+0F68 U+0F72>.

    Andrew



    This archive was generated by hypermail 2.1.5 : Sat Jan 08 2005 - 08:26:53 CST