GB18030 mapping (was Re: ISO 10646 compliance and EU law )

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Fri Jan 07 2005 - 05:06:46 CST

  • Next message: Andrew C. West: "Re: ISO 10646 & GB18030 repetoire [was: Re: ISO 10646 compliance and EU law]"

    On Thu, 6 Jan 2005 12:08:55 -0800 (PST), Kenneth Whistler wrote:
    >
    > Example 2:
    >
    > China decides to add Tibetan BrdaRten syllables to GB 18030
    > and map them to PUA characters in 10646.
    >
    > Well, guess what -- *all* PUA code points in 10646 already
    > have defined mappings to GB 18030. That means that the addition
    > of the Tibetan BrdaRten syllables and definition of mappings
    > will *change* those mappings, and will require changes to the
    > mappings tables. The only way to avoid that would be for
    > any GB 18030 additions to be defined at specific code points
    > currently labelled as empty in GB 18030 but mapped to 10646
    > PUA code points.

    Which is what they are doing.

    > For instance:
    >
    > TIBETAN CHARACTER KA U ==> AAA1 <--> U+E000
    >
    > That wouldn't change the code point mapping,

    Which was precisely my point, that code point mapping is fixed and stable. I
    think some confusion is being generated by the fact that you are talking about
    *character mapping*, whilst I (and Philippe I think) are talking about *code
    point mapping*, and of course the two mappings are not at all the same thing.

    > but... to actually
    > support the standardization of such a set of syllables in
    > GB 18030, the vendor mapping tables will have to introduce,
    > instead, the one-to-many mappings to actually intepret the
    > Tibetan syllables as what they are, instead of PUA code points,
    > so you would end up with the following entry in the mapping
    > tables:
    >
    > AAA1 <--> <U+0F40, U+0F74>
    >

    Well it all depends. A text editor might import a GB18030 document with BrdaRten
    SetA characters, and using the code point mapping tables convert AAA1 etc. to
    U+E000 etc. The user then selects a BrdaRten font that maps precomposed BrdaRten
    glyphs to U+E000 etc. and everything is displayed correctly. This kind of
    support does not need any modifications to the mapping tables as the mapping of
    U+E000 to <0F40, 0F74> is irrelevant ... PUA characters are just PUA characters,
    and if you have the right font these PUA character will be rendered as
    precomposed BrdaRten glyphs.

    Of course if you then want to treat these PUA characters as real Unicode Tibetan
    you need to know the character mapping, but from my perspective character
    mapping is something that is optionally applied on top of the code point
    mapping. For example, using my BabelPad application you can open a
    GB18030-encoded document, and it will convert all GB18030 code points to their
    corresponding Unicode code points. If the document contains BrdaRten SetA
    characters they will be converted to the PUA code points F300..F595, and if you
    have a suitable BrdaRten font Tibetan text will be displayed OK. If you then
    want to convert these PUA codepoints to their corresponding Tibetan block
    character sequences, you can select the required text and choose "Convert to
    Unicode", and the application applies a separate *character mapping table* to
    convert the text. If next year the Chinese amend the SetA repertoire (as is
    quite possible), then I'll have to modify my BrdaRten character mapping table,
    but I expect my GB18030 code point mapping routine to remain the same.

    Andrew



    This archive was generated by hypermail 2.1.5 : Fri Jan 07 2005 - 05:08:47 CST