Re: GB18030 mapping

From: Christopher Fynn (
Date: Sat Jan 08 2005 - 11:26:02 CST

  • Next message: Antoine Leca: "Re: ISO 10646 compliance and EU law"

    Andrew C. West wrote:

    > Personally, I think that that is precicely not what a GB18030-supporting Unicode
    > application would want to do. The whole point of China defining a large set of
    > precomposed Tibetan characters is to enable the display of Tibetan text using
    > simple font and rendering technology (i.e. not resorting to OpenType etc.). Any
    > font created to display BrdaRten characters would have precomposed Tibetan
    > glyphs mapped to the PUA (F300..F8FF for Set A, somewhere in Plane 16 for Set B
    > which I think is not yet fully defined) and basic Tibetan glyphs mapped to the
    > 0F00..0FFF (excluding the vowels and subjoined consonants which are not used in
    > the BrdaRten model). If an application opens a GB18030 document containing
    > BrdaRtren text and then automatically converts it to decomposed Tibetan, then
    > the document will be unreadable to the user with only a BrdaRten font. Therefore
    > the BrdaRten text must be kept as PUA characters in order to be displayed with a
    > BrdaRten font, and you would only want to convert them to decomposed Tibetan if
    > the user specifically requests it.

    However fonts can be built which support *both*:- you can make a font
    with all the pre-composed glyphs mapped to the PUA and GB18030 code
    points *and* lookup tables that can map sequences of Unicode characters
    to the precomposed Tibetan glyphs. Since font developers naturally want
    their fonts to work on the widest range of systems possible it seems
    likely that some developers of Tibetan fonts will do this.

    Trouble is such fonts allow you to create documents with a "mixed
    encoding" which is very messy.

    > As you say, for operations such as collation and comparison you would need to
    > convert "Unicode Tibetan" and "BrdaRten Tibetan" to a common encoding, but that
    > is probably not something that most BrdaRten users will want to do. As to the
    > problems of "mixed encoding", it would be up to the end user to ensure that he
    > uses an input method to write Tibetan that generates BrdaRten characters and not
    > decomposed Tibetan. Anyway, the BrdaRten "standard" explicitly allows for mixed
    > encoding, specifying two levels of support : Level 1 - supporting precomposed
    > Tibetan only; and Level 2 - supporting precomposed Tibetan and decomposed
    > Tibetan.

    Most of the time the end user will only care about what he/she sees on
    the screen and what comes out of the printer. It's when they try to use
    data like this in applications that only support GB18030 or apply
    particular properties to certain PUA characters (we already know that
    there are many applications which do this) or when they try to search
    /replace text and so on that the problems begin - and they probably
    won't know why.

    IMO this is also a mess for application developers who support Unicode
    but also need to support GB18030 for the Chinese market

    > It is also worthwhile pointing out that a lot of education about
    Unicode Tibetan
    > and OpenType technology is taking place both within Tibet and China
    and at
    > places such as the University of Virginia which has many visiting
    scholars from
    > Tibet. And as Chris has pointed out elsewhere, a recent study by Chinese
    > academics has confirmed the feasibility of the Unicode Tibetan
    encoding model in
    > conjunction with OpenType font technology (something that we knew all
    along, but
    > it is good to see the Chinese beginning to realise that OpenType is not
    > something to be scared of). My feeling is that with the current
    proliferation of
    > working Tibetan OpenType fonts Tibetan users in China will soon move
    away from
    > the precomposed Tibetan model, and BrdaRten will be effectively dead
    before it
    > has been fully defined.

    > Andrew

    Yes before Xmas I was at the University of Virginia showing three
    Tibetans visiting from China how convert their fonts to OpenType :-).
    One good thing, Tibetans seem to be very interested in being able to use
    cursive Tibetan script fonts. Since Unicode and OpenType make it
    possible to display the contextual glyph shapes required to render
    cursive Tibetan properly, while a pre-composed encoding with 1-to-1
    character to glyph mapping doesn't handle this - it was pretty easy to
    come up with a compelling demonstration. It may be the fact that they
    can render cursive Tibetan properly that convinces Tibetan users to use
    "pure" Unicode rather than GB18030 or some kind of hybrid.

    - Chris

    This archive was generated by hypermail 2.1.5 : Sat Jan 08 2005 - 11:29:52 CST