Re: GB18030 mapping

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Sat Jan 08 2005 - 05:13:52 CST

  • Next message: Andrew C. West: "GB18030 mapping (was Re: ISO 10646 compliance and EU law )"

    On Fri, 07 Jan 2005 16:01:56 +0000, Christopher Fynn wrote:
    >
    > Andrew C. West wrote:
    >
    > > Of course if you then want to treat these PUA characters as real Unicode
    > Tibetan
    > > you need to know the character mapping, but from my perspective character
    > > mapping is something that is optionally applied on top of the code point
    > > mapping.
    >
    > As soon as you want to edit the text in a Unicode based application
    > you'd probably need to convert (or "character map") the BrdaRten PUA
    > characters to "real Unicode" [or you might end up with the horrors of a
    > kind of mixed encoding]. Comparing text from "real Unicode" with
    > precomposed Tibetan (PUA or GB18030), and collation would be difficult
    > without conversion as well.

    Personally, I think that that is precicely not what a GB18030-supporting Unicode
    application would want to do. The whole point of China defining a large set of
    precomposed Tibetan characters is to enable the display of Tibetan text using
    simple font and rendering technology (i.e. not resorting to OpenType etc.). Any
    font created to display BrdaRten characters would have precomposed Tibetan
    glyphs mapped to the PUA (F300..F8FF for Set A, somewhere in Plane 16 for Set B
    which I think is not yet fully defined) and basic Tibetan glyphs mapped to the
    0F00..0FFF (excluding the vowels and subjoined consonants which are not used in
    the BrdaRten model). If an application opens a GB18030 document containing
    BrdaRtren text and then automatically converts it to decomposed Tibetan, then
    the document will be unreadable to the user with only a BrdaRten font. Therefore
    the BrdaRten text must be kept as PUA characters in order to be displayed with a
    BrdaRten font, and you would only want to convert them to decomposed Tibetan if
    the user specifically requests it.

    As you say, for operations such as collation and comparison you would need to
    convert "Unicode Tibetan" and "BrdaRten Tibetan" to a common encoding, but that
    is probably not something that most BrdaRten users will want to do. As to the
    problems of "mixed encoding", it would be up to the end user to ensure that he
    uses an input method to write Tibetan that generates BrdaRten characters and not
    decomposed Tibetan. Anyway, the BrdaRten "standard" explicitly allows for mixed
    encoding, specifying two levels of support : Level 1 - supporting precomposed
    Tibetan only; and Level 2 - supporting precomposed Tibetan and decomposed
    Tibetan.

    It is also worthwhile pointing out that a lot of education about Unicode Tibetan
    and OpenType technology is taking place both within Tibet and China and at
    places such as the University of Virginia which has many visiting scholars from
    Tibet. And as Chris has pointed out elsewhere, a recent study by Chinese
    academics has confirmed the feasibility of the Unicode Tibetan encoding model in
    conjunction with OpenType font technology (something that we knew all along, but
    it is good to see the Chinese beginning to realise that OpenType is not
    something to be scared of). My feeling is that with the current proliferation of
    working Tibetan OpenType fonts Tibetan users in China will soon move away from
    the precomposed Tibetan model, and BrdaRten will be effectively dead before it
    has been fully defined.

    Andrew



    This archive was generated by hypermail 2.1.5 : Sat Jan 08 2005 - 05:15:52 CST