RE: Precomposed Tibetan

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 17 2002 - 16:53:43 EST

  • Next message: Michael Everson: "RE: Precomposed Tibetan"

    Marco commented:

    > Another key point, IMHO, is verifying the following claim contained in the
    > proposal document:
    >
    > "Tibetan BrdaRten characters are structure-stable characters widely
    > used in education, publication, classics documentation including Tibetan
    > medicine. The electronic data containing BrdaRten characters are
    > estimated beyond billions. Once the Tibetan BrdaRten characters are encoded
      ^^^^^^^^^^^^^^^^^^^^^^^^^
    > in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
    > processing without major modification. Therefore, the international standard
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
    > Tibetan BrdaRten characters will speed up the standardization and
    > digitalization of Tibetan information, keep the consistency of
    > implementation level of Tibetan and other scripts, develop the Tibetan
    > culture and make the Tibetan culture resources shared by the world." [BTW,
    > billions of what!?]

    The Chinese delegation at the WG2 meeting agreed with a restatement of
    this as "gigabytes of data". Exactly what kind of data, they did not say,
    but in principle that could consist of a few medium-size databases. It
    almost certainly does not consist of billions of *documents*.

    > I'd propose the following:
    >
    > 1. Find all the available technical details about this BrdaRten
    > encoding.

    One additional detail for people. The BrdaRten stacks are currently
    implemented, in the Founders System software in Tibet, as an extension
    to GB 2312.

    > 2. Come up with a precise machine-readable mapping file between
    > BrdaRten encoding to *decomposed* Unicode Tibetan, possibly accompanied by a
    > sample conversion application.
    > Reasons: (a) to make it easy to migrate BrdaRten legacy data to
    > Unicode; (b) to easily update existing BrdaRten applications to export
    > Unicode text; (c) to easily retrofit new Unicode applications to import
    > BrdaRten text.

    See the key words "without major modification" above. If the BrdaRten
    stacks were encoded in Unicode, they would automatically become part
    of GB 18030 (because of the UTF-like nature of that strange standard).
    However, the catch is that the actual code points for Unicode/10646 are
    not predictable or controllable by the Chinese NB. That means that the
    final code points in GB 18030 are also not predictable -- and almost
    certainly are not the same as those used by the current GB 2312 extension
    in Tibet. And *that* means that the current "characters ... estimated
    beyond billions" will have to be migrated to a new encoding, anyway,
    once the systems are updated to GB 18030. That is the reason for the
    quibble word "major" in the phrase above. All the data will be reencoded,
    but the transition GB 2312 + Tibetan extension ==> GB 18030 containing
    Tibetan extension is viewed as "just a mapping" and not a major system
    modification.

    The alternative (and even scarier) prospect is that the existing GB 2312
    Tibetan extension code points would be forced as is into a new version
    of GB 18030, invalidating the mapping for the existing code points,
    and creating a completely new version of GB 18030 that would have to
    be supported as a different "code page" from the existing GB 18030. This
    would start us down the road to a indefinite number of distinct GB 18030
    mappings, probably not properly labeled in interchange, with large numbers
    of interoperability problems predictable (and likely to dwarf the JIS
    yen sign/backslash kinds of problems). The reason this prospect is even
    thinkable is that any existing implementation of the BrdaRten stacks
    in a GB 2312 extension would surely be using 2-byte character encodings,
    and a transition to 4-byte GB 18030 character encodings would likely
    disrupt the existing implementations significantly.

    The question for Unicoders is whether introduction of significant
    normalization problems into Tibetan (for everyone) is a worthwhile tradeoff
    for this claimed legacy ease of transition for one system, when it is
    clear that all existing legacy data using these precomposed stacks is
    going to have to either be reencoded anyway (or surrounded by migration
    filters for new systems).

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 17 2002 - 17:27:42 EST