Re: Major Defect in Combining Classes of Tibetan Vowels: Illustration

From: Christopher John Fynn (cfynn@gmx.net)
Date: Wed Jun 25 2003 - 21:39:16 EDT

  • Next message: John Hudson: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"

    Difficulties due to the present combining class values attached
    to these characters most frequently occur with
    abbreviations/contractions and/or with cursive scripts. With
    abbreviations it is common to have two or more vowels on a
    consonant stack. In cursive or semi-cursive forms of Tibetan
    script the subjoined vowels 0F71, 0F74 and 0F75 form ligatures
    with the consonant(s) in the stack, while above headline
    vowel(s) such as U+0F72 U+0F7A and U+0F7C sometimes forms a
    ligature with the following consonant or punctuation mark.

     In Dzongkha (Bhutanese) abbreviated spellings are often the
    usual way of writing words and a semi-cursive form of Tibetan
    script (Joyig) is standard - so the problem frequently occurs.
    I have a 225 page dictionary, and several other lists, of common
    abbreviations which are full of examples where this problem
    occurs.

    I've attached a couple of real and fairly simple examples.

    Example 1
    ========
    Following normal orthographic rules the characters to produce
    Example1_gtuig.jpg would be entered as:

    U+0F42 U+0F4F U+0F74 U+0F72 U+0F42

    If the characters remain in that order there is no problem -

    the first U+0F42 is straight forward, the isolated character is
    displayed as a simple glyph "uni0F42"
    the sequence U+0F4F U+0F74 is replaced by a ligature
    "uni0F4F0F74"
    U+0F72 U+0F42 is replaced by a ligature "uni0F720F42"

    Now if the text goes through a "normalisation" process the same
    text ends up reordered as:
    U+0F42 U+0F4F U+0F72 U+0F74 U+0F42
    because the combining class value of U+0F72 is less than that of
    U+0F74.

    To render this there is no change for the first character but I
    now need a lookup to render the whole sequence:
     U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 with two glyphs
    "uni0F4F0F74 uni0F720F42"

    Example 2
    ========
    Following normal orthographic rules the characters to produce
    Example1_gtuop.jpg would be entered as:

    U+0F42 U+0F4F U+0F74 U+0F7C U+0F54

    If the characters remain in that order there is no proplem -

    the first U+0F42 is as in the first example
    the sequence U+0F4F U+0F74 is replaced by a ligature
    "uni0F4F0F74"
    U+0F7C U+0F54 is replaced by a ligature "uni0F7C0F54"

    However, since the combining class value of U+0F7C is less than
    that of U+0F74,.
    after a "normalisation" process the same text ends up reordered
    as:
    U+0F42 U+0F4F U+0F7C U+0F72 U+0F54

    and the whole sequence:
    U+0F4F U+0F72 U+0F72 U+0F74 U+0F42 needs to be replaced with the
    two glyphs "uni0F4F0F74 uni0F720F42".

    Example 3 - (Example3_aMi-aiM.jpg)
    ==============================

    This is taken from an entirely different source, the "TibetBT"
    font which was specially created for a project in Sichuan
    digitising the Tibetan bstan-'gyur (a vast cannonical collection
    of texts in over 200 large volumes originally translated
    fromSanskrit into Tibetan). The glyph set of the font is the
    same as the the set of Tibetan stacks found in that collection.
    All stacks including any combining vowels are implemented as
    precomposed ligatures This font can be downloaded from
    (though it is wrapped-up in a Windows "setup.exe" file).

    Here we have two stacks which one would naturally enter as
    U+0F68 U+0F7E U+0F72 and U+0F68 U+0F72 U+0F7E respectively. No
    problem so long as the characters remain in that order. However
    since U+0F72 has a combining class value greater than that of
    U+0F7E - in a process of "normalisation" U+0F72 would always
    float to the end and both stings would end up as U+0F68 U+0F7E
    U+0F72 and be indistinguishable.

    If there were only a few and fixed number of cases like the
    first two examples it would not be *much* of a problem to add
    the extra lookups - even though my font would need both "many to
    one" and "many to many" lookups to handle it. But there are
    *numerous* cases I already know of and there is no fixed and
    final list of such abbreviations. So I should really build the
    tables in my font to be able to handle almost any possibility.
    If the combining classes of vowels & marks were based on the
    expected order where subjoined vowels are always written before
    any above headline vowels, this would be reasonably
    straight-forward to do - but as they may now wind up after
    normalisation it requires adding a huge number of complex
    lookups to the tables in my font. - Once I've done this it is
    going to be very difficult to test all the permeutations.
    Because of the number of additional lookups I need it is also
    likely there will be a hefty performance hit - especially on
    reflowing large documents. Unfortunately the third example
    can't simply be fixed by font lookups since two distinct
    combinations wind up being identical and hence would have to be
    rendered identically.

    If I wrote a peice of software where values I'd assigned caused
    problems and innefficiencies like this, I'd count it as a major
    fault or bug and hurry to fix it by assigning the correct
    values. I know the Tibetan characters were discussed in great
    detail by a number of "experts" at the time they were encoded -
    however there was little or no substantial discussion amongst
    these experts about the cannonical combining class values
    assigned to the characters by the UTC. If the combining
    classes of Tibetan dependant vowels had been based on the order
    in which these characters are normally written or typed there
    would not be this problem in processing them.

    I beleive that correcting the cannonical combining class values
    of these characters is the best solution. Leaving things as
    they are is going to cause a lot of extra work for implementors
    and inefficiencies in implementations. There is no work-around
    for the problem illustrated by Example 3. Someone suggested
    encoding an otherwise identical set of characters with the
    correct CCCV values and depreciating the existing ones but this
    is not a real solution only a kludge. - And how could encoding
    otherwise identical characters in ISO/IEC-10646 be justified
    since that standard does not specify cannonical combining class
    values of characters?

    - Chris

    Christopher Fynn
    4 Chester Court
    84 Salusbury Road
    London NW6 6PA



    Example2_gTuop.jpg
    Example1_gTuig.jpg
    Example3_aMi_aiM.jpg

    This archive was generated by hypermail 2.1.5 : Wed Jun 25 2003 - 22:17:15 EDT