Re: Major Defect in Combining Classes of Tibetan Vowels

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 24 2003 - 19:11:31 EDT

  • Next message: Peter_Constable@sil.org: "Re: Major Defect in Combining Classes of Tibetan Vowels"

    Chris Fynn wrote:

    > In Unicode's UnicodeData.txt (
    > http://www.unicode.org/Public/UNIDATA/Unicodea.Dattxt )
    > 0F7E has a Canonical Combining Class Value (CCCV) of 0;
    > 0F71 a CCCV of 129;
    > 0F72 0F7A 0F7B 0F7C 0F7D and 0F80 a CCCV of 130;
    > 0F74 a CCCV of 132;
    > and 0F82 and 0F83 have a CCCV of 230.
    >
    > By normal Tibetan & Dzongkha spelling, writing, and input rules
    > Tibetan script stacks should be entered and written: 1 headline
    > consonant (0F40-0F6A), any subjoined consonant(s) (0F90-
    > 0F9C), achung (0F71), shabkyu (0F74), any above headline
    > vowel(s) (0F72 0F7A 0F7B 0F7C 0F7D and 0F80) ; any ngaro (0F7E,
    > 0F82 and 0F83)
    >
    > So following normal Tibetan & Dzongkha input and spelling rules
    > the relative ordering of these characters should be:
    > A. 0F71
    > B. 0F74
    > C. 0F72 0F7A 0F7B 0F7C 0F7D and 0F80
    > D. 0F7E, 0F82 and 0F83
    >
    > The fact that, in a process of "canonical decomposition" or
    > "normalisation", these combining characters can get reordered
    > in a bizarre order relative to each other

    Actually, looking at this data, while I can see that the
    combining classes are assigned less than optimally, I don't
    see that this makes any practical problem for Tibetan data.

    You are saying, in effect, that the stack structure has
    the following position classes (treating the consonant stack
    itself as the more tightly bound unit that I will just
    symbolize as CS):

       CS - achung - shabkyu - vowelsabove - ngaro
       
    And since shabkyu has cc=132 whereas the vowelsabove have
    cc=130, they would reorder out of expected order if
    normalized. However, for most text the shabkyu (u-below)
    would be in complementary distribution with the vowels
    above, so the effective positional classes are:

                     { vowelsabove }
       CS - achung - { shabkyu } - ngaro
       
    And in this case, the relative combining class of the vowels
    doesn't really matter, since we wouldn't be seeing both
    present to reorder around each other.

    I'm guessing that you are claiming there are instances where
    the shabkyu does cooccur with other vowels above as well.
    Wouldn't those, if they do occur, represent a distinctly
    minority case in terms of the overall processing? The short
    summaries of Tibetan writing that I've seen don't even mention
    it as a possibility, since even the few diphthongs in -u
    are written with a separate stack <0F60, 0F74> to the
    right of the main stack.

    > causes difficulties
    > with culturally correct collation (where 0F7E, 0F82 and 0F83
    > should have an equal value) - and especially it necessitates
    > making lookups in smart fonts far more complex and inefficient
    > than they should have to be.

    And I'm not seeing the problem here, either. Since the
    combining class of 0F82 is 0, and not some other random
    value, it isn't going to reorder around the other vowel
    marks. If it is entered in the traditional spelling order you
    have indicated, then it is going to stay in that position;
    normalization won't move it. And since the equivalent
    0F82 and 0F83 sift to the end of the syllable, with their
    high combining class, they'll end up in the same position
    as the 0F7E ngaro if normalized.

    The only problem you'd have is with Tibetan data where a
    0F7E ngaro is entered in other than the optimal spelling
    order you indicated. Such a sequence won't compare equal
    unless you add a spelling equivalence rule on top of the
    canonical equivalence. But there are a number of such edge
    cases for Brahmic scripts -- not just Tibetan.

    Culturally correct collation is first a matter of giving
    the three ngaro characters equivalent weights. Beyond that,
    as you indicated, the weighting of the syllables (or stacks)
    is complicated, and isn't going to be affected by 0F7E
    having combining class 0 in any case.

    >
    > (In Tibetan script fonts 0F71 and 0F74 are often ligated with
    > preceding consonant (+ subjoined consonants) combined as a
    > single glyph whereas above headline vowels are almost always
    > treated as non spacing combining marks.)

    Yes, but the only point where this would be a problem would
    be for stacks with a shabkyu (u vowel) *and* another vowel.
    And even for such cases, wouldn't this be handled effectively
    by 6 triples in the ligature tables which would identify
    any shabkyu moved after one of the other 6 vowels?

    >
    > Currently there seems to be no easy or standardized work around
    > for these problems and the standard seems to say that the
    > relative values of assigned Canonical Combining Class Values
    > cannot be changed.

    They cannot.

    > Any suggestions as to how to create a standardized work around
    > for these incorrect values?

    I guess I'm not getting it. I don't see the need for a
    "standardized" work around, here.

    --Ken

    >
    > - Chris



    This archive was generated by hypermail 2.1.5 : Tue Jun 24 2003 - 20:05:47 EDT