RE: Order of Infrequent Combining Marks in Thai

From: Peter Constable (petercon@microsoft.com)
Date: Mon May 21 2007 - 16:13:47 CDT

  • Next message: Kenneth Whistler: "Re: [unicode] CJK variation modifier"

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Richard Wordingham

    > Who or what chooses which is the correct order for combining marks in
    > strings in the Thai script when some of them belong to the 'inherited'
    > class? Is the order established? ...

    > The Thai and Latin sequencing principles, plus the fact
    > that there is a functional unit <U+0E0A, U+0359> representing the 'zh'
    > sound, argue for <U+0E0A, U+0359, U+0E31>, but the overstrict Uniscribe
    > implementation on Windows XP seems to argue for <U+0E0A, U+0E31, U+0359>.

    Here's a longish response...

    First, I would not take the over-strict implementation on Windows XP as precedent setting. There are mono-script-Thai clusters for Pali that it can't handle, for instance. The Thai engine in Vista and Office 2007 has been relaxed (though we didn't anticipate use of these "common" marks).

    The biggest problem for displaying this kind of cluster is the lack of awareness of a need to support marks other than Thai-specific marks. I doubt very many Thai fonts support the combining asterisk below, for instance. I don't think we allow for these kinds of combinations in Vista Uniscribe -- and there are some good reasons why this wouldn't be straightforward, as I'll explain.

    A complicating factor in this is that canonical combining classes for Thai marks were assigned on one basis, but classes for "common" marks, which historically have been used primarily with Latin, were assigned on a different basis, and the two were not designed anticipating usage of the two together. I'll come back to this.

    In loosening up the Thai-Lao engine for Vista, I tried to work out what made sense given the way canonical combining classes are assigned for Thai and Lao to come up with something that would limit the cases of multiple canonically non-equivalent sequences with visually-identical appearance while allowing greater freedom in Thai/Lao mark combinations. I was also thinking about some decisions UTC had made or I thought would be making on mark ordering in these cases. (There were some decisions for Indic scripts; I thought some decisions were going to be made for some scripts including at least Thai and Lao, but I can't find any record of such decisions by UTC.) I concluded that we could expect all below marks to precede all above marks.

    That would fit what you the sequence order you thought would make sense. Take note, though: you were determining that on a *functional* basis, but in general Unicode doesn't assign positioning classes for "common" marks based on function. Also the Thai below marks are in non-zero classes, and so sequences involving Thai below marks and "common" below marks will fold under normalization. For example, the following sequences are in canonical-equivalence classes as indicated (the NFC representation is marked #):

    <U+0E0A, U+0359, U+0E39> = #<U+0E0A, U+0E39, U+0359>

    <U+0E0A, U+0359, U+0E3A> = #<U+0E0A, U+0E3A, U+0359>

    <U+0E0A, U+0359, U+0E3A, U+0E39>
    = <U+0E0A, U+0359, U+0E39, U+0E3A>
    = <U+0E0A, U+0E3A, U+0359, U+0E39>
    = <U+0E0A, U+0E3A, U+0E39, U+0359>
    = <U+0E0A, U+0E39, U+0359, U+0E3A>
    = #<U+0E0A, U+0E39, U+0E3A, U+0359>

    And this leads to a thorny open issue: if these are canonically equivalent, hence should display the same, how should the Thai fixed-position-class marks and the "common" marks interact typographically? There simply are no historical conventions that establish an answer to this question.

    (Even for the fixed-position classes 9 and 103 used for Thai marks alone, I don't know of historical conventions establishing the typographic interaction of marks. Our decision in Vista was to have phintuu position below the below-base vowels /u/ and /uu/, but I don't know of any basis to declare this correct or incorrect.)

    Above marks add their own complications. Right now, Thai above marks are in two classes: 0 and 107. Because marks in the same class don't re-order in normalization, and class-0 marks never re-order, effectively this makes all the Thai above marks behave as though they're in one class, so different combinations are canonically non-equivalent and can be distinguished visually simply stack outward in order. But as soon as you throw class 230 marks into the mix, things suddenly get complicated: class 107 and class 230 marks *will* re-order, and so differently-ordered sequences of marks from these classes will fold under normalization, hence should not be visually distinct. For example:

    <U+0E0A, U+0300, U+0E48>
    = #<U+0E0A, U+0E48, U+0300>

    So, how should these marks interact typographically? And what happens if we throw in a class-0 Thai mark?

    #<U+0E0A, U+0E35, U+0E48, U+0300>
    = <U+0E0A, U+0E35, U+0300, U+0E48>

    is distinct from

    <U+0E0A, U+0300, U+0E48, U+0E35>
    = #<U+0E0A, U+0E48, U+0300, U+0E35>

    is distinct from

    <U+0E0A, U+0300, U+0E35, U+0E48>

    is distinct from

    <U+0E0A, U+0E48, U+0E35, U+0300>

    (Note that there are six possible visual typographic arrangements of the three marks, but due to canonical combining class assignments only a three-way distinction is possible under normalization. For two pairs, there is no basis to say which visual arrangement from the pair is the one that should be displayed.)

    In order to support "common" marks in Thai clusters, we must have *some* answer for what happens in these mixed cases as we *must* display arbitrary combinations of Thai and "common" marks in *some* way.

    Btw, I'd be interested in scanned samples of publications in which the kinds of scenarios you're raising are attested.

    Peter



    This archive was generated by hypermail 2.1.5 : Mon May 21 2007 - 16:16:20 CDT