Re: New property for reordrant dependent vowels reordering?

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Tue Sep 06 2005 - 16:00:23 CDT

  • Next message: Eric Muller: "Re: New property for reordrant dependent vowels reordering?"

    Kent Karlsson wrote:

    >> Our difference here largely results from fundamentally different
    > ...
    >> is to insert a conjoining code between the codepoints for the
    >> consonant of this cluster.
    >
    > Whatever one thinks of the 'virama model', that is the model
    > standardised. I don't see any way of changing that model now.
    > It is much much too late for that.

    But I am not proposing a change to the encoding! The only implementational
    issues are:

    1) Should vowels be re-ordered across a virama + ZWNJ boundary?

    2) Does the automatic insertion of a visible virama because the font can't
    cope with the cluster create such a boundary?

    The answer to (1) is not a simple yes; I think it should be a simple 'no' -
    and I have a work-around below for those who would object to the answer
    'yes' to the second question.

    The encoding of the 'virama model' generally works:

    virama + ZWNJ => end of syllable, visible virama
    vowel => end of syllable
    virama + non-letter => end of syllable
    virama + ZWj => script-specific effect. For Devanagari, it marks the end of
    a column.
    other virama => conjoin.

    I don't think my conception is inconsistent with the 'virama model', but I
    do think it is a better way of looking at what is going on.

    While it would have been nice to have had a specific code for a visible
    virama, it is indeed too late.

    > Thai and Lao are also encoded differently, such that there is no
    > reordering problem for display, but there is one for collation instead.
    > The latter is even ambiguous. This has been solved by doing a simplified,
    > rather than semantically correct, reordering to logical order (now via
    > collation clusters).

    The default collation order for Thai in the Unicode Collation Algorithm
    agrees with the order in Thai dictionaries. Their sort order is explained
    in Campbell & Shaweevongs, 'The Fundamentals of the Thai Language (Fifth
    Edition)', Appendix 8, 'How to Use a Thai Dictionary'. I had a long debate
    with a Thai lady well versed in formal Thai grammar on the subject of
    ordering, and I could always find counterexamples to her explanations on the
    sorting of words that contradicted Campbell & Shaweevongs. She did give one
    rule though that is not in C&S or the UCA - when spelt the same, phonetic
    CCV precedes phonetic CVC. It works with แหน in the 'New Standard
    Thai-English Dictionary' - but the words are the other way round in the
    Dictionary of the Royal (Thai) Institute (Ratchabandit). A case of TiT I
    suppose.

    I do not believe there is a 'logical order' for Thai, at least not unless
    you add (in some fashion or other) placeholders for preposed vowels. As an
    indication, consider มนโฑ 'Montho', i.e. 'Mandora', and แมโคร 'macro'.
    *แมคโร would be pronounced quite differently to the word for 'macro'.

    >> > You really need a character based criterion, which is font
    >> independent.

    >> Therefore you encode the form that is desired in an ideal
    >> world, and ignore
    >> the effects of the font. The visible viramas are the ones
    >> that are visible
    >> in the desired form - as simple as that!

    > Hmm. Would this "desired ideal" be language independent
    > (though still script dependent)?

    If virama + ZWNJ is as much of a break as I think it is, then the desired
    form defined by a 'well-formed' sequence of codepoints is as well defined as
    for the Latin script. (I don't have a definition of 'well-formed'.)

    On forcing an author's (or typographer's?) preferred form for <TTA, TTHA, I>
    when the font leaves no alternative but use of a virama:

    >> > I'm not happy to leave this to be entirely platform/font dependent.
    >>
    >> Uniscribe interprets the code sequences as I would expect them to be
    >> interpreted. I see no font dependency in these sequences.
    >
    > That is what one "platform", in one particular version, does. (Not any of
    > the versions I've got...) Not sure it is THE one behaviour to be
    > standardised.

    Can the experts please tell us whether the following sequences have a
    definite meaning in the Devanagari script, and if so, what is the meaning?

    <TTA, DEPENDENT I, VIRAMA, ZWNJ, TTHA>
    <TTA, VIRAMA, ZWNJ, TTHA, DEPENDENT I>

    (I will try looking for information in the Indic list's Febrary posts -
    thanks for the pointer, Antoine.)

    > And Peter mentioned font dependence (for a future version), that I think
    > is inappropriate for this.

    I can now see how the rendering of <TTA, VIRAMA, TTHA, DEPENDENT I> in the
    absence of a conjunct form TTA.TTHA and the absence of a half form for TTA
    can be made font-specific. Simply complete the set of half forms by
    defining the missing half forms to be isolated form plus virama! I see
    nothing in the Unicode Standard that prohibits this.

    Richard.



    This archive was generated by hypermail 2.1.5 : Tue Sep 06 2005 - 16:02:31 CDT