Re: New property for reordrant dependent vowels reordering?

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon Sep 05 2005 - 07:58:55 CDT

  • Next message: Mark E. Shoulson: "Re: Punctuation character (inverted interrobang) proposed"

    Kent Karlsson wrote:

    > Richard Wordingham wrote:

    >> The primitive method of forming conjuncts is
    >> just to stack the consonants vertically,

    > That's not a conjunct, that's a stack ;-) We are obviously using
    > different terminology here. When I wrote "conjunct" read "conjunct
    > form" (and look that up in the TUS4 glossary).

    Indeed. By conjunct I think I mean what TUS4 calls 'conjunct consonants',
    but its definition seems to be wrong:

    1) TUS4 says they consist of one or more dead consonants followed by a live
    consonant. That implies Sanskrit _u:rk_ ऊर्क् U+090A, U+0930, U+094D,
    U+0915, U+094D (stem _u:rj_ ऊर्ज् is listed in Monier-Williams at
    http://www.ibiblio.org/sripedia/ebooks/mw/0200/mw__0254.html ) is not
    written with conjunct consonants!

    2) In Section 9.6 'Tamil', 'Ligatures', it says 'Vowel re-ordering occurs
    around conjunct consonants.' This is only true if (visible) <consonant
    pulli consonant> is not counted as a series of conjunct consonants.

    On the definition of orthographic syllable:

    >> To me, a more natural
    >> formulation is:

    >> <consonant, {combining marks, at least one of which is a conjoiner}>*
    >> <consonant, {maybe combining marks or visible virama, no conjoiner}>

    > Whether a virama is visible or not (absorbed into a half form or a
    > conjunct)
    > is in general font dependent, the above is not a good criterion for
    > orthographic syllables.

    Our difference here largely results from fundamentally different
    conceptions. I see the basic elements of an Indic script being CW units,
    where W is an explicit vowel, a visible virama, or the implicit vowel. The
    visible virama seems to be a late addition to the system. The vowel (W)
    side can be extended by anusvara, additional vowels etc, but that is not the
    cause of our differences. The consonant side can be extended to a consonant
    cluster. These CW units, possibly extended in these ways, are what I
    understand by the term 'orthographic syllable'.

    Now in my conception this cluster does *not* contain C+virama elements.
    This is an important difference. The primitive way of writing this cluster
    is as a stack - the virama is a late addition to the system.

    Now, when we encode text as a sequence of codepoints, we choose not to
    encode the implict vowel with a codepoint. This leaves us with the problem
    of distinguishing consonants in a cluster from Ca syllables. The solution
    is to insert a conjoining code between the codepoints for the consonant of
    this cluster.

    Now in two scripts where the virama is a marginal element of the script,
    Khmer and Tibetan, we mark this conjoining using a special codepoint (coeng
    in khmer) or by modifying the codepoint value for the consonant. I can only
    think of one case where a script actually has a physical mark for this
    conjoining - the obsolete yamakkan of Thai. What we usually do is to use
    the the codepoint for virama - this is parallelled by the modern use of
    phinthu in Pali written in the Thai script instead of yamakkan. Encoding
    thus does not lose information because virama followed by a letter or mark
    is a conjoining symbol, other viramas are visible viramas, in particular
    viramas followed by ZWNJ or whitespace. (Tamil can usually omit the ZWNJ
    for reasons given below.) The meaning of virama followed by ZWJ depends on
    the script - I will just discuss the Devanagari case.

    What follows may be regarded as a 'myth'. I believe it is in essentials
    true history, but I further believe it does not matter for practical matters
    whether it is true or not.

    Now a stack can be an unwieldy item, especially for printing. It also
    wastes a lot of paper / palm leaf / etc because the line separation is
    determined by the longest stack. One way of reducing the problem is to
    condense the letters in the stack, as is done in various ways. Also,
    whereas in Pali most clusters are either geminates or legal word-initial
    clusters in Pali, this is not so in many modern languages.

    A way of either eliminating the latter problem (if it is felt as a problem),
    or of reducing stack sizes, is to split the stack. The part that is split
    off may become C(CC)+visible virama, or the symbols in the stack may be
    modified in some way, as for example the Devanagari half-forms, to show that
    one does not have an independent unit. In the former case, what might have
    been one orthographic syllable is now two. In the latter case, the stack
    now occupies two or more physical columns. In the encoding of Devanagari
    we mark the division into columns by adding ZWJ after the virama code.

    Here endeth the myth.

    In Devanagari, there is a general licence to split stacks that are too
    awkward. From what can be achieved, the order of preference is single
    column, multiple columns (use of half-forms for non-final columns), multiple
    orthographic syllables. In Devanagari, a column formally consists of a
    single consonant or a conjunct form. Under this licence, therefore, more
    orthographic syllables than were desired can appear, and a virama that in
    the encoding of the desired form was merely a consonant-conjoining code may
    surface as a visible virama.

    There are no half-forms in Tamil, and the consonant element of an
    orthographic syllable must be a consonant or conjunct form. As most pairs
    of Tamil consonants do not form conjunct forms, an encoding of C1 virama C2
    vowel will usually be split into two orthographic syllables, resulting
    visually in <C1 pulli C2+vowel>.

    > You really need a character based criterion, which is font independent.

    Therefore you encode the form that is desired in an ideal world, and ignore
    the effects of the font. The visible viramas are the ones that are visible
    in the desired form - as simple as that!

    >> These are not the two Eric Muller spoke of. We are talking of three
    >> conventions where half-forms are not available. In
    >
    > Again, this is in general font dependent.
    >> Devanagari visual order
    >> they are:
    >>
    >> 1) <i da virama dha>
    >> 2) <da virama i dha>
    >> 3) <i d.dha>
    >>
    >> Peter is referring to all three; Eric Muller to forms (2) and (3).
    >
    > There is a standard way of distinquishing (1) and (3), by the use of
    > ZW(N)J just after the virama character; the default (no ZW(N)J present) is
    > font dependent between (1) and (3).

    Thus the distinction is between 'don't produce (3)' and 'produce (3) if you
    can'.

    > There is no standard way of getting (2).

    Which is a shame, as it is the form recommended by the 'standardizing
    authorities':

    '... They even recommended that an i-sign in a syllable such as ddhi should
    fall _between_ the two components of the conjunct, giving <da virama i ddha>
    instead of the well-established <i ddha>! ' - Eric Muller.

    >> > These must be *reliably* be distinguished in the underlying text.
    >> > It must NOT be font dependent (for properly constructed fonts).
    >>
    >> This would be unreasonable if you are referring to (2) v.
    >> (3). You would be
    >> requiring that for each *language* all Devanagari fonts have the same
    >> language-dependent repertoire of conjuncts.
    >
    > Eh, no. I don't think I have said anything requiring that. See above.

    If by 'underlying text' you mean stored encoding, the statement seems
    vacuous unless you mean it should dictate whether form (3) is used or not.
    If you mean something like printed text, you don't know whether use of a
    visible virama is intentional or a consequence of the font used. I suppose
    handwriting may be unambiguous.

    >> With Uniscribe and Mangal 1.20, that currently yields <i tta
    >> virama ttha>.
    >> In Windows Vista, this is to be overridable, I presume by
    >> feature selection.
    >
    > We really need a character based standard way of selecting between
    > these. Leaving it entirely implementation and font dependent will
    > result in apparent spell changes between different platforms/fonts.
    > As these are, to the eye, spell changes, there really need to be a
    > character based difference, and a standardised one.

    <Snip>
    >> I'm happier with the current Uniscribe schemes:
    >>
    >> <TTA, I, VIRAMA, ZWNJ, TTHA> yields vowel on the left - टि्‍ठ.
    >> <TTA, VIRAMA, ZWNJ, TTHA, I> yields vowel in the middle - ट्‍ठि.
    >
    > I'm not happy to leave this to be entirely platform/font dependent.

    Uniscribe interprets the code sequences as I would expect them to be
    interpreted. I see no font dependency in these sequences.

    Richard.



    This archive was generated by hypermail 2.1.5 : Mon Sep 05 2005 - 18:30:32 CDT