From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon Sep 05 2005 - 07:58:55 CDT
Kent Karlsson wrote:
> Richard Wordingham wrote:
>> The primitive method of forming conjuncts is
>> just to stack the consonants vertically,
> That's not a conjunct, that's a stack ;-) We are obviously using
> different terminology here. When I wrote "conjunct" read "conjunct
> form" (and look that up in the TUS4 glossary).
Indeed. By conjunct I think I mean what TUS4 calls 'conjunct consonants',
but its definition seems to be wrong:
1) TUS4 says they consist of one or more dead consonants followed by a live
consonant. That implies Sanskrit _u:rk_ ऊर्क् U+090A, U+0930, U+094D,
U+0915, U+094D (stem _u:rj_ ऊर्ज् is listed in Monier-Williams at
http://www.ibiblio.org/sripedia/ebooks/mw/0200/mw__0254.html ) is not
written with conjunct consonants!
2) In Section 9.6 'Tamil', 'Ligatures', it says 'Vowel re-ordering occurs
around conjunct consonants.' This is only true if (visible) <consonant
pulli consonant> is not counted as a series of conjunct consonants.
On the definition of orthographic syllable:
>> To me, a more natural
>> formulation is:
>> <consonant, {combining marks, at least one of which is a conjoiner}>*
>> <consonant, {maybe combining marks or visible virama, no conjoiner}>
> Whether a virama is visible or not (absorbed into a half form or a
> conjunct)
> is in general font dependent, the above is not a good criterion for
> orthographic syllables.
Our difference here largely results from fundamentally different
conceptions. I see the basic elements of an Indic script being CW units,
where W is an explicit vowel, a visible virama, or the implicit vowel. The
visible virama seems to be a late addition to the system. The vowel (W)
side can be extended by anusvara, additional vowels etc, but that is not the
cause of our differences. The consonant side can be extended to a consonant
cluster. These CW units, possibly extended in these ways, are what I
understand by the term 'orthographic syllable'.
Now in my conception this cluster does *not* contain C+virama elements.
This is an important difference. The primitive way of writing this cluster
is as a stack - the virama is a late addition to the system.
Now, when we encode text as a sequence of codepoints, we choose not to
encode the implict vowel with a codepoint. This leaves us with the problem
of distinguishing consonants in a cluster from Ca syllables. The solution
is to insert a conjoining code between the codepoints for the consonant of
this cluster.
Now in two scripts where the virama is a marginal element of the script,
Khmer and Tibetan, we mark this conjoining using a special codepoint (coeng
in khmer) or by modifying the codepoint value for the consonant. I can only
think of one case where a script actually has a physical mark for this
conjoining - the obsolete yamakkan of Thai. What we usually do is to use
the the codepoint for virama - this is parallelled by the modern use of
phinthu in Pali written in the Thai script instead of yamakkan. Encoding
thus does not lose information because virama followed by a letter or mark
is a conjoining symbol, other viramas are visible viramas, in particular
viramas followed by ZWNJ or whitespace. (Tamil can usually omit the ZWNJ
for reasons given below.) The meaning of virama followed by ZWJ depends on
the script - I will just discuss the Devanagari case.
What follows may be regarded as a 'myth'. I believe it is in essentials
true history, but I further believe it does not matter for practical matters
whether it is true or not.
Now a stack can be an unwieldy item, especially for printing. It also
wastes a lot of paper / palm leaf / etc because the line separation is
determined by the longest stack. One way of reducing the problem is to
condense the letters in the stack, as is done in various ways. Also,
whereas in Pali most clusters are either geminates or legal word-initial
clusters in Pali, this is not so in many modern languages.
A way of either eliminating the latter problem (if it is felt as a problem),
or of reducing stack sizes, is to split the stack. The part that is split
off may become C(CC)+visible virama, or the symbols in the stack may be
modified in some way, as for example the Devanagari half-forms, to show that
one does not have an independent unit. In the former case, what might have
been one orthographic syllable is now two. In the latter case, the stack
now occupies two or more physical columns. In the encoding of Devanagari
we mark the division into columns by adding ZWJ after the virama code.
Here endeth the myth.
In Devanagari, there is a general licence to split stacks that are too
awkward. From what can be achieved, the order of preference is single
column, multiple columns (use of half-forms for non-final columns), multiple
orthographic syllables. In Devanagari, a column formally consists of a
single consonant or a conjunct form. Under this licence, therefore, more
orthographic syllables than were desired can appear, and a virama that in
the encoding of the desired form was merely a consonant-conjoining code may
surface as a visible virama.
There are no half-forms in Tamil, and the consonant element of an
orthographic syllable must be a consonant or conjunct form. As most pairs
of Tamil consonants do not form conjunct forms, an encoding of C1 virama C2
vowel will usually be split into two orthographic syllables, resulting
visually in <C1 pulli C2+vowel>.
> You really need a character based criterion, which is font independent.
Therefore you encode the form that is desired in an ideal world, and ignore
the effects of the font. The visible viramas are the ones that are visible
in the desired form - as simple as that!
>> These are not the two Eric Muller spoke of. We are talking of three
>> conventions where half-forms are not available. In
>
> Again, this is in general font dependent.
>> Devanagari visual order
>> they are:
>>
>> 1) <i da virama dha>
>> 2) <da virama i dha>
>> 3) <i d.dha>
>>
>> Peter is referring to all three; Eric Muller to forms (2) and (3).
>
> There is a standard way of distinquishing (1) and (3), by the use of
> ZW(N)J just after the virama character; the default (no ZW(N)J present) is
> font dependent between (1) and (3).
Thus the distinction is between 'don't produce (3)' and 'produce (3) if you
can'.
> There is no standard way of getting (2).
Which is a shame, as it is the form recommended by the 'standardizing
authorities':
'... They even recommended that an i-sign in a syllable such as ddhi should
fall _between_ the two components of the conjunct, giving <da virama i ddha>
instead of the well-established <i ddha>! ' - Eric Muller.
>> > These must be *reliably* be distinguished in the underlying text.
>> > It must NOT be font dependent (for properly constructed fonts).
>>
>> This would be unreasonable if you are referring to (2) v.
>> (3). You would be
>> requiring that for each *language* all Devanagari fonts have the same
>> language-dependent repertoire of conjuncts.
>
> Eh, no. I don't think I have said anything requiring that. See above.
If by 'underlying text' you mean stored encoding, the statement seems
vacuous unless you mean it should dictate whether form (3) is used or not.
If you mean something like printed text, you don't know whether use of a
visible virama is intentional or a consequence of the font used. I suppose
handwriting may be unambiguous.
>> With Uniscribe and Mangal 1.20, that currently yields <i tta
>> virama ttha>.
>> In Windows Vista, this is to be overridable, I presume by
>> feature selection.
>
> We really need a character based standard way of selecting between
> these. Leaving it entirely implementation and font dependent will
> result in apparent spell changes between different platforms/fonts.
> As these are, to the eye, spell changes, there really need to be a
> character based difference, and a standardised one.
<Snip>
>> I'm happier with the current Uniscribe schemes:
>>
>> <TTA, I, VIRAMA, ZWNJ, TTHA> yields vowel on the left - टि्ठ.
>> <TTA, VIRAMA, ZWNJ, TTHA, I> yields vowel in the middle - ट्ठि.
>
> I'm not happy to leave this to be entirely platform/font dependent.
Uniscribe interprets the code sequences as I would expect them to be
interpreted. I see no font dependency in these sequences.
Richard.
This archive was generated by hypermail 2.1.5 : Mon Sep 05 2005 - 18:30:32 CDT