From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Tue Sep 06 2005 - 16:00:23 CDT
Kent Karlsson wrote:
>> Our difference here largely results from fundamentally different
> ...
>> is to insert a conjoining code between the codepoints for the
>> consonant of this cluster.
>
> Whatever one thinks of the 'virama model', that is the model
> standardised. I don't see any way of changing that model now.
> It is much much too late for that.
But I am not proposing a change to the encoding! The only implementational
issues are:
1) Should vowels be re-ordered across a virama + ZWNJ boundary?
2) Does the automatic insertion of a visible virama because the font can't
cope with the cluster create such a boundary?
The answer to (1) is not a simple yes; I think it should be a simple 'no' -
and I have a work-around below for those who would object to the answer
'yes' to the second question.
The encoding of the 'virama model' generally works:
virama + ZWNJ => end of syllable, visible virama
vowel => end of syllable
virama + non-letter => end of syllable
virama + ZWj => script-specific effect. For Devanagari, it marks the end of
a column.
other virama => conjoin.
I don't think my conception is inconsistent with the 'virama model', but I
do think it is a better way of looking at what is going on.
While it would have been nice to have had a specific code for a visible
virama, it is indeed too late.
> Thai and Lao are also encoded differently, such that there is no
> reordering problem for display, but there is one for collation instead.
> The latter is even ambiguous. This has been solved by doing a simplified,
> rather than semantically correct, reordering to logical order (now via
> collation clusters).
The default collation order for Thai in the Unicode Collation Algorithm
agrees with the order in Thai dictionaries. Their sort order is explained
in Campbell & Shaweevongs, 'The Fundamentals of the Thai Language (Fifth
Edition)', Appendix 8, 'How to Use a Thai Dictionary'. I had a long debate
with a Thai lady well versed in formal Thai grammar on the subject of
ordering, and I could always find counterexamples to her explanations on the
sorting of words that contradicted Campbell & Shaweevongs. She did give one
rule though that is not in C&S or the UCA - when spelt the same, phonetic
CCV precedes phonetic CVC. It works with แหน in the 'New Standard
Thai-English Dictionary' - but the words are the other way round in the
Dictionary of the Royal (Thai) Institute (Ratchabandit). A case of TiT I
suppose.
I do not believe there is a 'logical order' for Thai, at least not unless
you add (in some fashion or other) placeholders for preposed vowels. As an
indication, consider มนโฑ 'Montho', i.e. 'Mandora', and แมโคร 'macro'.
*แมคโร would be pronounced quite differently to the word for 'macro'.
>> > You really need a character based criterion, which is font
>> independent.
>> Therefore you encode the form that is desired in an ideal
>> world, and ignore
>> the effects of the font. The visible viramas are the ones
>> that are visible
>> in the desired form - as simple as that!
> Hmm. Would this "desired ideal" be language independent
> (though still script dependent)?
If virama + ZWNJ is as much of a break as I think it is, then the desired
form defined by a 'well-formed' sequence of codepoints is as well defined as
for the Latin script. (I don't have a definition of 'well-formed'.)
On forcing an author's (or typographer's?) preferred form for <TTA, TTHA, I>
when the font leaves no alternative but use of a virama:
>> > I'm not happy to leave this to be entirely platform/font dependent.
>>
>> Uniscribe interprets the code sequences as I would expect them to be
>> interpreted. I see no font dependency in these sequences.
>
> That is what one "platform", in one particular version, does. (Not any of
> the versions I've got...) Not sure it is THE one behaviour to be
> standardised.
Can the experts please tell us whether the following sequences have a
definite meaning in the Devanagari script, and if so, what is the meaning?
<TTA, DEPENDENT I, VIRAMA, ZWNJ, TTHA>
<TTA, VIRAMA, ZWNJ, TTHA, DEPENDENT I>
(I will try looking for information in the Indic list's Febrary posts -
thanks for the pointer, Antoine.)
> And Peter mentioned font dependence (for a future version), that I think
> is inappropriate for this.
I can now see how the rendering of <TTA, VIRAMA, TTHA, DEPENDENT I> in the
absence of a conjunct form TTA.TTHA and the absence of a half form for TTA
can be made font-specific. Simply complete the set of half forms by
defining the missing half forms to be isolated form plus virama! I see
nothing in the Unicode Standard that prohibits this.
Richard.
This archive was generated by hypermail 2.1.5 : Tue Sep 06 2005 - 16:02:31 CDT