Use of Invisible Characters and Inscrutable Sequencing

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Feb 02 2006 - 20:29:25 CST

Next message: Rick McGowan: "Announcement: CLDR 1.4 Data Submission Period Now Starting"

Previous message: Rick McGowan: "UTS #37: Ideographic Variation Database is now available"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I am not sure where this post belongs. It's prompted by a concern that
belongs to the SEasia list (the Lanna script), it might belong on the more
general Indic list (I don't understand why scripts of Further India don't
belong on the Indic list - isn't that the Brahmi family list?), but my hope
is to get an understanding of some general principles.

What are the principles determining whether distinctions that do not appear
on paper should be made in the Unicode encoding of text?

I can see a few principles, but I am not sure I completely grasp the
rationale:

1) Separation of scripts - if it is decreed that two scripts are separate,
then only occasionally should words written in one include the characters of
another - for example Latin, Cyrillic and Greek 'o' are encoded separately.
However:

a) Accents are shared between scripts - to some extent. Am I allowed to put
a combining circumflex on a Thai consonant? (Ulterior motive in this
question: handling mixed Old Tai Lue and New Tai Lue, an issue I suspect the
UTC would like to keep quiet.)

b) There is a tolerated mix of consonants from one Semitic script and vowel
marks from another - I forget which the two scripts are.

c) Punctuation is shared, though there is pressure to disunify inherited
punctuation such as danda and double danda.

2) Subscript Khmer DA (U+178A) and TA (U+178F) are indistinguishable -
perhaps with good reason, for TA originally represented both sounds. I
presume the argument here is phonetic sorting by dictionaries. (Does anyone
know what is happening in unschooled practice? To me the subscript looks
like TA and not at all like DA.)

3) Font dependent features - CGJ, ZWJ and ZWNJ are all used to control
features that may come and go with font and rich text controls.

4) CGJ is used to disrupt sequences that might otherwise be treated as a
unit in sorting. (This may not be an entirely honest summary of its
function.)

5) One of the arguments against countenancing the Tamil Unicode New Encoding
is that text that is a mixture of the two would arise, and that there would
then be no easy way to process it as more than printing instructions.
Similarly, Old Tai Lue KA plus subscript VA would be indistinguishable from
New Tai Lue KVA, so the two must be kept well part. (The requirement that a
sequence of codepoints in a normalised form remain in that normalised form
however the standard changes raises its ugly head again.) This is the
principle of being able to see what you have, compromised by features 1 to 4
above and also the fact that in some Indian scripts (notable Devanagari) you
can't easily be sure whether you don't have a full conjunct because of a
font limitation or because conjunct formation has been inhibited in the
text!

My question may appeal to those who cherish the notion of Cleanicode, for it
relates to the question of how one could have decreed 'logical' order for
Thai. In one of the languages written in the Lanna script (I'm not sure
which one), there is what may be considered a written sequence AE+S+W (its
2-D nature is irrelevant). This may be pronounced two different ways -
S+AE+W and S+W+AE - yielding two different words which I am told sort
differently. This is seen as a valid argument for encoding them
differently, in the two different phonetic orders. Is this a valid
argument, or is it outweighed by the fact that someone transcribing the text
might not actually know which word is actually meant! (There are far more
ambiguous character combinations, but that I think is an issue for the
SEasia list.)

However, there appears to be a second way of distinguishing the two words -
a special mark (mai sam) may be added to the syllable to show that the
vowel follows the second consonant. (It is not used consistently - my
textbook says S+Y+AA+M 'Thailand' has it, but it's missing on the trilingual
inscription marking the northernmost point in Thailand!) Would it therefore
be valid to decide that the correct way for the script to distinguish the
two in the absence of this mark was to add CGJ, say to have AE+S+CGJ+W to
indicate the pronunciation S+AE+W, i.e. that S+W is not a cluster? There
are a lot of practical issues to thrash out at SEasia - the script seems
fiendishly complicated if one wants to have phonetic order - one not only
has opaque Tibetan-style C1.V1.C1.V2 -> C1.V1.V2 type contractions but also
C1.V1.C2.V1 -> C1.C2.V1 contractions indistinguishable from C1.V1.C2!

I know S+W and S+Y are not strictly parallel given the phonology of the
languages, and I don't know whether mai sam is ever used with the S+W+AE
word, but I can offer S+W+A+R 'heaven' with mai sam, but a superscript
vowel rather than a preposed vowel, to bridge the gap. Interestingly, Thai
grammar allows 'clusters' to contain inherent vowels.

I believe these are generally relevant questions of principle, rather than
the arcane details of a very complex script, and am therefore raising them
on the general list.

Richard.

Next message: Rick McGowan: "Announcement: CLDR 1.4 Data Submission Period Now Starting"
Next message: Elharo: "price"
Previous message: Rick McGowan: "UTS #37: Ideographic Variation Database is now available"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 02 2006 - 20:34:13 CST