Use of Invisible Characters and Inscrutable Sequencing

From: Richard Wordingham (
Date: Thu Feb 02 2006 - 20:29:25 CST

  • Next message: Rick McGowan: "Announcement: CLDR 1.4 Data Submission Period Now Starting"
  • Next message: Elharo: "price"

    I am not sure where this post belongs. It's prompted by a concern that
    belongs to the SEasia list (the Lanna script), it might belong on the more
    general Indic list (I don't understand why scripts of Further India don't
    belong on the Indic list - isn't that the Brahmi family list?), but my hope
    is to get an understanding of some general principles.

    What are the principles determining whether distinctions that do not appear
    on paper should be made in the Unicode encoding of text?

    I can see a few principles, but I am not sure I completely grasp the

    1) Separation of scripts - if it is decreed that two scripts are separate,
    then only occasionally should words written in one include the characters of
    another - for example Latin, Cyrillic and Greek 'o' are encoded separately.

    a) Accents are shared between scripts - to some extent. Am I allowed to put
    a combining circumflex on a Thai consonant? (Ulterior motive in this
    question: handling mixed Old Tai Lue and New Tai Lue, an issue I suspect the
    UTC would like to keep quiet.)

    b) There is a tolerated mix of consonants from one Semitic script and vowel
    marks from another - I forget which the two scripts are.

    c) Punctuation is shared, though there is pressure to disunify inherited
    punctuation such as danda and double danda.

    2) Subscript Khmer DA (U+178A) and TA (U+178F) are indistinguishable -
    perhaps with good reason, for TA originally represented both sounds. I
    presume the argument here is phonetic sorting by dictionaries. (Does anyone
    know what is happening in unschooled practice? To me the subscript looks
    like TA and not at all like DA.)

    3) Font dependent features - CGJ, ZWJ and ZWNJ are all used to control
    features that may come and go with font and rich text controls.

    4) CGJ is used to disrupt sequences that might otherwise be treated as a
    unit in sorting. (This may not be an entirely honest summary of its

    5) One of the arguments against countenancing the Tamil Unicode New Encoding
    is that text that is a mixture of the two would arise, and that there would
    then be no easy way to process it as more than printing instructions.
    Similarly, Old Tai Lue KA plus subscript VA would be indistinguishable from
    New Tai Lue KVA, so the two must be kept well part. (The requirement that a
    sequence of codepoints in a normalised form remain in that normalised form
    however the standard changes raises its ugly head again.) This is the
    principle of being able to see what you have, compromised by features 1 to 4
    above and also the fact that in some Indian scripts (notable Devanagari) you
    can't easily be sure whether you don't have a full conjunct because of a
    font limitation or because conjunct formation has been inhibited in the

    My question may appeal to those who cherish the notion of Cleanicode, for it
    relates to the question of how one could have decreed 'logical' order for
    Thai. In one of the languages written in the Lanna script (I'm not sure
    which one), there is what may be considered a written sequence AE+S+W (its
    2-D nature is irrelevant). This may be pronounced two different ways -
    S+AE+W and S+W+AE - yielding two different words which I am told sort
    differently. This is seen as a valid argument for encoding them
    differently, in the two different phonetic orders. Is this a valid
    argument, or is it outweighed by the fact that someone transcribing the text
    might not actually know which word is actually meant! (There are far more
    ambiguous character combinations, but that I think is an issue for the
    SEasia list.)

    However, there appears to be a second way of distinguishing the two words -
    a special mark (mai sam) may be added to the syllable to show that the
    vowel follows the second consonant. (It is not used consistently - my
    textbook says S+Y+AA+M 'Thailand' has it, but it's missing on the trilingual
    inscription marking the northernmost point in Thailand!) Would it therefore
    be valid to decide that the correct way for the script to distinguish the
    two in the absence of this mark was to add CGJ, say to have AE+S+CGJ+W to
    indicate the pronunciation S+AE+W, i.e. that S+W is not a cluster? There
    are a lot of practical issues to thrash out at SEasia - the script seems
    fiendishly complicated if one wants to have phonetic order - one not only
    has opaque Tibetan-style C1.V1.C1.V2 -> C1.V1.V2 type contractions but also
    C1.V1.C2.V1 -> C1.C2.V1 contractions indistinguishable from C1.V1.C2!

    I know S+W and S+Y are not strictly parallel given the phonology of the
    languages, and I don't know whether mai sam is ever used with the S+W+AE
    word, but I can offer S+W+A+R 'heaven' with mai sam, but a superscript
    vowel rather than a preposed vowel, to bridge the gap. Interestingly, Thai
    grammar allows 'clusters' to contain inherent vowels.

    I believe these are generally relevant questions of principle, rather than
    the arcane details of a very complex script, and am therefore raising them
    on the general list.


    This archive was generated by hypermail 2.1.5 : Thu Feb 02 2006 - 20:34:13 CST