Re: ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Feb 01 2007 - 22:54:57 CST

  • Next message: John Hudson: "Re: ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts"

    ----- Original Message -----
    From: +ACI-Ruszlan Gaszanov+ACI- +ADw-ruszlan+AEA-ather.net+AD4-
    To: +ADw-unicode+AEA-unicode.org+AD4-
    Sent: Sunday, January 28, 2007 6:05 PM
    Subject: RE: ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts

    +AD4- +ACI-Ligation required+ACI- and +ACI-ligation prohibited+ACI- are orthographical concepts
    +AD4- and must be encoded in plain text.
    +AD4-
    +AD4- +ACI-Optional ligation+ACI-, on the other hand, is a stylistic concept. Basically,
    +AD4- it means that the specific orthography, as a rule, allows ligation of some
    +AD4- character combinations in certain writing/typesetting styles. Therefore,
    +AD4- +ACI-optional ligation+ACI- should be handled by higher level protocols based on
    +AD4- language tagging and rich text stypes applied to the text - not encoded at
    +AD4- plain text level.
    +AD4-
    +AD4- Considering the above, ZWNJ should only be encoded in plain text for
    +AD4- prohibiting ligation in exceptional cases for orthographies that allow
    +AD4- +ACI-optional ligation+ACI-. ZWJ, on the other hand, should be used for encoding
    +AD4- orthographically significant ligatures, not stylistic ligation.
    +AD4-
    +AD4- Modern English typesetting practice is somewhat of an anomaly in this
    +AD4- respect, since ligation is allowed only in certain words. However, those
    +AD4- are in fact unaltered French or Latin words written in their native
    +AD4- orthography and should be properly tagged as +ACI-French+ACI- or +ACI-Latin+ACI-, rather
    +AD4- then +ACI-English+ACI-. So, if you apply a style that uses optional ligation for
    +AD4- those languages, ligation would occur in such words, but not in the rest
    +AD4- of text tagged as proper +ACI-Modern English+ACI- (since all modern English
    +AD4- orthography variations do not allow ligation as a rule).

    Let me see if I understand you on this. In HTML, in the middle of English
    text I should write 'haemocoels' as '+ADw-span lang+AD0AIg-la-GB+ACIAPg-haemocoels+ADw-/span+AD4-'
    because ligation is optional and despitethe fact that neither plural
    'haemocoels' nor singular 'haemocoel' is a Latin word-form - and the word
    was coined in modern times from Greek roots, but when giving the Latin
    etymon of English 'aerial' I should cite '+ADw-span
    lang+AD0AIg-la-GB+ACIAPg-a+ACYAIw-8204+ADs-erius+ADw-/span+AD4-' to inhibit ligation because 'ae' is not a
    diphthong in this word. Is this your view?

    Aside: Dare I ask which variants of Latin used in England eschew ligation?
    I don't recall ligation in the Latin textbooks we used at school.

    I think the proper abstract linguistically-based mark-up would be to mark
    words like 'haemocoel' as Latinate - this would cover old styles in which
    native and Latinate words were printed in different fonts, or, going further
    back, handwritten in different styles. I'm not sure how one would do this
    in a general rather then ad-hoc fashion. (One could use 'class' and a
    stylesheet in HTML to select an appropriate font, but the names of the
    classes would be idiosyncratic.)

    Perhaps I am wrong to try and separate 'spelling' and writing style.
    There's a Northern Thai school of spelling that chooses the symbol for /a:/
    in part on an etymological basis (Pali v. native), but the current plan is
    for the vowel form to be specified in the encoding, as different schools use
    different rules for choosing the vowel form - and even then writers are not
    self-consistent. Even Pali (or at least, words regularly derived from Pali)
    has some interesting stylistic variation in writing which will be reflected
    in the encoding but would not normally be represented in a Pali
    transcription - treating the variation as rendering rules would be quite
    complex. There are four different ways of writing the last two syllables of
    Pali +AF8-desana:+AF8- and +AF8-sa:sana:+AF8AIQ- Three of the ways may be seen as using the
    same abbreviation technique - merging the last two aksharas. The simplest
    to express in transliteration is +AF8-sa:ssna:+AF8- for +AF8-sa:sana:+AF8-, and the others
    may be thought of as +ACoAXw-sa:ss'na:+AF8- (+AF8-dess'na:+AF8- is attested) and +AF8-sa:s'na:+AF8-.
    What I've written with an apostrophe is actually a repetition mark - one
    could view it as duplicating the implicit vowel so that only one is killed
    by the explicit vowel (a:). However, it's not so easy to argue that these
    are just different rendering conventions, say of +ACI-s'n+ACI- as the reference
    contraction.

    Another case would be the homorganic nasals in Indic scripts, at least in
    the rules
    for writing Sanskrit - anusvara or full consonant? Some might say it's a
    case of sa-IN v. sa-GB/sa-DE. One could certainly argument that using a
    font change to switch from one form to the other was not compliant with
    Unicode.

    Richard.



    This archive was generated by hypermail 2.1.5 : Thu Feb 01 2007 - 22:58:19 CST