RE: ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts

From: Ruszlan Gaszanov (ruszlan@ather.net)
Date: Sun Jan 28 2007 - 12:05:10 CST

  • Next message: Ruszlan Gaszanov: "RE: ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts"

    +ACI-Ligation required+ACI- and +ACI-ligation prohibited+ACI- are orthographical concepts and must be encoded in plain text.

    +ACI-Optional ligation+ACI-, on the other hand, is a stylistic concept. Basically, it means that the specific orthography, as a rule, allows ligation of some character combinations in certain writing/typesetting styles. Therefore, +ACI-optional ligation+ACI- should be handled by higher level protocols based on language tagging and rich text stypes applied to the text - not encoded at plain text level.

    Considering the above, ZWNJ should only be encoded in plain text for prohibiting ligation in exceptional cases for orthographies that allow +ACI-optional ligation+ACI-. ZWJ, on the other hand, should be used for encoding orthographically significant ligatures, not stylistic ligation.

    Modern English typesetting practice is somewhat of an anomaly in this respect, since ligation is allowed only in certain words. However, those are in fact unaltered French or Latin words written in their native orthography and should be properly tagged as +ACI-French+ACI- or +ACI-Latin+ACI-, rather then +ACI-English+ACI-. So, if you apply a style that uses optional ligation for those languages, ligation would occur in such words, but not in the rest of text tagged as proper +ACI-Modern English+ACI- (since all modern English orthography variations do not allow ligation as a rule).

    The relation of German orthography variations and typesetting practices causes confusions because it is not generally well understood. The language usually known as German, but more properly called High German (Hochdeutsch), can be written in for different orthography variations:

    1. +ACI-Long-S Orthography+ACI- was widely used for writing High German in Blackletter style until 1940s. It has the character repertoire of basic Latin, with additions of +AX8- (+ACI-long s+ACI-), +AOQ- (a-umlaut), +APY- (o-umlaut) and +APw- (u-umlaut). Although, this orthography variation is generally used with Blackletter style, it may also be used with Antiqua style.

    2. +ACI-Es-Zet Orthography+ACI- is commonly used in Germany and Austria for writing High German in Antiqua style. It differs from +ACI-Long-S+ACI- orthography in that +AX8- (+ACI-long s+ACI-) is replaced with (+ACI-round s+ACI-) in all instances, except for +AX8-s (+ACI-long s+ACI- +- +ACI-round s+ACI-) and +AX8-z (+ACI-long s+ACI- +- +ACI-z+ACI-) combinations, which are replaced with +AN8- (+ACI-es-zet+ACI-) character.

    3. +ACI-Swiss Orthography+ACI- is commonly used in Switzerland for writing High German in Antiqua style. It differs from +ACI-Es-Zet Orthography+ACI- in that +AN8- (+ACI-es-zet+ACI-) is replaced with +ACI-ss+ACI-. In +ACI-Swiss Orthography+ACI- uppercase +ACI-umlauts+ACI- (+AMQ-, +ANY-, +ANw-) are usually replaced with +ACI-Ae+ACI-, +ACI-Oe+ACI- and +ACI-Ue+ACI-.

    4. +ACI-Simplified Orthography+ACI- has evolved for use in telegraph, and later computers, that only support basic Latin letters. It differs from +ACI-Swiss Orthography+ACI- in that lowercase +ACI-umlauts+ACI- are also replaced with +ACI-ae+ACI-, +ACI-oe+ACI- and +ACI-ue+ACI-.

    Ideally, all High German texts should be encoded with +ACI-Long-S Orthography+ACI- because they can be then easily presented in either +ACI-Es-Zet+ACI-, +ACI-Swiss+ACI- or +ACI-Simplified+ACI- orthography variations by simply rendering some characters/character combinations as one or more glyphs of other characters. However, most texts in High German are encoded in modern practice using either +ACI-Es-Zet Orthography+ACI- or +ACI-Swiss Orthography+ACI-. This sometimes creates a problem, because it is only considered proper to present +ACI-Long-S+ACI- orthography in +ACI-Blackletter+ACI- style, so complex conversion (involving dictionary lookups and human intervention) is needed for presenting most contemporary digitalized texts in Blackletter style.

    But this, essentially, has nothing to do with the ligation problem, except that ZWNJ should be used in +ACI-Long-S Orthography+ACI- to prohibit ligation in certain (exceptional) cases. But then again, this method should be used for any orthography where non-ligation in certain typographic styles is an exceptional behavior.

    Ruszl+AOE-n

    -----Original Message-----
    From: unicode-bounce+AEA-unicode.org +AFs-mailto:unicode-bounce+AEA-unicode.org+AF0- On Behalf Of Richard Wordingham
    Sent: Saturday, January 27, 2007 1:00 PM
    To: unicode+AEA-unicode.org
    Subject: Re: ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts

    John H. Jenkins wrote on Saturday, January 27, 2007 5:43 AM

    +AD4- There is general agreement that having ZWNJ mark where ligation must not
    +AD4- take place is a reasonable suggestion. There is also general agreement
    +AD4- that having ZWJ mark exceptional places where ligation is obligatory is
    +AD4- reasonable.
    +AD4-
    +AD4- The major push-back is from people who object to the idea that ZWJ should
    +AD4- be used to mark where ligation would be desirable.

    So how would they handle the English words 'brae' and 'does' (ligature
    prohibited) and 'Ca+IA0-esar' and 'co+IA0-elom' (ligature optional)? Should the
    first two be spelt with ZWNJ? Partly the problem here is that the English
    writing system is a mixture of systems, but that goes for many
    long-established writing systems. I'm not at all sure how these rules can
    sensibly be applied for all languages using the Latin script - isn't there a
    principle that plain text should be renderable, without using language
    identification, without looking misspelt?

    Richard.



    This archive was generated by hypermail 2.1.5 : Sun Jan 28 2007 - 12:10:49 CST