Re: combining characters using ZWJ

From: Eric Muller (
Date: Sat Jan 28 2006 - 13:41:50 CST

  • Next message: Sandeep Srivastava: "Re: combining characters using ZWJ"

    In the context of Unicode, it is important to distinguish ligatures
    which have only a graphic motivation from ligatures which have a
    "semantic impact".

    The common "ff" ligature for example is all about solving a graphic
    design problem, namely when the shape of a single "f" is such that
    putting two in a row is ugly. In some font designs, two single "f" in a
    row are not a problem at all, and such fonts does not need an "ff"
    ligature at all.

    The "œ" ligature, on the other hand, has a "semantic impact". In the
    French orthography I learned at school, some words need to be spelled
    with œ (cœur, bœuf) and other words with oe (coexister). Coeur, boeuf,
    cœxister would all be considered mistakes (I don't know of a minimal
    pair, i.e. of two words that differ exactly by œ vs. oe). Therefore,
    pretty much all fonts needs to have an "œ" ligature, regardless of
    whether "oe" is graphically problematic or not. [I qualified "French
    orthography" by "[that] I learned at school" because orthographies do
    change, either de jure or de facto, and we certainly see tremendous
    changes with instant messaging and Internet games.]

    This leads to the following rule of thumb in Unicode: ligatures of the
    first kind are not inherent to the text being written, and therefore do
    not need their own code points; ligatures of the second kind are
    inherent and need their own code points. In fact, we do have U+0153 œ
    characters, without decompositions (canonical or other). U+FB00 ff LATIN
    SMALL LIGATURE FF is justified not by its "semantic impact" but by
    compatibility with legacy character standards and it does have a
    compatibility decomposition; for the purpose of this discussion, this
    character and its friends can be ignored.

    Back to your question, if you want æ for the second reason, then you
    really want to use U+00E6 æ LATIN SMALL LETTER AE. If on the other hand
    you want a ligature of a and e for graphic reasons (and in the
    orthography you use, that does not interfer with an æ ligature of the
    semantic kind), then you really want to use U+0061 a LATIN SMALL LETTER
    A, U+0065 e LATIN SMALL LETTER E, and the best you can do is to
    encourage the rendering system to use a ligature is to insert ZWJ
    between "a" and "e"; and you can discourage the formation of a ligature
    by inserting ZWNJ. However, that does not guarantee the result: a
    rendering system is free to ignore your request (it's even free to
    ignore it on even pages and satisfy it on odd pages - as far as Unicode
    is concerned, of course).

    Incidentally, a rendering system is the combination of a layout engine
    and one or more fonts. Both participate in the result so it's often not
    possible to say that a font will or will not produce outside the context
    of a given layout engine, hence my previous message.

    > So, if I understand you correctly, ligatures are full blown
    > characters, and that they cannot be created using the individual
    > characters they represent in any way.

    It entirely depends on the kind of ligature we are talking about. You
    statement is essentially true for the "semantic" ligatures, and the
    opposite statement is essentially true for the "graphic" ligatures.

    For completeness, I should add that there are edge cases where a
    ligature which is normally graphic only may have a semantic impact. For
    example, there is often an "fi" graphic ligature, because the top of "f"
    often collides with the dot of the "i", and the typical solution
    involves dropping the dot. But in orthographies which distinguish dotted
    i from dotless i (e.g. Turkish), such a ligature is not acceptable and
    font designers really need to find another way to solve the graphic
    problem (may be put more space between f and dotted i, or find another
    modification that dropping the dot).

    And while we are there, the use of ZWJ and ZWNJ in the context of the
    Latin script is different from their use in Arabic or the Brahmi-derived

    > I also found that every script has a different 'combining mark' to
    > combine characters. For example, U+09CD is the combining mark used for
    > the Bengali script, and U+094D is the combining mark used for the
    > Hindi script. If that's the case, then what is the use of ZWJ?

    First, you are right that U+094D ◌् DEVANAGARI SIGN VIRAMA and the other
    virama characters are formally combining marks.

    Second, the virama in the Indic scripts serves a very different purpose
    than the joiners (ZWJ and ZWNJ) in Latin. A स्त (sta) conjunct is much
    more like an "œ" ligature than it is like an "fi" ligature: "सत" (sata)
    and "स्त" (sta) are simply not interchangeable, you need to use the
    appropriate one.

    For Latin, we have a small number of pairs that form semantic ligatures,
    and it is therefore reasonable to encode a separate character for each
    pair as needed.

    Devanagari on the other hand has a large number of conjuncts (including
    some formed of three or four characters), so it was deemed preferable to
    have a constructive mechanism to represent conjuncts, namely to link the
    letters entering in a conjunct by the VIRAMA coded character. That way,
    there is no need to rework the standard every time somebody exhibits a
    new, up-to-now not encoded conjunct. [This is a bit of an historical
    revision: for one thing, Unicode followed the lead of ISCII; and I
    strongly suspect that having a small character set was a constraint for
    ISCII. But you get the point, I can pretty much guarantee that without
    legacy, Unicode would have selected a constructive approach anyway.]

    You could wonder what we would have done in Latin had the set of
    semantic ligatures be large or not bounded. A very viable approach would
    have been to not encode U+0153 œ LATIN SMALL LIGATURE OE and U+00E6 æ
    LATIN SMALL LETTER AE and friends, to encode LATIN SIGN VIRAMA instead,
    and to represent "œ" by <U+006F o LATIN SMALL LETTER O, LATIN SIGN

    As to whether we need a single VIRAMA character for all the scripts or
    one per script, it's six one way and half a dozen the other (although I
    am sure we will see answers from vehement proponents of each approach).

    Finally, the joiners are used in Devanagari for a function that is
    almost always similar to their use in Latin. It is to encourage the
    rendering system to select one form or another for a conjunct, when
    those forms are "semantically" equivalent (full conjunct vs. half-form +
    full-form vs. full-form + halant + full-form),.


    This archive was generated by hypermail 2.1.5 : Sat Jan 28 2006 - 13:43:25 CST