Re: combining characters using ZWJ

From: Sandeep Srivastava (
Date: Mon Jan 30 2006 - 03:14:28 CST

  • Next message: Andreas Prilop: "Re: Unicode, colours and (hiero)glyphs"

    Thanks Eric for the wonderful and detailed explaination. I think you have
    answered more than I had asked. That explaination was very useful.


    On 1/29/06, Eric Muller <> wrote:
    > In the context of Unicode, it is important to distinguish ligatures
    > which have only a graphic motivation from ligatures which have a
    > "semantic impact".
    > The common "ff" ligature for example is all about solving a graphic
    > design problem, namely when the shape of a single "f" is such that
    > putting two in a row is ugly. In some font designs, two single "f" in a
    > row are not a problem at all, and such fonts does not need an "ff"
    > ligature at all.
    > The "œ" ligature, on the other hand, has a "semantic impact". In the
    > French orthography I learned at school, some words need to be spelled
    > with œ (cœur, bœuf) and other words with oe (coexister). Coeur, boeuf,
    > cœxister would all be considered mistakes (I don't know of a minimal
    > pair, i.e. of two words that differ exactly by œ vs. oe). Therefore,
    > pretty much all fonts needs to have an "œ" ligature, regardless of
    > whether "oe" is graphically problematic or not. [I qualified "French
    > orthography" by "[that] I learned at school" because orthographies do
    > change, either de jure or de facto, and we certainly see tremendous
    > changes with instant messaging and Internet games.]
    > This leads to the following rule of thumb in Unicode: ligatures of the
    > first kind are not inherent to the text being written, and therefore do
    > not need their own code points; ligatures of the second kind are
    > inherent and need their own code points. In fact, we do have U+0153 œ
    > characters, without decompositions (canonical or other). U+FB00 ff LATIN
    > SMALL LIGATURE FF is justified not by its "semantic impact" but by
    > compatibility with legacy character standards and it does have a
    > compatibility decomposition; for the purpose of this discussion, this
    > character and its friends can be ignored.
    > Back to your question, if you want æ for the second reason, then you
    > really want to use U+00E6 æ LATIN SMALL LETTER AE. If on the other hand
    > you want a ligature of a and e for graphic reasons (and in the
    > orthography you use, that does not interfer with an æ ligature of the
    > semantic kind), then you really want to use U+0061 a LATIN SMALL LETTER
    > A, U+0065 e LATIN SMALL LETTER E, and the best you can do is to
    > encourage the rendering system to use a ligature is to insert ZWJ
    > between "a" and "e"; and you can discourage the formation of a ligature
    > by inserting ZWNJ. However, that does not guarantee the result: a
    > rendering system is free to ignore your request (it's even free to
    > ignore it on even pages and satisfy it on odd pages - as far as Unicode
    > is concerned, of course).
    > Incidentally, a rendering system is the combination of a layout engine
    > and one or more fonts. Both participate in the result so it's often not
    > possible to say that a font will or will not produce outside the context
    > of a given layout engine, hence my previous message.
    > > So, if I understand you correctly, ligatures are full blown
    > > characters, and that they cannot be created using the individual
    > > characters they represent in any way.
    > It entirely depends on the kind of ligature we are talking about. You
    > statement is essentially true for the "semantic" ligatures, and the
    > opposite statement is essentially true for the "graphic" ligatures.
    > For completeness, I should add that there are edge cases where a
    > ligature which is normally graphic only may have a semantic impact. For
    > example, there is often an "fi" graphic ligature, because the top of "f"
    > often collides with the dot of the "i", and the typical solution
    > involves dropping the dot. But in orthographies which distinguish dotted
    > i from dotless i (e.g. Turkish), such a ligature is not acceptable and
    > font designers really need to find another way to solve the graphic
    > problem (may be put more space between f and dotted i, or find another
    > modification that dropping the dot).
    > And while we are there, the use of ZWJ and ZWNJ in the context of the
    > Latin script is different from their use in Arabic or the Brahmi-derived
    > scripts.
    > > I also found that every script has a different 'combining mark' to
    > > combine characters. For example, U+09CD is the combining mark used for
    > > the Bengali script, and U+094D is the combining mark used for the
    > > Hindi script. If that's the case, then what is the use of ZWJ?
    > First, you are right that U+094D ◌् DEVANAGARI SIGN VIRAMA and the other
    > virama characters are formally combining marks.
    > Second, the virama in the Indic scripts serves a very different purpose
    > than the joiners (ZWJ and ZWNJ) in Latin. A स्त (sta) conjunct is much
    > more like an "œ" ligature than it is like an "fi" ligature: "सत" (sata)
    > and "स्त" (sta) are simply not interchangeable, you need to use the
    > appropriate one.
    > For Latin, we have a small number of pairs that form semantic ligatures,
    > and it is therefore reasonable to encode a separate character for each
    > pair as needed.
    > Devanagari on the other hand has a large number of conjuncts (including
    > some formed of three or four characters), so it was deemed preferable to
    > have a constructive mechanism to represent conjuncts, namely to link the
    > letters entering in a conjunct by the VIRAMA coded character. That way,
    > there is no need to rework the standard every time somebody exhibits a
    > new, up-to-now not encoded conjunct. [This is a bit of an historical
    > revision: for one thing, Unicode followed the lead of ISCII; and I
    > strongly suspect that having a small character set was a constraint for
    > ISCII. But you get the point, I can pretty much guarantee that without
    > legacy, Unicode would have selected a constructive approach anyway.]
    > You could wonder what we would have done in Latin had the set of
    > semantic ligatures be large or not bounded. A very viable approach would
    > have been to not encode U+0153 œ LATIN SMALL LIGATURE OE and U+00E6 æ
    > LATIN SMALL LETTER AE and friends, to encode LATIN SIGN VIRAMA instead,
    > and to represent "œ" by <U+006F o LATIN SMALL LETTER O, LATIN SIGN
    > As to whether we need a single VIRAMA character for all the scripts or
    > one per script, it's six one way and half a dozen the other (although I
    > am sure we will see answers from vehement proponents of each approach).
    > Finally, the joiners are used in Devanagari for a function that is
    > almost always similar to their use in Latin. It is to encourage the
    > rendering system to select one form or another for a conjunct, when
    > those forms are "semantically" equivalent (full conjunct vs. half-form +
    > full-form vs. full-form + halant + full-form),.
    > Eric.

    This archive was generated by hypermail 2.1.5 : Mon Jan 30 2006 - 03:15:47 CST