From: Sandeep Srivastava (firstname.lastname@example.org)
Date: Mon Jan 30 2006 - 03:14:28 CST
Thanks Eric for the wonderful and detailed explaination. I think you have
answered more than I had asked. That explaination was very useful.
On 1/29/06, Eric Muller <email@example.com> wrote:
> In the context of Unicode, it is important to distinguish ligatures
> which have only a graphic motivation from ligatures which have a
> "semantic impact".
> The common "ff" ligature for example is all about solving a graphic
> design problem, namely when the shape of a single "f" is such that
> putting two in a row is ugly. In some font designs, two single "f" in a
> row are not a problem at all, and such fonts does not need an "ff"
> ligature at all.
> The "œ" ligature, on the other hand, has a "semantic impact". In the
> French orthography I learned at school, some words need to be spelled
> with œ (cœur, bœuf) and other words with oe (coexister). Coeur, boeuf,
> cœxister would all be considered mistakes (I don't know of a minimal
> pair, i.e. of two words that differ exactly by œ vs. oe). Therefore,
> pretty much all fonts needs to have an "œ" ligature, regardless of
> whether "oe" is graphically problematic or not. [I qualified "French
> orthography" by "[that] I learned at school" because orthographies do
> change, either de jure or de facto, and we certainly see tremendous
> changes with instant messaging and Internet games.]
> This leads to the following rule of thumb in Unicode: ligatures of the
> first kind are not inherent to the text being written, and therefore do
> not need their own code points; ligatures of the second kind are
> inherent and need their own code points. In fact, we do have U+0153 œ
> LATIN SMALL LIGATURE OE, U+00E6 æ LATIN SMALL LETTER AE as regular
> characters, without decompositions (canonical or other). U+FB00 ﬀ LATIN
> SMALL LIGATURE FF is justified not by its "semantic impact" but by
> compatibility with legacy character standards and it does have a
> compatibility decomposition; for the purpose of this discussion, this
> character and its friends can be ignored.
> Back to your question, if you want æ for the second reason, then you
> really want to use U+00E6 æ LATIN SMALL LETTER AE. If on the other hand
> you want a ligature of a and e for graphic reasons (and in the
> orthography you use, that does not interfer with an æ ligature of the
> semantic kind), then you really want to use U+0061 a LATIN SMALL LETTER
> A, U+0065 e LATIN SMALL LETTER E, and the best you can do is to
> encourage the rendering system to use a ligature is to insert ZWJ
> between "a" and "e"; and you can discourage the formation of a ligature
> by inserting ZWNJ. However, that does not guarantee the result: a
> rendering system is free to ignore your request (it's even free to
> ignore it on even pages and satisfy it on odd pages - as far as Unicode
> is concerned, of course).
> Incidentally, a rendering system is the combination of a layout engine
> and one or more fonts. Both participate in the result so it's often not
> possible to say that a font will or will not produce outside the context
> of a given layout engine, hence my previous message.
> > So, if I understand you correctly, ligatures are full blown
> > characters, and that they cannot be created using the individual
> > characters they represent in any way.
> It entirely depends on the kind of ligature we are talking about. You
> statement is essentially true for the "semantic" ligatures, and the
> opposite statement is essentially true for the "graphic" ligatures.
> For completeness, I should add that there are edge cases where a
> ligature which is normally graphic only may have a semantic impact. For
> example, there is often an "fi" graphic ligature, because the top of "f"
> often collides with the dot of the "i", and the typical solution
> involves dropping the dot. But in orthographies which distinguish dotted
> i from dotless i (e.g. Turkish), such a ligature is not acceptable and
> font designers really need to find another way to solve the graphic
> problem (may be put more space between f and dotted i, or find another
> modification that dropping the dot).
> And while we are there, the use of ZWJ and ZWNJ in the context of the
> Latin script is different from their use in Arabic or the Brahmi-derived
> > I also found that every script has a different 'combining mark' to
> > combine characters. For example, U+09CD is the combining mark used for
> > the Bengali script, and U+094D is the combining mark used for the
> > Hindi script. If that's the case, then what is the use of ZWJ?
> First, you are right that U+094D ◌् DEVANAGARI SIGN VIRAMA and the other
> virama characters are formally combining marks.
> Second, the virama in the Indic scripts serves a very different purpose
> than the joiners (ZWJ and ZWNJ) in Latin. A स्त (sta) conjunct is much
> more like an "œ" ligature than it is like an "fi" ligature: "सत" (sata)
> and "स्त" (sta) are simply not interchangeable, you need to use the
> appropriate one.
> For Latin, we have a small number of pairs that form semantic ligatures,
> and it is therefore reasonable to encode a separate character for each
> pair as needed.
> Devanagari on the other hand has a large number of conjuncts (including
> some formed of three or four characters), so it was deemed preferable to
> have a constructive mechanism to represent conjuncts, namely to link the
> letters entering in a conjunct by the VIRAMA coded character. That way,
> there is no need to rework the standard every time somebody exhibits a
> new, up-to-now not encoded conjunct. [This is a bit of an historical
> revision: for one thing, Unicode followed the lead of ISCII; and I
> strongly suspect that having a small character set was a constraint for
> ISCII. But you get the point, I can pretty much guarantee that without
> legacy, Unicode would have selected a constructive approach anyway.]
> You could wonder what we would have done in Latin had the set of
> semantic ligatures be large or not bounded. A very viable approach would
> have been to not encode U+0153 œ LATIN SMALL LIGATURE OE and U+00E6 æ
> LATIN SMALL LETTER AE and friends, to encode LATIN SIGN VIRAMA instead,
> and to represent "œ" by <U+006F o LATIN SMALL LETTER O, LATIN SIGN
> VIRAMA, U+0065 e LATIN SMALL LETTER E>.
> As to whether we need a single VIRAMA character for all the scripts or
> one per script, it's six one way and half a dozen the other (although I
> am sure we will see answers from vehement proponents of each approach).
> Finally, the joiners are used in Devanagari for a function that is
> almost always similar to their use in Latin. It is to encourage the
> rendering system to select one form or another for a conjunct, when
> those forms are "semantically" equivalent (full conjunct vs. half-form +
> full-form vs. full-form + halant + full-form),.
This archive was generated by hypermail 2.1.5 : Mon Jan 30 2006 - 03:15:47 CST