[Unicode]  Frequently Asked Questions Home | Site Map | Search

Ligatures, Digraphs and Presentation Forms

 

Q: What's the difference between a "Digraph" and a "Ligature"?

A: Digraphs and ligatures are both made by combining two glyphs. In a digraph, the glyphs remain separate but are placed close together. In a ligature, the glyphs are fused into a single glyph. [JC]

Q: I can't find the digraph "IE" in Unicode. Where do I look?

A: Look at "Where is my character?"

Q. Why are "my" presentation forms NOT included in Unicode?

The Unicode Standard encodes characters and it is the function of rendering systems to select presentation forms as needed to render those characters. Thus there is no need to encode presentation forms. [EM]

Q. Is it necessary to use the presentation forms that are defined in Unicode?

A. No, it is not necessary to use those presentation forms. Those forms were selected and identified in the early days of developing Unicode when sophisticated rendering engines were not prevalent. A selected subset of the presentation forms was included to provide users with a simple method to generate them. [MK]

Q. Can one use the presentation forms in a data file?

A. It is strongly discouraged and not recommended because it does not guarantee data integrity and interoperability. In the particular case of Arabic, data files should include only the characters in the Arabic block, U+0600 to U+06FF.  [MK]

Q: My language needs the digraph "xy". That digraph is distinctly different from "x" + "y" and is treated as a unit in my language. given all the available space in the Latin Extended area, why not just encode that digraph as another character? If my "xy" digraph were represented by a unique symbol, it would certainly be included with all the other letters from my language.

[Editor's note: "xy" is being used as a stand-in for particular digraphs from particular languages; this question, or something very similar to it, has been asked recently, for instance, about "ch" in Slovak, about "ng" in Tagalog, and about "ie" in Maltese.]

A: If the digraph "xy" were some strange symbol, then it would indeed be new; there would not be any existing way to represent it using already encoded Unicode characters. But it is not a strange symbol - it is just the digraph "xy", and there is already a way to represent it in Unicode: <U+0078, U+0079>.

While it may seem that there is a lot of available space for Latin letters in the Unicode Standard, and the upper- and lowercase versions of the digraph "xy" only constitute a couple of characters, in reality what's at stake here are not just a couple of characters, but hundreds. This is a recurrent pattern. There is a steady flow of requests for Latin digraphs or precomposed base form + diacritic letters for various languages. The reason is always some variant of "In my language 'xy' is a unit and not a sequence; it has its own behavior, and so should be encoded separately."

It isn't just the matter of the standards overhead faced by the Unicode Technical Committee for dealing with all these encoding proposals for letters than can already be represented by existing encoding characters used in sequences. There are deeper issues pertaining to the implications for existing implementations and existing data. If a new digraph "xy" is added, that implies the addition of a new compatibility decomposition "xy" to "x" "y" to the data tables. And that means people will have to revise their software to handle it. Then there is the fact that people will not represent data consistently; some will use the new digraph character and some will not - you can count on that. Existing data will not magically update itself to make use of the new digraph. Because of these considerations and others, there will be situations in which it will be necessary to represent data using the decomposed form anyway - as for example when passing around normalized data on the Internet. So the addition of a digraph character has a fairly substantial (and costly) set of consequences, in return for a minimal set of benefits. The net of this is generally negative, rather than positive. Multiply that by hundreds of times for all of the other digraphs and pre-composed diacritic-marked letters that exist for other languages (and with perhaps a couple of thousand languages in the world currently written with the Latin script, there are lots), and you can see why the Unicode Technical Committee does not favor heading down this path.

At this point, the UTC has a default position: no new characters for digraphs or pre-composed diacritic letters should be accepted for encoding as individual characters. If a convincing enough case can be presented, there may always be exceptions to that default position. To be convincing, the line of reasoning would have to be along the line of: There are demonstrable processing issues in the writing system for this language that cannot adequately be dealt with using the existing encoded characters, but which could be resolved by the addition of this new character. ("xy", or whatever.) But the arguments have to be very convincing, and other approaches to dealing with the perceived problem have to be explored and to be shown inadequate. For example, citation of a different sorting order for "xy" in a language is not very convincing, because well-known collation techniques are used to handle sorting of digraphic sequences in various languages; for sorting, the alternative approaches available for using weights for sequences of letters are preferable to having a separately encoded digraph, because those approaches are more general and extensible. [PC] & [KW]

Q: I have here a bunch of manuscripts which use the "hr" ligature (for example) extensively. I see you have encoded ligatures for "fi", "fl", and even "st", but not "hr". Can I get "hr" encoded as a ligature too?

A: The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.

Ligaturing is a behavior encoded in fonts: if a modern font is asked to display "h" followed by "r", and the font has an "hr" ligature in it, it can display the ligature. Some fonts have no ligatures, some (especially for non-Latin scripts) have hundreds. It does not make sense to assign Unicode code points to all these font-specific possibilities. [JC]

Q: What about the "ct" ligature? Is there a character for that in Unicode?

No, the "ct" ligature is another example of a ligature of Latin letters commonly seen in older type styles. As for the case of the "hr" ligature, display of a ligature is a matter for font design, and does not require separate encoding of a character for the ligature. One simply represents the character sequence <c, t> in Unicode and depends on font design and font attribute controls to determine whether the result is ligated in display (or in printing).
The same situation applies for ligatures involving long s and many others found in Latin typefaces.

Remember that the Unicode Standard is a character encoding standard, and is not intended to standardize ligatures or other presentation forms, or any other aspects of the details of font and glyph design. The ligatures which you can find in the Unicode Standard are compatibility encodings only—and are not meant to set a precedent requiring the encoding of all ligatures as characters. [KW]