Ligatures, Digraphs, Presentation Forms vs. Plain Text
Q: What's the difference between a “Ligature” and a “Digraph”?
A: Digraphs and ligatures are both made by combining two glyphs. In a digraph, the glyphs remain separate but
are placed close together. In a ligature, the glyphs are fused into a single glyph. [JC]
Q: I have here a bunch of manuscripts which use the “hr” ligature (for example) extensively.
I see you have encoded ligatures for “fi”, “fl”, and even “st”, but not “hr”. Can I get “hr” encoded as a ligature too?
A: The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets.
Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the
font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, some (especially for non-Latin
scripts) have hundreds. It does not make sense to assign Unicode code points to all these font-specific possibilities. [JC]
Q: What about the “ct” ligature? Is there a character for that in Unicode?
A: No, the “ct” ligature is another example of a ligature of Latin letters commonly seen in older type styles.
As for the case of the “hr” ligature, display of a ligature is a matter for font design, and does not require separate
encoding of a character for the ligature. One simply represents the character sequence <c, t> in Unicode and depends
on font design and font attribute controls to determine whether the result is ligated in display (or in printing).
The same situation applies for ligatures involving long s and many others found in Latin typefaces.
Remember that the Unicode Standard is a character encoding standard, and is not intended to standardize
ligatures or other presentation forms, or any other aspects of the details of font and glyph design. The ligatures which you
can find in the Unicode Standard are compatibility encodings only—and are not meant to set a precedent requiring
the encoding of all ligatures as characters.
Q: I can't find the digraph “IE” in Unicode. Where do I look?
A: Look at “Where is my character?”
Q: My language needs the digraph “xy”. That digraph is distinctly different from “x” + “y”
and is treated as a unit in my language. How should it be represented in Unicode?
[Editor's note: “xy” is being used as a stand-in for particular digraphs from particular languages; this question, or something very similar to it, has been asked recently, for instance, about “ch” in Slovak, about “ng” in Tagalog, about “ie” in Maltese, and about “aa” in Danish.]
A: A digraph, for example “xy”, looks just like two ordinary letters in a row (in this example “x” and “y”), and there is already a way to represent it in Unicode: <U+0078, U+0079>. If instead, the digraph “xy” were represented by some strange symbol, then it would indeed be new; there would not be any existing way to represent it using already encoded Unicode characters. But it is not a strange symbol—it is just the digraph “xy”. [PC] & [AF]
Q: What speaks against encoding a distinct character? It would make it easier for software to recognize the digraph, and there would seem to be enough space in the Unicode Standard?
A: While it may seem that there is a lot of available space in the Unicode Standard, there are a number of issues. First, while the upper- and lowercase versions of a single digraph like “xy” only constitute a couple of characters, there are many languages in which digraphs may be treated specially. Second, each addition to the standard requires updates to the data tables and to all implementations and fonts that support the digraph. Third, there is the problem that people will not represent data consistently; some will use the new digraph character and some will not—you can count on that. Fourth, existing data will not magically update itself to make use of the new digraph.
Because of these considerations and others, there will be situations in which it will be necessary to represent data using the decomposed form anyway—as for example when passing around normalized data on the Internet.
In summary, the addition of a new digraph character has a fairly substantial (and costly) set of consequences, in return for a minimal set of benefits. Because of that, the UTC has taken the position that no new digraphs should be encoded, and that their special support should be handled by having implementations recognize the character sequence and treat it like a digraph. [PC] & [AF]
Q: How can I implement a different sorting order for a digraph “xy” in my language when I don't have a separate character code?
A: There are several well-known collation techniques are used to handle sorting of digraph sequences in various languages; for example using weights for particular sequences of letters. These techniques are preferable to having a separately encoded digraph, because they are more general and extensible. [PC]
Q: How can I distinguish a true digraph from an accidental combination of the same letters?
A: If the same letter pair can sometimes be a digraph, and sometimes be just a pair of letters, then you can insert U+034F COMBINING GRAPHEME JOINER to make the distinction, see: What is the function of U+034F COMBINING GRAPHEME JOINER? [AF]
Q: How can I get Unicode implementations to recognize the digraph more generally?
A: The Unicode CLDR project provides mechanisms that many software packages use to support the requirements of different languages. If the digraph sorts differently than the two separate characters, then it can be added to a collation table for the language. If the digraph needs to be listed separately, such as in an index, then it can be added to the exemplar characters. To request such a change, first look at the CLDR to determine if it is not already done, and file a change request if needed.
Q: What are presentation forms?
A: Presentation forms are ligatures or glyph variants that are normally not encoded but are forms that show up during presentation of text, normally selected automatically by the layout software. A typical example are the positional forms for Arabic letters. These don't need to be encoded, because the layout software determines the correct form from context.
For historical reasons, a substantial number of presentation forms was encoded in Unicode as compatibility characters, because legacy software or data included them. [AF]
Q: Why are “my” presentation forms NOT included in Unicode?
A: The Unicode Standard encodes characters and it is the function of rendering systems to select presentation forms as
needed to render those characters. Thus there is no need to encode presentation forms. [EM]
Q: Is it necessary to use the presentation forms that are defined in Unicode?
A: No, it is not necessary to use those presentation forms. Those forms were selected and identified in the
early days of developing Unicode when sophisticated rendering engines were not prevalent. A selected subset of the
presentation forms was included to provide users with a simple method to generate them. [MK]
Q: Can one use the presentation forms in a data file?
A: It is not recommended because it does not guarantee data integrity and
interoperability. In the particular case of Arabic, data files should include only the characters in the Arabic block,
U+0600 to U+06FF. [MK]
Q: What distinguishes presentation forms from other glyph variants encoded as compatibility characters?
A: Many characters with compatibility mappings are needed to correctly represent phonetic or mathematical notation. While presentation mechanisms, like styled text, could achieve the same visual representation, they cannot be automatically selected by the layout engine, but must be specified explicitly by the user. By using encoded characters rather than style markup, important semantic content for these notations will be preserved even if the text is converted to plain text. [AF]
Q: Why does Unicode contain whole alphabets of “italic” or “bold” characters in Plane 1?
A: It would have provided too much flexibility, and would have tempted people to use such characters to create “poor man's markup” schemes rather than using proper markup such as SGML/HTML/XML. The mathematical letters and digits are meant to be used only in mathematics, where the distinction between a plain and a bold letter is fundamentally semantic rather than stylistic. [JC]
Q: Wouldn't it have made more sense to simply have introduced a few new combining characters in Plane 0, such as: “make bold”, “make italic”, “make script”, “make fraktur”, “make double-struck”, “make sans serif”, “make monospace” and “make tag”?
A: This would have achieved the same effect (and with the same space requirements too, at least for things like “bold uppercase A” in UTF-16). One could have also made other characters bold too, or create combinations of the attributes not currently represented.
However, it would have provided too much flexibility at the character encoding level and would have duplicated, and therefore conflicted with, some of the features present in proper markup languages such as SGML/HTML/XML. [JC] & [AF]
Q: Why doesn't Unicode have a full set of superscripts and subscripts?
A: The superscripted and subscripted characters encoded in Unicode are either compatibility characters encoded for roundtrip conversion of data from legacy standards, or are actually modifier letters used with particular meanings in technical transcriptional systems such as IPA and UPA. Those characters are not intended for general superscripting or subscripting of arbitrary text strings—for such textual effects, you should use text styles or markup in rich text, instead.
Q: What is the difference between “rich text” and “plain text”?
A: Rich text is text with all its formatting information: typeface, point size, weight,
kerning, and so on. Plain text is the underlying content stream to which formatting is applied.
One key distinction between the two is that rich text breaks the text up into runs and
applies uniform formatting to each run. As such, rich text is inherently stateful. Plain text is not stateful.
It should be possible to lose the first half of a block of plain text without any impact on rendering.
Unicode, by design, only deals with plain text. It doesn't provide a generalized solution to rich text issues. [JJ]
Q: I'm reading a book which uses italic text to mean something distinct from roman text.
Doesn't that mean that italics should be encoded in Unicode?
A: No. It's common for specific formatting to be used to convey some of the semantic content—the meaning—of a text.
Unicode is not intended to reproduce the complete semantic content of all texts, but merely to provide plain text support
required by minimum legibility for all languages. [JJ]
Q: What does “minimum legibility” mean?
A: Minimum legibility refers to the minimum amount of information necessary to provide legible text for a
given language and nothing more. Minimally legible text can have a wide range of default formatting applied by the
rendering system and remain recognizably text belonging to a certain language as generally written. [JJ]
Q: I've spotted a sign which uses superscript text for a meaningful abbreviation. Doesn't
that mean that all the superscripted letters should be encoded in Unicode?
A: No. It's common for specific formatting to be used to convey some of the semantic content—the meaning—of
a text. As for italics, bold, or any other stylistic effect of this sort conveying meaning, the appropriate mechanism to
use in such cases is style or markup in rich text.