Ligatures, Digraphs and Presentation Forms
Q: What's the difference between a "Digraph" and a "Ligature"?
A: Digraphs and ligatures are both made by combining two glyphs. In a digraph, the glyphs remain separate but are
placed close together. In a ligature, the glyphs are fused into a single glyph. [JC]
Q: I can't find the digraph "IE" in Unicode. Where do I look?
A: Look at "Where is my character?"
Q: Why are "my" presentation forms NOT included in Unicode?
The Unicode Standard encodes characters and it is the function of rendering systems to select presentation forms as
needed to render those characters. Thus there is no need to encode presentation forms. [EM]
Q: Is it necessary to use the presentation forms that are defined in Unicode?
A: No, it is not necessary to use those presentation forms. Those forms were selected and identified in the
early days of developing Unicode when sophisticated rendering engines were not prevalent. A selected subset of the presentation forms was
included to provide users with a simple method to generate them. [MK]
Q: Can one use the presentation forms in a data file?
A: It is strongly discouraged and not recommended because it does not guarantee data integrity and
interoperability. In the particular case of Arabic, data files should include only the characters in the Arabic block, U+0600 to
U+06FF. [MK]
Q: My language needs the digraph "xy". That digraph is distinctly different from "x" + "y" and is
treated as a unit in my language. given all the available space in the Latin Extended area, why not just encode that digraph as another
character? If my "xy" digraph were represented by a unique symbol, it would certainly be included with all the other letters from my language.
[Editor's note: "xy" is being used as a stand-in for particular digraphs from particular languages; this
question, or something very similar to it, has been asked recently, for instance, about "ch" in Slovak, about "ng" in Tagalog, and about "ie"
in Maltese.]
A: If the digraph "xy" were some strange symbol, then it would indeed be new; there would not be any existing
way to represent it using already encoded Unicode characters. But it is not a strange symbol - it is just the digraph "xy", and there is
already a way to represent it in Unicode: <U+0078, U+0079>.
While it may seem that there is a lot of
available space for Latin letters in the Unicode Standard, and the
upper- and lowercase versions of the digraph "xy" only constitute a
couple of characters, in reality what's at stake here are not just a
couple of characters, but hundreds. This is a recurrent pattern. There
is a steady flow of requests for Latin digraphs or precomposed base
form + diacritic letters for various languages. The reason is always
some variant of "In my language 'xy' is a unit and not a sequence; it
has its own behavior, and so should be encoded separately."
It isn't just the matter of the standards
overhead faced by the Unicode Technical Committee for dealing with all
these encoding proposals for letters than can already be represented by
existing encoding characters used in sequences. There are deeper issues
pertaining to the implications for existing implementations and
existing data. If a new digraph "xy" is added, that implies the
addition of a new compatibility decomposition "xy" to "x" "y" to the data
tables. And that means people will have to
revise their software to handle it. Then there is the fact that people
will not represent data consistently; some will use the new digraph
character and some will not - you can count on that. Existing data
will not magically update itself to make use of the new digraph.
Because of these considerations and others, there will be situations in
which it will be necessary to represent data using the
decomposed form anyway - as for example when passing around normalized
data on the Internet. So the addition of a digraph character has a
fairly substantial (and costly) set of consequences, in return for a
minimal set of benefits. The net of this is generally negative, rather
than positive. Multiply that by hundreds of times for all of the other
digraphs and pre-composed diacritic-marked letters that exist for other
languages (and with perhaps a couple of thousand languages in the world
currently written with the Latin script, there are lots), and
you can see why the Unicode Technical Committee does not favor heading
down this path.
At this point, the UTC has a default position:
no new characters for digraphs or pre-composed diacritic letters should
be accepted for encoding as individual characters. If a convincing
enough case can be presented, there may always be exceptions to that
default position. To be convincing, the line of reasoning would have to
be along the line of: There are demonstrable processing issues in the
writing system for this language that cannot adequately be dealt with
using the existing encoded characters, but which could be resolved by
the addition of this new character. ("xy", or whatever.) But the
arguments have to be very convincing, and other approaches to dealing
with the perceived problem have to be explored and to be shown
inadequate. For example, citation of a different sorting order for "xy"
in a language is not very convincing, because well-known collation
techniques are used to handle sorting of digraphic sequences in various
languages; for sorting, the alternative approaches available for using
weights for sequences of letters are preferable to having a
separately encoded digraph, because those approaches are more general
and extensible. [PC] & [KW]
Q: I have here a bunch of
manuscripts which use the "hr" ligature (for example) extensively. I
see you have encoded ligatures for "fi", "fl", and even "st", but not
"hr". Can I get "hr" encoded as a ligature too?
A: The existing ligatures exist basically for
compatibility and round-tripping with non-Unicode character sets. Their
use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a
modern font is asked to display "h" followed by "r", and the font has
an "hr" ligature in it, it can display the ligature. Some fonts have no
ligatures, some (especially for non-Latin scripts) have hundreds. It
does not make sense to assign Unicode code points to all these
font-specific possibilities. [JC]
Q: What about the "ct" ligature? Is there a character for that in Unicode?
No, the "ct" ligature is another example of a ligature of Latin letters commonly seen in older type styles.
As for the case of the "hr" ligature, display of a ligature is a matter for font design, and does not require separate
encoding of a character for the ligature. One simply represents the character sequence <c, t> in Unicode and depends
on font design and font attribute controls to determine whether the result is ligated in display (or in printing).
The same situation applies for ligatures involving long s and many others found in Latin typefaces.
Remember that the Unicode Standard is a character encoding standard, and is not intended to standardize ligatures
or other presentation forms, or any other aspects of the details of font and glyph design. The ligatures which you can
find in the Unicode Standard are compatibility encodings only—and are not meant to set a precedent requiring
the encoding of all ligatures as characters. [KW]
Q: What are all those duplicated math
alphabet characters FOR? Wouldn't it
have made more sense to simply have introduced a few new combining
characters in plane 0, such as: "make bold", "make italic", "make script",
"make fraktur", "make double-struck", "make sans serif", "make monospace"
and "make tag". This would not only have achieved the same effect (and
with the same space requirements too, at least for things like "bold
uppercase A" in UTF-16), but with much greater flexibility (in that you
could also make other characters bold too, and you could create
combinations of the attributes not currently represented).
A: It would have provided too much flexibility, and would
have tempted people to use such characters to create "poor man's markup"
schemes rather than using proper markup such as SGML/HTML/XML. The
mathematical letters and digits are meant to be used only in mathematics,
where the distinction between a plain and a bold letter is fundamentally
semantic rather than stylistic. [JC]
Q: Why doesn't Unicode have a full set of superscripts and subscripts?
A: Unicode includes true superscripted Latin characters for round-trip compatibility with
other standards. Unicode also includes other characters which look like and are typographically derived
from superscripted Latin or Greek characters, such as U+02B0 MODIFIER LETTER SMALL H. Despite their
appearance, these are not true superscripts and should not be used as such. The situation is the same for subscripts.
Unicode considers true superscripts and subscripts to be a matter of rich text formatting and,
as such, out of the standard's scope. [JJ]
Q: What is the difference between "rich text" and "plain text"?
A: Rich text is text with all its formatting information: typeface, point size, weight,
kerning, and so on. Plain text is the underlying content stream to which formatting is applied.
One key distinction between the two is that rich text breaks the text up into runs and
applies uniform formatting to each run. As such, rich text is inherently stateful. Plain text is not stateful.
It should be possible to lose the first half of a block of plain text without any impact on rendering.
Unicode, by design, only deals with plain text. It doesn't provide a generalized solution to rich text issues. [JJ]
Q: I'm reading a book which uses italic text to mean something distinct from roman text.
Doesn't that mean that italics should be encoded in Unicode?
No. It's common for specific formatting to be used to convey some of the semantic content—the meaning—of a text.
Unicode is not intended to reproduce the complete semantic content of all texts, but merely to provide plain text support
required by minimum legibility for all languages. [JJ]
Q: What does "minimum legibility" mean?
Minimum legibility refers to the minimum amount of information necessary to provide legible text for a
given language and nothing more. Minimally legible text can have a wide range of default formatting applied by the
rendering system and remain recognizably text belonging to a certain language as generally written. [JJ]