Ligatures, Digraphs and Presentation Forms
Q: What's the difference between
a "Digraph" and a "Ligature"?
A: Digraphs and ligatures are both made by
combining two glyphs. In a digraph, the glyphs remain separate but are
placed close together. In a ligature, the glyphs are fused into a
single glyph. [JC]
Q: I can't find the digraph "IE"
in Unicode. Where do I look?
A: Look at "Where is my character?"
Q. Why are "my" presentation
forms NOT included in Unicode?
The Unicode Standard encodes characters and it
is the function of rendering systems to select presentation forms as
needed to render those characters. Thus there is no need to encode
presentation forms. [EM]
Q. Is it necessary to use the
presentation forms that are defined in Unicode?
A. No, it is not necessary to use those
presentation forms. Those forms were selected and identified in the
early days of developing Unicode when sophisticated rendering engines
were not prevalent. A selected subset of the presentation forms was
included to provide users with a simple method to generate them. [MK]
Q. Can one use the presentation
forms in a data file?
A. It is strongly discouraged and not
recommended because it does not guarantee data integrity and
interoperability. In the particular case of Arabic, data files
should include only the characters in the Arabic block, U+0600 to
U+06FF. [MK]
Q: My language needs the digraph
"xy". That digraph is distinctly different from "x" + "y" and is
treated as a unit in my language. given all the available space in the
Latin Extended area, why not just encode that digraph as another
character? If my "xy" digraph were represented by a unique symbol, it
would certainly be included with all the other letters from my language.
[Editor's note: "xy" is being used as a
stand-in for particular digraphs from particular languages; this
question, or something very similar to it, has been asked recently, for
instance, about "ch" in Slovak, about "ng" in Tagalog, and about "ie"
in Maltese.]
A: If the digraph "xy" were some strange symbol,
then it would indeed be new; there would not be any existing
way to represent it using already encoded Unicode characters. But it is
not a strange symbol - it is just the digraph "xy", and there is
already a way to represent it in Unicode: <U+0078, U+0079>.
While it may seem that there is a lot of
available space for Latin letters in the Unicode Standard, and the
upper- and lowercase versions of the digraph "xy" only constitute a
couple of characters, in reality what's at stake here are not just a
couple of characters, but hundreds. This is a recurrent pattern. There
is a steady flow of requests for Latin digraphs or precomposed base
form + diacritic letters for various languages. The reason is always
some variant of "In my language 'xy' is a unit and not a sequence; it
has its own behavior, and so should be encoded separately."
It isn't just the matter of the standards
overhead faced by the Unicode Technical Committee for dealing with all
these encoding proposals for letters than can already be represented by
existing encoding characters used in sequences. There are deeper issues
pertaining to the implications for existing implementations and
existing data. If a new digraph "xy" is added, that implies the
addition of a new compatibility decomposition "xy" to "x" "y" to the data
tables. And that means people will have to
revise their software to handle it. Then there is the fact that people
will not represent data consistently; some will use the new digraph
character and some will not - you can count on that. Existing data
will not magically update itself to make use of the new digraph.
Because of these considerations and others, there will be situations in
which it will be necessary to represent data using the
decomposed form anyway - as for example when passing around normalized
data on the Internet. So the addition of a digraph character has a
fairly substantial (and costly) set of consequences, in return for a
minimal set of benefits. The net of this is generally negative, rather
than positive. Multiply that by hundreds of times for all of the other
digraphs and pre-composed diacritic-marked letters that exist for other
languages (and with perhaps a couple of thousand languages in the world
currently written with the Latin script, there are lots), and
you can see why the Unicode Technical Committee does not favor heading
down this path.
At this point, the UTC has a default position:
no new characters for digraphs or pre-composed diacritic letters should
be accepted for encoding as individual characters. If a convincing
enough case can be presented, there may always be exceptions to that
default position. To be convincing, the line of reasoning would have to
be along the line of: There are demonstrable processing issues in the
writing system for this language that cannot adequately be dealt with
using the existing encoded characters, but which could be resolved by
the addition of this new character. ("xy", or whatever.) But the
arguments have to be very convincing, and other approaches to dealing
with the perceived problem have to be explored and to be shown
inadequate. For example, citation of a different sorting order for "xy"
in a language is not very convincing, because well-known collation
techniques are used to handle sorting of digraphic sequences in various
languages; for sorting, the alternative approaches available for using
weights for sequences of letters are preferable to having a
separately encoded digraph, because those approaches are more general
and extensible. [PC] & [KW]
Q: I have here a bunch of
manuscripts which use the "hr" ligature (for example) extensively. I
see you have encoded ligatures for "fi", "fl", and even "st", but not
"hr". Can I get "hr" encoded as a ligature too?
A: The existing ligatures exist basically for
compatibility and round-tripping with non-Unicode character sets. Their
use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a
modern font is asked to display "h" followed by "r", and the font has
an "hr" ligature in it, it can display the ligature. Some fonts have no
ligatures, some (especially for non-Latin scripts) have hundreds. It
does not make sense to assign Unicode code points to all these
font-specific possibilities. [JC]
Q: What about the "ct" ligature? Is there a character for that in Unicode?
No, the "ct" ligature is another example of a ligature of Latin letters commonly seen in older type styles.
As for the case of the "hr" ligature, display of a ligature is a matter for font design, and does not require separate
encoding of a character for the ligature. One simply represents the character sequence <c, t> in Unicode and depends
on font design and font attribute controls to determine whether the result is ligated in display (or in printing).
The same situation applies for ligatures involving long s and many others found in Latin typefaces.
Remember that the Unicode Standard is a character encoding standard, and is not intended to standardize ligatures
or other presentation forms, or any other aspects of the details of font and glyph design. The ligatures which you can
find in the Unicode Standard are compatibility encodings only—and are not meant to set a precedent requiring
the encoding of all ligatures as characters. [KW]