Criteria for Encoding Symbols
The text of this document is copied from L2/99-027, which was approved as a US National Body position by
the INCITS/L2 committee in 1999. Some of these criteria have been further refined in practice
since that time, as the UTC has reviewed various proposals
for encoding new symbols. Please see the section on symbols in our Proposal Guidelines.
|Title:||Towards criteria for encoding symbols|
|Date:||January 27, 1999|
Symbols and plain text
The primary goal of Unicode is as plain text encoding. Only a very limited
class of symbols are strictly needed in plain text, if it is understood that an
e-mail message is representative for plain text. A more expanded interpretation
of plain text acknowledges plain text as the back-bone for more elaborate and
rich implementations. Examples are the plain text buffer for a rich document, or
using character codes to access symbols in a CAD package, or to implement a
complex notational system such as musical notation.
In the latter cases, the class of symbols for which encoding makes sense
becomes much larger. It encompasses all symbols for which it is not enough to
merely be able to provide an image, but whose identity must be able to be
automatically interpreted and processed in ways that are similar to processes on
Symbols can be classified in two broad categories, depending on whether a
symbol is part of a symbolic notational system or not.
Symbols that are part of a notational system
Symbols that are part of a notational system have uses and usage patterns
analogous to the notational systems used for writing. They feature a defined(1)
repertoire and established rules of processing and layout. In computers they are
treated similar to a complex script, i.e. with their own layout engines (or sub
engines). Core user groups have shared legacy encodings which allow at least
their data to be migrated to the new encoding.
Symbols that are not part of a notational system
There are many distinct repertoires of non-notational symbols, some with very
small numerocity. The design and use of many of these symbols tends to be
subject to quick shifts in fashion; in many cases they straddle the realms of
the informative and the decorative. Layout is usually quite simple and directly
equivalent to an inline graphic. In computers they are treated as uncoded
entities today: they are provided as graphics or via fonts with ad-hoc
encodings, with no additional support for rendering. Because of the ad-hoc
nature of the legacy encodings for these symbols, data migration is near
An important subclass of non-notational symbols are technical symbols found
in legacy implementations and character sets for which plain text usage is
established. Prominent examples are compatibility symbols used in character mode
text display, e.g. terminal emulation.
Kinds of symbols that are found in the standard today
1. Part of a notational system
- mathematical operators
- electrotechnical symbols
- musical notations (accepted for Plane 1)
2. Compatibility for text mode display
- chess pieces
- forms and blocks
- control pictures
- integral pieces
3. Text ornaments
4. Traditional signs and icons
- astrological symbols
- religious symbols
5. Abbreviations or units used with text or numbers
- currency symbols
- prescription etc.
More than we have done before, we need to look at answering the question of
what the benefit of cataloging these entities will be and whether we have a
realistic expectation that users will be able to access them by the codes that
we define. This is especially an issue for non-notational, non-compatibility
symbols. As far as I can see the trend so far has not been encouraging. In the
last eight years we have seen enormous progress in the support of our encoding
for letters and punctuation. Instead of a collection of fonts with legacy
encodings, system and font vendors now provide fonts with a common encoding,
and, where scripts have similar typography, with combined repertoire.
The most widely available fonts for symbols, however, have not
followed that trend. Users of these symbols continue to use ad-hoc fonts in
their documents. Since one cannot easily convert existing data, I see more
resistance to changing the status quo.
As a conclusion we would need to select a set of non-notational symbols for
which the benefits of a shared encoding are so compelling that its existence
would encourage a transition.
What criteria strengthen the case for encoding?
- is typically used as part of computer applications (e.g. CAD symbols)
- has well defined user community / usage
- always occurs together with text or numbers (unit, currency, estimated)
- must be searchable or indexable
- is customarily used in tabular lists as shorthand for characteristics
(e.g. check mark, maru etc.)
- is part of a notational system
- has well-defined semantics
- has semantics that lend themselves to computer processing
- completes a class of symbols already in the standard
- is letterlike
(i.e. should vary with the surrounding font style)
What criteria weaken the case for encoding?
There is evidence that:
- the symbol is primarily used freestanding (traffic signs)
- the notational system is not widely used on computers (dance notation,
- the symbol is part of a set undergoing rapid changes
- the symbol is trademarked (unless requested by the owner)
(logos, Der grüne Punkt, CE symbol, UL symbol, etc)
- is purely decorative
- it’s ok to ignore its identity in processing
- font shifting is the preferred access and the user community is happy with
it (logos, etc.)
Or, conversely, there is not enough evidence for its usage or its user
The 'symbol fallacy’
The 'symbol fallacy’ is to confuse the fact that "symbols have semantic
content" with "in text, it is customary to use the symbol directly for
communication". These are two different concepts. An example is traffic signs
and the communication of traffic engineers about traffic signs. In their
(hand-)written communication the engineers are much more likely to use the words
"stop sign" when referring to a stop sign, than to draw the image.
Mathematicians are more likely to draw an integral sign and its limits and
integrands than to write an equation in words.
Proper attention should be given to the prioritization of coding outstanding
symbol repertoires that meet the criteria for encoding. Prioritization needs to
address not only the limited code space available, particularly in the BMP, but
also the allocation of other scarce resources such as work load of the standards
Mathematical operators are an example for an extensive set of symbols which
at the current time are incomplete. The existing repertoire is so incomplete
that not only does it not meet the needs of the current user community, but even
the use of the existing partial repertoire is precluded for many users.
Therefore, completion of this repertoire has a high priority. Otherwise, for
lack of usability, alternative encodings or markup will become the method of
choice, stranding the large repertoire already encoded. In the particular
example, this work is now being undertaken, and finishing it should be given a
very high priority.
By extension, proposal that contain incomplete repertoires of a given
category of symbol should be given a very low priority until they reach a level
of completeness that makes a compelling case for a given user community.
The case has been made that either "rapid changes in the glyph
representation", or "changes in the meaning of the character have nothing to do
with encoding (defined as a purely positional assignment), as long as the
general category of use of the symbol does not change.
The counter example to that is the Euro. There are glyph changes that cannot
be absorbed quietly, since the new glyph bears so little relation to the old one
that the change exceeds the implied range of glyphic variation.
If the same symbol (same glyph) acquires additional meaning(s) that would be
OK. For some symbols (part of a notational scheme) this could mean that the
symbol would need to be processed differently (i.e. a change in operational
semantics a.k.a. character properties). Such a change would affect coding.
In either case, rapid change means by definition that the situation isn’t
settled, and reliable information on the range of acceptable glyphic variation
or character properties is unavailable. Therefore it is a good reason to wait
The fact that a symbol merely "seems to be useful or potentially useful" is
precisely not a reason to code it. Demonstrated usage, or demonstrated demand,
on the other hand, does constitute a good reason to encode the symbol. The euro
is the classical example of a novel symbol for which there is demonstrated and
(1) Like with all repertoires, I include a sizeable 'gray zone’ in the term