NCITS-L2/98-311

L2/99-027
(supersedes L2/98-311)

Title: Towards criteria for encoding symbols

Source: Asmus Freytag

Date: Jan. 27, 1999 (*)

Symbols and plain text

The primary goal of Unicode is as plain text encoding. Only a very limited class of symbols are strictly needed in plain text, if it is understood that an e-mail message is representative for plain text. A more expanded interpretation of plain text acknowledges plain text as the back-bone for more elaborate and rich implementations. Examples are the plain text buffer for a rich document, or using character codes to access symbols in a CAD package, or to implement a complex notational system such as musical notation.

In the latter cases, the class of symbols for which encoding makes sense becomes much larger. It encompasses all symbols for which it is not enough to merely be able to provide an image, but whose identity must be able to be automatically interpreted and processed in ways that are similar to processes on text.

Classification

Symbols can be classified in two broad categories, depending on whether a symbol is part of a symbolic notational system or not.

Symbols that are part of a notational system

Symbols that are part of a notational system have uses and usage patterns analogous to the notational systems used for writing. They feature a defined(1) repertoire and established rules of processing and layout. In computers they are treated similar to a complex script, i.e. with their own layout engines (or sub engines). Core user groups have shared legacy encodings which allow at least their data to be migrated to the new encoding.

Symbols that are not part of a notational system

There are many distinct repertoires of non-notational symbols, some with very small numerocity. The design and use of many of these symbols tends to be subject to quick shifts in fashion; in many cases they straddle the realms of the informative and the decorative. Layout is usually quite simple and directly equivalent to an inline graphic. In computers they are treated as uncoded entities today: they are provided as graphics or via fonts with ad-hoc encodings, with no additional support for rendering. Because of the ad-hoc nature of the legacy encodings for these symbols, data migration is near impossible.

Legacy symbols

An important subclass of non-notational symbols are technical symbols found in legacy implementations and character sets for which plain text usage is established. Prominent examples are compatibility symbols used in character mode text display, e.g. terminal emulation.

Kinds of symbols that are found in the standard today

1. Part of a notational system

mathematical operators

electrotechnical symbols

APL

Braille

musical notations (accepted for Plane 1)

2. Compatibility for text mode display

chess pieces

forms and blocks

control pictures

integral pieces

3. Text ornaments

dingbats

enclosed/parenthesized

4. Traditional signs and icons

astrological symbols

religious symbols

5. Abbreviations or units used with text or numbers

currency symbols

units

prescription etc.

6. Other

Discussion

More than we have done before, we need to look at answering the question of what the benefit of cataloging these entities will be and whether we have a realistic expectation that users will be able to access them by the codes that we define. This is especially an issue for non-notational, non-compatibility symbols. As far as I can see the trend so far has not been encouraging. In the last eight years we have seen enormous progress in the support of our encoding for letters and punctuation. Instead of a collection of fonts with legacy encodings, system and font vendors now provide fonts with a common encoding, and, where scripts have similar typography, with combined repertoire.

The most widely available fonts for symbols, however, have not followed that trend. Users of these symbols continue to use ad-hoc fonts in their documents. Since one cannot easily convert existing data, I see more resistance to changing the status quo.

As a conclusion we would need to select a set of non-notational symbols for which the benefits of a shared encoding are so compelling that its existence would encourage a transition.

What criteria strengthen the case for encoding?

The symbol

is typically used as part of computer applications (e.g. CAD symbols)

has well defined user community / usage

always occurs together with text or numbers (unit, currency, estimated)

must be searchable or indexable

is customarily used in tabular lists as shorthand for characteristics (e.g. check mark, maru etc.) (2)

is part of a notational system

has well-defined semantics

has semantics that lend themselves to computer processing

completes a class of symbols already in the standard

is letterlike
(i.e. should vary with the surrounding font style)

What criteria weaken the case for encoding?

There is evidence that

the symbol is primarily used freestanding (traffic signs)

the notational system is not widely used on computers (dance notation, traffic signs)

the symbol is part of a set undergoing rapid changes(2)

the symbol is trademarked (unless requested by the owner)
(logos, Der grüne Punkt, CE symbol, UL symbol, etc)

is purely decorative

it’s ok to ignore its identity in processing

font shifting is the preferred access and the user community is happy with it (logos, etc.)

Or, conversely, there is not enough evidence for its usage or its user community.

The ‘symbol fallacy’

The ‘symbol fallacy’ is to confuse the fact that "symbols have semantic content" with "in text, it is customary to use the symbol directly for communication". These are two different concepts. An example is traffic signs and the communication of traffic engineers about traffic signs. In their (hand-)written communication the engineers are much more likely to use the words "stop sign" when referring to a stop sign, than to draw the image. Mathematicians are more likely to draw an integral sign and its limits and integrands than to write an equation in words.

Prioritization

Proper attention should be given to the prioritization of coding outstanding symbol repertoires that meet the criteria for encoding. Prioritization needs to address not only the limited code space available, particularly in the BMP, but also the allocation of other scarce resources such as work load of the standards committees.

Completion

Mathematical operators are an example for an extensive set of symbols which at the current time are incomplete. The existing repertoire is so incomplete that not only does it not meet the needs of the current user community, but even the use of the existing partial repertoire is precluded for many users. Therefore, completion of this repertoire has a high priority. Otherwise, for lack of usability, alternative encodings or markup will become the method of choice, stranding the large repertoire already encoded. In the particular example, this work is now being undertaken, and finishing it should be given a very high priority.

By extension, proposal that contain incomplete repertoires of a given category of symbol should be given a very low priority until they reach a level of completeness that makes a compelling case for a given user community.

Instability

The case has been made that either "rapid changes in the glyph representation", or "changes in the meaning of the character have nothing to do with encoding (defined as a purely positional assignment), as long as the general category of use of the symbol does not change.

The counter example to that is the Euro. There are glyph changes that cannot be absorbed quietly, since the new glyph bears so little relation to the old one that the change exceeds the implied range of glyphic variation.

If the same symbol (same glyph) acquires additional meaning(s) that would be OK. For some symbols (part of a notational scheme) this could mean that the symbol would need to be processed differently (i.e. a change in operational semantics a.k.a. character properties). Such a change would affect coding.

In either case, rapid change means by definition that the situation isn’t settled, and reliable information on the range of acceptable glyphic variation or character properties is unavailable. Therefore it is a good reason to wait with coding.

Perceived Usefulness

The fact that a symbol merely "seems to be useful or potentially useful" is precisely not a reason to code it. Demonstrated usage, or demonstrated demand, on the other hand, does constitute a good reason to encode the symbol. The euro is the classical example of a novel symbol for which there is demonstrated and strong demand.

(*) This document is a revision of an e-mail sent to UTC members on 10-6-98
and incorporating some of the feedback received at that time, as well as some more recent thoughts.

(1) Like with all repertoires, I include a sizeable ‘gray zone’ in the term ‘defined’ here.

(2) The typical camping, boating, hiking, etc. symbols are often used in that way.