L2/99-027
(supersedes L2/98-311)

Title: Towards criteria for encoding symbols

Source: Asmus Freytag

Date: Jan. 27, 1999 (*)

  

Symbols and plain text

The primary goal of Unicode is as plain text encoding. Only a very limited class of symbols are strictly needed in plain text, if it is understood that an e-mail message is representative for plain text. A more expanded interpretation of plain text acknowledges plain text as the back-bone for more elaborate and rich implementations. Examples are the plain text buffer for a rich document, or using character codes to access symbols in a CAD package, or to implement a complex notational system such as musical notation.

In the latter cases, the class of symbols for which encoding makes sense becomes much larger. It encompasses all symbols for which it is not enough to merely be able to provide an image, but whose identity must be able to be automatically interpreted and processed in ways that are similar to processes on text.

Classification

Symbols can be classified in two broad categories, depending on whether a symbol is part of a symbolic notational system or not.

Symbols that are part of a notational system

Symbols that are part of a notational system have uses and usage patterns analogous to the notational systems used for writing. They feature a defined(1) repertoire and established rules of processing and layout. In computers they are treated similar to a complex script, i.e. with their own layout engines (or sub engines). Core user groups have shared legacy encodings which allow at least their data to be migrated to the new encoding.

Symbols that are not part of a notational system

There are many distinct repertoires of non-notational symbols, some with very small numerocity. The design and use of many of these symbols tends to be subject to quick shifts in fashion; in many cases they straddle the realms of the informative and the decorative. Layout is usually quite simple and directly equivalent to an inline graphic. In computers they are treated as uncoded entities today: they are provided as graphics or via fonts with ad-hoc encodings, with no additional support for rendering. Because of the ad-hoc nature of the legacy encodings for these symbols, data migration is near impossible.

Legacy symbols

An important subclass of non-notational symbols are technical symbols found in legacy implementations and character sets for which plain text usage is established. Prominent examples are compatibility symbols used in character mode text display, e.g. terminal emulation.

Kinds of symbols that are found in the standard today

1. Part of a notational system

2. Compatibility for text mode display

3. Text ornaments

4. Traditional signs and icons

5. Abbreviations or units used with text or numbers

6. Other

 

Discussion

More than we have done before, we need to look at answering the question of what the benefit of cataloging these entities will be and whether we have a realistic expectation that users will be able to access them by the codes that we define. This is especially an issue for non-notational, non-compatibility symbols. As far as I can see the trend so far has not been encouraging. In the last eight years we have seen enormous progress in the support of our encoding for letters and punctuation. Instead of a collection of fonts with legacy encodings, system and font vendors now provide fonts with a common encoding, and, where scripts have similar typography, with combined repertoire.

The most widely available fonts for symbols, however, have not followed that trend. Users of these symbols continue to use ad-hoc fonts in their documents. Since one cannot easily convert existing data, I see more resistance to changing the status quo.

As a conclusion we would need to select a set of non-notational symbols for which the benefits of a shared encoding are so compelling that its existence would encourage a transition.

What criteria strengthen the case for encoding?

The symbol

What criteria weaken the case for encoding?

There is evidence that

Or, conversely, there is not enough evidence for its usage or its user community.

The ‘symbol fallacy’

The ‘symbol fallacy’ is to confuse the fact that "symbols have semantic content" with "in text, it is customary to use the symbol directly for communication". These are two different concepts. An example is traffic signs and the communication of traffic engineers about traffic signs. In their (hand-)written communication the engineers are much more likely to use the words "stop sign" when referring to a stop sign, than to draw the image. Mathematicians are more likely to draw an integral sign and its limits and integrands than to write an equation in words.

Prioritization

Proper attention should be given to the prioritization of coding outstanding symbol repertoires that meet the criteria for encoding. Prioritization needs to address not only the limited code space available, particularly in the BMP, but also the allocation of other scarce resources such as work load of the standards committees.

Completion

Mathematical operators are an example for an extensive set of symbols which at the current time are incomplete. The existing repertoire is so incomplete that not only does it not meet the needs of the current user community, but even the use of the existing partial repertoire is precluded for many users. Therefore, completion of this repertoire has a high priority. Otherwise, for lack of usability, alternative encodings or markup will become the method of choice, stranding the large repertoire already encoded. In the particular example, this work is now being undertaken, and finishing it should be given a very high priority.

By extension, proposal that contain incomplete repertoires of a given category of symbol should be given a very low priority until they reach a level of completeness that makes a compelling case for a given user community.

Instability

The case has been made that either "rapid changes in the glyph representation", or "changes in the meaning of the character have nothing to do with encoding (defined as a purely positional assignment), as long as the general category of use of the symbol does not change.

The counter example to that is the Euro. There are glyph changes that cannot be absorbed quietly, since the new glyph bears so little relation to the old one that the change exceeds the implied range of glyphic variation.

If the same symbol (same glyph) acquires additional meaning(s) that would be OK. For some symbols (part of a notational scheme) this could mean that the symbol would need to be processed differently (i.e. a change in operational semantics a.k.a. character properties). Such a change would affect coding.

In either case, rapid change means by definition that the situation isn’t settled, and reliable information on the range of acceptable glyphic variation or character properties is unavailable. Therefore it is a good reason to wait with coding.

Perceived Usefulness

The fact that a symbol merely "seems to be useful or potentially useful" is precisely not a reason to code it. Demonstrated usage, or demonstrated demand, on the other hand, does constitute a good reason to encode the symbol. The euro is the classical example of a novel symbol for which there is demonstrated and strong demand.


(*) This document is a revision of an e-mail sent to UTC members on 10-6-98
and incorporating some of the feedback received at that time, as well as some more recent thoughts.

(1) Like with all repertoires, I include a sizeable ‘gray zone’ in the term ‘defined’ here.

(2) The typical camping, boating, hiking, etc. symbols are often used in that way.