Symbols Encoding Principles (Rough Draft) (Unicode Symbols)

The UTC Subcommittee on Encoding of Symbols >

Symbols Encoding Principles (Rough Draft)

Principles for the encoding of symbols as assigned characters in the Unicode Standard: (See also chapter 2 section 2.2 "Unicode Design Principles" of the Unicode Standard.)

Already encoded: Symbols considered for encoding should already be encoded in a character set, called a source character set. Such a character set may be defined by a standards organization, a company, consortium or other organization. Such a character set should be in widespread use. Not every symbol in that character set need be in widespread use.

The character set may consist of a set of Unicode Private Use Area (PUA) code points, or it may use a non-Unicode encoding, or both (with a mapping table).

Source separation rule: If a single source character set separates two characters (anywhere in the character set, so including standard JIS codes), then we map them to two separate Unicode characters. (This is a hard and fast rule.)
Reuse: We map to existing Unicode symbols where appropriate. (Unification with existing characters.)
Separating generic symbols: If Unicode had a set of related symbols, but no one character in the set is as generic as in the proposed symbol sets, then we encode a new character. For example, the Emoji symbol sets do not distinguish between waxing and waning crescent moons.
Colors and Animation: We encode symbols as characters, abstracting away from colors and animation. We only distinguish by nominal color or animation for the source separation rule. (See naming below.)
Existing cross-mapping tables: Where cross-mapping tables are established among related symbol character sets, we follow the tables as much as possible and unify among the symbol character sets, but we disunify in cases where the visual images are very different and not semantically associated. For example, among Emoji symbol character sets:

We disunified the 'M' symbol for Metro from the Metro train image. The 'M' symbol would have translation problems. (This is similar to the problems with the international currency symbol and the proposal for a "generic decimal separator".)
On the other hand, we unified the sets of Zodiac symbols, even though the images shown by carriers vary widely. This is because they clearly belong to a cohesive set which corresponds across carriers.

Least-marked common symbol: For a set of symbols from related symbol character sets which each could map to an existing Unicode code point, we choose the symbol that is shared among the most carriers (according to the cross-mapping tables) and has the least-marked form.
Naming: Character names are typically based on the glosses of the vendor symbols or the visual appearance. We follow the conventions for existing Unicode characters where possible, in particular using "BLACK" for "filled" and "WHITE" for "hollow". We exclude nominal color and animation from proposed character names except where necessary for distinction.

It is preferred to choose symbol character names by appearance rather than semantics because symbols tend to be used for different purposes and selected for desired appearance.

Characters, not glyphs: As usual for Unicode, we should avoid encoding glyph variations.

For example, the ARIB standards have several Kanji with 70% of full size (ARIB row 92 cells 26..31). These should not be encoded separately.

Combining enclosing marks: For some symbols, it may be appropriate to encode them as sequences of an existing Unicode character with a combining enclosing mark of the right shape (circle, square, keycap, etc.). However, this cannot be done for enclosing multiple base characters, and should not be done for heavily styled characters where the enclosing mark does not express the styling well.
Needs discussion: Symbols that look like sequences of existing characters: When should we encode them, when should the sequences of existing characters be used directly? See discussion of ARIB symbols.
Needs discussion: Should we eliminate duplicates from the set, as proposed for ARIB 90/58 and 93/30, or always apply the source-separation rule? If we apply the source-separation rule, do we add canonical or compatibility decompositions between the duplicates?

Code point assignment guidelines

Use the principle of filling existing blocks in the BMP but not creating new blocks in that plane. While in modern use, it is felt that the few remaining spaces in the BMP should be reserved to scripts, not new symbols. New blocks are therefore allocated in the supplementary plane 1 (SMP) to accommodate characters that do not fit in existing BMP blocks.

Symbols mailing list feedback

Feedback received 2008-08-08..11 on the draft above

Karl Pentzlin:

At 8. "Naming", I would like a principle added like:
A symbol may only be named by its semantics if there can be assumed a consensus by the broad majority of the prospective users that the term used as name implies exactly the symbol, regardless of the geographical or cultural background of the user.
(E.g. this is true for U+263F MERCURY, but false for the name FACTORY as proposed for ARIB 9118.)

Another principle:
A naming must not conflict with existing standardized or widely accepted uses.
(E.g., a symbol [like ARIB 9120] may only be named LIGHTHOUSE if it resembles the lighthouse symbol used in nautical maps.)

Asmus Freytag:

"Symbols considered for encoding should already be encoded in a character set"

I think this is overly restrictive and at variance with the principles for encoding other characters. Nobody wants to encourage people to create character sets just so that characters can become eligible for encoding in Unicode (the same goes for private use characters).

Mark Davis: I agree that we can improve the wording. What we'd settled on in the UTC was to prioritize characters that were already encoded in widely deployed character sets (not to exclude others).

I would strongly recommend to rephrase that so that use of the character as part of a set becomes one of several criteria that support encoding. The source separation rule would then apply only if & when there is a source set.

Asmus Freytag: For use in the encoding principle, I continue to recommend the use of a list of "criteria" that, if satisfied, favor encoding. It is then an easy matter to add the prioritization explicitly. It's alwazs beneficial to logically separate the encoding policy from the principles used to decide whether something is or isn't a potential character.

Did you look at what the WG2 principles and procedures say about encoding symbols? If not, you should explicitly take the existing principles into account.

Asmus Freytag: I also continue to suggest that any new principles should be written so as to take into account existing principles.

Comments (2)

_displayNameOrEmail_ - _time_ - Remove

_text_

katmomoi - 8/8/2008 2:33AM PDT

Naming as promised in 5 is missing from this document. Something like the following?

8. Naming:

Try to be as descriptive as possible. It is best to avoid a functional naming for symbols since symbols are sometimes used in an unusual function and fixing the function may be off target. For example, a RUNNER's emoji is sometimes used in mobile map directions in the sense of WALKING instead -- to contrast going by train to going on foot. Using the WALKING emoji make this a bit humorous.

markus.icu - 8/8/2008 7:28AM PDT

Naming: I lifted the text from the Emoji proposal and added a brief note about appearance, not semantics. Please see if that's ok.

Unicode Symbols

Navigation

Recent site activity