[Unicode]  Technical Reports
 

Proposed Draft Unicode® Technical Standard #52

Unicode Emoji Mechanisms

Version 1.0 (draft 3)
Editors Mark Davis (Google Inc.), Peter Edberg (Apple Inc.)
Date 2016-05-25
This Version http://www.unicode.org/reports/tr52/tr52-3.html
Previous Version http://www.unicode.org/reports/tr52/tr52-1.html
Latest Version http://www.unicode.org/reports/tr52/
Latest Proposed Update http://www.unicode.org/reports/tr52/proposed.html
Revision 3

 

Summary

This document provides provides a new way of representing customizations of Unicode emoji characters. The first specified customizations provide for flags for subdivisions of countries (such as Scotland or California), gender variants (such as female runners or males raising a hand), hair color variants (a red-haired dancer), and directional variants (pointing a hand or bicyclist to the right).

Status

Work on this document has been suspended, pending investigation into whether ZWJ sequences would provide a better solution in the near term, at least for the highest priority feature: gender.

This document is a proposed draft Unicode Technical Standard. This document may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents




1 Introduction

There are many requests for variants of Unicode emoji. This document provides a concrete proposal for a mechanism that could be used for such customization, using the Unicode TAG characters. This approach allows the creation of variant emoji images without waiting for them to be encoded, and has been designed to be extensible.

The basic idea is like that of skin tone modifiers, used when a user selects an olive-skinned emoji on the keyboard. Internally that emoji consists of a base emoji followed by a skin-tone modifier that indicates that the base emoji should be olive-skinned. The olive-skinned emoji looks like a single character to the user, and behaves as a single emoji character in display, line break, and other processing.

With this TAG mechanism, a given emoji can have a specified list of special “TAG” characters after it that indicate a particular customization: the whole sequence is called an “emoji tag sequence”. The emoji tag sequence looks like a single character to the user, and behaves as a single emoji character in display, line break, and other processing. The TAG mechanism differs from the skin-tone modifiers in being much extensible.

The gray image in the illustrations below represent a sequence of TAG characters; the exact sequence that would be used in each case is described in the Technical Specification. For example, the image on the left below would be what the user sees; the sequence on the right would be the internal representation. The TAG characters indicated by the gray “…✦” below would never be visible.

=

If the TAG sequence is invalid, or if the system doesn't support that tag sequence, the user would see an indication of that, such as:

=

The Unicode emoji customization mechanism is used to request an alternate rendering of a particular emoji character. It is only used where the base emoji character is in some way generic, and the customization would be considered a variant of that base character. For example, it could be used to indicate that the MAN emoji should be shown with red hair, but not that the MAN emoji be shown as a cup of tea.

Only those emoji tag sequences that follow the specifications in this document are valid. Like the emoji sequences for different family groups (aka emoji zwj sequences), Unicode would catalog those sequences that are supported by major vendors, but would not recommend particular combinations.

The first specified customizations include flags, gender, hair color, and direction. The hair color, gender, direction, and emoji-modifiers could be applied to the same emoji, resulting in a combinatorial explosion of possible glyphs in fonts. The flag customizations also allow for a very large number of possible glyphs. Due to memory constraints, it is thus anticipated that only certain combinations would be supported by major vendors. There is no requirement or expectation that all of the possible combinations — or even any large subset of them — be supported by vendors, until there is technology in the fonts to deal with the combinatorics.

Flags. These customizations allow for additional emoji for images of a flag associated with particular regions, such as emoji depictions of the flag of Scotland or California.

=

They are limited to Unicode subdivisions (which generally correspond to ISO 3166-2 subdivisions), and valid 3-digit UN codes. Like the Regional_Indicator characters, they do not represent a specific image. One way to think of them is that they represent “a flag” for a region, not “the flag” for a region. This mechanism cannot be used for arbitrary flags. It cannot, for example, be used for a rainbow flag, the flag of a particular football club, or a pirate flag.

Gender. These customizations allow for emoji images with different genders.

=

Unicode itself does not normally specify the gender for emoji characters: the emoji character is RUNNER, not MAN RUNNING; POLICE OFFICER not POLICEMAN. For a greater sense of realism, however, vendors typically have picked a particular gender to display, even for the neutral characters.

The Gender customization allow vendors to display male, female, and neutral versions. Neutral doesn’t mean the default (untagged) presentation, which could be any of these three; it means a specifically a gender-neutral presentation. These customizations are to mark appearance, and not gender identification.

The following is a draft list of emoji characters that would allow customizations of gender. The comment (after #) includes the version of Unicode where the character was encoded, an image, and the character name.

Review Note: the images are currently characters, so the appearance will depend on the platform people view them on. They will be supplemented by an image so that people can read this document without having emoji support.

26F9  # 5.2 (⛹) PERSON WITH BALL
1F3C3 # 6.0 (🏃) RUNNER
1F3C4 # 6.0 (🏄) SURFER
1F3CA # 6.0 (🏊) SWIMMER
1F3CB # 7.0 (🏋) WEIGHT LIFTER
1F464 # 6.0 (👤) BUST IN SILHOUETTE
1F465 # 6.0 (👥) BUSTS IN SILHOUETTE
1F46E # 6.0 (👮) POLICE OFFICER
1F46F # 6.0 (👯) WOMAN WITH BUNNY EARS
1F470 # 6.0 (👰) BRIDE WITH VEIL
1F471 # 6.0 (👱) PERSON WITH BLOND HAIR
1F472 # 6.0 (👲) MAN WITH GUA PI MAO
1F473 # 6.0 (👳) MAN WITH TURBAN
1F481 # 6.0 (💁) INFORMATION DESK PERSON
1F482 # 6.0 (💂) GUARDSMAN
1F486 # 6.0 (💆) FACE MASSAGE
1F487 # 6.0 (💇) HAIRCUT
1F574 # 7.0 (🕴) MAN IN BUSINESS SUIT LEVITATING
1F575 # 7.0 (🕵) SLEUTH OR SPY
1F5E3 # 7.0 (🗣) SPEAKING HEAD IN SILHOUETTE
1F645 # 6.0 (🙅) FACE WITH NO GOOD GESTURE
1F646 # 6.0 (🙆) FACE WITH OK GESTURE
1F647 # 6.0 (🙇) PERSON BOWING DEEPLY
1F64B # 6.0 (🙋) HAPPY PERSON RAISING ONE HAND
1F64D # 6.0 (🙍) PERSON FROWNING
1F64E # 6.0 (🙎) PERSON WITH POUTING FACE
1F6A3 # 6.0 (🚣) ROWBOAT
1F6B4 # 6.0 (🚴) BICYCLIST
1F6B5 # 6.0 (🚵) MOUNTAIN BICYCLIST
1F6B6 # 6.0 (🚶) PEDESTRIAN
1F6C0 # 6.0 (🛀) BATH
1F926 # 9.0 (🤦) FACE PALM
1F935 # 9.0 (🤵) MAN IN TUXEDO
1F937 # 9.0 (🤷) SHRUG
1F938 # 9.0 (🤸) PERSON DOING CARTWHEEL
1F93B # 9.0 (🤻) MODERN PENTATHLON
1F93C # 9.0 (🤼) WRESTLERS
1F93D # 9.0 (🤽) WATER POLO
1F93E # 9.0 (🤾) HANDBALL

This includes characters that have explicit gender in their names, where there is not a corresponding character of the other gender (MAN vs WOMAN).

Review Note: Some characters encoded to form gender pairs might not have been encoded had this mechanism been available and widely implemented earlier. However, it is too late to remove these characters.

Review Note: Possible removals from the base character list above: For the 9.0 sports figures, we could drop those for which we have consensus to only show neutral images. We need to look at whether JUGGLING might be a person or should not be. Finally, we are considering removal of the three “IN SILHOUETTE” characters (two BUSTS, one SPEAKING HEAD).

Hair Color. These customizations allow for emoji images with different hair colors.

=

There are hundreds of possible distinctions among hair color, but because emoji are presented with a “cartoon” style it suffices to have just a few broad choices; that is also important to avoid font-size problems. The supported list is BLACK, BLONDE, BROWN, RED, GRAY, with the addition of Bald (no hair). This list is based on the distinctions made in the US Online Passport form, the UN Grounds Pass application, and other forms such as driver’s licences.

The following is a draft list of emoji characters that would allow customizations of hair color:

26F9  # 5.2 (⛹) PERSON WITH BALL
1F3C3 # 6.0 (🏃) RUNNER
1F3C4 # 6.0 (🏄) SURFER
1F3CA # 6.0 (🏊) SWIMMER
1F3CB # 7.0 (🏋) WEIGHT LIFTER
1F466 # 6.0 (👦) BOY
1F467 # 6.0 (👧) GIRL
1F469 # 6.0 (👩) WOMAN
1F46E # 6.0 (👮) POLICE OFFICER
1F470 # 6.0 (👰) BRIDE WITH VEIL
1F472 # 6.0 (👲) MAN WITH GUA PI MAO
1F473 # 6.0 (👳) MAN WITH TURBAN
1F474 # 6.0 (👴) OLDER MAN
1F475 # 6.0 (👵) OLDER WOMAN
1F476 # 6.0 (👶) BABY
1F478 # 6.0 (👸) PRINCESS
1F47C # 6.0 (👼) BABY ANGEL
1F481 # 6.0 (💁) INFORMATION DESK PERSON
1F483 # 6.0 (💃) DANCER
1F486 # 6.0 (💆) FACE MASSAGE
1F487 # 6.0 (💇) HAIRCUT
1F575 # 7.0 (🕵) SLEUTH OR SPY
1F57A # 9.0 (🕺) MAN DANCING
1F645 # 6.0 (🙅) FACE WITH NO GOOD GESTURE
1F646 # 6.0 (🙆) FACE WITH OK GESTURE
1F647 # 6.0 (🙇) PERSON BOWING DEEPLY
1F64B # 6.0 (🙋) HAPPY PERSON RAISING ONE HAND
1F64D # 6.0 (🙍) PERSON FROWNING
1F64E # 6.0 (🙎) PERSON WITH POUTING FACE
1F6A3 # 6.0 (🚣) ROWBOAT
1F6B6 # 6.0 (🚶) PEDESTRIAN
1F926 # 9.0 (🤦) FACE PALM
1F930 # 9.0 (🤰) PREGNANT WOMAN
1F934 # 9.0 (🤴) PRINCE
1F935 # 9.0 (🤵) MAN IN TUXEDO
1F936 # 9.0 (🤶) MOTHER CHRISTMAS
1F937 # 9.0 (🤷) SHRUG
1F938 # 9.0 (🤸) PERSON DOING CARTWHEEL
1F93B # 9.0 (🤻) MODERN PENTATHLON
1F93C # 9.0 (🤼) WRESTLERS
1F93D # 9.0 (🤽) WATER POLO
1F93E # 9.0 (🤾) HANDBALL

Direction. These customizations allow for emoji images to have different directions.

=

Some characters are typically presented with a direction that may be semantically significant in sequences of emoji: such as pointing in the direction of an action. Unfortunately, for historical reasons the “default” direction is fairly arbitrary; you see some emoji pointing left 🎥 and others pointing right 📽.

For example, controlling the direction allows for different compositions with hands around faces:

Or, when someone is trying to say “then the princess stabbed the prince” when commenting on the Game of Thrones, it would be far more natural to see (a) below rather than (b), which could be interpreted as “the prince stabbed the princess”:

a) …

b) …

The emoji tag sequences allow for both directions to be expressed and interchanged.

The following is a draft list of emoji characters that would allow customizations of direction:

26CF  # 5.2 (⛏) PICK
2708 # 1.1 (✈) AIRPLANE
270B # 6.0 (✋) RAISED HAND
270C # 1.1 (✌) VICTORY HAND
1F3C3 # 6.0 (🏃) RUNNER
1F3CA # 6.0 (🏊) SWIMMER
1F440 # ?.0 (👀) EYES
1F44B # 6.0 (👋) WAVING HAND SIGN
1F44C # 6.0 (👌) OK HAND SIGN
1F44D # 6.0 (👍) THUMBS UP SIGN
1F44F # 6.0 (👏) CLAPPING HANDS SIGN
1F4A8 # 6.0 (💨) DASH SYMBOL
1F4AA # 6.0 (💪) FLEXED BICEPS
1F528 # 6.0 (🔨) HAMMER
1F52B # 6.0 (🔫) PISTOL
1F5E1 # 7.0 (🗡) DAGGER KNIFE
1F618 # 6.0 (😘) FACE THROWING A KISS
1F6AC # 6.0 (🚬) SMOKING SYMBOL
1F6B4 # ?.0 (🚴) BICYCLIST
1F6B5 # ?.0 (🚵) MOUNTAIN BICYCLIST
1F6B6 # 6.0 (🚶) PEDESTRIAN
1F918 # 8.0 (🤘) SIGN OF THE HORNS
1F93A # 9.0 (🤺) FENCER
1F93D # 9.0 (🤽) WATER POLO
1F93E # 9.0 (🤾) HANDBALL
1F946 # 9.0 (🥆) RIFLE

Review Note: There are many possible characters that could usefully have direction applied to them. However, we should start out with just small number of base characters, and expand as necessary, probably winnowing down the above list. For the initial list, characters that already typically point to the left are omitted.

A major difference for the Direction customization is that the images simply need to be mirrored, thus making essentially no difference in the size of the font (if the technology permits).

Private Use. There are also private-use tag sequences, which allow for experimentation and interchange in closed systems, but which are not suitable for general interchange. For more information, see the Technical Specification.

Unsupported. Where the emoji tag sequence is not supported by an implementation, the tag characters are invisible and occupy no space. If the implementation is aware of the tag mechanism, but the sequence is invalid or the implementation does not support a particular sequence, then the fallback character should be displayed with a question mark superimposed, such as the following:

=

That way, the recipient is notified that there is something odd about the character; that it isn’t exactly what the sender intended. The above appearance is not normative: the goal is have a superimposed question mark, but there could be alternate appearances so that the base character is less obscured. However, an implementation cannot show a variant that is not specified by Unicode, unless it uses a Private Use tag sequence.

Review Note: one by-product of the direction of this work is that the Emoji SC and UTC should focus primarily on “generic” emoji characters, rather than very specific versions.

Review Note: We considered various other models:

2 Conformance

Any Unicode-conformant implementation that implements emoji tag sequences must do so as described in this document:

Review Note: The following are draft clauses, and will need to be refined.

C1No emoji tag sequence can be supported unless it is valid according to the specifications in this document.

C2Any supported emoji tag sequences must be displayed with an appropriate image according to the specifications in this document.

For example, an emoji tag sequence for adding red hair cannot change RUNNER into a SWIMMER.

C3Any invalid or unsupported emoji tag sequence must be displayed with an appropriate image according to the specifications in this document.

The image should be the base character with a superimposed question mark.

3 Technical Specification

This section provides a technical description for how the valid tag sequences are specified.

3.1 Overall Syntax

The customization syntax uses the 95 invisible TAG characters:

U+E0020..U+E007F (TAG SPACE..CANCEL TAG).

These correspond to ASCII characters, and may be referenced by an abbreviated name of the form Tag-<single-ascii-character>, such as Tag-U for U+E0056 TAG LATIN CAPITAL LETTER V. In examples, where clear, they can also be represented simply by the corresponding underlined ASCII letters. The tag-term can be represented by ✦. The regex characters ?, *, + have their normal meaning.

In addition, there are the following special terms:

Notation Characters Description
tag-base [:emoji=yes:] any single emoji character (w/o Regional_Indicators)
tag-term U+E007F CANCEL TAG terminating Tag ✦
tag-keyChar Tag-A..Tag-Z Tag characters corresponding to uppercase letters: A..Z
tag-valChar U+E0020..U+E0040, U+E005B..U+E007E Tag characters that are neither tag-term nor tag-keyChar.

Review Note: the original plan was to add to UTR #51 definitions ED-10a and ED-10b below, plus new terms for ED-13 and ED-15. However, since this document is now more generally about emoji mechanisms, it may be best to move (or at least copy) other definitions here (and then add these).

ED-10a. emoji tag sequence

A sequence consisting of an emoji character followed by one or more non-terminating TAG characters, followed by a terminating TAG character.

emoji-tag-sequence        :=        tag-base-item tag-key-value-pair+ tag-term
tag-base-item             :=        tag-base | tag-base_variation_sequence
tag-key-value-pair        :=        tag-key tag-value
tag-key                   :=        tag-keyChar+
tag-value                 :=        tag-valueChar+

ED-10b. A tag-base-variation-sequence is an emoji variation sequence that starts with a tag-base.

Add emoji-tag-sequence to the following definitions in UTR51:

ED-13. emoji modifier sequence, and ED-15. emoji core sequence.

Further Constraints:

Thus—in ASCII terms—each key is a tag sequence that consist of one or A..Z characters. So A is a key, as is AB or ZZ. The value is any sequence of one or more characters from _ (Tag-space) to ~, outside of A..Z. For example, a tag-key-value-pair could be Fusca, representing the key= F and value = usca. The interpretation of each tag-key-value-pair depends on the tag-base, tag-key, and tag-value.

Example: <tag-base>ABxAy✦ is invalid, as is <tag-base>AxAy✦.

Review Note: the length (currently 16) is up for discussion, but everyone is agreed that we need a fixed limit.

Review Note: the committee considered whether to have the tag-term or not. In theory, it is not necessary for termination, because the tag sequence is always bounded by any non-TAG characters.

For font lookup, however, people didn’t want to have to backup when a non-TAG character is seen. To mark the end, it would be sufficient to forbid sequences that are prefixes of other valid sequences. The easiest way to ensure that is to always have a terminator. In some cases, it may not be necessary, but it makes validity checking much easier.

This defines the well-formed emoji tag sequences. However, the only currently valid sequences are those defined in the following sections. All others are reserved for future use. Only certain tag-keys are valid for a given tag-base, and only certain tag-values are valid for the given tag-key. The tag-value may also have internal syntax.

An emoji modifier that affects the customized emoji should follow the complete sequence representing the customized emoji, and in a sequence such as <Char ZWJ Char TAG+> the TAG sequence applies only to the second character in the sequence.

Review Note: The tag sequences that are likely to have the highest priority in initial implementations are likely to be the subdivision flags and the gender tags. Should we remove the direction and hair color tag sequences from the first version?

3.2 Flags

Syntax

Tag-Base U+1F3F3 WAVING WHITE FLAG
Tag-Key Tag-V
Tag-Value (Tag-0..Tag-9, Tag-a..Tag-z)+

Further Constraints:

  1. The Tag-Value must be a specification of either a Unicode subdivision_id or a 3-digit unicode_region_subtag, as per [CLDR]. The only valid values are those with idStatus of "regular" or "deprecated". However, the deprecated values are only included for compatibility, and should not be used. The syntax also allows the 3-digit forms for future compatibility, but none are currently valid.
  2. They are only lowercase letters and number TAGs: [0-9 a-z].
  3. Like the current regional indicators, these can request an image for whatever is currently the flag of the specified subregion. They are not intended to provide a mechanism for versioned representations of any particular flag image.

Example: <U+1F3F3>Vgbsct✦ requests a flag for the subdivision “gbsct”, which represents Scotland. Note that there is no hyphen, and it is all lowercase, unlike the format for ISO subdivisions (“GB-SCT”).

3.3 Attributes General Info

The attributes are used to request the display of an emoji character to have a particular attribute. There are currently 3 supported types of attributes:

Further Constraints:

The Tag-Base and tag-key-value-pair values are listed below. A Tag-Base character can be in any of the listed classes. Each class has a set of valid attributes. Additional classes and attributes may be added in future versions.

The choice of Tag-Value characters is somewhat arbitrary, since they are not visible to users. However, they are chosen to be mnemonic where possible, since that is useful for internal debugging (more recognizable than Tag-1, Tag-2, etc.).

Additional emoji properties are added to [emoji-data] in support of these, with a new data file emoji-tags.txt using the standard format.

3.4 Gender Attribute

Syntax

Tag-Base Gender_Base
Tag-Key Tag-G
Tag-Value Exactly one of the following tag-valueChars:
tag-valueChar Description
Tag-m Male appearance
Tag-f Female appearance
Tag-n Gender-neutral: neither male nor female appearance

Examples:

Male Runner: 🏃Gm

Female Runner: 🏃Gf

3.5 Hair Attribute

Syntax

Tag-Base Hair_Base
Tag-Key Tag-H
Tag-Value Exactly one of the following tag-valueChars:
tag-valueChar Description
Tag-k Black-haired
Tag-s Blonde (also sandy-haired)
Tag-b Brown (Brunette)
Tag-r Redhead (Ginger)
Tag-g Gray-haired
Tag-n Bald (no hair)

Example:

Red-haired Female Runner: 🏃GfHr

3.6 Direction Attribute

Syntax

Tag-Base Direction_Base
Tag-Key Tag-D
Tag-Value Exactly one of the following tag-valueChars:
tag-valueChar Description
Tag-r Point-Right
Tag-l Point-Left

The Tag directions are to have a mirrored effect in a bidi context. All emoji characters are Bidi_Class=Other_Neutral (except for the enclosed alphanumerics).

Example:

Red-haired Female Runner facing Right: 🏃DrGrHr

3.7 Private Use

Private Use tag sequences are for closed interchange within a given system. As with private use codes in general, the tag sequence may have no meaning or a different meaning outside that system, so it is not suitable for general interchange. Any key starting with Tag-Z qualifies, and any tag-value.

Syntax

Tag-Base emoji-character
Tag-Key Tag-Z tag-keyChar*
Tag-Value tag-valChar+

Example:

Private use runner with ski boots: 🏃Zskiboots

Implementations should consider the use of any of the thousands of private use Unicode characters instead. However, the advantages of Private Use emoji customizations include:

4 Implementation Notes

The hair color, gender, and emoji-modifiers could, in theory, be mixed, resulting in a combinatorial explosion of glyphs in fonts. It is anticipated that only certain combinations would be supported generally until the technology supports them without a massive size increase. There are a few possible avenues. One is the “Mr(s) Potato Head” approach, whereby glyph pieces are assembled for a particular image. For example, there might be some different color hair styles that would be appropriate for overlaying on a RUNNER emoji. These could then be used to create male and female variants.

A more general approach is to allow override of the color palettes in the images in the font. With this approach, exactly the same image of a girl can be used with many combinations of hair and skin color, for example.

If the rendering engine can mirror images in the font, then the direction can also be supported generally with little size overhead.

Acknowledgments

Mark Davis and Peter Edberg created the initial versions of this document, and maintain the text.

Thanks to Jeremy Burge, Peter Constable, Craig Cummings, Doug Ewell, Tayfun Karadeniz, Ken Lunde, Rick McGowan, Katsuhiko Momoi, Roozbeh Pournader, Markus Scherer, and Ken Whistler for feedback on and contributions to this document.

References

[CLDR] CLDR - Unicode Common Locale Data Repository
http://cldr.unicode.org/
For the latest version of the associated specification (LDML), see:

http://www.unicode.org/reports/tr35/
[emoji-charts] The illustrative charts of emoji
http://unicode.org/emoji/charts/index.html
[emoji-data] The associated data file for emoji tag sequences
For the draft version, see
http://unicode.org/reports/tr52/emoji-tags.txt
[Unicode] The Unicode Standard
For the latest version, see:
http://unicode.org/versions/latest/

Modifications

The following summarizes modifications from the previous revisions of this document.

Revision 1