Unicode Emoji QID

Background Document for PRI #408

Last updated: 2019-11-14

The QID Emoji Tag Sequences (or QID emoji, for short) have been proposed to provide a well-defined mechanism for implementations to support additional valid emoji that are not representable by Unicode characters or emoji zwj sequences. This mechanism allows for the interchange of emoji whose meaning is discoverable, and which should be correctly parsed by all conformant implementations (although only displayed by implementations that support it). The meaning of each of these valid emoji is established by reference to a Wikidata QID.

There are definite pros and cons to this mechanism. For example: On the one side, it opens the definition of emoji up to organizations outside of the Unicode Consortium. On the other side, it has some of the disadvantages of "private-use" characters, and might not end up being utilized, or end up with fragmented support.

The consortium would appreciate feedback on (a) whether or not this mechanism should be added to UTS #51, and (b) whether any changes need to be made in the specification.

Background

The Unicode Consortium decided some years ago to take emoji proposals from the general public, and devoted resources to developing a workable process for assessing them. Petitions and or voting were considered as a source of information about popularity. But for those to work, there would have to be a system to authenticate that people can't "cheat" by voting multiple times, and a mechanism in place to assure that the population of voters is demographically representative of the world's population.

Some people are unsatisfied with the choices that are made, or the process behind them. Those people would like there to be a mechanism to define emoji without the Unicode Consortium being involved.

To a certain extent that is already the case. Unicode reserved 137,468 “private use characters” decades ago. These can be used by any person or organization for any purpose. Group-1 can create a font that has an emoji image for the private use character U+FABCD, and use it in a document. If Group-1 makes that font available to Individual-2, then Individual-2 can see the emoji as well. However, operating systems wouldn’t know about this meaning of U+FABCD; someone without that font would just see a box; and Group-2 could use U+FABCD for an utterly different emoji (or a non-emoji character).

For emoji, there is another mechanism available. If you look inside an emoji like woman pilot: medium skin tone, you will find that it is made up of 5 characters, glued together by a special character called a ZWJ. As long as an emoji can be represented as a reasonable sequence of other emoji, anyone can define and use it. This was done with the pirate flag before Unicode added it to the list of recommended emoji. Again, if Group-1 makes a font available for display of the emoji, then others using the font could see and use it. This is better than private use characters because operating systems know to handle these ZWJ sequences as emoji characters and, in many cases, the intent of the sequence can be discerned from the underlying characters. In addition, because each sequence is built from well-known emoji, there is always a useful fallback presentation (rather than a square missing symbol) that cannot conflict with other user's private needs. However, many emoji can’t be represented as a reasonable sequence of other emoji.

Of course, it's possible that other mechanisms, such as standards for the interchange of images, videos, or stickers, might render the need for additional or highly-specialized emoji obsolete. This has been recognized from the beginning: see Unicode Emoji: Longer Term.

Over time, there have been some proposals made for opening up the process so that the Unicode Consortium doesn’t need to be involved in the definition. For example, in 2016 there were two proposals: Unicode Specified Emoji Customizations (see under Private Use) and Coded Hashes of Arbitrary Images. But there were some unaddressed problems with those. In early 2019, there was a proposal to address those problems, which has further developed into this PRI.

The text documenting QID emoji tag sequences would be contained in a new section in an annex of UTS #51, Unicode Emoji. A draft of that text is shown below, to guide review of the concept. See also the various review notes below the draft for Section C.2, which list various issues that the UTC would appreciate feedback on.

C.2 QID Emoji Tag Sequences

The QID Emoji Tag Sequences (or QID emoji, for short) provide a well-defined mechanism for implementations to support additional valid emoji that are not representable by Unicode characters or emoji zwj sequences. This mechanism allows for the interchange of emoji whose meaning is discoverable, and which should be correctly parsed by all conformant implementations. The meaning of each of these valid emoji is established by reference to a Wikidata QID.

Communities of interest can use this mechanism to put together their own sets of emoji, independent of the Unicode Consortium. Consider the following scenario, for example:

The New Zealand Kennel Club puts together a set of emoji for dog breeds, using QID emoji. It creates a web font for those emoji, and makes it freely available. On PCs or other systems that allow downloadable fonts, users can see and use the emoji.

The Verband für das Deutsche Hundewesen (the German Kennel Club) then decides to support those emoji as well; the meaning of each one of them is already established. It creates an online tool for selecting the dog breed emoji, and also produces an app for iPhone and Android with a bundled font and emoji palette.

At some point, mobile phone vendors add the ability to allow users to add emoji to their emoji keyboard. That is, people can copy a emoji (including a QID emoji), and paste it into their favorites’ palette. A user interested in dog breeds installs the New Zealand Kennel Club font onto their phone. Later, they get a text message with the QID emoji for Shetland Sheepdog (Q39058) and add it to their keyboard. They can then pick that emoji just like any of the stock ones.

Examples of valid QID emoji that could potentially be implemented:

Tag Sequence	Sample Image	Description
Q459788✦		flag of NATO
Q4545971✦		gelatin dessert
Q14384✦		Triceratops

Full support on any particular implementation would additionally require installation of fonts (or other mechanisms for displaying images) and keyboard modules (or other mechanisms for text entry). Even without such support, the sequence is to be treated as an emoji character in all processing, but would just display a fallback image.

A valid QID emoji tag sequence must satisfy the following constraints:

The tag_base and tag_spec are limited to the following:

tag_base U+1F194 SQUARED ID The ID button emoji.

tag_spec Q[0-9]+ U+E0051 TAG LATIN CAPITAL LETTER Q [U+E0030 TAG DIGIT ZERO - U+E0039 TAG DIGIT NINE]+ A sequence of TAG characters corresponding a Q followed by a sequence of one or more digits, corresponding to a valid Wikidata QID. That QID should represent a visually depictable entity.

The validity and meaning of the Wikidata QID must be be verifiable in Wikidata, such as the entry https://www.wikidata.org/wiki/Q459788 for the flag of NATO. To find an appropriate Wikidata QID, it is often simplest to use look for a Wikipedia article, such as https://en.wikipedia.org/wiki/Flag_of_NATO. On the Desktop version, click on the Tools > Wikidata item in the left navigation bar to get to the corresponding Wikidata QID.)

QID emoji tag sequences for flags or other symbols that represent an entity should use the QID for the flag or symbol itself if available, not the flag for the entity. For example, the QID emoji sequence for flag of NATO should use Q459788✦ (flag of NATO), not Q7184✦ (NATO).

Review Note: What if there is no separate QID for the symbol? Should the vendor add an item to Wikidata, or use the existing QID of the symbolized entity instead?

A “visually depictable entity” can be clearly and distinctly represented by an emoji-style rendering that is sufficiently distinct and understandable at typical emoji sizes (such as 18-point text). The term entity includes not only concrete objects, but also actions and or emotions. In addition:

The term understandable means most people familiar with the entity should be able to tell that the representation is intended to depict the entity, without foreknowledge. A symbol such as ♅ U+2645 Uranus would thus be excluded from a depiction of the planet Uranus (Q324), but a symbolic depiction would be appropriate when the depicted entity is itself a symbol.
Actions may be represented by capturing a person or object in the midst of that action, such as U+1F3C3 🏃 runner, or U+1F92E 🤮 vomiting face.
Emotions or mental states can be represented by a face or body exhibiting that emotion or state, such as U+1F620 😠 angry face.
The representation may use commonly understood “comic-style” visual elements, such as U+1F4AD (💭) thought bubble, motion lines (U+1F44B 👋, U+1F5E3 🗣), or other signifiers such as in U+1F634 😴.
Some current emoji were added for compatibility, and would not satisfy these criteria.

A subset of QIDs are associated with entities that would be appropriate for emoji. For example, risk management (Q189447) and this (Q3109046) would not be appropriate. Of those that are appropriate, Wikidata may not have associated images for the referenced entity, and such images would rarely — if ever — be appropriate for use as images for emoji. As with regular emoji, it is better to be more general: a puffin emoji should use the general QID for puffin (Q311761), not the QID for a specific species (Q26685) (or breed/cultivar) — even if the image happens to be of that species.

At this point, no QID emoji are in the RGI emoji tag sequence set (see ED-24). However, a QID emoji can be proposed for addition to the RGI emoji tag sequence set for a future version of Unicode Emoji. A QID Emoji can also be proposed for encoding as a single Unicode emoji character in the same way. In both such proposals, the proposed emoji must meet the other conditions of Submitting Emoji Proposals — especially the exclusions, such as no logos. Strong evidence must be provided for expected frequency of usage on a major platform, including comparisons of the actual frequency of use of that QID emoji to the standard reference emoji listed Submitting Emoji Proposals.

Where ZWJ sequences are reasonable, implementations should prefer them over the QID emoji. They are shorter and occupy much less memory, and have a better fallback behavior when not supported on a target implementation.

For screen readers, it is anticipated that the normal behavior for an unknown QID tag sequence would be just to indicate that it is emoji. This is more information than would be provided for a PUA character, for example. For QID sequences that become popular, vendors could choose to provide more specific readings. Once a QID sequence is designated RGI, then Unicode would provide a short text-to-speech name (with localized versions available from CLDR) as it does for other RGI emoji.

Review Notes

We may want to have a more thorough elaboration of “visually depictable entity” in the Glossary of Unicode Terms, and point to that here. That elaboration can provide more examples, and be enhanced on a more timely basis.

The following are open issues; feedback on the pros and cons of the alternatives would be appreciated!

Issue 1: Length

The Emoji QID sequences take more memory than regular emoji (as of this writing, the maximum QID has 8 digits. It would take 42 bytes including the tag_base and tag_end).

We have 94 TAG values available, and could compress the decimal number into a base 64 value. For example, the 6 digits in 🐦Q218543✦ (White-crested tiger heron) turn into 3 values in base 64 <2F, 16, 35>, and represented by a base emoji + 5 TAG characters — instead of a base plus 8 TAG characters. In general, the sequences would take about 30-40% less memory.

Or if length is not felt to be important, we could just use the decimal representation. However, note that the decimal notation is not particularly easier for users. The notation 0 used in this document is representing the Unicode character U+E0030 TAG DIGIT ZERO, which is normally invisible and doesn’t mean anything to users.

Issue 2: Tag Base

The proposal has a single tag base . One alternative approach is to allow the use of any existing emoji as a tag_base, such as the following:

Tag Sequence	Sample Image	Description
Q459788✦		flag of NATO
Q4545971✦		gelatin dessert
Q14384✦		Triceratops

The advantage of doing so is that the fallback would be better if the tag sequence is not supported. The implementation can display the base character with a small missing emoji overlaying it, or other similar mechanism, to provide some indication of the type of emoji that was intended.

The main problem with allowing any existing emoji as a base is that the same QID Emoji could be represented by different sequences. That sequence could be very inappropriate, such as the following:

🐦Q459788✦

Flag of NATO emoji

It would be startling, at a minimum, for someone to see the NATO flag fallback to a 🐦. So for that case, it is far better to use as the tag_base a 🏴. But there would be no feasible way to impose constraints on tag_base to make it consistent with the QID.

Moreover, there might not be an obvious choice of tag_base character: for a falcon, for example, an implementation might choose either eagle 🦅 or the plain bird 🐦. And others might have no obvious base, such as a stroller. Different platforms could choose different bases, which is clearly not good for interoperability or consistency of fallback. On the other hand, this might sort itself out naturally, with the “first mover” effect.

Issue 3: Sequences

Currently emoji tag sequences are not full-fledged emoji, in that they can’t be part of other sequences. For example, they cannot appear in emoji ZWJ sequences, and thus could not be composed into longer emoji, such as in adding hair styles or gender. To address these issues, we could make changes to the definitions of emoji modifier base and emoji zwj element as shown below. The question is whether such changes would be warranted.

ED-12. emoji modifier base — A character whose appearance can be modified by a subsequent emoji modifier in an emoji modifier sequence.

emoji_modifier_base := \p{Emoji_Modifier_Base} | emoji_tag_sequence

ED-15a. emoji zwj element — A more limited element that can be used in an emoji ZWJ sequence, as follows:

emoji_zwj_element := emoji_character | emoji_presentation_sequence | emoji_modifier_sequence | emoji_tag_sequence

These changes would require some renumbering if we want to avoid forward references.

We could simplify the definitions yet further, and align more with UAX #31, with the following changes:

Adding emoji_tag_sequence to emoji_core_sequence, and dropping it from the definition of emoji_sequence
Replacing emoji_zwj_element by emoji_core_sequence, and dropping the definition ED-15a emoji_zwj_element

Issue 4: Registry

The Unicode Consortium plans to investigate maintaining a list of QID Emoji Tag Sequences known to be in use (but that are not RGI), so that people considering adding a QID Emoji Tag Sequence can see if someone else already is using a QID Emoji Tag Sequences for roughly the same thing. This needs investigation as to requirements, or whether it can be solved by guidance as to selecting QIDs.

Issue 5: Limiting the RGI emoji tag sequence set additions

Vendors have expressed concerns about limiting the total number of emoji added annually (see point 2 in Limitations on Emoji Encoding). The intent is that QID emoji added to the RGI emoji tag sequence set would be counted among the total number of RGI emoji sequences released per year, and subject to the same limitations.

Issue 6: Uniqueness and Stability of Emoji Representation

Background (note that the non-QID material here has been added to Emoji Encoding Principles): A given emoji may have multiple valid encoded representations. However, there is only one representation that is RGI Emoji. For example:

The flag of American Samoa (AS) has valid representations using either a pair of regional indicators or a tag sequence for “usas”. However, only the former is RGI.
Many existing RGI emoji also correspond to QIDs and would have valid representation as a QID sequence. However, that QID representation would never become RGI.
If an emoji QID sequence becomes popular, Unicode may define a different RGI representation using a character or sequence to save memory. The old QID representation could still be supported by fonts.
As an alternative to a QID sequence, vendors may use valid custom ZWJ sequences for platform-specific emoji that are not RGI. At some point in the future, these emoji could become RGI with possibly a different encoded representation; any older representation could still be supported by fonts.

Except for representations designated RGI, there is no guarantee of uniqueness or stability among possible valid representations for a given emoji.

Issue 7: Stability of QIDs

Can QIDs be deleted or moved on Wikidata? If so, might an old QID be repurposed with a new meaning?

From discussions with Wikidata, while there is no formal mechanism to guarantee QID stability, there are many practical considerations that make it very unlikely that a QID, once assigned, would be deleted or reassigned. We considered having a timestamp as part of the QID sequence, but that would be cumbersome and very rarely useful.

Related Q&A

Q: Will there be a way of identifying QIDs that correspond to Unicode emoji?
A: We don't anticipate having a normalization process for QID emoji. However, we can consider adding informative mappings between QIDs and emoji. This might be limited to RGI QIDs or be broader (see Issue #4)

Q: How are QIDs different than PUA characters?
A: They are similar in that an organization / company can use them without requiring any action by Unicode. They are different in that the semantics are much narrower, and the behavior (as emoji) specified, so that implementations know how to process them (such as in linebreak).

Q: Does the existence of a QID prevent encoding of the corresponding single character emoji?
A: No. Just because the dodo has a QID (https://www.wikidata.org/wiki/Q43502), that does not prevent a dodo emoji character from being added. In fact, if the QID is popular, that is good evidence of frequency of use for Submitting Emoji Proposals.

Q: Will screen readers choke on QID sequences?
A: Most screen readers won't read out a sequence of code points, since it would be meaningless to users. Certainly the hex codes are pointless (except for programmers). Any unknown QID sequence (or other emoji which the screen reader doesn't have a name for) could be read as "unknown emoji", corresponding to the "missing glyph" symbol that would be displayed when a font doesn't have a glyph for an emoji.

(It may be useful for screen readers to consider having a distinctive audible indication of an emoji. That could distinguish "she saw a poodle" from "she saw a 🐩". That would be similar to how TV captions for the deaf indicate that a phrase is being sung by using a 🎵. So "she saw a 🐩" could be read as "she saw a 🔔poodle🔔". Similarly, an unknown QID emoji could be read as "she saw a 🔔unknown🔔", where 🔔 represents a tone.)

Q: But what if a QID starts to be popular?
A: At that point, a more sophisticated screen reader could add a name for the sequence.

`tag_base`		U+1F194 SQUARED ID	The ID button emoji.
`tag_spec`	Q[0-9]+	U+E0051 TAG LATIN CAPITAL LETTER Q [U+E0030 TAG DIGIT ZERO - U+E0039 TAG DIGIT NINE]+	A sequence of TAG characters corresponding a Q followed by a sequence of one or more digits, corresponding to a valid Wikidata QID. That QID should represent a visually depictable entity.