[Unicode]  Frequently Asked Questions Home | Site Map | Search

Entities and Named Sequences

Q: Does Unicode define named entities, the way HTML does?

A: No. The use of "entities" with names, such as " ", "ñ", "<", and so on, to represent characters or symbols is a convention used in markup languages such as HTML. Large numbers of named entities are defined in ISO standards related to SGML, and common subsets of those entities are widely used in HTML and XHTML, but their definition is unrelated to the Unicode Standard.

Q: Are named entities in HTML the same as characters?

A: No. They are an abstracted, markup representation of a character. They were invented primarily so that there was some way to represent character content in text using a character set for which there was no directly encoded character for the entity in question.

Q: Can named entities for HTML be mapped to Unicode characters?

Almost all widely used named entities have unambiguous mappings to Unicode characters. Indeed, this is required for the use of named entities in HTML, because one of the steps in the interpretation of HTML text content is the conversion of any named entities parsed out of the raw HTML text into their corresponding Unicode values.

Q: But Unicode does have named sequences. What are those?

A: Unicode named character sequences associate a name, which looks just like a Unicode character name, with a sequence of two or more Unicode characters. An example would be:


where the angle brackets indicate that this is a sequence of the two encoded Unicode characters U+0172 LATIN CAPITAL LETTER U WITH OGONEK followed by U+0301 COMBINING ACUTE ACCENT. Although it has a formal name, the sequence "<0172, 0301>" is not an encoded character. The name for this sequence follows the same conventions as for regular, encoded Unicode characters.

Q: Why bother defining named sequences? Why not just encode the character, instead?

A: When text elements that people want to treat as "a character" can already be represented in text by a sequence of characters encoded in Unicode, encoding another precomposed character for that sequence introduces multiple representations of the same thing. Such an action would undermine the stability of Unicode text normalization, which is subject to very strict stability guarantees.

Q: So if you can't encode a precomposed character for a sequence, what's the use of defining a named sequence?

A: Sometimes other standards, specifications, or protocols need to refer very precisely to specific elements of a character repertoire. For the Unicode Standard to simply say that some of those elements can already be represented "as sequences of Unicode characters" may not be considered precise enough. ISO/IEC 10646 defines UCS Sequence Identifiers, of the form "<0172, 0301>" to specify such sequences precisely. Unicode named character sequences add character-like names to such sequences, to make it easier to understand what their intent is and to make it easier to refer to them.

Q: Where is the Unicode named character sequence syntax defined?

A: In UAX #34: Unicode Named Character Sequences.

Q: Who defines Unicode named character sequences? Can I just make them up myself?

A: The UTC defines Unicode named character sequences through a multiple-step approval process. Only a few of them are actually approved — generally in response to formal requests originating in other standards organizations. They are not designed to serve as an arbitrary end-user defined extension mechanism for the standard.

Q: So where can I find the list of approved Unicode named character sequences?

A: Those are listed in the data file NamedSequences.txt in the Unicode Character Database.

Q: Are Unicode named character sequences guaranteed to be stable?

A: Yes. Once a particular Unicode named character sequence has been finally approved, it will not be removed or changed. In order to allow sufficient time for review of named character sequences, a two-step process is used. First a named character sequence is provisionally approved and is listed in NamedSequencesProv.txt in the Unicode Character Database. Only later, after any feedback and any required corrections, is a named character sequence listed in NamedSequences.txt. Such entries are then stable.

Q: Does a character sequence need to have a name in order to be processed in a special way in an implementation?

A: No. Most implementations have many rules for processing characters that take certain sequences into account. Whether or not these sequences are named should not make any difference.

Q: I know of many important character sequences for my language's writing system. Should I submit a proposal to get them approved as Unicode named character sequences?

A: In general this is neither necessary nor desirable. There are thousands upon thousands of potentially significant sequences of Unicode characters for various languages and writing systems around the world. Software can (and should) handle such sequences appropriately, whether for sorting purposes, for rendering, or for natural language processing, and so on, based on their actual usage and significance — without depending on decisions by the UTC to formally recognize particular sequences as Unicode named character sequences. The primary purpose of Unicode named character sequences is to give other standards a way to reference such sequences as units. This is a much more limited task.

Q: But my language has N phonemes that are written as digraphs. Isn't it important that all those digraphs be officially recognized and get official Unicode named character sequences?

A: Unicode named character sequences are not intended for such documentation and recognition purposes, especially for sequences that are treated differently by different languages using a particular script. Named sequences are simply names for specified sequences of Unicode characters, and convey no information about whether such a sequence is a phoneme or a digraph or has any other special status for one or more writing systems for particular languages.

Q: But how can I get these sequences recognized?

A: The Unicode locales includes mechanisms for recognizing such sequences. If a sequence is important for sorting in a particular language, such as "ch" in Slovak, then it needs to be entered into the CLDR collation sequences. If the sequence is generally recognized as an element of the "alphabet" for a language, then it should be added to the exemplar characters.

For information on how to request such additions, see CLDR bugs.

Q: Are there other ways, outside of Unicode, to get language-specific sequences recognized?

A: Often a good additional route to get proper recognition for the details of the phonology and writing systems for various languages is to work in open forums such as the Wikipedia, which are increasingly drawing together sophisticated and consistent documentation about languages and writing systems that can substantially aid in better software implementations.