Entities and Named Sequences
Q: Does Unicode define named entities, the way HTML does?
A: No. The use of "entities" with names, such as "
<", and so on, to represent characters
or symbols is a convention used in markup languages such as HTML. Large numbers of named entities are defined in ISO standards
related to SGML, and common subsets of those entities are widely used in HTML and XHTML, but their definition is unrelated to
the Unicode Standard.
Q: Are named entities in HTML the same as characters?
A: No. They are an abstracted, markup representation of a character. They were invented primarily so that there was
some way to represent character content in text using a character set for which there was no directly encoded character for the
entity in question.
Q: Can named entities for HTML be mapped to Unicode characters?
Almost all widely used named entities have unambiguous mappings to Unicode characters. Indeed, this is required for
the use of named entities in HTML, because one of the steps in the interpretation of HTML text content is the conversion of any
named entities parsed out of the raw HTML text into their corresponding Unicode values.
Q: But Unicode does have named sequences. What are those?
A: Unicode named character sequences associate a name, which looks just like a Unicode character
name, with a sequence of two or more Unicode characters. An example would be:
<0172, 0301> LATIN CAPITAL LETTER U WITH OGONEK AND ACUTE
where the angle brackets indicate that this is a sequence of the two encoded Unicode characters U+0172 LATIN CAPITAL LETTER U WITH OGONEK followed by U+0301 COMBINING ACUTE ACCENT. Although it has a formal name, the sequence "<0172, 0301>" is not an encoded character.
The name for this sequence follows the same conventions as for regular, encoded Unicode characters.
Q: Why bother defining named sequences? Why not just encode the character, instead?
A: When text elements that people want to treat as "a character"
can already be represented in text by a sequence of characters encoded
in Unicode, encoding another precomposed character for that sequence introduces multiple
representations of the same thing. Such an action would undermine the stability of Unicode text normalization, which is
subject to very strict stability guarantees.
Q: So if you can't encode a precomposed character for a sequence, what's the use of defining a named sequence?
A: Sometimes other standards, specifications, or protocols need to refer very precisely to specific elements of a character
repertoire. For the Unicode Standard to simply say that some of those elements can already be represented "as sequences of Unicode
characters" may not be considered precise enough. ISO/IEC 10646 defines UCS Sequence Identifiers, of the form "<0172, 0301>" to specify
such sequences precisely. Unicode named character sequences add character-like names to such sequences, to make it easier to understand
what their intent is and to make it easier to refer to them.
Q: Where is the Unicode named character sequence syntax defined?
A: In UAX #34: Unicode Named Character Sequences.
Q: Who defines Unicode named character sequences? Can I just make them up myself?
A: The UTC defines Unicode named character sequences through a multiple-step approval process. Only a few of them are actually
approved — generally in response to formal requests originating in other standards organizations. They are not designed to serve as an
arbitrary end-user defined extension mechanism for the standard.
Q: So where can I find the list of approved Unicode named character sequences?
A: Those are listed in the data file
the Unicode Character Database.
Q: Are Unicode named character sequences guaranteed to be stable?
A: Yes. Once a particular Unicode named character sequence has been finally approved, it will not be removed or changed. In order to allow sufficient time for review of named character sequences, a two-step process is used. First a named character sequence is provisionally approved and is listed in
NamedSequencesProv.txt in the
Unicode Character Database. Only later, after any feedback and any required corrections, is a named character sequence listed in
NamedSequences.txt. Such entries are then stable.
Q: Does a character sequence need to have a name in order to be processed in a special way in an implementation?
A: No. Most implementations have many rules for processing characters that take certain sequences into account. Whether or not these sequences are named should not make any difference.
Q: I know of many important character sequences for my language's writing system. Should I submit a proposal
to get them approved as Unicode named character sequences?
A: In general this is neither necessary nor desirable. There are thousands upon thousands of potentially significant sequences
of Unicode characters for various languages and writing systems around the world. Software can (and should) handle such sequences
appropriately, whether for sorting purposes, for rendering, or for natural language processing, and so on, based on their actual usage and
significance — without depending on decisions by the UTC to formally recognize particular sequences as Unicode named character sequences.
The primary purpose of Unicode named character sequences is to give other standards a way to reference such sequences as units. This is a much more limited task.
Q: But my language has N phonemes that are written as digraphs. Isn't it important that all those
digraphs be officially recognized and get official Unicode named character sequences?
A: Unicode named character sequences are not intended for such documentation and recognition purposes, especially
for sequences that are treated differently by different languages using a particular script. Named sequences are simply names
for specified sequences of Unicode characters, and convey no information about whether such a sequence is a phoneme or a digraph
or has any other special status for one or more writing systems for particular languages.
Q: But how can I get these sequences recognized?
A: The Unicode locales includes mechanisms for recognizing such sequences. If a sequence is important for sorting
in a particular language, such as "ch" in Slovak, then it needs to
be entered into the CLDR
collation sequences. If the sequence is generally recognized as an
element of the "alphabet" for a language, then it should be added to
For information on how to request such additions, see
Q: Are there other ways, outside of Unicode, to get language-specific sequences recognized?
A: Often a good additional route to get proper recognition for the details of the phonology and writing systems
for various languages is to work in open forums such as the
Wikipedia, which are increasingly drawing together sophisticated
and consistent documentation about languages and writing systems that can substantially aid in better software implementations.