Entities and Named Sequences
Q: Does Unicode define named entities, the way HTML does?
A: No. The use of "entities" with names, such as " ", "ñ", "<", and so on, to represent characters
or symbols is a convention used in markup languages such as HTML. Large numbers of named entities are defined in ISO standards
related to SGML, and common subsets of those entities are widely used in HTML and XHTML, but their definition is unrelated to
the Unicode Standard.
Q: Are named entities the same as characters?
A: No. They are an abstracted, markup representation of a character. They were invented primarily so that there was
some way to represent character content in text using a character set for which there was no directly encoded character for the
entity in question.
Q: Can named entities be mapped to Unicode characters?
Almost all widely used named entities have unambiguous mappings to Unicode characters. Indeed, this is required for
the use of named entities in HTML, because one of the steps in the interpretation of HTML text content is the conversion of any
named entities parsed out of the raw HTML text into their corresponding Unicode values.
Q: But Unicode does have named sequences. What are those?
A: Unicode named character sequences consist of a formal association of a name, which looks just like a Unicode character
name, with a sequence of two or more Unicode characters. An example would be:
<0172, 0301> LATIN CAPITAL LETTER U WITH OGONEK AND ACUTE
U+0172 is the encoded Unicode character, LATIN CAPITAL LETTER U WITH OGONEK. U+0301 is the encoded Unicode character, COMBINING
ACUTE ACCENT. "<0172, 0301>" is not an encoded character, but just a formal syntax to indicate a sequence of two Unicode characters.
The name for this sequence follows the same conventions as for regular, encoded Unicode characters.
Q: Why bother defining named sequences? Why not just encode the character, instead?
A: When text elements that people want to treat as "a character" are already representable in Unicode text by a
sequence of already encoded characters, encoding another precomposed character for that sequence introduces multiple
representations of the same thing. Such an action would undermine the stability of Unicode text normalization, which is
subject to very strict stability guarantees.
Q: So if you can't encode a precomposed character for a sequence, what's the use of defining a named sequence?
A: Sometimes other standards, specifications, or protocols need to refer very precisely to specific elements of a character
repertoire. For the Unicode Standard to simply say that some of those elements can already be represented "as sequences of Unicode
characters" may not be considered precise enough. ISO/IEC 10646 defines UCS Sequence Identifiers, of the form "<0172, 0301>" to specify
such sequences precisely. Unicode named character sequences add character-like names to such sequences, to make it easier to understand
what their intent is and to make it easier to refer to them.
Q: Where is the Unicode named character sequence syntax defined?
A: In UAX #34: Unicode Named Character Sequences.
Q: Who defines Unicode named character sequences? Can I just make them up myself?
A: The UTC defines Unicode named character sequences through a multiple-step approval process. Only a few of them are actually
approved — generally in response to formal requests originating in other standards organizations. They are not designed to serve as an
arbitrary end-user defined extension mechanism for the standard.
Q: So where can I find the list of approved Unicode named character sequences?
A: Those are listed in the data file
NamedSequences.txt in
the Unicode Character Database.
Q: I know of many important character sequences for my language's writing system. Should I submit a proposal
to get them approved as Unicode named character sequences?
A: In general this is neither necessary nor desirable. There are thousands upon thousands of potentially significant sequences
of Unicode characters for various languages and writing systems around the world. Software can (and should) handle such sequences
appropriately, whether for sorting purposes, for rendering, or for natural language processing, and so on, based on their actual usage and
significance — without depending on decisions by the UTC to formally recognize particular sequences as Unicode named character sequences.
Unicode named character sequences are aimed at a much more limited task of referenceability between certain standards.
Q: But my language has N phonemes that are written as digraphs. Isn't it important that all those
digraphs be officially recognized and get official Unicode named character sequences?
A: Unicode named character sequences are not intended for such documentation and recognition purposes, especially
for sequences that are treated differently by different languages using a particular script. Named sequences are simply names
for specified sequences of Unicode characters, and convey no information about whether such a sequence is a phoneme or a digraph
or has any other special status for one or more writing systems for particular languages.
Q: But how can I get these sequences recognized?
A: The Unicode locales includes mechanisms for recognizing such sequences. If a sequence is important for sorting
in a particular language, such as "ch" in Slovak, then it needs to
be entered into the CLDR
collation sequences. If the sequence is generally recognized as an
element of the "alphabet" for a language, then it should be added to
the
exemplar characters.
For information on how to request such additions, see
CLDR bugs.
Q: Are there other ways, outside of Unicode, to get language-specific sequences recognized?
A: Often a good additional route to get proper recognition for the details of the phonology and writing systems
for various languages is to work in open forums such as the
Wikipedia, which are increasingly drawing together sophisticated
and consistent documentation about languages and writing systems that can substantially aid in better software implementations.
Q&A contributed by
[KW]