Stability Policy for the Unicode Standard
Unlike many other standards, the Unicode Standard is continually
expanding — new characters are added to meet a variety of uses,
ranging from technical symbols to letters for archaic languages.
Character properties are also expanded or revised to meet
implementation requirements.
In each new version of the Unicode Standard, the Unicode Consortium may
add characters or make certain changes in characters
that were encoded in a previous version of the standard. To minimize the impact on
existing implementations, however, there are limitations imposed by the consortium on the
types of changes that can be made
This page lists the policies of the Unicode Consortium regarding character
encoding stability. The notation Unicode X.Y+ means
"The Unicode Standard, Version X.Y and all subsequent versions".
(For associated information, see the Related Links on the
left.)
Encoding Stability
Applicable Version: Unicode 2.0+
Once a character is encoded, it will not be moved or
removed.
This ensures that implementers can always depend on each version of the Unicode
Standard being a superset of the previous version. The Unicode Standard may deprecate
the character (that is, formally discourage its use), but it will not reallocate, remove
or reassign the character.
Note: Ordering of characters is handled via
collation, not by moving
characters to different codepoints. For more information, see
UTS #10: Unicode Collation Algorithm
and the Unicode FAQ.
Name Stability
Applicable Version: Unicode 2.0+
Once a character is encoded, its character name will not
be changed.
Together with the limitations in name syntax,
this allows implementations to create unique identifiers from
character names. The character names are used to distinguish between characters, and do not
always express the full meaning of each character. They are designed to be used
programmatically, and thus must be stable.
In some cases the original name chosen to represent the character is inaccurate in
one way or another. Any such inaccuracies are dealt with by adding annotations to the
character name list (which is printed in the Unicode Standard and provided in a
machine-readable format),
or by adding descriptive text to the standard.
Note: It is possible to produce translated names for the characters, to
make the information conveyed by the name accessible to non-English speakers.
In cases of outright errors in character names such as misspellings,
a character may be given a formal name alias.
Formal Name Alias Stability
Applicable Version: Unicode
5.0+
Formal aliases, once assigned to a
character, will not be changed or removed.
Formal aliases are defined in the file NameAliases.txt in the
Unicode Character Database,
and listed in the character
code charts.
Named Character
Sequence Stability
Applicable Version: Unicode
5.0+
Named character sequences will not be changed or removed.
This stability guarantee applies both to the name of the named
character sequence and to the sequence of characters so named.
Named character sequences are defined in file NamedSequences.txt
in the Unicode Character
Database. For more information on named sequences, see UAX #34,
Named Sequences.
NOTE: There are also provisional named character
sequences, which are included in the Unicode Character Database but
which are not covered by this stability policy.
Name Uniqueness
Applicable Version: Unicode 2.0+
The names of characters, formal
aliases, and named sequences are unique within a shared namespace.
The names of characters, named sequences, and formal aliases for
characters share a single namespace in which each name uniquely
identifies either a single character or a single named sequence. The
definition of uniqueness is not just a simple comparison of the
characters — instead, the loose matching rules from UCD.html in the
Unicode Character Database
are used.
Normalization Stability
Applicable Version: Unicode 3.1+
If a string contains only characters from a given
version of the Unicode, and it is put into a normalized form in accordance with that
version of Unicode, then the result will also be in that normalized form according to
any subsequent version of Unicode. The result will also be in that normalized form
according to any prior version of the standard that contains all of the characters in
the string (back to the first applicable version, Unicode 3.1).
In particular, once a character is encoded, its canonical combining class and
decomposition mapping will not be changed in a way that will destabilize normalization.
Thus, the following constraints will be maintained under all circumstances:
Decomposition Mapping
The decomposition mapping may not be
changed except for the correction of exceptional errors which meet
all of the following conditions (a)-(c):
-
There is a clear and evident error identified in the
Unicode Character Database (such as a typographic mistake)
-
The error constitutes a clear violation of the
Identity
Stability policy
-
The correction of such an error does not violate the
following constraints (1)-(4)
-
no character will be given a decomposition mapping when it did not previously have
one
-
no decomposition mapping will be removed from a character
-
decomposition mappings will not change in type (canonical to compatibility or vice
versa)
-
the number of characters in a decomposition mapping will not change
Canonical Combining Class
Once a character is assigned, its
canonical combining class will not change
Note: If an implementation normalizes a string that contains characters
that are not assigned in the version of Unicode that it supports, that string
might not be in normalized form according to a future version of Unicode. For
example, suppose that a Unicode 4.0 program normalizes a string that
contains new Unicode 4.1 characters. That string might not be normalized
according to Unicode 4.1.
Note: in versions prior to Unicode 4.1, there were exceptional cases
where the normalization algorithm had to be applied twice to put a string into
normalized form. See
Corrigendum #5:
Normalization Idempotency and
UAX#15:
Normalization Forms.
Identity Stability
Applicable Version: Unicode 1.1+
Once a character is encoded, its properties may still be
changed, but not in such a way as to change the fundamental identity of the
character.
The consortium will endeavor to keep the values of the other properties as stable as
possible, but some circumstances may arise that require changing them. Particularly in
the situation where the Unicode Standard first encodes less-well documented characters
and scripts, the exact character properties and behavior initially may not be well
known. As more experience is gathered in implementing the characters, adjustments in the
properties may become necessary. Examples of such properties include, but are not
limited to, the following:
- General category
- Case mappings
- Bidi properties
- Compatibility decomposition tags
(e.g. <font>
vs.
<compat>
)
- Representative glyphs
However, character properties will not be changed in a way that would affect
character identity. For example, the representative glyph for U+0061 "A" cannot be
changed to "B"; the general category for U+0061 "A" cannot be changed to Ll
(lowercase letter); and the decomposition mapping for U+00C1 (Á) cannot be changed
to <U+0042, U+0301> (B, ´).
Property Value Stability
Applicable Version:
indicated in the table below.
Values of certain properties are
limited by the following constraints:
Applicable Versions |
Constraint on property values |
Unicode 1.1.5+ |
Combining classes are limited to
the values 0 to 255. |
Unicode 1.1.5+ |
All characters other than those of
General Category M* have the combining class 0. |
Unicode 2.0+ |
Canonical and Compatibility
mappings are always in canonical order, and the
resulting recursive decomposition will also be in
canonical order. |
Unicode 2.0+ |
Canonical mappings are always
limited either to a single value or to a pair. The
second character in the pair cannot itself have a
canonical mapping. |
Unicode 2.1.3+ |
The General_Category values will
not be further subdivided. |
Unicode 3.0.0+ |
The Bidi_Category values will not
be further subdivided. |
Unicode 3.1+ |
The Noncharacter_Code_Point
property is an immutable code point property, which
means that its property values for all Unicode code
points will never change |
Unicode 4.0+ |
The Bidirectional Properties will
be assigned so as to preserve canonical equivalence |
Unicode 4.1+ |
All characters with the Lowercase
property and all characters with the Uppercase property
have the Alphabetic property |
Unicode 4.1+ |
The Pattern_Syntax and
Pattern_Whitespace properties are immutable code point
properties, which means that their property values for
all Unicode code points will never change |
These constraints insure that implementers can
simplify or optimize certain aspects of their support for
character properties. Further description of these invariants is provided in described in
UCD.html
in the Unicode Character Database.
Identifier Stability
Applicable Version: Unicode 3.0+
All strings that are valid default Unicode identifiers
will continue to be valid default Unicode identifiers in all subsequent versions of
Unicode. Furthermore, default identifiers never contain characters with the Pattern_Syntax or Pattern_Whitespace properties.
If a string qualifies as an identifier under one version of Unicode, it will
qualify as an identifier under all future versions. The reverse is not true; an
identifier under Version 5.0 may not be an identifier under Version 4.0: it may contain
a character that was unassigned under Unicode 4.0, or (very rarely) a Unicode 4.0
character that wasn't an identifier character in Unicode 4.0, but became one in Unicode
5.0.
For more information, see UAX #31:
Identifier and Pattern Syntax.
Casefolding Stability
Applicable Version: Unicode
5.0+
Caseless matching of Unicode
strings used for identifiers is stable.
Casefolding stability ensures that identifiers created in
different versions of Unicode can be reliably matched in a
case-insensitive manner. For more information on identifiers
see UAX #31:
Identifier and Pattern Syntax. Identifiers commonly
exclude compatibility decomposable characters, and therefore
this policy formally applies only to strings normalized with NFKC.
The toCaseFold() operation used for caseless matching is
defined by rule R4 under "Default Case Conversion" in
Section 3.13, Default Case Algorithms of the Unicode
Standard.
The formal statement of this policy is:
For each string S normalized to
NFKC and containing characters only from a given Unicode
Version, toCasefold(S) under that version is identical to
toCasefold(S) under any later version of Unicode.