L2/00-268

Draft
(for final version see L2/00-279)
Unicode policies on the stability of character encodings

In a new version of the Unicode Standard, the Unicode Consortium may add characters or make certain changes in characters that were encoded in a previous version of the standard. There are, however, limitations imposed by the consortium on the changes that can be made. These have important implications for implementors in anticipating the kinds of changes in future versions of the standard.

  1. The Unicode Consortium will not remove or move a character, once encoded.

    This ensures that implementors can always depend on each version of the Unicode Standard being a superset of the previous version.

  2. The Unicode Consortium will not change a character name, once encoded.

    The character names are used to distinguish characters from other characters, and do not always express the full meaning of the character. They are designed to be used programmatically, and thus must be stable. In some cases the original name chosen to represent the character is inaccurate in one way or another. Any such problems can be dealt with by adding annotations to the character. Organizations may also wish to produce translated names for the characters, to make the information conveyed by the name accessible to non-English speakers.

  3. The Unicode Consortium will not change the canonical or compatibility decompositions in a way that would affect normalization, once characters are encoded.

    This means that given a string that only contains characters from version X of the Unicode Standard, once put into a normalization form, will also be in that normalization form in any future version of the Unicode Standard.

  4. The structure of certain property values in the Unicode Character Database is subject to the following invariants.

    Further description of these is provided in described in UnicodeData.html

    • The General Category values will not be further subdivided.
    • The Bidi Category values will not be further subdivided.
    • Combining classes are limited to the values 0 to 255.
    • All characters other than those of General Category M* have the combining class 0.
    • Canonical and Compatibility mappings are always in canonical order, and the resulting recursive decomposition will also be in canonical order.
    • Canonical mappings are always limited either to a single value, or to a pair. The second character in the pair cannot itself have a canonical mapping.
  5. The Unicode Consortium may change other properties of characters.

    The consortium will endeavor to keep these properties as stable as possible, but some circumstances may arise that require changing them. In particular, as Unicode encodes less-well documented scripts (such as for minority languages in Thailand) the exact character properties may not be known at the time the script is encoded.

    • General Category
    • Case Mappings
    • Bidi Properties
    • The type of compatibility decomposition (e.g. <font> vs. <compat>)
    • etc.