Source: Mark Davis
Date: July 10, 2006
Subject: Stable Normalization Forms


It has become clear that the stability of normalization is still the subject of misunderstandings. In particular, people don't realize how they can use the stability guarantees to handle normalization across versions. The following is a strawman proposal based on an idea of Asmus's, for introducing new terms that can be useful tools for writers of other specifications. The key is that once a string has been put into a "stable normalization format" (Stable NFC, Stable NFD,...) as defined below, it will never change under normalization (to NFC, NFD,..., respectively) in any version of Unicode, past or future. (That guarantee is already provided by the existing stability policies, but with this new definition it can be stated more clearly and succinctly.)

The changes to UAX#15 to make this change could be rather small, just adding new definitions and conformance clauses, although it could profit by some further explanatory information that should be supplied by the editorial committee. (The links are to working-draft versions of the document, since the final 5.0 ones are not yet posted.)

(Ideally #15 would have a more thorough reorganization to make the concepts and implications of normalization more accessible.)

http://unicode.org/draft/reports/tr15/tr15.html#Conformance, add:

UAX15-C5. A process that purports to transform text according to the Stable Normalization Process must do so in accordance with the specifications in this document.

http://unicode.org/draft/reports/tr15/tr15.html#Specification , add:

R3. The Stable Normalization Process for a given normalization form (NFD, NFC, NFKD, or NFKC) is the same as the corresponding process for generating that form, except that the process must be aborted with an error if either of two error conditions occur. The error conditions are:
  1. The string contains any unassigned code point that is unassigned according to the version of Unicode used for the normalization process. These are characters with the property values General_Category=Unassigned & Noncharacter_Code_Point=false
  2. The string contains any sequence of characters matching those in Table 10: Problem sequences

When generating a stable normalized form, a process normalizing according to:
  • Once a string has been put into stable normalization format (Stable NFC, Stable NFD,...), it will never change under normalization (to NFC, NFD,..., respectively) in any version of Unicode, past or future.
  • The additional data required for producing a stable normalized form can be easily implemented with a compact lookup table. (Most libraries supplying normalization functions also supply the required property tests.