L2/06-350

Public Review Issue #95

Specification of a Stable Normalization Process

The stability of Unicode normalization has been the subject of a number of misunderstandings. In particular, implementers are often unclear about the meaning of the stability guarantees for normalization and how they impact the handling of normalization of Unicode strings across different versions of the Unicode Standard.

This background document introduces new terms that can be useful tools for writers of other specifications. It is proposed to specify a "Stable Normalization Process". The key concept is that once a Unicode string has been successfully normalized via the Stable Normalization Process, it will never change if subsequently normalized again, in any version of Unicode, past or future. (That guarantee is already provided by the existing normalization stability policies, but with this new definition it can be stated more clearly and succinctly.)

The changes to UAX #15 to specify the Stable Normalization Process could be rather small — just adding new definitions and conformance clauses without materially affecting the definition of any existing normalization forms.

It is anticipated, however, that UAX #15 will also have further explanatory information added, and that a more thorough reorganization of the text of UAX #15 will be undertaken to make the concepts and implications of Unicode normalization more accessible to implementers.

(The links below are to final proposed update versions of UAX #15, since the final approved versions are not yet posted.)

To the section:

http://www.unicode.org/reports/tr15/tr15-26.html#Conformance

add:
 

UAX15-C5. A process that purports to transform text according to the Stable Normalization Process must do so in accordance with the specifications in this document.

To the section:

http://www.unicode.org/reports/tr15/tr15-26.html#Specification

add:

 

R3. The Stable Normalization Process for a given normalization form (NFD, NFC, NFKD, or NFKC) is the same as the corresponding process for generating that form, except that the process must be aborted with an error if either of two conditions occur:
  1. The string contains any unassigned code point that is unassigned according to the version of Unicode used for the normalization process. These are characters with the property values General_Category=Unassigned & Noncharacter_Code_Point=false
  2. The string contains any sequence of characters matching those in Table 11: Problem sequences
Examples:


Version Examples Required Behavior
Unicode 3.2 U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) must abort with an error if it encounters any of the characters
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1)
04CF (ӏ) CYRILLIC SMALL LETTER PALOCHKA (defined in Unicode 5.0)
Unicode 4.0 U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) will accept the character
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1) must abort with an error if it encounters either of the characters
0242 (ɂ) LATIN SMALL LETTER GLOTTAL STOP (defined in Unicode 5.0)
Unicode 4.1 U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) will accept the characters
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1)
0242 (ɂ) LATIN SMALL LETTER GLOTTAL STOP (defined in Unicode 5.0) must abort with an error if it encounters the character
Unicode 5.0 U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL (defined in Unicode 4.0) will accept the characters
0237 (ȷ) LATIN SMALL LETTER DOTLESS J (defined in Unicode 4.1)
0242 (ɂ) LATIN SMALL LETTER GLOTTAL STOP (defined in Unicode 5.0)
All Versions 09C7 (ে) BENGALI VOWEL SIGN E +
0300 ( ̀) COMBINING GRAVE ACCENT +
09BE (া) BENGALI VOWEL SIGN AA
must abort with an error if it encounters the sequence (an example from Table 10)
 
Notes: