L2/08-301 Title: Suggested Restructuring of UAX #15 Specification for Normalization. Source: Ken Whistler Date: August 8, 2008 Introduction I have a longstanding action item (108-A29a) to "Provide recommendations for how to restructure UAX #15 for clarity." As the first step for that, I provide here some detailed suggestions on how to restructure the core of UAX #15, i.e., the formal specification of Unicode normalization itself. One of the main issues with UAX #15 is that the specification of normalization depends on canonical decomposition and compatibility decomposition and the canonical ordering algorithm (all defined in Chapter 3 of TUS), but then introduces new definitions and another algorithm to deal with NFKD and NFKC. The result is fragmented and not easy to understand (over and above just the inherent complexity of the specification itself). After considering the potential ramifications of attempting to move relevant definitions out of Chapter 3 and into UAX #15, my current recommendation is instead to move the formal specification of normalization *into* Chapter 3 -- essentially restructuring the current Section 3.11, "Canonical Ordering Behavior" to make it instead Section 3.11, "Normalization Forms", with the complete specification for normalization in one place. In the process I also suggest a considerable tightening up of the definitions involved, while not trying to destabilize usage by introducing substantially new terminology. If the UTC is interested in pursuing this general direction, Mark and I can work on fleshing out the detail of this text and its implications for the rest of the text of UAX #15, and produce the formal Proposed Update draft for that. The idea would be that UAX #15 would still have substantial content, but rather than containing the formal definitions and formal specification, it would consist of the explanation of Unicode normalization, the examples and edge cases, version and stability information, all the optimization strategies for implementation, and so on. In doing this work, we (the editors) discovered that there was also a defect in the existing formal definitions of Compatibility Decomposition (D65) and Canonical Decomposition (D68) in the text in Chapter 3. We have been assuming all along that decompositions (and full decompositions) referred to Unicode character sequences, i.e., strings. But the definitions as currently written just refer to single characters. Unless this is fixed first, the formal definition of Unicode normalization gets rather clutzy and would require introducing new terms to distinguish a canonical decomposition (of a single character) and a canonical decomposition (of a string), when current de facto usage simply treats the canonical decomposition of a single character as the limiting case of the canonical decomposition of a string. =================================================================== [[ Corrections to defects in D65 and D68 in Section 3.7 Decomposition ]] D65 Compatibility decomposition: The decomposition of a character ... --> D65 Compatibility decomposition: The decomposition of a character or character sequence ... D68 Canonical decomposition: The decomposition of a character ... --> D68 Canonical decomposition: The decomposition of a character or character sequence ... =================================================================== [[ Surgery required to Section 3.11 to repurpose it as the normalization specification for the standard.]] [[ Move the subsection "Application of Combining Marks" out of 3.11 and tack it on to the end of 3.6 "Combination", where it more properly belongs. (Its occurrence in 3.11 is just the result of the long history of morphing of all this material from the earlier versions of the text.) That just leaves the "Combining Classes" and "Canonical Ordering" subsections in 3.11, which are relevant and on topic for a normalization section. 3.11 Normalization Forms [[ Rework the intro paragraphs of 3.11 so they properly introduce normalization as an issue for *all* Unicode character sequences and determination of their equivalence. The current text on canonical ordering behavior is out of date and refers to the pre-normalization notion of equivalence of combining character sequences as the issue of concern. It dates from Unicode 2.0: "The purpose of this section is to provide unambiguous interpretation of a combining character sequence." And in fact almost the entire text of the Canonical Ordering subsection, including the exact statement of the canonical ordering algorithm, dates from 2.0, verbatim. I won't try to do that redrafting of the intro text here, but it would draw, in part on the intro from UAX #15. ]] [[ Then continue with the Combining Classes subsection, which only needs some small tweaks. ]] [[ Then replace the Canonical Order section with the specification for normalization, drafted along the following lines ]] The specification of Unicode Normalization Forms applies to all Unicode coded character sequences [D12]. For clarity of exposition in the definitions and rules specified here, the terms "character" and "character sequence" are used, but coded character sequences refer also to sequences containing noncharacters or reserved code points. Unicode Normalization Forms are specified for *all* Unicode code points, and not just for ordinary, assigned graphic characters. Starters DK+1 Starter: Any code point (assigned or not) with combining class of zero. (ccc=0) Note that ccc=0 is the default value for the Canonical_Combining_Class property, so that all reserved code points are Starters by definition. Noncharacters are also Starters by definition. All control characters, format characters, and private-use characters are also Starters. Among the graphic characters, all those with General_Category values other than gc=M are Starters. Some combining marks have ccc=0 and thus are also Starters. Combining marks with ccc other than 0 are not Starters. [[ Insert table of combining marks and Starter status ]] Description gc ccc Starter? Non-spacing Mn 0 Yes 1..255 No Spacing Mc 0 Yes 216, 226 No Enclosing Me 0 Yes The term *Starter* refers, in concept, to the starting character of a combining character sequence [D56], because all combining character sequences except defective combining character sequences [D57] commence with a ccc=0 character -- in other words, they start with a Starter. However, because the specification of Unicode Normalization Forms must apply to all possible coded character sequences, and not just to typical combining character sequences, the behavior of a code point for Unicode Normalization Forms is specified entirely in terms of its status as a Starter or a non-starter, and its Decomposition_Mapping value. Canonical Ordering Algorithm DK+2 Reorderable Pair: Two adjacent characters A and B in a coded character sequence are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. DK+3 Canonical Ordering Algorithm: The Canonical Ordering Algorithm is specified as follows: In a decomposed character sequence D, exchange the positions of the characters in each Reorderable Pair until the sequence contains no more Reorderable Pairs. * In effect, the Canonical Ordering Algorithm is a local bubble sort that guarantees that a Canonical Decomposition or a Compatibility Decomposition will contain no subsequences in which a combining mark is *followed* directly by another combining mark that has a lower, non-zero combining class. * Canonical ordering is defined in terms of application of the Canonical Ordering Algorithm to an *entire* decomposed sequence, because there are some unusual sequences of Unicode characters for which the result of decomposing each of those characters individually will not directly result in the entire sequence being in canonical order. [Pointer to examples] In practice, most decompositions for Unicode strings are already in canonical order. Canonical Composition Algorithm DK+4 Composition Exclusion: A Canonical Decomposable Character [D69] which has the property value Composition_Exclusion=True. * The list of Composition Exclusions is provided in CompositionExclusions.txt in the Unicode Character Database. DK+5 Singleton Decomposition: A canonical decomposition mapping to a single character other than the character itself. * Example: U+2126 OHM SIGN has a singleton decomposition to U+03A9 GREEK CAPITAL LETTER OMEGA. * A character with a singleton decomposition is often referred to simply as a *singleton* for short. DK+6 Non-starter Decomposition: A canonical decomposition mapping to a sequence of more than one character, for which the first character in that sequence is not a starter. * Example: U+0344 COMBINING GREEK DIALYTIKA TONOS has a non-starter decomposition to . DK+7 Full Composition Exclusion: A Canonical Decomposable Character which has the property value Full_Composition_Exclusion=True. * Full composition exclusions consist of the entire list of composition exclusions plus all characters with singleton decompositions or with non-starter decompositions. * For convenience in implementation of Unicode normalization, the derived property, Full_Composition_Exclusion is computed, and all characters with the property value Full_Composition_Exclusion=True are listed in DerivedNormalizationProps.txt in the Unicode Character Database. DK+8 Primary Composite: A Canonical Decomposable Character [D69] which is not a Full Composition Exclusion. * For any given version of the Unicode Standard, the list of primary composites can be computed by extracting all canonical decomposable characters from UnicodeData.txt in the Unicode Character Database, adding the list of precomposed Hangul syllables [D117], and subtracting the list of full decomposition exclusions. DK+9 Blocked: Let A and C be two characters in a coded character sequence . C is blocked from A if and only if ccc(A)=0 and there exists some character B between A and C in the coded character sequence, i.e., , and either ccc(B)=0 or ccc(B) >= ccc(C). DK+10 Non-blocked Pair: A pair of characters in a coded character sequence, in which C is not blocked from A. * It is important for proper implementation of the Canonical Composition Algorithm to be aware that a Non-blocked Pair need not be contiguous. DK+11 Canonical Composition Algorithm: The Canonical Composition Algorithm is specified as follows: Starting from the second character in the coded character sequence (of a Canonical Decomposition or Compatibility Decomposition) and proceeding sequentially to the final character, perform the following steps: 1. Seek back (left) in the coded character sequence from the character C to find the last Starter L preceding C in the character sequence. 2. If there exists a Primary Composite P which is canonically equivalent to the sequence , and if C is not blocked from L, replace L by P in the sequence and delete C from the sequence. * When the algorithm completes, all Non-blocked Pairs canonically equivalent to a Primary Composite will have been systematically replaced by those Primary Composites. * Note that the replacement of the Starter L in step 2 requires continuing to check the succeeding characters until the character at that position is no longer part of any Non-blocked Pair that can be replaced by a Primary Composite. [Pointer to examples] Normalization Forms The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal *decomposition* of equivalent sequences, and two of these forms are defined by maximal *composition* of equivalent sequences. Each is then differentiated based on whether it employs a Canonical Decompositions or Compatibility Decomposition. DK+12 Normalization Form D (NFD): The Canonical Decomposition of a coded character sequence. DK+13 Normalization Form KD (NFKD): The Compatibility Decomposition of a coded character sequence. DK+14 Normalization Form C (NFC): The Canonical Composition of the Canonical Decomposition of a coded character sequence. DK+15 Normalization Form KC (NFKC): The Canonical Composition of the Compatibility Decomposition of a coded character sequence. Logically, to get the NFD or NFKD (maximally decomposed) normalization form for a Unicode string, one first computes the full decomposition of that string and then applies the Canonical Ordering Algorithm to it. Logically, to get the NFC or NFKC (maximally composed) normalization form for a Unicode string, one first computes the NFD or NFKD normalization form for that string, and then applies the Canonical Composition Algorithm to it.