PRI #146 Background Document

L2/09-138
PRI #146 Background Document

Title: Suggested Restructuring of Text in Chapter 3 for Clarification of Unicode Normalization
Source: Ken Whistler
Date: March 18, 2009

Important Note: This Public Review Issue is very tightly coupled with PRI #145, the Proposed Update for UAX #15, Unicode Normalization Forms. Please review both this background document and the Proposed Update UAX #15 text, before providing feedback, as the proposed text changes are closely related.

Introduction

This background document is based in part on L2/08-301, a document presented to the UTC to suggest reorganization of the discussion of Unicode normalization in UAX #15 and in Chapter 3.

The present document focuses on the changes involved to the text of Chapter 3, "Conformance", in the standard. Correlated changes to UAX #15 are dealt with separately in the Proposed Update for UAX #15.

The main issue noted here for UAX #15 is that the specification of Unicode normalization depends on canonical decomposition and compatibility decomposition and the canonical ordering algorithm (all currently defined in Chapter 3 of the Unicode Standard), but then introduces new formal definitions and another algorithm to define NFKD and NFKC. The result is fragmented and not easy to understand (over and above just the inherent complexity of the specification of Unicode normalization itself).

The current recommendation is to move the formal specification of normalization into Chapter 3 -- essentially restructuring the current Section 3.11, "Canonical Ordering Behavior" to make it instead Section 3.11, "Normalization Forms", with the complete specification for Unicode normalization in one place. This also involves a considerable tightening up of the definitions involved, but does not destabilize current usage by introducing substantially new terminology.

Note that this is intended as a textual reorganization for clarity of exposition. There is no change whatsoever to the actual, substantive definition of the Unicode Normalization Algorithm. This textual reorganization will have no implications whatsoever for actual implementations of the algorithm, would not change the normalized forms for any Unicode character or string, and would not violate any stability guarantees for Unicode normalization.

The outcome would be that UAX #15 would still have substantial content, but rather than containing half of the formal definitions and formal specification of Unicode normalization, it would consist of the explanation of Unicode normalization, the examples and edge cases, version and stability information, all the optimization strategies for implementation, and so on. For that, see the Proposed Update for UAX #15.

There is also a textual defect in the existing formal definitions of Compatibility Decomposition (D65) and Canonical Decomposition (D68) in the text in Chapter 3. The specification has been assuming all along that decompositions (and full decompositions) referred to Unicode character sequences, i.e., strings. But the definitions as currently written nominally just refer to single characters. Unless this is fixed first, the formal definition of Unicode normalization gets rather awkward and would require introducing new terms to distinguish a canonical decomposition (of a single character) and a canonical decomposition (of a string), when current de facto usage simply treats the canonical decomposition of a single character as the limiting case of the canonical decomposition of a string.

The UTC is interested in any feedback on the changes proposed here for Chapter 3 to address Unicode normalization.

The detailed proposal for text change for Chapter 3 is appended here for reference. Text located in [[ double brackets ]] are editorial notes, intended to clarify the edits or the reason for them. Those editorial notes will not appear in the final edited text for Chapter 3. When noting locations within the text of The Unicode Standard, Version 5.0, the abbreviation "TUS 5.0" is used.

[[ Update to the conformance clauses for Unicode normalization in Section 3.2, Conformance Requirements, p. 74 of TUS 5.0. This update simply points the conformance clauses to the updated Section 3.11 specification, instead of to UAX #15. ]]

Normalization Forms

C13 A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in Section 3.11, "Normalization Forms."

C14 A process that tests Unicode text to determine whether it is in a Normalization Form shall do so in accordance with the specifications in Section 3.11, "Normalization Forms."

[[ Note that the corresponding existing conformance clauses in UAX #15 will point to Section 3.11 also and refer to the conformance clauses, C13 - C14 in Chapter 3. In this way a formal claim of conformance to UAX #15 for Unicode normalization will stay accurate (and need not be updated in other standards or specifications), even though the bulk of the definitions and specification will have been moved to Section 3.11 in Chapter 3. ]]

[[ Corrections to defects in D65 and D68 in Section 3.7 Decomposition, pp. 95 - 96 of TUS 5.0. ]]

D65 Compatibility decomposition: The decomposition of a character ... →
D65 Compatibility decomposition: The decomposition of a character or character sequence ...

D68 Canonical decomposition: The decomposition of a character ... →
D68 Canonical decomposition: The decomposition of a character or character sequence ...

[[ Some reordering of subsections is required for Section 3.11, to repurpose it as the normalization specification for the standard. The subsection "Application of Combining Marks" will be moved out of 3.11 and be tacked on to the end of 3.6 "Combination", where it more properly belongs. (Its occurrence in 3.11 is just the result of the long history of morphing of all this material from the earlier versions of the text.) That just leaves the "Combining Classes" and "Canonical Ordering" subsections in 3.11, which are relevant and on topic for a normalization section. Section 3.11 is then retitled: ]]

3.11 Normalization Forms

[[ The introductory paragraphs of 3.11 are then reworded as follows so that they properly introduce normalization as an issue for all Unicode character sequences and the determination of their equivalence. ]]

The concepts of canonical equivalent [D70] or compatibility equivalent [D67] characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence of this sort is usually referred to as normalization, since determining it requires defining algorithms that convert strings into normalized forms which can then be compared directly for identity.

This section provides the formal definitions of the four Unicode Normalization Forms. It defines the Canonical Ordering Algorithm and the Canonical Composition Algorithm which are used to convert Unicode strings to one of the Unicode Normalization Forms for comparison. It also formally defines Unicode Combining Classes -- values assigned to all Unicode characters and used by the Canonical Ordering Algorithm.

Note: In versions of the Unicode Standard up to Version 5.1.0, the Unicode Normalization Forms and the Canonical Composition Algorithm were defined in UAX #15, Unicode Normalization Forms. Those definitions have now been consolidated in this chapter, for clarity of exposition of the normative definitions and algorithms involved in Unicode normalization. However, because implementation of Unicode normalization is quite complex, implementers are still advised to fully consult UAX #15, which contains more detailed explanations, examples, and implementation strategies.

Unicode normalization should be carefully distinguished from Unicode collation. Both processes involve comparison of Unicode strings. However, the point of Unicode normalization is to make a determination of canonical (or compatibility) equivalence or non-equivalence of strings -- it does not provide any rank-ordering information about those strings. Unicode collation, on the other hand, is designed to provide orderable weights or "keys" for strings; those keys can then be used to sort strings into ordered lists. Unicode normalization is not tailorable; normalization equivalence relationships between strings are exact and unchangeable. Unicode collation, on the other hand, is designed to be tailorable to allow many kinds of localized and other specialized orderings of strings. For more about Unicode collation, see UTS #10, "Unicode Collation Algorithm".

[[ The Combining Classes subsection on pp. 114-115 of TUS 5.0 stays effectively unchanged. It will only need some very small editorial tweaks when the rest of Section 3.11 is restrung. ]]

[[ Then replace the existing Canonical Order subsection, pp. 115 - 117 of TUS 5.0, with the full specification for normalization, drafted as follows: ]]

Specification of Unicode Normalization Forms

The specification of Unicode Normalization Forms applies to all Unicode coded character sequences [D12]. For clarity of exposition in the definitions and rules specified here, the terms "character" and "character sequence" are used, but coded character sequences refer also to sequences containing noncharacters or reserved code points. Unicode Normalization Forms are specified for all Unicode code points, and not just for ordinary, assigned graphic characters.

[[ In the following text, the convention "DK+1", refers to definition numbers. When the final text is edited, "K" will be evaluated, and these will turn into specific definition numbers, for example, D107, D108, and so on. ]]

Starters

DK+1 Starter: Any code point (assigned or not) with combining class of zero. (ccc=0)

Note that ccc=0 is the default value for the Canonical_Combining_Class property, so that all reserved code points are Starters by definition. Noncharacters are also Starters by definition. All control characters, format characters, and private-use characters are also Starters.

Among the graphic characters, all those with General_Category values other than gc=M are Starters. Some combining marks have ccc=0 and thus are also Starters. Combining marks with ccc other than 0 are not Starters. Table 3-X summarizes the relationship between types of combining marks and their status as Starters.

[[ Insert table 3-X. Caption: Combining Marks and Starter Status ]]

Description	gc	ccc	Starter?
Non-spacing	Mn	0	Yes
		1..255	No
Spacing	Mc	0	Yes
		216, 226	No
Enclosing	Me	0	Yes

The term Starter refers, in concept, to the starting character of a combining character sequence [D56], because all combining character sequences except defective combining character sequences [D57] commence with a ccc=0 character -- in other words, they start with a Starter. However, because the specification of Unicode Normalization Forms must apply to all possible coded character sequences, and not just to typical combining character sequences, the behavior of a code point for Unicode Normalization Forms is specified entirely in terms of its status as a Starter or a non-starter, together with its Decomposition_Mapping value.

Canonical Ordering Algorithm

DK+2 Reorderable Pair: Two adjacent characters A and B in a coded character sequence <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.

DK+3 Canonical Ordering Algorithm: The Canonical Ordering Algorithm is specified as follows:

In a decomposed character sequence D, exchange the positions of the characters in each Reorderable Pair until the sequence contains no more Reorderable Pairs.

* In effect, the Canonical Ordering Algorithm is a local bubble sort that guarantees that a Canonical Decomposition or a Compatibility Decomposition will contain no subsequences in which a combining mark is followed directly by another combining mark that has a lower, non-zero combining class.

* Canonical ordering is defined in terms of application of the Canonical Ordering Algorithm to an entire decomposed sequence, because there are some unusual sequences of Unicode characters for which the result of decomposing each of those characters individually will not directly result in the entire sequence being in canonical order. [Pointer to examples] In practice, most decompositions for Unicode strings are already in canonical order.

Canonical Composition Algorithm

DK+4 Composition Exclusion: A Canonical Decomposable Character [D69] which has the property value Composition_Exclusion=True.

* The list of Composition Exclusions is provided in CompositionExclusions.txt in the Unicode Character Database.

DK+5 Singleton Decomposition: A canonical decomposition mapping to a single character other than the character itself.

* Example: U+2126 OHM SIGN has a singleton decomposition to U+03A9 GREEK CAPITAL LETTER OMEGA.

* A character with a singleton decomposition is often referred to simply as a singleton for short.

DK+6 Non-starter Decomposition: A canonical decomposition mapping to a sequence of more than one character, for which the first character in that sequence is not a Starter.

* Example: U+0344 COMBINING GREEK DIALYTIKA TONOS has a non-starter decomposition to <U+0308 COMBINING DIAERESIS, U+0301 COMBINING ACUTE ACCENT>.

DK+7 Full Composition Exclusion: A Canonical Decomposable Character which has the property value Full_Composition_Exclusion=True.

* Full composition exclusions consist of the entire list of composition exclusions plus all characters with singleton decompositions or with non-starter decompositions.

* For convenience in implementation of Unicode normalization, the derived property, Full_Composition_Exclusion is computed, and all characters with the property value Full_Composition_Exclusion=True are listed in DerivedNormalizationProps.txt in the Unicode Character Database.

DK+8 Primary Composite: A Canonical Decomposable Character [D69] which is not a Full Composition Exclusion.

* For any given version of the Unicode Standard, the list of primary composites can be computed by extracting all canonical decomposable characters from UnicodeData.txt in the Unicode Character Database, adding the list of precomposed Hangul syllables [D117], and subtracting the list of Full Decomposition Exclusions.

DK+9 Blocked: Let A and C be two characters in a coded character sequence <A, ... C>. C is blocked from A if and only if ccc(A)=0 and there exists some character B between A and C in the coded character sequence, i.e., <A, ... B, ... C>, and either ccc(B)=0 or ccc(B) >= ccc(C).

* Because the Canonical Composition Algorithm operates on a string which is already in canonical order, testing whether a character is blocked requires looking only at the immediately preceding character in the string.

DK+10 Non-blocked Pair: A pair of characters <A, ... C> in a coded character sequence, in which C is not blocked from A.

* It is important for proper implementation of the Canonical Composition Algorithm to be aware that a Non-blocked Pair need not be contiguous.

DK+11 Canonical Composition Algorithm: The Canonical Composition Algorithm is specified as follows:

Starting from the second character in the coded character sequence (of a Canonical Decomposition or Compatibility Decomposition) and proceeding sequentially to the final character, perform the following steps:

1. Seek back (left) in the coded character sequence from the character C to find the last Starter L preceding C in the character sequence.

2. If there exists a Primary Composite P which is canonically equivalent to the sequence <L, C>, and if C is not blocked from L, replace L by P in the sequence and delete C from the sequence.

* When the algorithm completes, all Non-blocked Pairs canonically equivalent to a Primary Composite will have been systematically replaced by those Primary Composites.

* The replacement of the Starter L in step 2 requires continuing to check the succeeding characters until the character at that position is no longer part of any Non-blocked Pair that can be replaced by a Primary Composite. [Pointer to examples]

* The character C in step 1 is not necessarily a non-starter. It is necessary to check all characters in the sequence, because there are sequences <L, C> where both L and C are Starters, yet there is a Primary Composite P which is canonically equivalent to that sequence. [Pointer to example]

Definition of Normalization Forms

The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences. Each is then differentiated based on whether it employs a Canonical Decomposition or a Compatibility Decomposition.

DK+12 Normalization Form D (NFD): The Canonical Decomposition of a coded character sequence.

DK+13 Normalization Form KD (NFKD): The Compatibility Decomposition of a coded character sequence.

DK+14 Normalization Form C (NFC): The Canonical Composition of the Canonical Decomposition of a coded character sequence.

DK+15 Normalization Form KC (NFKC): The Canonical Composition of the Compatibility Decomposition of a coded character sequence.

Logically, to get the NFD or NFKD (maximally decomposed) normalization form for a Unicode string, one first computes the full decomposition of that string and then applies the Canonical Ordering Algorithm to it.

Logically, to get the NFC or NFKC (maximally composed) normalization form for a Unicode string, one first computes the NFD or NFKD normalization form for that string, and then applies the Canonical Composition Algorithm to it.

============================== end Section 3.11 =======================

L2/09-138PRI #146 Background Document

Title: Suggested Restructuring of Text in Chapter 3 for Clarification of Unicode NormalizationSource: Ken WhistlerDate: March 18, 2009

Introduction

L2/09-138
PRI #146 Background Document

Title: Suggested Restructuring of Text in Chapter 3 for Clarification of Unicode Normalization
Source: Ken Whistler
Date: March 18, 2009