L2/08-301
Title: Suggested Restructuring of UAX #15 Specification
for Normalization.
Source: Ken Whistler
Date: August 8, 2008
Introduction
I have a longstanding action item (108-A29a) to "Provide
recommendations for how to restructure UAX #15 for clarity."
As the first step for that, I provide here some detailed
suggestions on how to restructure the core of UAX #15,
i.e., the formal specification of Unicode normalization
itself.
One of the main issues with UAX #15 is that the specification
of normalization depends on canonical decomposition
and compatibility decomposition and the canonical ordering
algorithm (all defined in Chapter 3 of TUS), but then
introduces new definitions and another algorithm to
deal with NFKD and NFKC. The result is fragmented and
not easy to understand (over and above just the inherent
complexity of the specification itself).
After considering the potential ramifications of attempting
to move relevant definitions out of Chapter 3 and into
UAX #15, my current recommendation is instead to move
the formal specification of normalization *into* Chapter 3 --
essentially restructuring the current Section 3.11,
"Canonical Ordering Behavior" to make it instead Section 3.11,
"Normalization Forms", with the complete specification for
normalization in one place. In the process I also suggest
a considerable tightening up of the definitions involved,
while not trying to destabilize usage by introducing
substantially new terminology.
If the UTC is interested in pursuing this general direction,
Mark and I can work on fleshing out the detail of this
text and its implications for the rest of the text of UAX #15,
and produce the formal Proposed Update draft for that.
The idea would be that UAX #15 would still have substantial
content, but rather than containing the formal definitions
and formal specification, it would consist of the explanation
of Unicode normalization, the examples and edge cases,
version and stability information, all the optimization
strategies for implementation, and so on.
In doing this work, we (the editors) discovered that there
was also a defect in the existing formal definitions of
Compatibility Decomposition (D65) and Canonical Decomposition
(D68) in the text in Chapter 3. We have been assuming all
along that decompositions (and full decompositions) referred
to Unicode character sequences, i.e., strings. But the
definitions as currently written just refer to single
characters. Unless this is fixed first, the formal
definition of Unicode normalization gets rather clutzy and
would require introducing new terms to distinguish
a canonical decomposition (of a single character) and a
canonical decomposition (of a string), when current de facto
usage simply treats the canonical decomposition of a single
character as the limiting case of the canonical decomposition
of a string.
===================================================================
[[ Corrections to defects in D65 and D68 in Section 3.7
Decomposition ]]
D65 Compatibility decomposition: The decomposition of a character ...
-->
D65 Compatibility decomposition: The decomposition of a character
or character sequence ...
D68 Canonical decomposition: The decomposition of a character ...
-->
D68 Canonical decomposition: The decomposition of a character
or character sequence ...
===================================================================
[[ Surgery required to Section 3.11 to repurpose it as the
normalization specification for the standard.]]
[[ Move the subsection "Application of Combining Marks" out
of 3.11 and tack it on to the end of 3.6 "Combination", where
it more properly belongs. (Its occurrence in 3.11 is just
the result of the long history of morphing of all this material
from the earlier versions of the text.) That just leaves the
"Combining Classes" and "Canonical Ordering" subsections
in 3.11, which are relevant and on topic for a normalization
section.
3.11 Normalization Forms
[[ Rework the intro paragraphs of 3.11 so they properly
introduce normalization as an issue for *all* Unicode
character sequences and determination of their equivalence.
The current text on canonical ordering behavior is out
of date and refers to the pre-normalization notion of
equivalence of combining character sequences as the issue
of concern. It dates from Unicode 2.0: "The purpose of this
section is to provide unambiguous interpretation of a combining
character sequence." And in fact almost the entire text of
the Canonical Ordering subsection, including the exact
statement of the canonical ordering algorithm, dates
from 2.0, verbatim. I won't try to do that redrafting
of the intro text here, but it would draw, in part on
the intro from UAX #15. ]]
[[ Then continue with the Combining Classes subsection,
which only needs some small tweaks. ]]
[[ Then replace the Canonical Order section with the
specification for normalization, drafted along the following lines ]]
The specification of Unicode Normalization Forms applies to all
Unicode coded character sequences [D12]. For clarity of exposition in
the definitions and rules specified here, the terms "character"
and "character sequence" are used, but coded
character sequences refer also to sequences containing noncharacters
or reserved code points. Unicode Normalization Forms are specified
for *all* Unicode code points, and not just for ordinary, assigned
graphic characters.
Starters
DK+1 Starter: Any code point (assigned or not) with combining
class of zero. (ccc=0)
Note that ccc=0 is the default value for the Canonical_Combining_Class
property, so that all reserved code points are Starters by
definition. Noncharacters are also Starters by definition.
All control characters, format characters, and private-use
characters are also Starters.
Among the graphic characters, all those with General_Category values
other than gc=M are Starters. Some combining marks have
ccc=0 and thus are also Starters. Combining marks with
ccc other than 0 are not Starters.
[[ Insert table of combining marks and Starter status ]]
Description gc ccc Starter?
Non-spacing Mn 0 Yes
1..255 No
Spacing Mc 0 Yes
216, 226 No
Enclosing Me 0 Yes
The term *Starter* refers, in concept, to the starting character
of a combining character sequence [D56], because all combining
character sequences except defective combining character sequences
[D57] commence with a ccc=0 character -- in other words, they
start with a Starter. However, because the specification
of Unicode Normalization Forms must apply to all possible coded character
sequences, and not just to typical combining character sequences, the
behavior of a code point for Unicode Normalization Forms is specified
entirely in terms of its status as a Starter or
a non-starter, and its Decomposition_Mapping value.
Canonical Ordering Algorithm
DK+2 Reorderable Pair: Two adjacent
characters A and B in a coded character sequence
are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
DK+3 Canonical Ordering Algorithm: The Canonical
Ordering Algorithm is specified as follows:
In a decomposed character sequence D, exchange the positions
of the characters in each Reorderable Pair until the
sequence contains no more Reorderable Pairs.
* In effect, the Canonical Ordering Algorithm is a local bubble sort that
guarantees that a Canonical Decomposition or a Compatibility
Decomposition will contain no subsequences
in which a combining mark is *followed* directly by another combining mark
that has a lower, non-zero combining class.
* Canonical ordering is defined in terms of application of the
Canonical Ordering Algorithm to an *entire* decomposed sequence, because
there are some unusual sequences of Unicode characters for which the
result of decomposing each of those characters individually will
not directly result in the entire sequence being in canonical order.
[Pointer to examples] In practice, most decompositions for Unicode strings
are already in canonical order.
Canonical Composition Algorithm
DK+4 Composition Exclusion: A Canonical Decomposable Character [D69]
which has the property value Composition_Exclusion=True.
* The list of Composition Exclusions is provided in CompositionExclusions.txt
in the Unicode Character Database.
DK+5 Singleton Decomposition: A canonical decomposition mapping to a
single character other than the character itself.
* Example: U+2126 OHM SIGN has a singleton decomposition
to U+03A9 GREEK CAPITAL LETTER OMEGA.
* A character with a singleton decomposition is often referred
to simply as a *singleton* for short.
DK+6 Non-starter Decomposition: A canonical decomposition mapping to a
sequence of more than one character, for which the first character
in that sequence is not a starter.
* Example: U+0344 COMBINING GREEK DIALYTIKA TONOS has a non-starter
decomposition to __.
DK+7 Full Composition Exclusion: A Canonical Decomposable Character
which has the property value Full_Composition_Exclusion=True.
* Full composition exclusions consist of the entire list of
composition exclusions plus all characters with singleton
decompositions or with non-starter decompositions.
* For convenience in implementation of Unicode normalization, the
derived property, Full_Composition_Exclusion is computed, and
all characters with the property value Full_Composition_Exclusion=True
are listed in DerivedNormalizationProps.txt in the Unicode
Character Database.
DK+8 Primary Composite: A Canonical Decomposable Character [D69] which
is not a Full Composition Exclusion.
* For any given version of the Unicode Standard, the list of
primary composites can be computed by extracting all canonical
decomposable characters from UnicodeData.txt in the Unicode
Character Database, adding the list of precomposed Hangul
syllables [D117], and subtracting the list of full decomposition
exclusions.
DK+9 Blocked: Let A and C be two characters
in a coded character sequence . C is blocked
from A if and only if ccc(A)=0 and there exists some character B
between A and C in the coded character sequence, i.e.,
, and either ccc(B)=0 or ccc(B) >= ccc(C).
DK+10 Non-blocked Pair: A pair of characters in a coded
character sequence, in which C is not blocked from A.
* It is important for proper implementation of the Canonical Composition
Algorithm to be aware that a Non-blocked Pair need not be contiguous.
DK+11 Canonical Composition Algorithm: The Canonical Composition Algorithm
is specified as follows:
Starting from the second character in the coded character sequence
(of a Canonical Decomposition or Compatibility Decomposition) and proceeding
sequentially to the final character, perform the following steps:
1. Seek back (left) in the coded character sequence from
the character C to find the last Starter L preceding
C in the character sequence.
2. If there exists a Primary Composite P which is canonically
equivalent to the sequence , and if C is not blocked
from L, replace L by P in the sequence and delete C from
the sequence.
* When the algorithm completes, all Non-blocked Pairs canonically
equivalent to a Primary Composite will have been systematically
replaced by those Primary Composites.
* Note that the replacement of the Starter L in step 2 requires
continuing to check the succeeding characters until the character
at that position is no longer part of any Non-blocked Pair that can
be replaced by a Primary Composite. [Pointer to examples]
Normalization Forms
The Unicode Standard specifies four normalization forms. Informally,
two of these forms are defined by maximal *decomposition* of equivalent
sequences, and two of these forms are defined by maximal *composition* of
equivalent sequences. Each is then differentiated based on whether it
employs a Canonical Decompositions or Compatibility Decomposition.
DK+12 Normalization Form D (NFD): The Canonical Decomposition of
a coded character sequence.
DK+13 Normalization Form KD (NFKD): The Compatibility Decomposition
of a coded character sequence.
DK+14 Normalization Form C (NFC): The Canonical
Composition of the Canonical Decomposition of a coded character
sequence.
DK+15 Normalization Form KC (NFKC): The Canonical
Composition of the Compatibility Decomposition of a coded character
sequence.
Logically, to get the NFD or NFKD (maximally decomposed) normalization
form for a Unicode string, one first computes the full decomposition
of that string and then applies the Canonical Ordering Algorithm to it.
Logically, to get the NFC or NFKC (maximally composed) normalization
form for a Unicode string, one first computes the NFD or NFKD normalization
form for that string, and then applies the Canonical Composition Algorithm
to it.
__