L2/08-301


Title:  Suggested Restructuring of UAX #15 Specification
        for Normalization.
       
Source: Ken Whistler

Date:   August 8, 2008


Introduction

I have a longstanding action item (108-A29a) to "Provide
recommendations for how to restructure UAX #15 for clarity."
As the first step for that, I provide here some detailed
suggestions on how to restructure the core of UAX #15,
i.e., the formal specification of Unicode normalization
itself.

One of the main issues with UAX #15 is that the specification
of normalization depends on canonical decomposition
and compatibility decomposition and the canonical ordering
algorithm (all defined in Chapter 3 of TUS), but then
introduces new definitions and another algorithm to
deal with NFKD and NFKC. The result is fragmented and
not easy to understand (over and above just the inherent
complexity of the specification itself).

After considering the potential ramifications of attempting
to move relevant definitions out of Chapter 3 and into
UAX #15, my current recommendation is instead to move
the formal specification of normalization *into* Chapter 3 --
essentially restructuring the current Section 3.11,
"Canonical Ordering Behavior" to make it instead Section 3.11,
"Normalization Forms", with the complete specification for
normalization in one place. In the process I also suggest
a considerable tightening up of the definitions involved,
while not trying to destabilize usage by introducing
substantially new terminology.

If the UTC is interested in pursuing this general direction,
Mark and I can work on fleshing out the detail of this
text and its implications for the rest of the text of UAX #15,
and produce the formal Proposed Update draft for that.

The idea would be that UAX #15 would still have substantial
content, but rather than containing the formal definitions
and formal specification, it would consist of the explanation
of Unicode normalization, the examples and edge cases,
version and stability information, all the optimization
strategies for implementation, and so on.

In doing this work, we (the editors) discovered that there
was also a defect in the existing formal definitions of
Compatibility Decomposition (D65) and Canonical Decomposition
(D68) in the text in Chapter 3. We have been assuming all
along that decompositions (and full decompositions) referred
to Unicode character sequences, i.e., strings. But the
definitions as currently written just refer to single
characters. Unless this is fixed first, the formal
definition of Unicode normalization gets rather clutzy and
would require introducing new terms to distinguish
a canonical decomposition (of a single character) and a
canonical decomposition (of a string), when current de facto
usage simply treats the canonical decomposition of a single
character as the limiting case of the canonical decomposition
of a string.

===================================================================

[[ Corrections to defects in D65 and D68 in Section 3.7 
Decomposition ]]

D65 Compatibility decomposition: The decomposition of a character ...
-->
D65 Compatibility decomposition: The decomposition of a character
    or character sequence ...
    

D68 Canonical decomposition: The decomposition of a character ...
-->
D68 Canonical decomposition: The decomposition of a character
    or character sequence ...
       
===================================================================

[[ Surgery required to Section 3.11 to repurpose it as the
normalization specification for the standard.]]

[[ Move the subsection "Application of Combining Marks" out
of 3.11 and tack it on to the end of 3.6 "Combination", where
it more properly belongs. (Its occurrence in 3.11 is just
the result of the long history of morphing of all this material
from the earlier versions of the text.) That just leaves the
"Combining Classes" and "Canonical Ordering" subsections
in 3.11, which are relevant and on topic for a normalization
section. 

3.11 Normalization Forms

[[ Rework the intro paragraphs of 3.11 so they properly
introduce normalization as an issue for *all* Unicode
character sequences and determination of their equivalence.
The current text on canonical ordering behavior is out
of date and refers to the pre-normalization notion of
equivalence of combining character sequences as the issue
of concern. It dates from Unicode 2.0: "The purpose of this
section is to provide unambiguous interpretation of a combining
character sequence." And in fact almost the entire text of
the Canonical Ordering subsection, including the exact
statement of the canonical ordering algorithm, dates
from 2.0, verbatim. I won't try to do that redrafting
of the intro text here, but it would draw, in part on
the intro from UAX #15. ]]

[[ Then continue with the Combining Classes subsection,
which only needs some small tweaks. ]]

[[ Then replace the Canonical Order section with the
specification for normalization, drafted along the following lines ]]

The specification of Unicode Normalization Forms applies to all
Unicode coded character sequences [D12]. For clarity of exposition in
the definitions and rules specified here, the terms "character"
and "character sequence" are used, but coded
character sequences refer also to sequences containing noncharacters
or reserved code points. Unicode Normalization Forms are specified
for *all* Unicode code points, and not just for ordinary, assigned
graphic characters.

Starters

DK+1  Starter: Any code point (assigned or not) with combining 
      class of zero. (ccc=0)

Note that ccc=0 is the default value for the Canonical_Combining_Class
property, so that all reserved code points are Starters by
definition. Noncharacters are also Starters by definition.
All control characters, format characters, and private-use
characters are also Starters.

Among the graphic characters, all those with General_Category values
other than gc=M are Starters. Some combining marks have
ccc=0 and thus are also Starters. Combining marks with
ccc other than 0 are not Starters.

[[ Insert table of combining marks and Starter status ]]

Description   gc     ccc       Starter?

Non-spacing   Mn     0         Yes
                     1..255    No
Spacing       Mc     0         Yes
                     216, 226  No
Enclosing     Me     0         Yes

The term *Starter* refers, in concept, to the starting character
of a combining character sequence [D56], because all combining
character sequences except defective combining character sequences
[D57] commence with a ccc=0 character -- in other words, they
start with a Starter. However, because the specification
of Unicode Normalization Forms must apply to all possible coded character 
sequences, and not just to typical combining character sequences, the 
behavior of a code point for Unicode Normalization Forms is specified 
entirely in terms of its status as a Starter or
a non-starter, and its Decomposition_Mapping value.

Canonical Ordering Algorithm
       
DK+2  Reorderable Pair: Two adjacent
      characters A and B in a coded character sequence <A, B>
      are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
          
DK+3  Canonical Ordering Algorithm: The Canonical
      Ordering Algorithm is specified as follows:
      
      In a decomposed character sequence D, exchange the positions
      of the characters in each Reorderable Pair until the
      sequence contains no more Reorderable Pairs.
                     
  * In effect, the Canonical Ordering Algorithm is a local bubble sort that 
    guarantees that a Canonical Decomposition or a Compatibility
    Decomposition will contain no subsequences
    in which a combining mark is *followed* directly by another combining mark
    that has a lower, non-zero combining class.
    
  * Canonical ordering is defined in terms of application of the 
    Canonical Ordering Algorithm to an *entire* decomposed sequence, because 
    there are some unusual sequences of Unicode characters for which the 
    result of decomposing each of those characters individually will 
    not directly result in the entire sequence being in canonical order. 
    [Pointer to examples] In practice, most decompositions for Unicode strings
    are already in canonical order.
    
Canonical Composition Algorithm

DK+4  Composition Exclusion: A Canonical Decomposable Character [D69]
      which has the property value Composition_Exclusion=True.
      
  * The list of Composition Exclusions is provided in CompositionExclusions.txt
    in the Unicode Character Database.
    
DK+5  Singleton Decomposition: A canonical decomposition mapping to a
      single character other than the character itself.
      
  * Example: U+2126 OHM SIGN has a singleton decomposition
    to U+03A9 GREEK CAPITAL LETTER OMEGA.
    
  * A character with a singleton decomposition is often referred
    to simply as a *singleton* for short.
    
DK+6  Non-starter Decomposition: A canonical decomposition mapping to a
      sequence of more than one character, for which the first character
      in that sequence is not a starter.
      
  * Example: U+0344 COMBINING GREEK DIALYTIKA TONOS has a non-starter
    decomposition to <U+0308 COMBINING DIAERESIS, U+0301 COMBINING ACUTE ACCENT>. 
    
DK+7  Full Composition Exclusion: A Canonical Decomposable Character
      which has the property value Full_Composition_Exclusion=True.
      
  * Full composition exclusions consist of the entire list of
    composition exclusions plus all characters with singleton
    decompositions or with non-starter decompositions.
    
  * For convenience in implementation of Unicode normalization, the
    derived property, Full_Composition_Exclusion is computed, and
    all characters with the property value Full_Composition_Exclusion=True
    are listed in DerivedNormalizationProps.txt in the Unicode
    Character Database.
    
DK+8  Primary Composite: A Canonical Decomposable Character [D69] which
      is not a Full Composition Exclusion.
          
   * For any given version of the Unicode Standard, the list of
     primary composites can be computed by extracting all canonical
     decomposable characters from UnicodeData.txt in the Unicode
     Character Database, adding the list of precomposed Hangul
     syllables [D117], and subtracting the list of full decomposition
     exclusions.
     
DK+9  Blocked: Let A and C be two characters
      in a coded character sequence <A, ... C>. C is blocked
      from A if and only if ccc(A)=0 and there exists some character B
      between A and C in the coded character sequence, i.e.,
      <A, ... B, ... C>, and either ccc(B)=0 or ccc(B) >= ccc(C).
      
DK+10 Non-blocked Pair: A pair of characters <A, ... C> in a coded
      character sequence, in which C is not blocked from A.
      
   * It is important for proper implementation of the Canonical Composition
     Algorithm to be aware that a Non-blocked Pair need not be contiguous.
          
DK+11 Canonical Composition Algorithm: The Canonical Composition Algorithm 
      is specified as follows:
           
      Starting from the second character in the coded character sequence
      (of a Canonical Decomposition or Compatibility Decomposition) and proceeding
      sequentially to the final character, perform the following steps:
           
        1. Seek back (left) in the coded character sequence from
           the character C to find the last Starter L preceding
           C in the character sequence.
                
        2. If there exists a Primary Composite P which is canonically
           equivalent to the sequence <L, C>, and if C is not blocked
           from L, replace L by P in the sequence and delete C from 
           the sequence.
                
    * When the algorithm completes, all Non-blocked Pairs canonically
      equivalent to a Primary Composite will have been systematically
      replaced by those Primary Composites. 
           
    * Note that the replacement of the Starter L in step 2 requires
      continuing to check the succeeding characters until the character
      at that position is no longer part of any Non-blocked Pair that can
      be replaced by a Primary Composite. [Pointer to examples] 
           
Normalization Forms

The Unicode Standard specifies four normalization forms. Informally,
two of these forms are defined by maximal *decomposition* of equivalent
sequences, and two of these forms are defined by maximal *composition* of
equivalent sequences. Each is then differentiated based on whether it
employs a Canonical Decompositions or Compatibility Decomposition.

DK+12 Normalization Form D (NFD): The Canonical Decomposition of
      a coded character sequence.
           
DK+13 Normalization Form KD (NFKD): The Compatibility Decomposition
      of a coded character sequence.
           
DK+14 Normalization Form C (NFC): The Canonical
      Composition of the Canonical Decomposition of a coded character
      sequence.
           
DK+15 Normalization Form KC (NFKC): The Canonical
      Composition of the Compatibility Decomposition of a coded character
      sequence.
           
Logically, to get the NFD or NFKD (maximally decomposed) normalization
form for a Unicode string, one first computes the full decomposition
of that string and then applies the Canonical Ordering Algorithm to it.

Logically, to get the NFC or NFKC (maximally composed) normalization
form for a Unicode string, one first computes the NFD or NFKD normalization
form for that string, and then applies the Canonical Composition Algorithm
to it.