Unicode Technical Note #5

Canonical Equivalence in Applications

Version	1
Authors	Markus W. Scherer
Date	2002-07-24
This Version	http://www.unicode.org/notes/tn5/tn5-1.html
Previous Version	http://www.unicode.org/notes/tn5/tn5-1.html
Latest Version	http://www.unicode.org/notes/tn5

Summary

This document describes methods and formats for efficient processing of text under canonical equivalence, as defined in UAX #15 Unicode Normalization Forms [UAX15]:

Canonical closure of data for processing most unnormalized texts.
Fast mechanisms for dealing with unnormalized texts (FCD).
A unique format that is useful for internal processing (FCC).

The reader should be familiar with [UAX15] and with Unicode normalization and canonical equivalence in general.

Status

This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes.

Introduction
Canonical Closure
FCD Condition and Test
1. FCD Test
FCC Form
Enumerating Equivalent Strings
Sample Implementation
References
Modifications

1 Introduction

As described in [UAX15], multiple Unicode strings can be canonically equivalent. Such strings are not distinguishable by a user and therefore the standard requires them to be treated the same (see Conformance Requirement C9 of Unicode 3.0 [C9]). This means that canonically equivalent strings should normally be displayed identically, sort identically, be case-mapped equivalently, and so on.

It is sometimes necessary to normalize a string before processing in order to fulfill this requirement. Such additional normalization decreases performance.

For most strings it is possible to avoid or minimize an additional normalization step by preparing the supporting data appropriately. This document describes conditions that such strings must meet, including a test algorithm, and describes how to modify the supporting data accordingly. It also reintroduces a variation of the NFC normalization form that transforms strings into a unique, compact form that always fulfills these conditions.

While normalization is very fast, for time-critical processes the cost of simply traversing the text an extra time (or for allocating extra buffers) can be too expensive.

This Technical Note collects concepts and definitions from earlier documents as referenced.

2 Canonical Closure

Unicode text processing is primarily done based on per-code point data. At the same time, many processes are specified for text in NFD. This is done to simplify the specification of a process including the requirement to produce equivalent results for equivalent strings. Any conformant implementation must yield equivalent results even if it does not normalize its input.

For example, the Unicode Collation Algorithm [UCA] assigns collation weights to code points (and sometimes to sequences of code points). An implementation must map canonically equivalent strings to the same sequences of collation weights.

Decomposition to NFD is not necessary for many strings if an implementation has “precomposed data” for precomposed characters. That is, the implementation data maps each code point to the same sequence of collation weights as it does for the canonical decomposition of that code point. Such data is called “canonically closed”, and the process of building such data from the original table is called “canonical closure”.

For example, if a collation table maps the character A to the collation weight (A) and the combining ring to the weight (ring), then the string A+ring will be mapped to the sequence of collation weights (A)+(ring). A canonically closed collation table also directly maps the character A-ring to that same sequence of weights.

Character	Weights	Comment
A	(A)	From original table
ring	(ring)	From original table
A-ring	(A)(ring)	From canonical closure

The decomposition is effectively performed directly with a data lookup instead of on the text because its data is the same as for its equivalent decomposition. This avoids the normalization process.

3 FCD Condition and Test

An implementation can use canonically closed data without normalizing input strings only if the strings fulfill a certain condition: The decompositions of adjacent code points must not need canonical reordering. This guarantees that the concatenation of per-code point data (like collation weights) yields the same data sequence as for the NFD form of the strings.

Since all NFD strings and most NFC strings fulfill this condition, they are said to be in “Fast C or D” form, or “FCD”. Note that FCD does not describe a unique form: Unlike with regular Normalization Forms, two FCD strings can be canonically equivalent but not identical.

Example:

Text	Weights	Comment
A	(A)
ring	(ring)
cedilla	(cedilla)
A-ring	(A)(ring)	NFC, FCD
A+ring	(A)+(ring)	NFD, FCD
A+cedilla+ring	(A)+(cedilla)+ (ring)	NFD, FCD
A-ring+cedilla	(A)(ring)+(cedilla)	wrong: NFC but not FCD needs reordering before data lookup

3.1 FCD Test

Strings can be checked for FCD efficiently. The algorithm is best expressed in pseudo-code:

// get the combining class of the first code point of c's decomposition
unsigned byte getLeadCC(code_point c) {
  string d = getCanonicalDecomposition(c);
  code_point first = d.getFirstCodePoint();
  return getCombiningClass(first);
}

// get the combining class of the last code point of c's decomposition
unsigned byte getTrailCC(code_point c) {
  string d = getCanonicalDecomposition(c);
  code_point last = d.getLastCodePoint();
  return getCombiningClass(last);
}

// check if a string is in FCD
boolean checkFCD(string s) {
  code_point c; // current code point
  unsigned byte prevCC; // trail cc of the previous code point
  unsigned byte cc; // lead cc of the current code point

  prevCC = 0; // initialize
  for(each code point c in s) {
    cc = getLeadCC(c);
    if(cc != 0 && cc < prevCC) {
      // s is not in FCD:
      // the concatenation of per-code point decompositions needs reordering
      return false;
    }
    prevCC = getTrailCC(c);
  }
  return true; // s is in FCD
}

The functions getLeadCC and getTrailCC can be replaced by a single table lookup of the code point, returning a 16-bit value that contains both combining classes, which makes checkFCD very fast.

Mark Davis first described FCD in the ICU Collation Design Document [Collation].

4 FCC Form

As demonstrated above, not all NFC strings are also in FCD. The reason is that the composition step in the NFC algorithm sometimes combines two characters even if they are separated by one or more other characters. In the case of A+cedilla+ring the A combines with the ring even though the cedilla is between them. The NFC result, A-ring+cedilla, is not in FCD.

A modified NFC algorithm will produce unique strings that are always in FCD. The only modification needed is to remove the discontiguous composition. This is called here “FCC” for “Fast C Contiguous”. FCC is almost as compact as NFC, and most legacy codepages also convert 1:1 into FCC strings. In fact, for most strings the NFC and FCC Forms are identical, which makes the transformation between them fast (mostly only a test).

Martin Dürst proposed this form in an Internet Draft titled “Normalization of Internationalized Identifiers” [altNFC] as an alternative for the definition of NFC. The current NFC algorithm was chosen because it tends to be more compact, and because NFC strings are more “stable”: appending a combining mark causes a precomposed character to unravel in fewer cases. (In FCC, appending a cedilla to A-ring results in A+cedilla+ring while in NFC the A-ring stays intact and the result is A-ring+cedilla.)

Sample code for FCC is a simple modification of the NFC sample code (see [UAX15]). In the function internalCompose (in Normalizer.java) replace lastClass = chClass; with lastClass = 256; to prevent a discontiguous combination across the current character, which failed to combine with the starter. (This makes one of the conditions fail that are tested for the actual combination: lastClass < chClass)

5 Enumerating Equivalent Strings

In order to build canonically closed data one needs to enumerate all strings that are canonically equivalent to an input string and in FCD. Each such string then gets assigned the same data as for the original string.

Mark Davis developed the following algorithm to enumerate canonically equivalent strings. For efficiency, supporting data is precomputed, but this is the logical structure.

Transform the input string into its NFD form.
Partition the string into segments, with each starter character in the string at the beginning of a segment.
(A starter in this sense has combining class 0 and does not appear in non-initial position of any other character's decomposition.)
For each segment enumerate canonically equivalent forms, as follows:
1. Use the set of characters whose decomposition begins with the segment's starter.
2. For each character in this set:
  1. Get the character's decomposition.
  2. If the decomposition contains characters that are not in the segment, then skip this character.
  3. If the decomposition contains a character that is blocked in the segment (preceded by a combining mark with the same combining class), then also skip this character.
  4. Otherwise, start building a new string with this character.
  5. Append all characters from the input segment that are not in this character's decomposition in canonical order.
  6. Add this string to the set of canonical equivalents for the current segment.
  7. Recurse: Treat all but the initial character of this new string as a segment and add to the set for the current segment all combinations of the initial character and the equivalent strings of the rest.
Enumerate the combinations of all forms of all segments.

6 Sample Implementation

Many of the concepts presented here are implemented in the International Components for Unicode [ICU]. The ICU normalization code supports FCD in its quickCheck and isNormalized functions as well as in the normalization transformation functions. Transforming a string “into FCD” generally only decomposes and reorders as far as is necessary for the result to meet the FCD condition.

Incremental and whole-string FCD checks are used in the collation and string search services.

The CanonicalIterator class implements the above algorithm for Enumerating Equivalent Strings. It is used in the collation code for building collation tailoring tables.

As of ICU 2.2, the CanonicalIterator is an internal API and FCC is not yet implemented in ICU.

7 References

[C9]	Conformance Requirement C9 of Unicode 3.0 http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode_3_0_0 Requires that canonically equivalent text be treated equivalently. See also C10 and several other places in The Unicode Standard.
[UAX15]	Unicode Standard Annex #15: Unicode Normalization Forms http://www.unicode.org/reports/tr15/ Specifies NF, canonical equivalence, etc., together with passages of The Unicode Standard as described in UAX #15.*
[UCA]	Unicode Technical Standard #10: Unicode Collation Algorithm http://www.unicode.org/reports/tr10/
[Collation]	ICU Collation Design Document http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm First description of FCD by Mark Davis.
[OptiNorm]	Optimizing the Usage of Normalization http://www.icu-project.org/docs/papers/normalization_iuc21.ppt Also IUC 21 session C14 http://www.unicode.org/iuc/iuc21/a348.html Presentation about FCD and other normalization optimizations.
[altNFC]	Martin Dürst proposed what is called FCC here as an alternative to the current NFC: Internet Draft M. Duerst <draft-duerst-i18n-norm-00.txt> University of Zurich Expires in six months July 1997 Normalization of Internationalized Identifiers 2.1 Normalization of Combining Sequences
[ICU]	International Components for Unicode http://www.icu-project.org Open-source C/C++ and Java libraries for Unicode and I18N support.

Modifications

The following summarizes modifications from the previous version of this document.

2008.10.01	Updated stale links in version 1
1	Initial version

© 2002 Markus W. Scherer. This publication is protected by copyright, and permission must be obtained from the author and Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use.

Use of this publication is governed by the Unicode Terms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.