L2/01-371

Normalization Form FCD

Markus W. Scherer
2001-oct-07


Editorial Notes 2001-oct-08:


Form "Fast C or D" was developed for collation and similar string processing operations to work with most strings without normalizing them.
Algorithms like collation and string search that treat all canonically equivalent forms of a string the same can be easily implemented by first canonically decomposing (NFD) each input string. Since there will be no precomposed characters after that, the implementation of the algorithm can omit properties for many characters from its data.
However, if an implementation does include equivalent data for both precomposed and decomposed forms, then it can avoid the normalization step in most cases. It will work properly without NFD normalization if input strings are in form FCD.

Definition: A string is in FCD if the concatenation of the canonical decompositions (NFD) of each of its characters is canonically ordered.

FCD has the following properties:

Testing for FCD (pseudo-code):

boolean isFCD(String s) {
 String d; // holds decompositions
 UChar32 c; // code point value
 uint8_t prevCC=0, leadCC, trailCC; // variables for combining classes
 
 for each Unicode code point c in s {
 d=NFD(c);
 leadCC=getCombiningClass(first code point in d);
 trailCC=getCombiningClass(last code point in d);
 if(leadCC!=0 && leadCC<prevCC) {
 return false;
 }
 prevCC=trailCC;
 }
}

This test can be made extremely fast by storing the precomputed leadCC and trailCC for each Unicode code point in a table. Further optimizations include:

Form "Fast C or D" was developed and first documented by Mark E. Davis in 2000 for the collation implementation in ICU 1.8. See the ICU collation design document for the original description, and the ICU normalization implementation for a test function (unorm_checkFCD()) and a "makeFCD" function (unorm_makeFCD()).