Optimizing the Usage of Normalization
Vladimir Weinstein - IBM Corporation

Intended Audience: Software Engineers, Technical Writers, Testers

Session Level: Beginner, Intermediate, Advanced

Many processes (UCA) and standards (see the W3C Character Model) require the use of normalization. Although there are several efficient implementations of the normalization algorithms, it is not free. This paper discusses how carefully preparing the supporting data and using normalization procedures wisely can substantially improve the performance of other processes, and illustrates proper usage with examples from the collation service in the ICU library. In particular, it discusses checking for pre-existing normalized text, incremental normalization of text, concatenation of normalized text, and the use of the FCD format.

Text is in the FCD format when canonical decomposition without any canonical reordering produces correct NFD text. Almost all text in practice is in FCD, and a test to see whether text is in FCD is very fast. A correctly optimized algorithm can check for FCD, and avoid normalization if the text is in that format. However, to be able to support FCD, the data used by the algorithm must be preprocessed to be what is called 'canonically closed'.

CLOSE WINDOW