Optimizing the Usage of Normalization
Vladimir Weinstein - IBM Corporation
Many processes (UCA) and standards (see the W3C Character Model) require the use of normalization. Although there are several efficient implementations of the normalization algorithms, it is not free. This paper discusses how carefully preparing the supporting data and using normalization procedures wisely can substantially improve the performance of other processes, and illustrates proper usage with examples from the collation service in the ICU library. In particular, it discusses checking for pre-existing normalized text, incremental normalization of text, concatenation of normalized text, and the use of the FCD format.
Text is in the FCD format when canonical decomposition without any canonical reordering produces correct NFD text. Almost all text in practice is in FCD, and a test to see whether text is in FCD is very fast. A correctly optimized algorithm can check for FCD, and avoid normalization if the text is in that format. However, to be able to support FCD, the data used by the algorithm must be preprocessed to be what is called 'canonically closed'.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
21 February 2002, Webmaster