UnicodeIUC22
Program Showcase Registration Accommodation Travel Sponsors
Unicode Standard Conference Board Conference CD Last Conference Past Conferences Next Conference
Abstract

Optimizing the Usage of Normalization

Vladimir Weinstein - IBM Corporation

Intended Audience: Managers, Software Engineers, Systems Analysts, Technical Writers
Session Level: Intermediate, Advanced

Many processes (UCA) and standards (see the W3C Character Model) require the use of normalization. Although there are several efficient implementations of the normalization algorithms, it is not free. This paper discusses how carefully preparing the supporting data and using normalization procedures wisely can substantially improve the performance of other processes, and illustrates proper usage with examples from the collation service in the ICU library. In particular, it discusses checking for pre-existing normalized text, incremental normalization of text, concatenation of normalized text, and the use of the FCD format.

Text is in the FCD format when canonical decomposition without any canonical reordering produces correct NFD text. Almost all text in practice is in FCD, and a test to see whether text is in FCD is very fast. A correctly optimized algorithm can check for FCD, and avoid normalization if the text is in that format. However, to be able to support FCD, the data used by the algorithm must be preprocessed to be what is called 'canonically closed'.


Unicode
When the world wants to talk, it speaks Unicode

UnicodeIUC22
Program Showcase Registration Accommodation Travel Sponsors
Unicode Standard Conference Board Conference CD Last Conference Past Conferences Next Conference
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

5 July 2002, Webmaster