Twenty-first International Unicode Conference

Unicode Transforms in ICU

Mark Davis - IBM Centre for Java Technology SV

Intended Audience:	Software Engineers, Systems Analysts
Session Level:	Intermediate, Advanced

by Mark Davis, Alan Liu

ICU provides a set of powerful tools for transforming Unicode text. A variety of transformations are supplied: uppercase or lowercase conversions; normalizations (NFC, NFD, NFKC, NFKD); fullwidth-halfwidth conversions; hex and Unicode name conversions; many script-to-script transliterations; and others.

These transformations can be chained together in arbitrary combinations. To remove accents for example, one need only create a transformation from the string: "NFD; [:Nonspacing Mark:] Remove; NFD". That string chains together three transformations, with a filter to constrain the characters that are affected. Additional transliterators can be easily built from a series of textual rules (at runtime), with a rule syntax very much like regular expressions.

Both the filters and the rules can use UnicodeSets, which provide very useful way to specify arbitrary combinations of Unicode characters. The sets can be built up from explicit lists, from properties (including script), and from boolean combinations of these. UnicodeSets are very compact, yet supply all the normal set operations; they are another valuable tool for processing text.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

30 April 2002, Webmaster