Unicode Transforms in ICU
Mark Davis - IBM Centre for Java Technology SV
by Mark Davis, Alan Liu
ICU provides a set of powerful tools for transforming Unicode text. A variety of transformations are supplied: uppercase or lowercase conversions; normalizations (NFC, NFD, NFKC, NFKD); fullwidth-halfwidth conversions; hex and Unicode name conversions; many script-to-script transliterations; and others.
These transformations can be chained together in arbitrary combinations. To remove accents for example, one need only create a transformation from the string: "NFD; [:Nonspacing Mark:] Remove; NFD". That string chains together three transformations, with a filter to constrain the characters that are affected. Additional transliterators can be easily built from a series of textual rules (at runtime), with a rule syntax very much like regular expressions.
Both the filters and the rules can use UnicodeSets, which provide very useful way to specify arbitrary combinations of Unicode characters. The sets can be built up from explicit lists, from properties (including script), and from boolean combinations of these. UnicodeSets are very compact, yet supply all the normal set operations; they are another valuable tool for processing text.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
30 April 2002, Webmaster