Character Conversions and Mapping Tables
Markus Scherer, George Rhoten & Ram Viswanadha - ICU Team of IBM in Cupertino, CA
This talk discusses character conversions to and from Unicode, presents problems that can cause the loss of text data, and shows pragmatic ways to avoid such problems.
Text data is widely exchanged among networked systems. While modern Internet protocols and applications use Unicode more and more directly, a lot of text is still exchanged and processed in legacy encodings. Character conversion is performed whenever text is exchanged and processed in different encodings.
Character conversions can cause the loss of some of the text data for a number of reasons. Obvious problems are an insufficient repertoire in the target encoding and the lack of support for an encoding altogether. Some more obscure and unexpected problems include mismatches in conversion behavior and conversion data. Similarly, encoding names are only loosely standardized and inconsistently interpreted.
Parts of the ICU team are working with the UTC and interested parties in the industry on collecting and publishing mapping data for character conversions to and from Unicode. This project includes the assignment of unique identifiers for encodings, the collection of aliases, and the comparison of the Unicode mappings.
Using these mapping tables, ICU and other libraries can precisely duplicate the conversion results of other systems.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
22 Jun 2001, Webmaster