Unicode Encoding Formats: A Devil In Disguise?

Debmalya Biswas - InfoSys Technologies Limited

Intended Audience: Managers, Software Engineers, Systems Analysts, Content Developers, Technical Writers
Session Level: Intermediate

Statement Of Purpose:

Unicode.org defines Unicode as "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language". While Unicode does what it says if you consider a Platform (Operating System) or a Language (Software l! an! guage) as an independent entity, but the moment you start talking about implementing international character support across languages and platforms via Unicode, you are entering dangerous territory. The reason being the various Unicode encoding formats (UTF-7, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4 etc) used by different languages/platforms.

In Unicode terminology, the numbers assigned to each character is called code point. Since encoding formats determine how a code point is represented at the bit level, whenever Unicode data is passed from one language/platform to another language/platform using a different encoding format, the situation becomse similar to as was prevalent before Unicode came into being (the problem of different charsets being used for different language characters). The paper stresses the need for a better solution to the interoperability woes than the measly format conversion routines provided (or you can say not provided) by different languages/ platforms.

Paper Description:

This paper attempts to highlight the problems/issues faced by software developers while implementing an I18N solution, which spans across multiple platforms/languages by taking a few real life interoperability examples (such as between Java/C++, HTML/ASP/JSP, SQL Server/ODBC/JDBC). It also analyzes the solutions or answers provided by some of these languages/platforms to overcome these issues.


Although the various encoding formats are necessary given the various limitations and requirements of different languages/platforms, but the fact remains that the presence of these is a serious hindrance when it comes to interoperability between languages/platforms.


A basic understanding of Software Internationlaization as well as aware of Unicode Terminology.

