Confusion In Conversion - How Does Occasional Character Corruption Happen?

Ken Glidden - Basis Technology Corporation

Intended Audience: Software Engineers, Content Developers
Session Level: Intermediate, Advanced

Unicode has solved major problems in internationalization of software. It, however, does not mean all the problems associated with character encoding are gone. Unicode has even introduced a new set of problems.

In this session, the authors will discuss common practical problems many that developers face when developing Unicode-based multi-tier systems where their subsystems run on different platforms (a system where Java clients on Windows talk to a C++ server on Unix in UTF-8 is a very typical example of such systems). Then we focus on two common problems, character converter mismatches and non-standard extension characters of Windows code pages. These problems are very common in Japanese language processing, yet they are not well known. These problems are difficult to detect because they manifest only when particular characters out of the large Japanese code sets are used. The system seems to be working fine at the system tests and these problems are often found only after deployment.

The authors will explain how this occasional corruption of characters happens. They will describe a test strategy to detect these problems. And they will explore ways to solve these problems.

Although the discussion is based on experiences developing and deploying systems for the Japanese market, the same principle should apply to any multi-tier systems in which subsystems run on different platforms and exchange data among them in Unicode.