Internationalization and the Case for Unicode
Q: In the past, we have just handed off our
code to a translation agency. What's wrong with that?
A: Often, companies develop a first version of a program or
system to just deal with English. When it comes time to produce a first
international version, a common tactic is to just go through all the lines
of code, and translate the literal strings.
While this may work once, it is not a pattern that you want
to follow. Not all literal strings get translated, so this process
requires human judgment, and is time-consuming. Each new version is
expensive, since people have to go through the same process of identifying
the strings that need to be changed. In addition, since there are multiple
versions of the source code, maintenance and support becomes expensive.
Moreover, there a high risk that a translator may introduce bugs by
mistakenly modifying code.
Q: What is the IT industry's best practice for translation now?
A: The general technique used now is to internationalize
the programs. This means to prepare them so that the code never needs
modification—separate files contain the translatable information. This
involves a number of modifications to the code:
move all translatable strings into separate files called
resource files, and make the code access those strings when needed.
These resource files can be flat text files, databases, or even code
resources, but they are completely separate from the main code, and
contain nothing but the translatable data.
change variable formatting to be language-independent. This
means that dates, times, numbers, currencies, and messages all call
functions to format according to local language and country
change sorting, searching and other types of processing to
Once this process is concluded, you have an internationalized
program. To localize that program then involves no changes to the
source code. Instead, just the translatable files are typically handed off
to contractors or translation agencies to modify. The initial cost of
producing internationalized code is somewhat higher than localizing to a
single market, but you only pay that once. The costs of simply doing a
localization, once your code is internationalized, is a fraction of the
previous cost—and avoids the considerable cost of maintenance and source
code control for multiple code versions.
Q: How does Unicode play in
A: Unicode is the new foundation for this process of
internationalization. Older codepages were difficult to use, and have
inconsistent definitions for characters. Internationalizing your code
while using the same code base is complex, since you would have to support
different character sets—with different architectures—for different
But modern business requirements are even stronger; programs
have to handle characters from a wide variety of languages at the same
time: the EU alone requires several different older character sets to
cover all its languages. Mixing older character sets together is a
nightmare, since all data has to be tagged, and mixing data from different
sources is nearly impossible to do reliably.
With Unicode, a single internationalization process can
produce code that handles the requirements of all the world markets at the
same time. Since Unicode has a single definition for each character, you
don't get data corruption problems that plague mixed codeset programs.
Since it handles the characters for all the world markets in a uniform
way, it avoids the complexities of different character code architectures.
All of the modern operating systems, from PCs to mainframes, support
Unicode now or are actively developing support for it. The same is true of
databases, as well.
Q: What was wrong with using classical
character sets for application programs?
A: Different character sets have very different
architectures. In many, even simply detecting which bytes form a
character is a complex, contextually-dependent process. That means
either having multiple versions of the program code for different
markets, or making the program code much, much more complicated. Both of these choices involve
development, testing, maintenance, and support problems. These make the
non-US versions of programs more expensive, and delay their introduction,
causing significant loss of revenue.
Q: What was wrong with using classical
character sets for databases?
A: Classical character sets only handle a few languages at a
time. Mixing languages was very difficult or impossible. In today's
markets, mixing data from many sources all around the world, that strategy
for products fails badly. The code for a simple letter like "A" will vary
wildly between different sets, making searching, sorting, and other
operations very difficult. There is also the problem of tagging every
piece of textual data with a character set, and corruption problems when
mixing text from different character sets.
Q: What is different about Unicode?
A: Unicode provides a unique encoding for every character.
Once your data is in Unicode, it can be all handled the same way—sorted,
searched, and manipulated without fear of data corruption.
Q: You talk about Unicode being the right
technical approach. But is it being accepted by the market?
A: The industry is converging on Unicode for all
internationalization. For example, Microsoft Windows is built on a base of
Unicode; AIX, Solaris, HP/UX, and Apple's MacOS all offer Unicode support.
All the new web standards; HTML, XML, etc. are supporting or requiring Unicode.
All modern browsers have extensive support for Unicode—including Internet
Explorer, Firefox, Safari, Opera, and Chrome. All modern database software also has Unicode support.
Most significant application programs with international
versions either support Unicode or are moving towards it. For example,
Microsoft's products were rapidly adapted to use Unicode: most of
Microsoft's Office suite of applications has supported Unicode for several
versions now. This is a good illustration—Microsoft first started by
merging their East Asian (Chinese, Japanese, and Korean) plus their US
version into a single program using Unicode. They then merged in Middle
East and South Asian support, until they had a single executable that
could handle all their supported languages.
Q: What about East Asian support?
A: Unicode incorporates the characters of all the major
government standards for ideographic characters from Japan, Korea, China,
and Taiwan, and more. The Unicode Standard has over 80,000
ideographic characters. The Unicode Consortium actively works with the
IRG committee of ISO SC2/WG2 to define additional sets of ideographic
characters for inclusion in future versions.
Q: So all I need is Unicode, right?
A: Unicode is not a magic wand; it is a standard for the
storage and interchange of textual data. Somewhere there has to be code
that recognizes and provides for the conventions of different languages
and countries. These conventions can be quite complex, and require
considerable expertise to develop code for and to produce the data
formats. Changing conditions and new markets also require considerable
maintenance and development. Usually this support is provided in the
operating system, or with a set of code libraries.
Q: Unicode has all sorts of features:
combining marks, bidirectionality, input methods, surrogates, Hangul
syllables, etc. Isn't a big burden to support?
A: Unicode by itself is not complicated to
implement—it all depends on which languages you want to support. The
character repertoire you need fundamentally determines the features you
need to have for compliance. If you just want to support Western Europe,
you don't need to have much implementation beyond what you have in ASCII.
Which further characters you need to support is really
dependent on the languages you want, and what system requirements you have
(servers, for example, may not need input or display). For example, if you
need East Asian languages (in input), you have to have input methods. If
you support Arabic or Hebrew characters (in display), then you need the
bidirectional algorithm. For normal applications, of course, much of this
will be handled by the operating system for you.
Q: What level of support should I look for?
A: Unicode support really divides up into two categories:
server-side support and client-side support. The requirements
for Unicode support in these two categories can be summarized as follows
(although you may only need a subset of these features for your projects):
Full server-side Unicode support
This consists of:
Storage and manipulation of Unicode strings.
Conversion facilities to a full complement of other
charsets (8859-x, JIS, EBCDIC, etc.)
A full range of formatting/parsing functionality for
numbers, currencies, date/time and messages for all locales you need.
Message cataloging (resources) for accessing translated
Unicode-conformant collation, normalization, and text
boundary (grapheme, word, line-break) algorithms.
Multiple locales/resources available simultaneously in
the same application or thread.
Charset-independent locales (all Unicode characters
usable in any locale).
Full client-side support
This consists all the same features as
server-side, plus GUI support:
Displaying, printing and editing Unicode text.
Inputting text (e.g. with Japanese input methods)
Full incorporation of these facilities into the windowing
system and the desktop interface.
Q: Why is it that emails written in non-Latin languages sometimes display correctly, while other times they just appear as squares and question marks?
A: Given the nature of the Internet, your email might potentially be handled by software which is decades old.
Protocols for handling email are very old, and that means that your email might potentially be handled by software that can't
deal with any form of Unicode at all—or any other non-ASCII character set for that matter.
The content of email can be mangled by such software.
Newer protocols have been designed to avoid this kind of problem in handling character sets, but because of the
widely distributed nature of the email infrastructure, potential points of failure of character set conversion may exist for
decades yet, until everything is using Unicode correctly.
There is no way to diagnose a particular email problem without exact details of scenarios, because so many pieces of software
and so many different protocols are involved. These kinds of problems can creep in at any of the interfaces between them—or sometimes even internally to some particular piece of software.
Simply saying "My email doesn't work for script X" is usually all that an end user knows, but that is a little like a patient
approaching a doctor saying "I have a fever." Something is clearly wrong, but its cause could be any of hundreds of things and
requires detailed diagnosis of what is happening on a case-by-case basis.
There simply is no single satisfactory and satisfying answer to the "My email is broken" query, because nobody in the IT field—not even the email specialists in the IETF—has mastery of all the software that could be involved and which could be going
haywire in character conversions someplace. [JJ]