[Unicode]  Frequently Asked Questions Home | Site Map | Search

Internationalization and the Case for Unicode

Q: In the past, we have just handed off our code to a translation agency. What's wrong with that?

A: Often, companies develop a first version of a program or system to just deal with English. When it comes time to produce a first international version, a common tactic is to just go through all the lines of code, and translate the literal strings.

While this may work once, it is not a pattern that you want to follow. Not all literal strings get translated, so this process requires human judgment, and is time-consuming. Each new version is expensive, since people have to go through the same process of identifying the strings that need to be changed. In addition, since there are multiple versions of the source code, maintenance and support becomes expensive. Moreover, there a high risk that a translator may introduce bugs by mistakenly modifying code.

Q: What is the IT industry's best practice for translation now?

A: The general technique used now is to internationalize the programs. This means to prepare them so that the code never needs modification—separate files contain the translatable information. This involves a number of modifications to the code:

  1. move all translatable strings into separate files called resource files, and make the code access those strings when needed. These resource files can be flat text files, databases, or even code resources, but they are completely separate from the main code, and contain nothing but the translatable data.

  2. change variable formatting to be language-independent. This means that dates, times, numbers, currencies, and messages all call functions to format according to local language and country requirements.

  3. change sorting, searching and other types of processing to be language-independent.

Once this process is concluded, you have an internationalized program. To localize that program then involves no changes to the source code. Instead, just the translatable files are typically handed off to contractors or translation agencies to modify. The initial cost of producing internationalized code is somewhat higher than localizing to a single market, but you only pay that once. The costs of simply doing a localization, once your code is internationalized, is a fraction of the previous cost—and avoids the considerable cost of maintenance and source code control for multiple code versions.

Q: How does Unicode play in internationalization?

A: Unicode is the new foundation for this process of internationalization. Older codepages were difficult to use, and have inconsistent definitions for characters. Internationalizing your code while using the same code base is complex, since you would have to support different character sets—with different architectures—for different markets.

But modern business requirements are even stronger; programs have to handle characters from a wide variety of languages at the same time: the EU alone requires several different older character sets to cover all its languages. Mixing older character sets together is a nightmare, since all data has to be tagged, and mixing data from different sources is nearly impossible to do reliably.

With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. Since Unicode has a single definition for each character, you don't get data corruption problems that plague mixed codeset programs. Since it handles the characters for all the world markets in a uniform way, it avoids the complexities of different character code architectures. All of the modern operating systems, from PCs to mainframes, support Unicode now or are actively developing support for it. The same is true of databases, as well.

Q: What was wrong with using classical character sets for application programs?

A: Different character sets have very different architectures. In many, even simply detecting which bytes form a character is a complex, contextually-dependent process. That means either having multiple versions of the program code for different markets, or making the program code much, much more complicated. Both of these choices involve development, testing, maintenance, and support problems. These make the non-US versions of programs more expensive, and delay their introduction, causing significant loss of revenue.

Q: What was wrong with using classical character sets for databases?

A: Classical character sets only handle a few languages at a time. Mixing languages was very difficult or impossible. In today's markets, mixing data from many sources all around the world, that strategy for products fails badly. The code for a simple letter like "A" will vary wildly between different sets, making searching, sorting, and other operations very difficult. There is also the problem of tagging every piece of textual data with a character set, and corruption problems when mixing text from different character sets.

Q: What is different about Unicode?

A: Unicode provides a unique encoding for every character. Once your data is in Unicode, it can be all handled the same way—sorted, searched, and manipulated without fear of data corruption.

Q: You talk about Unicode being the right technical approach. But is it being accepted by the market?

A: The industry is converging on Unicode for all internationalization. For example, Microsoft Windows is built on a base of Unicode; AIX, Solaris, HP/UX, and Apple's MacOS all offer Unicode support. All the new web standards; HTML, XML, etc. are supporting or requiring Unicode. All modern browsers have extensive support for Unicode—including Internet Explorer, Firefox, Safari, Opera, and Chrome. All modern database software also has Unicode support.

Most significant application programs with international versions either support Unicode or are moving towards it. For example, Microsoft's products were rapidly adapted to use Unicode: most of Microsoft's Office suite of applications has supported Unicode for several versions now. This is a good illustration—Microsoft first started by merging their East Asian (Chinese, Japanese, and Korean) plus their US version into a single program using Unicode. They then merged in Middle East and South Asian support, until they had a single executable that could handle all their supported languages.

Q: What about East Asian support?

A: Unicode incorporates the characters of all the major government standards for ideographic characters from Japan, Korea, China, and Taiwan, and more. The Unicode Standard has over 80,000 ideographic characters. The Unicode Consortium actively works with the IRG committee of ISO SC2/WG2 to define additional sets of ideographic characters for inclusion in future versions.

Q: So all I need is Unicode, right?

A: Unicode is not a magic wand; it is a standard for the storage and interchange of textual data. Somewhere there has to be code that recognizes and provides for the conventions of different languages and countries. These conventions can be quite complex, and require considerable expertise to develop code for and to produce the data formats. Changing conditions and new markets also require considerable maintenance and development. Usually this support is provided in the operating system, or with a set of code libraries.

Q: Unicode has all sorts of features: combining marks, bidirectionality, input methods, surrogates, Hangul syllables, etc. Isn't a big burden to support?

A: Unicode by itself is not complicated to implement—it all depends on which languages you want to support. The character repertoire you need fundamentally determines the features you need to have for compliance. If you just want to support Western Europe, you don't need to have much implementation beyond what you have in ASCII.

Which further characters you need to support is really dependent on the languages you want, and what system requirements you have (servers, for example, may not need input or display). For example, if you need East Asian languages (in input), you have to have input methods. If you support Arabic or Hebrew characters (in display), then you need the bidirectional algorithm. For normal applications, of course, much of this will be handled by the operating system for you.

Q: What level of support should I look for?

A: Unicode support really divides up into two categories: server-side support and client-side support. The requirements for Unicode support in these two categories can be summarized as follows (although you may only need a subset of these features for your projects):

Full server-side Unicode support

This consists of:
  • Storage and manipulation of Unicode strings.

  • Conversion facilities to a full complement of other charsets (8859-x, JIS, EBCDIC, etc.)

  • A full range of formatting/parsing functionality for numbers, currencies, date/time and messages for all locales you need.

  • Message cataloging (resources) for accessing translated text.

  • Unicode-conformant collation, normalization, and text boundary (grapheme, word, line-break) algorithms.

  • Multiple locales/resources available simultaneously in the same application or thread.

  • Charset-independent locales (all Unicode characters usable in any locale).

Full client-side support

This consists all the same features as server-side, plus GUI support:
  • Displaying, printing and editing Unicode text.

    • This requires BIDI display if Arabic and Hebrew characters are supported.

    • This requires character shaping if scripts such as Arabic or the scripts of India are supported.

  • Inputting text (e.g. with Japanese input methods)

  • Full incorporation of these facilities into the windowing system and the desktop interface.

Q: Why is it that emails written in non-Latin languages sometimes display correctly, while other times they just appear as squares and question marks?

A: Given the nature of the Internet, your email might potentially be handled by software which is decades old. Protocols for handling email are very old, and that means that your email might potentially be handled by software that can't deal with any form of Unicode at all—or any other non-ASCII character set for that matter. The content of email can be mangled by such software.

Newer protocols have been designed to avoid this kind of problem in handling character sets, but because of the widely distributed nature of the email infrastructure, potential points of failure of character set conversion may exist for decades yet, until everything is using Unicode correctly.

There is no way to diagnose a particular email problem without exact details of scenarios, because so many pieces of software and so many different protocols are involved. These kinds of problems can creep in at any of the interfaces between them—or sometimes even internally to some particular piece of software.

Simply saying "My email doesn't work for script X" is usually all that an end user knows, but that is a little like a patient approaching a doctor saying "I have a fever." Something is clearly wrong, but its cause could be any of hundreds of things and requires detailed diagnosis of what is happening on a case-by-case basis.

There simply is no single satisfactory and satisfying answer to the "My email is broken" query, because nobody in the IT field—not even the email specialists in the IETF—has mastery of all the software that could be involved and which could be going haywire in character conversions someplace. [JJ]

Access to Copyright and terms of use