Processing Non-Unicode Markup Language Formats in
Unicode-Based Localization Environments
Daniel Brockmann - TRADOS GmbH

Intended Audience: Software Engineers, Technical Writers, Testers, This presentation addresses anyone concerned with processing both non-Unicode and Unicode data with multilingual applications. People attending should be knowledgeable in typical challenges in handling file formats such as SGML and XML in authoring or multilingual environments

Session Level: Advanced

Today's localization environments, that is, systems offering functionality in the area of translation memory and terminology databases, are all more or less based on the Unicode standard. They store database records in Unicode format and represent all data in Unicode internally to be able to support all languages available on Unicode-based platforms such as Windows XP. However, such systems do not only have to deal with such Unicode formats as XML. They also still have to support non-Unicode aware 7-bit ASCII markup file formats, e.g. SGML. SGML typically represents all special characters as public entities. For instance, umlaut characters such as " Ã" are represented as "ö", and publishing characters such as the "—" are represented as "‐".

Obviously, these characters should be preserved in any given target languages supported by the localization environment - what comes in as "‐" must also come out as "‐". However, being founded on the Unicode standard, advanced systems can offer the users several options to convert entities to "real" Unicode characters during translation and back again to entities after translation.

Using the leading localization system and a real-life customer example, this presentation discusses advantages and disadvantages of processing public entities as entities or Unicode characters during translation, depending on the use case and other parameters such as underlying platform, codepages, fonts etc. It discusses advantages and disadvantages of both approaches.

This presentation addresses anyone concerned with processing both non-Unicode and Unicode data with multilingual applications. People attending should be knowledgeable in typical challenges in handling file formats such as SGML and XML in authoring or multilingual environments.

CLOSE WINDOW