Converting non Unicode Databases to Unicode
Martin Schmidt - SAP AG

Intended Audience: Software Engineers, Managers, System Administrators

Session Level: Beginner, Intermediate

Enterprises use large IT landscapes, which are based on different platforms and databases. Especially the database content is critical to handle while converting to Unicode, because data often originate from different codepages.

This paper discusses by means of some examples the technical challenges that are faced during a conversion of multilingual databases to Unicode and shows how they are solved within SAP software:

1. Automatic Codepage Recognition
All data in the database must be assigned a codepage to convert it to Unicode. If the codepage cannot be retrieved from attributes, it can be done automatically using a statistical method evaluating the relative frequency of characters and character sequences. Our method presented here is fast and reliable for the automatic assignment of codepages and can be used for every language.

2. Round-trip compatibility
If data are converted wrongly to Unicode it has to be ensured that original data can be restored. Special care has to be taken for the round-trip compatibility of all user definable areas defined in the non Unicode character sets. An example is given for the conversion of a database using the Hong Kong Supplementary Character Set.

3. Conversion of huge databases
A critical parameter for the conversion of a database is the downtime. Besides a careful preparation, all steps have to be highly parallelized during the conversion to reduce the downtime to a minimum. A general scenario is sketched for the conversion of databases with a size of 2 TBytes within 48 hours.