[Unicode]  Frequently Asked Questions Home | Site Map | Search

Conversions / Mappings

Q: How should Unicode strings be passed as data to programs that are not 'Unicode-aware', and only support a different encoding?

A: There are 3 options:

(1) Use "escapes". For example, in XML or HTML one can write the string "α ≤ 3" using only ASCII characters as " α &#x 2264; 3". Similarly, with Java conventions one can write this string as "\u03B1 \u2264 3". This allows round-tripping from Unicode to the legacy set and back.

(2) Transform all the data in the field to a hex form, say in UTF-16 as "03B10020226400200033" or in UTF-8 as "CEB120E289A42033". This also permits round-tripping, but takes more space and is less readable.

(3) Simply transform to the legacy encoding. This will, however, cause corruption of data than cannot be expressed in the legacy encoding. For example, when Unicode is transformed to 1252 on Windows platforms, this string may come out as "a < 3". On other platforms, it might come out as "? ? 3" or even as "  3".

Q: Where can I find some utf-8 sample files for testing a to_utf16 converter?

A: There are interesting links to some at http://www.cl.cam.ac.uk/~mgk25/unicode.html#examples. The Unicode site also contains a collection of UTF-8 pages at What is Unicode? Those pages contain UTF-8 translations of the same content into many different languages, and can also be used to test conversion algorithms, as well as rendering or other processes. [DB]