Conversions / Mappings
Q: How should Unicode strings be passed as
data to programs that are not 'Unicode-aware', and only support a
A: There are 3 options:
(1) Use "escapes". For example, in XML or HTML one can
write the string "α ≤ 3" using only ASCII characters as " α &#x
2264; 3". Similarly, with Java conventions one can write this string as
"\u03B1 \u2264 3". This allows round-tripping from Unicode to the legacy
set and back.
(2) Transform all the data in the field to a hex form, say
in UTF-16 as "03B10020226400200033" or in UTF-8 as "CEB120E289A42033".
This also permits round-tripping, but takes more space and is less
(3) Simply transform to the legacy encoding. This will,
however, cause corruption of data than cannot be expressed in the legacy
encoding. For example, when Unicode is transformed to 1252 on Windows
platforms, this string may come out as "a < 3". On other platforms, it
might come out as "? ? 3" or even as " 3".
Q: Where can I find some utf-8 sample files
for testing a to_utf16 converter?
A: There are interesting links to some at
http://www.cl.cam.ac.uk/~mgk25/unicode.html#examples. The Unicode
site also contains a collection of UTF-8 pages at
Unicode? Those pages contain UTF-8 translations of the same content
into many different languages, and can also be used to test conversion
algorithms, as well as rendering or other processes.