Conversions / Mappings
Q: How should Unicode strings be passed as
data to programs that are not 'Unicode-aware', and only support a
A: There are 3 options:
(1) Use "escapes". For example, in XML or HTML one can
write the string "α ≤ 3" using only ASCII characters as " α &#x
2264; 3". Similarly, with Java conventions one can write this string as
"\u03B1 \u2264 3". This allows round-tripping from Unicode to the legacy
set and back.
(2) Transform all the data in the field to a hex form, say
in UTF-16 as "03B10020226400200033" or in UTF-8 as "CEB120E289A42033".
This also permits round-tripping, but takes more space and is less
(3) Simply transform to the legacy encoding. This will,
however, cause corruption of data than cannot be expressed in the legacy
encoding. For example, when Unicode is transformed to 1252 on Windows
platforms, this string may come out as "a < 3". On other platforms, it
might come out as "? ? 3" or even as " 3".
Q: Where can I find some utf-8 sample files
for testing a to_utf16 converter?
A: There are interesting links to some at
http://www.cl.cam.ac.uk/~mgk25/unicode.html#examples. The Unicode
site also contains a collection of UTF-8 pages at
Unicode? Those pages contain UTF-8 translations of the same content
into many different languages, and can also be used to test conversion
algorithms, as well as rendering or other processes.
Q: Many of the East Asian character set mapping data files on the Unicode site
are in an "OBSOLETE" directory. Why?
A: Mapping of legacy East Asian character sets is complicated, and there are
often subtle differences between implementations on different platforms.
The mapping files located in the Public data directory on the Unicode site
have some historical interest and may inform implementations of mappings,
but do not necessarily reflect current practice in all cases. Because of this
they are placed in an "OBSOLETE" directory, and their documentation includes
appropriate caveats about their use.
For a few more up-to-date mappings for East Asian legacy character sets to Unicode
code points, see the vendor-supplied mapping tables located in:
More complete and detailed solutions for East Asian character set mappings
can be found associated with currently maintained general implementations
of character set conversion. For example, see the International Components
for Unicode documentation about character set conversions:
or the Free Software Foundation documentation about the Gnu libiconv implementation: