Conversions / Mappings

Q: How should Unicode strings be passed as data to programs that are not 'Unicode-aware', and only support a different encoding?

There are 3 options:

(1) Use "escapes". For example, in XML or HTML one can write the string "α ≤ 3" using only ASCII characters as " α &#x 2264; 3". Similarly, with Java conventions one can write this string as "\u03B1 \u2264 3". This allows round-tripping from Unicode to the legacy set and back.

(2) Transform all the data in the field to a hex form, say in UTF-16 as "03B10020226400200033" or in UTF-8 as "CEB120E289A42033". This also permits round-tripping, but takes more space and is less readable.

(3) Simply transform to the legacy encoding. This will, however, cause corruption of data than cannot be expressed in the legacy encoding. For example, when Unicode is transformed to 1252 on Windows platforms, this string may come out as "a < 3". On other platforms, it might come out as "? ? 3" or even as " 3".

Q: Where can I find some utf-8 sample files for testing a to_utf16 converter?

There are interesting links to some at http://www.cl.cam.ac.uk/~mgk25/unicode.html#examples external link . [DB]

Q: Many of the East Asian character set mapping data files on the Unicode site are in an "OBSOLETE" directory. Why?

Mapping of legacy East Asian character sets is complicated, and there are often subtle differences between implementations on different platforms. The mapping files located in the Public data directory on the Unicode site have some historical interest and may inform implementations of mappings, but do not necessarily reflect current practice in all cases. Because of this they are placed in an "OBSOLETE" directory, and their documentation includes appropriate caveats about their use.

For a few more up-to-date mappings for East Asian legacy character sets to Unicode code points, see the vendor-supplied mapping tables located in: https://www.unicode.org/Public/MAPPINGS/VENDORS

More complete and detailed solutions for East Asian character set mappings can be found associated with currently maintained general implementations of character set conversion. For example, see the International Components for Unicode documentation about character set conversions: https://icu.unicode.org/charts/charset or the Free Software Foundation documentation about the Gnu libiconv implementation: http://www.gnu.org/software/libiconv/ external link