Re: UTF-16 inside UTF-8

From: Markus Scherer (
Date: Thu Nov 06 2003 - 15:00:00 EST

  • Next message: Anto'nio Martins-Tuva'lkin: "Re: Merging combining classes"

    I would like to comment on several statements that I have seen in this thread -

    - Migrating from UCS-2 to UTF-16:
       Doable, and has been done for many applications and libraries.

    - Difficult to handle UTF-16?
       Use ICU - it handles all of Unicode for collation,
       regular expressions, string casing, codepage conversion,
       and many other things.

    - Support for supplementary characters only for Chinese?
       Japan has defined JIS X 0213 which has characters that map to
       + supplementary characters
       as well as
       + multiple BMP characters
       (ICU 2.8 will support codepage conversion involving
        multiple characters on either side)

       CJKV ideographs, used in several languages, are driving support
       for supplementary characters.

    - Case mappings can be modified to return a 32-bit Unicode
       code point instead of 16-bit BMP?
       This works, but only for "simple" case mappings.
       Full Unicode case mappings are defined on strings, and
       single-character APIs won't work at all.
       Full string mappings map 1:n and are context- and language-sensitive.


    Opinions expressed here may not reflect my company's positions unless otherwise noted.

    This archive was generated by hypermail 2.1.5 : Thu Nov 06 2003 - 15:58:08 EST