Re: Need encoding conversion routines

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Mar 14 2003 - 12:29:31 EST

  • Next message: Marco Cimarosti: "RE: Need encoding conversion routines"

    Let's try this:

    ICU has C header files with macros for code point handling in UTF-8/16 strings. See the utf8.h and
    utf16.h headers (together with utf.h) in ICU's source tree at source/common/unicode/.

    http://oss.software.ibm.com/icu/download/
    http://oss.software.ibm.com/cvs/icu/icu/source/common/unicode/

    There is also a utf32.h header, but that is empty now. I redesigned the set of macros last year to
    simplify and improve them a bit.

    Specifically, see below.

    (Note that the UTF-8 macros [except for the "unsafe" ones] handle the complicated cases in functions
    that are called from inside the macros. See source/common/utf_impl.c . Safe UTF-8 handling requires
    a lot of error checks.)

    askq1 askq1 wrote:
    > I want c/c++ code that will give me UTF8 byte sequence representing a
    > given code-point, UTF16 16 bits sequence reppresenting a given
    > code-point, UTF32 32 bits sequence representing a given code-point.
    >
    > e.g.
    >
    > UTF8_Sequence CodePointToUTF8(Unichar codePoint)

    Use U8_APPEND().
    http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a12

    To read a code point from UTF-8, use U8_NEXT()
    http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a10

    or U8_GET() etc.

    > UTF16_Sequence CodePointToUTF16(Unichar codePoint)

    U16_APPEND()
    http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16

    To read a code point from UTF-8, use U16_NEXT()
    http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16

    or U16_GET() etc.

    > UCS2_Sequence CodePointToUCS2(Unichar codePoint)

    For UCS-2, the best strategy (in my opinion) is to treat it exactly the same as UTF-16. Most people
    mean UTF-16 when they talk about UCS-2 or generally about "16-bit Unicode".

    If you do want to distinguish them anyway, then this is trivial:
    if(0<=codePoint<=0xffff) {
         cast codePoint to 16-bit type and emit;
    } else {
         error;
    }

    Similarly, UTF-32 is trivial as well - it just stores each code point value in a 32-bit integer
    unit. Unicode code points are values 0..0x10ffff.

    See also http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/samples/ustring/ustring.cpp

    I hope this helps - best regards,
    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Fri Mar 14 2003 - 13:06:21 EST