Re: Need encoding conversion routines

From: Markus Scherer ([email protected])
Date: Fri Mar 14 2003 - 12:29:31 EST

Next message: Marco Cimarosti: "RE: Need encoding conversion routines"

Previous message: Edward H Trager: "Re: Need encoding conversion routines"
In reply to: askq1 askq1: "Re: Need encoding conversion routines"
Next in thread: Marco Cimarosti: "RE: Need encoding conversion routines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Let's try this:

ICU has C header files with macros for code point handling in UTF-8/16 strings. See the utf8.h and
utf16.h headers (together with utf.h) in ICU's source tree at source/common/unicode/.

http://oss.software.ibm.com/icu/download/
http://oss.software.ibm.com/cvs/icu/icu/source/common/unicode/

There is also a utf32.h header, but that is empty now. I redesigned the set of macros last year to
simplify and improve them a bit.

Specifically, see below.

(Note that the UTF-8 macros [except for the "unsafe" ones] handle the complicated cases in functions
that are called from inside the macros. See source/common/utf_impl.c . Safe UTF-8 handling requires
a lot of error checks.)

askq1 askq1 wrote:
> I want c/c++ code that will give me UTF8 byte sequence representing a
> given code-point, UTF16 16 bits sequence reppresenting a given
> code-point, UTF32 32 bits sequence representing a given code-point.
>
> e.g.
>
> UTF8_Sequence CodePointToUTF8(Unichar codePoint)

Use U8_APPEND().
http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a12

To read a code point from UTF-8, use U8_NEXT()
http://oss.software.ibm.com/icu/apiref/utf8_8h.html#a10

or U8_GET() etc.

> UTF16_Sequence CodePointToUTF16(Unichar codePoint)

U16_APPEND()
http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16

To read a code point from UTF-8, use U16_NEXT()
http://oss.software.ibm.com/icu/apiref/utf16_8h.html#a16

or U16_GET() etc.

> UCS2_Sequence CodePointToUCS2(Unichar codePoint)

For UCS-2, the best strategy (in my opinion) is to treat it exactly the same as UTF-16. Most people
mean UTF-16 when they talk about UCS-2 or generally about "16-bit Unicode".

If you do want to distinguish them anyway, then this is trivial:
if(0<=codePoint<=0xffff) {
cast codePoint to 16-bit type and emit;
} else {
error;
}

Similarly, UTF-32 is trivial as well - it just stores each code point value in a 32-bit integer
unit. Unicode code points are values 0..0x10ffff.

I hope this helps - best regards,
markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Next message: Marco Cimarosti: "RE: Need encoding conversion routines"
Previous message: Edward H Trager: "Re: Need encoding conversion routines"
In reply to: askq1 askq1: "Re: Need encoding conversion routines"
Next in thread: Marco Cimarosti: "RE: Need encoding conversion routines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Mar 14 2003 - 13:06:21 EST