Re: CodePage Information

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 22 2003 - 07:15:28 EDT

  • Next message: Ben Dougall: "Re: More of The Unicode Standard, Version 4.0 available online"

    From: "Abdij Bhat" <Abdij.Bhat@kshema.com>
    Cc: <unicode@unicode.org>
    Sent: Thursday, May 22, 2003 12:39 PM
    Subject: RE: CodePage Information

    > Hi Phillipe,
    > One more thing..
    > Can you point me to some UTF-8 to UTF-16 converters (decoders/encoders) and
    > vice-versa.
    >

    Look in Unicode reference for the description of the very simple algorithm. Which can fully be described in a single page of specification (if you exclude samples).

    You'll certainly find a lot of converters in various open-source softwares.

    IBM's open-source ICU is implementing it, but it's a rather large library which (in its compiled form) can expand to about 2MB of code and data, and that covers many internationalization features and many encoding, many tailored collation orders for many language, includes resources managers, date&number formaters/readers, ...

    Also, if your software is written using a 16-bit wchar_t type (or with _UNICODE and CHAR defines in C/C++ with the Windows API), your strings are internally stored with UTF-16 code units, and not using codepoints. Be careful that you'll find Chinese text that will probably now use Han Ideographic Extension characters out of the BMP, and thus stored with pairs of surrogates.

    There are two ways to encode these surrogate pairs to bytes.

    1) The standard way is to use UTF-8, but the converter must internally detect these pairs to compute a 32bit codepoint to generate the UTF-8 byte sequence.
    2) With the CESU-8 encoding (non-standard but described in Unicode Technical Supplements), you don't have to bother detecting these surrogate pairs, as you can encode each UTF-16 surrogate separately using the same algorithm as UTF-8 (but now restricted to encoding only 16 bit values, and not the full range of Unicode code points.

    With the standard UTF-8 encoding, a UTF-16 string would encode a supplementary character as 4 bytes, but with CESU-8, it would use 2 successive sequences of 3 bytes (i.e. a total of 6 bytes) for each surrogate.

    If you want to optimize storage space in your hardware device, prefer the strict UTF-8 method to the simple CESU-8.
    Note that you can decode CESU-8 and UTF-8 with the same decoder if you want to preserve compatibility, if you relax the strict decoding rules for UTF-8, by not excluding "shortest forms" that were described in Unicode 1.1 to 3.0 and now excluded in the current stricter standard that makes a distinction between UTF-8 and CESU-8...

    For higher compression of Unicode strings, but that does not preserve the ASCII encoding, look into SCSU and BOCU specifications in UTS. They are more complex and for now implemented in few softwares.

    On the opposite, UTF-8 (either the strict version or the relaxed version which combines UTF-8 and CESU-8 in a common encoding, corresponding to older versions of Unicode where this distinction was not specified) is very widely implemented now in all OS'es, many libraries and supported by most major applications.

    A search in Google and the Unicode web site will give you a lot of existing implementations (most often hidden in larger set of sources), as this conversion code is extremely short and simple to write.



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 08:22:20 EDT