Re: Unicode support

From: Asmus Freytag (
Date: Wed Jul 27 2005 - 12:05:58 CDT

  • Next message: Sinnathurai Srivas: "Re: Letters for Indic transliteration"

    At 10:55 PM 7/26/2005, Tunga, Prasad wrote:
    >I have an application (written in 'C') which currently reads and
    >manipulates ASCII strings. However I would like to it convert it so that
    >it can read Unicode strings.
    >What are the basic things I should be looking at to make it compatible
    >with Unicode..?

    There is no simple answer to that. The optimal solution depends on the
    nature of your application, the nature of the data it reads, and the nature
    of the platform(s) it is supposed to be used on.

    If the data is all or predominantly in UTF-8 (for example HTML) then it may
    make sense to simply use char * and work in UTF-8. I wrote "may" because
    for that to be a reasonable strategy, two other conditions need to hold:
    The data volume must be so great and the type of 'manipulation' so limited
    that a) converting the data to any other encoding form would be
    cost-prohibitive and b) the penalty for processing multi-byte sequences is
    low. An example would be a web-log analyzer. Extracting information from a
    raw UTF-8 encoded log qualifies as 'limited manipulation', but the data
    rates are usually high, so that any time spent converting data formats is
    wasted. The worst thing in such a scenario would be going to 4-byte UTF-32,
    as that will surely blow your cache.

    If the platforms (or i18n library) you are using, or plan to use, are
    UTF-16 based, and communication with the platform is your primary form of
    data exchange, then working with UTF-16 is likely your best bet. Converting
    data when the interface has many entry points is challenging, while working
    around the occasional character that takes two 16-bit code units in UTF-16
    is not particularly difficult (and does not have to be expensive).

    If your platform or i18n library supports 4-byte i.e. (UTF-32) characters,
    then, for similar reasons, you might want to use them - but it's a poor
    choice for high data rates, as (on average) only half as many characters
    will fit your cache.

    If you are developing for cross-platform or cross-compiler, you need to pay
    attention to how you define a data type that can contain your preferred
    code unit in a portable way. For UTF-8 that is trivial, as support for an
    8-bit data type is universal. For UTF-16 or UTF-32, currently, the best
    practice is to use your own typedef, and map that to a compiler-specific
    choice of actual integer data type in a header file.

    However, the C language standard is adding support for data types of
    guaranteed length, (both 16 and 32 bit) and even for adding a way to
    declare that a particular data type contains characters of the
    corresponding Unicode encoding form. Where vendors are supporting this
    scheme, you could use it.

    Non-standard implementations, which support UTF-16 as a wchar_t are widely
    used, due to the fact that they make life easier for people working on
    platforms where UTF-16 is natively supported.

    If you port to UTF-8, all code that does not try to interpret byte values >
    0x7F will work, but watch for anything that truncates strings or buffers at
    places other than '\n' or '\0' or at space or syntax characters from the
    ASCII range. Also watch for jumps into the middle of a string. However,
    code like:

    while (*s)
             *d++ = *s++;

    is fine and does not need to be aware of the multibyte nature of UTF-8.

    If you port to UTF-32 or UTF-16 you need to make sure that you use the
    correct data type (see above). If you have used char* extensively for both
    strings and raw data buffers, you'll have your work cut out for you
    deciding which pointer needs the new data type. However, compilers can be
    of some help here. As you convert some of the interfaces, type mismatches
    should be flagged. If you can, try compile with a C++ compiler (even though
    you are writing C code, your type checking will improve).

    Again, if all the characters that your application deals with explicitly
    are from the BMP, then a UTF-16 port needs to be aware of the single/double
    code unit nature of UTF-16 only insofar as to avoid buffer truncation and
    jumping into the middle of strings.


    This archive was generated by hypermail 2.1.5 : Wed Jul 27 2005 - 12:08:50 CDT