Re: Unicode support

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Jul 27 2005 - 12:05:58 CDT

Next message: Sinnathurai Srivas: "Re: Letters for Indic transliteration"

Previous message: Neil Harris: "Re: Unicode support"
Maybe in reply to: Tunga, Prasad: "Unicode support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 10:55 PM 7/26/2005, Tunga, Prasad wrote:
>I have an application (written in 'C') which currently reads and
>manipulates ASCII strings. However I would like to it convert it so that
>it can read Unicode strings.
>What are the basic things I should be looking at to make it compatible
>with Unicode..?

There is no simple answer to that. The optimal solution depends on the
nature of your application, the nature of the data it reads, and the nature
of the platform(s) it is supposed to be used on.

If the data is all or predominantly in UTF-8 (for example HTML) then it may
make sense to simply use char * and work in UTF-8. I wrote "may" because
for that to be a reasonable strategy, two other conditions need to hold:
The data volume must be so great and the type of 'manipulation' so limited
that a) converting the data to any other encoding form would be
cost-prohibitive and b) the penalty for processing multi-byte sequences is
low. An example would be a web-log analyzer. Extracting information from a
raw UTF-8 encoded log qualifies as 'limited manipulation', but the data
rates are usually high, so that any time spent converting data formats is
wasted. The worst thing in such a scenario would be going to 4-byte UTF-32,
as that will surely blow your cache.

If the platforms (or i18n library) you are using, or plan to use, are
UTF-16 based, and communication with the platform is your primary form of
data exchange, then working with UTF-16 is likely your best bet. Converting
data when the interface has many entry points is challenging, while working
around the occasional character that takes two 16-bit code units in UTF-16
is not particularly difficult (and does not have to be expensive).

If your platform or i18n library supports 4-byte i.e. (UTF-32) characters,
then, for similar reasons, you might want to use them - but it's a poor
choice for high data rates, as (on average) only half as many characters
will fit your cache.

If you are developing for cross-platform or cross-compiler, you need to pay
attention to how you define a data type that can contain your preferred
code unit in a portable way. For UTF-8 that is trivial, as support for an
8-bit data type is universal. For UTF-16 or UTF-32, currently, the best
practice is to use your own typedef, and map that to a compiler-specific
choice of actual integer data type in a header file.

However, the C language standard is adding support for data types of
guaranteed length, (both 16 and 32 bit) and even for adding a way to
declare that a particular data type contains characters of the
corresponding Unicode encoding form. Where vendors are supporting this
scheme, you could use it.

Non-standard implementations, which support UTF-16 as a wchar_t are widely
used, due to the fact that they make life easier for people working on
platforms where UTF-16 is natively supported.

If you port to UTF-8, all code that does not try to interpret byte values >
0x7F will work, but watch for anything that truncates strings or buffers at
places other than '\n' or '\0' or at space or syntax characters from the
ASCII range. Also watch for jumps into the middle of a string. However,
code like:

while (*s)
{
*d++ = *s++;
}

is fine and does not need to be aware of the multibyte nature of UTF-8.

If you port to UTF-32 or UTF-16 you need to make sure that you use the
correct data type (see above). If you have used char* extensively for both
strings and raw data buffers, you'll have your work cut out for you
deciding which pointer needs the new data type. However, compilers can be
of some help here. As you convert some of the interfaces, type mismatches
should be flagged. If you can, try compile with a C++ compiler (even though
you are writing C code, your type checking will improve).

Again, if all the characters that your application deals with explicitly
are from the BMP, then a UTF-16 port needs to be aware of the single/double
code unit nature of UTF-16 only insofar as to avoid buffer truncation and
jumping into the middle of strings.

A./

Next message: Sinnathurai Srivas: "Re: Letters for Indic transliteration"
Previous message: Neil Harris: "Re: Unicode support"
Maybe in reply to: Tunga, Prasad: "Unicode support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jul 27 2005 - 12:08:50 CDT