Unicode Frequently Asked Questions

Programming Issues

Q: How do I convert an existing application written in 'C'; so that it can handle Unicode strings?

There is no simple answer to that. The optimal solution depends on the nature of your application, the nature of the data it reads, and the nature of the APIs you are going to use. Assuming that your application currently reads and manipulates ASCII strings, the first thing to look at is the encoding form of Unicode you are going to use. [MS]

Q: When would using UTF-8 be the right approach?

If the Unicode data your program will be handling is all or predominantly in UTF-8 (for example, HTML) then it may make sense to simply continue using char datatypes and char* pointers and to work directly in UTF-8.

However, UTF-8 is inherently somewhat more difficult to process, so two other conditions need to hold for this choice to make sense:

A good example where this choice might be appropriate would be a web-log analyzer. The data would be mostly UTF-8 to begin with, rather than UTF-16 or UTF-32. Extracting information from a the raw UTF-8 encoded log qualifies as 'limited manipulation', but the data rates are usually high, so that any time spent converting data formats is wasted.

The worst choice in such a scenario would be converting to 4-byte UTF-32, as that would greatly inflate your transitory memory requirements, resulting in frequent exceeding of cache limits and slowing down your processing. [MS]

Q: When would using UTF-16 be the right approach?

If the APIs you are using, or plan to use, are UTF-16 based, which is the typical case, then working with UTF-16 directly is likely your best bet. Converting data for each individual call to an API is difficult and inefficient, while working around the occasional character that takes two 16-bit code units in UTF-16 is not particularly difficult (and does not have to be expensive). [MS]

Q: How about converting to UTF-32?

If your platform or i18n library supports UTF-32 (4-byte) characters, then, for similar reasons, you might want to use them instead of UTF-16. Generally, the simplest approach is to match your basic character and string datatype to that used by the APIs you need to call.

However, UTF-32 is often a poor choice for high performance, since (on average) only half as many characters will fit your processor cache.

Many libraries, such as ICU or Java 5, use a hybrid approach. For strings, they use UTF-16 to reduce storage, but for single-character APIs they use code points (UTF-32 values) for API simplicity. [MS]

Q: What basic datatypes do I need to use?

If you are developing for a cross-platform or cross-compiler implementation, you need to pay attention to how you define a datatype that can contain the code units of your preferred Unicode encoding form in a portable way.

For UTF-8 the cross-platform datatype is trivial, as compiler support for an 8-bit character datatype is universal.

For UTF-16 or UTF-32, currently the best practice is to use your own typedefs for a 16-bit or 32-bit code unit datatype, and map that to a compiler-specific choice of actual integer data type in a header file.

However, the C and C++ language standards have added support for datatypes of guaranteed length, (both 16 and 32 bit) and even for adding a way to declare that a particular datatype contains characters of the corresponding Unicode encoding form. Where vendors are supporting this scheme, you can use it effectively.

Non-standard implementations, which support UTF-16 as a wchar_t are widely used, due to the fact that they make life easier for people working on platforms where UTF-16 is natively supported—particularly Windows. [MS]

Q: What are the porting issues I need to watch out for with UTF-8?

If you port to UTF-8, all code that does not try to interpret byte values greater than 0x7F will work, because ASCII and UTF-8 are identical up to 0x7F. However, watch for anything that truncates strings or buffers at places other than '\n' or '\0' or at space or syntax characters from the ASCII range. Truncations based on character counting are inherently dangerous, because UTF-8 is a multi-byte encoding. Also watch out for jumps into the middle of a string.

Many kinds of inner loop code exist for which the code does not need to be aware of the multi-byte nature of UTF-8, for example a simple copy operation like:

while (--len && *d++ = *s++);
if (!len)
*d = 0;

will work correctly either for ASCII or for UTF-8. [MS]

Q: How about issues in porting to UTF-16 or UTF-32?

If you port to UTF-16 or UTF-32 you need to make sure that you use the correct datatype (see above). If you have used char* extensively for both strings and raw data buffers, you'll have your work cut out for you in deciding which pointers need to be converted to the new data type. However, compilers can be of some help here. As you convert some of the interfaces, type mismatches should be flagged. If you are using C, try compiling with a C++ compiler, since even though you are writing C code, your type checking will generally improve.

If all the characters that your application deals with explicitly are from the BMP (and typically from just the ASCII range, U+0000..U+007F), then the semantics of your string handling may not be impacted at all. However, your code still needs to be made aware of the single/double code unit nature of UTF-16 to avoid incorrect buffer truncations or jumping into the middle of strings at incorrect locations. The same concerns apply as for multi-byte string handling, but with 16-bit code units instead of bytes.

For UTF-32, your application's string handling can be simpler, since UTF-32 is a fixed-width encoding, with one character per 32-bit code unit. String handling logic that worked for fixed-width 8-bit ASCII can often be kept completely intact. However, the drawback is that your processing efficiency for strings may be lowered, and you may have to rethink algorithms or memory handling to compensate for that loss of efficiency. [MS]

Page edited by [AF]