Re: UTF8 vs. Unicode (UTF16) in code

Date: Fri Mar 09 2001 - 06:51:25 EST

Generally, UTF-8 is a quicker-and-dirtier method of getting Unicode
support into a legacy product. The work that goes into supporting UTF-8
in 8-bit clean code is analogous to multibyte enabling: you have to
provide functions for moving the pointer about, searching, etc.

This *can* be less work initially than going whole hog and doing a UTF-16
port, especially if some of your code can remain "UTF-8 ignorant", which
is often the case.

UTF-16 is a bit more work. Most crossplatform developers choose to use
their own datatype (usually unsigned short) instead of wchar_t because of
the variability of the wchar_t type on different platforms. The problem
with that is the need to write a good bit of code to replace the usual
string functions one would expect to have available (and then replace
calls to the "usual string functions" wherever they occur in code!).

Most people use a library. The ICU library on IBM DeveloperWorks is
popular for this purpose. You might want to check it out.

If you use wchar_t then the Microsoft platform is a breeze. Solaris is a
bit more work, since wchar_t isn't directly or explicitly linked to
Unicode. It's just a (32-bit) datatype. You do get the benefit of having
all your usual functions available. C++ developers can use the STL String
class (albeit with caveats).

And, as you note, you have to provide interface routines to your legacy
code (actually you have that problem either way you go).

So, after all that you may be wondering why most implementers choose to
implement a UTF-16 solution! The answer is that once you're done with the
initial port you'll find your code a bit more manageable and readable
(usually, usually) and it matches up better with other Unicode
implementations out there. For "on-the-wire" communications you'll still
probably end up with a UTF-8 representation, but this is only a few lines
of C to convert to/from UTF-16.

Good Luck,


Addison P. Phillips Globalization Architect
webMethods, Inc
Sunnyvale, CA, USA

+1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
"Internationalization is not a feature. It is an architecture."

On Wed, 7 Mar 2001, Allan Chau wrote:

> We've got an English-language only product which makes use of
> single-byte character strings throughout the code. For our next
> release, we'd like to internationalize it (Unicode) & be able to store
> data in UTF8 format (a requirement for data exchange).
> We're considering between using UTF8 within the code vs. changing our
> code to use wide characters. I'm wondering what experiences others have
> had that can help with our decision. I'm thinking that using UTF8
> internally may mean less rewriting initially, but we'd have to check
> carefully for code that make assumptions about character boundaries.
> Because of this, I think that it'd be more complicated for developers to
> have to work with UTF8 in code. Unicode (UTF16) internally would be
> easier to manage since most characters will essentially be fixed width,
> but there'd be alot of code to rewrite. Also, I've heard of problems
> with the wide character type (wchar_t) having different definitions
> depending on platform (we're running on NT & Sun Solaris). Many of our
> product APIs would also be affected.
> Can others offer their insights, suggestions?
> Thanks,
> -allan

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT