Re: UTF8 vs. Unicode (UTF16) in code

From: Antoine Leca (
Date: Fri Mar 09 2001 - 07:22:19 EST

Ienup Sung wrote:
> I also implement UTF-16 and UTF-8 support in various levels and
> I find UTF-8 is more easier to handle and write software with since
> we have many MB functions, e.g., mblen() for byte length, that we can use,
> and, there is no byte ordering hassle that we need to worry about.

mblen() is not a very good option. First, because it has been superceeded
by another, mbrlen(), which is "better" in term of usability. Then,
much more important, and as Addison pointed out, because you cannot be
sure that the underlying implementation of mblen() really deals with UTF-8.
So practically a lot of people relies on their own version of mbxxx functions.
Which implementations takes a substantial amount of time in a project
like the one Allan was initially talking about.

> PS. You talked about lead byte checking at DBCS (which I guess because
> you didn't have mblen() in your platform(s)??)

No, that is just because with some DBCS encodings like iso-2022-jp
or Big-5, mblen() just does not work on arbitrary byte: you have to first
check that you are considering a lead byte.

> and I think from now on you will have to do that with UTF-16, i.e.,
> check if the leading two byte entity in UTF-16 is U+D8xx

Yes, of course. But the cost of the check is not really the same, particularly
since U+D8xx is *not* a valid character per se. So you can deal with each
16-bit unit as a unique character, with just a simple break when the value
is between D800 and DFFF (then you branch to the special surrogates
routines, in the case when you really need to deal with them; Plan-9
experience showed that these cases are uncommon, by the way).

With MBCS, on the other hand, the more-than-one length of a character is
the standard rather than the exception; so you have to first test, then deal
with it. Also, a UTF-8 character can span 4 units, which is not as easier to
deal with as it is with UTF-16 (particularly if you need to deal with
incremental input with only one putback allowed).


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT