Re: UTF8 vs. Unicode (UTF16) in code

From: Antoine Leca (
Date: Fri Mar 09 2001 - 15:29:30 EST

Ienup Sung wrote:
> Well, on the contrary to what you said, it is a very good option since you
> don't have to know anything about what's inside the character bytes which
> means by using the mblen/mbrlen, you can achieve codeset independent
> programming that will support not only Unicode/UTF-8 but also any other major
> codesets in the world.

This assumes that the C standard library the final user is really using,
does really support UTF-8. My current impression is that, unfortunately,
I cannot rely on that assumption, unless (as you note) I require full
Unicode conformance *on part of the underlying platform*. Which is
on practical matters still a heavy requirements these days, at least for me.

> Also, what I meant by is mbrlen/mblen kind of interfaces; of course if
> you want to deal with stateful encoding then you obviously need to use
> mbrlen() that are rather recently added at ISO C MSE/XPG5.

Published in Spring 1995 (I speak about ISO C).

I know it has been added "rather recently" in real world
implementations though... And I have some ideas about the
underlying reasons.

> Your argument on mblen doesn't work for BIG5 as a living proof, all
> Unix systems that have BIG5 locale work fine and perfectly with/at
> mblen/mbrlen with the BIG5 locale.

Sorry, I wasn't clear. My idea was that you cannot call mblen() with an
arbitrary pointer, the result would be meaningless: you need to be sure
this is a lead byte or a single-byte character before.
OTOH, with UTF-16, you got meaningful results.

And yes, this is a very minor point.

> Therefore, I argue your argument on mblen and such not working with
> BIG5 and ISO-2022-JP not true and mis-leading.

I never say nor imply that they are not working. I cannot understand what
sentence of mine may have lead to that conclusion. I just said they are not
currently *working* with UTF-8 inputs, which is quite different.
I also said that mblen on any DBCS encodings (and _this_ includes Big-5
or ISO-2022-JP) is more clumsy than the equivalent on UTF-16. Which
is a matter of taste I do not want to discuss any further, since as Michka
rightly points out any discussion about religious matters are useless.

So you can consider you are right.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT