Re: UTF8 vs. Unicode (UTF16) in code

From: Ienup Sung (
Date: Fri Mar 09 2001 - 14:52:38 EST

Well, on the contrary to what you said, it is a very good option since you
don't have to know anything about what's inside the character bytes which
means by using the mblen/mbrlen, you can achieve codeset independent
programming that will support not only Unicode/UTF-8 but also any other major
codesets in the world.

Also, what I meant by is mbrlen/mblen kind of interfaces; of course if
you want to deal with stateful encoding then you obviously need to use
mbrlen() that are rather recently added at ISO C MSE/XPG5. I have to point out
though there are not that many encodings that is being used as a codeset of
a locale except probably JIS-7 encoding. ISO-2022-JP and such are, while I
wouldn't deny a possibility of being a locale's codeset, it is more for
transmission and web related documents as you know.

Also, you can be absolutely 100% sure that mblen/mbrlen and such will work
perfectly with UTF-8 if your platform supports Unicode/UTF-8 locale because
when you're saying that your platform has support for Unicode/UTF-8, it
means that all the Worldwide Portability Interfaces (WPI) that includes
mblen/mbrlen and so on, are guaranteed to work for the Unicode/UTF-8 locales
and thus for UTF-8. As an example, I don't believe you can give me one
single example/evidence that mblen/mbrlen is not working properly in any of
Unix system when they say they are supporting Unicode/UTF-8 locale and thus

Your argument on mblen doesn't work for BIG5 as a living proof, all
Unix systems that have BIG5 locale work fine and perfectly with/at
mblen/mbrlen with the BIG5 locale. As I mentioned and as you probably know,
ISO-2022-JP is not really supported as a codeset of any locale and
I haven't seen anybody supporting ISO-2022-JP as a codeset of a locale but if
I have to support ja_JP.ISO-2022-JP per se, one of the first things that I
must to do to support the locale (and hence the codeset) is to provide mblen
and mbrlen pluggable methods so that when you are in the ja_JP.ISO-2022-JP,
the pluggable methods for the locale will be in the mblen and mbrlen.
Therefore, I argue your argument on mblen and such not working with
BIG5 and ISO-2022-JP not true and mis-leading.

I also argue the cost of check isn't really different at all. For instance,
you mentioned in your email that "with a simple break" and that "simple
break" is a check that requires at least one machine instruction of
cmp and possibly accompanying jmp instruction and then you'll have to
make sure if consecutively followinga two byte entity is actually a good
and valid one or not. Pretty much same or less amount of work need to
be done for any other kind of encodings including UTF-8, BIG5, Shift_JIS,
Johab, Expanded Wansung, GBK, eucJP, you name it. You may argue that UTF-8
has four different sizes but you don't need to do four separate comparisons
all the time but just a single comparison is good enough to figure out
the byte length since the leading byte of UTF-8 characters always give you
the clue the number of bytes. You also said we have to "test and deal with it"
with MBCS and I dare you that we also have to "test and deal with it" with
UTF-16 since it is also a variable width encoding and the number of
entities that you may have to deal with is two and four, or, one and two
depending on your entity size; it is pretty much same for any other
multibyte, variable width locales/codesets, what you will have to deal with is
one and two (for so-called double byte codeset), or, one and two for
Occidental Unicode/UTF-8 locales in most of cases, one and three for
Asian Unicode/UTF-8 locales, and so on.

With regards,


PS. Speaking of religion by the way, I checked sacred scrolls and books
of my religion the other day and found that I have St. UTF-8, St. UTF-16, and
St. UTF-32 in my scrolls and books and you know why. Oh, in addition to
the saints, I also have UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE sisters and
borthers too.

] Date: Fri, 09 Mar 2001 04:01:41 -0800 (GMT-0800)
] From: Antoine Leca <>
] Subject: Re: UTF8 vs. Unicode (UTF16) in code
] To: Unicode List <>
] Cc: Unicode List <>
] MIME-version: 1.0
] Content-transfer-encoding: 7bit
] Ienup Sung wrote:
] >
] > I also implement UTF-16 and UTF-8 support in various levels and
] > I find UTF-8 is more easier to handle and write software with since
] > we have many MB functions, e.g., mblen() for byte length, that we can use,
] > and, there is no byte ordering hassle that we need to worry about.
] mblen() is not a very good option. First, because it has been superceeded
] by another, mbrlen(), which is "better" in term of usability. Then,
] much more important, and as Addison pointed out, because you cannot be
] sure that the underlying implementation of mblen() really deals with UTF-8.
] So practically a lot of people relies on their own version of mbxxx functions.
] Which implementations takes a substantial amount of time in a project
] like the one Allan was initially talking about.
] > PS. You talked about lead byte checking at DBCS (which I guess because
] > you didn't have mblen() in your platform(s)??)
] No, that is just because with some DBCS encodings like iso-2022-jp
] or Big-5, mblen() just does not work on arbitrary byte: you have to first
] check that you are considering a lead byte.
] > and I think from now on you will have to do that with UTF-16, i.e.,
] > check if the leading two byte entity in UTF-16 is U+D8xx
] <snip>
] Yes, of course. But the cost of the check is not really the same, particularly
] since U+D8xx is *not* a valid character per se. So you can deal with each
] 16-bit unit as a unique character, with just a simple break when the value
] is between D800 and DFFF (then you branch to the special surrogates
] routines, in the case when you really need to deal with them; Plan-9
] experience showed that these cases are uncommon, by the way).
] With MBCS, on the other hand, the more-than-one length of a character is
] the standard rather than the exception; so you have to first test, then deal
] with it. Also, a UTF-8 character can span 4 units, which is not as easier to
] deal with as it is with UTF-16 (particularly if you need to deal with
] incremental input with only one putback allowed).
] Antoine

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT