Re: UTF8 vs. Unicode (UTF16) in code

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Fri Mar 09 2001 - 15:29:30 EST

Next message: Tex Texin: "Re: Unicode market acceptance"
Previous message: Richard Cook: "Re: Unicode market acceptance"
In reply to: Ienup Sung: "Re: UTF8 vs. Unicode (UTF16) in code"
Next in thread: Peter_Constable@sil.org: "RE: UTF8 vs. Unicode (UTF16) in code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Ienup Sung wrote:
>
> Well, on the contrary to what you said, it is a very good option since you
> don't have to know anything about what's inside the character bytes which
> means by using the mblen/mbrlen, you can achieve codeset independent
> programming that will support not only Unicode/UTF-8 but also any other major
> codesets in the world.

This assumes that the C standard library the final user is really using,
does really support UTF-8. My current impression is that, unfortunately,
I cannot rely on that assumption, unless (as you note) I require full
Unicode conformance *on part of the underlying platform*. Which is
on practical matters still a heavy requirements these days, at least for me.

> Also, what I meant by is mbrlen/mblen kind of interfaces; of course if
> you want to deal with stateful encoding then you obviously need to use
> mbrlen() that are rather recently added at ISO C MSE/XPG5.

Published in Spring 1995 (I speak about ISO C).

I know it has been added "rather recently" in real world
implementations though... And I have some ideas about the
underlying reasons.

> Your argument on mblen doesn't work for BIG5 as a living proof, all
> Unix systems that have BIG5 locale work fine and perfectly with/at
> mblen/mbrlen with the BIG5 locale.

Sorry, I wasn't clear. My idea was that you cannot call mblen() with an
arbitrary pointer, the result would be meaningless: you need to be sure
this is a lead byte or a single-byte character before.
OTOH, with UTF-16, you got meaningful results.

And yes, this is a very minor point.

> Therefore, I argue your argument on mblen and such not working with
> BIG5 and ISO-2022-JP not true and mis-leading.

I never say nor imply that they are not working. I cannot understand what
sentence of mine may have lead to that conclusion. I just said they are not
currently *working* with UTF-8 inputs, which is quite different.
I also said that mblen on any DBCS encodings (and _this_ includes Big-5
or ISO-2022-JP) is more clumsy than the equivalent on UTF-16. Which
is a matter of taste I do not want to discuss any further, since as Michka
rightly points out any discussion about religious matters are useless.

So you can consider you are right.

Antoine

Next message: Tex Texin: "Re: Unicode market acceptance"
Previous message: Richard Cook: "Re: Unicode market acceptance"
In reply to: Ienup Sung: "Re: UTF8 vs. Unicode (UTF16) in code"
Next in thread: Peter_Constable@sil.org: "RE: UTF8 vs. Unicode (UTF16) in code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT