Re: UTF8 vs. Unicode (UTF16) in code

From: Ienup Sung (ienup.sung@eng.sun.com)
Date: Fri Mar 09 2001 - 16:12:54 EST


Well, C stadard library is every where including non-Unix systems in these
days as a "Runtime" support. I do though know and experienced as a fact
that back in and up to early 80s there was no such thing called mblen in all
the systems but things had been changed quite a bit so many years ago. Also,
I don't think in these days that C development environment comes out
without mblen or so interfaces since it is a part of ISO/ANSI C std and if
you want to have "branding" you have to support those. And, they are working
on all kind of different codesets/encodings certainly including UTF-8, BIG5,
Shift_JIS and so on.

MSE has been quite sometime with us at least in forms of proposal/review draft
and so on and the earliest discussion about the functions and the MSE itself
go back before 1994.

For the arbitrary pointer issue, samething for UTF-16 (whether that is
dealt with a two byte entity or a two single byte entities) since
you cannot actually go simply 2 x the number of characters anymore with UTF-16
in a byte stream. Also, even with multibyte codesets, you can backtrack
a few bytes (as like you can do with UTF-16 as well) to get to the right
character boundary if one has to deal with such situation; of course this
kind of thing will require understanding on each and every one of the codesets
including UTF-16 that you want to support which we don't think that each and
everyone of developers must/need/want to do.

I had a chance and privilage of looking at various mblen and such interface
implemetations from various commercial and open source systems since quite
some years ago, major Unix vendors exchanged their implementations through
the COSE/Spec1170 deal. I have to say that the implementations that I saw
might not what all the vendors have today but even then the mblen
implementations were not clumsy at all but elegant and lean as they can be in
most of cases.

With regards,

Ienup

] Date: Fri, 09 Mar 2001 12:09:08 -0800 (GMT-0800)
] From: Antoine Leca <Antoine.Leca@renault.fr>
] Subject: Re: UTF8 vs. Unicode (UTF16) in code
] To: Unicode List <unicode@unicode.org>
] Cc: unicode@unicode.org
] MIME-version: 1.0
] Content-transfer-encoding: 7bit
]
] Ienup Sung wrote:
] >
] > Well, on the contrary to what you said, it is a very good option since you
] > don't have to know anything about what's inside the character bytes which
] > means by using the mblen/mbrlen, you can achieve codeset independent
] > programming that will support not only Unicode/UTF-8 but also any other
major
] > codesets in the world.
]
] This assumes that the C standard library the final user is really using,
] does really support UTF-8. My current impression is that, unfortunately,
] I cannot rely on that assumption, unless (as you note) I require full
] Unicode conformance *on part of the underlying platform*. Which is
] on practical matters still a heavy requirements these days, at least for me.
]
]
] > Also, what I meant by is mbrlen/mblen kind of interfaces; of course if
] > you want to deal with stateful encoding then you obviously need to use
] > mbrlen() that are rather recently added at ISO C MSE/XPG5.
]
] Published in Spring 1995 (I speak about ISO C).
]
] I know it has been added "rather recently" in real world
] implementations though... And I have some ideas about the
] underlying reasons.
]
]
] > Your argument on mblen doesn't work for BIG5 as a living proof, all
] > Unix systems that have BIG5 locale work fine and perfectly with/at
] > mblen/mbrlen with the BIG5 locale.
]
] Sorry, I wasn't clear. My idea was that you cannot call mblen() with an
] arbitrary pointer, the result would be meaningless: you need to be sure
] this is a lead byte or a single-byte character before.
] OTOH, with UTF-16, you got meaningful results.
]
] And yes, this is a very minor point.
]
]
] > Therefore, I argue your argument on mblen and such not working with
] > BIG5 and ISO-2022-JP not true and mis-leading.
]
] I never say nor imply that they are not working. I cannot understand what
] sentence of mine may have lead to that conclusion. I just said they are not
] currently *working* with UTF-8 inputs, which is quite different.
] I also said that mblen on any DBCS encodings (and _this_ includes Big-5
] or ISO-2022-JP) is more clumsy than the equivalent on UTF-16. Which
] is a matter of taste I do not want to discuss any further, since as Michka
] rightly points out any discussion about religious matters are useless.
]
] So you can consider you are right.
]
]
] Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT