Re: UTF8 vs. Unicode (UTF16) in code

From: addison@inter-locale.com
Date: Fri Mar 09 2001 - 13:48:52 EST


Well....

Actually, there is a significant difference between being "UTF-8
ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant" program thinks that
surrogate pairs are just two characters with undefined properties. Since
currently there are no characters "up there" this isn't a really big
deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
characters in the supplemental planes... but they'll be relatively rare.

In most cases where one has a "character pointer", one is not performing
casing, line breaking, or other text interpretation that requires
significant awareness of the meaning of the text. Of course, it depends on
the instance and the application how true that is ;-). But in many cases
you *can* ignore the fact that a high- or low-surrogate character is
really part of something else.

With UTF-8, however, is is impossible to ignore the multi-byte sequences
and they can never really be treated as separate characters. So I guess
all I'm saying is that, depending on what you need to do and what level of
awareness your application needs to achieve, a pure "UCS-2 port" might be
a better choice than UTF-8, since the specific details overlooked are
of a different quality.

Best Regards,.

Addison

===============================================================
Addison P. Phillips Globalization Architect
webMethods, Inc http://www.webmethods.com
Sunnyvale, CA, USA mailto:aphillips@webmethods.com

+1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
===============================================================
"Internationalization is not a feature. It is an architecture."

On Thu, 8 Mar 2001, Ienup Sung wrote:

> Hello,
>
> Actually, as you also implied in your email, since an UTF-16 character can be
> a two-byte entity or two two-byte entity, there will be no significant
> difference between UTF-8 and UTF-16 in terms of how to count and/or decide
> character boundaries in a string. I.e., a UTF-8 character could be a 1, 2, 3,
> or 4 byte entity and a UTF-16 character could be a 2 or 4 byte entity since
> starting from Unicode 3.1, we will have characters defined outside of BMP,
> esp., Plane 01, 02, and 0E.
>
> I do not know however whether MSFT is fully supporting UTF-8 as a multibyte
> form in their interfaces and probably that can be answered by MSFT folks.
>
> Yes, you are correct that while MSFT advocates that wchar_t == Unicode,
> all other Unix systems are conforming to ISO/ANSI C and POSIX standards that
> the wchar_t is an opaque data type. There are a couple of exceptions to
> this though; glibc of Linux assumes that the wchar_t == UCS-4 and most of the
> commercial Unix systems including Solaris guarantee that a form of Unicode
> whether that is UTF-32 or UTF-16 will be used in wchar_t of the Unicode/UTF-8
> locales.
>
> With regards,
>
> Ienup
>
>
> ] Date: Wed, 07 Mar 2001 16:21:27 -0800 (GMT-0800)
> ] From: Allan Chau <achau@rsasecurity.com>
> ] Subject: UTF8 vs. Unicode (UTF16) in code
> ] To: Unicode List <unicode@unicode.org>
> ] MIME-version: 1.0
> ]
> ] We've got an English-language only product which makes use of
> ] single-byte character strings throughout the code. For our next
> ] release, we'd like to internationalize it (Unicode) & be able to store
> ] data in UTF8 format (a requirement for data exchange).
> ]
> ] We're considering between using UTF8 within the code vs. changing our
> ] code to use wide characters. I'm wondering what experiences others have
> ] had that can help with our decision. I'm thinking that using UTF8
> ] internally may mean less rewriting initially, but we'd have to check
> ] carefully for code that make assumptions about character boundaries.
> ] Because of this, I think that it'd be more complicated for developers to
> ]
> ] have to work with UTF8 in code. Unicode (UTF16) internally would be
> ] easier to manage since most characters will essentially be fixed width,
> ] but there'd be alot of code to rewrite. Also, I've heard of problems
> ] with the wide character type (wchar_t) having different definitions
> ] depending on platform (we're running on NT & Sun Solaris). Many of our
> ] product APIs would also be affected.
> ]
> ] Can others offer their insights, suggestions?
> ]
> ] Thanks,
> ] -allan
> ]
> ]
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT