Re: UTF8 vs. Unicode (UTF16) in code

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Thu Mar 08 2001 - 21:13:51 EST


I think you missed Addison's point.

There is TRULY a significant difference between UTF-8 text and UTF-16 text
on so many different levels that claiming they are all in the same
"multibyte" realm (along with DBCS, etc.) is almost laughable.

I won't laugh, since I have been in MBCS muck (for lack of a better term).
Anyone who has ever had to code MBCS-aware functions, deal with DBCS lead
bytes, validate MBCS input, and all the rest, will have absolutely no
problem feeling that UTF-16 *is* an easier implementation than DBSC and also
UTF-8.

Now, if one is on a platform that tends to think well of UTF-32 and one is
going to be spending a significant amount of one's time going beyond the
BMP, then certainly UTF-32 is superior to both. But if you were forced to
choose between UTF-8 and UTF-16 purely on which will be easier to implement
a system entirely from scratch, I find it hard to believe that anyone who is
not paid by the hour would choose UTF-8.

Perhaps we should just give up on this whole notion of trying to convince
others that *our* favorite encoding is *the* favorite encoding? It seems
that the whole conversation has taken on the look of a religious debate. :-(

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

----- Original Message -----
From: "Ienup Sung" <ienup.sung@eng.sun.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Thursday, March 08, 2001 5:21 PM
Subject: Re: UTF8 vs. Unicode (UTF16) in code

> I think we shouldn't advocate that since there will be only 43K
> CJK characters at the SIP, about 1.6K characters at SMP, and, 97 tag
> characters at SPP, we can ignore such the characters and the additional
planes
> of the UTF-16/32 of Unicode 3.1. Furthemore, when you're doing the first
i18n
> on the existing programs, you can do the whole thing at once with minor
> additional cost if you choose to have support for UTF-16 while you're at
it
> rather than do it only for BMP/UCS-2 now and later do one more time of
change
> even though that would be decided by each team/company who are doing
> the i18n in my opinion.
>
> And, as we all know, we can no longer claim that the UTF-16 is a fixed
> width anymore since it is variable width now as like UTF-8; we will just
> have to deal with it in my opinion.
>
> With regards,
>
> Ienup
>
>
> ] Date: Fri, 09 Mar 2001 10:48:52 -0800 (PST)
> ] From: addison@inter-locale.com
> ] Subject: Re: UTF8 vs. Unicode (UTF16) in code
> ] X-Sender: root@addisonp.inter-locale.com
> ] To: Ienup Sung <ienup.sung@eng.sun.com>
> ] Cc: Unicode List <unicode@unicode.org>
> ] MIME-version: 1.0
> ]
> ] Well....
> ]
> ] Actually, there is a significant difference between being "UTF-8
> ] ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant" program thinks that
> ] surrogate pairs are just two characters with undefined properties. Since
> ] currently there are no characters "up there" this isn't a really big
> ] deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
> ] characters in the supplemental planes... but they'll be relatively rare.
> ]
> ] In most cases where one has a "character pointer", one is not performing
> ] casing, line breaking, or other text interpretation that requires
> ] significant awareness of the meaning of the text. Of course, it depends
on
> ] the instance and the application how true that is ;-). But in many cases
> ] you *can* ignore the fact that a high- or low-surrogate character is
> ] really part of something else.
> ]
> ] With UTF-8, however, is is impossible to ignore the multi-byte sequences
> ] and they can never really be treated as separate characters. So I guess
> ] all I'm saying is that, depending on what you need to do and what level
of
> ] awareness your application needs to achieve, a pure "UCS-2 port" might
be
> ] a better choice than UTF-8, since the specific details overlooked are
> ] of a different quality.
> ]
> ] Best Regards,.
> ]
> ] Addison
> ]
> ] ===============================================================
> ] Addison P. Phillips Globalization Architect
> ] webMethods, Inc http://www.webmethods.com
> ] Sunnyvale, CA, USA mailto:aphillips@webmethods.com
> ]
> ] +1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
> ] ===============================================================
> ] "Internationalization is not a feature. It is an architecture."
> ]
> ] On Thu, 8 Mar 2001, Ienup Sung wrote:
> ]
> ] > Hello,
> ] >
> ] > Actually, as you also implied in your email, since an UTF-16 character
can
> be
> ] > a two-byte entity or two two-byte entity, there will be no significant
> ] > difference between UTF-8 and UTF-16 in terms of how to count and/or
decide
> ] > character boundaries in a string. I.e., a UTF-8 character could be a
1, 2,
> 3,
> ] > or 4 byte entity and a UTF-16 character could be a 2 or 4 byte entity
since
> ] > starting from Unicode 3.1, we will have characters defined outside of
BMP,
> ] > esp., Plane 01, 02, and 0E.
> ] >
> ] > I do not know however whether MSFT is fully supporting UTF-8 as a
multibyte
> ] > form in their interfaces and probably that can be answered by MSFT
folks.
> ] >
> ] > Yes, you are correct that while MSFT advocates that wchar_t ==
Unicode,
> ] > all other Unix systems are conforming to ISO/ANSI C and POSIX
standards that
> ] > the wchar_t is an opaque data type. There are a couple of exceptions
to
> ] > this though; glibc of Linux assumes that the wchar_t == UCS-4 and most
of
> the
> ] > commercial Unix systems including Solaris guarantee that a form of
Unicode
> ] > whether that is UTF-32 or UTF-16 will be used in wchar_t of the
> Unicode/UTF-8
> ] > locales.
> ] >
> ] > With regards,
> ] >
> ] > Ienup
> ] >
> ] >
> ] > ] Date: Wed, 07 Mar 2001 16:21:27 -0800 (GMT-0800)
> ] > ] From: Allan Chau <achau@rsasecurity.com>
> ] > ] Subject: UTF8 vs. Unicode (UTF16) in code
> ] > ] To: Unicode List <unicode@unicode.org>
> ] > ] MIME-version: 1.0
> ] > ]
> ] > ] We've got an English-language only product which makes use of
> ] > ] single-byte character strings throughout the code. For our next
> ] > ] release, we'd like to internationalize it (Unicode) & be able to
store
> ] > ] data in UTF8 format (a requirement for data exchange).
> ] > ]
> ] > ] We're considering between using UTF8 within the code vs. changing
our
> ] > ] code to use wide characters. I'm wondering what experiences others
have
> ] > ] had that can help with our decision. I'm thinking that using UTF8
> ] > ] internally may mean less rewriting initially, but we'd have to check
> ] > ] carefully for code that make assumptions about character boundaries.
> ] > ] Because of this, I think that it'd be more complicated for
developers to
> ] > ]
> ] > ] have to work with UTF8 in code. Unicode (UTF16) internally would be
> ] > ] easier to manage since most characters will essentially be fixed
width,
> ] > ] but there'd be alot of code to rewrite. Also, I've heard of
problems
> ] > ] with the wide character type (wchar_t) having different definitions
> ] > ] depending on platform (we're running on NT & Sun Solaris). Many of
our
> ] > ] product APIs would also be affected.
> ] > ]
> ] > ] Can others offer their insights, suggestions?
> ] > ]
> ] > ] Thanks,
> ] > ] -allan
> ] > ]
> ] > ]
> ] >
> ]
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT