RE: UTF8 vs. Unicode (UTF16) in code

From: Ayers, Mike (Mike_Ayers@bmc.com)
Date: Thu Mar 08 2001 - 21:02:39 EST


        If you really want to finish the job, there's always UTF-32, which
should do rather nicely until we meet the space aliens aith the
4,293,853,186 character alphabet!

/|/|ike

P.S. No, they're not Klingons!

> From: Ienup Sung [mailto:ienup.sung@eng.sun.com]
>
> I think we shouldn't advocate that since there will be only 43K
> CJK characters at the SIP, about 1.6K characters at SMP, and, 97 tag
> characters at SPP, we can ignore such the characters and the
> additional planes
> of the UTF-16/32 of Unicode 3.1. Furthemore, when you're
> doing the first i18n
> on the existing programs, you can do the whole thing at once
> with minor
> additional cost if you choose to have support for UTF-16
> while you're at it
> rather than do it only for BMP/UCS-2 now and later do one
> more time of change
> even though that would be decided by each team/company who are doing
> the i18n in my opinion.
>
> And, as we all know, we can no longer claim that the UTF-16 is a fixed
> width anymore since it is variable width now as like UTF-8;
> we will just
> have to deal with it in my opinion.
>
> With regards,
>
> Ienup
>
>
> ] Date: Fri, 09 Mar 2001 10:48:52 -0800 (PST)
> ] From: addison@inter-locale.com
> ] Subject: Re: UTF8 vs. Unicode (UTF16) in code
> ] X-Sender: root@addisonp.inter-locale.com
> ] To: Ienup Sung <ienup.sung@eng.sun.com>
> ] Cc: Unicode List <unicode@unicode.org>
> ] MIME-version: 1.0
> ]
> ] Well....
> ]
> ] Actually, there is a significant difference between being "UTF-8
> ] ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant"
> program thinks that
> ] surrogate pairs are just two characters with undefined
> properties. Since
> ] currently there are no characters "up there" this isn't a really big
> ] deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
> ] characters in the supplemental planes... but they'll be
> relatively rare.
> ]
> ] In most cases where one has a "character pointer", one is
> not performing
> ] casing, line breaking, or other text interpretation that requires
> ] significant awareness of the meaning of the text. Of
> course, it depends on
> ] the instance and the application how true that is ;-). But
> in many cases
> ] you *can* ignore the fact that a high- or low-surrogate character is
> ] really part of something else.
> ]
> ] With UTF-8, however, is is impossible to ignore the
> multi-byte sequences
> ] and they can never really be treated as separate
> characters. So I guess
> ] all I'm saying is that, depending on what you need to do
> and what level of
> ] awareness your application needs to achieve, a pure "UCS-2
> port" might be
> ] a better choice than UTF-8, since the specific details
> overlooked are
> ] of a different quality.
> ]
> ] Best Regards,.
> ]
> ] Addison
> ]
> ] ===============================================================
> ] Addison P. Phillips Globalization Architect
> ] webMethods, Inc http://www.webmethods.com
> ] Sunnyvale, CA, USA mailto:aphillips@webmethods.com
> ]
> ] +1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
> ] ===============================================================
> ] "Internationalization is not a feature. It is an architecture."



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT