Re: UTF8 vs. Unicode (UTF16) in code

From: Ienup Sung (ienup.sung@eng.sun.com)
Date: Thu Mar 08 2001 - 22:01:22 EST


I also implement UTF-16 and UTF-8 support in various levels and
I find UTF-8 is more easier to handle and write software with since
we have many MB functions, e.g., mblen() for byte length, that we can use,
and, there is no byte ordering hassle that we need to worry about.

Of course the experience from each and every one would be different so
I wouldn't say that UTF-8 is absolutely better or the other way. I only tell
people that there are certain things that you will need to aware of when you
deal with Unicode representation forms for multiple platforms (or, a single
platform) and certainly UTF-16 is no better (or, worse) than UTF-8 in
that regard.

With regards,

Ienup

PS. You talked about lead byte checking at DBCS (which I guess because
you didn't have mblen() in your platform(s)??) and I think from now on
you will have to do that with UTF-16, i.e., check if the leading two byte
entity in UTF-16 is U+D8xx and also have to check "pair integrity" and all
the rest so you (or, we) are not entirely out from the variable width
character woods unless you can jump directly to UTF-32 ;-)

] Date: Thu, 08 Mar 2001 18:13:51 -0800
] From: "Michael (michka) Kaplan" <michka@trigeminal.com>
] Subject: Re: UTF8 vs. Unicode (UTF16) in code
] To: Ienup Sung <ienup.sung@eng.sun.com>, Unicode List <unicode@unicode.org>
] MIME-version: 1.0
] Content-transfer-encoding: 7bit
] X-OriginalArrivalTime: 09 Mar 2001 02:13:51.0970 (UTC)
FILETIME=[95B85C20:01C0A83E]
]
] I think you missed Addison's point.
]
] There is TRULY a significant difference between UTF-8 text and UTF-16 text
] on so many different levels that claiming they are all in the same
] "multibyte" realm (along with DBCS, etc.) is almost laughable.
]
] I won't laugh, since I have been in MBCS muck (for lack of a better term).
] Anyone who has ever had to code MBCS-aware functions, deal with DBCS lead
] bytes, validate MBCS input, and all the rest, will have absolutely no
] problem feeling that UTF-16 *is* an easier implementation than DBSC and also
] UTF-8.
]
] Now, if one is on a platform that tends to think well of UTF-32 and one is
] going to be spending a significant amount of one's time going beyond the
] BMP, then certainly UTF-32 is superior to both. But if you were forced to
] choose between UTF-8 and UTF-16 purely on which will be easier to implement
] a system entirely from scratch, I find it hard to believe that anyone who is
] not paid by the hour would choose UTF-8.
]
] Perhaps we should just give up on this whole notion of trying to convince
] others that *our* favorite encoding is *the* favorite encoding? It seems
] that the whole conversation has taken on the look of a religious debate. :-(
]
] MichKa
]
] Michael Kaplan
] Trigeminal Software, Inc.
] http://www.trigeminal.com/
]
] ----- Original Message -----
] From: "Ienup Sung" <ienup.sung@eng.sun.com>
] To: "Unicode List" <unicode@unicode.org>
] Sent: Thursday, March 08, 2001 5:21 PM
] Subject: Re: UTF8 vs. Unicode (UTF16) in code
]
]
] > I think we shouldn't advocate that since there will be only 43K
] > CJK characters at the SIP, about 1.6K characters at SMP, and, 97 tag
] > characters at SPP, we can ignore such the characters and the additional
] planes
] > of the UTF-16/32 of Unicode 3.1. Furthemore, when you're doing the first
] i18n
] > on the existing programs, you can do the whole thing at once with minor
] > additional cost if you choose to have support for UTF-16 while you're at
] it
] > rather than do it only for BMP/UCS-2 now and later do one more time of
] change
] > even though that would be decided by each team/company who are doing
] > the i18n in my opinion.
] >
] > And, as we all know, we can no longer claim that the UTF-16 is a fixed
] > width anymore since it is variable width now as like UTF-8; we will just
] > have to deal with it in my opinion.
] >
] > With regards,
] >
] > Ienup
] >
] >
] > ] Date: Fri, 09 Mar 2001 10:48:52 -0800 (PST)
] > ] From: addison@inter-locale.com
] > ] Subject: Re: UTF8 vs. Unicode (UTF16) in code
] > ] X-Sender: root@addisonp.inter-locale.com
] > ] To: Ienup Sung <ienup.sung@eng.sun.com>
] > ] Cc: Unicode List <unicode@unicode.org>
] > ] MIME-version: 1.0
] > ]
] > ] Well....
] > ]
] > ] Actually, there is a significant difference between being "UTF-8
] > ] ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant" program thinks that
] > ] surrogate pairs are just two characters with undefined properties. Since
] > ] currently there are no characters "up there" this isn't a really big
] > ] deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
] > ] characters in the supplemental planes... but they'll be relatively rare.
] > ]
] > ] In most cases where one has a "character pointer", one is not performing
] > ] casing, line breaking, or other text interpretation that requires
] > ] significant awareness of the meaning of the text. Of course, it depends
] on
] > ] the instance and the application how true that is ;-). But in many cases
] > ] you *can* ignore the fact that a high- or low-surrogate character is
] > ] really part of something else.
] > ]
] > ] With UTF-8, however, is is impossible to ignore the multi-byte sequences
] > ] and they can never really be treated as separate characters. So I guess
] > ] all I'm saying is that, depending on what you need to do and what level
] of
] > ] awareness your application needs to achieve, a pure "UCS-2 port" might
] be
] > ] a better choice than UTF-8, since the specific details overlooked are
] > ] of a different quality.
] > ]
] > ] Best Regards,.
] > ]
] > ] Addison
] > ]
] > ] ===============================================================
] > ] Addison P. Phillips Globalization Architect
] > ] webMethods, Inc http://www.webmethods.com
] > ] Sunnyvale, CA, USA mailto:aphillips@webmethods.com
] > ]
] > ] +1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
] > ] ===============================================================
] > ] "Internationalization is not a feature. It is an architecture."
] > ]
] > ] On Thu, 8 Mar 2001, Ienup Sung wrote:
] > ]
] > ] > Hello,
] > ] >
] > ] > Actually, as you also implied in your email, since an UTF-16 character
] can
] > be
] > ] > a two-byte entity or two two-byte entity, there will be no significant
] > ] > difference between UTF-8 and UTF-16 in terms of how to count and/or
] decide
] > ] > character boundaries in a string. I.e., a UTF-8 character could be a
] 1, 2,
] > 3,
] > ] > or 4 byte entity and a UTF-16 character could be a 2 or 4 byte entity
] since
] > ] > starting from Unicode 3.1, we will have characters defined outside of
] BMP,
] > ] > esp., Plane 01, 02, and 0E.
] > ] >
] > ] > I do not know however whether MSFT is fully supporting UTF-8 as a
] multibyte
] > ] > form in their interfaces and probably that can be answered by MSFT
] folks.
] > ] >
] > ] > Yes, you are correct that while MSFT advocates that wchar_t ==
] Unicode,
] > ] > all other Unix systems are conforming to ISO/ANSI C and POSIX
] standards that
] > ] > the wchar_t is an opaque data type. There are a couple of exceptions
] to
] > ] > this though; glibc of Linux assumes that the wchar_t == UCS-4 and most
] of
] > the
] > ] > commercial Unix systems including Solaris guarantee that a form of
] Unicode
] > ] > whether that is UTF-32 or UTF-16 will be used in wchar_t of the
] > Unicode/UTF-8
] > ] > locales.
] > ] >
] > ] > With regards,
] > ] >
] > ] > Ienup
] > ] >
] > ] >
] > ] > ] Date: Wed, 07 Mar 2001 16:21:27 -0800 (GMT-0800)
] > ] > ] From: Allan Chau <achau@rsasecurity.com>
] > ] > ] Subject: UTF8 vs. Unicode (UTF16) in code
] > ] > ] To: Unicode List <unicode@unicode.org>
] > ] > ] MIME-version: 1.0
] > ] > ]
] > ] > ] We've got an English-language only product which makes use of
] > ] > ] single-byte character strings throughout the code. For our next
] > ] > ] release, we'd like to internationalize it (Unicode) & be able to
] store
] > ] > ] data in UTF8 format (a requirement for data exchange).
] > ] > ]
] > ] > ] We're considering between using UTF8 within the code vs. changing
] our
] > ] > ] code to use wide characters. I'm wondering what experiences others
] have
] > ] > ] had that can help with our decision. I'm thinking that using UTF8
] > ] > ] internally may mean less rewriting initially, but we'd have to check
] > ] > ] carefully for code that make assumptions about character boundaries.
] > ] > ] Because of this, I think that it'd be more complicated for
] developers to
] > ] > ]
] > ] > ] have to work with UTF8 in code. Unicode (UTF16) internally would be
] > ] > ] easier to manage since most characters will essentially be fixed
] width,
] > ] > ] but there'd be alot of code to rewrite. Also, I've heard of
] problems
] > ] > ] with the wide character type (wchar_t) having different definitions
] > ] > ] depending on platform (we're running on NT & Sun Solaris). Many of
] our
] > ] > ] product APIs would also be affected.
] > ] > ]
] > ] > ] Can others offer their insights, suggestions?
] > ] > ]
] > ] > ] Thanks,
] > ] > ] -allan
] > ] > ]
] > ] > ]
] > ] >
] > ]
] >
]



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT