Re: UTF8 vs. Unicode (UTF16) in code

From: Ienup Sung (ienup.sung@eng.sun.com)
Date: Thu Mar 08 2001 - 20:47:32 EST


I think we shouldn't advocate that since there will be only 43K
CJK characters at the SIP, about 1.6K characters at SMP, and, 97 tag
characters at SPP, we can ignore such the characters and the additional planes
of the UTF-16/32 of Unicode 3.1. Furthemore, when you're doing the first i18n
on the existing programs, you can do the whole thing at once with minor
additional cost if you choose to have support for UTF-16 while you're at it
rather than do it only for BMP/UCS-2 now and later do one more time of change
even though that would be decided by each team/company who are doing
the i18n in my opinion.

And, as we all know, we can no longer claim that the UTF-16 is a fixed
width anymore since it is variable width now as like UTF-8; we will just
have to deal with it in my opinion.

With regards,

Ienup

] Date: Fri, 09 Mar 2001 10:48:52 -0800 (PST)
] From: addison@inter-locale.com
] Subject: Re: UTF8 vs. Unicode (UTF16) in code
] X-Sender: root@addisonp.inter-locale.com
] To: Ienup Sung <ienup.sung@eng.sun.com>
] Cc: Unicode List <unicode@unicode.org>
] MIME-version: 1.0
]
] Well....
]
] Actually, there is a significant difference between being "UTF-8
] ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant" program thinks that
] surrogate pairs are just two characters with undefined properties. Since
] currently there are no characters "up there" this isn't a really big
] deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
] characters in the supplemental planes... but they'll be relatively rare.
]
] In most cases where one has a "character pointer", one is not performing
] casing, line breaking, or other text interpretation that requires
] significant awareness of the meaning of the text. Of course, it depends on
] the instance and the application how true that is ;-). But in many cases
] you *can* ignore the fact that a high- or low-surrogate character is
] really part of something else.
]
] With UTF-8, however, is is impossible to ignore the multi-byte sequences
] and they can never really be treated as separate characters. So I guess
] all I'm saying is that, depending on what you need to do and what level of
] awareness your application needs to achieve, a pure "UCS-2 port" might be
] a better choice than UTF-8, since the specific details overlooked are
] of a different quality.
]
] Best Regards,.
]
] Addison
]
] ===============================================================
] Addison P. Phillips Globalization Architect
] webMethods, Inc http://www.webmethods.com
] Sunnyvale, CA, USA mailto:aphillips@webmethods.com
]
] +1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
] ===============================================================
] "Internationalization is not a feature. It is an architecture."
]
] On Thu, 8 Mar 2001, Ienup Sung wrote:
]
] > Hello,
] >
] > Actually, as you also implied in your email, since an UTF-16 character can
be
] > a two-byte entity or two two-byte entity, there will be no significant
] > difference between UTF-8 and UTF-16 in terms of how to count and/or decide
] > character boundaries in a string. I.e., a UTF-8 character could be a 1, 2,
3,
] > or 4 byte entity and a UTF-16 character could be a 2 or 4 byte entity since
] > starting from Unicode 3.1, we will have characters defined outside of BMP,
] > esp., Plane 01, 02, and 0E.
] >
] > I do not know however whether MSFT is fully supporting UTF-8 as a multibyte
] > form in their interfaces and probably that can be answered by MSFT folks.
] >
] > Yes, you are correct that while MSFT advocates that wchar_t == Unicode,
] > all other Unix systems are conforming to ISO/ANSI C and POSIX standards that
] > the wchar_t is an opaque data type. There are a couple of exceptions to
] > this though; glibc of Linux assumes that the wchar_t == UCS-4 and most of
the
] > commercial Unix systems including Solaris guarantee that a form of Unicode
] > whether that is UTF-32 or UTF-16 will be used in wchar_t of the
Unicode/UTF-8
] > locales.
] >
] > With regards,
] >
] > Ienup
] >
] >
] > ] Date: Wed, 07 Mar 2001 16:21:27 -0800 (GMT-0800)
] > ] From: Allan Chau <achau@rsasecurity.com>
] > ] Subject: UTF8 vs. Unicode (UTF16) in code
] > ] To: Unicode List <unicode@unicode.org>
] > ] MIME-version: 1.0
] > ]
] > ] We've got an English-language only product which makes use of
] > ] single-byte character strings throughout the code. For our next
] > ] release, we'd like to internationalize it (Unicode) & be able to store
] > ] data in UTF8 format (a requirement for data exchange).
] > ]
] > ] We're considering between using UTF8 within the code vs. changing our
] > ] code to use wide characters. I'm wondering what experiences others have
] > ] had that can help with our decision. I'm thinking that using UTF8
] > ] internally may mean less rewriting initially, but we'd have to check
] > ] carefully for code that make assumptions about character boundaries.
] > ] Because of this, I think that it'd be more complicated for developers to
] > ]
] > ] have to work with UTF8 in code. Unicode (UTF16) internally would be
] > ] easier to manage since most characters will essentially be fixed width,
] > ] but there'd be alot of code to rewrite. Also, I've heard of problems
] > ] with the wide character type (wchar_t) having different definitions
] > ] depending on platform (we're running on NT & Sun Solaris). Many of our
] > ] product APIs would also be affected.
] > ]
] > ] Can others offer their insights, suggestions?
] > ]
] > ] Thanks,
] > ] -allan
] > ]
] > ]
] >
]



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT