RE: UTF8 vs. Unicode (UTF16) in code

From: Ienup Sung (ienup.sung@eng.sun.com)
Date: Thu Mar 08 2001 - 21:08:42 EST


Hmmm... As many people in this mailing list already know that
the coding space for UCS-2 is 64K, UTF-16/UTF-32 is 17 x 64K, and UCS-4
is 2G, and, so I think you meant 17 x 64K = 1,114,112 not 4,293,853,186 ??

With regards,

Ienup

] Date: Thu, 08 Mar 2001 20:02:39 -0600
] From: "Ayers, Mike" <Mike_Ayers@bmc.com>
] Subject: RE: UTF8 vs. Unicode (UTF16) in code
] To: 'Ienup Sung' <ienup.sung@eng.sun.com>, Unicode List <unicode@unicode.org>
] MIME-version: 1.0
]
]
] If you really want to finish the job, there's always UTF-32, which
] should do rather nicely until we meet the space aliens aith the
] 4,293,853,186 character alphabet!
]
]
] /|/|ike
]
] P.S. No, they're not Klingons!
]
] > From: Ienup Sung [mailto:ienup.sung@eng.sun.com]
] >
] > I think we shouldn't advocate that since there will be only 43K
] > CJK characters at the SIP, about 1.6K characters at SMP, and, 97 tag
] > characters at SPP, we can ignore such the characters and the
] > additional planes
] > of the UTF-16/32 of Unicode 3.1. Furthemore, when you're
] > doing the first i18n
] > on the existing programs, you can do the whole thing at once
] > with minor
] > additional cost if you choose to have support for UTF-16
] > while you're at it
] > rather than do it only for BMP/UCS-2 now and later do one
] > more time of change
] > even though that would be decided by each team/company who are doing
] > the i18n in my opinion.
] >
] > And, as we all know, we can no longer claim that the UTF-16 is a fixed
] > width anymore since it is variable width now as like UTF-8;
] > we will just
] > have to deal with it in my opinion.
] >
] > With regards,
] >
] > Ienup
] >
] >
] > ] Date: Fri, 09 Mar 2001 10:48:52 -0800 (PST)
] > ] From: addison@inter-locale.com
] > ] Subject: Re: UTF8 vs. Unicode (UTF16) in code
] > ] X-Sender: root@addisonp.inter-locale.com
] > ] To: Ienup Sung <ienup.sung@eng.sun.com>
] > ] Cc: Unicode List <unicode@unicode.org>
] > ] MIME-version: 1.0
] > ]
] > ] Well....
] > ]
] > ] Actually, there is a significant difference between being "UTF-8
] > ] ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant"
] > program thinks that
] > ] surrogate pairs are just two characters with undefined
] > properties. Since
] > ] currently there are no characters "up there" this isn't a really big
] > ] deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
] > ] characters in the supplemental planes... but they'll be
] > relatively rare.
] > ]
] > ] In most cases where one has a "character pointer", one is
] > not performing
] > ] casing, line breaking, or other text interpretation that requires
] > ] significant awareness of the meaning of the text. Of
] > course, it depends on
] > ] the instance and the application how true that is ;-). But
] > in many cases
] > ] you *can* ignore the fact that a high- or low-surrogate character is
] > ] really part of something else.
] > ]
] > ] With UTF-8, however, is is impossible to ignore the
] > multi-byte sequences
] > ] and they can never really be treated as separate
] > characters. So I guess
] > ] all I'm saying is that, depending on what you need to do
] > and what level of
] > ] awareness your application needs to achieve, a pure "UCS-2
] > port" might be
] > ] a better choice than UTF-8, since the specific details
] > overlooked are
] > ] of a different quality.
] > ]
] > ] Best Regards,.
] > ]
] > ] Addison
] > ]
] > ] ===============================================================
] > ] Addison P. Phillips Globalization Architect
] > ] webMethods, Inc http://www.webmethods.com
] > ] Sunnyvale, CA, USA mailto:aphillips@webmethods.com
] > ]
] > ] +1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
] > ] ===============================================================
] > ] "Internationalization is not a feature. It is an architecture."



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT