Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Joel Rees (
Date: Thu Feb 22 2001 - 22:40:51 EST


Thanks for the consideration. I threw my ego away years ago.

> Joel,
> > > Note that I am just sending a response to you, not to the list.
> >
> > I wouldn't mind this being on the list. I was making bad assumptions
> > Sun's and others's reasons for wanting to do perverse things with
> > pairs, and this clears it up. I guess you want to reduce traffic on the
> > list?
> No, not necessarily. But I prefer not to say blunt, uncomplementary
> things about other members of the Consortium on an open, public list.
> I just said this privately to you, so that you would realize that there
> are implementation issues here that were different from what you
> seemed to be driving at.
> > Now, I'm going to have to do the math and see what happens, but if I get
> > results I it sounds like I will get then the Java char type really was a
> > choice, and similar engineering decisions need to be avoided in the
> > even to the extent of heavy evangelizing. Internal probably does need to
> > 32 bit.
> The choice of UTF-16 is done for a whole series of reasons.
> Java choice a 16-bit character because it was practical. There
> are some implementation issues with it, because they didn't fully
> allow for what UTF-16 would imply for the API's. Many people who
> started out with 16-bit Unicode a decade ago have the same issues today
> in adapting to Unicode 3.1.
> But it isn't that hard to fix things, while retaining 16-bit code
> units. I've been doing that just recently for the Unicode library
> that Sybase uses. Microsoft, no doubt, has similar issues, because
> they standardized on a 16-bit unichar long ago.
> And while UTF-32 has certain processing advantages in some places,
> UTF-16 works just fine for most things. I know, because I've
> implemented it for all kinds of functionality. All my tables for
> properties, normalization, collation, and such are implemented in
> UTF-16 -- they're more space efficient, among other things. And
> all my string handling is UTF-16. It is only at certain unique
> points, such as in recursive functions for doing decomposition,
> where the extra overhead for dealing with UTF-16 makes UTF-32
> attractive enough that I convert locally to UTF-32 to do
> that processing, and then convert back.
> This stuff is not rocket science, though it may seem to be sometimes.
> --Ken

If you can look past my extreme opinions prefering common standards to
universal, I would appreciate hearing more about how you've managed your way
around the warps in the transformations. I think the folks at Sun and Oracle
might be interested, too. Have you tried sharing some of the key elements
with them, as a sort of bribe to get them away from trying to convert
surrogate pairs directly into UTF-8?


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT