RE: How will software source code represent 21 bit unicode charac ters?

From: addison@inter-locale.com
Date: Thu Apr 26 2001 - 12:23:25 EDT


On Mon, 23 Apr 2001, Mike Brown wrote:

> A char corresponds to a Unicode value -- a UTF-16 code value, which could
> either represent a Unicode character or one half of a surrogate pair. In the
> latter case, it would take a sequence of two "char"s to make one Unicode
> character. It is my understanding that Java's character encoding/decoding
> mechanisms can handle this sort of thing already. However, this is not
> obvious when looking at any Java platform documentation.
>
Actually, Java currently doesn't handle surrogate characters as anything
other than individual code points. You can blithely use an unpaired
surrogate and Java won't complain. Similarly, there is no way to access
the Unicode Scalar Value or any of the character attributes referred to by
a (valid) surrogate pair [which shouldn't be surprising, if you consider
that current JREs reflect an older Unicode standard in which no characters
are actually assigned "out there in the ethereal planes" and thus there
is no character information _to_ access].

And the Java platform documentation is quite explicit about how
Unicode encodings are handled internally: you aren't supposed to
know! Each JRE can choose its own course. In fact, from John O'Conner's
presentation at TUC last fall, I suspect that the char == int relationship
in Sun's environment will be supplanted by a 32-bit representation for
the char datatype, while the String object will remain UTF-16
internally. How well this works and what this breaks remains to be
seen. Brian Beck is supposed to make a presentation on Unicode 3.0 support
in JDK 1.4 at JavaOne this year which should be quite interesting in this
regard.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect
webMethods, Inc.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT