RE: surrogate at java's property file

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Mon Oct 01 2001 - 21:33:36 EDT


But then, it's my day to be an idiot...

Of course an int can store more than 16 bits. It's char that's defined at
0..65535 in Java. int's will work fine in the APIs. It's the chars that are
a problem.

Must be the heat. ;-)

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone) +1 408.210.3659 (mobile)
-------------------------------------------------
Internationalization is an architecture. It is not a feature.

-----Original Message-----
From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
Sent: Monday, October 01, 2001 6:24 PM
To: Yung-Fong Tang; unicode@unicode.org
Subject: RE: surrogate at java's property file

Java doesn't define any characters beyond Unicode 2.1.8 at the moment. It's
stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of these
versions have defined characters in the supplemental planes.

In Java, a java.lang.Character object is closely tied to the definition of
an "int", the 16-bit numeric type. Many classes and objects make no
distinction (or worse, conflate a character with an int---many methods are
defined to take and return ints for "Characters"). As a result, the Java
character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
surrogate character *is* recognized to be a surrogate, but a high-low pair
is not recognized as representing a character, nor can you retrieve the
character properties of the matched pair.

So to property files. The java.lang.Character sequence U+D800 U+DC00 is
represented by the sequence "\ud800\udc00". This sequence does NOT represent
U+10000. It represents TWO Characters, which happen to be surrogates that
form a valid pair. I should point out that Java is slightly clever. For
example, the UTF-8 converter knows that U+D800 U+DC00 represents the scalar
value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80 (and
vice versa, of course).

However, it is unclear how Unicode 3.1 support is going to make it into JDK
1.4++. The APIs are going to have to change to support the supplemental
planes and the ripple effects on various APIs seems like an interesting
problem. Perhaps they'll redefine an int to be a 32-bit value and switch
Java to UTF-32 (yeah, sure.....)

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone) +1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture. It is not a feature.

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Yung-Fong Tang
Sent: Monday, October 01, 2001 5:10 PM
To: unicode@unicode.org
Subject: surrogate at java's property file

Any one know how does Java handle Surrogate pair property file ?

Java's property file use the \u encoding for non ASCII characters,
therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
Surrogate Pair?

Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I
think it should be \u10000) or they cannot handle them at all ?



This archive was generated by hypermail 2.1.2 : Mon Oct 01 2001 - 20:05:08 EDT