RE: surrogate at java's property file

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Wed Oct 03 2001 - 19:31:02 EDT


No fair! You forgot to quote my disclaimer in the next email for my big
boo-boo regarding what an int is in Java. An int is fine, darnit! It's char
that was originally (at least externally) limited to 16-bits. Of course,
many APIs use ints, which don't present a problem. But java.lang.Character
and java.lang.String would have to change internal representation or add
methods or something to allow surrogate pairs to be evaluated.

Addison

-----Original Message-----
From: Yung-Fong Tang [mailto:ftang@netscape.com]
Sent: Wednesday, October 03, 2001 4:17 PM
To: Addison Phillips [wM]
Cc: unicode@unicode.org; bcbeck@eng.sun.com
Subject: Re: surrogate at java's property file

Brian Beck:
What do you think ?

"Addison Phillips [wM]" wrote:

> Java doesn't define any characters beyond Unicode 2.1.8 at the moment.
It's
> stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of
these
> versions have defined characters in the supplemental planes.
>
> In Java, a java.lang.Character object is closely tied to the definition of
> an "int", the 16-bit numeric type. Many classes and objects make no
> distinction (or worse, conflate a character with an int---many methods are
> defined to take and return ints for "Characters"). As a result, the Java
> character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
> surrogate character *is* recognized to be a surrogate, but a high-low pair
> is not recognized as representing a character, nor can you retrieve the
> character properties of the matched pair.
>
> So to property files. The java.lang.Character sequence U+D800 U+DC00 is
> represented by the sequence "\ud800\udc00". This sequence does NOT
represent
> U+10000. It represents TWO Characters, which happen to be surrogates that
> form a valid pair. I should point out that Java is slightly clever. For
> example, the UTF-8 converter knows that U+D800 U+DC00 represents the
scalar
> value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80
(and
> vice versa, of course).
>
> However, it is unclear how Unicode 3.1 support is going to make it into
JDK
> 1.4++. The APIs are going to have to change to support the supplemental
> planes and the ripple effects on various APIs seems like an interesting
> problem. Perhaps they'll redefine an int to be a 32-bit value and switch
> Java to UTF-32 (yeah, sure.....)
>
> Best Regards,
>
> Addison
>
> Addison P. Phillips
> Globalization Architect / Manager, Globalization Engineering
> webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
> +1 408.962.5487 (phone) +1 408.210.3569 (mobile)
> -------------------------------------------------
> Internationalization is an architecture. It is not a feature.
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Yung-Fong Tang
> Sent: Monday, October 01, 2001 5:10 PM
> To: unicode@unicode.org
> Subject: surrogate at java's property file
>
> Any one know how does Java handle Surrogate pair property file ?
>
> Java's property file use the \u encoding for non ASCII characters,
> therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
> Surrogate Pair?
>
> Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I
> think it should be \u10000) or they cannot handle them at all ?



This archive was generated by hypermail 2.1.2 : Wed Oct 03 2001 - 18:07:07 EDT