Re: Java 5 Strings

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jan 04 2006 - 11:09:51 CST

  • Next message: Markus Scherer: "Re: Java 5 Strings"

    ----- Original Message -----
    From: "Mike Ayers" <mayers@celequest.com>
    To: <unicode@unicode.org>
    Sent: Wednesday, January 04, 2006 12:41 AM
    Subject: Re: Java 5 Strings

    > Markus Scherer wrote:
    >> Java Strings could always *store* all Unicode characters, BMP and
    >> supplementary. Java 5 adds APIs to handle supplementary characters
    >> explicitly, and implementation code (in regular expressions, for
    >> example) to handle them rather than ignore them.
    >
    > Thanks for properly interpreting my ill-phrased question.
    >
    >> Some Java (JRE) implementation code (for example text layout, some
    >> converters) could handle supplementary characters already in earlier
    >> Java versions.
    >
    > I read this to mean that some of the functions handled supplementary
    > characters properly and others did not, correct? If so, then it sounds
    > a little risky.

    Absolutely not. Theexisting APIs was not changed, and Strings continue to benadled internally as vectors of 16-bit "char" each one continaing in fact a code unit (two code units are used to encodesupplementary characters as surrogates).

    The additional support was to include new APIs working at the code point level (represented with 32-bit "int") instead of just the code unit level, withoutchanging the representation of Strings. Also,the Character class (whose instances can still only store a single 16-bit "char" code unit) has been extended with static (non-instance) methods to get the properties of any valid Unicode codepoint.

    Other support was added in String instance methods to parse them as vectors of codepoints.
    The previous APIs still work properly but they work on "char" positions, and so cannot getthe properties of supplementary characters (if youuse them,youstillget the properties of isolated surrogates, but for non-surrogate BMP characters, nothing was wrong andnothing has changed).

    The encoding forms converters were already working properly. See that as a pure extension. Nothing was risky given that the additions were not existing before.

    In ICU4J there are new datatypes, for handling 32-bit code units/codepoints and for working with Unicode strings handled internally as vectors of codepoints, and conversion functions between java native String class and the alternate UString class. But using it is rarely justified (conversion of Strings has a significant performance cost on the VM).

    I recommand you to look at the Java 5 API documentation, instead of assuming there was a bug(there was none, and you could very well work using Java 1.4 and lower with any valid Unicode string containing non-BMP characters, even without the ICU4J library, provided that your code properly handled surrogate "char"s).



    This archive was generated by hypermail 2.1.5 : Wed Jan 04 2006 - 11:22:06 CST