Re: Java 5 Strings

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jan 05 2006 - 07:02:45 CST

  • Next message: Markus Scherer: "Re: Java 5 Strings"

    From: "Markus Scherer" <markus.icu@gmail.com>
    To: <unicode@unicode.org>
    Sent: Thursday, January 05, 2006 1:06 AM
    Subject: Re: Java 5 Strings

    > On 1/4/06, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    >> In ICU4J there are new datatypes, for handling 32-bit code units/codepoints and for working with Unicode strings handled internally as vectors of codepoints, and conversion functions between java native String class and the alternate UString class. But using it is rarely justified (conversion of Strings has a significant performance cost on the VM).
    >
    > I don't know where you get this idea. ICU4J uses the regular Java
    > String class. It provides utility functions to work with code points
    > (as int values) much like what Java 5 added, and it correctly handles
    > surrogate pairs in Strings where appropriate, but there is no
    > separate/parallel ICU4J-specific String or UString class. It really
    > works like Java 5, except that you can use it with Java 1.4 as well.

    I should have checked again the Java API myself (support for strings viewed as vectors of codepoints is in StringCharacterIterator (I've been confused with the internals of ICU4C, when handling charset convertors or on platforms with 32-bit wchar_t) and in UTF16.StringComparator. At one time in the development there was a class to represent strings as vectors of codepoints.

    Well the effective name is UCharacter, not UChar. It's true that since Java5 these ICU4J features are no longer essential, except when porting applications back to Java 1.4 or lower (for platforms that still don't have Java5 support). ICU4J remains useful for its extensive support of additional charsets, iterators, normalizers, transliterators, IDNA/StringPrep, better locale support including for message formating.

    Given the newer architecture in Java1.5 for character sequences (of which String is now a specialization), it would make sense to have an alternate internal representation with 32-bit codepoints, if it helps reducing the complexity of text handling code (notably for text recognizers).



    This archive was generated by hypermail 2.1.5 : Thu Jan 05 2006 - 07:09:09 CST