Re: Java 5 Strings

From: Markus Scherer (markus.icu@gmail.com)
Date: Wed Jan 04 2006 - 18:06:40 CST

  • Next message: Anto'nio Martins-Tuva'lkin: "Re: Cyrillic "borrowed" letterforms (unicode Digest V5 #301)"

    On 1/4/06, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    > In ICU4J there are new datatypes, for handling 32-bit code units/codepoints and for working with Unicode strings handled internally as vectors of codepoints, and conversion functions between java native String class and the alternate UString class. But using it is rarely justified (conversion of Strings has a significant performance cost on the VM).

    I don't know where you get this idea. ICU4J uses the regular Java
    String class. It provides utility functions to work with code points
    (as int values) much like what Java 5 added, and it correctly handles
    surrogate pairs in Strings where appropriate, but there is no
    separate/parallel ICU4J-specific String or UString class. It really
    works like Java 5, except that you can use it with Java 1.4 as well.

    > I recommand you to look at the Java 5 API documentation, instead of assuming there was a bug(there was none, and you could very well work using Java 1.4 and lower with any valid Unicode string containing non-BMP characters, even without the ICU4J library, provided that your code properly handled surrogate "char"s).

    Except that most JDK implementation code, for example for regular
    expressions, BreakIterator, etc., simply treated surrogate code units
    as separate characters. Please see the document to which Naoto
    pointed.

    markus

    --
    Opinions expressed here may not reflect my company's positions unless
    otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 04 2006 - 18:10:07 CST