Re: Java 5 Strings

From: Markus Scherer (
Date: Thu Jan 05 2006 - 12:37:15 CST

  • Next message: John Hudson: "Re: Cyrillic "borrowed" letterforms (unicode Digest V5 #301)"

    On 1/5/06, Philippe Verdy <> wrote:
    > I should have checked again the Java API myself (support for strings viewed as vectors of codepoints is in StringCharacterIterator (I've been confused with the internals of ICU4C, when handling charset convertors or on platforms with 32-bit wchar_t) and in UTF16.StringComparator.

    It is true that these classes support supplementary characters, but
    they do so with semantics of Unicode 16-bit strings - in particular,
    using 16-bit-unit indexes and offsets. These classes work with
    standard Java String, StringBuffer, etc.

    > At one time in the development there was a class to represent strings as vectors of codepoints.

    I can't find a trace of such a class, except for suggestions by you on
    this list that Java *could* add such a thing. Searching for "icu4j
    ustring" in the list archives for this list yields for example:

    "Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)" Mon
    Nov 15 2004

    "Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and
    Azeri, was: Accented ij ligatures)" Sat Jul 12 2003

    (see the list archive homepage for the login parameters -

    I also see an email from Mark Davis replying to you on Wed Jun 04 2003
    ("Re: Encoding converion through JDBC") with "In ICU4J (which is an
    add-on package for Java), we don't have classes UChar and UString." -

    > Well the effective name is UCharacter, not UChar. It's true that since Java5 these ICU4J features are no longer essential, except when porting applications back to Java 1.4 or lower (for platforms that still don't have Java5 support). ICU4J remains useful for its extensive support of additional charsets, iterators, normalizers, transliterators, IDNA/StringPrep, better locale support including for message formating.

    Thanks for the endorsement :-)

    > Given the newer architecture in Java1.5 for character sequences (of which String is now a specialization), it would make sense to have an alternate internal representation with 32-bit codepoints, if it helps reducing the complexity of text handling code (notably for text recognizers).

    This was discussed and rejected by the Java working group that
    designed the support for supplementary characters. It would be
    expensive to have an internal representation based on Unicode 32-bit
    strings while maintaining 16-bit code unit indexes and offsets. It
    would also create memory and cache bottlenecks. There are
    presentations and reports about the work of JSR 204 which detail the
    options and choices.

    By the way, CharSequence was introduced in Java 1.4, not Java 5, and
    ICU4J supports CharSequence.

    Best regards,

    Opinions expressed here may not reflect my company's positions unless
    otherwise noted.

    This archive was generated by hypermail 2.1.5 : Thu Jan 05 2006 - 12:43:45 CST