Re: Java 5 Strings

From: Markus Scherer (markus.icu@gmail.com)
Date: Thu Jan 05 2006 - 12:37:15 CST

Next message: John Hudson: "Re: Cyrillic "borrowed" letterforms (unicode Digest V5 #301)"

Previous message: Philippe Verdy: "Re: Java 5 Strings"
In reply to: Philippe Verdy: "Re: Java 5 Strings"
Next in thread: Naoto Sato: "Re: Java 5 Strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 1/5/06, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
> I should have checked again the Java API myself (support for strings viewed as vectors of codepoints is in StringCharacterIterator (I've been confused with the internals of ICU4C, when handling charset convertors or on platforms with 32-bit wchar_t) and in UTF16.StringComparator.

It is true that these classes support supplementary characters, but
they do so with semantics of Unicode 16-bit strings - in particular,
using 16-bit-unit indexes and offsets. These classes work with
standard Java String, StringBuffer, etc.

> At one time in the development there was a class to represent strings as vectors of codepoints.

I can't find a trace of such a class, except for suggestions by you on
this list that Java *could* add such a thing. Searching for "icu4j
ustring" in the list archives for this list yields for example:

"Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)" Mon
Nov 15 2004 http://www.unicode.org/mail-arch/unicode-ml/y2004-m11/0140.html

"Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and
Azeri, was: Accented ij ligatures)" Sat Jul 12 2003
http://www.unicode.org/mail-arch/unicode-ml/y2003-m07/0257.html

(see the list archive homepage for the login parameters -
http://www.unicode.org/mail-arch/)

I also see an email from Mark Davis replying to you on Wed Jun 04 2003
("Re: Encoding converion through JDBC") with "In ICU4J (which is an
add-on package for Java), we don't have classes UChar and UString." -
http://www.unicode.org/mail-arch/unicode-ml/y2003-m06/0083.html

> Well the effective name is UCharacter, not UChar. It's true that since Java5 these ICU4J features are no longer essential, except when porting applications back to Java 1.4 or lower (for platforms that still don't have Java5 support). ICU4J remains useful for its extensive support of additional charsets, iterators, normalizers, transliterators, IDNA/StringPrep, better locale support including for message formating.

Thanks for the endorsement :-)

> Given the newer architecture in Java1.5 for character sequences (of which String is now a specialization), it would make sense to have an alternate internal representation with 32-bit codepoints, if it helps reducing the complexity of text handling code (notably for text recognizers).

This was discussed and rejected by the Java working group that
designed the support for supplementary characters. It would be
expensive to have an internal representation based on Unicode 32-bit
strings while maintaining 16-bit code unit indexes and offsets. It
would also create memory and cache bottlenecks. There are
presentations and reports about the work of JSR 204 which detail the
options and choices.

By the way, CharSequence was introduced in Java 1.4, not Java 5, and
ICU4J supports CharSequence.

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless
otherwise noted.

Next message: John Hudson: "Re: Cyrillic "borrowed" letterforms (unicode Digest V5 #301)"
Previous message: Philippe Verdy: "Re: Java 5 Strings"
In reply to: Philippe Verdy: "Re: Java 5 Strings"
Next in thread: Naoto Sato: "Re: Java 5 Strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 05 2006 - 12:43:45 CST