Re: Canonical equivalence in rendering: mandatory or recommended?

From: Markus Scherer (
Date: Wed Oct 15 2003 - 11:32:34 CST

Philippe Verdy wrote:
> ... In fact, to further optimize and reduce the
> memory footprint of Java strings, in fact I choosed to store
> the String in a array of bytes with UTF-8, instead of an
> array of chars with UTF-16. The internal representation is

This does or does not save space and time depending on the average string contents and on what kind
of processing you do.

> chosen dynamically, depending on usage of that string: if
> the string is not accessed often with char indices (which in
> Java does not return actual Unicode codepoint indices as
> there may be surrogates) the UTF-8 representation uses less
> memory in most cases.
> It is possible, with a custom class loader to overide the default
> String class used in the Java core libraries (note that compiled
> Java .class files use UTF-8 for internally stored String constants,

No. It's close to UTF-8, but .class files use a proprietary encoding instead of UTF-8. See the
.class file documentation from Sun.

> as this allows independance with the architecture, and this is the
> class loader that transforms the bytes storage of String constants
> into actual chars storage, i.e. currently UTF-16 at runtime.)
> Looking at the Java VM machine specification, there does not
> seem to be something implying that a Java "char" is necessarily a
> 16-bit entity. So I think that there will be sometime a conforming
> Java VM that will return UTF-32 codepoints in a single char, or
> some derived representation using 24-bit storage units.

I don't know about the VM spec, but the language and its APIs have 16-bit chars wired deeply into
them. It would be possible to _add_ a new char32 type, but that is not planned, as far as I know.
_Changing_ char would break all sorts of code. However, as far as I have heard, a future Java
release may provide access to Unicode code points and use ints for them.

(And please do not confuse using a single integer for a code point with UTF-32 - UTF-32 is an
encoding form for _strings_ requiring a certain bit pattern. Integers containing code points are
just that, integers containing code points, not any UTF.)

> So there already are some changes of representation for Strings in
> Java, and similar technics could be used as well in C#, ECMAScript,
> and so on...

I am quite confident that existing languages like these will keep using 16-bit Unicode strings, for
the same reasons as for Java: Changing the string units would break all kinds of code.

Besides, most software with good Unicode support and non-trivial string handling uses 16-bit Unicode
strings, which avoids transformations where software components meet.

> ... Depending of runtime
> tuning parameters, the internal representation of String objects may
> (should) become transparent to applications. One future goal

The internal representation is already transparent in languages like Java. The API behavior has to
match the documentation, though, and cannot be changed on a whim.

> would be that a full Unicode String API will return real characters
> as grapheme clusters of varying length, in a way that can be
> comparable, orderable, etc... to better match what the users
> consider as a string "length" (i.e. a number of grapheme clusters,
> if not simply a combining sequence if we exclude the more complex
> case of Hangul Jamos and Brahmic clusters).

This is overkill for low-level string handling, and is available via library functions. Such library
functions might be part of a language's standard libraries, but won't replace low-level access

Best regards,

Opinions expressed here may not reflect my company's positions unless otherwise noted.

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST