RE: Canonical equivalence in rendering: mandatory or recommended?

From: Jill Ramonsky (
Date: Wed Oct 15 2003 - 08:23:00 CST

> -----Original Message-----
> From: Philippe Verdy []
> The same
> optimization can be done in Java by subclassing the String
> class to add a "form" field and related form conversion (getters)
> and tests methods.

Only slightly confused about this. The Java String class is declared
*final* in the API, and therefore cannot be subclassed. One would have
to write an alternative String class (not rocket science of course, but
still a tad more involved than subclassing).

> In fact, to further optimize and reduce the
> memory footprint of Java strings, in fact I choosed to store
> the String in a array of bytes

Okay. That explains that then.

> It is possible, with a custom class loader to overide the default
> String class used in the Java core libraries

Ouch. Never taken Java that far myself. I like the idea though. Is it

> Looking at the Java VM machine specification, there does not
> seem to be something implying that a Java "char" is necessarily a
> 16-bit entity. So I think that there will be sometime a conforming
> Java VM that will return UTF-32 codepoints in a single char, or
> some derived representation using 24-bit storage units.

I've wondered about that ever since Unicode went to 21 bits. Actually of
course, it's C (and C++), not Java, which has the real problem. C is
(supposed to be) portable, but fast on all architectures, so all of the
built-in types have platform-dependent widths. (So far so good). The
annoying thing is that, BY DEFINITION, the *sizeof()* operator returns
the size of an object /measured in chars/. Therefore, it is a violation
of the rules of C to have an addressable object smaller than a char. One
/can/ have 32-bit chars, but /only/ if you disallow bytes and 16-bit
words. *sizeof()* is not allowed to return a fraction. Sigh! If only C
had seen fit to measure addressable locations in /bits/, or even
architecture-specific-/atoms/ (which would have been 8-bits wide on most
systems), then we could have had sizeof(char) returning 4 or something.
Ah well.

> This leads to many discussions about what is a "character"

I think we just had that discussion. If it happens again I'm probably
not going to join in (though it was quite amusing).


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST