Re: Canonical equivalence in rendering: mandatory or recommended?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 15 2003 - 11:36:15 CST


From: "Nelson H. F. Beebe" <beebe@math.utah.edu>
> >> * char, whose values are 16-bit unsigned integers representing Unicode
> >> characters (section 2.1)

May be I have missed this line, but the Java bytecode instruction set does
make a native char not equivalent to a short, as there's an explicit
conversion
instruction between char's and other integer types.

Those Java programs that assume that a char is 16-bit wide would have a
big problem because of the way String constants are serialized within
class files.

Of course it will be hard to include new primitive types in the JVM (because
it would require adding new instructions). And for the same reason, the Java
language "char" type is bound to the JVM native char type, and it cannot
be changed, meaning that String objects are bound to contain only "char"
elements (but this is done only through the String and StringBuffer methods,
as the backing array is normally not acessible).

I don't see where is the issue if char is changed from 16-bit to 32-bit
wide, included with the existing bytecode instruction set and the String
API (for compatibility the existing methods should use indices as if the
backing array was storing 16-bit entities, even if in fact it would store
now
32-bit char's.)

But an augmented set of methods could use directly the new indices in
the native 32-bit char. Additionally, there may be an option bit set in
compiled classes to specify the behavior of the String API: with this bit
set, the class loader would bind the String API methods to the new 32-bit
version of the new core library, and without this bit set (legacy compiled
classes), they would use the compatibility 16-bit API.

Te javac compiler already sets version information for the target JVM, and
thus can be used to compile a class to use the new 32-bit API instead of the
legacy one: in this mode, for example, the String.length() method in the
source would be compiled to call the String.length32() method of the new
JVM, or to remap it to String32.length(), with a replacement class name
(I use String32 here, but in fact it could be the UString class of ICU).

I am not convinced that Java must be bound to a fixed size for its "char"
type, as it already includes "byte", "short", "int", "long" for integer
types
with known sizes (respectively 8, 16, 32, 64 bits), and the JVM bytecode
instruction set clearly separates the native char type from the native
integer type, and does not allow arithmetic operations between chars and
integer types without an explicit conversion.

Note also that JNI is not altered with this change: when a JNI program uses
the .getStringUTF() method, it expects a UTF-8 string. When it uses the
.getString(), it expects a UTF-16 string (not necessarily the native char
encoding seen in Java), and an augmented JNI interface could be defined
to use .getStringUTF32() is one wants maximum performance with no
conversion with the internal backing store used in the JVM.

For me the "caload" instruction, for example just expects to return the
"char" (whatever its size) at a fixed and know index. There's no way to
break this "char" into its bit components without using an explicit
conversion
to a native integer type: the char is unbreakable, and the same is true for
String constants (and that's why they can be reencoded internally into
UTF-8 in compiled classes and why String constants can't be used to store
reliably any array of 16-bit integers, because of surrogates, as a String
constant cannot include invalid surrogate pairs)

Out of String constants, all other strings in Java can be built from
serialized data only through a encoding converter with arrays of native
integers. It's up to the converter (not to the JVM itself) to parse the
native
integers in the array to build the internal String. The same is true for
StringBuffers, which could as well use an internal backing store with 32-bit
chars.

So the real question when doing this change from 16bit to 32-bit is whever
and how it will affect performance of existing applications: for Strings,
this may be an issue if conversions are needed to get an UTF-16 view of a
String internally using a UTF-32 backing store. But in fact, an internal
(private) attribute "form" could store this indicator, so that construction
of Strings will not suffer from performance issues. In fact, if the JVM
could internally manage several alternate encoding forms for Strings to
reduce
the footprint of a large application and just proded a "char" view through
an
emulation API to applcations, this could benefit to the performance (notably
in server/networking applications using lots of strings, such as SQL engines
like Oracle that include a Java VM).

What would I see if I was a programmer in such environment: the main
change would be in the character properties, where new Unicode block
identifiers would become accessible out of the BMP, and no "char" would
be identified as a low or high surrogate.

It would be quite simple to know if the 32-bit char environment would
be used:
    final String test = "\uD800\uDC00"; // U+10000
this compiles a String constant into its UTF-8 serialization in the .class
file, thus compiling this bytes in the init data section of that .class
file:
    const byte[] test$ = {0xF0,0x90,0x80,0x80};

Whatever if the environment is 16-bit-char or 32-bit-char, the class loader
parses the .class file as an UTF-8 sequence, returning the single codepoint
U+10000. Then this is an internal decision of the JVM to store it internally
with 16-bit or 32-bit backing stores, i.e. as a single {(char)'\U00010000'}
or as two {(char)'\0D800', (char)'\uDC00'}.
The existing String API does not need to be changed as well as already
compiled classes targetted for previous versions of the JVM.

For source code however, a compile-time flag may target a new VM and
automatically substitute java.lang.String by java.lang.UString, offering
an alias if needed with java.lang.String16 if one wants to force the
emulated (derived) 16-bit API with the new 32-bit String backing store.
In the compiled class for the new VM, there would noly remain references
to java.lang.String and javax.lang.UString, and the code would still run on
older VMs, provided that an installable package for those VMs provides
the UString class. If needed, class names referenced in .class files would
be solved and accessible through reflection according to the comtability
indicator detected by the class loader, so that even class names would
not need to be altered.

Known caveat with this change:
if the application compares the successor of '\uFFFF' with '\u0000',
the result will be equal on a 16bit-char environment and false in the
newer 32-bit environment. That's why the target VM indicator inserted
by the compiler in the .class file needs to be used: it defines (alters)
the semantics of the char native type, so that it will use a forced 16bit
truncation for classes in the compatibility mode.

However I still don't see any reason why char could not be 32-bit. This
is quite similar to the memory page mode of x86 processors that allow
fixing the semantics of 16-bit programs in 32-bit environments (notably
for pointer sizes stored in the processor stack).



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST