Re: Opinions on this Java URL?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 13 2004 - 20:27:03 CST

  • Next message: Philippe Verdy: "Re: Opinions on this Java URL?"

    From: "A. Vine" <andrea.vine@sun.com>
    >> I'm just curious about the \0 thing. What problems would having a \0 in
    >> UTF-8 present, that are not presented by having \0 in ASCII? I can't see
    >> any advantage there.
    >
    > Beats me, I wasn't there. None of the Java folks I know were there
    > either.

    The problem is in the way strings that get passed to JNI via the legacy
    *UTF() APIs are accessed: there's no indicator of the string length, so it
    would be impossible to know if the \0 terminates the string if if is allowed
    in the content of the string data.
    The C080 encoding is a way to escape this character, so that it can be
    passed to JNI using the legacy *UTF() APIs that exist since Java 1.0.
    This encoding is also part of the Java class file format, where string
    constants are also encoded this way. Note that the Java String object allows
    storing ANY UTF-16 code unit, including invalid ones (0xFFFE and 0xFFFF), as
    well as isolated or unpaired surrogates. So Java internally does not use
    UTF-16 strictly. Using a plain UTF-8 representation would have prevented the
    class format to support such string instances, which are invalid for
    Unicode, but not in Java. Using CESU-8 would not work either.

    There are legacy Java applications that use the String object to store
    unrestricted arrays of unsigned 16-bit integers (Java native type "char"),
    without any association with the fact that it may represent valid
    characters, and it has the advantage that such representation allows fast
    loading of classes containing large constant pools (these classes won't
    perform a long class initialization code, like the one performed when
    initilizing an array of integer type, but will directly use the String
    constant pool which is decoded and loaded into chars directly by native CPU
    code in the JVM rather than with interpreted bytecode which will never be
    compiled; this may seem a bad programming practice, but the Java language
    specs allows this, and Sun will not remove such possibility without breaking
    compatibility with those programs).

    This "modified UTF" should then be regarded as a specific encoding scheme
    that supports the unrestricted encoding form used Java String instances
    (extended UTF-16, more exactly UCS-2) which, by initial design, can
    represent and store *more* than just valid Unicode strings.

    The newer JNI interface allows reading/returning String instance data
    directly in UCS-2 encoding form, without using the specific "modified UTF"
    encoding scheme: there's a API parameter field to pass the actual string
    length, so the interface is binary safe. Applications can then use it to
    pass any valid Unicode string, or even invalid ones (with invalid code units
    or unpaired surrogates) if they wish. There's no requirement that this data
    represent only true characters. Note that even Windows uses an unrestricted
    UCS-2 representation in its "Unicode-enabled" Win32 APIs.

    The newer UCS-2 interface is enough for JNI extensions to generate true
    UTF-8 if they wish. I don't see the interest of adding an additional support
    for true UTF-8 in JNI, given that this support is trivial to implement using
    either the null-terminated *UTF() JNI APIs or the UCS-2-based JNI APIs... In
    addition, this support is not really needed for performance (the UCS-2
    interface is the fastest one for JNI, as it avoids the JNI extension to
    allocate internal work buffers to work with native OS APIs that can also use
    UCS-2 directly without using extra code-converters).



    This archive was generated by hypermail 2.1.5 : Sat Nov 13 2004 - 20:28:49 CST