Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

From: Philippe Verdy (
Date: Mon Nov 15 2004 - 07:06:46 CST

  • Next message: Clark Cox: "Confusion about Collation"

    ----- Original Message -----
    From: "John Cowan" <>
    To: "Doug Ewell" <>
    Cc: "Unicode Mailing List" <>; "Philippe Verdy"
    <>; "Peter Kirk" <>
    Sent: Monday, November 15, 2004 7:05 AM
    Subject: Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

    > Doug Ewell scripsit:
    >> As soon as you can think of one, let me know. I can think of plenty of
    >> *binary* protocols that require zero bytes, but no *text* protocols.
    > Most languages other than C define a string as a sequence of characters
    > rather than a sequence of non-null characters. The repertoire of
    > characters
    > than can exist in strings usually has a lower bound, but its full
    > magnitude
    > is implementation-specific. In Java, exceptionally, the repertoire is
    > defined by the standard rather than the implementation, and it includes
    > U+0000. In any case, I can think of no language other than C which does
    > not support strings containing U+0000 in most implementations.

    This is exactly the inclusion of U+0000 as a valid character in Java strings
    that requires that this character be preserved in the JNI interface and in
    String serializations.

    Some are thinking here that this is a broken behavior, but there's no other
    wimple way to represent this character when passing a Java String instance
    to and from a JNI interface, or though serialization such as in class files.

    My opinion is that the Java behavior does not define a new encoding, it is
    rather a transfer encoding syntax (TES), so that it can effectively
    serialize String instances (which are UCS-2 encoded using the 16-bit "char"
    Java datatype, and not only the UTF-16 restriction of UCS-2 which also
    requires paired surrogates, but does not make the '\uFFFF' and '\uFFFE' char
    or code unit illegal as they are simply mapped to U+FFFF and U+FFFE code
    points, even if these code points are permanently assigned as non-characters
    in Unicode and ISO/IEC 10646).

    The internal working storage of Java Strings is not a character set (CCS or
    CES), and these strings are not necessarily bound to Unicode (even if Java
    provides lots of Unicode-based character properties, and character sets
    conversion libraries), as they can store as well other charsets, using other
    charset encoding/decoding libraries than those found in* and
    java.text.* packages. Once you admit that, Java String instances are just
    arrays of code units, not arrays of code points, their interpretation as
    encoded characters being left to other layers.

    Should there exist any successor to Unicode (or a preference in a Chinese
    implementation to handle String instances internally with GB18030), with
    different mappings from code units to code points and characters, the
    working model of Java String instances and "char" datatype would not be
    affected. This would still be conforming to Java specifications, if the
    standard java.text.* and* or java.nio.* packages that perform the
    various mappings between code units and code points, characters and byte
    streams are not modified: new alternate packages could be used, without
    changing the String object and the unsigned 16-bit integer "char" datatype.

    In Java 1.5, Sun chose to support supplementary characters without changing
    the char and String representations, but the "Character" object was extended
    to support the static representation of code points as static 32-bit "int",
    and include the mapping from any Unicode code points in the 17 planes with
    "char" code units. The String class has then been extended to allow parsing
    "char"-encoded strings by "int" code points (so with the automatic support
    and detection of surrogate pairs), but the legacy interface was preserved.
    In ICU4J, the "UCharacter" object does not use a static representation but
    stores code points directly as "int", unlike "Character" whose instances
    still only store a single 16-bit "char", and offers only a static support
    for code points: there's still no "Character(int codepoint)" constructor,
    only a "Character(char codeunit)", because "Character" keeps its past
    serialization for compatibility, and "Character" is also bound to the 16-bit
    "char" datatype for object-boxing (automatic boxing only exists in Java 1.5,
    explicit boxing in previous and current versions is still supported).

    If Java needs some more extension, it's to include the ICU4J "UCharacter"
    class that would allow storing 32-bit "int" codepoints, or building a
    UCharacter from a "char"-coded surrogates pair of code units, or from a
    "Character" instance; and also to add a "UString" class using internally
    arrays of "int"-coded code units, with converters between String and
    UString. Such extension would not need any change in the JVM, just new
    supported packages.

    But even with all these extensions, the U+0000 Unicode character would
    remain valid and supported, and there would still remain the need to support
    it in JNI and in internal JVM serializations for String instances. I really
    don't like the idea of some people here that would want to deprecate a
    widely used JNI interface that needs this special serialization when
    interfacing with C code.

    Also the fact that C assigns a role to the all-bits-set-to-zero char as a
    end-of-string terminator does not imply that C is unable to represent the
    NULL character, given that all other "char" values have *no* required
    semantics or values (for example '\r' or '\n' are not bound to fixed Unicode
    characters, but to functions, and this interpretation remains compiler- and

    This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 07:13:30 CST