Re: U+0000 in C strings

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Nov 15 2004 - 12:18:49 CST

  • Next message: Norbert Lindenberg: "Re: Opinions on this Java URL?"

    From: "Doug Ewell" <dewell@adelphia.net>
    > How does Java indicate the end of a string? It can't use the value
    > U+0000, as C does, because the "modified UTF-8" sequence C0 80 still
    > gets translated as U+0000. And if the answer is that Java uses a length
    > count, and therefore doesn't care about zero bytes, then why is there a
    > need to encode U+0000 specially?

    You seem to assume that Java (the language) uses this sequence. In fact the
    sequence is not for use by Java itself, but in its interfaces with other
    languages, including C.
    In 100% pure Java programming, you never see that sequence, you just work
    with any UCS-2 code units when parsing "String" instances or comparing
    "char" values.
    And if you perform I/O using the supported "UTF-8" Charset instance, Java
    properly encodes it as a single null byte. So why do people think that UTF-8
    support in Java is broken? It is not.
    The "modified UTF-8" encoding is only for use in the serialization of
    compiled classes that contain a constant string pool, and through the JNI
    interface to C-written modules using the legacy *UTF() APIs that want to
    work with C strings.

    There's no requirementto use that legacy *UTF() interface in C, because you
    wan also use the UTF-26 interface which does not require that the JavaVM
    allocates a bytes buffer to perform the conversion to internal String
    storage (the UTF-16 JNI interface is more efficient, just a bit more complex
    to handle in C when you only use the standard char-based C library; if you
    use the wchar_t-based C library, you don't need this legacy interface, but
    support of wchar_t in standard libraries is not guaranteed on all platforms,
    even if there's a wchar_t datatype in the C libraries and headers, as
    "wchar_t" is allowed in ANSI C to be defined equal to "char", i.e. only 1
    byte; if this is the case, the C program will need to use something else
    than "wchar_t", for example "unsigned short", if it is at least 16-bits).

    On Windows, Unix, Linux and Mac OS or OSX, modern C compilers support
    wchar_t with at least 16 bits, so this is not a problem. It may still be a
    problem if wchar_t is 32-bits, because the fastest UTF-16 based JNI
    interface requires 16-bits code units: in that case you won't be able to use
    wsrtlen() and so on...

    With the fastest JNI 16-bit interface, note that "wide-string" C libraries
    assume that U+0000 coded as a single null wchar_t code unit is a
    end-of-string terminator; so if this is an issue, and your external
    C-written JNI component should work with any Java String instances, you'll
    need instead to use memcpy() and similar functions with the separate length
    indicator to access to all characters of a Java String instance. Such
    complication is not always necessary and sometimes causes unneeded errors.
    Using the legacy *UTF() JNI interface will solve the security risk or
    interpretation issue...



    This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 12:23:51 CST