Re: Opinions on this Java URL?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Nov 15 2004 - 03:04:13 CST

  • Next message: Philippe Verdy: "Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)"

    From: "Asmus Freytag" <asmusf@ix.netcom.com>
    >>CESU-8 is the documentation of someone's internal, non-standard
    >>implementation of UTF-8. Of course, the "someone" is large and
    >>important and their implementation affects a lot of users. If nobody
    >>else is motivated by the presence of UTR #26 to adopt this non-standard
    >>version, good.
    >
    > There are some UTF-8/UTF-16 interoperability aspects that are addressed
    > by CESU-8. These concerns are real, and affect multi-component
    > architectures
    > that must interchange data across component boundaries. Therefore a
    > standard
    > specification serves a useful purpose.
    >
    >>What worries me is that there might be other people in the world like
    >>Philippe
    >
    > Phillippe doesn't worry me ;-)

    I'd like to note that the Java modified UTF-8 format is not purely internal
    to Java and that it is used for interchange of data in a multi-component
    architecture, which is the JNI interface allowing external native libraries
    to interchange data with any Java-compliant VM.

    So it's not only Sun's implementation, but also part of all other VM
    implementations that have the support for JNI and native Java interface.
    Also, this support appears within the class format which is standardized and
    accessible too through Java's reflection mechanisms allowing Java programs
    to control how classes are loaded or created at run-time, and interchanged
    as well with other hosts. The "component boundaries" above apply to Java as
    well.

    Finally, the capability of Java of storing and exchanging valid Unicode
    strings with embedded nulls (U+0000) is a feature rather than a limitation,
    notably when this interchange requires using fixed-sized structures
    containing variable-length strings, where these nulls serve as padding bytes
    (for example in fixed-width plain-text table formats, where the introduction
    of binary length prefixes would make the text file unreadable).

    The NULL character is mostly used in plain-text formats as a ignorable
    padding, with less ambiguity than spaces commonly used in so many SQL
    engines or in XML formats. Some text editors are broken so that they will
    not load correctly a text file with embedded nulls: these editors truncate
    the read data instead of handling nulls as if it was ignorable whitespace,
    because they handle the text as C strings where null bytes mean end of
    strings.

    There are also many places in data structures used for interchange where
    plain-text strings are encoded in data fields, without any extra length
    specified specified because the field is extremely small. Nulls are used as
    required padding and must not be truncated, because these structures would
    be desynchronized. Nulls are also used as filler bytes within some
    communication protocols based on plain-text data.

    Like it or not, but nulls are part of almost all character sets, from the
    oldest ones to the most recent ones (with one notable exception in GSM text
    for SMS, where the null byte is a printable character, as GSM don't
    need/want data fillers). The support of ignorable padding characters will
    remain needed for long (or ever) in plain-texts, even if a plain-text *file*
    does not need it (there are other uses of plain-text than just complete
    files). Those many expecting that a file containing any null byte is not
    text but binary are restricting to the use of "text/plain" in MIME message
    formats.

    A GSM message would embed null bytes without being considered as binary, and
    would contain no data filler; it could not be interchanged with a MIME
    "text/plain" datatype even with a "charset" qualifier, but it would still be
    plain-text in the definition accepted at the Unicode or ISO/IEC10646 level
    (they don't care much about which encoding schemes or transport syntaxes are
    used to interchange plain-text, but about the interpretation of the
    *decoded* code points; lower encoding levels in the Unicode standard are
    mandatory only if applications choose to implement these levels and label
    their data with the corresponding charset identifiers that have been
    reserved, and included in the Unicode standard).

    So it's a fact that Unicode's UTF-8 format is fully compatible with Unicode
    (i.e. it can encode any Unicode texts, including those containing NULL
    characters), but not with C and other applications that can't depend on the
    effective text length being specified out of band, but with an explicit and
    mandatory end-of-text marker. This is the place where transport syntaxes are
    used in MIME, to escape reserved bytes which have special functions in the
    embedding transport: hexadecimal, Base-64, quoted-printable, uuencode, COBS,
    ... or escaping control bytes by the 0xC0 leading byte (unused in UTF-8)
    followed by the control byte with a 0x80 offset. The string definition in C
    implies that nulls must be escaped if they are needed, or that string length
    be encoded separately out of band (but in that case this is no more a
    standard null-terminated C string).

    C does not mandate any escaping mechanism, and Java's "modified UTF-8" is
    perfectly valid in this context as a transport syntax for CESU-8. In fact I
    don't like the term "modified UTF-8" used by Sun in its revized
    documentation; it is causing confusion, and in fact it would be more exact
    if Sun said it is in fact a "modified CESU-8" (so that it will match with
    how Java handles now supplementary characters), and if Sun documented that
    this format includes the bijective support of strings with non-character
    code units '\uFFFE' and '\uFFFF', and more critically of malformed strings
    with unpaired or isolated surrogates (which are normally not acceptable even
    in standard CESU-8).

    A better term without reference to UTF-8 or even CESU-8 would be useful
    (even if information is given that will refer to other standard UTF-8 and
    CESU-8 encoding schemes). As this encoding is needed for the *serialization*
    (data interchange over a byte-oriented stream) of Java String objects (which
    can contain malformed Unicode text with unpaired surrogates, and any valid
    or invalid or reserved or unassigned 16-bit code units), why not refering
    this encoding as "Java-String-8" (or "JS-8" for short)?



    This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 05:52:19 CST