Re: Opinions on this Java URL?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 13 2004 - 18:51:40 CST

  • Next message: Doug Ewell: "Re: Opinions on this Java URL?"

    From: "Theodore H. Smith" <delete@elfdata.com>
    > http://java.sun.com/j2se/1.5.0/docs/api/java/io/
    > DataInput.html#modified-utf-8
    >
    > If only people could sue for suggesting bad coding practices ;o)

    It was not bad coding practive at the time when Sun designed these APIs,
    because it was explicitly based on the ISO/IEC 10646 definition of UTF-8,
    which was at that time the legacy version published in the RFC, where
    non-shortest encodings were allowed. Sun used it simply as a convenience to
    allow using standard C libraries that expect a NUL byte to terminate
    strings, but still allowing String objects to contain NUL (U+0000)
    characters. Also at that time, Unicode 1.0 was defined only as a 16-bit
    subset of ISO/IEC 10646, and the definitions for supporting other planes
    were missing.

    What is a shame is that Unicode did not consider this widely used legacy
    practice when it defined CESU-8 (the way supplementary characters are
    encoded with the Java-modified-UTF encoding), so that it would also allow
    encoding NUL (U+0000) as {0xC0,0x80}, something that is so useful to allow
    interoperatibility with standard C libraries.

    Now that CESU-8 is fixed and standardized, the Sun modified UTF encoding
    should have its own encoding label registered with something less ambiguous
    than the expression "modified UTF". Has Sun applied for registering its
    encoding (actually a encoding scheme, because the encoding form is plain
    UTF-16, even though the Sun scheme allows encoding isolated or unpaired
    surrogates, or invalid code units 0xFFFE and 0xFFFF) with a IANA/MIME
    charset identifier? It would then be easier for Sun to reference this
    encoding with this label, if Sun published a public informative RFC, for the
    IANA charset registration.

    Without this RFC, may be the informative page in the Java SDK documentation
    may be used as the reference for the IANA registration. But Sun should
    ensure that this page will remain accessible (that's why extracting this
    page into a isolated plain text document for an informative RFC would be
    helpful).

    I won't support the idea of Sun suddenly removing one of its APIs (because
    it would break lots of JNI extensions that need it, even though there are
    APIs that can be used to pass String data directly in the native UTF-16
    format supported by the Java native char datatype). Also redefining the API
    with new names (without the UTF suffix) seems like overkill, and not needed
    for Unicode conformance: the API is self-contained and there's no
    restriction in Unicode about how an API function should be named.



    This archive was generated by hypermail 2.1.5 : Sat Nov 13 2004 - 18:52:30 CST