Re: Opinions on this Java URL?

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Nov 13 2004 - 19:50:26 CST

  • Next message: Philippe Verdy: "Re: Opinions on this Java URL?"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > It was not bad coding practive at the time when Sun designed these
    > APIs, because it was explicitly based on the ISO/IEC 10646 definition
    > of UTF-8, which was at that time the legacy version published in the
    > RFC, where non-shortest encodings were allowed.

    Good heavens, is that urban legend still floating around?

    From the ORIGINAL DEFINITION of FSS-UTF in September 1992:

    > When there are multiple ways to encode a value, for example
    > UCS 0, only the shortest encoding is legal.

    See http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt if you think
    I'm making this up.

    And from RFC 2279, dated January 1998, which Philippe says "allowed
    non-shortest encodings":

    > It is important to note that the rows of the table are mutually
    > exclusive, i.e. there is only one valid way to encode a given UCS-4
    > character.

    and later:

    > For example, a parser might prohibit the NUL character when encoded
    > as the single-octet sequence 00, but allow the illegal two-octet
    > sequence C0 80 and interpret it as a NUL character. Another example
    > might be a parser which prohibits the octet sequence 2F 2E 2E 2F
    > ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

    Note the use of the word "illegal" to describe non-shortest encodings.
    Note also that U+0000 is a specific example, both here and in the 1992
    document.

    RFC 2044, which had been published for only 15 months before RFC 2279
    replaced it, did not include these passages, but neither did it
    expressly "allow" non-shortest sequences. And even RFC 2044 said:

    > Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
    > correspond to octets 00 to 7F (7 bit US-ASCII values).

    which seems unambiguous.

    Even the pre-3.1 versions of Unicode, which ill-advisedly allowed a
    conformant process to INTERPRET non-shortest forms, never allowed them
    to be GENERATED.

    This is either naïveté or intentional disinformation.

    > What is a shame is that Unicode did not consider this widely used
    > legacy practice when it defined CESU-8 (the way supplementary
    > characters are encoded with the Java-modified-UTF encoding), so that
    > it would also allow encoding NUL (U+0000) as {0xC0,0x80}, something
    > that is so useful to allow interoperatibility with standard C
    > libraries.

    What is a shame is that Unicode published a definition of the defective
    CESU-8 at all.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sat Nov 13 2004 - 20:00:29 CST