Re: Opinions on this Java URL?

From: Asmus Freytag (
Date: Sun Nov 14 2004 - 04:54:03 CST

  • Next message: Donald Z. Osborn: "Re: NYT article: Using a New Language in Africa to Save Dying Ones"

    At 11:36 PM 11/13/2004, Doug Ewell wrote:
    >Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    > >> What is a shame is that Unicode published a definition of the
    > >> defective CESU-8 at all.
    > >
    > > On that point at least we agree. I wonder why CESU-8 was created, if
    > > there effectively exists applications needing it.
    >UTC could have simply acknowledged that certain applications and vendors
    >have created their own transformation formats for internal use, based
    >on, but incompatible with, existing Unicode encoding schemes. Oracle
    >has a UTF-8-like one which encodes supplementary code points with six
    >bytes instead of four.

    The way UTC formally 'acknowledges' something like that may involve
    the issuance of a specification for it. That was done for CESU-8, and
    incidentally also for UTF-EBCDIC.

    Sometimes the purpose of creating a label for a format is to be able
    to clearly identify data as *not* being in conformance to the Unicode
    specification. I've not seen evidence that UTR#26 has resulted in
    more or fewer implementations using CESU-8 style data. That is as
    expected, because the use of that format is driven by specific compatibility
    requirements, which neither get created nor removed by fiat from the UTC.
    On the other hand all implementations that do see a need to use that format
    can now safely warn all others of potential incompatibilities by
    correctly labelling their data. I see that as a win.

    > Sun has one like this which also encodes U+0000
    >as two bytes instead of one. Someone else might decide to use one of
    >the "zany" UTFs invented by Marco Cimarosti or me.

    I think there is a distinction that people recognize between zany
    UTFs invented by some guys with too much time on their hands, compared
    to documenting specific compatibility warts that (unfortunately)
    inflict a sizable group of users.

    >Whatever... but there was no need to publish a Technical Report
    >describing Oracle's custom format, giving it a formal-sounding name like
    >"CESU-8" and registering it as an IANA charset for interchange. Not
    >everyone outside this list is familiar with the fine distinction between
    >a UTR, officially approved by UTC, and a UTN, published but not approved
    >by UTC. I hope UTC does not ever go the "CESU-8" route with a UTN
    >describing Sun's broken format.

    A UTN is a different animal, as you are well aware. A UTN that says in
    effect "Java's string serialization is not conformant to UTF-8" (and
    explains the reason) is well within the parameters set for UTNs by the
    Unicode Consortium. It would also pass the sniff test for 'information
    useful to implementers and users of the standard'.

    As Sun is discouraging the use of their format for all but Java-specific
    and reasonably low level serialization of class data - an option not open
    to the users of CESU-8 or UTF-EBCDIC who face the issue of interchange
    at least among the components of certain distributed implementations -
    there's not the same call for a formal specification and label.

    But a UTN would make a nice place that one could use to capture the information
    that gets dredged up every so often when this issue percolates on this
    and related mail lists. UTNs after all, are intended to allow for the
    documentation of such issues, without requiring UTC endorsement.


    This archive was generated by hypermail 2.1.5 : Sun Nov 14 2004 - 04:56:47 CST