From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Nov 14 2004 - 04:54:03 CST
At 11:36 PM 11/13/2004, Doug Ewell wrote:
>Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> >> What is a shame is that Unicode published a definition of the
> >> defective CESU-8 at all.
> >
> > On that point at least we agree. I wonder why CESU-8 was created, if
> > there effectively exists applications needing it.
>
>UTC could have simply acknowledged that certain applications and vendors
>have created their own transformation formats for internal use, based
>on, but incompatible with, existing Unicode encoding schemes. Oracle
>has a UTF-8-like one which encodes supplementary code points with six
>bytes instead of four.
The way UTC formally 'acknowledges' something like that may involve
the issuance of a specification for it. That was done for CESU-8, and
incidentally also for UTF-EBCDIC.
Sometimes the purpose of creating a label for a format is to be able
to clearly identify data as *not* being in conformance to the Unicode
specification. I've not seen evidence that UTR#26 has resulted in
more or fewer implementations using CESU-8 style data. That is as
expected, because the use of that format is driven by specific compatibility
requirements, which neither get created nor removed by fiat from the UTC.
On the other hand all implementations that do see a need to use that format
can now safely warn all others of potential incompatibilities by
correctly labelling their data. I see that as a win.
> Sun has one like this which also encodes U+0000
>as two bytes instead of one. Someone else might decide to use one of
>the "zany" UTFs invented by Marco Cimarosti or me.
I think there is a distinction that people recognize between zany
UTFs invented by some guys with too much time on their hands, compared
to documenting specific compatibility warts that (unfortunately)
inflict a sizable group of users.
>Whatever... but there was no need to publish a Technical Report
>describing Oracle's custom format, giving it a formal-sounding name like
>"CESU-8" and registering it as an IANA charset for interchange. Not
>everyone outside this list is familiar with the fine distinction between
>a UTR, officially approved by UTC, and a UTN, published but not approved
>by UTC. I hope UTC does not ever go the "CESU-8" route with a UTN
>describing Sun's broken format.
A UTN is a different animal, as you are well aware. A UTN that says in
effect "Java's string serialization is not conformant to UTF-8" (and
explains the reason) is well within the parameters set for UTNs by the
Unicode Consortium. It would also pass the sniff test for 'information
useful to implementers and users of the standard'.
As Sun is discouraging the use of their format for all but Java-specific
and reasonably low level serialization of class data - an option not open
to the users of CESU-8 or UTF-EBCDIC who face the issue of interchange
at least among the components of certain distributed implementations -
there's not the same call for a formal specification and label.
But a UTN would make a nice place that one could use to capture the information
that gets dredged up every so often when this issue percolates on this
and related mail lists. UTNs after all, are intended to allow for the
documentation of such issues, without requiring UTC endorsement.
A./
This archive was generated by hypermail 2.1.5 : Sun Nov 14 2004 - 04:56:47 CST