From: Doug Ewell (firstname.lastname@example.org)
Date: Sun Nov 14 2004 - 12:21:01 CST
Asmus Freytag <asmusf at ix dot netcom dot com> wrote:
> The way UTC formally 'acknowledges' something like that may involve
> the issuance of a specification for it. That was done for CESU-8, and
> incidentally also for UTF-EBCDIC.
Throughout all of this, I had completely missed the fact that the Tech
Note for CESU-8 had been upgraded to a Tech Report, two and a half years
ago, in fact. Perhaps I was in denial. Anyway, that puts CESU-8 on the
same plane with UTF-EBCDIC, and invalidates many of my comments which
assumed that CESU-8 was defined in a Tech Note, which non-listers might
confuse for the relative sanction of a Tech Report.
> Sometimes the purpose of creating a label for a format is to be able
> to clearly identify data as *not* being in conformance to the Unicode
> specification. I've not seen evidence that UTR#26 has resulted in
> more or fewer implementations using CESU-8 style data. That is as
> expected, because the use of that format is driven by specific
> compatibility requirements, which neither get created nor removed by
> fiat from the UTC. On the other hand all implementations that do see a
> need to use that format can now safely warn all others of potential
> incompatibilities by correctly labelling their data. I see that as a
CESU-8 is the documentation of someone's internal, non-standard
implementation of UTF-8. Of course, the "someone" is large and
important and their implementation affects a lot of users. If nobody
else is motivated by the presence of UTR #26 to adopt this non-standard
What worries me is that there might be other people in the world like
Philippe who think Sun's "modified UTF-8" is a good and useful thing,
because it allows arbitrary data to be stored in C-style strings, and
who might propagate its use in a way that, thankfully, you haven't seen
with CESU-8. There are perfectly good data structures available for
storing arbitrary binary data. Strings of text are not one of them.
> A UTN is a different animal, as you are well aware. A UTN that says in
> effect "Java's string serialization is not conformant to UTF-8" (and
> explains the reason) is well within the parameters set for UTNs by the
> Unicode Consortium. It would also pass the sniff test for 'information
> useful to implementers and users of the standard'.
I am aware of the difference, and so are all (or most) list members.
How far that awareness extends beyond this list is left as an exercise
for the reader. But again, everything I said about UTNs is moot,
because I assumed CESU-8 was documented in a UTN, which did not confer
the appearance of Unicode sanction. The fact that it is a UTR is
actually more discouraging.
At least in the case of UTF-EBCDIC, the creators did not merely take an
existing, broken implementation of an existing character encoding scheme
and get it documented. They created an algorithm similar to and
inspired by UTF-8, but not in any way mistakable for it, and added a
1-to-1 EBCDIC translation layer. It's actually quite elegant.
> As Sun is discouraging the use of their format for all but Java-
> specific and reasonably low level serialization of class data - an
> option not open to the users of CESU-8 or UTF-EBCDIC who face the
> issue of interchange at least among the components of certain
> distributed implementations - there's not the same call for a formal
> specification and label.
That's good to know.
> But a UTN would make a nice place that one could use to capture the
> information that gets dredged up every so often when this issue
> percolates on this and related mail lists. UTNs after all, are
> intended to allow for the documentation of such issues, without
> requiring UTC endorsement.
While we're on the subject of UTNs, I think it's a shame that BOCU-1, a
genuinely novel and potentially useful compression scheme that was
invented from scratch, is only documented in a "no-endorsement" UTN,
when a draft UTR-upgrade that adds a white-box algorithm was written
almost a year ago but has not been approved. This places BOCU-1 *below*
CESU-8 in the food chain, which seems badly wrong.
This archive was generated by hypermail 2.1.5 : Sun Nov 14 2004 - 12:23:22 CST