From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 14 2004 - 01:36:18 CST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>> What is a shame is that Unicode published a definition of the
>> defective CESU-8 at all.
>
> On that point at least we agree. I wonder why CESU-8 was created, if
> there effectively exists applications needing it.
UTC could have simply acknowledged that certain applications and vendors
have created their own transformation formats for internal use, based
on, but incompatible with, existing Unicode encoding schemes. Oracle
has a UTF-8-like one which encodes supplementary code points with six
bytes instead of four. Sun has one like this which also encodes U+0000
as two bytes instead of one. Someone else might decide to use one of
the "zany" UTFs invented by Marco Cimarosti or me.
Whatever... but there was no need to publish a Technical Report
describing Oracle's custom format, giving it a formal-sounding name like
"CESU-8" and registering it as an IANA charset for interchange. Not
everyone outside this list is familiar with the fine distinction between
a UTR, officially approved by UTC, and a UTN, published but not approved
by UTC. I hope UTC does not ever go the "CESU-8" route with a UTN
describing Sun's broken format.
> On the other side, the Java modified UTF-8 (in fact more near from
CESU-8)
> has proven to be useful and is widely used... Simply because it is
> compatible with standard C libraries for null-terminated strings.
An unusual type of "compatible" that makes a special allowance for
strings with embedded nulls, impossible by definition in C.
If the Java architects had wanted a variable-length array of arbitrary
byte data, they should have created such a type in the first place,
instead of overloading the string type. Strings are for text. Text
does not need nulls.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Sun Nov 14 2004 - 01:40:16 CST