Re: CodePage Information

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 22 2003 - 21:13:35 EDT

Next message: Kenneth Whistler: "Re: Is it true that Unicode is insufficient for Oriental languages?"

Previous message: Rick McGowan: "Re: Is it true that Unicode is insufficient for Oriental languages?"
Maybe in reply to: Abdij Bhat: "CodePage Information"
Next in thread: Kenneth Whistler: "Re: CodePage Information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> From: "Kenneth Whistler" <kenw@sybase.com>
> > So Doug is correct. 0xC0 0x80 is not a permissible representation
> > of U+0000 in UTF-8, and it is bad advice to recommend to people
> > that they should use it.

Philippe Verdy retorted:

>
> This is not what I said or meant.

What you said and (I presume) meant was:

"... encode a NULL codepoint with the pair of bytes (0xC0; 0x80)."

I am not claiming that you are claiming that that is legal in
UTF-8. You stated it was a "trivial extension" of UTF-8 (and
CESU-8).

I am claiming that that is bad advice.

> The main reason why the 0x00 byte causes problems is because it
> is most often used as a string terminator, unlike what ASCII or
> Unicode defines for the NULL character.

No. The reason why the 0x00 byte causes problems is because
people who have not sufficiently familiarized themselves with
the structure of the standard assume that they can treat
byte-serialized UTF-16 with standard C string API's (and
similar protocols). And when they discover that a UTF-16
byte serialization uses all byte values, including 0x00, they
tend to assert that that is a problem with Unicode instead
of being a problem with their choice of encoding schemes.

As you originally indicated, choice of UTF-8 as an encoding scheme
deals with this problem. It deals with it because U+0000
is represented as 0x00, and 0x00 never appears in a valid
UTF-8 byte serialization meaning anything other than U+0000.

> In this case, one cannot encode it because the device or protocol
> does not support sending a separate length specifier and needs
> the 0x00 to terminate the string, and thus a NULL character in
> a Unicode string could not be encoded even if it's needed.

Or an *ASCII* string! You are missing the main point that
this has nothing to do with Unicode.

If I use only 8-bit ASCII data with C string API's, then
I am similarly precluded from embedding the ASCII NUL (0x00)
character in a string, because of the string termination
convention used in C. In that case I don't have some
option to "escape" the NUL to 0xC0 0x80 to use it embedded
in the string.

Well, 8-bit ASCII data is also valid UTF-8 data, and it
should behave no differently. If I have a NUL character,
it should be represented as 0x00, just as it is in ASCII.

If I have such character data in an array, with a NUL
character in it, well, I obviously can't just point that
array at strlen() and get the right results. But that
is no different for ASCII than for UTF-8.

> This is the case where an escaping mechanism, using other
> unused parts of UTF-8 can make sense,

It never makes sense to "use other unused parts of UTF-8".
The UTC has gone to rather extreme lengths to keep ad hoc
"trivial extensions" of UTF-8 from being promulgated, so
as to preserve the interoperability of all UTF-8 data.

> and I don't think that Sun made an error when using such
> escaping mechanism to allow sending strings containing a
> significant NULL character through JNI (and at the time
> when Sun used it for Java, it was a valid and compliant
> UTF-8 encoding for that character,

This is an erroneous claim. <0xC0 0x80> has *never* been a
"valid and compliant UTF-8 encoding" for U+0000.
ISO/IEC 10646-1:1993/AMD.2: 1994 (E), which added UTF-8 to
10646, clearly maps U+0000 to UTF-8 octet 0x00, and
disallows <0xC0 0x80> as the UTF-8 mapping of any UCS
code position. <0xC0 0x80> would be a "malformed sequence"
by P.7 in that Amendment. Even the predecessor of UTF-8,
published in Unicode 1.1 as "FSS-UTF", clearly stated:

  "When there are multiple ways to encode a value, for
   example U+0000, only the shortest encoding is legal."

         UTR #4, The Unicode Standard, Version 1.1, p. 28 (1993)

It was wishful thinking, on the part of some implementers,
that it would be o.k. to use non-shortest forms of UTF-8
to represent characters. And people took shortcuts in
programming their UTF-8 decoders, because it is easier to
just let the algorithm bit-shift <0xC0 0x80> to U+0000 than
to range check and raise an exception for the illegal
sequences. The wording in Unicode 2.0 and Unicode 3.0
unfortunately encouraged such short-cuts, but because of
the trouble that has caused, the UTC clamped down even on those
laxities and stated that it means what it says about non-shortest
forms being disallowed.

So I return to my original point: it is bad advice to encourage
people to use <0xC0 0x80> to represent U+0000 as an extension
of UTF-8.

> and I see no good reason why Sun would change this without
> breaking the ascending compatibility of JNI, which is a
> *published* interface since long, but not an internal encoding
> used only within compiled/serialized classes).

Well, existing JNI interfaces won't change. But data encoded
with non-shortest forms is *not* UTF-8, and people need to
understand if they are using such data, it can lead to
interoperability problems.

> I never said that such (0xC0; 0x80) sequence is now a valid
> UTF-8 encoding (yes now it's prohibited).

It was *always* prohibited. It was merely *tolerated* when people
did what was prohibited anyway. Now people are spanked instead
of being given a wink and a nod.

> I just say that this is an upper-level encoding on top of
> UTF-8 needed for the very common case where the 0x00 byte is
> interpreted as a string terminator and is not part of the
> string content, and there's no other way to specify a total
> encoded length to integrate that null byte as a significant
> character.

And I just say that this is a C string termination issue, and
has nothing to do with UTF-8. The exact same problem for NUL
applies to *every* other character encoding in any kind of
widespread use, including all the "ASCII"-based ones (and DBCS)
and the EBCDIC code pages. You don't go to 8859-1 or Code Page 437
or MacRoman or Code Page 037 or GB 2312 and create "upper-level
encodings" with escape mechanisms for 0x00 just so you could
put NUL's into the strings for them for use with C runtime
libraries. If you need to embed NUL characters in character
arrays and treat them as strings (for *any* of these encodings),
you modify and extend your string libraries so that they
don't depend on null-termination of the strings.

> It may be the only way to represent Unicode strings that need
> to include NULL characters with a huge set of C libraries that
> depend on the fact that 0x00 is NOT part of the encoded string
> and is ALWAYS a string terminator.

The implication is totally erroneous, because this is nothing
Unicode-specific, but applies to *every* character encoding in
any significant use. You imply that the "only way" to handle
this problem for Unicode strings is by a "trivial extension"
to a Unicode-specific encoding form, when the problem is the
same for every character encoding, and nobody advocates hacking
up all the other encodings to fix the same problem for them.

>
> But for now such derived encoding has no new formal name:
> the old definition of UTF-8 was enough,

?? It did not allow it. Why would it have been given a formal
name? It had an *informal* name: non-shortest form UTF-8,
and was designated not to be a legal form.

> but the new restriction of UTF-8 forgot to assign a name to
> this case

See above. It was not an oversight.

> (only CUSE-8 was considered has meriting a technical report
^^^^^^
CESU-8
> and a new name but this addresses a distinct problem or legacy
> usage). I think that both UTF-8 or CUSE-8 should have a variant
> accepting this escaping mechanism for the NULL character as the
> only way to represent it safely (UTF-8-NULL? CUSE-8-NULL ?)

And I think this is a *terrible* idea, which will be roundly
rejected. Let me state it one last time: it is bad advice to
recommend that people use <0xC0 0x80> to represent U+0000
(as any kind of extension to UTF-8).

--Ken

Next message: Kenneth Whistler: "Re: Is it true that Unicode is insufficient for Oriental languages?"
Previous message: Rick McGowan: "Re: Is it true that Unicode is insufficient for Oriental languages?"
Maybe in reply to: Abdij Bhat: "CodePage Information"
Next in thread: Kenneth Whistler: "Re: CodePage Information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 22:09:16 EDT