RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 07 2004 - 19:34:38 CST

Next message: John Cowan: "Re: OpenType not for Open Communication?"

Previous message: Peter Kirk: "Word dividers, was: proposals I wrote (and also, didn't write)"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Reply: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Reply: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars,

I'm going to step in here, because this argument seems to
be generating more heat than light.

> I never said it doesn't violate any existing rules. Stating that it does,
> doesn't help a bit. Rules can be changed.

> I ask you to step back and try to see the big picture.

First, I'm going to summarize what I think Lars Kristan is
suggesting, to test whether my understanding of the proposal
is correct or not.

I do not think this is a proposal to amend UTF-8 to allow
invalid sequences. So we should get that off the table.

What I think this suggestion is is for adding 128 characters
to represent byte values in conversion to Unicode when the
byte values are uninterpretable as characters. Why 128 instead
of 256 I find a little mysterious, but presumably the intent
is to represent 0x80..0xFF as raw, uninterpreted byte values,
unconvertible to Unicode characters otherwise.

This is suggested by Lars' use case of:

> Storing UNIX filenames in a Windows database.

... since UNIX filenames are simply arrays of bytes, and cannot,
on interconnected systems, necessarily be interpreted in terms
of well-defined characters.

Apparently Lars is currently using PUA U+E080..U+E0FF
(or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
of byte values uninterpretable as characters to be converted, and
is asking for standard Unicode values for this purpose, instead.

The other use case that Lars seems to be talking about are
existing documents containing data corruptions in them, which
can often happen when Latin-1 data gets dropped into UTF-8 data
or vice versa due to mislabeled email or whatever.

> So you would drop the data. There are only two options with current designs.
> Dropping invalid sequences, or storing it separately (which probably means
> the whole document is dead until manually decoded). Dropping invalid
> sequences is actually a better choice. And would even be justifiable (but
> still sometimes inconvenient) if we were living in world where everything is
> in UTF-8. In a world, trying to transition from legacy encodings to Unicode,
> there could be a lot of data lost and a lot of angry users.

And I am assuming this is referring primarily to the second case,
where the extreme scenario Lars is envisioning would be, for
example, where each point in a system was hyper-alert to
invalid sequences and simply tossed or otherwise sequestered
entire documents if they got these kinds of data corruptions
in them. And in such a case, I can understand the concern about
angry users. How many people on this list would be cursing if
every bit of email that had a character set conversion error in
it resulting in some bit hash or other, simply got tossed in the
bit bucket instead of being delivered with the glorious hash
intact, at least giving you the chance to see if you could
figure out what was intended?

> A UTF-16 based program will only be able to process valid UTF-8
> data. A UTF-8 based program will in many cases preserve invalid sequences
> even without any effort. Let me guess, you will say it is a flaw in the
> UTF-8 based program. If validation is desired, yes. But then I think you
> would want all UTF-8 based programs to do that. That will not happen. What
> will happen is that UTF-8 based programs will be better text editors
> (because they will not lose data or constantly complain), while UTF-16 based
> programs will produce cleaner data. You will opt for the latter.

This is, I think the basic point at which people are talking past each
other.

Notionally, Doug is correct that UTF-8 and UTF-16 are equivalent
encoding forms, and anything represented (correctly) in one can
be represented (correctly) in the other. In that sense, there is
no difference between representation of text in UTF-8 or UTF-16,
and no reason to postulate that a "UTF-8 based program" will have
any advantages or disadvantages over a "UTF-16 based program" when
it comes to dealing with corrupted data.

What Lars is talking about is a broad class of UNIX-based software
which is written to handle strings essentially as
opaque bags of bytes, not caring what they contain for many
purposes. Such software generally keeps working just fine if you
pump UTF-8 at it, which is by design for UTF-8 -- precisely because
UTF-8 leaves untouched all the 0x00..0x7F byte values that may
have particular significance for those processes. Most of that
software treats 0x80..0xFF just as bit hash from the get-go, and
neither cares nor has any way of knowing if the particular
sequence of bit hash is valid UTF-8 or Shift-JIS or Latin-1 or
EUC-JIS or some mix or whatever.

> And I for
> the former. But will users know exactly what they've got? Will designers
> know exactly what they're gonna get? This is where all this started. I
> stated that there is an important difference between deciding for UTF-8 or
> for UTF-16 (or UTF-32).

This is where this is all getting derailed. Whatever the solutions
for representation of corrupt data bytes or uninterpreted data
bytes on conversion to Unicode may be, that is irrelevant to the
concerns on whether an application is using UTF-8 or UTF-16
or UTF-32.

This has been like the Miðgarð (<== Latin-1 data corruption opportunity)
Serpent, where things go around and around because the cosmic snake is
holding its tail in its mouth. UTF-8 applications on UNIX are easy to
write because they don't care about UTF-8 data corruption -- they
keep working just fine. But then because such applications pass
corrupted UTF-8 data around all the time, we have a legacy problem
of ensuring the preservation of corrupted UTF-8 in documents. And
furthermore, because such applications may corrupt data in other
character encodings as well, we have to have means of preserving the
data corruptions on conversion to UTF-8, so that we can roundtrip
the data corruptions, as well as the data.

That about it?

> > Data stored in UTF-8 and UTF-16 and UTF-32 must remain completely
> > interchangeable, from one encoding form to another. That is not
> > negotiable.
> (smiles) It should be.

And here we apparently have the clash of conflicting worldviews.

Unicode encoding forms represent code points. They are completely
interconvertible, by *definition*, whether we are talking about
encoded characters or unassigned code points.

The UNIX world sees strings as uninterpreted byte streams, and
sees the necessity of preserving the integrity of the byte
stream, no matter what crazy process or user may have stuck
some byte into the stream contrary to a charset definition.
And it sees UTF-8 as simply one interpretation slapped on top
of the fundamental construct of the byte stream.

There's more to it, of course, but this is, I believe, as the
bottom of the reason why, for 12 years now, people have been
fundamentally misunderstanding each other about UTF-8.

> Besides, surrogates are not completely interchangeable. Frankly, they are,
> but do not need to be, right? Instead of using the PUA, I could have chosen
> unpaired surrogates. But would risk that a UTF-16 validator drops them. The
> 128 codepoints I am proposing definitely need to have a special status, like
> the surrogates. And like I once said, UTF-16 got a big chunk of the BMP, and
> a lot of exceptions. The same can be done for UTF-8. With only 128
> codepoints.

As stated, this sounds like nonsense to a Unicode standardizer.

So let me try to restate it, and see if this is what Lars is
actually after -- in a way that a Unicode standardizer could
interpret.

Say a process gets handed a "UTF-8" string that contains the
byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>.
^^ ^^

The 93 and 94 are just corrupt data -- it cannot be interpreted
as UTF-8, and may have been introduced by some process that
screwed up smart quotes from Code Page 1252 and UTF-8, for
example. Interpreting the string, we have:

<U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>

Now *if* I am interpreting Lars correctly, he is using 128
PUA code points to *validly* contain any such byte, so that
it can be retained. If the range he is using is U+EE80..U+EEFF,
then the string would be reinterpreted as:

<U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C, U+10302, U+EE94>

which in UTF-8 would be the byte sequence:

<61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94>
^^^^^^^^ ^^^^^^^^

This is now well-formed UTF-8, which anybody could deal with.
And if you interpret U+EE93 as meaning "a placeholder for the
uninterpreted or corrupt byte 0x93 in the original source",
and so on, you could use this representation to exactly
preserve the original information, including corruptions,
which you could feed back out, byte-for-byte, if you reversed
the conversion.

Stated this way, at least I think the Unicode standardizers
can understand what the proposal is aiming at -- if I haven't
grossly misinterpreted it myself.

> If you can guarantee that all data will be valid Unicode, then there would
> be no need for the latter. And it's not arbitrary. It is about small
> portions of data within otherwise valid UTF-8 data. Those can be legacy
> encoded filenames, someone mistakenly inserting Latin 1 into a UTF-8
> document, transmission errors, whatever. I think preserving data should be
> possible. Programs that explicitly need to have clean data can validate,
> drop or whatever. It's about the choice. Currently there isn't one.

This is the statement from Lars that leads me to my interpretation,
by the way. I think it fits the intent of what he was after.

> > Characters don't get moved from PUA to BMP unless UTC assigns them
> > there.
> Yes, that is what I meant.

Understood, I think.

> And why do you think <99 C9> would become U+E000 and U+E001?! It's U+E099
> and U+E0C9.
> And no, my solution does not interprete UTF-8 correctly. Why should it.
> Codepoints used for the roundtrip area are not supposed to be valid. They
> are again stored as invalid sequences.
>
> And, it's not E0, it's EE, if anyone cares.

From which I derive the above interpretation.

> > I assure you, nobody will reject this scheme on the basis that it had
> > not been considered before.
> I am not so sure. Although, I am afraid somebody would try to reject it
> because IT HAS been considered before. But has not been explained well
> enough.

Actually, what was considered before was a proposal to encode
characters for byte values 0x00..0xFF, and in a somewhat different
context than described here.

Now moving from interpretation to critique, I think it unlikely
that the UTC would actually want to encode 128 such characters
to represent byte values -- and the reasons would be similar to
those adduced for rejecting the earlier proposal. Effectively,
in either case, these are proposals for enabling representation
of arbitrary, embedded binary data (byte streams) in plain text.
And that concept is pretty fundamentally antithetical to the
Unicode concept of plain text.

The response is likely to be to simply find another way around
the problem, without trying to define maintenance of roundtrip
integrity of unconvertible, corrupt string data as a *plain
text* requirement.

Storage of UNIX filenames on Windows databases, for example,
can be done with BINARY fields, which correctly capture the
identity of them as what they are: an unconvertible array of
byte values, not a convertible string in some particular
code page.

As for the data corruption problem, the issue is simply how
to deal with:

<U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>

in such a way as to preserve the source identity of the "???"
on conversion, rather than mapping everything to U+FFFD, which
loses information on roundtripping.

Effectively, you just need a scheme for representing:

"???(<0x93)" and "???(<0x94)"

distinctly.

In my opinion, trying to do that with a set of encoded characters
(these 128 or something else) is *less* likely to solve the
problem than using some visible markup convention instead.
After all, that is what the various hex conventions already
in use address, in part. In other words, I see little
advantage to:

<U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C, U+10302, U+EE94>

(PUA), or:

<U+0061, U+0062, U+0063, U+XX93, U+004D, U+0430, U+4E8C, U+10302, U+XX94>

(standard on BMP), over

<U+0061, U+0062, U+0063, "=93", U+004D, U+0430, U+4E8C, U+10302, "=94">

with whatever escape you need in place to deal with your escape
convention itself. In either case, the essential problem is
getting applications to universally support the convention
for maintaining and interpreting the corrupt bytes. Simply
encoding 128 characters in the Unicode Standard ostensibly to
serve this purpose is no guarantee whatsoever that anyone would
actually implement and support them in the universal way you
envision, any more than they might a "=93", "=94" convention.

--Ken

Next message: John Cowan: "Re: OpenType not for Open Communication?"
Previous message: Peter Kirk: "Word dividers, was: proposals I wrote (and also, didn't write)"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Reply: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Reply: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 19:36:33 CST