Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Dec 06 2004 - 14:11:31 CST

  • Next message: Peter Constable: "RE: Arial Unicode MS"

    RE: Nicest UTFLars Kristan wrote:

    >> I could not disagree more with the basic premise of Lars' post. It
    >> is a fundamental and critical mistake to try to "extend" Unicode with
    >> non-standard code unit sequences to handle data that cannot be, or
    >> has not been, converted to Unicode from a legacy standard. This is
    >> not what any character encoding standard is for.
    >
    > What a standard is or is not for is a decision. And Unicode consortium
    > is definitely the body that makes the decision in this case.

    Actually the Unicode Technical Committee. But you are correct: it is up
    to the UTC to decide whether they want to redefine UTF-8 to permit
    invalid sequences, which are to be interpreted as unknown characters
    from an unknown legacy coding standard, and to prohibit conversion from
    this redefined UTF-8 to other encoding schemes, or directly to Unicode
    code points. We will have to wait and see what UTC members think of
    this.

    > But this decision should not be based solely on theory and ideal
    > worlds.

    Right. Uh-huh.

    >> This is simply what you have to do. You cannot convert the data into
    >> Unicode in a way that says "I don't know how to convert this data
    >> into Unicode." You must either convert it properly, or leave the
    >> data in its original encoding (properly marked, preferably).
    >
    > Here lies the problem. Suppose you have a document in UTF-8, which
    > somehow got corrupted and now contains a single invalid sequence. Are
    > you proposing that this document needs to be stored separately?

    Of course not. That is not at all the same as INTENTIONALLY storing
    invalid sequences in UTF-8 and expecting the decoding mechanism to
    preserve the invalid bytes for posterity.

    > Everything else in the database would be stored in UTF-16, but now one
    > must add the capability to store this document separately. And
    > probably not index it. Regardless of any useful data in it. But if you
    > use UTF-8 storage instead, you can put it in with the rest (if you can
    > mark it, even better, but you only need to do it if that is a
    > requirement).

    And do what with it, Lars? Keep it on a shelf indefinitely in case some
    archaeologist unearths a new legacy encoding that might unlock the
    mystery data?

    Is this really worth the effort of redefining UTF-8 and disallowing free
    conversion between UTF-8 and Unicode code points?

    Do you have a use case for this?

    > I can reinterprete your example. Using the French word is exactly the
    > solution I am proposing, and I see your solution is to replace the
    > word with a placeholder which says "a word that does not exist in
    > German". Even worse, you want to use the same placeholder for all the
    > unknown words. Numbering them would be better, but awkward, since you
    > don't know how to assign numbers. Fortunetely, with bytes in invalid
    > sequences, the numbering is trivial and has a meaning.

    So with your plan, you have invalid sequence #1, invalid sequence #2,
    and so forth. Now, what do the sequences mean? Is there any way to
    interpret them? No, there isn't, because by definition these sequences
    represent characters from an unknown coding standard. Either (a) nobody
    has gone to the trouble to find out what characters they truly
    represent, (b) the original standard is lost and we will *never* know,
    or (c) we are waiting for the archaeologist to save the day.

    In the meantime, the UTF-8 data with invalid sequences must be kept
    isolated from all processes that would interpret the sequences as code
    points, and raise an exception on invalid sequences-- in other words,
    all existing processes that handle UTF-8.

    > Let's compare UTF-8 to UTF-16 conversion to an automated translation
    > from German to French. What Unicode standard says can be interpreted
    > as follows:
    >
    > * All input text must be valid German language.
    > * All output text must be valid French language.
    > * Any unknown words shall be replaced by a (single) 'unknown word'
    > placeholder.

    If you have French words that cannot be translated into German at all,
    and nobody in the target audience is capable of understanding French,
    then what you have is an inscrutable collection of mystery data, perhaps
    suitable for research and examination by linguists, but not something
    that the audience can make any sense of. In that case, converting all
    the mystery data to a single "unknown word" placeholder is no worse than
    any other solution, and in particular, no worse than a solution that
    converts 100 different mystery words into 100 different placeholders,
    *none* of which the audience can decipher.

    > And that last statement goes for German words missing in your
    > dictionary, misspelled words, Spanish words, proper nouns...

    The underlying assumption is that somebody, somewhere, will be able to
    recognize these "foreign" or "unrecognized" words and make some sense of
    them. But in your character encoding example, the premise is that we
    DON'T know what the original encoding was, and it's too difficult or
    impossible to find out, so we just shoehorn them into UTF-8. That's not
    consistent with the German example.

    > I never said it is valid UTF-8. The fact remains I can store legacy
    > data in the same store as UTF-8 data. But cannot do that if storage is
    > UTF-16 based.

    Data stored in UTF-8 and UTF-16 and UTF-32 must remain completely
    interchangeable, from one encoding form to another. That is not
    negotiable.

    > Now suppose you have a UNIX filesystem, containing filenames in a
    > legacy encoding (possibly even more than one). If one wants to switch
    > to UTF-8 filenames, what is one supposed to do? Convert all filenames
    > to UTF-8?

    Well, yes. Doesn't the file system dictate what encoding it uses for
    file names? How would it interpret file names with "unknown" characters
    from a legacy encoding? How would they be handled in a directory
    search?

    > Who will do that? And when? Will all users agree?

    Agree about what? The conversion of characters from a legacy character
    set to Unicode? That's not up to the users; there are well-defined
    conversion tables that take care of this, for virtually every legacy
    character set you will ever run into. And for the ones we don't know,
    we don't know.

    > Should all
    > filenames that do not conform to UTF-8 be declared invalid?

    If you have a UTF-8 file system, yes.

    > And those files innacessible?

    Those files should never have gotten into the file system in the first
    place if they have invalid names. As I said before, how would you go
    about accessing them? If your OS has a UTF-8 file system, it is
    probably Unicode-based in general, so you would be using databases or
    editors or whatever that work with Unicode characters. Except, whoops,
    this file name here contains an invalid UTF-8 sequence. That means it
    contains one or more characters whose mapping to Unicode is unknown or
    non-existent. Those files are ALREADY inaccessible.

    > If you keep all processing in UTF-8, then this is
    > a decision you can postpone. But if you start using UTF-32
    > applications for processing filenames, invalid sequences will be
    > dropped and those files can in fact become inaccessible. And then
    > you'll be wondering why users don't want to start using Unicode.

    Right.

    > I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames.
    > Do you want to discourage them?

    Absolutely.

    >> Among other things, you run the risk that the mystery data happens to
    >> form a valid UTF-8 sequence, by sheer coincidence. The example of
    >> "NESTLÉ™" in Windows CP1252 is applicable here. The last two
    >> bytes are C9 99, a valid UTF-8 sequence for U+0259. By applying the
    >> concept of "adaptive UTF-8" (as Dan Oscarsson called it in 1998),
    >> this sequence would be interpreted as valid UTF-8, and data loss
    >> would occur.
    >
    > I am well aware of that risk. But all you risk is that you lose the
    > 'late detection' of what happened.

    And you want to apply this strategy to file names, where unambiguous
    matching is critically important?

    (leans toward door) Security!

    > But actually data has not been lost in this case. It is however lost
    > in all other cases, for example in NESTLÉ followed by a space. That
    > is, if for whatever reason this byte sequence happens to be in a
    > UTF-8 encoded text. Or if you named your file NESTLÉ.rtf and then
    > switched your locale to UTF-8.

    Why am I switching from locale A to locale B and expecting the files I
    named under locale A to remain unchanged under locale B? Do I expect
    that for other text? If so, why have locales?

    > Ability to store code points (and define all the codepoints that we
    > need) is one thing. Detecting corrupted data and trying to get rid of
    > legacy text for which encoding is not known is another. The two goals
    > should not be tied one to another. Especially not since we cannot be
    > sure that goal #2 can be achieved at all. Nor that everybody agrees it
    > is a goal in the first place. IMHO, preserving data is more important,
    > but so far it seems it is not a goal at all. With a simple argument -
    > that Unicode only defines how to process Unicode data. Understandably
    > so, but this doesn't mean it needs to remain so.

    There are plenty of tables available for converting data from other
    standards to Unicode. ICU has hundreds of them. That is how Unicode
    supports the processing of non-Unicode data, by defining how it can be
    converted to Unicode.

    Unicode is a standard for character encoding. It is not, *and should
    not be*, a standard for storing arbitrary binary data.

    >> What sort of "exception" is to be raised? What sort of "additional
    >> caution" should the user take? What if this process is not
    >> interactive, and contains no user intervention?
    >
    > Precisely the same questions I need to ask you for the process of
    > converting the data. You are saying that data that cannot be converted
    > should not be converted.

    Go back and read that last sentence again, please.

    Done? Yes, that is what I am saying.

    > Then what if THIS process is not interactive? What if there is no
    > manpower that could ever make it interactive? Simply resort to
    > conversion with data loss?

    You ALREADY HAVE data loss. You have character data that cannot be
    converted to Unicode, which you are trying to fit into a Unicode
    environment. Where is the practical difference between converting the
    mystery "characters" to U+FEFF and converting them to invalid UTF-8
    sequences that still cannot be interpreted?

    >> UTF-8 and UTF-16, used correctly, are perfectly interchangeable. It
    >> is not in any way a fault of UTF-16 that it cannot be used to store
    >> arbitrary binary data.
    >
    > According to the requirements that were taken into account when
    > designing UTF-16 - no, it is not a fault. The fact remains that UTF-8
    > can coexist with binary data. If anyone finds that to be an advantage,
    > then he might see this property of UTF-16 as a fault.

    Any data format can coexist with arbitrary binary data, if you violate
    the format sufficiently.

    >> Because it is an incredibly bad idea.
    >
    > Until proven otherwise.

    I'm just waiting for UTC members to join this thread and provide the
    "proof" you need.

    > You *should* use the PUA for this purpose. It is an excellent
    >> application of the PUA. But do not be surprised if someone else,
    >> somewhere, decides to use the same 128 PUA code points for some other
    >> purpose. That does not make your data "non-standard," because all
    >> PUA data, by definition, is "non-standard."
    >
    > So we agree it is "non-stadard", even now. And that there are risks
    > involved. Hence, this is not a good solution. Unfortunately no other
    > solution keeps the data in the same store, allowing it to be indexed
    > as a whole.

    How do you propose to "index" data that you cannot identify?

    > It is the best (I hope) solution given the current limitations.

    The "current limitations" are the way Unicode *and every other character
    encoding standard* are designed. It is how Unicode will remain, unless,
    as I said, UTC comes up with a real shocker for us all.

    > It is a solution that is simple and efficient.

    The PUA part of your solution is simple and efficient. The "invalid
    UTF-8" solution breaks interoperability with the rest of the Unicode
    Standard, and goes nowhere toward solving the real problem of
    undecipherable data.

    > Why look for other solutions that will ultimately yield the same
    > result but will just be a lot of effort to get around the consequences
    > of the fact that someone once said that Unicode is not supposed to
    > assist in solving this problem.

    Again, I ask for a realistic use case which will demonstrate that "this
    problem" of uninterpretable legacy data is something that Unicode should
    assist in solving.

    > When that is realized, the logical next step is to get rid of the
    > remaining problems. That is, to move from PUA to BMP.

    Characters don't get moved from PUA to BMP unless UTC assigns them
    there.

    > And bite the sour apple of security consequnces.

    You think the "invalid UTF-8" scheme has fewer security consequences
    than using the PUA?

    > Though, you have a serious security issue even with your design. Two
    > strings may compare different in UTF-8, but same if converted to
    > UTF-16. Yes, assuming they contained invalid sequences.

    If I convert two different invalid UTF-8 sequences to the same Unicode
    code point (U+FFFD), or otherwise raise the same error condition for
    both, as directed by conformance clause C12a, then this is a serious
    security issue with my design. Hmm, yes, I can see that.

    > So, you should validate any UTF-8 data before processing it. A nasty
    > prerequisite. Thought all UTFs were equal...

    They are. You have to validate any UTF sequence before processing it.
    Data in UTF-16 may not contain surrogate code points, except where a
    high surrogate is followed by a low surrogate (Definition D35). Data in
    UTF-32 may not contain surrogates, or any value above 0x10FFFF
    (Definition D31). Nothing special about UTF-8 here.

    > If you adopt the 128 codepoints I am proposing, the two strings would
    > compare different also in UTF-16. Not saying that my approach doesn't
    > introduce any new security considerations, but it is interesting that
    > it also solves some.

    It provides matching for characters declared not to be convertible to
    Unicode. Please provide a use case that demonstrates that this problem
    occurs in the real world, and needs to be solved within Unicode.

    >> What you are doing with the PUA is far more standard, and far more
    >> interoperable, than writing invalid UTF-8 sequences and expecting
    >> parsers to interpret them as "undeciphered 8-bit legacy text of some
    >> sort."
    >
    > Well, you may have a wrong assumption here. You probably think that I
    > convert invalid sequences into PUA characters and keep them as such in
    > UTF-8. That is not the case. Any invalid sequences in UTF-8 are left
    > as they are. If they need to be converted to UTF-16, then PUA is used.
    > If they are then converted to UTF-8, they are converted back to their
    > original bytes, hence the incorrect sequences are re-created.

    So if you have, say, the invalid UTF-8 sequence <99 C9>, you treat it as
    two mystery characters and map them to U+E000 and U+E001. Then if you
    have the valid UTF-8 sequences <EE 80 80> and <EE 80 81>, you also map
    those to U+E000 and U+E001, because of course your solution still
    interprets valid UTF-8 correctly.

    Then if you want to convert these two examples back to UTF-8, you
    convert both them to... what?

    (stands up and looks down hall) SECURITY!!

    > The other name for this is roundtripping. Currently, Unicode allows a
    > roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are several
    > reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more valuable,
    > even if it means that the other roundtrip is no longer guaranteed:

    Oh, good, now we have "UTF-16(32)". What is that?

    > * Legacy data is more likely to creep into UTF-8 data than into any
    > other UTF.
    >
    > * It is inreasingly more evident that UTF-8 is probably the most
    > likely candidate for any future storage and data exchange.

    "Crime is more likely to occur in poor, crowded urban areas, and these
    areas are growing larger and more numerous. Therefore, we should remove
    all police from those areas to allow the inevitable to happen."

    > One more example of data loss that arises from your approach:
    > If a single bit is changed in UTF-16 or UTF-32, that is all that will
    > happen (in more than 99% of the cases). If a single bit changes in
    > UTF-8, you risk that the entire character will be dropped or replaced
    > with the U+FFFD. But funny, only if it ever gets converted to the
    > UTF-16 or UTF-32. Not that this is a major problem on its own, but it
    > indicates that there is something fishy in there.

    Never mind Unicode. You have a lot to learn about data structures, data
    conversion, security, and standardization in general.

    > There was a discussion on nul characters not so long ago. Many text
    > editors do not properly preserve nul characters in text files. But it
    > is definitely a nice thing if they do. While preserving nul characters
    > only has a limited value, preserving invalid sequences in text files
    > could be crucial. A UTF-8 based editor can easily do this. A UTF-16
    > based editor cannot do it at all. If you say that UTF-16 is not
    > intended for such a purpose, then so be it. But this also means that
    > UTF-8 is superior. You can say that this is no longer UTF-8, since it
    > is misused. But that doesn't make a difference. If it is superior
    > because it can be misused, it is still superior.

    (wipes eyes)
    I'll leave this particular gem for someone else to reply to.

    > Yes, it is not related much. Except for the fact I was trying to see
    > if UTF-32 is needed at all. If one can do everything in UTF-8, then
    > invalid sequences can be preserved. If, however, certain things can
    > only be done by converting to UTF-32, then the application is again
    > risking data losses. Can be avoided, but extra care needs to be
    > taken. And if the way to do it would be standardized (hopefully also
    > simplified), it would help a lot.

    You are not really talking about UTF-32 so much as the whole idea of
    code points. You think UTF-8 byte sequences *themselves* should be the
    representation of the character. This is, very simply, not the way it
    is.

    >> This is completely wrong-headed. The whole purpose of UTF-8 is to
    >> represent valid Unicode code points, convertible to any other
    >> encoding form.
    >
    > I agree. Again, this is a question of requirements. Current
    > requirements have been met, no doubt about that. I am simply saying
    > that new requirements should be added. Or at least considered. But
    > fairly, not rejected simply because they were not considered before.

    I assure you, nobody will reject this scheme on the basis that it had
    not been considered before.

    >> Is it really "stateless" to store textual data in some mystery
    >> encoding, such that we don't know what it is, but we insist on
    >> storing it anyway, undeciphered, like some relic from an
    >> archaeological dig? What is the use of such data? Can it really be
    >> considered plain text at all?
    >
    > Still much better than rejecting the data. Not that rejecting the data
    > is not a valid option sometimes. But currently it is not an option, it
    > is the only thing that can be done.

    It is the only reasonable thing for a character encoding standard to do,
    if the identity of the data is unknown.

    >> Sorry to be so harsh and sound so personal. Frankly, I am amazed
    >> that in the 39 hours since your message was posted, this mailing list
    >> has not become absolutely flooded with responses explaining what a
    >> truly bad idea this is.
    >
    > I hope it is because it is not as bad as you still think.

    It is far, far worse than you think.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 14:16:06 CST