RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (
Date: Wed Dec 08 2004 - 04:01:39 CST

  • Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    Doug Ewell wrote:
    > How do file names work when the user changes from one SBCS to another
    > (let's ignore UTF-8 for now) where the interpretation is
    > different? For
    > example, byte C3 is U+00C3, A with tilde (Ã) in ISO 8859-1,
    > but U+0102,
    > A with breve (Ă) in ISO 8859-2. If a file name contains byte
    > C3, is its
    > name different depending on the current locale?
    It displays differently, but compares the same. Whether or not it is the
    same name is a philosophical question.

    > Is it
    > accessible in all
    > locales?
    Typically, yes for all SBCS, but not really guaranteed for all MBCS. Depends
    on whether you validate the string or not. The way UNIX is being developed,
    those files are typically still accessible since the programs are still
    working with 8-bit strings. And that is what I am saying. A UTF-8 program (a
    hypothetical 'UNIX Commander 8') would have no problems accessing the files.
    A UTF-16 program (a hypothetical 'UNIX Commander 16') on the other hand
    would have problems.

    > (Not every SBCS defines a character at every code point.
    > There's no C3 in ISO 8859-3, for example.)
    It works just like unassigned codepoints in Unicode work. How they are
    displayed is not defined, but they can be passed around and compared for
    equality. Collation is again not defined, but simple sorting does give
    useful results.

    > Does this work with MBCS other than UTF-8? I know you said
    > other MBCS,
    > like Shift-JIS, are not often used alongside other encodings except
    > ASCII, but we can't guarantee that since we're not in a perfect world.
    > :-) What if they were?
    I don't know if and how much they were. But I am assuming UTF-8 would be
    used alongside other encodings on a much larger scale. At least that's what
    we are hoping for aren't we? Of course it would be even better if we would
    be only using UTF-8 (or any other Unicode format), but the transition has to
    come first.

    > I fear Ken is not correct when he says you are not arguing for the
    > legalization of invalid UTF-8 sequences.
    I am arguing for a mechanism that allows processing invalid UTF-8 sequences.
    For those who need to do so. You can still think of them as invalid. Exactly
    how they will be called and to what extent will they be discouraged still
    needs to be investigated and defined.

    > This isn't about UTF-8 versus other encoding forms. UTF-8-based
    > programs will reject these invalid sequences because they don't map to
    > code points, and because they are supposed to reject them.
    The problem is, until now a text editor typically preserved all data if a
    file was opened and saved immediately. Even binary data. And the data could
    be interpreted as Latin 1, Latin 2, ... But you cannot interprete the data
    as UTF-8 and preserve all the data at the same time. Well, actually it is
    possible, which is exactly what I am saying is the advantage of UTF-8. But
    if you insist on validation, you break it. Fine, you get your Unicode world,
    and UTF-16 is then just as good as UTF-8. But you are now losing data where
    previously it wasn't lost. Well, you better remember to put a disclaimer in
    you license agreement...

    > > Besides, surrogates are not completely interchangeable.
    > Frankly, they
    > > are, but do not need to be, right?
    > They are not completely. In UTF-8 and UTF-32, they are not allowed at
    > all. In UTF-16, they may only occur in the proper context: a high
    > surrogate may only occur before a low surrogate, and a low
    > surrogate may
    > only appear after a high surrogate. No other usage of surrogates is
    > permitted, because if unpaired surrogates could be interpreted, the
    > interpretation would be ambiguous.
    Well, yes, that's the theory. But as usual, I look at how things that are
    not defined yet work. From the algorithms, unpaired surrogates convert
    pretty well. Unless they start to pair up, of course. But there are cases
    where one knows they cannot (no concatenation is done).

    Let me bring up one issue again. I want to standardize a mechanism that
    allows a roundtrip for 8-bit data. And I already stated that by doing that,
    you lose the roundtrip for 16-bit data. Now I ask myself again, is that
    true? Yes and no. For the case I mentioned above (no concatenation),
    roundtrip is currently really possible. But generally speaking, it is not
    always possible. And last but not least, you don't even care for it, right?
    Good, because that means my proposal doesn't make anything worse.

    > I admit my error with regard to the handling of file names by
    > Unix-style
    > file systems, and I appreciate being set straight.

    Sorry for rubbing it in, but ...... could it be that a lot of conclusions
    you have about what Unicode should or should not be are also wrong if they
    were based on such incorrect assumptions.

    > I think preserving transmission errors is carrying things too
    > far. Your
    > Unix file system already doesn't guarantee that; if a byte
    > gets changed
    > to 00 or 2F, you will have problems.
    Like this one. Transmission, disk, memory errors (unless data is compressed)
    are typically 1 bit errors. And one case where things go really wrong
    doesn't invalidate the importance of many cases where things remain within
    certain limits.

    > On the other hand, if the user is typing UTF-8 bytes directly into a
    > non-UTF-8-aware editor, then of course anything is possible. But that
    > seems like a bad way to live.
    On UNIX, files are also concatenated, and assembled in many other ways. By
    scripts, by the system... Again, eventually, it will all be UTF-8. But if
    there will be problems in the transition period ..... hmmmm, who knows.

    > Now we're getting somewhere. We are no longer talking about a
    > mysterious, unknown encoding in arbitrary text, but about file names
    > known to be in Latin-1 instead of UTF-8. If the security risk is
    > determined to be low, you *may* be able to get away with interpreting
    > invalid UTF-8 as Latin-1. But in that case, the bytes need to be
    > converted to real Unicode characters in the range
    > U+0080..U+00FF, not to
    > PUA characters, and they must not be written back as invalid UTF-8.
    No, that is not what I am talking about.
    * First, there were never any mysterious encodings. I was always referring
    to existing, well defined encodings (except when I was talking about
    transmission errors).
    * The 'unknown encoding' stood for the fact that there is no information
    about WHICH encoding was used. And in the above example, this encoding is
    Latin 1. You know it, I know it. We can see it. But the computer doesn't,
    because there is no information about it, because it is plain text.
    * The assumption is that most other data is already UTF-8 (or user chose to
    set locale to UTF-8 in order to start using it). Hence, the program will
    attempt to interprete the data as UTF-8, not Latin 1.

    > Maybe not. I think your scheme involves converting invalid
    > UTF-8 to PUA
    > code points in UTF-16, and back to invalid UTF-8. I'm saying the PUA
    > part is sensible, and the invalid-UTF-8 part is not. (I
    > know... only if
    > I'm afraid to break some eggs...)
    Well, if the invalid UTF-8 part is not sensible, then why have the PUA part,
    right? But, remember filesystems, so you say having UTF-8 filenames on the
    same disk as legacy encoded filenames is not sensible? How do you suppose
    users will get from one state to the other? Or should they switch their
    locale each time they want to work with the other group of filenames? And
    what if they want to work with both at the same time? Backup?

    Well, you might as well force them to never mix the two. Let them buy a new
    machine for UTF-8 and keep the eggs intact.

    > > Was that sarcastic or.....
    > Yes, and I apologize for that.
    It's not about my feelings, it almost lead to a misunderstanding.

    > But I disagree that following the
    > standard -- even if you think it is flawed -- constitutes a "serious
    > security issue" in my design.
    A security issue is a security issue. If it is a result of following a
    standard, this doesn't make it less of an issue. Rather more, I would say.

    > As a programmer, I will say:
    > - Validating conversion is *part* of supporting Unicode, not a frill.
    > - Validating conversion is one of the easiest parts of supporting
    > Unicode, not a major source of struggle.
    > - The standard is very clear about validation; there is no controversy
    > over where to start and where to end.
    > - Strict validation is required by the standard, and not that
    > difficult.
    > - Validation of conversion can be very efficient.
    > - Validation of conversion from a well-defined charset is
    > straightforward, and can easily be guaranteed.

    As a software architect, I will say:

    Validating the conversion, yes. But for UTF-8 programs to work like UTF-16
    programs, you need to validate data even when no conversion is done.
    When to start and where to end was meant: how do you tell programmers WHERE
    to validate. Input? Yes. Output? Maybe. Oh, but what is input and what is
    output? Of the program? Of a function?

    > > Can you, please, provide a description specific problems with my
    > > design? I mean other than that it violates certain rules, clauses or
    > > whatever.
    > Well, there's that. That's not trivial, is it?
    I know it isn't. That's why I am adressing this mailing list.

    > Why don't you write a proposal for this to the UTC? They may
    > be able to
    > provide you with a more satisfactory answer than I can. Be sure to be
    > thorough in describing what you want.
    Maybe I will. But as long as the general opinion is that this is a complete
    nonsense and has be dealt with before - I don't stand a chance, do I?


    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 04:09:30 CST