RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Dec 13 2004 - 10:00:53 CST

  • Next message: Otto Stolz: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"

    Marcin 'Qrczak' Kowalczyk wrote:
    > UTF-8 is painful to process in the first place. You are making it
    > even harder by demanding that all functions which process UTF-8 do
    > something sensible for bytes which don't form valid UTF-8. They even
    > can't temporarily convert it to UTF-32 for internal processing for
    > convenience.
    My point exactly. I am proposing to provide a conversion so you can. All you
    need is to assign 128 codepoints and define their properties. They would be
    printable characters, non-spaces, would have no upper/lower case properties,
    would collate (for example) after all letters but before any special
    characters, and so on. Then you don't need to fix anything. Not in the
    functions. You just need to convert (and even convert from byte stream to
    UTF-8) on boundaries where you expect such data. And decide whether you need
    to prevent anything due to security reasons. If not, then you're done.

    So, no, I am not demanding that UTF-8 functions need to behave differently.
    Existing functions work perfectly well, assuming you convert to UTF-8 (so,
    use three bytes to represent each invalid byte as a valid codepoint). It
    would be beneficial if they would, but that is a separate issue. It would
    need to be determined which functions could do so. Maybe all could, maybe
    only some could, maybe none should. It needs to be investigated before
    anything is changed. This is in line with what I said about validation.
    Processing functions may do validation implicitly. But this is not a
    requirement. Unless you make it so. But in my opinion, it is better to
    separate validation from processing. In that case you can even prescribe
    exactly what they should do with invalid data. And in this case they should
    do exactly what they would do if the data was converted to UTF-8 according
    to my conversion. But again, this is the next step, that needn't be done at
    all.

    >
    > > Listing files in a directory should not signal anything. It MUST
    > > return all files and it should also return them in a way that this
    > > list can be used to access each of the files.
    >
    > Which implies that they can't be interpreted as UTF-8.
    >
    > By masking an error you are not encouraging users to fix it.
    > Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
    Failure to process such files is also an error. Think virus scanners and
    backup.

    > > The interesting thing is that if you do start using my conversion,
    > > you can actually get rid of the need to validate UTF-8 strings
    > > in the first scenario. That of course means you will allow users
    > > with invalid UTF-8 sequences, but if one determines that this is
    > > acceptable (or even desired), then it makes things easier. But the
    > > choice is yours.
    >
    > For me it's not acceptable, so I will not support declaring it valid.
    I said, the choice is yours. My proposal does not prevent you from doing it
    your way. You don't need to change anything and it will still work the way
    it worked before. OK? I just want 128 codepoints so I can make my own
    choice. And once and for all, you can treat those 128 codepoints just as you
    do today.

    Lars



    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 10:07:01 CST