Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (
Date: Mon Dec 13 2004 - 08:59:21 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    Lars Kristan <> writes:

    > But, as I once already said, you can do it with UTF-8, you simply
    > keep the invalid sequences as they are, and really handle them
    > differently only when you actually process them or display them.

    UTF-8 is painful to process in the first place. You are making it
    even harder by demanding that all functions which process UTF-8 do
    something sensible for bytes which don't form valid UTF-8. They even
    can't temporarily convert it to UTF-32 for internal processing for

    > Listing files in a directory should not signal anything. It MUST
    > return all files and it should also return them in a way that this
    > list can be used to access each of the files.

    Which implies that they can't be interpreted as UTF-8.

    By masking an error you are not encouraging users to fix it.
    Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

    > Let's start with UTF-8 usernames. This is a likely scenario, since I
    > think UTF-8 will typically be used in network communication. If you
    > store the usernames in UTF-16, the conversion will signal an error
    > and you will not have any users with invalid UTF-8 sequences nor
    > will any invalid sequence be able to match any user. If you later on
    > start comparing users somewhere else, in UTF-8, then you must not
    > only strcmp them, but also validate each string. This is just a fact
    > and I am not complaining about it.

    If usernames are supposed to be UTF-8, and in fact they are not,
    then it's normal that some software will signal an error instead
    of processing them. The proper way is to fix the username database,
    not to change programs.

    > The interesting thing is that if you do start using my conversion,
    > you can actually get rid of the need to validate UTF-8 strings
    > in the first scenario. That of course means you will allow users
    > with invalid UTF-8 sequences, but if one determines that this is
    > acceptable (or even desired), then it makes things easier. But the
    > choice is yours.

    For me it's not acceptable, so I will not support declaring it valid.

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 09:03:58 CST