Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Mon Dec 06 2004 - 12:35:15 CST

  • Next message: Peter Kirk: "Re: No Invisible Character - NBSP at the start of a word"

    Lars Kristan <lars.kristan@hermes.si> writes:

    >> This is simply what you have to do. You cannot convert the data
    >> into Unicode in a way that says "I don't know how to convert this
    >> data into Unicode." You must either convert it properly, or leave
    >> the data in its original encoding (properly marked, preferably).
    >
    > Here lies the problem. Suppose you have a document in UTF-8, which
    > somehow got corrupted and now contains a single invalid sequence.
    > Are you proposing that this document needs to be stored separately?

    He is not proposing that.

    > Everything else in the database would be stored in UTF-16, but now
    > one must add the capability to store this document separately.

    No, it can be be stored in UTF-16 or whatever else is used. Except the
    corrupted part of course, but it's corrupted, and thus useless, so it
    doesn't matter what happens with it.

    > Now suppose you have a UNIX filesystem, containing filenames in a legacy
    > encoding (possibly even more than one). If one wants to switch to UTF-8
    > filenames, what is one supposed to do? Convert all filenames to UTF-8?

    Yes.

    > Who will do that?

    A system administrator (because he has access to all files).

    > And when?

    When the owners of the computer system decide to switch to UTF-8.

    > Will all users agree?

    It depends on who decides about such things. Either they don't have a
    voice, or they agree and the change is made, or they don't agree and
    the change is not made. What's the point?

    > Should all filenames that do not conform to UTF-8 be declared invalid?

    What do you mean by "invalid"? They are valid from the point of view
    of the OS, but they will not work with reasonable applications which
    use Unicode internally.

    > If you keep all processing in UTF-8, then this is a decision you can
    > postpone.

    You mean, various programs will break at various points of time,
    instead of working correctly from the beginning?

    If it's broken, fix it, instead of applying patches which will
    sometimes hide the fact that it's broken, or sometimes not.

    > I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames.
    > Do you want to discourage them?

    Mixing any two incompatible filename encodings on the same file system
    is a bad idea.

    > IMHO, preserving data is more important, but so far it seems it is
    > not a goal at all. With a simple argument - that Unicode only
    > defines how to process Unicode data. Understandably so, but this
    > doesn't mean it needs to remain so.

    If you don't know the encoding and want to preserve the values of
    bytes, then don't convert it to Unicode.

    > Well, you may have a wrong assumption here. You probably think that
    > I convert invalid sequences into PUA characters and keep them as
    > such in UTF-8. That is not the case. Any invalid sequences in UTF-8
    > are left as they are. If they need to be converted to UTF-16, then
    > PUA is used. If they are then converted to UTF-8, they are converted
    > back to their original bytes, hence the incorrect sequences are
    > re-created.

    This does not make sense. If you want to preserve the bytes instead
    of working in terms of characters, don't convert it at all - keep the
    original byte stream.

    > One more example of data loss that arises from your approach: If a
    > single bit is changed in UTF-16 or UTF-32, that is all that will
    > happen (in more than 99% of the cases). If a single bit changes in
    > UTF-8, you risk that the entire character will be dropped or
    > replaced with the U+FFFD. But funny, only if it ever gets converted
    > to the UTF-16 or UTF-32. Not that this is a major problem on its
    > own, but it indicates that there is something fishy in there.

    If you change one bit in a file compressed by gzip, you might not be
    able to recover any part of it. What's the point?

    UTF-x were not designed to minimize the impact of corruption of
    encoded bytes. If you want to preserve the text despite occasional
    corruption, use a higher level protocol for this (if I remember
    correctly, RAR can add additional information to an archive which
    allows to recover the data even if parts of the archive, entire
    blocks, have been lost).

    > There was a discussion on nul characters not so long ago. Many text
    > editors do not properly preserve nul characters in text files.
    > But it is definitely a nice thing if they do. While preserving nul
    > characters only has a limited value, preserving invalid sequences
    > in text files could be crucial.

    An editor should alert the user that the file is not encoded in a
    particular encoding or that it's corrupted, instead of trying to guess
    which characters were supposed to be there.

    If it's supposed to edit binary files too, it should work on the bytes
    instead of decoded characters.

    > A UTF-8 based editor can easily do this. A UTF-16 based editor
    > cannot do it at all. If you say that UTF-16 is not intended for such
    > a purpose, then so be it. But this also means that UTF-8 is superior.

    It's much easier with CP-1252, which shows that it's superior to UTF-8
    :-)

    > Yes, it is not related much. Except for the fact I was trying to see
    > if UTF-32 is needed at all. If one can do everything in UTF-8,

    UTF-8 is poorly suitable for internal processing of strings in a
    modern programming language (i.e. one which doesn't already have a
    pile of legacy functions working of bytes, but which can be designed
    to make Unicode convenient at all). It's because code points have
    variable lengths in bytes, so extracting individual characters is
    almost meaningless (unless you care only about the ASCII subset, and
    sequences of all other characters are treated as non-interpreted bags
    of bytes). You can't even have a correct equivalent of C isspace().

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 12:37:10 CST