RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 09:40:39 CST

  • Next message: Asmus Freytag: "Re: UTF-8 'BOM'"

    Antoine Leca wrote:

    > But it is when you try to port it outside that problems
    > surge. Particularly
    > when the same file is written "w" and read "rb", by the same
    > set of programs
    > (yes, it happens).

    How about using _fmode? Whoever can't display the result (Notepad?) should
    be blamed. If you can't read other's files, you should consider fixing it
    anyway, might as well encounter them on UNIX. But, yes, I know it's not
    always as simple as that. But if most of those files is internal, you might
    get results quicker.

    > Windows is probably a necessary evil, but this is not an
    > excuse to take the
    > short track and do the things in a way that only raise problems to the
    > others.

    UNIX causes problems to Windows, Windows causes problems to UNIX. And the
    short track is the quickest. You can chose the other track. It typically
    takes longer, maybe because it swaps.

    But on a more serious note - defining BOM to be quasi harmless so it can
    sometimes be considered to be part of the text is also the short track.
    Those who use it that way are doing things in a way that only raises
    problems to the others.

    UNIX gracefully processes binary data as text. Some call it processing
    garbage. Well, as UTF-8 prevails, the garbage will gradually be gone. I
    think it is not that bad to process garbage for a while - better than make
    the effort to clean and label all garbage, only to discard most of it soon
    thereafter.

    Windows chooses BOM. If UNIX data is garbage, BOM is clutter. Just like the
    CRs. Processing clutter is not cheap. Hence, Windows swaps.

    > Of course, this is not against you, there is nothing personnal here.

    I wouldn't expect otherwise. I suppose it must look like I am a UNIX fan.
    Well, I just sit inbetween and get the problems of both. And it looks like
    it's the UNIX that needs to be able to process invalid sequences. Actually
    Windows needs it more.

    > > BTW (yes, again and again): This is something Windows is not able
    > > to achieve.
    >
    > What do you mean (simple curiosity, I did not get your point)?
    > Yes Windows does extravagant contortions about codepages with
    > filenames (and
    > this is disappearing, fortunately), but that should be
    > irrelevant. Yes, C
    > programs on Windows "eat" CR.
    > But I fail to see sigificant examples beyond those (I
    > consider the "feature"
    > of BOM at the beginning of Notepad/RichEdit/whatever UTF-8
    > files to be an
    > outright bugbug that had been missed in due time and is now
    > entrenched).

    Not a bug. Windows relies on it. At least in Notepad. But it's true that it
    is sometimes emitted also when it shouldn't be. But this is because it is
    hard to determine when to do what. Hard to even define, let alone determine
    at run time.

    As to what I meant - well the answer should be in the other posts. In short,
    Windows is not able to process data in an unknown encoding. Not without a
    risk of losing data. And it typically does so silently, yielding incorrect
    results.

    > Sorry: you mean, a Unicode application is able to "absorb" any stream
    > (including erroneous encodings) when programmed in UTF-8
    > while unable when
    > programmed in UTF-16?
    > I would have expected just the reverse (because of the
    > requirements for the
    > illegality of the overlong encodings).

    Interesting choice of words. Absorb. If the application only consumes
    (absorbs) data, then dropping unrecognised data (invalid sequences or
    unassigned values in legacy encodings) often doesn't matter.

    If you however need to change the data and pass it on? Then dropping data is
    a problem. Even more when it is done silently. And refusing the stream is
    not always an answer. Especially if you already processed and passed on half
    of it.

    Now an additon to the first scenario, absorption. Dropping data is sometimes
    a problem there too. OK, Windows has a habit of really dropping invalid
    sequences, meaning even a simple word count might fail. Next, you fix it and
    replace invalid sequences with 0xFFFD. Word count now works (at least on
    reasonable data, which is what I need). What if I then choose to search for
    0xFFFD? And mean 0xFFFD?! It's close to searching for a '?', except I have
    no way of telling I do not want it to be treated as a wildcard.

    UTF-16 is very good at absorbing. Too good. But it's not its fault, it's the
    fault of the UTF-8 to UTF-16 conversion. UTF-8 is good at preserving.

    Of course none of that is true if you consider those two applications as
    conformant. If you think only conformant applications and what happens if
    you feed them non-conformant data, then it is indeed easier to implement
    validation in a UTF-16 application than in a UTF-8 application. The problem
    is that UTF-8 to UTF-16 conversion implies partial validation, while UTF-16
    to UTF-8 conversion does not necessarily validate anything.

    Lars



    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 11:11:33 CST