RE: Subject: Re: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 04:53:34 CST

  • Next message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"

    > I wonder if this is all a bit of a storm in a teacup. When will the
    > problem actually occur? It seems to be restricted to UTF-8 files
    > generated by Windows and perhaps some other systems and read
    > by Unix and
    > perhaps some other systems. I really don't see how BOMs will
    > end up in
    > filenames - or does Windows put BOMs in filenames?

    Here is how it can happen:

    Suppose you want to convert some filenames to UTF-8.
    You use a ls to generate a list of files. Then you use Notepad to open the
    file and save it as another file in UTF-8. You then use a script that takes
    the first list and renames each file to the name specified in the second
    list. The first file will get a BOM.

    This is just a stupid example. But you can think of a number of scenarios
    where the same thing would happen. Especially if other tools start emitting
    BOMs but you keep using some older tools that don't consume it.

    Now, you would think that this only happens if you mix UNIX and Windows, or
    if you introduce BOM emitting tools to UNIX. But it also happens on Windows
    alone. Not everyhting is in Unicode, not all tools consume or tolerate BOM.
    In particular, the stdin and stdout are still 8-bit, ACP. The cmd.exe will
    not recognise Notepad's "text documents" in UTF-8. And this is not as easy
    to fix as one would think. The best solution I've come up with involves
    proper handling of invalid sequences. It is not only UNIX that can benefit
    from it, Windows can too.

    As for whether Windows puts BOMs in filenames - of course I did not mean it
    just does that all the time. But it can happen. Now, I already suggested
    that BOM should really be a non-charater. Then Windows should NOT allow
    creation of such filenames. But, hell, it surely allows unpaired surrogates
    (Windows is still pretty much UCS-2). And it also allows U+FFFF. Well, it
    looks like filenames on Windows are not really text, they are binary data.
    Not that I believe that, but I've been told to process UNIX filenames as
    binary data. Guess the same is then true for Windows filenames. Nice.

    Lars



    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 04:54:26 CST