RE: Subject: Re: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 04:53:34 CST

Next message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"

Previous message: Christopher Fynn: "Re: Conformance (was UTF, BOM, etc)"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I wonder if this is all a bit of a storm in a teacup. When will the
> problem actually occur? It seems to be restricted to UTF-8 files
> generated by Windows and perhaps some other systems and read
> by Unix and
> perhaps some other systems. I really don't see how BOMs will
> end up in
> filenames - or does Windows put BOMs in filenames?

Here is how it can happen:

Suppose you want to convert some filenames to UTF-8.
You use a ls to generate a list of files. Then you use Notepad to open the
file and save it as another file in UTF-8. You then use a script that takes
the first list and renames each file to the name specified in the second
list. The first file will get a BOM.

This is just a stupid example. But you can think of a number of scenarios
where the same thing would happen. Especially if other tools start emitting
BOMs but you keep using some older tools that don't consume it.

Now, you would think that this only happens if you mix UNIX and Windows, or
if you introduce BOM emitting tools to UNIX. But it also happens on Windows
alone. Not everyhting is in Unicode, not all tools consume or tolerate BOM.
In particular, the stdin and stdout are still 8-bit, ACP. The cmd.exe will
not recognise Notepad's "text documents" in UTF-8. And this is not as easy
to fix as one would think. The best solution I've come up with involves
proper handling of invalid sequences. It is not only UNIX that can benefit
from it, Windows can too.

As for whether Windows puts BOMs in filenames - of course I did not mean it
just does that all the time. But it can happen. Now, I already suggested
that BOM should really be a non-charater. Then Windows should NOT allow
creation of such filenames. But, hell, it surely allows unpaired surrogates
(Windows is still pretty much UCS-2). And it also allows U+FFFF. Well, it
looks like filenames on Windows are not really text, they are binary data.
Not that I believe that, but I've been told to process UNIX filenames as
binary data. Guess the same is then true for Windows filenames. Nice.

Lars

Next message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Previous message: Christopher Fynn: "Re: Conformance (was UTF, BOM, etc)"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 04:54:26 CST