From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Jan 21 2005 - 16:20:30 CST
Richard Gillam wrote:
> The current committee is EXTREMELY vigilant
> and won't let these things happen.
I thing this vigilance (though typically warranted) has gotten too far. So
far in fact that even though it would be possible to prevent losing data
when encountering invalid sequences in UTF-8, it still looks like the cost
of 128 codepoints is too much. Yes, the million codepoints will suffice for
a very long time.
> That sequence has no special meaning in UTF-8;
> it's just a zero-width non-breaking space.
Ummmmm, is it really? Thought it no longer is. And in this particular case
it definitely is not. It is not a part of the text is it? It is a part (well
the only part) of the document that encapsulates the text. If your text is a
letter, you won't see the difference, but if the text is a batch file, then
you will. Or will you say programs are not written in plain text?
> So EF BB BF at the top of the UTF-8 file does occur
> in practice and it's good for software to be aware
> of it (but relatively harmless if it isn't).
Again, think beyond books and letters. Think plain text databases (a list of
files to process, if you wish), batch files (called scripts on UNIX), and a
number of other things, like configuration files. Harmless? On the contrary.
But we shouldn't really be talking about whether or not to have the BOM. We
should be talking about how to make things work. Either with the BOM,
without the BOM, or with the BOM being optional. Those are the three paths
we could take. It is not even necessary that everybody will take the same
path.
Lars
This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 16:27:16 CST