From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sun Sep 24 2006 - 23:45:59 CST
On Sun, 24 Sep 2006, Doug Ewell wrote:
> A process that claims to be able to "support Unicode" 
> should at least be able to follow the simple rule, "If the file or stream 
> starts with EF BB BF, throw them away and treat the remainder of the file or 
> stream as UTF-8."
No, that would be incorrect if the character encoding of the data has been 
declared. It would be a mistake to start interpreting the octets of data 
in a manner othen than the declared encoding, at least as long as the data 
is formally correct according to the encoding. If the declared encoding 
is, say, ISO-8859-1, then EF BB BF has a well-defined meaning that has 
absolutely nothing to do with BOM. Even if the data happens to violate a 
higher-level protocol, such as HTML specification, it would be wrong to 
interpret it at the character level in a manner that violates fundamental 
protocols.
> Even the W3C FAQ says: "In some browsers, the presence of a UTF-8 signature 
> will cause the browser to interpret the text as UTF-8 regardless of any 
> character encoding declarations to the contrary." That's exactly what it 
> should do.
No, it's definitely something that browsers must not do when the character 
encoding has been declared, as it should, by the protocols. In the absence 
of declaration of encoding in any manner (HTTP header, meta tag, etc.), 
the browser may guess, and will, for obvious reasons. _Then_ the octet
EF BB BF at the start of data may and should be treated as a good reason 
to make the heuristic guess that the data is UTF-8 encoded.
>
> The argument about accidentally throwing away a U+FEFF that was intended as a 
> ZWNBSP is becoming increasingly irrelevant;
I'm not sure exactly which argument you are referring to. When performing 
file insertion via SSI or otherwise, it is certainly safe and 
recommendable to drop an eventual U+FEFF if it appears at the start of an 
included file. There's hardly any argument about this, though there might 
be practical problems in implementing (depending on how much control you 
have over the insertion mechanism).
> U+2060 has been recommended over 
> ZWNBSP for over 4 years now, and few applications used ZWNBSP anyway.
I'm afraid U+2060 is not widely supported, to put it mildly.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Sun Sep 24 2006 - 23:56:02 CST