From: Mark Davis (firstname.lastname@example.org)
Date: Mon Sep 25 2006 - 08:12:37 CST
On 9/24/06, Jukka K. Korpela <email@example.com> wrote:
> On Sun, 24 Sep 2006, Doug Ewell wrote:
> > A process that claims to be able to "support Unicode"
> > should at least be able to follow the simple rule, "If the file or
> > starts with EF BB BF, throw them away and treat the remainder of the
> file or
> > stream as UTF-8."
> No, that would be incorrect if the character encoding of the data has been
> declared. It would be a mistake to start interpreting the octets of data
> in a manner othen than the declared encoding, at least as long as the data
> is formally correct according to the encoding.
In theory, that's correct. In practice, however, the charset is set
incorrectly so, so often. In a browser, the user can reset the charset
manually if he or she sees that it is wrong. That option is not available to
more mechanical processes like search engines -- there, the process simply
can't afford to always believe the charset parameter(s), any more than it
can always depend on the HTML being valid.
This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 08:19:34 CST