Re: Problem with SSI and BOM

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Sep 25 2006 - 08:12:37 CST

  • Next message: Cristian Secară: "what is the Unicode correspondent of character HORIZONTAL BAR from ISO/IEC 6397 ?"

    On 9/24/06, Jukka K. Korpela <jkorpela@cs.tut.fi> wrote:
    >
    > On Sun, 24 Sep 2006, Doug Ewell wrote:
    >
    > > A process that claims to be able to "support Unicode"
    > > should at least be able to follow the simple rule, "If the file or
    > stream
    > > starts with EF BB BF, throw them away and treat the remainder of the
    > file or
    > > stream as UTF-8."
    >
    > No, that would be incorrect if the character encoding of the data has been
    > declared. It would be a mistake to start interpreting the octets of data
    > in a manner othen than the declared encoding, at least as long as the data
    > is formally correct according to the encoding.

    In theory, that's correct. In practice, however, the charset is set
    incorrectly so, so often. In a browser, the user can reset the charset
    manually if he or she sees that it is wrong. That option is not available to
    more mechanical processes like search engines -- there, the process simply
    can't afford to always believe the charset parameter(s), any more than it
    can always depend on the HTML being valid.

    Mark



    This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 08:19:34 CST