Re: Problem with SSI and BOM

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Sep 25 2006 - 00:39:47 CST

  • Next message: Jukka K. Korpela: "Re: Question about formatting numerals"

    Jukka K. Korpela <jkorpela at cs dot tut dot fi> wrote:

    >> A process that claims to be able to "support Unicode" should at least
    >> be able to follow the simple rule, "If the file or stream starts with
    >> EF BB BF, throw them away and treat the remainder of the file or
    >> stream as UTF-8."
    >
    > No, that would be incorrect if the character encoding of the data has
    > been declared.

    You are right, and I withdraw that statement.

    >> Even the W3C FAQ says: "In some browsers, the presence of a UTF-8
    >> signature will cause the browser to interpret the text as UTF-8
    >> regardless of any character encoding declarations to the contrary."
    >> That's exactly what it should do.
    >
    > No, it's definitely something that browsers must not do when the
    > character encoding has been declared, as it should, by the protocols.

    Again, I withdraw the statement. I didn't read the thread carefully
    enough at all.

    >> The argument about accidentally throwing away a U+FEFF that was
    >> intended as a ZWNBSP is becoming increasingly irrelevant;
    >
    > I'm not sure exactly which argument you are referring to. When
    > performing file insertion via SSI or otherwise, it is certainly safe
    > and recommendable to drop an eventual U+FEFF if it appears at the
    > start of an included file. There's hardly any argument about this,
    > though there might be practical problems in implementing (depending on
    > how much control you have over the insertion mechanism).

    In the past there were significant objections to the suggestion that
    initial U+FEFF could be discarded. The concern was that the U+FEFF
    might be intended as a ZWNBSP rather than a BOM or signature, especially
    if the file or stream had been split into pieces (for transmission or
    some other reason) and the first character of each piece might not be
    first in the file or stream. My reply was always that (a) ZWNBSP at the
    true start of a file or stream makes no sense and (b) processes that
    work with partial files or streams have to be aware of the
    initial/medial/final state of each block in order to parse UTF-8
    sequences correctly across blocks.

    >> U+2060 has been recommended over ZWNBSP for over 4 years now, and few
    >> applications used ZWNBSP anyway.
    >
    > I'm afraid U+2060 is not widely supported, to put it mildly.

    You're right, but the same is probably true of U+FEFF as ZWNBSP.

    --
    Doug Ewell
    Fullerton, California, USA
    http://users.adelphia.net/~dewell/
    RFC 4645  *  UTN #14
    


    This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 00:41:37 CST