Re: Problem with SSI and BOM

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sat Sep 23 2006 - 02:08:19 CDT

  • Next message: Mike: "Re: Unicode 5.0 success"

    On Fri, 22 Sep 2006, Addison Phillips wrote:

    > See: http://www.w3.org/International/questions/qa-utf8-bom

    That page is not very specific in its statements about browser behavior.
    It discusses BOM handling in both browsers and editors and mainly the
    appearance of BOM at the start of data.

    Indirectly, the statement "Note that a number of more recent browsers,
    such as the latest versions of Internet Explorer (Win), Mozilla (Netscape)
    and Opera, do not exhibit this behavior." seems to say that BOM in UTF-8
    is not much of a problem in most browsing situations.

    Checking some less common browsers, I noticed that Netscape 4.5 shows the
    BOM as a square box (probably because it's trying to render it as a
    visible character), and Lynx 2.8.5 shows it as the inverted question mark
    character (don't ask me why - the browser can handle UTF-8 in general,
    though it is often used in environments where the browser can _display_
    e.g. ISO Latin 1 characters only).

    (By the way, the page contains two descriptions of what an UTF-8 encoded
    BOM looks like when interpreted as UTF-8. The first one, , in the first
    paragraph is correct, whereas the second occurrence, 﫿, has got the
    guillemet changed.)

    > The BOM is often rendered in the page, throwing off other display elements.

    I can't agree with the "often" adverb. And I didn't see any empty lines,
    though I saw some other faulty renderings.

    > While one might expect
    > this to act as a "no-op" character, in practice, it isn't.

    We might expect the BOM, i.e. U+FEFF, inside data to act as a control
    character according to its old Unicode semantics, which has been retained
    although the use of U+FEFF for that purpose has been deprecated in
    favor of word joiner U+2060. That is, data should not contain U+FEFF
    except at the start of data as a BOM, but programs should still interpret
    it in a specific way.

    Then again, HTML specifications do not require browsers to observe Unicode
    semantics for characters in general. In fact, Internet Explorer, for
    example, fails to do so for U+FEFF inside text. The browser does not try
    to render the character in any visible way, which is good, but it does not
    interpret it as forbidding line breaks before and after it. That's too
    bad, since if it did, we would have a standards-conforming and relatively
    safe way of forbidding a line break after a hyphen-minus, for example.
    (Using the nonbreaking hyphen character is not a realistic option, because
    it creates problems far too often, due to its absence in most fonts.)

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Sat Sep 23 2006 - 02:22:12 CDT