RE: BOM in HTML

From: Jon Hanna (jon@hackcraft.net)
Date: Sat Jan 22 2005 - 12:31:10 CST

  • Next message: Doug Ewell: "UCData (was: Re: The "JDGI" file grows)"

    > But this OOB data is lost when a file is saved to disk. Or
    > are any applications already using some tagging in the OOB
    > data of the filesystem? Really, I've never thought of it, how
    > does IE handle this when saving files?

    Saving to its web cache it stores the encoding (and other header
    information), saving through the "Save As..." option it alters the file,
    saving by right-clicking a link and selecting "Save Target As.." it stores
    the file blindly and may fail to display it properly later (similarly it
    loses content-type information and may fail to recognised the type of a file
    saved from the web).

    > > HTML files can contain <meta /> elements that MAY be used
    > to determine
    > > encoding in the absence of such out of band information,
    > and all the
    >
    > One would expect that the priority is, low from high, OOB,
    > BOM, meta, which is also the natural order of processing,
    > meaning each directive would simply override previous ones.
    > And not retroactively. Although, yes, a disagreement between
    > BOM and meta could be problematic.
    >
    > Is there a deeper reason why the actual order is reversed and
    > OOB overrides meta?

    Well we can take <meta /> out of the list - there's no standard that says
    that they MUST or even SHOULD be honoured in this way, just one that says
    they MAY but that doing so is problematic. As such they're last on the list
    because they're a kludgy last resort when all other means of determining
    encoding has failed.

    So we have OOB followed by BOM + XML Declarations. XML Declarations aren't
    allowed to give impossible information (e.g. if you've just seen a BOM then
    a declaration saying the encoding is ISO 8859-1 is wrong - either that or
    the XML document starts with  which isn't well-formed XML - and so the
    document is in error) so really the BOM and declaration are part of the same
    process of an XML parser working out what encoding to use in the absense of
    OOB info.

    So the question is why is OOB info considered to override BOM + XML.

    One is efficiency, it's generally more efficient to know what encoding you
    are using and then start processing than to read a bit into a file, work out
    the encoding, and then proceed. This can even go as far as a process knowing
    "by the time it gets to me, it's going to be UTF-8".

    Which hints at another, we can pipe input and output between processes more
    easily if they rely on OOB. In particular we can negotiate for encodings, a
    requesting process can state that it only understands, say, UTF-8, UTF-16
    and US-ASCII and a responding process can produce UTF-8 from the ISO-8859-2
    file it is serving. This negotiation can be done by a responding process
    with no special knowledge of XML or HTML as long as OOB info is given
    priority.

    Another example of an operation on XML that doesn't work on the document *as
    XML* is viewing the source - we can only view the source in a text editor if
    we know the encoding and text editors don't know <?xml version="1.0"
    encoding="iso-2022-jp"?> from Adam.

    See RFC 3023 for a better explanation that what I'm dashing off.

    > > Encoding *is not* switched in the middle of a document.
    >
    > Could even be, theoretically. As long as each directive only
    > applies for the data that follows it, possibly with the first
    > one being an exception to the rule. But I guess there is no
    > need for it.

    No, and doing so would gravely complicate matters (do you really want to
    switch from ISO-8859-9 to UTF-16BE in the middle of a document?).

    Regards,
    Jon Hanna
    Work: <http://www.selkieweb.com/>
    Play: <http://www.hackcraft.net/>
    Chat: <irc://irc.freenode.net/selkie>



    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:33:19 CST