RE: BOM in HTML

From: Jon Hanna (jon@hackcraft.net)
Date: Sat Jan 22 2005 - 12:31:10 CST

Next message: Doug Ewell: "UCData (was: Re: The "JDGI" file grows)"

Previous message: Doug Ewell: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Lars Kristan: "RE: BOM in HTML"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> But this OOB data is lost when a file is saved to disk. Or
> are any applications already using some tagging in the OOB
> data of the filesystem? Really, I've never thought of it, how
> does IE handle this when saving files?

Saving to its web cache it stores the encoding (and other header
information), saving through the "Save As..." option it alters the file,
saving by right-clicking a link and selecting "Save Target As.." it stores
the file blindly and may fail to display it properly later (similarly it
loses content-type information and may fail to recognised the type of a file
saved from the web).

> > HTML files can contain <meta /> elements that MAY be used
> to determine
> > encoding in the absence of such out of band information,
> and all the
>
> One would expect that the priority is, low from high, OOB,
> BOM, meta, which is also the natural order of processing,
> meaning each directive would simply override previous ones.
> And not retroactively. Although, yes, a disagreement between
> BOM and meta could be problematic.
>
> Is there a deeper reason why the actual order is reversed and
> OOB overrides meta?

Well we can take <meta /> out of the list - there's no standard that says
that they MUST or even SHOULD be honoured in this way, just one that says
they MAY but that doing so is problematic. As such they're last on the list
because they're a kludgy last resort when all other means of determining
encoding has failed.

So we have OOB followed by BOM + XML Declarations. XML Declarations aren't
allowed to give impossible information (e.g. if you've just seen a BOM then
a declaration saying the encoding is ISO 8859-1 is wrong - either that or
the XML document starts with which isn't well-formed XML - and so the
document is in error) so really the BOM and declaration are part of the same
process of an XML parser working out what encoding to use in the absense of
OOB info.

So the question is why is OOB info considered to override BOM + XML.

One is efficiency, it's generally more efficient to know what encoding you
are using and then start processing than to read a bit into a file, work out
the encoding, and then proceed. This can even go as far as a process knowing
"by the time it gets to me, it's going to be UTF-8".

Which hints at another, we can pipe input and output between processes more
easily if they rely on OOB. In particular we can negotiate for encodings, a
requesting process can state that it only understands, say, UTF-8, UTF-16
and US-ASCII and a responding process can produce UTF-8 from the ISO-8859-2
file it is serving. This negotiation can be done by a responding process
with no special knowledge of XML or HTML as long as OOB info is given
priority.

Another example of an operation on XML that doesn't work on the document *as
XML* is viewing the source - we can only view the source in a text editor if
we know the encoding and text editors don't know <?xml version="1.0"
encoding="iso-2022-jp"?> from Adam.

See RFC 3023 for a better explanation that what I'm dashing off.

> > Encoding *is not* switched in the middle of a document.
>
> Could even be, theoretically. As long as each directive only
> applies for the data that follows it, possibly with the first
> one being an exception to the rule. But I guess there is no
> need for it.

No, and doing so would gravely complicate matters (do you really want to
switch from ISO-8859-9 to UTF-16BE in the middle of a document?).

Regards,
Jon Hanna
Work: <http://www.selkieweb.com/>
Play: <http://www.hackcraft.net/>
Chat: <irc://irc.freenode.net/selkie>

Next message: Doug Ewell: "UCData (was: Re: The "JDGI" file grows)"
Previous message: Doug Ewell: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Lars Kristan: "RE: BOM in HTML"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:33:19 CST