From: Jon Hanna (jon@hackcraft.net)
Date: Tue Jan 25 2005 - 11:51:14 CST
> Whatever the HTTP protocol specs say, it is not mandating
> anything about how to interpret Content-Types.
Yes it is. RTFRFC.
HTTP just
> offers a way to transport the Content-Type information, and
> then leaves the interpretation of this content type to MIME
> specifications.
No it doesn't. It is based on MIME, but it is not identical to MIME. This is
what RFC 2616 means by such phrases as "MIME-like messages", "similar to
that used by Internet mail as defined by the Multipurpose Internet Mail
Extensions (MIME)", "analogous to ... MIME", and in particular "HTTP is not
a MIME-compliant protocol." and even the ability to state a MIME-Version
header to indicate that the message is in compliance with MIME comes with
the caveat "However, HTTP/1.1 message parsing and semantics are defined by
this document and not the MIME specification."
RTFRFC.
> In other words, it DOES NOT specify any default charset for
> the transported document, should it be "text/html", or
> whatever other "text/*" content-type !
RTFRFC:
'The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset parameter
is provided by the sender, media subtypes of the "text" type are defined to
have a default charset value of "ISO-8859-1" when received via HTTP. Data in
character sets other than "ISO-8859-1" or its subsets MUST be labeled with
an appropriate charset value. See section 3.4.1 for compatibility problems.'
> It's important to note this because the relevant information
> is not in RFC 2616. HTTP is ONLY a transport protocol that
> allows querying documents along with their meta-data.
HTTP is not a transport protocol, it is an application protocol that sits on
top of a transport protocol, though it is frequently abused to serve as a
transport protocol (e.g. in SOAP). It is transport-protocol neutral (and
some of the changes between 1.0 and 1.1 increase the range of transport
protocols it can work on top of). The most common transport protocol it is
used on top of is TCP, but other reliable protocols can and are used
(generally these in turn sit on top of TCP and offer privacy or other
advantages, this doesn't have to be the case - one could implement HTTP
immediately on top of Ethernet for example, though one isn't likely to find
this useful).
HTTP
> does not describe or mandate any of these meta-data.
The majority of the text of the spec does exactly that. RTFRFC.
> The only mandatory requirements in HTTP for the
> interpretation of headers are those effectively used in HTTP,
> to specify the origin host of the document, to sign its
> content or certify it against alteration, to see if the
> document can be replicated or cached, or to change its
> transport encoding syntax to bypass some limitations (most
> HTTP gateways however are binary safe today, so the only
> current use of transfer-encoding syntaxes in HTTP is for data
> compression or security, for example by inserting partial
> checksums, and allowing altered parts of the document to be
> reloaded from the source)...
Actually, clients are free to assume origin host based on connection, to
ignore MD5 hashes, to implement their own caching rules if they are at the
end of the connection (i.e. not a proxy) and to not cache any content on
arbitrary grounds even if they are marked as cacheable. You've come pretty
close to identifying the set of HTTP headers that aren't mandatory.
> Don't forget that: HTTP is only a transport protocol, but not
Again it is not. It is an application protocol. TCP is a transport protocol.
> fact that HTTP is most often used for HTML documents since
> its origin does not conceptually binds it to the HTML
> requirements.
Of course the only way to determine what is an HTML document and what is a
text file that happens to contain the likes of <head> etc. is by examining
the content-type (IE is buggy in this regard though).
In fact HTTP does not even specify that all
> HTML documents should be transported with a "text/html"
> content-type (it could be other types including some XML
> variants, or application specific content-types, even if the
> document will first be parsed as HTML, depending on the
> client application requirements.)
This is noted in the specs which *do* specify the text/html and
application/xhtml+xml MIME types. Notably the most recent registration for
text/html notes the feature of HTTP with regard to default charset
parameters you are claiming does not exist.
> The "Content-Type:" header is then only standardized as a
> container for MIME related information, but it's not to the
> HTTP spec to say how it will be interpreted.
Again, MIME-like; not MIME, merely based on it. RTFRFC.
Notably the
> absence of a "charset=..." attribute in the content-type
> value means or implies NOTHING in HTTP,
RTFRFC.
which is open to any
> other content-types for which the simple concept of a
> "charset" is not significant.
Yes, the rule for 'media subtypes of the "text" type' does not apply to
media subtypes that are not of the "text" type.
RTFRFC.
> So please don't assert such things. HTTP will just indicate
> you that the document is of a "text/html" content-type, and
> then it's up to the client to interpret it according to the
> definition of this content-type in MIME, where it is
> registered and bound in reference to the HTML standard.
No, it's up to the client to interpret it according to the adaptation of
MIME used by HTTP. RTFRFC.
> Then comes the HTML standard: this is where the "text/html"
> content-type will be described with its charset attribute.
> The HTML standard is clear:
The HTML standards explicitly refer to exactly the feature of HTTP that I
mentioned, which you claim does not exist, and the practical issues with it
that I also mentioned. In other words they say to RTFRFC.
> (8) In some cases, the browser will need to reload the
> document from its source by performing a new request to its
> URL (this will be true if the source indicates that the
> document is not cachable and generated on the fly, or
> secured. Unfortunately, if the source document came from a
> dynamic POST request, the document may be the result of a
> active query, so generally the browser will first ask to the
> user whever it wants to resend its last form to get the
> generated document).
Browsers are free to retrieve data from private caches when they are merely
refreshing a *view* on a page or if the user is going backwards and forwards
through the browser history. It is therefore not necessary to repeat the
POST to comply with a document being labelled as not cacheable (RTFRFC).
> However, more modern browsers will cache internally the bytes
> stream coming from HTTP, to be able to change its meta-data
> on the fly without having to reload the source: this may
> cause problems if the HTML document was parsed a first time,
> and refered to other objects whose URL may be active;
> reparsing the document will possibly change the list of
> refered active objects.
>
> To avoid this nightmare, notably for actively generated
> documents, all of them should really specify reliably the
> charset needed to parse them without needing to requery the
> source. This is to the website designer to ensure this!
This is both unlikely and not a "nightmare", merely an efficiency issue
(since those sub-objects of an HTML document would be retrieved through GET
which (unlike POST, PUT and DELETE) has safe semantics.
Regards,
Jon Hanna
Work: <http://www.selkieweb.com/>
Play: <http://www.hackcraft.net/>
Chat: <irc://irc.freenode.net/selkie>
This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 11:52:15 CST