(no subject)

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Tue Jan 25 2005 - 10:58:46 CST

  • Next message: Jon Hanna: "RE:"

     --- Jon Hanna <jon@hackcraft.net> a écrit :
    > or if there's no charset
    > > specification in HTTP headers, but there's an internal charset
    > > specified in the document that indicates it's using the UTF-8
    > > "charset"
    > *Strictly* in the absence of a charset parameter the header
    > "Content-Type:
    > text/html" is supposed to be taken as having a default charset
    > parameter of
    > "charset=iso-8859-1", which is one of the minor changes RFC 2616
    > (HTTP) made
    > in its use of MIME (under which the default charset parameter would
    > have
    > been "charset=us-ascii).

    Whatever the HTTP protocol specs say, it is not mandating anything about how to interpret Content-Types. HTTP just offers a way to transport the Content-Type information, and then leaves the interpretation of this content type to MIME specifications.

    In other words, it DOES NOT specify any default charset for the transported document, should it be "text/html", or whatever other "text/*" content-type !

    It's important to note this because the relevant information is not in RFC 2616. HTTP is ONLY a transport protocol that allows querying documents along with their meta-data. HTTP does not describe or mandate any of these meta-data.

    The only mandatory requirements in HTTP for the interpretation of headers are those effectively used in HTTP, to specify the origin host of the document, to sign its content or certify it against alteration, to see if the document can be replicated or cached, or to change its transport encoding syntax to bypass some limitations (most HTTP gateways however are binary safe today, so the only current use of transfer-encoding syntaxes in HTTP is for data compression or security, for example by inserting partial checksums, and allowing altered parts of the document to be reloaded from the source)...

    Don't forget that: HTTP is only a transport protocol, but not a way to specify how documents should be represented. The fact that HTTP is most often used for HTML documents since its origin does not conceptually binds it to the HTML requirements. In fact HTTP does not even specify that all HTML documents should be transported with a "text/html" content-type (it could be other types including some XML variants, or application specific content-types, even if the document will first be parsed as HTML, depending on the client application requirements.)

    The "Content-Type:" header is then only standardized as a container for MIME related information, but it's not to the HTTP spec to say how it will be interpreted. Notably the absence of a "charset=..." attribute in the content-type value means or implies NOTHING in HTTP, which is open to any other content-types for which the simple concept of a "charset" is not significant.

    So please don't assert such things. HTTP will just indicate you that the document is of a "text/html" content-type, and then it's up to the client to interpret it according to the definition of this content-type in MIME, where it is registered and bound in reference to the HTML standard.

    Then comes the HTML standard: this is where the "text/html" content-type will be described with its charset attribute. The HTML standard is clear:

    (1) if the content-type associated to a document specifies a charset, then this charset must be used to parse the HTML document

    (2) if there's no charset attribute, or if there's not even a known MIME content-type, this information can come from the OS or browser integration, which may determine a default charset from other sources of meta-data

    (3) if this does not reveal a charset, the document may be parsed according to a XML schema and syntax, if it starts with a <?xml ...?> signature. If there's such signature, then it may specify a charset which will become the default.

    (4) if the HTML document is not XML, the document may then content a <meta> tag in the header specifying the content-type. If it's present, then the document will be reparsed using that information.

    (5) if there's no such information, the browser may try to determine automatically the charset. Most browsers will for example auto-detect the leading BOM and then deduce the charset associated to the corresponding UTF-* encoding scheme with which it is encoded.

    (6) browsers will then be free to guess the charset by heuristics, or will then use a user's preference to parse the document.

    (7) Once the document is parsed (successfully or not), users may select manually another charset to reinterpret the document.

    (8) In some cases, the browser will need to reload the document from its source by performing a new request to its URL (this will be true if the source indicates that the document is not cachable and generated on the fly, or secured. Unfortunately, if the source document came from a dynamic POST request, the document may be the result of a active query, so generally the browser will first ask to the user whever it wants to resend its last form to get the generated document).

    However, more modern browsers will cache internally the bytes stream coming from HTTP, to be able to change its meta-data on the fly without having to reload the source: this may cause problems if the HTML document was parsed a first time, and refered to other objects whose URL may be active; reparsing the document will possibly change the list of refered active objects.

    To avoid this nightmare, notably for actively generated documents, all of them should really specify reliably the charset needed to parse them without needing to requery the source. This is to the website designer to ensure this!

    BOM in HTML is standard, even if there's nothing mandatory in the HTML standard about it. It should be recognized as such as soon as the HTML document is interpreted with a charset which is a standard Unicode encoding scheme that accepts it: "UTF-8" or "UTF-16" or "UTF-32" charset (also with "CESU-8" or "BOCU-1" or "SCSU" ?).

    As conceptually this BOM is not part of the encoded text, but only from the encoding scheme used to represent it in the transported document, it does not influence how HTML documents will be parsed, because HTML parsers should not be exposed with the presence or absence of this BOM: a HTML parser must take its input from the output of the charset decoder which converts the stream of bytes into parsable character entities (most often Unicode code units or codepoints today, simply because HTML *parsers*, not the input charset decoder itself, need also to recognize the numeric character entities present in the HTML source and which are bound to Unicode/ISO/IEC 10646 codepoints).

    This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 11:06:44 CST