Re: [OT] HTML charset declarations (was: GSM and Unicode)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 15:00:02 EST

  • Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"

    YTang0648@aol.com wrote:
    > I think the reason that we see page which have
    > <meta charset=""> is because the old charset detection
    > code we put into Netscape 2.0 way back in early 1996 is
    > very "loose". that code is not build into the paser but in
    > a pre-parsing STREAM filter. It is a simple sniffer for
    > performance reason and it cannot be in the parser because
    > you need to detect, and convert the charset before hand
    > those data to the parser for the reason of ISO-2002-JP.
    > Because of the loose of that old meta charset detection
    > code, page have those tag will ALSO gracefully work with
    > Netscape 2.0 till 4.x. (We are force to also make it work
    > for Netscape 7 and Mozilla later becasue of that I believe.
    > MS probably do the samething for IE because of the same
    > reason). Although Netscape never put down any document
    > to advertise that, people somehow find that is shorter
    > than the "right way" and also it work with major browser
    > in that time so some people start to use it. I believe it is
    > where all it come from.

    As long as the document does not specify the strict HTML 4 compliance with a
    <!DOCTYPE> declaration, this is safe within the loose document schema (but
    there's absolutely no guarantee that it will work in compliance with any
    standard, as this interpretation remains private to Netscape 2.x to 4.x).

    For newer browsers, they may legitimately fail to render a page correctly if
    this is the only source for the charset, such as in:
    <meta http-equiv="content-type" content="text/html" charset="UTF-8">
    which fails because the content-type is just specifying the document is a
    HTML file (including for web servers which will not see the extra charset
    attribute, and thus will generate this HTTP header before transmitting the
    document:
        Content-Type: text/html

    As the HTTP header takes precedence to ANY charset specified by the document
    in compliant browsers, they will always ignore the extra attribute.

    This extra attribute is only safe if its value is exactly the same one as
    specified in the content attribute:
        <meta http-equiv="content-type" content="text/html;charset=UTF-8"
    charset="UTF-8">
    but it must be removed with the strict HTML 4.01 document type.

    One can however use it safely with XHTML, because XHTML documents are XML
    documents which may specify explicitly another document schema that includes
    this extra attribute (thanks to the modular model of XHTML). But you'll have
    to provide your own XML schema...

    Note that for XHTML, which must be a valid XML document, UTF-8 is the
    default if nothing is specified. But the XML declaration may be added on top
    to specify the charset to use when parsing the XML document. In that case,
    the XML declaration in the document takes precedence on the external HTTP
    header, which itself takes precedence on the <meta http-equiv /> elements.

    So if you want full XML compliance and support for legacy browsers, you need
    to:

        - use a leading <?xml ?> declaration with the explicit charset
    pseudo-attribute.

        - declare the <!DOCTYPE > with your own schema, and make this extended
    schema accessible at the referenced SYSTEM url, and give it a specific
    PUBLIC doctype name.

        - use a <meta http-equiv /> tag very soon in your <head> section, even
    before any possibly internationalized string like the <title></title>
    element (in fact it is recommanded to put ALL <meta http-equiv /> elements
    before the required <title></title> element and then only put the other
    <meta name /> elements such as robots control tags, description and
    keywords)

        - avoid all line breaks within <meta http-equiv /> elements (needed for
    some web servers tuned for performance and that can parse lazily the HTML
    document before generating HTTP headers), unless you can control the
    generation of HTTP headers (with a external server control file like
    .httpd.conf or similar features, or if you generate headers yourself within
    a server-side script)

        - make sure you insert a space before all abbreviated elements
    terminators "/>"

        - always specify explicitly the "iso-8859-1" document charset with the
    above method, if this is the one you use, as the default charset differs
    between HTML (which defaults to ISO-8859-1) and XHTML (which defaults to
    UTF-8, per XML conformance, unless there's a leading BOM to specify UTF-16
    or UTF-32)



    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 15:56:52 EST