Re: [OT] HTML charset declarations (was: GSM and Unicode)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 15:00:02 EST

Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"

Previous message: John Hudson: "RE: ZWJ/ZWNJ in combining mark sequences"
In reply to: YTang0648@aol.com: "Re: GSM and Unicode"
Next in thread: YTang0648@aol.com: "Re: [OT] HTML charset declarations (was: GSM and Unicode)"
Maybe reply: YTang0648@aol.com: "Re: [OT] HTML charset declarations (was: GSM and Unicode)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

YTang0648@aol.com wrote:
> I think the reason that we see page which have
> <meta charset=""> is because the old charset detection
> code we put into Netscape 2.0 way back in early 1996 is
> very "loose". that code is not build into the paser but in
> a pre-parsing STREAM filter. It is a simple sniffer for
> performance reason and it cannot be in the parser because
> you need to detect, and convert the charset before hand
> those data to the parser for the reason of ISO-2002-JP.
> Because of the loose of that old meta charset detection
> code, page have those tag will ALSO gracefully work with
> Netscape 2.0 till 4.x. (We are force to also make it work
> for Netscape 7 and Mozilla later becasue of that I believe.
> MS probably do the samething for IE because of the same
> reason). Although Netscape never put down any document
> to advertise that, people somehow find that is shorter
> than the "right way" and also it work with major browser
> in that time so some people start to use it. I believe it is
> where all it come from.

As long as the document does not specify the strict HTML 4 compliance with a
<!DOCTYPE> declaration, this is safe within the loose document schema (but
there's absolutely no guarantee that it will work in compliance with any
standard, as this interpretation remains private to Netscape 2.x to 4.x).

For newer browsers, they may legitimately fail to render a page correctly if
this is the only source for the charset, such as in:
<meta http-equiv="content-type" content="text/html" charset="UTF-8">
which fails because the content-type is just specifying the document is a
HTML file (including for web servers which will not see the extra charset
attribute, and thus will generate this HTTP header before transmitting the
document:
Content-Type: text/html

As the HTTP header takes precedence to ANY charset specified by the document
in compliant browsers, they will always ignore the extra attribute.

This extra attribute is only safe if its value is exactly the same one as
specified in the content attribute:
<meta http-equiv="content-type" content="text/html;charset=UTF-8"
charset="UTF-8">
but it must be removed with the strict HTML 4.01 document type.

One can however use it safely with XHTML, because XHTML documents are XML
documents which may specify explicitly another document schema that includes
this extra attribute (thanks to the modular model of XHTML). But you'll have
to provide your own XML schema...

Note that for XHTML, which must be a valid XML document, UTF-8 is the
default if nothing is specified. But the XML declaration may be added on top
to specify the charset to use when parsing the XML document. In that case,
the XML declaration in the document takes precedence on the external HTTP
header, which itself takes precedence on the <meta http-equiv /> elements.

So if you want full XML compliance and support for legacy browsers, you need
to:

- use a leading <?xml ?> declaration with the explicit charset
pseudo-attribute.

- declare the <!DOCTYPE > with your own schema, and make this extended
schema accessible at the referenced SYSTEM url, and give it a specific
PUBLIC doctype name.

- use a <meta http-equiv /> tag very soon in your <head> section, even
before any possibly internationalized string like the <title></title>
element (in fact it is recommanded to put ALL <meta http-equiv /> elements
before the required <title></title> element and then only put the other
<meta name /> elements such as robots control tags, description and
keywords)

- avoid all line breaks within <meta http-equiv /> elements (needed for
some web servers tuned for performance and that can parse lazily the HTML
document before generating HTTP headers), unless you can control the
generation of HTTP headers (with a external server control file like
.httpd.conf or similar features, or if you generate headers yourself within
a server-side script)

- make sure you insert a space before all abbreviated elements
terminators "/>"

- always specify explicitly the "iso-8859-1" document charset with the
above method, if this is the one you use, as the default charset differs
between HTML (which defaults to ISO-8859-1) and XHTML (which defaults to
UTF-8, per XML conformance, unless there's a leading BOM to specify UTF-16
or UTF-32)

Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Previous message: John Hudson: "RE: ZWJ/ZWNJ in combining mark sequences"
In reply to: YTang0648@aol.com: "Re: GSM and Unicode"
Next in thread: YTang0648@aol.com: "Re: [OT] HTML charset declarations (was: GSM and Unicode)"
Maybe reply: YTang0648@aol.com: "Re: [OT] HTML charset declarations (was: GSM and Unicode)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 15:56:52 EST