Re: Not snazzy (was: New Unicode Savvy Logo)

From: Philippe Verdy (
Date: Thu May 29 2003 - 10:33:44 EDT

  • Next message: Brian Doyle: "FW: The role of country codes/Not snazzy"

    From: <>
    > > there are still (even more) browsers that do not display UTF-8
    > > correctly...
    > > who still use very often a browser that supports some form their
    > > national encoding (SJIS, GB2312, Big5, KSC5601), sometimes with
    > > ISO2022-* but shamely do not decode UTF-8 properly (even when the
    > > page is correctly labelled...
    > > but the same browsers really know how to use Unicode
    > > codepoints and even know UTF-8, but refuse to switch to it because
    > > they do not interpret the meta information that both the page
    > > content and the HTTP header specify! I have found that these
    > > browsers simply do not recognize ANY encoding markup or meta-data
    > > and always use the user setting (which is stupid in that case,
    > > unless the page was incorrectly labelled).
    > IIRC, there are still problems with recent versions of browsers in relation
    > to NCRs: some understand hex but not decimal, or vice versa.
    > Sounds like what's needed more than a logo to identify pages in UTF-8 is a
    > logo to identify browsers (and probably HTML editors) that do the right
    > thing wrt encoding.

    Browsers that do not understand NCR (either decimal or hexadecimal) are not HTML4 compliant (and cannot be made compliant with XML or XHTML either). I think this case should become exceptional now (HTML4 is now an old standard)

    But the HTML standard does not specify how the character encoding can be indicated. There are twoways for this:

    1) out of the document using HTTP conventions with "Content-Type:" which allows to specify a MIME content type; however, the value is not standardized in HTTP itself, but in the MIME content-type registry.

    2) within the document using the <meta> element (the element is standardized in HTML, but not in XML or XHTML, and this usage has been deprecated due to problems with XML)... Here it is just a fallback method, and the value of the <meta> element refers to another specification that allows to specify HTTP-Equivs within the header of the document, but according to the rules of HTTP (which describes the role of each HTTP equivalent header name, but not its values)

    So we are left to 2 separate specifications out of scope of the HTML standard. Moreover, these two methods interact with each other. There are technical interoperability problems, because sometimes the HTTP header contradicts the <meta http-equiv> setting in the document, and too many browsers ignore the now deprecated <meta> tag, in favor of the HTTP equiv (and this causes deployment problems, as many web servers cannot be configured to send the appropriate HTTP header, due to security restrictions).

    Some browsers will NOT autodetect the UTF-8 BOM (because it is NOT recommanded by Unicode...) and so will not switch automatically to UTF-8 in absence of a header, or <meta> element.

    Such standardization occured only too recently, so in most cases, it is safer to encode a page with NCRs using ISO-8859-1 for the base encoding of the document (decimal recommanded as there are much more browsers that recognize them than hexadecimal NCRs).

    Decimal NCR's are a legal way (and the most interoperable for now), to specifiy Unicode characters, even with browsers that hae an implementation of UTF-8 (due to the nightmare of conflicting settings in servers, proxies, deprecation of <meta>, and user settings).

    When Netscape 4, IE3, or early versions of Opera or Lynx will become insignificant, we will be able to use UTF-8 everywhere. For now it's too risky for any commercial website, which does not have its home page also accessible in a language encoded with ISO-8859-1. We can promote UTF-8, but we still must maintain for now ISO-8859-1 as the defacto standard for default homepages with most Western European languages (from where a user can select and try another language).

    I have experimented this with a website publishing a Chinese translation. Most Chinese users complained that the UTF-8 page was not rendered automatically with the proper characters (they had to manually select the UTF-8 encoding in their browser). All attempts to sepcify the encoding in the HTTP header and in the <meta http-equiv> have failed. All complains have stopped immediately when the Chinese pages were reverted to ISO-8859-1 using decimal NCRs!

    We could have used GB2312 for them (as most Chinese users seem to have browsers that correctly render it, as GB2312 and now the newer GB18030 is mandatory in China) but maintaining pages in this encoding is really too complicate as it constantly requires reencoding with an external tool.

    This is a proof that browsers, despite they understand the Unicode standard, do not understand the other standards which are sometimes conflicting each other but are still needed...

    I do hope that old legacy browsers will remoe the bugs for automatically selecting the appropriate encoding used in the pages but the deprecation of the <meta htp-equiv> element to specify the encoding, and the deployment problem with HTTP headers is still an issue.

    I do think that the best solution would be to use a leading BOM in HTML documents encoded with UTF-8 (even if Unicode does not recommend it), and have browsers interpret it correctly. This interpretation of BOM in UTF-8 is out of the Unicode standard but part of the HTML/XML standard.

    -- Philippe.

    This archive was generated by hypermail 2.1.5 : Thu May 29 2003 - 11:08:55 EDT