RE: HTML5 encodings (was: Re: BOCU patent)

From: Phillips, Addison (addison@amazon.com)
Date: Tue Dec 29 2009 - 13:52:32 CST

  • Next message: Andrew Lipscomb: "The "prohibited" encodings..."

    (this is a personal response)

     
    > The HTML5 draft uses disclaimers such as this one to justify such
    > decisions:
    >
    > "This algorithm is a willful violation of the HTTP specification,
    > which
    ...

    > This was from a table in Section 9.2.2.1, where browser developers
    > are
    > encouraged to choose a default encoding that is not Unicode 2/3 of
    > the time based on "the user's locale."

    You need to be careful here. This section deals with the selection of the character encoding used to interpret a document when all other possibilities are exhausted---including auto-detection of UTF-8 and the user's personal preference. If you already know a document is not UTF-8, using UTF-8 to interpret it is not that useful.

    Discussion of exactly how to word that paragraph is ongoing :-).

    >
    > More "willful violations" appear in Section 9.2.2.2, in which
    > browsers
    > are required to "misinterpret for compatibility" ISO and
    > national-standard character sets as Windows code pages, even when
    > the author specified the ISO or national character set.

    It usually isn't clear what the author meant to specify. The vast preponderance of people have no idea what a character encoding is.

    >
    > The implications are that (1) the authors of the present draft know
    > better than authors of previous works on character encoding and (2)
    > compatibility with existing, incorrectly or incompletely marked
    > HTML
    > documents is more important than adherence to standards. This is a
    > departure from all other HTML and XHTML specifications I've ever
    > seen from the W3C.
    >

    Yes, it is. However...

    Although CharMod [1] (which is the thing being willfully violated) says you MUST NOT do this (requirements C028 and C030), the simple fact is that users are much better off when user agents (and web sites and search engines and...) apply this particular mis-interpretation. Superset encodings are a fact of life.

    In particular, the C1 control characters (0x80 -> 0x9F) in the ISO 8859 series of standards have no useful interpretation in a Web context. Microsoft's "appropriation" of these bytes for the Windows code pages makes it very likely that any such bytes in a Web page, stylesheet, etc. actually represents the Microsoft encoding and not an ISO or national character set. Because users do not appreciate the distinction, because data using these bytes is prevalent throughout the Internet, and because user-agents (and web sites and search engines and...) have accommodated these facts via "willful violation" in the past, making these specific mappings makes more sense than requiring user agents to do something that their customers will NOT appreciate.

    Addison

    [1] http://www.w3.org/TR/charmod/#sec-EncodingIdent

    Addison Phillips
    Globalization Architect -- Lab126
    Chair -- W3C Internationalization WG

    Internationalization is not a feature.
    It is an architecture.



    This archive was generated by hypermail 2.1.5 : Tue Dec 29 2009 - 13:57:06 CST