RE: HTML5 encodings (was: Re: BOCU patent)

From: Phillips, Addison (
Date: Wed Dec 30 2009 - 12:58:32 CST

  • Next message: Andrew Lipscomb: "Re: The "prohibited" encodings..."

    (personal response)
    > >
    > > You need to be careful here. This section deals with the
    > selection of
    > > the character encoding used to interpret a document when all
    > other
    > > possibilities are exhausted---including auto-detection of UTF-8
    > and
    > > the user's personal preference. If you already know a document is
    > not
    > > UTF-8, using UTF-8 to interpret it is not that useful.
    > OK, I did miss something there. But then I wonder why UTF-8 is
    > still recommended in this table for some languages.

    Because some people thought it appropriate to recommend it.

    There are two reasons. First: the W3C I18N WG notes that some languages do not have their own widely supported standardized encoding. Certainly various improvised or font-based encodings exist, but users of these languages are better served by switching to UTF-8 as soon as practical and UTF-8 is the encoding most like to give them good results. Since the resulting page is pretty likely to have mojibake all over it anyway, choosing UTF-8 helps prompt users to do the right thing when fixing their pages.

    The second reason, though, explains the contents of the table in the current editor's copy: someone looked to see what specific browsers (notably Mozilla) do for a given locale and wrote it down.

    The I18N WG doesn't agree with having a normative table in the spec. Current the table is not normative, merely "suggested", but I still hope it is removed in future versions.

    > >> More "willful violations" appear in Section, in which
    > >> browsers are required to "misinterpret for compatibility" ISO
    > and
    > >> national-standard character sets as Windows code pages, even
    > when the
    > >> author specified the ISO or national character set.
    > >
    > > It usually isn't clear what the author meant to specify. The vast
    > > preponderance of people have no idea what a character encoding is.
    > True, and again I don't have much of a problem with encouraging or
    > recommending this sort of behavior, but I do have a problem with
    > requiring it.

    "Encouraging" or "recommending" doesn't play very well with normative language or the goals of HTML5 (which is intended to provide for consistent rendering of the same Web page on compliant browsers). Reducing the normative language from MUST to SHOULD or even MAY would not alter the actual implementations already obeying this diktat. And future implementers are given less-clear guidance on how to handle pages.

    Note too: HTML5 has a couple of different publication "modes". You are reading the version intended for user agent implementers. The version intended for authors of HTML5 documents does NOT willfully violate CharMod. It tells users to label their content and it's encoding correctly (ideally in UTF-8).


    This archive was generated by hypermail 2.1.5 : Wed Dec 30 2009 - 13:04:04 CST