Re: Problems encoding the spanish o

From: Philippe Verdy
Date: Mon Nov 17 2003

    From: "Marco Cimarosti" <>
    To: "'Pim Blokland'" <>; "Unicode mailing list"
    > Pim Blokland wrote:
    > > Not only that, but the process making the mistake of thinking it is
    > > UTF-8 also makes the mistake of not generating an error for
    > > encountering malformed byte sequences,
    > BTW, this process has a name: "Internet Explorer".

    Don't blame IE too much if it attempts to interpret the text using UTF-8,
    because the page is tagged explicitly with a UTF-8 charset. Well, it's true
    that IE should stop to use this erroneous charset tag as soon as it sees a
    violation of the UTF-8 rule, and rather should attempt to use its "automatic
    selection". But it's true also, that IE still attempts to use the legacy
    UTF-8 encoding which allowed interpreting non-short sequences.

    I do think this bug does not occur within recent updates of IE, notably
    since it was corrected to remove the security hole in MSHTML.DLL to avoid
    interpreting non-short sequences. If IE really wants to keep some
    compatibility, it may only accept the CESU-8 encoding only as a possible
    choice for its "automatic selection" of charsets, or display a visible
    replacement character (such as a narrow white box) for invalid characters
    (that could internally be handled as if these invalid sequences were
    representing U+FFFF).

    But if the user forces the UTF-8 decoding in the GUI, IE should still not
    consider any invalid UTF-8 sequence, and interpret it as an invalid
    character like U+FFFF or, even better, disable this UTF-8 choice in the user

    So this is really an effect of the collision of multiple Unicode violations,
    both in the User-Agent interpreting the coded strings, and in the content of
    the page, incorrectly labelled UTF-8 when it is not (here: complain to your
    web page designer, or blame yourself if you created this page with invalid

    Beware, when editing an UTF-8 page that includes the UTF-8 charset metatag
    explicitly, that your editor will not save it into ISO-8859-1, only because
    it thinks it will save storage space...

    There are also of some bogous "web site optimizers" that perform this kind
    of encoding optimization (in addition to removing unnecessary spaces and new
    lines, or to compressing/obfuscating the JavaScript code, CSS stylesheet
    class names) and don't take care of changing the value of this meta-tag...

    Changing the internal encoding of any text file without an explicit request
    from the user should never be done automatically without confirmation and
    logging of the actions taken.

