RE: Fun with proof by analogy, was Re: Mojibake on my Web pages

From: Jill Ramonsky (Jill.Ramonsky@Aculab.com)
Date: Mon Sep 29 2003 - 11:01:49 EDT

  • Next message: Peter Kirk: "Re: Fun with proof by analogy, was Re: Mojibake on my Web pages"

    I don't see anything wrong with the spec. So far as I can see it is
    doing the right thing. Although the behaviour of the described server
    could be better.

    First point - if no information is present, assume "us-ascii". Sounds
    /extremely sensible/ to me. ASCII is the intersection of Latin-1, UTF-8,
    and various other commonly used encodings. Moreover, in order to even
    /read/ the name of the encoding, the name of the encoding must have
    itself been encoded in /something/. It makes sense to me to assume the
    absolute minimum. If you want more than the minimum, declare your
    encoding. This should not be a problem.

    Second point - the "search order" - (1) server; (2) XML tag; (3) HTML
    meta tag. This also makes sense to me. Yes, the document author should
    know best, but it is the /_server_/, not the /_client_/, which should
    take notice of the meta tag.

    As far as the browser is concerned, meta tags in the document _/must
    not/_ override the headers, as this could result in security holes
    exploitable by attackers.

    The issue is slightly more complicated. The browser /must/ believe the
    HTTP headers. However, if the meta tags and HTTP headers are in conflict
    then I believe _the server is at fault_, in not making the correct
    declaration. In other words, if the document author says (in a meta tag)
    "this is in UTF-8", then the server should (in my opinion) send the
    document to the browser with an encoding type of UTF-8. In other words,
    the server should (again, in my opinion), ensure that the HTTP header is
    not in conflict with a meta tag, by changing the HTTP header to match
    the meta tag. However, if a server does not do this, still, then the
    browser must believe the HTTP header.

    Jill

    > -----Original Message-----
    > From: John Cowan [mailto:cowan@mercury.ccil.org]
    > Sent: Saturday, September 27, 2003 3:48 PM
    > To: jameskass@att.net
    > Cc: unicode@unicode.org
    > Subject: Re: Fun with proof by analogy, was Re: Mojibake on
    > my Web pages
    >
    >
    > jameskass@att.net scripsit:
    >
    > > First, the browser checks the HTTP header, then the XML declaration
    > > (which is not relevant to HTML), then the HTML meta tag.
    > >
    > > Apparently, upon finding character set information, the operation
    > > stops, so if information is present in the HTTP header, the meta
    > > tag won't be consulted.
    >
    > It's worse than that. If the HTTP header says "text/xml" or
    > "text/html",
    > and no charset information is provided, a fully conforming browser
    > MUST treat this as if the charset "us-ascii" is specified. That's
    > just insane, but such are the rules.
    >
    > Only if there is no header, or if the header says "application/xml",
    > do we get to proceed to other sources of knowledge.
    >
    > > All of the data should be consulted and there should be some kind
    > > of protocol in place to handle conflicting character set info.
    >
    > It *is* in place and fully specified. It's just that most of us
    > don't care for the results, and most programs don't fully conform
    > for that reason.
    >
    > --
    > Some people open all the Windows; John Cowan
    > wise wives welcome the spring jcowan@reutershealth.com
    > by moving the Unix. http://www.reutershealth.com
    > --ad for Unix Book Units (U.K.) http://www.ccil.org/~cowan
    > (see http://cm.bell-labs.com/cm/cs/who/dmr/unix3image.gif)
    >



    This archive was generated by hypermail 2.1.5 : Mon Sep 29 2003 - 11:53:27 EDT