Re: Bad Content-type headers on Unicode web site?

From: Markus Scherer (markus.icu@gmail.com)
Date: Fri Mar 04 2005 - 11:03:23 CST

  • Next message: Dean Snyder: "Re: Ambiguity and disunification"

    The problem is of course that web servers usually don't know which
    file has which encoding. A recent Apache update that made ISO-8859-1
    the default, and sent it rather than leaving the charset unspecified,
    is famous for wreaking havoc on other-charset content. There is a way
    to specify per-file meta data but that's a manual process and tends to
    get out of sync.

    You also can't declare the same charset for all UCD files because
    there are at least two in use (ISO-8859-1 and UTF-8) for different
    files.

    Unicode signatures might help, but are controversial, and may break
    UCD file parsers.

    It looks like there is no good solution. HTML and XML have mechanisms
    for internal charset declarations, but plain text doesn't. If you add
    some syntax, it becomes markup...

    I suppose the UCD files (the ones which are not in ISO-8859-1) could
    get a comment line with some syntax, and the web server could in
    principle parse the files and pick that up. That's a custom solution
    then. Or add the signature on the server and strip it while serving.
    (Production tool change.)

    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless
    otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Fri Mar 04 2005 - 11:04:41 CST