Re: Bad Content-type headers on Unicode web site?

From: Erik van der Poel (erik@vanderpoel.org)
Date: Fri Mar 04 2005 - 12:41:18 CST

  • Next message: Kenneth Whistler: "Re: Hentaigana"

    Markus Scherer wrote:
    > I suppose the UCD files (the ones which are not in ISO-8859-1) could
    > get a comment line with some syntax, and the web server could in
    > principle parse the files and pick that up. That's a custom solution
    > then. Or add the signature on the server and strip it while serving.
    > (Production tool change.)

    It might be a good idea to convert all of unicode.org's non-UTF-8 *.txt
    files to UTF-8, if that wouldn't cause too many problems. I'm not very
    familiar with the ways in which those *.txt files are used, so I may be
    grossly underestimating the impact of such a change. If you can convert
    all of them to UTF-8, then Apache probably also has a way to specify the
    charset for *.txt site-wide. Alternatively, you might be able to do it
    on a per-directory basis. Per-file might not be good since it could more
    easily get out of sync, as you say. Look for Apache in:

    http://www.w3.org/International/tutorials/tutorial-char-enc/#declaring

    Back in 1995 at Netscape, I thought we had the opportunity to force all
    Web servers to specify the charset for non-iso-8859-1 text. We were
    about to ship 2.0 with a number of new features, and I argued that that
    might be a good time to treat a missing charset parameter as iso-8859-1.
    Netscape had a large market share at that time. A lot of users would
    have started to use 2.0, and the Web servers would have had to follow
    suit. Or at least, that was my argument. The management said no. It is
    actually difficult to tell whether that change would have been
    successful. If it had led to a lot of pain and suffering, we might have
    shipped 2.01 immediately, with that decision reversed.

    So in this case, I don't think hindsight is 20-20, though it might have
    worked if we tried it earlier in the game, say, Netscape 1.1 or 1.0. It
    is somewhat embarrassing to admit all this on an open mailing list, but
    maybe there is a lesson to be learned from it.

    A somewhat similar issue is that of strict HTML parsing. Mosaic and
    Netscape are considered notorious in some circles for being so lax about
    HTML syntax. Some people say that is why we have so much *garbage* out
    there, and others say that is why we have *so much* "garbage" out there.
    See the *emphasis* I have placed in that sentence.

    The latter argument is that the Web exploded *because* it was so easy to
    create content, even by hand.

    There was one issue at around that time, though, where Netscape actually
    *did* decide to tighten the parser. We went round and round in circles,
    trying to decide, and we finally bit that bullet, and it worked. I
    forget the issue, but I think it had something to do with quotes in HTML
    attributes.

    Erik



    This archive was generated by hypermail 2.1.5 : Fri Mar 04 2005 - 12:42:34 CST