Date: Thu Jul 01 2010

    > > The problem with slavishly following the charset parameter is that it
    > > is often incorrect.

    > I wonder how you could draw such a conclusion. In order to make such
    > a statement, there must be some other (god-given?) parameter, which is the "real charset".

    > Each and every program (webbrowser, newsreader, e-mailer ...)

    Actually, historically, that's not quite right. NOW they do (if they're behaving), but in the past they often just used whatever the system code page is. Even worse, people would write in one local code page, stick it on an en-US server, and then "test" it on the same source machine (same locale), so then it "worked", but only for them. Once it gets read by a different machine it doesn't work.

    Even worse, either the editing software, or the server, might mistag the code page because they were trying to fill in missing information. And there was a common abuse of the ISO code pages for what were really windows code page encoded data.

    So, now, in theory, and in well-behaved environments, the taggings are much more accurate, however it can be difficult to distinguish correctly tagged data from mis-tagged data. Using UTF-8 helps a ton, because it's pretty obvious that it's UTF-8.

    Anyway, I have no clue what Google's doing, however mis-tagging of data is a common problem in the industry, and a great reason to use Unicode. Some countries have an even bigger problem do to variations in implementations of their commonly used code pages, and extensions which may, or may not, always be supported. It's also part of why you occasionally see things like badly marked up rich quotes on major news sites, even now.


