Re: FYI: Google blog on Unicode

From: Doug Ewell (doug@ewellic.org)
Date: Mon Feb 08 2010 - 20:43:57 CST

  • Next message: Doug Ewell: "Re: FYI: Google blog on Unicode"

    Mark Davis ☸ wrote:

    > There are really two methodologies in question.
    >
    > 1. Accept the charset tagging without question.
    > 2. Use charset detection, which uses a number of signals. The primary
    > signal is a statistical analysis of the bytes in the document, but the
    > charset tagging is taken into account (and can sometimes make a
    > difference).
    >
    > The issue is whether, on balance, which of these produces better
    > results for web pages and other documents. And with pretty exhaustive
    > side-by-side comparisons of encodings, it is clear that #2 does,
    > overwhelmingly.

    What about option 1½: Use charset detection, assisted by the charset
    tagging. That is, if the content is valid UTF-8 or UTF-16, or something
    else unambiguous like GB18030, ignore the tagging and trust the
    detection algorithm fully. But if the algorithm shows that it could
    reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
    trust the tag. Just a thought.

    --
    Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
    RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 20:48:53 CST