Re: FYI: Google blog on Unicode

From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Feb 08 2010 - 13:06:15 CST

  • Next message: Kenneth Whistler: "Re: NFD normalisation test"

    It is unclear exactly what point you are trying to make.

    There are really two methodologies in question.

       1. Accept the charset tagging without question.
       2. Use charset detection, which uses a number of signals. The primary
       signal is a statistical analysis of the bytes in the document, but the
       charset tagging is taken into account (and can sometimes make a difference).

    The issue is whether, on balance, which of these produces better results for
    web pages and other documents. And with pretty exhaustive side-by-side
    comparisons of encodings, it is clear that #2 does, overwhelmingly.

    Of course, the less the contents of the document look like "real text", the
    more likely that #2 will produce the incorrect results. But we have to look
    at the approach that produces the best results overall.

    Mark

    On Mon, Feb 8, 2010 at 09:40, Andreas Prilop <prilop4321@trashmail.net>wrote:

    > On Fri, 29 Jan 2010, Mark Davis wrote upside-down:
    >
    > > It is encodings determined by a detection algorithm.
    >
    > This is so stupid!
    >
    > The results can be seen here:
    > http://groups.google.co.uk/group/pl.test/msg/1fa7fa753aad46a2
    >
    > Special characters are often messed up in groups.google
    > because your stupid algorithm takes ISO-8859-1 when the
    > message is actually ISO-8859-2 or ISO-8859-15 or whatever.
    >
    > http://groups.google.co.uk/group/pl.test/msg/359af83289a00e8e
    >
    > > The declarations for encodings (and language)
    > > are far too unreliable to be depended on.
    >
    > Unreliable is a guy who doesn't even know how to quote.
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 13:10:28 CST