Re: FYI: Google blog on Unicode

From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Feb 08 2010 - 13:06:15 CST

Next message: Kenneth Whistler: "Re: NFD normalisation test"

Previous message: Andreas Prilop: "Re: FYI: Google blog on Unicode"
In reply to: Andreas Prilop: "Re: FYI: Google blog on Unicode"
Next in thread: Doug Ewell: "Re: FYI: Google blog on Unicode"
Reply: Doug Ewell: "Re: FYI: Google blog on Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It is unclear exactly what point you are trying to make.

There are really two methodologies in question.

   1. Accept the charset tagging without question.
   2. Use charset detection, which uses a number of signals. The primary
   signal is a statistical analysis of the bytes in the document, but the
   charset tagging is taken into account (and can sometimes make a difference).

The issue is whether, on balance, which of these produces better results for
web pages and other documents. And with pretty exhaustive side-by-side
comparisons of encodings, it is clear that #2 does, overwhelmingly.

Of course, the less the contents of the document look like "real text", the
more likely that #2 will produce the incorrect results. But we have to look
at the approach that produces the best results overall.

Mark

On Mon, Feb 8, 2010 at 09:40, Andreas Prilop <prilop4321@trashmail.net>wrote:

> On Fri, 29 Jan 2010, Mark Davis wrote upside-down:
>
> > It is encodings determined by a detection algorithm.
>
> This is so stupid!
>
> The results can be seen here:
> http://groups.google.co.uk/group/pl.test/msg/1fa7fa753aad46a2
>
> Special characters are often messed up in groups.google
> because your stupid algorithm takes ISO-8859-1 when the
> message is actually ISO-8859-2 or ISO-8859-15 or whatever.
>
> http://groups.google.co.uk/group/pl.test/msg/359af83289a00e8e
>
> > The declarations for encodings (and language)
> > are far too unreliable to be depended on.
>
> Unreliable is a guy who doesn't even know how to quote.
>
>

Next message: Kenneth Whistler: "Re: NFD normalisation test"
Previous message: Andreas Prilop: "Re: FYI: Google blog on Unicode"
In reply to: Andreas Prilop: "Re: FYI: Google blog on Unicode"
Next in thread: Doug Ewell: "Re: FYI: Google blog on Unicode"
Reply: Doug Ewell: "Re: FYI: Google blog on Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 13:10:28 CST