From: John Burger (email@example.com)
Date: Fri Jul 02 2010 - 09:46:46 CDT
Asmus distinguishes between two kinds of cases: The first is guessing
the charset incorrectly in a way that completely degrades the text,
e.g. 8859-1 vs. 8859-2. Second is a more subtle kind of mistakes, and
arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart
I like this distinction, and would point out that we can probably
quantify this into a continuum, in the sense that most of the code
points in 8859-1 and 1252 are equivalent, while fewer are so in 8859-1
and 8859-2. (If we wished, we could refine this further by assigning
different penalties for showing the wrong glyph for an alphabetic
character than for punctuation.)
> If I were to design a charset-verifier, I would distinguish between
> these two cases. If something came tagged with a region-specific
> charset, I would honor that, unless I found strong evidence of the
> can't be right" nature. In some cases, to collect such evidence would
> require significant statistics. The rule here should be "do no harm",
> that is, destroying a document by incorrectly changing a true charset
> should receive a nuch higher penalty than failing to detect a broken
> charset. (That way, you don't penalize people who live by the
> rules :).
I have always thought that the "right way" to deal with determining
the correct charset of a document is to treat it as a statistical
classification problem. Given a collection of documents as training
data, we could extract features including the following:
- "suggested" charset, document type, and other information from
such as HTTP Content-Type, HTML <META> tags, email headers, etc.
- various statistical signatures from the text itself, e.g. ngrams
- top-level domain of the originating web site
- anything else we can think of
We can then apply one of many possible multi-class algorithms
developed by the machine learning community to this training set.
Such an algorithm would learn how to weight the different features so
as to tag the most documents correctly. (For some of these algorithms
we would have to tag each document in the training set with the "real"
charset, but there are also semi-supervised and unsupervised
algorithms that would discover the most consistent assignment, if we
were unable or unwilling to correctly tag everything in our dataset.)
I have always assumed that Google, or someone, must already be doing
this sort of thing (although perhaps not on Google Groups!).
Asmus' comments made me realize that the machine learning approach I
outline above can be taken even further: there are many classification
algorithms that can be trained with different penalties for different
kinds of mistakes. These penalties could be determined by hand, or
could come from quantifying the potential degradation as I describe
above. This provides a natural and principled way to require far more
evidence for overriding 8859-1 with 8859-2 than with 1252, for example.
- John D. Burger
This archive was generated by hypermail 2.1.5 : Fri Jul 02 2010 - 09:51:04 CDT