Re: charset parameter in Google Groups

From: John Burger (john@mitre.org)
Date: Fri Jul 02 2010 - 09:46:46 CDT

  • Next message: Andreas Prilop: "Re: charset parameter in Google Groups"

    Asmus distinguishes between two kinds of cases: The first is guessing
    the charset incorrectly in a way that completely degrades the text,
    e.g. 8859-1 vs. 8859-2. Second is a more subtle kind of mistakes, and
    arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart
    quotes" problem.

    I like this distinction, and would point out that we can probably
    quantify this into a continuum, in the sense that most of the code
    points in 8859-1 and 1252 are equivalent, while fewer are so in 8859-1
    and 8859-2. (If we wished, we could refine this further by assigning
    different penalties for showing the wrong glyph for an alphabetic
    character than for punctuation.)

    > If I were to design a charset-verifier, I would distinguish between
    > these two cases. If something came tagged with a region-specific
    > charset, I would honor that, unless I found strong evidence of the
    > "this
    > can't be right" nature. In some cases, to collect such evidence would
    > require significant statistics. The rule here should be "do no harm",
    > that is, destroying a document by incorrectly changing a true charset
    > should receive a nuch higher penalty than failing to detect a broken
    > charset. (That way, you don't penalize people who live by the
    > rules :).

    I have always thought that the "right way" to deal with determining
    the correct charset of a document is to treat it as a statistical
    classification problem. Given a collection of documents as training
    data, we could extract features including the following:

    - "suggested" charset, document type, and other information from
    metadata,
       such as HTTP Content-Type, HTML <META> tags, email headers, etc.
    - various statistical signatures from the text itself, e.g. ngrams
    - top-level domain of the originating web site
    - anything else we can think of

    We can then apply one of many possible multi-class algorithms
    developed by the machine learning community to this training set.
    Such an algorithm would learn how to weight the different features so
    as to tag the most documents correctly. (For some of these algorithms
    we would have to tag each document in the training set with the "real"
    charset, but there are also semi-supervised and unsupervised
    algorithms that would discover the most consistent assignment, if we
    were unable or unwilling to correctly tag everything in our dataset.)

    I have always assumed that Google, or someone, must already be doing
    this sort of thing (although perhaps not on Google Groups!).

    Asmus' comments made me realize that the machine learning approach I
    outline above can be taken even further: there are many classification
    algorithms that can be trained with different penalties for different
    kinds of mistakes. These penalties could be determined by hand, or
    could come from quantifying the potential degradation as I describe
    above. This provides a natural and principled way to require far more
    evidence for overriding 8859-1 with 8859-2 than with 1252, for example.

    - John D. Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Fri Jul 02 2010 - 09:51:04 CDT