Re: Detecting encoding in Plain text

From: John Burger (
Date: Wed Jan 14 2004 - 10:16:41 EST

  • Next message: Peter Kirk: "Re: New MS Mac Office and Unicode?"

    Mark E. Shoulson wrote:

    > If it's a heuristic we're after, then why split hairs and try to make
    > all the rules ourselves? Get a big ol' mess of training data in as
    > many languages as you can and hand it over to a class full of CS
    > graduate students studying Machine Learning.

    Absolutely my reaction. All of these suggested heuristics are great,
    but would almost certainly simply fall out of a more rigorous approach
    using a generative probabilistic model, or some other classification
    technique. Useful features would include n-graphs frequencies, as Mark
    suggests, as well as lots of other things. For particular
    applications, you could use a cache model, e.g., using statistics from
    other documents from the same web site, or other messages from the same
    email address, or even generalizing across country-of-origin.
    Additionally, I'm pretty sure that you could get some mileage out of
    unsupervised data, that is, all of the documents in the training set
    needn't be labeled with language/encoding. And one thing we have a lot
    of on the web is unsupervised data.

    I would be extremely surprised if such an approach couldn't achieve 99%
    accuracy - and I really do mean 99%, or better.

    By the way, I still don't quite understand what's special about Thai.
    Could someone elaborate?

    - John Burger

    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 10:57:25 EST