Re: Detecting encoding in Plain text

From: John Burger (john@mitre.org)
Date: Wed Jan 14 2004 - 10:16:41 EST

Next message: Peter Kirk: "Re: New MS Mac Office and Unicode?"

Previous message: Peter Kirk: "Re: German characters not correct in output webform"
In reply to: Mark E. Shoulson: "Re: Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Doug Ewell: "Re: Detecting encoding in Plain text"
Reply: Frank Yung-Fong Tang: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark E. Shoulson wrote:

> If it's a heuristic we're after, then why split hairs and try to make
> all the rules ourselves? Get a big ol' mess of training data in as
> many languages as you can and hand it over to a class full of CS
> graduate students studying Machine Learning.

Absolutely my reaction. All of these suggested heuristics are great,
but would almost certainly simply fall out of a more rigorous approach
using a generative probabilistic model, or some other classification
technique. Useful features would include n-graphs frequencies, as Mark
suggests, as well as lots of other things. For particular
applications, you could use a cache model, e.g., using statistics from
other documents from the same web site, or other messages from the same
email address, or even generalizing across country-of-origin.
Additionally, I'm pretty sure that you could get some mileage out of
unsupervised data, that is, all of the documents in the training set
needn't be labeled with language/encoding. And one thing we have a lot
of on the web is unsupervised data.

I would be extremely surprised if such an approach couldn't achieve 99%
accuracy - and I really do mean 99%, or better.

By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?

- John Burger
MITRE

Next message: Peter Kirk: "Re: New MS Mac Office and Unicode?"
Previous message: Peter Kirk: "Re: German characters not correct in output webform"
In reply to: Mark E. Shoulson: "Re: Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Doug Ewell: "Re: Detecting encoding in Plain text"
Reply: Frank Yung-Fong Tang: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 10:57:25 EST