From: Frank Yung-Fong Tang (email@example.com)
Date: Wed Jan 14 2004 - 18:43:40 EST
John Burger wrote on 1/14/2004, 7:16 AM:
> Mark E. Shoulson wrote:
> > If it's a heuristic we're after, then why split hairs and try to make
> > all the rules ourselves? Get a big ol' mess of training data in as
> > many languages as you can and hand it over to a class full of CS
> > graduate students studying Machine Learning.
> Absolutely my reaction. All of these suggested heuristics are great,
> but would almost certainly simply fall out of a more rigorous approach
> using a generative probabilistic model, or some other classification
> technique. Useful features would include n-graphs frequencies, as Mark
> suggests, as well as lots of other things. For particular
> applications, you could use a cache model, e.g., using statistics from
> other documents from the same web site, or other messages from the same
> email address, or even generalizing across country-of-origin.
> Additionally, I'm pretty sure that you could get some mileage out of
> unsupervised data, that is, all of the documents in the training set
> needn't be labeled with language/encoding. And one thing we have a lot
> of on the web is unsupervised data.
> I would be extremely surprised if such an approach couldn't achieve 99%
> accuracy - and I really do mean 99%, or better.
> By the way, I still don't quite understand what's special about Thai.
> Could someone elaborate?
For language other than Thai, Chinese and Japanese, you usually will see
space between words. Therefore, you should see a high count of SPACE in
your document. The SPACE for text in language other than Thai, Chinese
and Japanese should occupy probably 10%-15% of the code point (just a
guess, if the average lenght of word is 9 characters, you will get 10%
SPACE, if it shorter, if the average is shoter, than the percentage of
SPACE increase). But for Thai, Chinese and Japanese, space is not put in
between words, and therefore the percentage of SPACE code point will be
quite different. For Korean, it is hard to say, depend they are using
IDEOGRAPH SPACE or SINGLE BYTE SPACE. Also, for Korea, it will depend on
which normalization form they are using. The % of space will be
different too because in one normalization form you will count one
Korean characters as one unicode code point, but in the decomposed form,
it may be count as 3.
Shanjian Lee and Kat Momoi implement a charset detector based on my
early work and direction. They summarise it into a paper and present in
Sept 11, 2001. see
details. It talk about a different issue and problem.
> - John Burger
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 19:14:03 EST