Re: Detecting encoding in Plain text

From: Frank Yung-Fong Tang (
Date: Wed Jan 14 2004 - 18:43:40 EST

  • Next message: Peter Kirk: "Re: Detecting encoding in Plain text"

    John Burger wrote on 1/14/2004, 7:16 AM:

    > Mark E. Shoulson wrote:
    > > If it's a heuristic we're after, then why split hairs and try to make
    > > all the rules ourselves? Get a big ol' mess of training data in as
    > > many languages as you can and hand it over to a class full of CS
    > > graduate students studying Machine Learning.
    > Absolutely my reaction. All of these suggested heuristics are great,
    > but would almost certainly simply fall out of a more rigorous approach
    > using a generative probabilistic model, or some other classification
    > technique. Useful features would include n-graphs frequencies, as Mark
    > suggests, as well as lots of other things. For particular
    > applications, you could use a cache model, e.g., using statistics from
    > other documents from the same web site, or other messages from the same
    > email address, or even generalizing across country-of-origin.
    > Additionally, I'm pretty sure that you could get some mileage out of
    > unsupervised data, that is, all of the documents in the training set
    > needn't be labeled with language/encoding. And one thing we have a lot
    > of on the web is unsupervised data.
    > I would be extremely surprised if such an approach couldn't achieve 99%
    > accuracy - and I really do mean 99%, or better.
    > By the way, I still don't quite understand what's special about Thai.
    > Could someone elaborate?

    For language other than Thai, Chinese and Japanese, you usually will see
    space between words. Therefore, you should see a high count of SPACE in
    your document. The SPACE for text in language other than Thai, Chinese
    and Japanese should occupy probably 10%-15% of the code point (just a
    guess, if the average lenght of word is 9 characters, you will get 10%
    SPACE, if it shorter, if the average is shoter, than the percentage of
    SPACE increase). But for Thai, Chinese and Japanese, space is not put in
    between words, and therefore the percentage of SPACE code point will be
    quite different. For Korean, it is hard to say, depend they are using
    IDEOGRAPH SPACE or SINGLE BYTE SPACE. Also, for Korea, it will depend on
    which normalization form they are using. The % of space will be
    different too because in one normalization form you will count one
    Korean characters as one unicode code point, but in the decomposed form,
    it may be count as 3.

    Shanjian Lee and Kat Momoi implement a charset detector based on my
    early work and direction. They summarise it into a paper and present in
    Sept 11, 2001. see for
    details. It talk about a different issue and problem.

    > - John Burger
    > MITRE

    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 19:14:03 EST