Re: Detecting encoding in Plain text

From: Mark E. Shoulson (mark@kli.org)
Date: Wed Jan 14 2004 - 00:05:32 EST

  • Next message: Don Osborn: "Re: New MS Mac Office and Unicode?"

    On 01/13/04 05:40, Marco Cimarosti wrote:

    >Peter Kirk wrote:
    >
    >
    >>This one also looks dangerous.
    >>
    >>
    >
    >What do you mean by "dangerous"? This is an heuristic algorithm, so it is
    >only supposed to work always but only in some lucky cases.
    >
    >If lucky cases average to, say, 20% or less then it is a bad and useless
    >algorithm; if they average to, say, 80% or more, then it is good and
    >useless. But you can't ask that it works in the 100% of cases, or it
    >wouldn't be heuristic anymore.
    >
    >
    If it's a heuristic we're after, then why split hairs and try to make
    all the rules ourselves? Get a big ol' mess of training data in as many
    languages as you can and hand it over to a class full of CS graduate
    students studying Machine Learning. Throw it at some neural networks,
    go Bayesian with digraphs, whatever. Analyzing multigraph frequency
    (say, strings of up to four characters) would probably do a pretty
    decent job just by itself.

    ~mark



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 00:51:12 EST