Re: Detecting encoding in Plain text

From: Mark E. Shoulson (mark@kli.org)
Date: Wed Jan 14 2004 - 00:05:32 EST

Next message: Don Osborn: "Re: New MS Mac Office and Unicode?"

Previous message: Doug Ewell: "Re: Cuneiform - Dynamic vs. Static"
In reply to: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Next in thread: John Burger: "Re: Detecting encoding in Plain text"
Reply: John Burger: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 01/13/04 05:40, Marco Cimarosti wrote:

>Peter Kirk wrote:
>
>
>>This one also looks dangerous.
>>
>>
>
>What do you mean by "dangerous"? This is an heuristic algorithm, so it is
>only supposed to work always but only in some lucky cases.
>
>If lucky cases average to, say, 20% or less then it is a bad and useless
>algorithm; if they average to, say, 80% or more, then it is good and
>useless. But you can't ask that it works in the 100% of cases, or it
>wouldn't be heuristic anymore.
>
>
If it's a heuristic we're after, then why split hairs and try to make
all the rules ourselves? Get a big ol' mess of training data in as many
languages as you can and hand it over to a class full of CS graduate
students studying Machine Learning. Throw it at some neural networks,
go Bayesian with digraphs, whatever. Analyzing multigraph frequency
(say, strings of up to four characters) would probably do a pretty
decent job just by itself.

~mark

Next message: Don Osborn: "Re: New MS Mac Office and Unicode?"
Previous message: Doug Ewell: "Re: Cuneiform - Dynamic vs. Static"
In reply to: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Next in thread: John Burger: "Re: Detecting encoding in Plain text"
Reply: John Burger: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 00:51:12 EST