RE: Detecting encoding in Plain text

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jan 13 2004 - 05:40:47 EST

Next message: Marco Cimarosti: "RE: Detecting encoding in Plain text"

Previous message: Bert Kemner: "German characters not correct in output webform"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Mark E. Shoulson: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Kirk wrote:
> This one also looks dangerous.

What do you mean by "dangerous"? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.

If lucky cases average to, say, 20% or less then it is a bad and useless
algorithm; if they average to, say, 80% or more, then it is good and
useless. But you can't ask that it works in the 100% of cases, or it
wouldn't be heuristic anymore.

> Some scripts include their own
> digits and punctuation; not all scripts use spaces; and controls are not
> necessarily used, if U+2028 LINE SEPARATOR is used for new lines.

Yes, but *all* these circumstances must occur together in order for the
algorithm to be totally useless for *that* language.

If a certain Unicode plain text file uses ASCII punctuation OR spaces OR
end-of-line characters, AND the file is not too short or has a very odd
formatting, then the algorithm should work.

> But there may be some characters U+??00 which are used rather
> commonly in a particular script and so occur commonly in
> some text files.

And those text files will not be detected correctly, particularly if they
are very short: that's part of the game.

_ Marco

Next message: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Previous message: Bert Kemner: "German characters not correct in output webform"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Peter Kirk: "Re: Detecting encoding in Plain text"
Reply: Mark E. Shoulson: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 06:22:09 EST