Re: Autodetection of CP437 vs. Latin-1

From: Frank Ellermann ([email protected])
Date: Sat Feb 10 2007 - 08:28:23 CST

Next message: Richard Wordingham: "Re: Query for Validity of Thai Sequence"

Previous message: Doug Ewell: "Autodetection of CP437 vs. Latin-1"
In reply to: Doug Ewell: "Autodetection of CP437 vs. Latin-1"
Next in thread: Mike: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Mike: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Asmus Freytag: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:

> I'm looking for tips on automatically detecting text data in MS-DOS
> CP437 (or 850, etc.) versus Latin-1 or Windows CP1252. It doesn't
> have to be a perfect solution, but pretty good.

Tricky, for starters 437 and "850" (= 858 with € instead of inodot on
PC DOS and OS/2, probably also on MS DOS) are quite different. It's
easy if you know that the text is say "de" - you'd look for äöüßÄÖÜ.

If you have something without 0x80 up to 0x9F your chances are good
that it's some kind of Latin.

> One problem is detecting text with the MS-DOS box-drawing characters

And 850 (or "850") doesn't have all of them, it has only the complete
single and double sets. 437 has also all single/double combinations.

> Please don't tell me this is anachronistic; I know it is.

"850" _is_ my default charset, anachronistic or not... :-) You find
various obscure tools on my pages to deal with that, hm, "situation".

> I'm trying to migrate a lot of that anachronistic data to UTF-8, as
> automatically as possible.

Maybe you could try some plausible languages, use that as best guess
for the discrimination, and finally check if the UTF-8 result is still
plausible for the tested language. You'd need dictionaries.

Frank

Next message: Richard Wordingham: "Re: Query for Validity of Thai Sequence"
Previous message: Doug Ewell: "Autodetection of CP437 vs. Latin-1"
In reply to: Doug Ewell: "Autodetection of CP437 vs. Latin-1"
Next in thread: Mike: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Mike: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Asmus Freytag: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 08:35:16 CST