From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Sat Feb 10 2007 - 08:28:23 CST
Doug Ewell wrote:
> I'm looking for tips on automatically detecting text data in MS-DOS
> CP437 (or 850, etc.) versus Latin-1 or Windows CP1252. It doesn't
> have to be a perfect solution, but pretty good.
Tricky, for starters 437 and "850" (= 858 with € instead of inodot on
PC DOS and OS/2, probably also on MS DOS) are quite different. It's
easy if you know that the text is say "de" - you'd look for äöüßÄÖÜ.
If you have something without 0x80 up to 0x9F your chances are good
that it's some kind of Latin.
> One problem is detecting text with the MS-DOS box-drawing characters
And 850 (or "850") doesn't have all of them, it has only the complete
single and double sets. 437 has also all single/double combinations.
> Please don't tell me this is anachronistic; I know it is.
"850" _is_ my default charset, anachronistic or not... :-) You find
various obscure tools on my pages to deal with that, hm, "situation".
> I'm trying to migrate a lot of that anachronistic data to UTF-8, as
> automatically as possible.
Maybe you could try some plausible languages, use that as best guess
for the discrimination, and finally check if the UTF-8 result is still
plausible for the tested language. You'd need dictionaries.
Frank
This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 08:35:16 CST