Re: Autodetection of CP437 vs. Latin-1

From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Sat Feb 10 2007 - 08:28:23 CST

  • Next message: Richard Wordingham: "Re: Query for Validity of Thai Sequence"

    Doug Ewell wrote:

    > I'm looking for tips on automatically detecting text data in MS-DOS
    > CP437 (or 850, etc.) versus Latin-1 or Windows CP1252. It doesn't
    > have to be a perfect solution, but pretty good.

    Tricky, for starters 437 and "850" (= 858 with € instead of inodot on
    PC DOS and OS/2, probably also on MS DOS) are quite different. It's
    easy if you know that the text is say "de" - you'd look for äöüßÄÖÜ.

    If you have something without 0x80 up to 0x9F your chances are good
    that it's some kind of Latin.

    > One problem is detecting text with the MS-DOS box-drawing characters

    And 850 (or "850") doesn't have all of them, it has only the complete
    single and double sets. 437 has also all single/double combinations.

    > Please don't tell me this is anachronistic; I know it is.

    "850" _is_ my default charset, anachronistic or not... :-) You find
    various obscure tools on my pages to deal with that, hm, "situation".

    > I'm trying to migrate a lot of that anachronistic data to UTF-8, as
    > automatically as possible.

    Maybe you could try some plausible languages, use that as best guess
    for the discrimination, and finally check if the UTF-8 result is still
    plausible for the tested language. You'd need dictionaries.

    Frank



    This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 08:35:16 CST