Re: Autodetection of CP437 vs. Latin-1

From: Addison Phillips (addison@yahoo-inc.com)
Date: Sat Feb 10 2007 - 11:05:37 CST

  • Next message: Richard Wordingham: "Re: missing symbol?"

    Hi Doug,

    Doug Ewell wrote:
    > I'm looking for tips on automatically detecting text data in MS-DOS
    > CP437 (or 850, etc.) versus Latin-1 or Windows CP1252. It doesn't have
    > to be a perfect solution, but pretty good.

    Perfection in encoding detection is not really possible, since detection
    is typically based on the statistical distribution of byte pairs or byte
    sequences in various language-encoding combinations.

    There are several good encoding detection libraries available out there.
    ICU includes encoding detection, as does 'chardet' bit of Mozilla. There
    are also commercial libraries.

    Nearly all such solutions include language detection because
    language-related distribution of characters affects the outcome. If you
    know the language of the bytes you're checking, that can greatly
    increase the accuracy of the encoding detection (because it limits the
    range of encodings to choose from).

    Detection accuracy varies depending on your data. The more you can limit
    the range of encodings and languages being checked for, the more likely
    you will achieve an accurate result. Bear in mind that, because it is
    statistical, the more text you have the more accurate the result. If
    you've got a database with a varchar(50) and an unlimited range of
    encodings you'll be wrong a lot.

    If your data is all Western European and strictly limited to "OEM" vs
    "ANSI" code pages, it is likely that you can detect the encoding
    accurately with shorter runs.

    There was a good paper on migrating databases from the Ebay folks at
    IUC29 which you might look for: it included practical discussion of how
    to migrate data in a live application.

    >
    > One problem is detecting text with the MS-DOS box-drawing characters,
    > many of which occupy the same code points as Latin-1 accented letters.
    > This means that simple range-checking often doesn't work.

    Yep.

    >
    > Please send replies off-list unless you feel they would interest the
    > list. Please don't tell me this is anachronistic; I know it is. I'm
    > trying to migrate a lot of that anachronistic data to UTF-8, as
    > automatically as possible.

    (laughing) It certainly isn't anachronistic... yet. Would that it was.

    Addison

    -- 
    Addison Phillips
    Globalization Architect -- Yahoo! Inc.
    Internationalization is an architecture.
    It is not a feature.
    


    This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 11:08:09 CST