Re: Autodetection of CP437 vs. Latin-1

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Feb 10 2007 - 13:39:05 CST

  • Next message: Philippe Verdy: "Re: missing symbol?"

    Addison Phillips <addison at yahoo dash inc dot com> wrote:

    > Perfection in encoding detection is not really possible, since
    > detection is typically based on the statistical distribution of byte
    > pairs or byte sequences in various language-encoding combinations.

    I just wanted to make sure nobody thought I was expecting the
    impossible.

    > There are several good encoding detection libraries available out
    > there. ICU includes encoding detection, as does 'chardet' bit of
    > Mozilla. There are also commercial libraries.

    I looked in the ICU documentation under "Character Set Detection" and
    couldn't find the DOS code pages, or any detection code at all available
    from C.

    In any case (though I forgot to mention it) I'm trying to stay
    lightweight with this particular project, though I know at some point
    I'll cross the Rubicon and start reflexively including ICU in all my
    projects.

    > If you know the language of the bytes you're checking, that can
    > greatly increase the accuracy of the encoding detection (because it
    > limits the range of encodings to choose from).

    Most of the text will be English; almost all the rest will be FIGS. I
    suppose I can eliminate the Portuguese- and Nordic-specific characters
    since I don't think I have any data in those.

    I'm really only concerned with 437 vs. 1252 for this case. I have
    almost no data in CP850, which was an excellent compromise encoding, and
    none in 858, its euro-enabled successor. But if 850 has enough in
    common with 437—which, according to one private response, it
    doesn't—then I'd like to detect that as well.

    I've already got UTF-* encoding worked out; those are easy. If I ever
    encounter "NESTLÉ™" coded in 1252, that will just have to be one of the
    rare exceptional cases that doesn't get detected properly.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 13:40:56 CST