Re: Autodetection of CP437 vs. Latin-1

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Feb 12 2007 - 08:39:25 CST

  • Next message: Lokesh Joshi: "Re: Query for Validity of Thai Sequence"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > If your encoded data really contain real humane text, then the common
    > parts of 437 and 850 are very likely to occur often. Don't exclude 850
    > for example if the international text contains French. In fact, 437
    > was used mostly in English sources, and 850 is the default OEM charset
    > for most legacy DOS-based european encodings, and is much more
    > frequent than 437, given that 437 is the default OEM codepage only for
    > US english installations, where characters outside ASCII are quite
    > rare (except for drawing boxes and rare English words imported from
    > French and Spanish, and people name).

    The data I have to convert that is in an MS-DOS code page is
    overwhelmingly in 437. I'm not planning to "exclude" 850 but it is a
    secondary priority for this project.

    > My opinion is that 850 is much more frequent than 437, including in
    > databases (there were many small database engines (dBase, Paradox)
    > deployed in legacy applications where 850 was the only default, and
    > this was true also for IBM text terminals or terminal emulation on AiX
    > where 850 was the default, instead of the DEC VT220 charset which is
    > very near ISO 8859-1).

    This is very possibly true for the totality of data ever encoded, or
    still outstanding, in MS-DOS code pages. It is most emphatically not
    true for the particular collection of data that needs to be converted by
    this project.

    > in fact, it is rarely needed to transcode text files, but transcoding
    > occurs during most often when transfering database dumps for a new
    > engine.

    Well, thank goodness I don't really have to do this project after all!
    Thank you for expertly analyzing my data needs.

    > Consider also using a filesystem that can store more than just 8.3
    > filenames, to allow such tagging; today, all systems have such
    > capabilities (so forget FAT and FAT12, use FAT32 or NTFS to get long
    > filenames on support medias, or Unix/Linux partitions...)

    I am using NTFS under Windows XP SP2. That has precisely nothing to do
    with this. I have text files accumulated over the past 20 years that
    are in various character sets that I would like to convert, or at least
    view, with as much automatic charset recognition as possible. Renaming
    the files to identify the charset is not part of the solution.

    Thanks to all who sent constructive solutions, publicly and privately.
    Most of them made use of the observation that "extended Latin" letters
    typically don't occur in runs of two or more, and box-drawing characters
    typically do. I may post a better summary after I get this working.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Mon Feb 12 2007 - 08:43:13 CST