Re: Autodetection of CP437 vs. Latin-1

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Feb 12 2007 - 04:49:30 CST

  • Next message: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"

    From: "Doug Ewell" <dewell@adelphia.net>
    > I'm really only concerned with 437 vs. 1252 for this case. I have
    > almost no data in CP850, which was an excellent compromise encoding, and
    > none in 858, its euro-enabled successor. But if 850 has enough in
    > common with 437—which, according to one private response, it
    > doesn't—then I'd like to detect that as well.

    If your encoded data really contain real humane text, then the common parts of 437 and 850 are very likely to occur often. Don't exclude 850 for example if the international text contains French. In fact, 437 was used mostly in English sources, and 850 is the default OEM charset for most legacy DOS-based european encodings, and is much more frequent than 437, given that 437 is the default OEM codepage only for US english installations, where characters outside ASCII are quite rare (except for drawing boxes and rare English words imported from French and Spanish, and people name).

    So consider studying the codepages that are most likely to occur for each language you wish to support in your conversion; the tricky case will be in some databases that contain only short text fields like people names and addresses; if you can get hints about the country where each people lives, you may have some hints for the possible language used; but there's no guarantee of being 100% accurate as it highly depends on the input interface that was used when this data was created and handled.

    My opinion is that 850 is much more frequent than 437, including in databases (there were many small database engines (dBase, Paradox) deployed in legacy applications where 850 was the only default, and this was true also for IBM text terminals or terminal emulation on AiX where 850 was the default, instead of the DEC VT220 charset which is very near ISO 8859-1).

    in fact, it is rarely needed to transcode text files, but transcoding occurs during most often when transfering database dumps for a new engine. When possible, for text files, it's proably best to tag them with unidentified encodings, and to rename them progressively once they have been reviewed by a humane in a program that allows selecting the encoding and memorize it (for example using an explicit filename extension:
    * "something.txt" : untagged plain text document, needs to be reviewed
    * "something.CP850.txt" : tagged after review, probable source: West European DOS
    * "something.CP1252.txt" : tagged after review (may be VT220, ISO8859-1 or windows-1252); probable source : Windows, or Unix/X11 terminals (not IBM AIX)
    * "something.UTF-8.txt" : converted to UTF-8
    * "something.ASCII.txt" : no character present outside ASCII, no conversion needed.

    You should identify the possible sources of those legacy data, and using an explicit tagging like above, you may progressively enhance the quality without having to reencode it in a way that would be destructive for the data.

    Consider also using a filesystem that can store more than just 8.3 filenames, to allow such tagging; today, all systems have such capabilities (so forget FAT and FAT12, use FAT32 or NTFS to get long filenames on support medias, or Unix/Linux partitions...)



    This archive was generated by hypermail 2.1.5 : Mon Feb 12 2007 - 04:51:49 CST