RE: Autodetection of CP437 vs. Latin-1

From: Philippe Verdy (
Date: Thu Feb 15 2007 - 14:51:03 CST

  • Next message: Philippe Verdy: "RE: Autodetection of CP437 vs. Latin-1"

    > De la part de Doug Ewell
    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    > > If your encoded data really contain real humane text, then the common
    > > parts of 437 and 850 are very likely to occur often. Don't exclude 850
    > > for example if the international text contains French. In fact, 437
    > > was used mostly in English sources, and 850 is the default OEM charset
    > > for most legacy DOS-based european encodings, and is much more
    > > frequent than 437, given that 437 is the default OEM codepage only for
    > > US english installations, where characters outside ASCII are quite
    > > rare (except for drawing boxes and rare English words imported from
    > > French and Spanish, and people name).
    > The data I have to convert that is in an MS-DOS code page is
    > overwhelmingly in 437. I'm not planning to "exclude" 850 but it is a
    > secondary priority for this project.
    > > My opinion is that 850 is much more frequent than 437, including in
    > > databases (there were many small database engines (dBase, Paradox)
    > > deployed in legacy applications where 850 was the only default, and
    > > this was true also for IBM text terminals or terminal emulation on AiX
    > > where 850 was the default, instead of the DEC VT220 charset which is
    > > very near ISO 8859-1).
    > This is very possibly true for the totality of data ever encoded, or
    > still outstanding, in MS-DOS code pages. It is most emphatically not
    > true for the particular collection of data that needs to be converted by
    > this project.

    It's probably because you're dealing with data whose source was located in
    USA where codepage 437 was the default for legacy DOS apps.

    But be careful with the origin of you files, especially if you're willing to
    deploy your conversion tools in an organization-wide network which may
    include foreign subsidiaries.

    Those may have a lot of legacy text files to handle which whose usage
    pattern will definitely not use CP437 primarily.
    Just think that CP850 just comes second after CP437, and that it is even the
    first one outside USA...

    Now if you go to the Canada, there are two candidates which are not CP437,
    depending mostly on the preferred language. It's important to have an exact
    view of the history of the organization's computing infrastructure and a
    correct estimation of which OS'es were deployed and which version.

    And remember also that the usage pattern will be extremely different
    according to the document type: plain text files and database files (like
    dBase and Paradox) will typically have their own encoding, independent of
    the OS on which they were used (I still have got plenty of small local
    database engines that run with CP850 internal encoding, even though the
    applications that use them are fully converted to Windows and use
    Windows-1252 or Unicode since long!) Code conversion occurs within the
    database engine or library, and is sometimes assisted by the applications
    using them which specify (sometime in an hard-coded way) the charset and
    encoding used.

    A simple conversion to Unicode may have undesirable effects such as breaking
    database reports due to extra sorting classes, or uncovered ranges in data
    selection; I've seen some cases where some data fields use unique
    identifiers with a very precise sorting assumption, so that unique
    identifiers belong to several classes and can be split simply by comparing
    ranges, or using SQL syntaxes where the sort order is very significant. If
    you have to convert such things, make sure that the sort order remains
    consistant, and if you can't support the binary sort order implied by a
    legacy charset, you'll have not only to fixe the identifiers, or modify the

    This archive was generated by hypermail 2.1.5 : Thu Feb 15 2007 - 14:53:58 CST