RE: Autodetection of CP437 vs. Latin-1

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Feb 15 2007 - 14:51:03 CST

Next message: Philippe Verdy: "RE: Autodetection of CP437 vs. Latin-1"

Previous message: Lokesh Joshi: "Re: Query for Validity of Thai Sequence"
In reply to: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Next in thread: Philippe Verdy: "RE: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> De la part de Doug Ewell
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > If your encoded data really contain real humane text, then the common
> > parts of 437 and 850 are very likely to occur often. Don't exclude 850
> > for example if the international text contains French. In fact, 437
> > was used mostly in English sources, and 850 is the default OEM charset
> > for most legacy DOS-based european encodings, and is much more
> > frequent than 437, given that 437 is the default OEM codepage only for
> > US english installations, where characters outside ASCII are quite
> > rare (except for drawing boxes and rare English words imported from
> > French and Spanish, and people name).
>
> The data I have to convert that is in an MS-DOS code page is
> overwhelmingly in 437. I'm not planning to "exclude" 850 but it is a
> secondary priority for this project.
>
> > My opinion is that 850 is much more frequent than 437, including in
> > databases (there were many small database engines (dBase, Paradox)
> > deployed in legacy applications where 850 was the only default, and
> > this was true also for IBM text terminals or terminal emulation on AiX
> > where 850 was the default, instead of the DEC VT220 charset which is
> > very near ISO 8859-1).
>
> This is very possibly true for the totality of data ever encoded, or
> still outstanding, in MS-DOS code pages. It is most emphatically not
> true for the particular collection of data that needs to be converted by
> this project.

It's probably because you're dealing with data whose source was located in
USA where codepage 437 was the default for legacy DOS apps.

But be careful with the origin of you files, especially if you're willing to
deploy your conversion tools in an organization-wide network which may
include foreign subsidiaries.

Those may have a lot of legacy text files to handle which whose usage
pattern will definitely not use CP437 primarily.
Just think that CP850 just comes second after CP437, and that it is even the
first one outside USA...

Now if you go to the Canada, there are two candidates which are not CP437,
depending mostly on the preferred language. It's important to have an exact
view of the history of the organization's computing infrastructure and a
correct estimation of which OS'es were deployed and which version.

And remember also that the usage pattern will be extremely different
according to the document type: plain text files and database files (like
dBase and Paradox) will typically have their own encoding, independent of
the OS on which they were used (I still have got plenty of small local
database engines that run with CP850 internal encoding, even though the
applications that use them are fully converted to Windows and use
Windows-1252 or Unicode since long!) Code conversion occurs within the
database engine or library, and is sometimes assisted by the applications
using them which specify (sometime in an hard-coded way) the charset and
encoding used.

A simple conversion to Unicode may have undesirable effects such as breaking
database reports due to extra sorting classes, or uncovered ranges in data
selection; I've seen some cases where some data fields use unique
identifiers with a very precise sorting assumption, so that unique
identifiers belong to several classes and can be split simply by comparing
ranges, or using SQL syntaxes where the sort order is very significant. If
you have to convert such things, make sure that the sort order remains
consistant, and if you can't support the binary sort order implied by a
legacy charset, you'll have not only to fixe the identifiers, or modify the
applications!

Next message: Philippe Verdy: "RE: Autodetection of CP437 vs. Latin-1"
Previous message: Lokesh Joshi: "Re: Query for Validity of Thai Sequence"
In reply to: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Next in thread: Philippe Verdy: "RE: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 15 2007 - 14:53:58 CST