Re: Autodetection of CP437 vs. Latin-1

From: Addison Phillips (addison@yahoo-inc.com)
Date: Sat Feb 10 2007 - 11:05:37 CST

Next message: Richard Wordingham: "Re: missing symbol?"

Previous message: Peter Constable: "RE: Query for Validity of Thai Sequence"
In reply to: Doug Ewell: "Autodetection of CP437 vs. Latin-1"
Next in thread: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Doug,

Doug Ewell wrote:
> I'm looking for tips on automatically detecting text data in MS-DOS
> CP437 (or 850, etc.) versus Latin-1 or Windows CP1252. It doesn't have
> to be a perfect solution, but pretty good.

Perfection in encoding detection is not really possible, since detection
is typically based on the statistical distribution of byte pairs or byte
sequences in various language-encoding combinations.

There are several good encoding detection libraries available out there.
ICU includes encoding detection, as does 'chardet' bit of Mozilla. There
are also commercial libraries.

Nearly all such solutions include language detection because
language-related distribution of characters affects the outcome. If you
know the language of the bytes you're checking, that can greatly
increase the accuracy of the encoding detection (because it limits the
range of encodings to choose from).

Detection accuracy varies depending on your data. The more you can limit
the range of encodings and languages being checked for, the more likely
you will achieve an accurate result. Bear in mind that, because it is
statistical, the more text you have the more accurate the result. If
you've got a database with a varchar(50) and an unlimited range of
encodings you'll be wrong a lot.

If your data is all Western European and strictly limited to "OEM" vs
"ANSI" code pages, it is likely that you can detect the encoding
accurately with shorter runs.

There was a good paper on migrating databases from the Ebay folks at
IUC29 which you might look for: it included practical discussion of how
to migrate data in a live application.

>
> One problem is detecting text with the MS-DOS box-drawing characters,
> many of which occupy the same code points as Latin-1 accented letters.
> This means that simple range-checking often doesn't work.

Yep.

>
> Please send replies off-list unless you feel they would interest the
> list. Please don't tell me this is anachronistic; I know it is. I'm
> trying to migrate a lot of that anachronistic data to UTF-8, as
> automatically as possible.

(laughing) It certainly isn't anachronistic... yet. Would that it was.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Internationalization is an architecture.
It is not a feature.

Next message: Richard Wordingham: "Re: missing symbol?"
Previous message: Peter Constable: "RE: Query for Validity of Thai Sequence"
In reply to: Doug Ewell: "Autodetection of CP437 vs. Latin-1"
Next in thread: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Reply: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 11:08:09 CST