From: Doug Ewell (dewell@adelphia.net)
Date: Sat Feb 10 2007 - 13:39:05 CST
Addison Phillips <addison at yahoo dash inc dot com> wrote:
> Perfection in encoding detection is not really possible, since
> detection is typically based on the statistical distribution of byte
> pairs or byte sequences in various language-encoding combinations.
I just wanted to make sure nobody thought I was expecting the
impossible.
> There are several good encoding detection libraries available out
> there. ICU includes encoding detection, as does 'chardet' bit of
> Mozilla. There are also commercial libraries.
I looked in the ICU documentation under "Character Set Detection" and
couldn't find the DOS code pages, or any detection code at all available
from C.
In any case (though I forgot to mention it) I'm trying to stay
lightweight with this particular project, though I know at some point
I'll cross the Rubicon and start reflexively including ICU in all my
projects.
> If you know the language of the bytes you're checking, that can
> greatly increase the accuracy of the encoding detection (because it
> limits the range of encodings to choose from).
Most of the text will be English; almost all the rest will be FIGS. I
suppose I can eliminate the Portuguese- and Nordic-specific characters
since I don't think I have any data in those.
I'm really only concerned with 437 vs. 1252 for this case. I have
almost no data in CP850, which was an excellent compromise encoding, and
none in 858, its euro-enabled successor. But if 850 has enough in
common with 437—which, according to one private response, it
doesn't—then I'd like to detect that as well.
I've already got UTF-* encoding worked out; those are easy. If I ever
encounter "NESTLÉ™" coded in 1252, that will just have to be one of the
rare exceptional cases that doesn't get detected properly.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 13:40:56 CST