From: Philippe Verdy (firstname.lastname@example.org)
Date: Wed Aug 10 2005 - 13:10:11 CDT
> Maybe the new CharsetDetector in ICU 3.4 would be
> useful for this situation:
This is a draft only, and it is already deprecated...
Typically, such a class should be a provider that allows pluggable
customizations, using alternate statistic distributions, or allows building
the statistics from a text corpus.
Also, there are different needs for such decoders: if the document to check
is quite long, you may need to limit the length of the initial text parsed,
because you'll want to start using the text on the fly. So the detection may
occur only within the first 1 or 2 KB of the encoded text.
Note that the statistics also depend on the language actually used. The
statistics for English will be quite different with Italian or French, and
in some cases it will be hard to decide between ISO-8859-1 and ISO-8859-2
for some Nordic or Baltic languages).
Your detector may also try to match all candidate charsets in parallel, and
then stop at some level. Currently this draft class only has a
getAllDetectableCharsets() API, which is probably not sufficient. One would
also need a setAllDetectableCharsets() to limit the choice. You would then
need to feed the detector with as much encoded byte stream as you need,
before calling a method that returns the array of encoding accuracy levels.
In some cases, no charset will match with 100% accuracy. Such a class should
return 0% level if there's an encoding error, but in some cases, encoding
errors are acceptable (for example encoding the Euro symbol as character
entity number 128 in a ISO-8859 charset): this is a place where tuning is
needed. So use this class with care.
Building an accurate heuristic that allows making distinctions only between
legacy ISO charsets is notoriously difficult, and all web browsers have
difficulties to "autodetect" the charset used on web pages when the
effective charset is not specified or is invalid:
- some webservers are labelling all pages with ISO-8859-1 even though it is
another encoding or a UTF. Encoding exceptions are detected by the fact that
HTML does not allow using some controls (but Internet Explorer silently
accepts C1 controls in ISO-8859-1 as if they were in fact valid Windows-1252
- and some servers are labelling all with UTF-8 despite the texts are
encoded with ISO-8859-1 (Exceptions occur when the UTF-8 encoding
requirements are not respected within the document body, so if there is no
leading BOM, IE tries to guess an alternate charset or displays square
boxes, depending on user preferences or manual selection in the browser).
This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 13:11:33 CDT