Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 10 2005 - 13:10:11 CDT

  • Next message: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    From: <eflarup@yahoo.com>
    > Maybe the new CharsetDetector in ICU 3.4 would be
    > useful for this situation:
    >
    > http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html

    This is a draft only, and it is already deprecated...
    Typically, such a class should be a provider that allows pluggable
    customizations, using alternate statistic distributions, or allows building
    the statistics from a text corpus.

    Also, there are different needs for such decoders: if the document to check
    is quite long, you may need to limit the length of the initial text parsed,
    because you'll want to start using the text on the fly. So the detection may
    occur only within the first 1 or 2 KB of the encoded text.

    Note that the statistics also depend on the language actually used. The
    statistics for English will be quite different with Italian or French, and
    in some cases it will be hard to decide between ISO-8859-1 and ISO-8859-2
    for some Nordic or Baltic languages).

    Your detector may also try to match all candidate charsets in parallel, and
    then stop at some level. Currently this draft class only has a
    getAllDetectableCharsets() API, which is probably not sufficient. One would
    also need a setAllDetectableCharsets() to limit the choice. You would then
    need to feed the detector with as much encoded byte stream as you need,
    before calling a method that returns the array of encoding accuracy levels.
    In some cases, no charset will match with 100% accuracy. Such a class should
    return 0% level if there's an encoding error, but in some cases, encoding
    errors are acceptable (for example encoding the Euro symbol as character
    entity number 128 in a ISO-8859 charset): this is a place where tuning is
    needed. So use this class with care.

    Building an accurate heuristic that allows making distinctions only between
    legacy ISO charsets is notoriously difficult, and all web browsers have
    difficulties to "autodetect" the charset used on web pages when the
    effective charset is not specified or is invalid:

    - some webservers are labelling all pages with ISO-8859-1 even though it is
    another encoding or a UTF. Encoding exceptions are detected by the fact that
    HTML does not allow using some controls (but Internet Explorer silently
    accepts C1 controls in ISO-8859-1 as if they were in fact valid Windows-1252
    characters)

    - and some servers are labelling all with UTF-8 despite the texts are
    encoded with ISO-8859-1 (Exceptions occur when the UTF-8 encoding
    requirements are not respected within the document body, so if there is no
    leading BOM, IE tries to guess an alternate charset or displays square
    boxes, depending on user preferences or manual selection in the browser).



    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 13:11:33 CDT