Re: encoding sniffing

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 14 2003 - 17:26:29 EDT

  • Next message: Patrick Andries: "Re: encoding sniffing"

    On Monday, July 14, 2003 10:14 PM, Peter_Constable@sil.org <Peter_Constable@sil.org> wrote:

    > Are there any libraries out there (open-source or otherwise) that can
    > be used to detect the character encoding of a file or data stream?

    Yes, but these libraries actually try to detect the actual encoded
    language, based on strict validity rules to discriminate first the
    possible encodings, then statistic rules to try matching the
    languages with their various encoded byte sequences, then with
    the help of common words. The result is probabilistic, and what you
    get is an ordered list of language-encoding pairs. There are many
    cases where the final decision is ambiguous, so this may be tuned
    by the reader.

    Simple algorithms are used in Internet Explorer for its "auto-
    determined" mode, but it often fails and detects a Chinese
    text encoded with EUC-CN or UTF-7, when in fact it is just plain
    English coded with ASCII. This failure occurs with Chinese
    simply because there is no actual dictionnary to try matching the
    common ideographs often used in Chinese text (notably its
    ideographic punctuation and square spaces).

    However pure statistic rules often works to detect only the
    encoding (but with no guarantee).

    I don't use Mozilla, but it may have such a mode for the detection
    of the actual encoding; if so it should be in its sources (I did not
    check).

    -- 
    Philippe.
    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.
    


    This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 18:07:08 EDT