Re: encoding sniffing

From: Philippe Verdy (
Date: Mon Jul 14 2003 - 20:03:50 EDT

  • Next message: Stefan Persson: "Error in the roadmap"

    On Tuesday, July 15, 2003 12:51 AM, Patrick Andries <> wrote:

    > ----- Message d'origine -----
    > De: "Philippe Verdy" <>
    > > On Monday, July 14, 2003 11:42 PM, Patrick Andries
    > <> wrote:
    > >
    > > > In any case, I believe Peter has an idea how these libraries work
    > > > and their limitations, he is rather looking for one with its
    > > > limitations.
    > >
    > > Including the Chinese limitations? It will become tricky if
    > > managing with
    > traditional or scientific texts with many rare ideographs, because
    > it's difficult to create an exhaustive morphological analysis with
    > Chinese,
    > This product does no morphological analysis but uses a hidden Markov
    > Model. Did you try it ? (I just checked with
    > gave me Chinese, Big-5).

    I could find a few technical plain-text documents that are obviously in
    English ASCII and that were also identified as Chinese GBK. This includes
    some technical pages that I have (such as abuse analysis
    pages that are also sometimes interpreted as Chinese in Internet Explorer,
    or some Unicode text tables that are containing pure US English ASCII
    among other numeric data).

    I admit that such errors occur mostly on very technical documents, but
    technical documents that need correct identification of their encoding
    are also database table exports in flat files, in which it is sometimes hard
    for a humane to find inaccuracies for its encoding identification, when it
    contains lots of digits and separators, or people names and addresses,
    with a large majority of lines using a familiar script or orthograph.

    For this reason, now, I am used to import data from multiple files in a joint
    database, by adding a tracking import id, that allows discovering later if
    some batches requires special full export and reencoding: there's nothing
    worse that feeding a database from data incorrectly interpreted from multiple
    encodings, notably when the database is also very active and is used in
    parallel with internal applications whose encoding is well controled. I know
    that there are some products that allow identifying a language/encoding
    pair on a fragment of a file or from a subselection in a database, but they
    are expensive. Using a humane to review each record in a large database
    is also costly, long and errorprone.

    This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 20:53:12 EDT