Re: encoding sniffing

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 14 2003 - 20:03:50 EDT

Next message: Stefan Persson: "Error in the roadmap"

Previous message: Kenneth Whistler: "Re: Aramaic, Samaritan, Phoenician"
In reply to: Patrick Andries: "Re: encoding sniffing"
Next in thread: Kurosaka, Teruhiko: "RE: encoding sniffing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Tuesday, July 15, 2003 12:51 AM, Patrick Andries <Patrick.Andries@xcential.com> wrote:

> ----- Message d'origine -----
> De: "Philippe Verdy" <verdy_p@wanadoo.fr>
>
> > On Monday, July 14, 2003 11:42 PM, Patrick Andries
> <Patrick.Andries@xcential.com> wrote:
> >
> > > In any case, I believe Peter has an idea how these libraries work
> > > and their limitations, he is rather looking for one with its
> > > limitations.
> >
> > Including the Chinese limitations? It will become tricky if
> > managing with
> traditional or scientific texts with many rare ideographs, because
> it's difficult to create an exhaustive morphological analysis with
> Chinese,
>
> This product does no morphological analysis but uses a hidden Markov
> Model. Did you try it ? (I just checked http://www.gov.tw/sars/ with
> http://quebec.alis.com/castil/essai_silc.cgi gave me Chinese, Big-5).

I could find a few technical plain-text documents that are obviously in
English ASCII and that were also identified as Chinese GBK. This includes
some technical pages that I have (such as SpamCop.net abuse analysis
pages that are also sometimes interpreted as Chinese in Internet Explorer,
or some Unicode text tables that are containing pure US English ASCII
among other numeric data).

I admit that such errors occur mostly on very technical documents, but
technical documents that need correct identification of their encoding
are also database table exports in flat files, in which it is sometimes hard
for a humane to find inaccuracies for its encoding identification, when it
contains lots of digits and separators, or people names and addresses,
with a large majority of lines using a familiar script or orthograph.

For this reason, now, I am used to import data from multiple files in a joint
database, by adding a tracking import id, that allows discovering later if
some batches requires special full export and reencoding: there's nothing
worse that feeding a database from data incorrectly interpreted from multiple
encodings, notably when the database is also very active and is used in
parallel with internal applications whose encoding is well controled. I know
that there are some products that allow identifying a language/encoding
pair on a fragment of a file or from a subselection in a database, but they
are expensive. Using a humane to review each record in a large database
is also costly, long and errorprone.

Next message: Stefan Persson: "Error in the roadmap"
Previous message: Kenneth Whistler: "Re: Aramaic, Samaritan, Phoenician"
In reply to: Patrick Andries: "Re: encoding sniffing"
Next in thread: Kurosaka, Teruhiko: "RE: encoding sniffing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 14 2003 - 20:53:12 EDT