Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Andy Heninger (
Date: Thu Aug 11 2005 - 13:04:12 CDT

  • Next message: Jony Rosenne: "RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)"

    It's more on language detection, and less on charset detection, but there is
    an interesting paper from IBM Research here

    Linguini: Language Identification for Multilingual Documents
    John M. Prager

    Some of the ideas from this ended up in the charset detection that was just
    added to the Java version of the ICU library.

    We are just now starting to look at doing a C version of that charset
    detection. If anyone would like to weigh in with opinions on how the API
    should look, the icu-design mail list is the place to do it.

    -- Andy Heninger

    On 8/11/05, Patrick Andries <> wrote:
    > You could present a page in an unknown language and character set and it
    > would guess both for you.
    > The trick is simply to train a Hidden Markovian Model (modèle markovien
    > caché) with a larger corpus of tagged (for both variables) content.
    > Incidentally, this probabilistic model, given enough documents, will
    > automatic detect the most common sequence of n consecutive bytes (n = 2,
    > 3, 4 as you wish) for a given pair <language, character set> as one of
    > its result (and one should thus find "die, der, das" having a high
    > probability for <de, latin-1> for instance, but "est, les, lui" for
    > <fr,latin-1>). Detecting the language and encoding is then "simply" a
    > matter of calculating the [compound] relative probability of a given
    > passage and chosing the one with the highest probability for a given
    > pair <language, character set>.
    > Used for <>
    > P. A.

    This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 13:05:52 CDT