Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Patrick Andries (
Date: Thu Aug 11 2005 - 09:56:14 CDT

  • Next message: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Tom Emerson a écrit :

    >Indeed: I wrote a detecter for Arabic encodings that did exactly this,
    >in that it could differentiate between ISO-8859-6, Windows CP1256,
    >Unicode transformation formats, and ASMO-449. In this particular
    >application it was known a priori that the text was Arabic, just not
    >the encoding.
    About 8 years we had a tool that did this for many languages and
    character sets.

    You could present a page in an unknown language and character set and it
    would guess both for you.

    The trick is simply to train a Hidden Markovian Model (modèle markovien
    caché) with a larger corpus of tagged (for both variables) content.
    Incidentally, this probabilistic model, given enough documents, will
    automatic detect the most common sequence of n consecutive bytes (n = 2,
    3, 4 as you wish) for a given pair <language, character set> as one of
    its result (and one should thus find "die, der, das" having a high
    probability for <de, latin-1> for instance, but "est, les, lui" for
    <fr,latin-1>). Detecting the language and encoding is then "simply" a
    matter of calculating the [compound] relative probability of a given
    passage and chosing the one with the highest probability for a given
    pair <language, character set>.

    Used for <>

    P. A.

    This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 09:57:25 CDT