Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Tom Emerson (tree@basistech.com)
Date: Thu Aug 11 2005 - 08:30:32 CDT

  • Next message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Philippe Verdy writes:
    > For the case of Arabic, the first indicator is effectively the alphabet, but
    > I think that there are similar usage pattern that helps making distinction
    > between Arabic and Urdu. Anyway, the various encodings used for the Arabic
    > script will be easily determined by letter occurences statistics.

    Indeed: I wrote a detecter for Arabic encodings that did exactly this,
    in that it could differentiate between ISO-8859-6, Windows CP1256,
    Unicode transformation formats, and ASMO-449. In this particular
    application it was known a priori that the text was Arabic, just not
    the encoding.

    I've found that unigram frequencies are usually enough to
    differentiate Arabic from Persian, and bigram frequencies enough to
    differentiate Arabic, Persian, Urdu, Pashto, and Kurdish when using an
    encoding that supports all of the writing systems. I have not looked
    at Uighur, though I expect bigrams will be enough there as well.

    One problem I've experienced with Urdu is the large number of
    font-specific encodings that are out there: historically few pages
    have used Unicode opting instead for a custom font and unique
    encoding, that may include presentation forms. This is when using
    metadata, either declared lang attributes, font names, or URL
    information, is absolutely necessary to identify the possible ranges
    of encodings.

    Peace,

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)
    


    This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 08:31:28 CDT