From: Jony Rosenne (email@example.com)
Date: Thu Aug 11 2005 - 15:02:51 CDT
I found that simple digram analysis is sufficient to distinguish between
English and several encodings of Hebrew.
> -----Original Message-----
> From: firstname.lastname@example.org
> [mailto:email@example.com] On Behalf Of Tom Emerson
> Sent: Thursday, August 11, 2005 3:31 PM
> To: Philippe Verdy
> Cc: Doug Ewell; Unicode Mailing List
> Subject: Re: Cp1256 (Windows Arabic) Characters not supported by UTF8
> Philippe Verdy writes:
> > For the case of Arabic, the first indicator is effectively
> the alphabet, but
> > I think that there are similar usage pattern that helps
> making distinction
> > between Arabic and Urdu. Anyway, the various encodings used
> for the Arabic
> > script will be easily determined by letter occurences statistics.
> Indeed: I wrote a detecter for Arabic encodings that did exactly this,
> in that it could differentiate between ISO-8859-6, Windows CP1256,
> Unicode transformation formats, and ASMO-449. In this particular
> application it was known a priori that the text was Arabic, just not
> the encoding.
> I've found that unigram frequencies are usually enough to
> differentiate Arabic from Persian, and bigram frequencies enough to
> differentiate Arabic, Persian, Urdu, Pashto, and Kurdish when using an
> encoding that supports all of the writing systems. I have not looked
> at Uighur, though I expect bigrams will be enough there as well.
> One problem I've experienced with Urdu is the large number of
> font-specific encodings that are out there: historically few pages
> have used Unicode opting instead for a custom font and unique
> encoding, that may include presentation forms. This is when using
> metadata, either declared lang attributes, font names, or URL
> information, is absolutely necessary to identify the possible ranges
> of encodings.
> Tom Emerson Basis
> Technology Corp.
> Software Architect
> "You can't fake quality any more than you can fake a good
> meal." (W.S.B.)
This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 14:04:56 CDT