Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 11 2005 - 07:36:10 CDT

  • Next message: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    From: "Theo Veenker" <Theo.Veenker@let.uu.nl>
    > Did you check this one, it is a java port of mozilla's automatic charset
    > detection algorithm. The original C++ sources are provided as well.
    >
    > http://www.i18nfaq.com/chardet.html

    Not a bad ressource, but it only addresses the autodetection of East-Asian
    charsets. There's nothing to help detecting the autodetection of European
    charsets (notably all those in ISO-8859-*, even if we exclude windows
    charsets which are extensions of these ISO charsets).

    Also missing is the detection of Vietnamese VISCII, and Russian/Ukrainian
    charsets which are more common than ISO-8859 Cyrillic.

    Add to this the need to detect legacy MacOS charsets and DOS/OEM codepages.

    Is there some project in Mozilla to add support for them? This would require
    adding more statistics accurate for common European languages (notably
    French and Spanish, which are sometimes incorrectly detected as Asian
    charsets).

    But now if you consider the subject of this thread, there's absolutely
    nothing there for the Arabic script...



    This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 07:37:36 CDT