Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 10 2005 - 13:36:12 CDT

  • Next message: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Philippe Verdy writes:
    > Note that the statistics also depend on the language actually used. The
    > statistics for English will be quite different with Italian or French, and
    > in some cases it will be hard to decide between ISO-8859-1 and ISO-8859-2
    > for some Nordic or Baltic languages).

    I've found that English is the "Great Corrupter" when it comes to
    training these things: not only are English words found everythere,
    but English has borrowed (or had the "borrowing" thrust upon it, not
    that I'm bitter or anything ;-) so much from Germanic and Romance
    languages over the last 1000 years that English can be easily confused
    with French, Italian, or Dutch. Again, in my experience.

    > - some webservers are labelling all pages with ISO-8859-1 even though it is
    > another encoding or a UTF. Encoding exceptions are detected by the fact that
    > HTML does not allow using some controls (but Internet Explorer silently
    > accepts C1 controls in ISO-8859-1 as if they were in fact valid Windows-1252
    > characters)

    This happens *all* the time. I constantly encounter pages that are
    labeled as ISO-8859-1 (actually usually CP1252) and indeed, if you
    just look at the byte values, are valid Latin 1 (or even just
    US-ASCII). However, the content is encoded in HTML escapes, and is
    actually Arabic or Persian. Hence you have to do the detection in a
    couple of steps, since the presence of these entities (remember, an
    X?HTML page can include any character regardless of the declared
    "primary" encoding) opens up all of Unicode. A heuristic along the
    lines of: "If the page says it is (or detects as) Latin1 (or some
    form), and it has some largish number of contiguous HTML entities,
    transcode the whole thing into Unicode with the SGML entities
    expanded, then run your language id again." This assumes, of course,
    that you interested in identifying the language: doing this is almost
    necessary if you want to differentiate the ISO-8859-n versions.

    > - and some servers are labelling all with UTF-8 despite the texts are
    > encoded with ISO-8859-1 (Exceptions occur when the UTF-8 encoding
    > requirements are not respected within the document body, so if there is no
    > leading BOM, IE tries to guess an alternate charset or displays square
    > boxes, depending on user preferences or manual selection in the browser).

    I've also seen misconfigured Apache 2 servers sending HTTP response
    headers with a different encoding than that specified in the page,
    usually to the detriment of all involved.

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 13:37:17 CDT