Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 10 2005 - 14:48:59 CDT

  • Next message: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    From: "Tom Emerson" <tree@basistech.com>:
    > Philippe Verdy writes:
    >> This is absolutely not needed for a charset detector (i.e. the detection
    >> of
    >> the encoding used to serialize the text). HTML escapes are perfectly
    >> valid
    >> in HTML, and even if they refer to non Latin-1 characters, this does not
    >> change the fact that the page remains encoded in ISO-8859-1.
    >>
    >> You don't need to take HTML escapes into account with regards of which
    >> encoding is used, because these escapes are independant of the actual
    >> encoding used.
    >
    > Agreed. But if you are interested in the langauge of the page as well
    > as the encoding, which some applications do care about, then you have
    > to take these into account. And, as I said, building a model that
    > accounts for language as well as encoding can help differentiate the
    > various Latin-n versions.

    OK but this is not a text encoding decoder: this means that you have to
    build a list of candidate charsets that pass at the plain-text level, then
    to try parse the text using a HTML parser to filter out parts that should
    not count in statistics:
    - the document type declaration and its inline DTD if any
    - processing instructions
    - the HTML comments
    - the syntaxic HTML tag delimiters < = / > and quotes around attribute
    values
    - the element and attribute names
    - the spaces around block elements, and within the opening tags around the
    attributes
    - most values of attributes, except enumerated or ID or name attributes (but
    not all, as there are localizable CDATA attribute values)
    - a few text elements with specific syntax (for example the content of
    <script> and <style> attributes) which are not considered as renderable
    plain-text.

    This done, you can use the *parsed* text elements and attributes (where
    character entities like "&#x0380;" have been converted to plain-text
    equivalent) to feed a statistic counter if you try detecting the language.

    You'll also have to consider the case where some or all of these text
    elements and attributes is already marked with a language indicator. In that
    case, the language autodetection should ignore them, and instead the
    statistics of characters should be computed separately per indicated
    language.

    This means that you'll end with several statistic vectors, one for each
    explicit language, plus one for the unspecified language (note that the
    document headers or HTTP headers may include its own language indicator,
    however this indication is notoriously incorrect, specially in the HTTP
    headers, because it is often generated within common headers or page
    templates for a whole site, even if the HTML page uses another language).

    All the above remains specific to HTML. But there are other options to
    consider that do apply to plain-text only documents without markup:

    The other problem is that most composed pages forget to explicitly label the
    foreign language used in small spans of text. These spans can be very
    frequent, specially within technical documents (like a JavaDoc page, or
    document speaking about some standards, with lots of acronyms or
    untranslated terms).

    To detect a language, you could also try searching for very common terms
    like "the", "is", "are", "have", "and" in English, "le", "un", "a", "à",
    "est", "et" in French, "der", "das", "ist" in German. These general terms
    are exactly those that are generally ignored by search engines due to their
    frequence in each language. I could have taken other examples than these 3
    languages that generally use the same ISO-8859-1 charset, but their
    frequence in a text informs that the document is probably not encoded in
    ISO-8859-2 or -4 or -15.



    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 14:50:47 CDT