Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 10 2005 - 15:05:21 CDT

  • Next message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Philippe Verdy writes:
    > OK but this is not a text encoding decoder: this means that you have to
    > build a list of candidate charsets that pass at the plain-text level, then
    > to try parse the text using a HTML parser to filter out parts that should
    > not count in statistics:
    [...]

    Yes, yes, yes, we've done all of this, and more, for HTML and numerous
    other markup languages and document formats: I'm speaking from
    experience here, not just spouting off random ideas.

    > You'll also have to consider the case where some or all of these text
    > elements and attributes is already marked with a language indicator. In that
    > case, the language autodetection should ignore them, and instead the
    > statistics of characters should be computed separately per indicated
    > language.

    A lot of the time we find that the language attribute on a given tag
    is wrong. User supplied metadata is useful, but can rarely be
    trusted. More useful, often, are the font tags that they sprinkle
    around. These can be used to help infer language, and later, encoding.

    > The other problem is that most composed pages forget to explicitly label the
    > foreign language used in small spans of text. These spans can be very
    > frequent, specially within technical documents (like a JavaDoc page, or
    > document speaking about some standards, with lots of acronyms or
    > untranslated terms).

    We have technology here that can detect occurrences of multiple
    languages in a single document, though not at the level of one or two
    words.

    > To detect a language, you could also try searching for very common terms
    > like "the", "is", "are", "have", "and" in English, "le", "un", "a", "",
    > "est", "et" in French, "der", "das", "ist" in German. These general terms
    > are exactly those that are generally ignored by search engines due to their
    > frequence in each language.

    Right, isn't this what the Netscape detecter does? Building these term
    lists is easily done, and can be useful indeed when disambiguating
    possible matches. You can use these lists too to differentiate very
    similar languages, like Malay and Indonesian, something we can do
    quite reliably when given enough text.

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 15:06:37 CDT