Re: Multi-lingual corpus?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 25 2005 - 05:04:54 CDT

  • Next message: Philippe Verdy: "Re: Windows Glyph Handling"

    From: <jarkko.hietaniemi@nokia.com>
    > I believe a relatively simple exercise in statistics, playing with the
    > typical n-gram frequencies,
    > shows that you need to have dozens of letters to get any reasonably
    > reliable results.

    My opinion was not to defeat the idea of n-gram analysis, but that it
    effectively fails to identify languages at the indicated success rates.
    That's why I tested it starting by short phrases, augmenting them until it
    gives some positive value.

    And I've found that in MANY cases, this method will NOT detect the correct
    language before MUCH longer sentences than what as been shown on various
    places.

    But additionally, some other implementations give much better results using
    n-gram analysis only. This reveals that the statistics used in their
    implementations are better tuned, probably made with less constrained corpus
    of text, and on realistic text examples.

    For example, I found that the XEROX's language identifier performs much
    better, on MUCH shorter texts. I can think that it uses a distinct
    mathematical model, and that it combines several analysis instead of just
    one heuristic with badly tuned statistic models. Notably XEROX seems to use
    variable-length ngram analysis, instead of fixed-length ngrams, and it also
    uses short word analysis (there have been reports where both ngram and
    short-word analysis were identifying languages at same or similar success
    rate, but combining the two orthogonal approach gives much more significant
    results; a generlized method would be to combine the two using a
    variable-length ngram analysis, and that's apparently what XEROX has done --
    variable-length analysis does not attempt to identify arbitrary fixed-length
    ngrams, but instead attempts to approximate well the syllabic level).

    The extension of this variable-length ngram analysis would be to build quite
    reliable syllable breakers without using huge dictionnaries...

    For now, I can just conclude that basic fixed-length ngram analysis fails
    for too many practical cases where language identification is needed. It
    will only succeed when parsing significant monolingual non-technical texts
    (for example articles about history in Wikipedia). There are tons of other
    texts for which one would need automatic language identification, notably in
    plain-text search engines and indexers.

    I am not speaking about what Google does, because Google already has a huge
    database of dictionnaries available, which is constantly augmented by the
    very large corpus of web sites it can index. Google can then identify
    languages not by ngrams and short words only, and most probably not by
    syllabic structure only, but directly at the word and phrase level, and also
    most probably at the semantic level (using the semantic relations created by
    matching occurences of terms in the same paragraphs, from a large corpus of
    texts written by different source). (Google for example could correlate the
    various conjugated verbs together using such approach, and discover
    relations between singular/plural or feminine/masculine forms, or casted
    forms, simply because of the relations that exist between words within
    phrases found in lots of documents). And for this reason, the heuristic used
    to identify languages, is certainly MUCH more complex (I don't know if it
    has been implemented in Google Desktop Search, or if just uses an heuristic
    tweaked in favor of the desktop user's locale; in fact I don't see a
    language selection in Google Desktop Search, so I doubt that it is
    implemented).

    Anyway, it seems that all works related to ngram analysis (and short word
    analysis) has been finished in early 1996, and no more significant results
    have been published since then. Nearly 10 years have elapsed, and I'm sure
    that there exists now other approaches that can be combined to offer better
    identification results.



    This archive was generated by hypermail 2.1.5 : Thu Aug 25 2005 - 05:05:42 CDT