Re: Multi-lingual corpus?

From: Tom Emerson (
Date: Wed Aug 24 2005 - 20:59:26 CDT

  • Next message: Christopher Fynn: "Re: Windows Glyph Handling"

    Philippe Verdy writes:
    > From: "Tom Emerson" <>
    > >
    > >
    > > Indeed, all of the data van Noord uses is included in his distribution.
    > I tried his demo page just with French, and the conclusion are not good.

    Oh, his system is not very good... I didn't mean (if I did at all)
    that it was. It's just one that is raised repeatedly when people
    evaluate the language/encoding identifier my company sells. His
    training corpora are rediculously small for building any useful
    model. What's more, as soon as you feed it unclean text, with weird
    capitalization (for example) it gives up the ghost completely.

    > I fear that it bases its results only on digrams, but does not use trigrams.

    The Cavner and Trenkle algorithm generates n grams (and van Noord's
    implementation of it), 1 <= n <= 5, and keeps the 300 hundred most
    frequent. These are usually the unigrams of the language, as well as
    some bigrams. Only when you train on a *lot* of a data do you see
    n-grams in the top-300 with n > 3. I've successfully used their
    algorithm for dialect identification, for example, because it is so
    trivially implemented.

    > Now the quoted references are quite old (about 1996). There are certainly
    > better technics today than just n-grams...

    Just n-grams gets you a long way, actually. However, there are other
    techiques that are used in larger and more accurate systems: I will
    dig up references to more recent work that utilize hidden markhov
    models and other probabilistic methods to good affect.


    Tom Emerson                                          Basis Technology Corp.
    Software Architect                       
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)

    This archive was generated by hypermail 2.1.5 : Wed Aug 24 2005 - 21:03:49 CDT