Re: Multi-lingual corpus?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 24 2005 - 18:19:15 CDT

  • Next message: Peter Kirk: "Re: Unicode TTF question"

    From: "Tom Emerson" <tree@basistech.com>
    > http://odur.let.rug.nl/~vannoord/TextCat/
    >
    > Indeed, all of the data van Noord uses is included in his distribution.

    I tried his demo page just with French, and the conclusion are not good.
    - starting by "essai", it replied finnish
    - extending it to "un essai", it replied romanian
    - extending it to "un essai long", "un essai plus long", or "un essai encore
    plus long", it replied "rumantsh"
    - extending it to "ceci est un essai long", "ceci est un essai trop long",
    "ceci est un essai encore trop long", "ceci est un essai suffisant", it
    replied again romanian...

    I don't think that it will have the expected accuracy for texts at least 30
    characters long (a common minimum case where language identification will be
    needed). The accuracy comes only after a minimum of about 50 characters for
    French (it seems that it will identify French positively only of the text
    has at least one French accent or contraction with apostrophe, else it will
    mix it with other romance languages).
    What is strange is the fact that it identifies Romanian so often, when
    Romanian should have frequent uses of the cedilla (or its distinctive comma
    below if it is encoded so, but only in Unicode).

    I fear that it bases its results only on digrams, but does not use trigrams.

    Now, the authors cites a competitor, also based on GPL-ed TextCat, named
    "Languid", and that is supposed to support more languages, but the link it
    provides is invalid (and my search into Google just reveals now dead domains
    for sale, theater sites, or unrelated distinctions about language ids i.e.
    ISO 639 codes most often, or the term "languid" in common English
    dictionnaries, or the French conjugated verb "languid", or Javascript
    methods for handling localized data in web pages, or the name of a city in
    Laos, or discographic sites...) May be the product has changed its name...
    So I tried to alter its domain name languid.cantbedone.org into
    www.cantbedone.org and it reveal that the domain is (was?) operating a web
    crawler for blogs, hosted at javelina.cet.middlebury.edu. This looks like it
    was a student project that was hosted there, but which is dead when he has
    left the college apparently in late 2003 (searching on the college's website
    reveals nothing interesting). This project has run for less than 1 year and
    seems abandoned, or there's another pointer somewhere else, or finally
    TextCat got even better and maintaining Languid was no more necessary.

    Now the quoted references are quite old (about 1996). There are certainly
    better technics today than just n-grams...



    This archive was generated by hypermail 2.1.5 : Wed Aug 24 2005 - 18:21:34 CDT