Re: Multi-lingual corpus?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 24 2005 - 18:19:15 CDT

Next message: Peter Kirk: "Re: Unicode TTF question"

Previous message: Philippe Verdy: "Re: Unicode TTF question"
In reply to: Tom Emerson: "Re: Multi-lingual corpus?"
Next in thread: Tom Emerson: "Re: Multi-lingual corpus?"
Reply: Tom Emerson: "Re: Multi-lingual corpus?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Tom Emerson" <tree@basistech.com>
> http://odur.let.rug.nl/~vannoord/TextCat/
>
> Indeed, all of the data van Noord uses is included in his distribution.

I tried his demo page just with French, and the conclusion are not good.
- starting by "essai", it replied finnish
- extending it to "un essai", it replied romanian
- extending it to "un essai long", "un essai plus long", or "un essai encore
plus long", it replied "rumantsh"
- extending it to "ceci est un essai long", "ceci est un essai trop long",
"ceci est un essai encore trop long", "ceci est un essai suffisant", it
replied again romanian...

I don't think that it will have the expected accuracy for texts at least 30
characters long (a common minimum case where language identification will be
needed). The accuracy comes only after a minimum of about 50 characters for
French (it seems that it will identify French positively only of the text
has at least one French accent or contraction with apostrophe, else it will
mix it with other romance languages).
What is strange is the fact that it identifies Romanian so often, when
Romanian should have frequent uses of the cedilla (or its distinctive comma
below if it is encoded so, but only in Unicode).

I fear that it bases its results only on digrams, but does not use trigrams.

Now, the authors cites a competitor, also based on GPL-ed TextCat, named
"Languid", and that is supposed to support more languages, but the link it
provides is invalid (and my search into Google just reveals now dead domains
for sale, theater sites, or unrelated distinctions about language ids i.e.
ISO 639 codes most often, or the term "languid" in common English
dictionnaries, or the French conjugated verb "languid", or Javascript
methods for handling localized data in web pages, or the name of a city in
Laos, or discographic sites...) May be the product has changed its name...
So I tried to alter its domain name languid.cantbedone.org into
www.cantbedone.org and it reveal that the domain is (was?) operating a web
crawler for blogs, hosted at javelina.cet.middlebury.edu. This looks like it
was a student project that was hosted there, but which is dead when he has
left the college apparently in late 2003 (searching on the college's website
reveals nothing interesting). This project has run for less than 1 year and
seems abandoned, or there's another pointer somewhere else, or finally
TextCat got even better and maintaining Languid was no more necessary.

Now the quoted references are quite old (about 1996). There are certainly
better technics today than just n-grams...

Next message: Peter Kirk: "Re: Unicode TTF question"
Previous message: Philippe Verdy: "Re: Unicode TTF question"
In reply to: Tom Emerson: "Re: Multi-lingual corpus?"
Next in thread: Tom Emerson: "Re: Multi-lingual corpus?"
Reply: Tom Emerson: "Re: Multi-lingual corpus?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 24 2005 - 18:21:34 CDT