RE: Multi-lingual corpus?

Date: Thu Aug 25 2005 - 03:33:23 CDT

  • Next message: Adam Twardoch: "Re: Windows Glyph Handling"

    > I tried his demo page just with French, and the conclusion
    > are not good.
    > - starting by "essai", it replied finnish
    > - extending it to "un essai", it replied romanian
    > - extending it to "un essai long", "un essai plus long", or
    > "un essai encore
    > plus long", it replied "rumantsh"
    > - extending it to "ceci est un essai long", "ceci est un
    > essai trop long",
    > "ceci est un essai encore trop long", "ceci est un essai
    > suffisant", it
    > replied again romanian...

    I think you are being much too harsh in your judgment, it would do well to sit
    down and think for a moment what does it do, based on what input, and what does
    it output. Instead, you could have some fun, and see what it does.

    a irish
    au welsh
    auk malay
    auke german
    aukea basque
    aukeam malay
    aukeama swahili
    aukeamaa sanskrit
    aukeamaan finnish

    (The 'aukeamaan' being a valid Finnish word.) My main point being, I guess, that take
    a look at the replies: 'a' is a valid word in MANY languages - but it replies only with
    one. Ditto for 'au' and 'auk', and 'auke'. 'aukea', 'aukeama', and 'aukeamaa' are valid
    Finnish words, but apparently they could be Basque, Malay, and Swahili.

    I believe a relatively simple exercise in statistics, playing with the typical n-gram frequencies,
    shows that you need to have dozens of letters to get any reasonably reliable results.


    This archive was generated by hypermail 2.1.5 : Thu Aug 25 2005 - 03:36:23 CDT