Re: Multi-lingual corpus?

From: Tom Emerson (
Date: Wed Aug 24 2005 - 13:26:03 CDT

  • Next message: Bruno Lowagie: "Re: Unicode TTF question"

    Philippe Verdy writes:
    > I wonder if it's a good idea to provide him with such data, if he
    > does not want to publish anything in fact (there may be legal issues
    > with his source, notably if he used copyrighted materials such as
    > the paper he is citing).

    Well, the Cavnar and Trenkle paper has been around for a long time:
    it's a trivial algorithm to implement, and has served as the
    foundation for many of the open sourced or freely available
    language/encoding ID systems that are out there. Most notably is van
    Noord's Perl "TextCat" program, which has profiles for 77
    language/encoding pairs:

    Indeed, all of the data van Noord uses is included in his distribution.

    The copyright issue is a real one, and he'll need to be careful if he
    decides to re-release te data.


    Tom Emerson                                          Basis Technology Corp.
    Software Architect                       
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)

    This archive was generated by hypermail 2.1.5 : Wed Aug 24 2005 - 13:27:04 CDT