Re: [OT] looking for electronic dictionaries

From: James E. Agenbroad (jage@loc.gov)
Date: Fri Aug 30 2002 - 08:24:58 EDT


On Thu, 29 Aug 2002, Eric Muller wrote:

> For my personal use, I would like to acquire electronic dictionaries,
> principally for the major European languages, with the following
> characteristics:
>
> - reputable source
>
> - "raw" datafiles accessible - I appreciate the interfaces that
> dictionary vendors may provide, but I want to be able to write my own
> code to find the data I am looking for
>
> - the wordlist is the principal aspect; I can live without definitions.
>
> - "markup" about the structure of words, for things like hyphenation,
> etc. (or from which hyphenation can be derived)
>
> - some form of frequency count would be nice
>
> For example, I'd like to compute something like: "the average French
> character occupies x bytes in UTF-8", with average defined in sync with
> the frequency count. And I'd like to compute things like spelling
> changes introduced by hyphenation in Dutch.
>
> Any pointers?
>
> Thanks,
> Eric.
                                             Friday, August 30, 2002
Eric,
    I have no sources to suggest, just a comment. The average UTF-8
length of a French word will depend to some extent on whether separate
codes are used for combining characters/diacritics or a single code for a
precomposed letter + diacritic combination. It will matter more if you
want the average length of Czech or Polish words. Fortunately Vietnamese
isn't European.

     Regards,
          Jim Agenbroad ( jage@LOC.gov )
     "It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams." Adapted
from a letter by Gabriel Garcia Marquez.
     The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
     Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE,
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.



This archive was generated by hypermail 2.1.2 : Fri Aug 30 2002 - 09:08:20 EDT