At 09:42 AM 4/18/2002, Markus Scherer wrote:
>Doug Ewell wrote:
>
>>The ICU package includes a sorted Thai word list in a UTF-8 file called
>>th18057.txt. Since you may not wish to download the whole package and I
>>don't know if the Thai file is available separately, I have uploaded it
>>(for a limited time only) to:
>
>
>Note that ICU has CVS and WebCVS, so you can get any of our files separately.
>For this one:
>http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/test/testdata/th18057.txt
>
>(ICU uses the X license. See http://oss.software.ibm.com/icu/)
>
>We use this word list for word break iteration, for which we have APIs.
That file is used to test Thai collation; there is a separate, binary
dictionary file that's used for word breaking. The dictionary is built
using ICU4J. You can pick up the source file here:
This file is UTF-16 with a BOM at the front. There is a ^M ^J after each word.
>PS: For details about CVS for ICU see
>http://oss.software.ibm.com/icu/develop/cvs.html
Eric Mader
IBM GCoC - San Jose
5600 Cottle Rd. M/S 50-2/B11
San Jose, CA 95193
This archive was generated by hypermail 2.1.2 : Thu Apr 18 2002 - 18:41:09 EDT