Re: extracting words

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 12 2001 - 15:37:32 EST


Mark Davis said:

> BTW, someone on this thread made this topic out to be even more complex than
> is: that Devanagari and Korean are written without spaces. While that may
> have been the case historically, I believe that the modern text does use
> spaces. Chinese, Japanese and Thai are the main languages written without
> spaces.

To that, add Khmer and Lao, which generally follow the Thai pattern of using
spaces not between words, but between phrases. (Chinese and Japanese don't
use space either between words or between phrases.)

And also add Tibetan, which is an unusual case. It doesn't use spaces to
indicate boundaries, but instead has an obligatory segmentation mark, the
tsek (U+0F0B) that roughly occurs between syllables. Yes, Tibetanists, I
know that the term "syllable" is not technically correct here, so please don't
nitpick me to death on this one. ;-)

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT