Re: [OT] RE: FW: extracting words

From: Jungshik Shin (jshin@mailaps.org)
Date: Sun Feb 11 2001 - 19:00:26 EST


On Sun, 11 Feb 2001, Thomas Chan wrote:

> On Sun, 11 Feb 2001, Mike Lischke wrote:
>
> > > If you are willing to give up precision, then you can use heuristics.
> > >
> > > It's ugly but perhaps ok for a simple editor. You can improve the
> > > precision
> > > with better heuristics and more data, so you get to decide how much is
> > > good enough...
> >
> > So using white spaces for general word breaking and ideographs for CJK
> > would be an acceptable approach? What I wonder about is how to handle
>
> The handling of Japanese and Korean text is different from that of Chinese
> (lumping them together as "CJK" is inappropriate in this context), but I

I'm glad to see this. Lumping them together as "CJK" is inappropriate not
only in this context but also in other cases as well. For sure Chinese,
Japanese and Korean text processing have a lot in common. However, there
are a lot of differences as well. In case of Korean, Korean writting
system Hangul is not just syllabic (as is Japanese Kana) but it's also
alphabetic (which means it also needs to be dealt with the way Thai and
Indic scripts are treated in some cases) and this point should not be
overlooked to avoid making half-baked Korean support.

The other day, somebody wrote to this list that most morphemes in CJK
might be monosyllabic. That's true of Chinese (as far as I can tell),
but cannot be farther from true in Japanese and Korean (although that
holds true for Chinese-loan-words in Korean). Chinese is an isolating
language. On the other hand, Japanese and Korean are agglutinating
languages (the geographic closeness doesn't necesarilly lead to the
linguistic closeness. The distance between Chinese on the one hand and
Japanese and Korean on the other hand is much much greater than that
between English and Sanskrit both of which belong to the Indo-European
language family). IMHO, this difference makes it harder to extract
word-roots (for search engines, DB, etc) out of Japanese and Korean text
(and highly inflective languages) than out of Chinese text.

Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT