RE: extracting words

From: Mark Leisher (
Date: Mon Feb 12 2001 - 12:14:08 EST

>> > - line break (wrapping lines on the screen) > - word break (for
>> selection) > - word/root extraction (for search)
>> I recognize that the second and third case are really difficult to
>> handle.

    Jarkko> Root extraction is decidecly non-trivial and a highly
    Jarkko> language-specific problem, even more so than word breaking, it's a
    Jarkko> messy linguistic problem instead of a clean algoritmic problems.
    Jarkko> To start with, the choice of the term "extraction" shows that one
    Jarkko> has not understood the problem in all its g(l)ory: a more
    Jarkko> appropriate term would be "finding", or maybe, "reducing" the
    Jarkko> root.

The words we use in computational linguistics are "stemming" and less
frequently "lemmatization." This is often the step in morphological analysis
that precedes determining the part-of-speech. Jarkko is right that it is a
messy problem for many languages.

    Jarkko> - "syllablization" (is that a word?) as a third problem (for
    Jarkko> breaking words more nicely into lines), it would rank in
    Jarkko> difficulty somewhere between word breaking and root extraction.

I believe "syllabization" or perhaps "syllabification" might be the term.

>> But for word wrapping I assume line breaking is sufficient. But when I
>> don't have spaces to use for wrapping and/or don't know whether the
>> actual text part uses spaces at all (what about exotic languages like
>> Ogham or Anglo-saxon?) then how can I go to implement word wrapping?
>> Simply do it character by character?
Spaces and other punctuation come in handy for line breaking. Segmentation is
used with scripts that don't use this sort of intra-sentence term separation
(i.e. Chinese, Japanese, Thai). There are whole conferences devoted to
segmentation approaches. Another messy area of computational linguistics :-)
If segmentation is not available, then lines are often wrapped between
Mark Leisher But there is no doubt but money is to the
Computing Research Lab fore now. It is the romance, the poetry
New Mexico State University of our age. It's the thing that chiefly
Box 30001, Dept. 3CRL strikes our imagination.
Las Cruces, NM 88003 -- The Rise of Silas Lapham, W. D. Howells

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT