Re: extracting words

From: Mark Davis (
Date: Sun Feb 11 2001 - 15:55:30 EST

Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
recommended in my last message. The Unicode standard is online, as is the
TR. Both can be found by going to, and selecting the right
topic. The TR in particular discusses the recommended approach to line break
in great detail.

However, as with all Unicode functionality, you should not try to reinvent
the wheel. See if you can use the services of the OS/Platform, or get a
Unicode library (such as ICU or Basis), to cover your requirements. You
mentioned Java; it has an API for line break.


P.S. It also helps communication if we use the same terms, e.g. "line
break", not "word wrapping".
P.P.S. As to the list settings: if we change it to please you, we would
annoy someone else. We cannot simultaneously please everyone. And please,
nobody start another thread on this topic.

----- Original Message -----
From: "Mike Lischke" <>
To: "Unicode List" <>
Sent: Sunday, February 11, 2001 11:32
Subject: re: extracting words

> - line break (wrapping lines on the screen)
> - word break (for selection)
> - word/root extraction (for search)

I recognize that the second and third case are really difficult to handle.
But for word wrapping I assume line breaking is sufficient. But when I don't
have spaces to use for wrapping and/or don't know whether the actual text
part uses spaces at all (what about exotic languages like Ogham or
Anglo-saxon?) then how can I go to implement word wrapping? Simply do it
character by character?

Ciao, Mike

PS: sorry for sending this mail first to you privately, but those
unpractical list settings make me always to send to the wrong place first.
It is difficult for me to get used to these strange settings. I'm answering
about 50 mails per day with a simple "reply", so I simply forget all the
time that I have to "reply all" (and the out-of-office bounces I get to my
private mail whenever I send a message to the Unicode list don't make the
task easier).

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT