Re: Default Word Boundary Definition
From: Mark Davis
Date: 2001-02-07

The XML Query group is interested in the possibility of using the Unicode default specifications (Table 5-4 Word Boundaries) for word boundaries in their full text search work. However, this specification has not received nearly the attention -- and refinement -- of the default line boundary specification (UAX #14: Line Breaking Properties). The Query group is requesting that we review this specification and fix any problems so that it could be utilized by them as a default specification. (They would allow tailored word boundaries to be used as well, so that language-specific engines could do a better job; that's consistent with what we expect of default specifications.)

Background. The word boundaries are related to the line boundaries, but are distinct. Here is an example of word boundaries.

Example 1: Word Boundaries

The   quick   ( " brown " )   fox   can't   jump   32.3   feet ,   right ?

There is a boundary, for example, on either side of the word brown. These are the boundaries that users would expect, for example, if they chose "Whole Word Search" (WWS). Matching brown with WWS works, since there is a boundary on either side. Matching brow doesn't. Matching "brown" also works, since there are boundaries between the parentheses and the quotation marks.

The particular requirement that the Query group has is for proximity; seeing whether, for example, "monster" is within 3 words of "truck". That is done with the above boundaries by extracting any words that contain a letter or digit (whether or not digits are included would be left up to the implementation). Thus for proximity we get the following, so "fox" is within three words of "quick".

Example 2: Extracted Words

The quick brown fox can't jump 32.3 feet right

The current definitions in Table 5-4 Word Boundaries basically break between letters and non-letters, with combining marks considered part of the letter. Clusters of CJK characters or katakana are considered single words (including trailing sequences of hiragana).

There are some problems here.

While the default definition can't do anything sophisticated with CJK (such as dictionary lookup), it would be better to have breaks around single CJK than to include a whole paragraph (potentially) as a single word.

Proposal. We should address these issues with a revised default specification, leveraging the definitions and character properties that we use in line-break where possible.

Note: As we do this, we must remember that we are supplying a default specification. As with our other default specifications, implementations are free to override (tailor) the results to meet the requirements of different environments or particular languages.

Here is a basic proposal that we could use for the basis of further discussion.

Table 5-4. Default Word Boundaries

Character Classes

sot Start of Text
eot End of Text
Hiragana General_Category = Letter AND Script = HIRAGANA
Katakana General_Category = Letter AND Script = KATAKANA
Letter (General_Category = Letter OR General_Category = Modifier_Symbol)
¨ (Line_Break = Ideographic OR Hiragana OR Katakana)
MidLetter U+0027 (') apostrophe, U+2019 (í) curly apostrophe, U+003A (:) colon (used in Swedish), U+0029 (.) period, U+00AD (00AD) soft hyphen, U+05F3 (׳) geresh, U+05F4 (״) gershayim
Ignorable Join_Controls, Bidi_Controls, Word_Joiner, ZWNBSP, CGJ,
OR (General_Category = Mark)
other Other categories are from Line_Break (using the long names from PropertyAliases


Each Ignorable is treated as if it were the type of the previous letter.

X Ignorable => X X

Don't break between most letters

Letter Letter

Donít break letters across certain punctuation

Letter MidLetter Letter
Letter MidLetter Letter

Donít break within sequences of digits, or digits adjacent to letters.

Numeric Numeric
Letter Numeric
Numeric Letter

Donít break within sequences like: '-3.2'

Hyphen Numeric
Numeric Infix_Numeric Numeric
Numeric Infix_Numeric Numeric
Prefix_Numeric Numeric
Numeric Postfix_Numeric

Don't break between Hiragana or Katakana

Hiragana Hiragana
Katakana Katakana

Otherwise, break everywhere (including around ideographs)

Any ų
ų Any