Re: \b{wb} from Richard Wordingham on 2015-08-22 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 22 Aug 2015 22:46:08 +0100

On Sat, 22 Aug 2015 14:08:14 -0600
Karl Williamson <public_at_khwilliamson.com> wrote:

> But it isn't such a replacement, creating some consternation, and the
> main reason is that, unlike \b, it treats the boundary between white
> space characters as a breaking opportunity, so that it doesn't create
> runs of them. Thus if you have two spaces after a full stop, it
> treats each as an individual word.
>
> My question is "Was this intentional, and if so, Why?"

See below.

> TR18 says \b{w} is a"Zero-width match at a Unicode word boundary.
> Note that this is different than \b alone, which corresponds to \w
> and \W."

Unless I'm being stupid, \b and \b{w} are indeed vary different.
Consider a sequence <U+0020, U+1F1EB REGIONAL INDICATOR SYMBOL LETTER F,
U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R, U+0041 LATIN CAPITAL
LETTER A, U+0062 LATIN SMALL LETTER B>

That has two internal word boundaries, splitting it into a space, a
flag, and the word "Ab". Is this what you want?

Worse, consider a short Thai sentence ผมไม่มีคอมพิวเตอร์ที่ดี. That
gets split by ICU into |ผม|ไม่มี|คอมพิวเตอร์|ที่|ดี| - 5 words and
4 internal word boundaries. Note that there's a word or two between
each boundary. Is this what you want?

> My question is "Was this intentional, and if so, Why?"

Take a look at the rules in UAX#29 Section 4.1.1. Apart from the first
two and the last, they all identify where word boundaries aren't. This
is tidy - the algorithm concentrates on working out where a word
continues.

In principle, you could, I believe, extend the rules so that characters
outside words and regional indicator runs were not divided, but it
would make for a more complicated algorithm with plenty of
opportunities for error. I think the thought was that word-free runs
did not need to be assembled into runs of non-word material.

The short answer, of course, is that the regular expression engine
could do this final step of post-processing itself. This may get
tricky with customised word-breaking.

Richard.
Received on Sat Aug 22 2015 - 16:47:18 CDT

This archive was generated by hypermail 2.2.0 : Sat Aug 22 2015 - 16:47:19 CDT