Re: UAX #29: Ambiguities in WB4, and contributing back testcases from Richard Wordingham on 2016-12-22 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 22 Dec 2016 22:58:10 +0000

On Thu, 22 Dec 2016 14:05:18 -0800
Manish Goregaokar <manish_at_mozilla.com> wrote:

> I guess the confusion is, with → rules, do we apply them globally, or
> only apply them when considering subsequent rules?

I would say the latter. The logic is that you apply the whole set of
rules on either side of each character.

> I suspect the answer here is that you only apply them in order. The
> list of rules is not a list of precedences, but rather a list with the
> order in which the rules are applied. So a → rule means "Treat the
> left side as if it were the right side in the context of all
> subsequent rules"

I would indeed say that you apply them in order. The relevant example
in the test suite (file auxiliary/WordBreakTest.txt in the UCD) is:

÷ 000D ÷ 0308 ÷ 000A ÷

Now, I am not sure if it is possible to automatically turn the rules
into an automatic break iterator based on regular expressions. The last
time I looked, ICU was doing this by manual conversion. I would
therefore deduce that such a conversion is impossible, difficult, or
produces highly inefficient code. ICU has the added complication that
it also needs to invoke real Southeast Asian break iterators. When I
looked, their interface was not returning appropriate
word-break properties for the characters, but was itself a break
iterator.

Richard.
Received on Thu Dec 22 2016 - 16:58:36 CST

This archive was generated by hypermail 2.2.0 : Thu Dec 22 2016 - 16:58:36 CST