Re: Potential contradiction between the WordBreak test data and UAX #29

From: Tom Hacohen <tom_at_osg.samsung.com>
Date: Wed, 23 Nov 2016 11:00:53 +0000

On 23/11/16 10:52, Daniel Bünzli wrote:
> On Wednesday 23 November 2016 at 11:22, Tom Hacohen wrote:
>> Thank you for your reply, but I don't think the UAX, specifically the
>> line you quoted implies that. The line you quoted says that the process
>> is terminated when a rule matches and produces a boundary status. In
>> Table 1[1], the right-arrow (which is used in rule 4) is listed as a
>> boundary symbol,
>
> Precisely, rules with this *symbol* do not produce a boundary *status* which is either boundary or not boundary as mentioned in parens in the line I quoted.

This looks like a mistake statement rather than a binding rule.

>
>> so I would argue that one should stop the process and start it again from the start.
>
> At least in the current UAX there is no mention of an idea of stopping and restarting the process at all.

Even if that's true, look at my second statement (which you redacted in
your reply):

Furthermore, in the clarification to rule 4[2] it clearly states: "The
main purpose of this rule is to always treat a grapheme cluster as a
single character—that is, as if it were simply the first character of
the cluster".
This again sides with my understanding that:
X Extendend Y
should behave exactly the same as
X Y
after the extended part.
Which is exactly what I'm arguing for.

Also take another look at
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules
specifically the table that shows another way of writing the ignore
rule. This again shows my understanding of rule 4 is correct.

Specially look at the following equivalence:
X Y × Z W ⇒ X (Extend | Format)* Y (Extend | Format)* × Z (Extend |
Format)* W

--
Tom
Received on Wed Nov 23 2016 - 05:01:23 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 23 2016 - 05:01:23 CST