Re: Potential contradiction between the WordBreak test data and UAX #29

From: Tom Hacohen <tom_at_osg.samsung.com>
Date: Wed, 23 Nov 2016 11:28:41 +0000

On 23/11/16 11:20, Philippe Verdy wrote:
> 2016-11-23 12:00 GMT+01:00 Tom Hacohen <tom_at_osg.samsung.com
> <mailto:tom_at_osg.samsung.com>>:
>
>
> Also take another look at
> http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules
> <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules>
> specifically the table that shows another way of writing the ignore
> rule. This again shows my understanding of rule 4 is correct.
>
> Specially look at the following equivalence:
> X Y × Z W ⇒ X (Extend | Format)* Y (Extend | Format)* ×
> Z (Extend | Format)* W
>
>
> This expansion does not occur before rule WB4; it cannot be used to
> transform rules WB1 to WB3c; this is explicitly stated in the algorithm.
> And because the rule WB3c handles your case, you are misinterpreting the
> specs as if it was applying there too...
>

I took a look at the ICU sources, and they explicitly mention this case,
so it seems I was mistaken with interpreting the intention of the UAX. I
still find it confusing, but based on this thread, it seems to just be me.

Sorry for the noise.

The comment from the ICU source code:
# Rule 3c ZWJ x (Extended_Pict | EmojiNRK). Precedes WB4, so no
intervening Extend chars allowed.

Thanks for your help,
Tom
Received on Wed Nov 23 2016 - 05:29:11 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 23 2016 - 05:29:11 CST