Re: Potential contradiction between the WordBreak test data and UAX #29 from Philippe Verdy on 2016-11-22 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 23 Nov 2016 03:49:08 +0100

IMHO, the ZWJ should glue with the last symbol following your examples.
But the combining diaeresis following the ZWJ extends it (even if in my
opinion it is "defective" and would likely display on a dotted ciurcle in
renderers, but not defective for the string definition of combining
sequences).
So ignore it and test whever the last symbols glues with ZWJ (it should, so
there's no break in the reference implementation).

WB4: X (Extend | Format | ZWJ)*→X

Extend: [ExtendGrapheme_Extend=Yes] This includes:
  General_Category = Nonspacing_Mark (this includes the combining diaeresis)
  General_Category = Enclosing_Mark
  U+200C ZERO WIDTH NON-JOINER
  plus a few General_Category = Spacing_Mark needed for canonical
equivalence.

So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ (EBG|
Glue_After_Zwj) from rule WB4 eliminate the combining mark from the input
queue

But rule WB3c comes before and prohibits it:

WB3c: ZWJ × (Glue_After_Zwj | EBG)

This means that you have first:

ZWJ "COMBINING DIERESIS" GAZ → ZWJ × "COMBINING DIERESIS" EBG

and this does not match the rule WB4 which is not matching for:

X × (Extend | Format | ZWJ)*→X

(it cannot remove the extenders if there's a no-break before them, it is
valid only when the break oppotunity is still unspecified. As soon as a
rule as produced a "break here" or "nobreak here" at a given position, you
must advance after this position (the rules are based on a small finite
state machine). So after :

ZWJ "COMBINING DIERESIS" GAZ → ZWJ × "COMBINING DIERESIS" EBG

it just remains in your input queue:

"COMBINING DIERESIS" EBG (because "ZWJ ×" is already processed, and so ZWJ
is elminated)

Now comes WB4: X (Extend | Format | ZWJ)* → X

There's no more any "X" to match before the combining diaeresis: your input
queue starts by the combining diareasis matching "X", the following
character (EBG) does not match within "(Extend | Format | ZWJ)*" (which
matches an empty string and does not contain the combining diaresis already
matched in "X"), rule WB4 has then no replacement effect and preserves the
initial "X" (i.e. the combining diaeresis)

2016-11-22 13:07 GMT+01:00 Tom Hacohen <tom_at_osg.samsung.com>:

> Dear,
>
> I recently updated libunibreak[1] according to unicode 9.0.0. I thought I
> implemented it correctly, however it fails against two of the tests in the
> reference test data:
>
> ÷ 200D × 0308 ÷ 2764 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
> (Glue_After_Zwj) ÷ [0.3]
>
> and
>
> ÷ 200D × 0308 ÷ 1F466 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
>
>
> More specifically, it fails in both after the "combining diaeresis". My
> implementation marks it as a break, whereas the test data as not. The
> reference implementation, as expected, agrees with the test data.
>
>
> However, looking at the test case and the UAX[2], this does not look
> correct. More specifically, because of rule 4:
> ZWJ Extended GAZ -> ZWJ GAZ
> And then according to rule 3c, there should be no break opportunity
> between them. The reference implementation, however, uses rule 999 here,
> which I believe is incorrect.
>
>
> Am I missing anything, or is this an issue with the reference test data
> and reference implementation?
>
> Thanks,
> Tom.
>
> [1]: https://github.com/adah1972/libunibreak
> [2]: http://www.unicode.org/reports/tr29/#WB1
>
Received on Tue Nov 22 2016 - 20:50:11 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 22 2016 - 20:50:12 CST