Potential contradiction between the WordBreak test data and UAX #29 from Tom Hacohen on 2016-11-22 (Unicode Mail List Archive)

From: Tom Hacohen <tom_at_osg.samsung.com>
Date: Tue, 22 Nov 2016 12:07:16 +0000

Dear,

I recently updated libunibreak[1] according to unicode 9.0.0. I thought
I implemented it correctly, however it fails against two of the tests in
the reference test data:

÷ 200D × 0308 ÷ 2764 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
(Glue_After_Zwj) ÷ [0.3]

and

÷ 200D × 0308 ÷ 1F466 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]

More specifically, it fails in both after the "combining diaeresis". My
implementation marks it as a break, whereas the test data as not. The
reference implementation, as expected, agrees with the test data.

However, looking at the test case and the UAX[2], this does not look
correct. More specifically, because of rule 4:
ZWJ Extended GAZ -> ZWJ GAZ
And then according to rule 3c, there should be no break opportunity
between them. The reference implementation, however, uses rule 999 here,
which I believe is incorrect.

Am I missing anything, or is this an issue with the reference test data
and reference implementation?

Thanks,
Tom.

[1]: https://github.com/adah1972/libunibreak
[2]: http://www.unicode.org/reports/tr29/#WB1
Received on Tue Nov 22 2016 - 09:24:16 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 22 2016 - 09:24:16 CST