Public Review Issues

Accumulated Feedback on PRI #322

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Fri Mar 11 04:16:11 CST 2016
Name: Sascha Brawer
Report Type: Error Report
Opt Subject: Missing reference in TR14/TR41

Section 3.1 of TR14 has a broken link.

The final paragraph of http://www.unicode.org/reports/tr14/#BreakOpportunities
says: “In bidirectional text, line breaks are determined before applying rule
L1 of the Unicode Bidirectional Algorithm [Bidi].” The [Bidi] link points to
http://www.unicode.org/reports/tr41/tr41-17.html#Bidi but there is no #Bidi
anchor in TR41.

Date/Time: Wed Apr 20 15:19:33 CDT 2016
Name: Andy Heninger
Report Type: Error Report
Opt Subject: UAX 14 feedback, PRI #322


The UAX-14 line breaking of numbers beginning with a decimal point can be bad.
Consider the string "start .789 end".

With the default rules there will only be one break, "start .789 |end". Rule
LB13, "x IS" will prevent a break before the number.

With the tailoring of numbers from example 7 of section 8.2 there will be an
unexpected break after the full stop, yielding "start .|789 |end", because the
regular expression for numbers does not allow a character of class IS to
precede the first digit.

How this might be fixed will require some thought

This problem was originally reported by Bernhard Fey in an ICU bug report,
http://bugs.icu-project.org/trac/ticket/12017

Date/Time: Tue Apr 26 15:03:57 CDT 2016
Name: Andy Heninger
Report Type: Public Review Issue
Opt Subject: UAX 14 feedback

Line Break rule LB1 says that, in the absence of other criteria, unknown 
characters (class XX) should be treated as alphabetic (class AL).

There is no break opportunity between alphabetic characters.

Emojis are having problems with this. Adoption of new emoji characters 
tends to occur extremely quickly, leaving un-updated implementations of 
line-break seeing them as unknown. Treating unknown characters as class 
ID might give better results.

Or maybe something could be done based on blocks, treating unassigned 
characters from blocks for alphabetic scripts as AL and others as ID.

Date/Time: Fri Apr 29 16:06:51 CDT 2016
Name: Marcin Grzegorczyk
Report Type: Public Review Issue
Opt Subject: UAX 14 feedback (PRI #322)

The new rule LB30a (as of rev. 36 draft 1) cannot be implemented with a pair
table without extra processing. If the rule is to retain its extended context,
then the implementation presented in chapter 7 – not just the pair table, but
the sample code, too – will have to be updated to account for it.

One possible implementation of LB30a is to introduce an additional, artificial
line breaking class – let’s call it RI2 – and change RI into RI2 in the main
loop if the previous class (taking LB9 into account) was RI (RIs previously
mapped to RI2 would not count). Then LB30a can be rewritten as (RI × RI2) and
(RI2 ÷ RI), which can be implemented directly in the pair table.

This is part of a broader issue with regional indicators. Because there is
only a single set of regional indicator symbol letters (as opposed to separate
sets of leading and trailing letters), if some process accidentally breaks a
string of RIs on an odd boundary (e.g. due to a limited buffer size) the
entire part of that string following the break is corrupted. (This is similar
to the weakness inherent in several multi-byte character sets such as EUC-CN.)
The new rule LB30a provides a partial mitigation of the problem (direct break
opportunities in strings of RIs significantly reduce the number of cases where
an emergency line break is required), but the fundamental issue remains.
However, I guess it is too late now to add a second set of trailing RI symbol
letters, although it might be a good idea to re-include (perhaps in the main
Standard text) the recommendation to insert ZWSP (or WJ if a break is
undesirable) between pairs of RIs, since it would limit the potential damage
to a single RI pair.