Public Review Issues

Accumulated Feedback on PRI #446

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Tue Apr 5 07:14:53 CDT 2022
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 446


In UAX #14, the descriptions of the line breaking classes Postfix_Numeric
(PO) and Prefix_Numeric (PR) don’t match the actual behaviour of the line
breaking algorithm when it comes to the treatment of intervening spaces.

The description of PO states:

    »Characters that usually follow a numerical expression may not be
    separated from preceding numeric characters or preceding closing
    characters, even if one or more space characters intervene. For
    example, there is no break opportunity in “(12.00) %”.«

And similarly, the description of PR states:

    »Characters that usually precede a numerical expression may not be
    separated from following numeric characters or following opening
    characters, even if a space character intervenes. For example,
    there is no break opportunity in “$ (100.00)”.«

However, the actual line breaking rules that govern these classes
(LB23a, LB24, LB25, LB27) don’t actually contain a special provision for
intervening spaces. As a result, the strings given as examples *do* in fact
contain line breaking opportunities simply due to rule LB18 (Break after
spaces) – before the percent sign in the former and before the opening
parenthesis in the latter. This can be confirmed via the use of Unicode’s
online utility tool for breaks and segmentation
(https://util.unicode.org/UnicodeJsps/breaks.jsp).

Date/Time: Sun Apr 10 20:12:11 CDT 2022
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: UAX 14

Section 3.1 of UAX 14 has the following description of the South East Asian
style of line breaking: “The third style is used for scripts such as Thai,
which do not use spaces, but which restrict word breaks to syllable
boundaries, whose determination requires knowledge of the language
comparable to that required by a hyphenation algorithm. Such an algorithm
is beyond the scope of the Unicode Standard.”

This description is odd in not starting out with line breaking, but with
word breaks, whose relevance to line breaking is not explained. The problem
statement I usually hear is that Thai, Lao, Khmer, and Myanmar allow line
breaks only at word boundaries, but do not mark word boundaries in any way,
so that they have to be determined by higher-level algorithms, typically
based on dictionaries. See, for example, the W3C layout requirements:

https://www.w3.org/International/sealreq/thai/#h_line_breaking
https://www.w3.org/International/sealreq/lao/#h_line_breaking
https://www.w3.org/International/sealreq/khmer/#h_line_breaking

The comparison with hyphenation algorithms is also questionable, as the
complexity of hyphenation algorithms can vary substantially between
languages.

Finally, Thai does use spaces to separate phrases.

I propose replacing the text quoted above with "The third style is used for
scripts such as Thai, which allow line breaks only at word boundaries, but
do not mark word boundaries in any way, so that the determination of line
break opportunities requires language dependent text analysis. Algorithms
and data for such analysis are beyond the scope of the Unicode Standard."

Date/Time: Fri Jun 3 10:22:13 CDT 2022
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: 446

UAX #14 says that U+23B6 BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET is a 
member of class QU, but that has not been true for many years.

Date/Time: Fri Jun 3 09:10:34 CDT 2022
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: 446

The SY class is motivated by the commonness of URLs. Hebrew letters can
appear in URLs. What is the rationale for LB21b? Why is Hebrew special
among all scripts that can appear in URLs? Documenting the reason would
help implementers decide how to tailor the algorithm.

Maybe the reasoning is that, although Hebrew can appear in URLs, most URLs
are still ASCII, so a slash in Hebrew is probably not a URL slash and so
isn’t a break opportunity. However, if so, that reasoning applies to all
non-ASCII characters; the only reason Hebrew is treated specially is that
it happens to have its own line break class for an unrelated reason, not
because Hebrew is actually different from other scripts. If this is the
reason, there are two ways to make the algorithm more consistent. The first
is to delete LB21b. The second is to expand LB21b to all non-ASCII
alphabetic/symbol characters.

Date/Time: Fri Jun 3 19:49:05 CDT 2022
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: 446

L2/21-042 gives examples of U+2E55..U+2E5C within words, just like how
U+0029 is used in “(s)he”. It is central to these characters’ purpose to
appear within words, so it is likely that their line breaking works the
same as for U+0029. The closing characters U+2E56, U+2E58, U+2E5A, and
U+2E5C should therefore have Line_Break=Close_Parenthesis.

Date/Time: Mon Jul 11 20:35:10 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Anatolian hieroglyphic line breaks

The standard says that “Spaces are used in modern renditions of
[Anatolian] hieroglyphic text”; accordingly, most Anatolian hieroglyphs
have Line_Break=Alphabetic, such that there are no line break opportunities
within words. The only exceptions are U+145CE and U+145CF. If U+145CF
appears within a word, there is a line break opportunity after it. Is that
really true? It seems more likely that modern renditions of Anatolian
hieroglyphic text break on spaces, not within words. U+145CE and U+145CF
should therefore get Line_Break=Alphabetic.

Date/Time: Tue Jul 19 11:24:29 CDT 2022
Name: Brad Andalman
Report Type: Error Report
Opt Subject: UAX#14

UAX#14 [https://unicode.org/reports/tr14/] asserts that “Line breaks can
occur before and after an em dash.” It also claims that the only use for an
em dash is to “set off parenthetical text.” However, that is only one of
many ways that an em dash can be used in English.

The Chicago Manual of Style – beginning at entry 6.85 in the 17th edition –
enumerates numerous ways an em dash can be used. Entry 6.87 mentions that
an em dash should be used for “sudden breaks or interruptions.” One of the
examples it uses is as follows:

    “Well, I don’t know,” I began tentatively. “I thought I might—”
    “Might what?” she demanded.

If that trailing em dash followed by a quotation mark were to end on its own
line, it would look terrible. This is easy to make happen on a simple web
page (see my bug report to WebKit:
https://bugs.webkit.org/show_bug.cgi?id=242822), and it can often be seen
in Apple Books as well (e.g. when reading The Invisible Man By H.G. Wells).
This is because Apple Books is based on WebKit, which faithfully implements
the line-breaking behavior specified in UAX#14.

The Chicago Manual of Style addresses the problem of line breaks directly
(in 6.90): “In printed publications, line breaks should generally be made
after an em dash but not before, in the manner of hyphens. In the case of a
closing quotation mark (or any other mark of punctuation) immediately
following the dash, however, the quotation mark and dash *must not be
broken at the end of a line*” [emphasis mine].

It would be great if UAX#14 could be updated to reflect the varied uses of
em dashes in writing. Thanks!