Accumulated Feedback on PRI #469

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.


Date/Time: Fri Jan 6 18:26:42 CST 2023
Name: Marshall Stoner
Report Type: Error Report
Opt Subject: www.unicode.org/reports/tr29/


The Rule WB4 should be expanded and clarified.  As is, the algorithm may
break an Arabic numeric heading such as U+061C U+0600 U+0664 U+0666 in the
wrong place.  The word break rules should lead to "U+061C ÷ U+0600 ×
U+0664", *not* "U+061C x U+0600 ÷ U+0664". According to the same document,
the sequence "U+0600 U+0664" is a grapheme cluster that should not be
broken.  I think there should be a rule in addition to WB4 that clarifies
the break should come *after* most 'Format', 'Extend', or 'ZWJ', code
points, but 'Format' should exclude any format characters that are
subtending marks.  Format characters that are subtending marks should be
placed in a new category and there should then be two rules..

    WB4a:    Any × ( Extend | Format | ZWJ )
    WB4b:    Prepend × Any

Therefore, if there is a sequence [some letter] ( Extend | Format | ZWJ )* Prepend* [ another letter ], 
the break should always occur after the "( Extend | Format | ZWJ)*" string but *before* the "Prepend*" string.  
Prepend should be characters excluded from Format.


Feedback above this line was reviewed during or prior to UTC #175 in April, 2023

Date/Time: Fri Jun 16 21:13:48 CDT 2023
ReportID: ID20230616211348
Name: Eiso Chan
Report Type: Public Review Issue
Opt Subject: 469

In Table 1c, “ri-sequence” and “RI-Sequence” are both used.

Maybe all “ri-sequence” in Table 1c should be “RI-Sequence”.

Date/Time: Tue Jun 20 13:51:08 CDT 2023
ReportID: ID20230620135108
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

I’m happy to see some progress in fixing UAX 29 for Brahmic scripts, even if
it’s initially only for 6 of the roughly 40 scripts that need a fix.

However, in the rule that defines consonant clusters, it’s not clear at all
whether the class ExtCccZwj includes or excludes the right characters. The
combining class for marks in Brahmic scripts (except for viramas and, up to
now, nuktas) should generally be 0, and assignments of other values were in
most cases mistakes that unfortunately can not be corrected. Trying to
derive meaning from ccc values in Brahmic scripts is almost certainly a
mistake. Why should variation selectors be excluded from consonant
clusters? Is the exclusion of three Gujarati nuktas intentional? Is the
inclusion of Vedic tone marks intentional?

If combining classes are really considered the appropriate basis for
selecting characters that can occur within a consonant cluster, then this
should be explained. If not, then the class should be defined so as to
include the right characters, independent of ccc values.

Date/Time: Wed Jun 21 10:21:17 CDT 2023
ReportID: ID20230621102117
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

UAX 29 uses the set operators “&” and “-” in several regular expressions. 
UTR 18 and Appendix A of The Unicode Standard have settled on “&&” 
and “--”. UAX 29 should follow.

Date/Time: Fri Jun 23 11:30:10 CDT 2023
ReportID: ID20230623113010
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

The proposed update of UAX 29 states twice in new text that “the default
grapheme clusters are also known as extended grapheme clusters”, and that
legacy grapheme clusters are defined as a profile. On the other hand,
existing text talks about a “key feature of default Unicode grapheme
clusters (both legacy and extended)”, notes that “default [i.e., extended]
Unicode grapheme clusters were previously referred to
as ‘locale-independent graphemes’” even though that note predates the
invention of extended grapheme clusters, has a section “Default Grapheme
Cluster Boundary Specification” that covers both legacy and extended
grapheme clusters, and requires “When citing the Unicode definition of
grapheme clusters, it must be clear which of the two alternatives are being
specified: extended versus legacy” as if there were no default.

The use of “default” and defaults with respect to grapheme clusters should
be reviewed and made consistent.

Date/Time: Fri Jun 23 11:30:48 CDT 2023
ReportID: ID20230623113048
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

UAX 29 has a note claiming that “The boundary between default Unicode
grapheme clusters can be determined by just the two adjacent characters”.
Looking at rules GB9c, GB11, GB12, and GB13, I don’t believe this is true.

Date/Time: Fri Jun 23 11:31:50 CDT 2023
ReportID: ID20230623113150
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

The description of Table 2a states “each macro represents a repeated union
of the basic Grapheme_Cluster property values”. This seems to be
incorrectly adapted from the descriptions of other tables. In reality, the
table uses intersection and difference rather than union, and uses several
other Unicode properties besides Grapheme_Cluster_Break (the real name
of “Grapheme_Cluster”).

The other macro tables in UAX 29 consider “represents” clear enough without
a “=“ sign; I think this would work here too.

Date/Time: Fri Jun 23 11:32:29 CDT 2023
ReportID: ID20230623113229
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

When rule GB9c is rendered in a narrow view (such as a printed page), it appears as

LinkingConsonant ExtCccZwj* × LinkingConsonant
ConjunctLinker ExtCccZwj*

which invites a reading very different from the intended one.

The rendering could be improved by using “vertical-align: bottom” on the
last two cells of the row. 

Date/Time: Fri Jun 23 11:33:23 CDT 2023
ReportID: ID20230623113323
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

The introduction to word boundaries in UAX 29 has a paragraph on the
relationship between word boundaries and line boundaries. It should be
clarified that this relationship exists only in some scripts, not in
others. In Chinese, Japanese, Balinese, Brahmi, etc. line breaking pays no
attention to words. Also, thanks to hyphenation engines for languages where
words do matter for line breaking, line breaks within words are far more
common than the statement on SHY would imply.

The last paragraph in the same section mentions three Line_Break property
values and then states “that means that satisfactory treatment of languages
like Chinese or Thai requires special handling”. Chinese uses none of the
three Line_Break property values,  and while word breaking for Chinese
requires special handling, that has nothing to do with its line breaking.

Date/Time: Sun Jun 25 06:15:35 CDT 2023
ReportID: ID20230625061535
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 469

Section 3, “Grapheme Cluster Boundaries”, states:

	»Word boundaries, line boundaries, and sentence boundaries should
	 not occur within a grapheme cluster: in other words, a grapheme
	 cluster should be an atomic unit with respect to the process of
	 determining these other boundaries.«

This does not actually hold true for line boundaries when an emoji modifier
is applied to a non-standard base character. For example, the
sequence <U+1F9DF, U+1F3FB> 🧟🏻 (ZOMBIE, EMOJI MODIFIER FITZPATRICK
TYPE-1-2) is a single grapheme cluster because emoji modifiers have
Grapheme_Cluster_Break=Extend, but nonetheless a line break is
theoretically allowed between the two characters because ZOMBIE has
Emoji_Modifier_Base=False and line break rule LB30b applies only to
characters with Emoji_Modifier_Base=True or unassigned code points with
Extended_Pictographic=True. In fact, Chromium-based web browsers will break
lines in the middle of these sequences.

Date/Time: Fri Jun 30 07:45:06 CDT 2023
ReportID: ID20230630074506
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 469

Table 1c defines the following regex pattern:

	conjunctCluster := LinkingConsonant ExtCccZwj* (ConjunctLinker ExtCccZwj* LinkingConsonant)+

If we expand the “(ConjunctLinker ExtCccZwj* LinkingConsonant)+” part, we
get a sequence pattern where ExtCccZwj can occur only *after* a
ConjunctLinker but not *before* it:

	ConjunctLinker ExtCccZwj* LinkingConsonant ConjunctLinker ExtCccZwj* LinkingConsonant ConjunctLinker ExtCccZwj* LinkingConsonant ...

This does not match rule GB9c which accounts for ExtCccZwj in both
positions, which is necessary because Indic scripts make use of combining
marks with CCC values both smaller and greater than 9 (Virama). Therefore I
think the definition should actually be:

	conjunctCluster := LinkingConsonant ExtCccZwj* (ConjunctLinker ExtCccZwj* LinkingConsonant ExtCccZwj*)+

Date/Time: Tue Jul 04 17:39:03 CDT 2023
ReportID: ID20230704173903
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

The discussion of Aksaras in UAX 29 states that “consonant cluster aksaras
are not incorporated into the default rules”. That’s no longer correct;
such aksaras are now incorporated for six scripts, and more will hopefully
follow.

The same paragraph mentions “additional prefixed consonants”. That seems to
reflect a Devanagari-centric view, as in many other scripts the additional
consonants are better described as “subjoined” or in other terms. I suggest
removing the word “prefixed”.

Date/Time: Tue Jul 04 17:39:43 CDT 2023
ReportID: ID20230704173943
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

The proposed update of UAX 29 states “Boundaries never occur within a
combining character sequence or conjoining sequence, so the boundaries
within non-NFD text can be derived from corresponding boundaries in the NFD
form of that text.” Unfortunately, the stated condition is not sufficient.
It would also be ncessary that normalization didn’t reorder characters out
of character pairs that should not be broken up. As the section 
“Compatibility with normalization” of L2/23-141 discusses, it sometimes 
does, and workarounds are necessary to achieve the desired results in 
normalized text.

Date/Time: Tue Jul 04 17:40:02 CDT 2023
ReportID: ID20230704174002
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469

Document L2/23-140, Setting expectations for grapheme clusters, is intended
to be feedback to PRI 469.