Accumulated Feedback on PRI #355

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Sep 25 10:21:46 CDT 2017
Name: Tsuyoshi Ito
Report Type: Error Report (PRI #355)
Opt Subject: The regular expressions for a grapheme cluster in Table 1b of UAX #29 do not match the rules in Section 3.1.1

Table 1b of UAX #29, “Unicode Text Segmentation”
(http://www.unicode.org/reports/tr29/tr29-31.html), shows the regular
expressions for a legacy and an extended grapheme cluster. Section 6.3 seems
to indicate that they are supposed to be equivalent to the rules in Section
3.1.1.

(However, to be honest, the wording of Section 6.3 is not very clear to me.
It says “The conversion into a regular expression is fairly straightforward
for the grapheme cluster boundaries of Table 2.” but Table 2 is a summary
of the Grapheme_Cluster_Break property values, not the rules to determine
grapheme cluster boundaries.)

However, I think that they are quite different. For example:

* According to the rules in Section 3.1.1, a string of more than two
regional indicator symbols is not a (legacy or extended) single grapheme
cluster. However, according to the regular expressions in Table 1b, it is
a single (legacy and extended) grapheme cluster.

* According to the rules in Section 3.1.1, an emoji zwj sequence is a single
grapheme cluster. However, the regular expression for neither a legacy nor
extended grapheme cluster treats ZWJ in a special way, and it puts a
grapheme cluster boundary before and after ZWJ.

Please consider one of the following changes: Option 1: The regular
expressions in Table 1b (and the regular expressions in Table 1c used there)
should be updated to match the rules in Section 3.1.1. Option 2: The text of
Sections 3.1.1 and 6.3 should be updated to clarify that the regular
expressions in Table 1b do not necessarily match the rules in Section 3.1.1.

Date/Time: Tue Oct 10 10:06:06 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: PRI #355: Cursors in ligatures

“For example, the text editing framework must know if a digraph is represented 
as a single glyph in the font, which therefore cannot have a cursor separating 
its two parts.” That is not true: text editing frameworks can and do put cursors 
within ligature glyphs.

Date/Time: Tue Oct 10 10:15:49 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: Devanagari kshi does not need tailoring

Table 1a lists ⟨क्षि⟩ as a tailored grapheme cluster, but it no longer needs tailoring.

Date/Time: Tue Oct 10 10:29:29 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: PRI #355: More LinkingConsonants

Grapheme_Cluster_Break=LinkingConsonant should be expanded to include 
Indic_Syllabic_Category=Vowel_Independent and Indic_Syllabic_Category=Consonant_Dead. 
Independent vowels may be subjoined in Khmer, and Bengali’s khanda ta may take a repha.

Date/Time: Wed Oct 11 14:18:02 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: Indic clusters without virama

Some Indic consonant clusters do not use a virama. GB9c should be 
(StackingConsonant | Virama | ZWJ) × LinkingConsonant, where StackingConsonant 
is Indic_Syllabic_Category = Consonant_With_Stacker.

Date/Time: Wed Oct 11 14:33:49 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: U+0BCD TAMIL SIGN VIRAMA

U+0BCD TAMIL SIGN VIRAMA generally does not create conjuncts. The exceptions 
are ⟨க்ஷ⟩ and ⟨ஶ்ரீ⟩. It may better match user expectations to exclude 
U+0BCD from GCB = Virama.

Date/Time: Thu Oct 19 17:50:25 CDT 2017
Name: Roozbeh Pournader
Report Type: Public Review Issue
Opt Subject: Virama and UAX #29

Because of the dual usage of the characters with the Indic Syllabic Category
of Virama. It appears to me that a virama is more frequently just a visible
killer instead of an invisible stacker.

Tamil is a common example where the visible killer frequency is much higher
than the invisible stacker frequency. But I expect several other scripts
would have a similar situation, and even for languages such as Hindi, the
frequency of visible killer usage would be too high for always disallowing a
grapheme break.

If breaks after this class are being forbidden, I suggest removing the
Indic_Syllabic_Category = Virama class from the new virama class, and
renaming the class to InvisibleStacker.

Also, forbidding breaks between ZWJ and LinkingConsonant appears incorrect.
ZWJ is generally used in Indic as an invisible letter. So in that usage, it
could be thought of as ending a cluster with a break allowed after it. Also
note that ZWJ is used after virama in the legacy representation of Malayalam
Chillus, which is still very common on the internet and in newly created
content. Forbidding a cluster break between a ZWJ and a consonant would be
incorrect in such usage.

Finally, note that it's not just character of InSC=Consonant that take post-
stacker forms. Independent Vowels, Consonant Placeholders, and perhaps
Consonant_Deads and Consonant_With_Stacker may appear after stackers. There
may even be more.

Altogether, I think the proposed rules are based on a simplified version of
the Indic grapheme cluster patterns which needs much more research. They
should be rewritten to only discourage breaks in these non-controversial
cases:

1. Forbid grapheme breaks after all characters of InSC=Invisible_Stacker,
regardless of the character that comes after. (We don't need to worry about
odd cases, like when an Invisible_Stacker is followed by a space or
punctuation. These are malformed text, and it's OK to go either way on
malformed text.)

2. Forbid breaks before all InSC={Virama, Invisible_Stacker, Pure_Killer}
(note that this is already the case, since they are currently categorized as
Extend, but may be necessary if the Extend class is split).

Date/Time: Mon Dec 11 12:35:36 CST 2017
Name: Otto Stolz
Report Type: Error Report
Opt Subject: Proposed Update Unicode® Standard Annex #29

http://www.unicode.org/reports/tr29/tr29-32.html#Word_Boundaries

Figure 2 does not match the pertinent text which says:
“That is done with the above boundaries by ignoring any words that do not 
contain a letter, as in Figure 2.” In contrast, figure 2 comprises the word 
“32.3” that does not contain any letter.