Accumulated Feedback on PRI #341

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Nov 28 16:01:08 CST 2016
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #341 formatting


One of the example tailored grapheme clusters is ⟨kʷ⟩. This is 
encoded in HTML as `k<sup>w</sup>`. Why not use Unicode?

Date/Time: Tue Jan 3 19:06:13 CST 2017
Name: Manish Goregaokar
Report Type: Error Report
Opt Subject: UAX #29: Property tables should be updated for emoji sequences

The spec lists GraphemeBreakProperty.txt[1] and WordBreakProperty.txt[2] as
the normative source for grapheme and word categorization respectively.

However, the spec also gives non-normative definitions of these
properties. In particular, it defines Glue_After_Zwj[3] as

>> Emoji characters that do not break from a previous ZWJ in a defined 
>> emoji zwj sequence, and are not listed as Emoji_Modifier_Base=Yes in emoji-data.txt. See [UTR51].

Going through emoji-zwj-sequences.txt[4], there are a lot of emoji
characters that satisfy this property. The kiss/heart emojis are like
this, as well as every object emoji in the "Gendered Role, with
object" section. However, we only count the kiss, heart, and speech
bubble emoji as GAZ in the property table.

The property table should include all role and gender modifiers as GAZ.

Could this be updated?

 [1]: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt
 [2]: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt
 [3]:http://www.unicode.org/reports/tr29/proposed.html#Glue_After_Zwj
 [4]: http://unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt

Date/Time: Wed Jan 4 04:31:52 CST 2017
Name: Manish Goregaokar
Report Type: Public Review Issue
Opt Subject: UAX #29: Avoiding grapheme breaks on Indic consonant clusters

I've often noticed that grapheme clusters in Indic scripts don't span consonant 
clusters. For example, "ग्रा" is not a single grapheme cluster, but two: <ग्> + <रा>.

There is reasoning given in the spec for this.

>> Grapheme clusters can be tailored to meet further requirements. Such 
>> tailoring is permitted, but the possible rules are outside of the scope of 
>> this document. One example of such a tailoring would be for the aksaras, 
>> or orthographic syllables, used in many Indic scripts. Aksaras usually 
>> consist of a consonant, sometimes with an inherent vowel and sometimes 
>> followed by an explicit, dependent vowel whose rendering may end up on any 
>> side of the consonant letter base. Extended grapheme clusters include such
>>  simple combinations.

>> However, aksaras may also include one or more additional prefixed consonants,
>> typically with a virama (halant) character between each pair of consonants in
>> the sequence. Such consonant cluster aksaras are not incorporated into the
>> default rules for extended grapheme clusters, in part because not all such
>> sequences are considered to be single “characters” by users. Indic scripts
>> vary considerably in how they handle the rendering of such aksaras—in some
>> cases stacking them up into combined forms known as consonant conjuncts, and
>> in other cases stringing them out horizontally, with visible renditions of the
>> halant on each consonant in the sequence. There is even greater variability in
>> how the typical liquid consonants (or “medials”), ya, ra, la, and wa, are
>> handled for display in combinations in aksaras. So tailorings for aksaras may
>> need to be script-, language-, font-, or context-specific to be useful.

This really boils down to "it depends on the font" and "you can use a tailoring here". I'll note that:

 - Most fonts for most modern-used consonant clusters will produce a single glyph 
without a halant. It's only when you get to things like three-consonant clusters 
(rare) that it stops working, and even then for most three-consonant clusters 
(e.g. those involving a `ra` on one end) that come up you will have a glyph. More 
common is consonant clusters rendering as larger glyphs, but that shouldn't mean 
they get split up into separate grapheme clusters.

 - As far as the language is concerned the halant and sans-halant form are equivalent, 
but the sans-halant form is generally preferred. I've only seen it used in complex 
clusters from Sanskrit and in typewriter-produced text.

 - As far as text segmentation is concerned you rarely want to break a consonant 
cluster. If, for example, I'm selecting a segment of a word to copy-paste, I will 
almost always select whole clusters.

 - As far as I can tell, tailoring is for ambiguous cases where it wouldn't make 
sense to use the tailoring as part of the default algorithm, either if you're trying 
for a very specific form of segmentation (e.g. backspace -- backspace usually gobbles 
individual combining characters, but in the case of flag emoji many input fields will 
delete the entire emoji -- this is not the regular algorithm for segmentation), or 
for shared scripts where you don't want to cause conflicts. This case seems to be mostly 
unambiguous, on the other hand.


Additionally, Hangul has a very similar problem, but it does have special handling 
for it. While modern Korean only uses choseong+jungseong+optional jongseong (LV or LVT) 
syllable blocks, the spec does allow for things like LLLLVTTT (e.g. <ᄀᄀᄀ각ᆨᆨ>) In 
this case, the whole sequence is considered a single grapheme cluster (it even selects 
without segmentation in Firefox and Chrome). There don't seem to be any fonts which 
handle anything more than LVT glyphs, however.

I think we should be consistent here, and try to match what would be expected in an Indic 
language. The simplest thing to do would be to define halant characters as non-breaking 
on either side. This does mean that if a halant character is side-by-side with something 
from a different script it will still form the same cluster, which is questionable (but 
we do that already with things like <gौ> being considered a single cluster). If that 
behavior is undesired, a system similar to the Hangul one can be devised, where an indic 
grapheme cluster is defined as C(HC)*V* (one base consonant, possibly followed by 
halant-consonant pairs, followed by one or more vowel modifiers)

Thanks!

Date/Time: Sat Jan 21 17:21:11 CST 2017
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UAX29 and spans of space

I submitted a request last year suggesting that the Word Break property not consider 
each individual horizontal white space character in a span of them to be a separate 
word.  I was told that this might have merit, but it was too late for Unicode 9.0, 
but would be put out for public comment afterwards.  I did not follow up, assuming 
that you would.  But now, I see that this isn't being asked about in the 10.0 proposed 
UAX29.  I did read the minutes of the meetings since, and I don't believe there was 
any mention of this, so my guess is that this dropped through the cracks.

Feedback above this line was reviewed during UTC #150, January 2017.

Date/Time: Wed Mar 8 11:30:18 CST 2017
Name: Nick Wellnhofer
Report Type: Error Report
Opt Subject: RI characters in grapheme clusters

In revision 29 of UAX #29, the grapheme cluster rules were updated to break after 
each pair of RI characters (GB 12 and 13). But the text still contains the following 
paragraph (also in the draft for revision 30):

"The base can be single characters, or be any sequence of Hangul Jamo characters that 
form a Hangul Syllable, as defined by D133 in The Unicode Standard, or be any sequence 
of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote 
Emoji national flag symbols corresponding to ISO country codes. Sequences of more than 
two RI characters should be separated by other characters, such as U+200B ZERO WIDTH SPACE (ZWSP)."

I think the paragraph should be updated to reflect the new rules.

Date/Time: Fri Apr 7 15:09:47 CDT 2017
Name: Andy Heninger
Report Type: Error Report
Opt Subject: Full Width Digits Word Break Property

Full-width ASCII digits (U+FF10 - U+FF19] have the word break property of "Other". 
It should probably be Numeric.

The full width digits existing Line break property of Ideographic is correct; line 
wrapping within a full width number is expected. But word selection should match a 
multi-digit number.

Also for this problem was the CLDR ticket http://unicode.org/cldr/trac/ticket/6555, 
which was resolved as out of scope and with a suggestion to submit feedback to Unicode. 
As far as I can find, this did not happen.

Here is a number composed of full width digits: 1234.  Double-click it to check 
browser word break behavior. Chrome, at least, treats the digits as numeric.