Accumulated Feedback on PRI #396

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Wed Jul 3 19:10:38 CDT 2019
Name: Johan Curcio Lindström
Report Type: Error Report
Opt Subject: ARABIC NUMBER SIGN Control or Prepend in UAX #29

Hello,

When implementing the extended grapheme cluster segmentation algorithm, I
noticed what appears to be a mistake in the specification.

The ARABIC NUMBER SIGN code point (U+0600) belongs to the Format general
category, which means that it will have a Grapheme_Cluster_Break of Control.
Since it also has Prepended_Concatenation_Mark = Yes, it could also be
considered for the Prepend value, but the specification states that Control
wins because it is higher up in the table.

GB4 states that we break after Control and GB9b that we don't break after
Prepend. Since Control won out, we should not break after ARABIC NUMBER
SIGN. The provided test cases in GraphemeBreakTest.txt expect a break after
this code point and the tables in GraphemeBreakProperty.txt include it under
Prepend and not under Control.

This seems true for a range of code points in Prepend in fact, but only
U+0600 is part of the tests so that's why I mention it.

Is there a part of the specification that I'm missing?

It is also a bit unclear what the difference between the GB* rules that use
the Any value and just leaving one side blank is, e.g., "sot ÷ Any" vs. " ×
SpacingMark". Why is GB1 not written as "sot ÷ " or GB9a not as "Any ×
SpacingMark"?

Date/Time: Sat Jul 6 16:57:48 CDT 2019
Name: Charlotte Buff
Report Type: Error Report
Opt Subject: Line-Break Behaviour of Emoji Modifier Sequences

In Revision 33 of UAX #29 (Unicode Text Segmentation), the rules governing
emoji modifier sequences were simplified. In particular, emoji modifiers are
now considered generic extenders. This change has not carried over to the
line breaking algorithm, however, which still relies on the Emoji_Modifier
and Emoji_Modifier_Base properties. As a consequence, certain sequences of
characters now form a single grapheme cluster, but still theoretically allow
line breaks inside of them, which isn’t very sensible.

This discrepancy affects existing characters such as U+1F9DF 🧟 ZOMBIE, which
is available in different skin tones as part of Microsoft’s Segoe UI Emoji
font despite not being an official modifier base, as well as newly released
characters whose properties may not have been fully implemented yet.

I propose deprecating the line break properties E_Base and E_Modifier, and
merging the affected characters into Ideographic and Combining_Mark
respectively. This would synchronise the behaviour between line breaking and
text segmentation, and also automatically future‐proof the system for new
emoji modifier bases that might be added in the future.


Feedback above this line was reviewed and processed during UTC #160 in July 2019.

Date/Time: Tue Jul 30 12:55:18 CDT 2019
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #396: More modifier letters for Chinese tones

The examples in L2/04-107 show U+A700..A707 used the same way as the other
Chinese tone letters, so U+A700..A707 should have Word_Break = ALetter, like
the rest of the Chinese tone letters that are proposed to be ALetter.

Is the reason they weren’t proposed along with the others that they are used
with Han characters, which have different Word_Break values than Latin? If
so, page 7 shows some of them with Latin letters too.

Date/Time: Fri Aug 23 14:17:34 CDT 2019
Name: Theodore Beers
Report Type: Other Question, Problem, or Feedback
Opt Subject: ZWNJ in Annex 29

I think it would help if a sentence or two were added to clarify the
exclusion of ZWNJ (U+200C) as a grapheme boundary. In Persian, this
character is used to prevent letters from being connected to one another,
mainly at points where prefixes and suffixes are attached to words. This
(arguably) does not generate a new "user-perceived character," but rather
dictates which of the standard letter forms may be set at the point in
question.[0]

Somewhat similar is the use of ZWNJ in German, to prevent a ligature across
the stems of a compound word (e.g., no U+FB02 in Auflage).

Specialists have indicated that they agree with the current rule—i.e., that
ZWNJ by default should *not* be treated as a grapheme boundary, and thus
that it should be grouped with the preceding cluster. I'm not entirely
convinced… but more importantly, the annex might benefit from further detail
on this point. Where the ZWNJ is mentioned, it is in reference to Indic
languages, in which there are in fact unique user-perceived characters that
rely on the ZWNJ to be composed correctly. So those are cases where it is
obvious that ZWNJ cannot be a grapheme boundary. I think the proper
treatment of this character in a language like Persian is less self-evident.
There are, to be sure, advantages to the rule as it stands, e.g.,
facilitating cursor positioning and the counting of user-perceived
characters (if not their definition). Still, my feeling is that the annex
could make the rationale a bit clearer to non-cognoscenti.

[0] It seems relevant to me to note that ZWNJ has not always been readily
available for typing/typesetting/rendering in Persian (it's much easier
these days), and it remains the case that many people are unaware of its
availability or have not learned to use it. So it is still common to see a
full space entered where there ought to be a ZWNJ. This is not ideal, of
course—the result is breaking a compound word into two words. (There are
also people who hypercorrect, using ZWNJ before a suffix even in cases
where allowing the letters to connect would produce no ambiguity.) The
story of this character in Persian is extremely messy. There are words
where you might find ZWNJ, or a full space, or connected letters. Somehow
this exacerbates my confusion when it comes to the rule for segmenting
graphemes.

Date/Time: Fri Sep 20 15:07:12 CDT 2019
Name: Yichao 'Peak' Ji
Report Type: Public Review Issue
Opt Subject: UAX #29: Full width comma in WB11 and WB12

(Note: Re-open for PRI #396, as discussed in the Unicore mail list.)

Actually I’ve submitted this issue before in PRI #355, but today I found a
Review Note about the exact same thing, so please allow me to elaborate
more:

In English and other languages, we use commas in long numbers for better
readability, and use comma plus a whitespace to separate clauses. But in
Chinese, we use full width commas without any whitespace to separate
clauses, and never use comma in numbers.

U+FF0C FULLWIDTH COMMA and U+FF1B FULLWIDTH SEMICOLON as MidNum would
prevent implementations from breaking legit Chinese clauses starting with
digits after clauses ending with digits.

For example, “今晚19:30,2014大奖赛即将开幕” (The 2014 championship will start at
19:30), a weird “30,2014” token will be generated.

This behavior affects a wide range of Chinese news articles, as I mentioned
before in the report, we found more than 5k invalid tokens like these in a
corpus of 2.8 million articles.

AFAIK, Chinese is the only language using the full width variant. Japanese
uses u+3001 (、) and Korean uses the ascii comma. So I’d say it’s safe to
remove U+FF0C and U+FF1B from MidNum.