Accumulated Feedback on PRI #237

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Jan 14 08:13:33 CST 2013
Contact: koji.a.ishii@mail.rakuten.com
Name: Koji Ishii
Report Type: Error Report
Opt Subject: UAX#14 for U+3033, 3034, and 3035


I would like to propose to change the Line Breaking Class of UAX#14 
for U+3033, 3034, and 3035 to IN.

These characters are used only as pairs; U+3033 and U+3035, or U+3034 and U+3035, 
to build 2em-wide single character. There are no line break opportunities nor 
expansion opportunities between the two code points.

JLREQ solves this issue by defining these 3 code points as Inseparable (cl-08)
http://www.w3.org/TR/jlreq/#cl-08

In UAX#14, IN is defined as:
  These characters are intended to be used consecutively.
  There is never a line break between two characters of this class.
http://www.unicode.org/reports/tr14/#IN

Although the 2nd sentence can fix the issue for these 3 code points, 
strictly speaking, the characteristics of the 3 code points do not match 
to the 1st sentence.

Another possible way to fix the issue is to make U+3035 CM. This is also 
a workable solution, but doing so still does not describe accurate 
characteristics for U+3033 and U+3034.

Creating a new class just for this purpose looks overkill to me.

So, my conclusion is that making these 3 code points as IN would be the 
best fix. I'm happy to go with other solutions though.

If the 1st sentence is problematic to include these 3 code points, we 
could also consider changing the 1st sentence of IN to something like:
  These characters are generally intended to be used consecutively.
or:
  These characters are intended to be used consecutively,
  or used by a pair of characters in this class.

Thank you in advance for your continued support on this great UAX.

/koji


Feedback above this line was considered at the February UTC meeting.

Added from mail archive per request from author:

From: Konstantin Ritt <ritt.ks_at_gmail.com>
Date: Sat, 2 Jun 2012 07:22:01 +0300

It seems like there is an inconsistency between what the default
grapheme clusters specification says and what the test results are
expected to be:

The UAX#29 says:
> Another key feature (of default Unicode grapheme clusters) is that <b>default Unicode grapheme clusters are atomic units with respect to the process of determining the Unicode default line, word, and sentence boundaries</b>.
Also this mentioned in UAX#14:
> Example 6. Some implementations may wish to tailor the line breaking algorithm to resolve grapheme clusters according to Unicode Standard Annex #29, “Unicode Text Segmentation” [UAX29], as a first stage. <b>Generally, the line breaking algorithm does not create line break opportunities within default grapheme clusters</b>; therefore such a tailoring would be expected to produce results that are close to those defined by the default algorithm. However, if such a tailoring is chosen, characters that are members of line break class CM but not part of the definition of default grapheme clusters must still be handled by rules LB9 and LB10, or by some additional tailoring.

However, <U+0020 (SP), U+0308 (CM)> in the line breaking algorithm is
handled by the rules LB10+LB18 and produces a break opportunity while
GB9 prohibits break between <U+0020 (Other), U+0308 (Entend)>.
Section 9.2 "Legacy Support for Space Character as Base for Combining
Marks" in UAX#29 clarifies why there is a line break occurs, but the
fact that the statements above are false statements and introduce some
ambiguility.
In case the space character is not a grapheme base anymore the
grapheme cluster breaking rules need to be updated.

Kind regards,
Konstantin