L2/19-055 Date: January 14, 2019 Title: Proposed Changes in the Segmentation Property Values for Fullwidth Digits Source: Laurentiu Iancu (Microsoft Corporation) and Andy Heninger (Google, Inc.) Action: For consideration by the UTC Proposal: For the ten characters U+FF10 FULLWIDTH DIGIT ZERO through U+FF19 FULLWIDTH DIGIT NINE, change the Word_Break and Sentence_Break property values from Other to Numeric (from WB=XX, SB=XX to WB=NU, SB=NU). Background and Analysis Users have reported unexpected results in the word segmentation of CJK text that contains fullwidth digits. The issue was reported multiple times since 2015 and was tracked by action item 144-A83. Although the fullwidth digits are General_Category = Decimal_Number, their current Word_Break classification is Word_Break = Other, the same as for non-decimal numerals (General_Category = Other_Number). Due to that assignment, the word segmentation algorithm finds word boundaries between adjacent fullwidth digits. As a result, numbers written in fullwidth digits are not counted as whole words but rather are fragmented at the individual characters. The sentence segmentation algorithm does not break up sentences between fullwidth digits with the current Sentence_Break assignment. However, the proposal includes a change in their Sentence_Break property values for consistency with the classification of all other decimal digits (General_Category = Decimal_Number). The Line_Break property value of fullwidth digits is Ideographic (lb=ID), which also implies line-breaking opportunities between adjacent fullwidth digits. A change in the Line_Break classification is not proposed at this time. In summary, the segmentation and line-breaking classification of fullwidth digits before and after the proposed changes looks as follows: Before: Word_Break = Other (WB=XX) Sentence_Break = Other (SB=XX) Line_Break = Ideographic (lb=ID) After: Word_Break = Numeric (WB=NU) Sentence_Break = Numeric (SB=NU) Line_Break = Ideographic (lb=ID) The change in Word_Break property value for fullwidth digits was tested in the ICU word-break iterator and produced the expected results.