L2/19-055

Date: January 14, 2019

Title: Proposed Changes in the Segmentation Property Values for Fullwidth Digits

Source: Laurentiu Iancu (Microsoft Corporation) and Andy Heninger (Google, Inc.)

Action: For consideration by the UTC

Proposal:

For the ten characters U+FF10 FULLWIDTH DIGIT ZERO through U+FF19 FULLWIDTH DIGIT NINE, 
change the Word_Break and Sentence_Break property values from Other to Numeric (from WB=XX, 
SB=XX to WB=NU, SB=NU).

Background and Analysis

Users have reported unexpected results in the word segmentation of CJK text that contains 
fullwidth digits.  The issue was reported multiple times since 2015 and was tracked by 
action item 144-A83.

Although the fullwidth digits are General_Category = Decimal_Number, their current Word_Break 
classification is Word_Break = Other, the same as for non-decimal numerals (General_Category = 
Other_Number).  Due to that assignment, the word segmentation algorithm finds word boundaries 
between adjacent fullwidth digits.  As a result, numbers written in fullwidth digits are not 
counted as whole words but rather are fragmented at the individual characters.

The sentence segmentation algorithm does not break up sentences between fullwidth digits 
with the current Sentence_Break assignment.  However, the proposal includes a change in their 
Sentence_Break property values for consistency with the classification of all other decimal 
digits (General_Category = Decimal_Number).

The Line_Break property value of fullwidth digits is Ideographic (lb=ID), which also implies 
line-breaking opportunities between adjacent fullwidth digits.  A change in the Line_Break 
classification is not proposed at this time.

In summary, the segmentation and line-breaking classification of fullwidth digits before and 
after the proposed changes looks as follows:

Before:
Word_Break = Other (WB=XX)
Sentence_Break = Other (SB=XX)
Line_Break = Ideographic (lb=ID)

After:
Word_Break = Numeric (WB=NU)
Sentence_Break = Numeric (SB=NU)
Line_Break = Ideographic (lb=ID)

The change in Word_Break property value for fullwidth digits was tested in the ICU word-break 
iterator and produced the expected results.