L2/12-310

Mark Davis

Live Document: http://goo.gl/KIMTs

We provide a crisp and useful definition of Numeric_Type=Decimal. However we have not provided a crisp and useful definition of the other two types: Numeric_Type=Digit and Numeric_Type=Numeric. We’ve also drifted away from the (few) characterizations that we had in the text. This issue was raised in the UTC last time, so I spent some time looking at the current contents, and what would make a coherent definition.

Proposal

Put the following proposal up for public review, targeted at Unicode 6.2.1.

1. Add the following definitions

(The underlined examples would be changed from current properties.)

Numeric_Type=Decimal

Characters used in a positional decimal systems, which standard base-10 radix systems with contiguous digits 0..9, and are most-significant-digit first (backingstore order). These are coextensive by definition with General_Category=Decimal_Number.

Rationale: This is simply a formulation of conditions that we already have.

Numeric_Type=Digit

Variants of positional decimal characters (Numeric_Type=Decimal) or sequences thereof. These include super/subscripts, enclosed, or decorated by the addition of characters such as parentheses, dots, or commas.

Examples:

U+2080 ( ₀ ) SUBSCRIPT ZERO

U+2460 ( ① ) CIRCLED DIGIT ONE

U+2469 ( ⑩ ) CIRCLED NUMBER TEN

Rationale: This provides a cohesive, useful definition, and does not break series of related numbers like circled Western numbers, or include non-decimal numbers like Ethiopic:

- This moves characters like U+1369 ( ፩ ) ETHIOPIC DIGIT ONE to Numeric: These are not used in positional decimal systems; they are not Nd or variants of them. They don’t have a zero, and typically are used with related characters having numeric values above 9 (like U+1372 ( ፲ ) ETHIOPIC NUMBER TEN). It makes more sense to put them all in Numeric_Type=Decimal.
- It also moves characters like U+2469 ( ⑩ ) CIRCLED NUMBER TEN to Digit; these are variants of sequences of characters in positional decimal systems, and make more sense to not break apart related characters that are representations of decimal numbers.

Numeric_Type=Numeric

Characters with numeric value, but that are neither Decimal nor Digit.

Examples:

U+2150 ( ⅐ ) VULGAR FRACTION ONE SEVENTH

U+2160 ( Ⅰ ) ROMAN NUMERAL ONE

U+1369 ( ፩ ) ETHIOPIC DIGIT ONE

U+1372 ( ፲ ) ETHIOPIC NUMBER TEN

U+0D72 ( ൲ ) MALAYALAM NUMBER ONE THOUSAND

U+3021 ( 〡 ) HANGZHOU NUMERAL ONE

2. Change NT properties for certain characters, consistent with the above.

A. from Numeric_Type=Digit to Numeric_Type=Numeric

U+10E60 ( 𐹠 ) RUMI DIGIT ONE...U+10E68 ( 𐹨 ) RUMI DIGIT NINE

U+11052 ( 𑁒 ) BRAHMI NUMBER ONE…U+1105A ( 𑁚 ) BRAHMI NUMBER NINE

U+1369 ( ፩ ) ETHIOPIC DIGIT ONE...U+1371 ( ፱ ) ETHIOPIC DIGIT NINE

U+10A40 ( 𐩀 ) KHAROSHTHI DIGIT ONE...U+10A43 ( 𐩃 ) KHAROSHTHI DIGIT FOUR

U+19DA ( ᧚ ) NEW TAI LUE THAM DIGIT ONE

B. from Numeric_Type=Numeric to Numeric_Type=Digit

U+2469 ( ⑩ ) CIRCLED NUMBER TEN...U+2473 ( ⑳ ) CIRCLED NUMBER TWENTY

U+247D ( ⑽ ) PARENTHESIZED NUMBER TEN...U+2487 ( ⒇ ) PARENTHESIZED NUMBER TWENTY

U+2491 ( ⒑ ) NUMBER TEN FULL STOP...U+249B ( ⒛ ) NUMBER TWENTY FULL STOP

U+277F ( ❿ ) DINGBAT NEGATIVE CIRCLED NUMBER TEN

U+24EB ( ⓫ ) NEGATIVE CIRCLED NUMBER ELEVEN...U+24F4 ( ⓴ ) NEGATIVE CIRCLED NUMBER TWENTY

U+3251 ( ㉑ ) CIRCLED NUMBER TWENTY ONE...U+32BF ( ㊿ ) CIRCLED NUMBER FIFTY

U+3248 ( ㉈ ) CIRCLED NUMBER TEN ON BLACK SQUARE...U+324F ( ㉏ ) CIRCLED NUMBER EIGHTY ON BLACK SQUARE

U+24FE ( ⓾ ) DOUBLE CIRCLED NUMBER TEN

U+2789 ( ➉ ) DINGBAT CIRCLED SANS-SERIF NUMBER TEN

U+2793 ( ➓ ) DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN

Background Information

Current contents, for comparison

- http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{nt=decimal}
- http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{nt=digit}
- http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{nt=numeric}

Text:

Numeric_Type=Decimal & General_Category=Decimal_Number

- Used in a decimal radix positional number system
- Represented in Unicode by 10 values, 0-9, contiguously encoded
- By definition, the two property values are coextensive
- Current UAX44 gloss: a decimal digit

General_Category=Letter_Number

- Current UAX44 gloss: a letterlike numeric character

General_Category=Other_Number

- Current UAX44 gloss: a numeric character of other type

Numeric_Type=Digit

- Current UAX 44 gloss: This covers digits that need special handling, such as the compatibility superscript digits.

Numeric_Type=Numeric

- Current UAX 44 gloss: This includes fractions such as, for example, "1/5" for U+2155 VULGAR FRACTION ONE FIFTH. Some characters have these properties based on values from the Unihan data files. See Numeric_Type, Han.