L2/12-210

Title: Revert Cuneiform Numeric Changes for Unicode 6.2
Source: Ken Whistler
Date: June 6, 2012
Action: For consideration by UTC


Background

At UTC meeting #131, the UTC decided to change the numeric values of
six Cuneiform numeric signs for Unicode 6.2. The decision was taken
by consensus 131-C30, and the relevant changes for UnicodeData.txt
for Unicode 6.2 were recorded as:

1240F;CUNEIFORM NUMERIC SIGN FOUR U;Nl;0;L;;;;4;N;;;;;        4 → 40
12410;CUNEIFORM NUMERIC SIGN FIVE U;Nl;0;L;;;;5;N;;;;;        5 → 50
12411;CUNEIFORM NUMERIC SIGN SIX U;Nl;0;L;;;;6;N;;;;;        6 → 60
12412;CUNEIFORM NUMERIC SIGN SEVEN U;Nl;0;L;;;;7;N;;;;;        7 → 70
12413;CUNEIFORM NUMERIC SIGN EIGHT U;Nl;0;L;;;;8;N;;;;;        8 → 80
12414;CUNEIFORM NUMERIC SIGN NINE U;Nl;0;L;;;;9;N;;;;;        9 → 90

This change was taken, based on general feedback on the UCD posted
in L2/12-160. The UTC did not record any independent rationale for the change,
beyond what was expressed in the report in L2/12-160. Essentially, the
point made there was that the Cuneiform "U" signs are used for the 10's
series (which is true), so their Unicode Numeric_Value property should
be updated to reflect those values, rather than retaining the digit values
they have in UnicodeData.txt for Unicode 6.1 (and all prior versions).

Analysis

The change which the UTC made in consensus 131-C30 at first looks
innocuous, but it turns out to have hidden consequences which in hindsight
make it an undesirable change, in my opinion.

First, one can ask, what is this really fixing? The Cuneiform numbering system
is complicated and completely unlike any decimal radix system. Any
computational implementation of Cuneiform numbers by necessity requires
a lot of specialized code which cannot depend simply on UnicodeData
Numeric_Value property values, anyway. The Cuneiform 10's series, in
particular, is defective in this regard, because the representations of 10,
20, and 30 depend on *other* "U" signs which are encoded amongst the
main listing of Cuneiform signs (with gc=Lo and nv=NaN), because they
also function importantly as syllabic values. In one case (20), the numeric
sign has to be represented by a *sequence* of two single "U" signs, so
that instance is even more complicated.

So while the change in Numeric_Value for these 6 characters seems to be
making the UnicodeData values marginally more correct, that change
does not (and cannot) make the values correct for all of the signs.

But what does it hurt? Well, it turns out that there is at least one other
algorithm which is dependent on the *existing* Numeric_Value for these
6 characters, which is disrupted by changing them: the Unicode Collation
Algorithm. The program which is used to generate the DUCET table has
special-case code to handle digit values, so that all "4"'s (etc.) from all scripts
end up with the same primary weight, and are distinguished by manufactured
secondary weights specific to different scripts. Changing the Numeric_Value
for precisely 6 of the 90 different integral-value Cuneiform numeric signs in
the range U+12400..U+12459 meant that in production of the DUCET draft
for UCA 6.2, the collation weights for those 6 signs were bled from the
branch which does comparable processing for all the "digits" 0..9. That
meant that those 6 signs, and only those signs, ended up with primary-ignorable
weights and would then end up in lists completely removed from the
other 84 Cuneiform numeric signs which consist of varying numbers of
stacked strokes 2 through 9.

To fix that problem for DUCET for UCA 6.2, I was forced to put in place
YACH (yet another crappy hack) for the sifter program, which had to
catch and *undo* precisely this change, so that the default collation for
the Cuneiform Numbers and Punctuation block would not be destabilized
by the changes to Numeric_Value for the 6 characters in question.

Proposal

Because the change in Numeric_Value for the 6 Cuneiform numeric signs
U+1240F..U+12414 doesn't really accomplish anything significant for
implementation of Cuneiform numbers, and because it disturbs values
which have been in place since Unicode 5.0, and mostly because it
actually *creates* problems for the generation of DUCET for the UCA
(and inconsistency in that table for the handling of the Cuneiform
numeric signs), I propose that the UTC revert its decision of 130-C30
before the publication of Unicode 6.2 makes the change a fait accompli
(and a further permanent wart on the UCD).

To address the concerns originally expressed in the report in L2/12-160,
I suggest instead that the numeric signs in question simply be annotated
in the Unicode names list regarding their use in Cuneiform numbers.
That should suffice to draw attention to their status in expressing the
10's series, without destabilizing existing processing which depends
on the current digit Numeric_Value assignments.