L2/12-210 Title: Revert Cuneiform Numeric Changes for Unicode 6.2 Source: Ken Whistler Date: June 6, 2012 Action: For consideration by UTC Background At UTC meeting #131, the UTC decided to change the numeric values of six Cuneiform numeric signs for Unicode 6.2. The decision was taken by consensus 131-C30, and the relevant changes for UnicodeData.txt for Unicode 6.2 were recorded as: 1240F;CUNEIFORM NUMERIC SIGN FOUR U;Nl;0;L;;;;4;N;;;;; 4 → 40 12410;CUNEIFORM NUMERIC SIGN FIVE U;Nl;0;L;;;;5;N;;;;; 5 → 50 12411;CUNEIFORM NUMERIC SIGN SIX U;Nl;0;L;;;;6;N;;;;; 6 → 60 12412;CUNEIFORM NUMERIC SIGN SEVEN U;Nl;0;L;;;;7;N;;;;; 7 → 70 12413;CUNEIFORM NUMERIC SIGN EIGHT U;Nl;0;L;;;;8;N;;;;; 8 → 80 12414;CUNEIFORM NUMERIC SIGN NINE U;Nl;0;L;;;;9;N;;;;; 9 → 90 This change was taken, based on general feedback on the UCD posted in L2/12-160. The UTC did not record any independent rationale for the change, beyond what was expressed in the report in L2/12-160. Essentially, the point made there was that the Cuneiform "U" signs are used for the 10's series (which is true), so their Unicode Numeric_Value property should be updated to reflect those values, rather than retaining the digit values they have in UnicodeData.txt for Unicode 6.1 (and all prior versions). Analysis The change which the UTC made in consensus 131-C30 at first looks innocuous, but it turns out to have hidden consequences which in hindsight make it an undesirable change, in my opinion. First, one can ask, what is this really fixing? The Cuneiform numbering system is complicated and completely unlike any decimal radix system. Any computational implementation of Cuneiform numbers by necessity requires a lot of specialized code which cannot depend simply on UnicodeData Numeric_Value property values, anyway. The Cuneiform 10's series, in particular, is defective in this regard, because the representations of 10, 20, and 30 depend on *other* "U" signs which are encoded amongst the main listing of Cuneiform signs (with gc=Lo and nv=NaN), because they also function importantly as syllabic values. In one case (20), the numeric sign has to be represented by a *sequence* of two single "U" signs, so that instance is even more complicated. So while the change in Numeric_Value for these 6 characters seems to be making the UnicodeData values marginally more correct, that change does not (and cannot) make the values correct for all of the signs. But what does it hurt? Well, it turns out that there is at least one other algorithm which is dependent on the *existing* Numeric_Value for these 6 characters, which is disrupted by changing them: the Unicode Collation Algorithm. The program which is used to generate the DUCET table has special-case code to handle digit values, so that all "4"'s (etc.) from all scripts end up with the same primary weight, and are distinguished by manufactured secondary weights specific to different scripts. Changing the Numeric_Value for precisely 6 of the 90 different integral-value Cuneiform numeric signs in the range U+12400..U+12459 meant that in production of the DUCET draft for UCA 6.2, the collation weights for those 6 signs were bled from the branch which does comparable processing for all the "digits" 0..9. That meant that those 6 signs, and only those signs, ended up with primary-ignorable weights and would then end up in lists completely removed from the other 84 Cuneiform numeric signs which consist of varying numbers of stacked strokes 2 through 9. To fix that problem for DUCET for UCA 6.2, I was forced to put in place YACH (yet another crappy hack) for the sifter program, which had to catch and *undo* precisely this change, so that the default collation for the Cuneiform Numbers and Punctuation block would not be destabilized by the changes to Numeric_Value for the 6 characters in question. Proposal Because the change in Numeric_Value for the 6 Cuneiform numeric signs U+1240F..U+12414 doesn't really accomplish anything significant for implementation of Cuneiform numbers, and because it disturbs values which have been in place since Unicode 5.0, and mostly because it actually *creates* problems for the generation of DUCET for the UCA (and inconsistency in that table for the handling of the Cuneiform numeric signs), I propose that the UTC revert its decision of 130-C30 before the publication of Unicode 6.2 makes the change a fait accompli (and a further permanent wart on the UCD). To address the concerns originally expressed in the report in L2/12-160, I suggest instead that the numeric signs in question simply be annotated in the Unicode names list regarding their use in Cuneiform numbers. That should suffice to draw attention to their status in expressing the 10's series, without destabilizing existing processing which depends on the current digit Numeric_Value assignments.