L2/10-304 Title: Response to Proposed Collation Changes for 6.0 (L2/10-275R) Author: Ken Whistler Date: August 6, 2010 Status: For consideration by the UTC I have some comments to make about L2/10-275R. I won't reiterate the discussion already made in L2/10-275R -- it is assumed as background for these comments. Sections I refer to are the numbered sections in L2/10-275R. ***************************************************************** Re the Section 1 recommendations about weighting of four characters in the DUCET for UCA, I strongly disagree with the recommendations. U+20A8 RUPEE SIGN The existing rupee sign isn't treated as a unitary sign like "$". When people use it, they are typically actually typing "Rs" as a sequence of letters, and it has a compatibility decomposition precisely for that reason. It differs from Pesetas, which had a long symbolic history inside IBM before it came to Unicode. I think the correct default weighting for U+20A8 is to treat it as roughly equal to the sequence "Rs". Any *new* Indian Rupee Sign, on the other hand, will be such a unitary currency sign, and should be treated like the other currency signs for the purposes of collation. It has been argued that other currency signs which resemble letters or are derived as diacritic modifications of letters are not weighted similarly to letters (which is true -- see, e.g. the yen sign, the euro sign, the pound sign, etc.), so this should be the case for *all* currency signs. However, "Rs" is not a diacritic modification of some letter to make it a symbol -- it simply is the sequence of "R" followed by "s". U+FDFC RIAL SIGN This currency sign is another of the Arabic word ligatures. I think the best *default* treatment of this is to simply weight it as if it were the spelled out "rial" word. This is most similar to the RUPEE SIGN, and should *not* be weighted as an arbitrary unitary symbol among the currency signs. I could be convinced otherwise if somebody would bring convincing evidence that this really is treated as an unanalyzed, separate symbol, rather than as a word ligature with a special presentation. Otherwise, I think any special weighting behavior for this should treated as a possible tailoring consideration for a Pashtun or Dari tailoring, but does not belong in the default table. U+19DE NEW TAI LUE SIGN LAE U+19DF NEW TAI LUE SIGN LAEV The two New Tai Lue signs are word ligatures, and should be collated as such. In other words, their default collation weights are correct. If anything, the questionable thing about these two characters is their General_Category assignment, gc=Po. If having primary weights for the signs amongst the other New Tai Lue letters is a problem for the proposal because these are gc=Po, the correct action is to change the General_Category assignment instead, to make the gc=Lo. (This, of course, may have ramifications for identifiers and other derived properties, which would have to be nailed down carefully.) ***************************************************************** Re Section 2, I would have no objection to making U+0640 ARABIC TATWEEL (and by extension U+07FA and any other tatweel encoded in the future for any other script) completely ignorable. That does seem the correct thing to do. These are already special-cased in the sifter code for production of the DUCET table, and changing that special handling to make them ignorable is not complicated. It should be noted that there are many precomposed characters in the standard with U+0640 as part of their compatibility decompositions -- e.g. U+FE71 ARABIC TATWEEL WITH FATHATAN ABOVE But those also have special handling in the sifter code, and I think the current weightings would be consistent with a decision to make U+0640 ignorable. ***************************************************************** Re Section 3, Grouping Punctuation The basic idea presented here is fine, and it seems fine to produce a tailored DUCET that handles punctuation differently from symbols. The thing I want to point out explicitly here, however, because it isn't made clear in L2/10-275R, is that there is a stability guarantee implied here. In fact, the reason why Section 1 of this document concerns itself with the two New Tai Lue word ligatures is because for this implementation of DUCET tailoring to work this way for *all* punctuation, all punctuation has to be weighted in a certain range in the default DUCET for UCA. The missing statement in the document is: "... what we care about is that... all punctuation is Variable." By "Variable" here is meant any punctuation needs to be given one of the starred weights in the table, and those weights, by design must be less than any regular primary weights. That is what the two New Tai Lue characters ran afoul of. If the UTC goes along with this "informational" section, essentially it will be constrained in the future as to how it can either: A. Weight by default any characters identified as gc=P in the UCD, for the DUCET table. and/or B. Decide the General_Category for an existing character should be gc=P, if it is already treated as Variable or not for the DUCET table. In other words, there is an unspoken Stability Guarantee masquerading behind this informational section about how CLDR wants to tailor the DUCET. Maybe this is o.k., but I would prefer that the UTC take note of this eyes wide open, rather than eyes wide shut.