L2/10-304
                                             
Title:  Response to Proposed Collation Changes for 6.0 (L2/10-275R)

Author: Ken Whistler

Date:   August 6, 2010

Status: For consideration by the UTC


I have some comments to make about L2/10-275R. I won't reiterate
the discussion already made in L2/10-275R -- it is assumed as
background for these comments. Sections I refer to are
the numbered sections in L2/10-275R.

*****************************************************************

Re the Section 1 recommendations about weighting of four characters
in the DUCET for UCA, I strongly disagree with the recommendations.

U+20A8 RUPEE SIGN

The existing rupee sign isn't treated as a unitary
sign like "$". When people use it, they are typically actually
typing "Rs" as a sequence of letters, and it has a compatibility
decomposition precisely for that reason. It differs from Pesetas,
which had a long symbolic history inside IBM before it came to
Unicode. I think the correct default weighting for U+20A8 is
to treat it as roughly equal to the sequence "Rs".

Any *new* Indian Rupee Sign, on the other hand, will be such a unitary
currency sign, and should be treated like the other
currency signs for the purposes of collation.

It has been argued that other currency signs which resemble letters
or are derived as diacritic modifications of letters are not
weighted similarly to letters (which is true -- see, e.g. the yen
sign, the euro sign, the pound sign, etc.), so this should be
the case for *all* currency signs. However, "Rs" is not a
diacritic modification of some letter to make it a symbol -- it
simply is the sequence of "R" followed by "s".

U+FDFC RIAL SIGN

This currency sign is another of the Arabic word ligatures.
I think the best *default* treatment of this is to simply
weight it as if it were the spelled out "rial" word. This
is most similar to the RUPEE SIGN, and should *not* be
weighted as an arbitrary unitary symbol among the currency
signs.

I could be convinced otherwise if somebody would bring
convincing evidence that this really is treated as an
unanalyzed, separate symbol, rather than as a word ligature
with a special presentation. Otherwise, I think any
special weighting behavior for this should treated as
a possible tailoring consideration for a Pashtun or Dari
tailoring, but does not belong in the default table.

U+19DE NEW TAI LUE SIGN LAE
U+19DF NEW TAI LUE SIGN LAEV

The two New Tai Lue signs are word ligatures, and should
be collated as such. In other words, their default collation
weights are correct.

If anything, the questionable thing
about these two characters is their General_Category assignment,
gc=Po. If having primary weights for the signs amongst
the other New Tai Lue letters is a problem for the proposal
because these are gc=Po, the correct action is to change the
General_Category assignment instead, to make the gc=Lo.
(This, of course, may have ramifications for identifiers and
other derived properties, which would have to be nailed down
carefully.)

*****************************************************************

Re Section 2, I would have no objection to making U+0640
ARABIC TATWEEL (and by extension U+07FA and any other
tatweel encoded in the future for any other script) completely
ignorable. That does seem the correct thing to do. These
are already special-cased in the sifter code for production
of the DUCET table, and changing that special handling to
make them ignorable is not complicated. 

It should be noted that there are many precomposed characters
in the standard with U+0640 as part of their compatibility
decompositions -- e.g. U+FE71 ARABIC TATWEEL WITH FATHATAN ABOVE
But those also have special handling in the sifter code, and
I think the current weightings would be consistent with
a decision to make U+0640 ignorable.


*****************************************************************

Re Section 3, Grouping Punctuation

The basic idea presented here is fine, and it seems fine
to produce a tailored DUCET that handles punctuation differently
from symbols.

The thing I want to point out explicitly here, however, because
it isn't made clear in L2/10-275R, is that there is a
stability guarantee implied here. In fact, the reason
why Section 1 of this document concerns itself with
the two New Tai Lue word ligatures is because for this implementation
of DUCET tailoring to work this way for *all* punctuation, all
punctuation has to be weighted in a certain range in the
default DUCET for UCA.

The missing statement in the document is:

"... what we care about is that... all punctuation is Variable."

By "Variable" here is meant any punctuation needs to be given
one of the starred weights in the table, and those weights,
by design must be less than any regular primary weights.
That is what the two New Tai Lue characters ran afoul of.

If the UTC goes along with this "informational" section, essentially
it will be constrained in the future as to how it can either:

   A. Weight by default any characters identified as gc=P
      in the UCD, for the DUCET table.
      
   and/or
   
   B. Decide the General_Category for an existing character
      should be gc=P, if it is already treated as Variable
      or not for the DUCET table.
      
In other words, there is an unspoken Stability Guarantee
masquerading behind this informational section about how
CLDR wants to tailor the DUCET.

Maybe this is o.k., but I would prefer that the UTC take note
of this eyes wide open, rather than eyes wide shut.