L2/12-230

Title: Suggested Changes to DUCET for UCA 6.2.1

Authors: Ken Whistler, Mark Davis, Markus Scherer

Date: July 17, 2012

Action: For consideration by the UTC


Summary

In offline discussion and analysis of implementations of the UCA,
we have determined that there are a number of small changes to
the way the DUCET table for UCA is generated, which could
significantly improve the handling of secondary weights.

We first present the proposed changes that we would like the UTC
to authorize for the *next* revision of UCA after UCA 6.2. Then we
discuss the rationale and implications for each. Finally, at the
bottom of the document, we provide a draft of the document that
we would ask to be submitted to the OWG-SORT, if the UTC authorizes
these changes to DUCET for the UCA, so that the corresponding
changes for the CTT for ISO 14651 could be tabled for consideration.

Proposal

Make the following changes to secondary weights in DUCET for UCA 6.2.1:

1. Remove all script-specific secondaries for digits in DUCET. Just weight all
the digits with 0020 for the secondary. ("<BASE>" for 14651).

2. Remove all the gaps in the secondaries range which were created
originally in DUCET to correspond to combined accent symbols for
the CTT for ISO 14651.

3. Move a targeted few high-use combining marks in DUCET so they
end up with lower secondary weights. The recommended list (which
might be extended in discussion) is:

3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
0335 COMBINING SHORT STROKE OVERLAY

U+3099 and U+309A are associated with secondary weights which are
quite common in Japanese data. (Note that if these two characters have
their secondary weights reassigned in DUCET, the compatibility
characters U+FF9E and U+FF9F would also automatically have their
weights adjusted.)

U+0335 is associated with a secondary weight which is also fairly
common in samples of the web. It is seen in common letters for
Polish and Icelandic language material.

Although unrelated to secondary weighs in DUCET, we also suggest the
following change:

4. Remove 4th level weights from the DUCET listing.

Rationale and Analysis

Regarding item 1, script secondaries were originally added to DUCET (and to
the CTT for ISO 14651) on the theory that different sets of digits were
notionally like diacritic-marked variants of letters, so it made sense to give
each distinct set of script-specific digits a separate (synthesized) secondary
weight to distinguish them in collation weighting. However, as Unicode has
grown and added more and more sets of script-specific digits, this weight
assignment protocol has, by necessity, led to use of more and more secondary
weights in the table. Expanding the use of secondary weights has a fairly
significant implication for the UCA algorithm, as optimizations of UCA often
try to pack down secondary weights into minimal sets of values, to minimize
generated key size, as well as collation table size.

As it turns out, there also is no really good reason for maintaining secondary
weights for sets of digits, anyway. Sets of digits from different scripts are
rarely mixed in real data, at least not in ways that matter for *sorting*
data. And even when more than one set of digits occurs in data sets which need
to be sorted, they would have to be mixed *within* number strings for the
secondary weights to start making a difference for sorting outcome. That
situation almost never occurs in non-artificial data. As a result, we think it
would be quite feasible to simply remove *all* of the secondary weights now
used in the DUCET for different sets of digits, and just replace them all with
the lowest secondary weight, 0020. This would work fine as a default for UCA.
Any highly specialized implementation could still distinguish sets of digits
with a secondary weight by simply tailoring the particular set(s) of digits
with which they are concerned.

Regarding item 2, there are approximately 62 gaps currently in the secondary
ranges for DUCET. These gaps were originally placed in the sequences to
account for combined accent symbols used in the generation of the CTT for ISO
14651. In later revisions of the UCA and ISO 14651, the way sequences of
accents was handled for generation of default weights changed. And currently
all of those combining symbols in the CTT are simply commented out. As a
result, the gaps serve no current function. They could simply be removed from
the generation of DUCET. That would pack down all the higher secondaries by a
delta value of up to 62, allowing many more of them to have default values
which can be packed more tightly in constructed keys. Removal of the gaps
would also simplify preprocessing of the allkeys.txt table for certain
implementations.

Regarding item 3, after the initial sequence of secondary weights for some
high-frequency combining marks, most of the rest of the secondary weights in
DUCET are initialized in an order that is roughly defaulted to code point
order, with some modifications to keep combining marks from similar scripts
together. One of the bad consequences of this choice is that a few high-
frequency combining marks that occur extensively in data, end up with fairly
high values for secondary weights. Optimizations of UCA would work better if
these high-frequency secondary weights were lower in numeric value. The
obvious candidates to move to a lower position are U+3099 and U+309A for
Hiragana and Katakana and U+0335 for Polish and Icelandic. But there may be a
few others which are high enough frequency to justify also moving them lower
in the sequence.

Regarding item 4, the 4th weights in the DUCET tables are a holdover from
earlier design goals. The original  purpose was to provide a last-resort tie-
breaker, but that is currently done algorithmically by Step S3.10 in the UCA;
the UCA algorithm itself does not actually make use of the 4th weight from the
DUCET table. Moreover, the 4th level weights in allkeys.txt are confusing
because they have nothing to do with the 4th level weights constructed for
variable weighting. And as it turns out, there are also quirks in the
assignment of 4th level weights in DUCET, which mean that those weights are
not well-formed according to the definitions in UCA.

Because of these issues and others, it turns out that implementations of UCA
typically do not make any use of the 4th level weights that are listed in
DUCET. The simplest solution here is to just not include any 4th level weights
in DUCET.

Note that we are requesting that these changes for DUCET be introduced for UCA
6.2.1, and *not* as part of the UCA 6.2 update which is currently in beta
review. This would give plenty of time to make the necessary adjustments to
the sifter utility which generates DUCET and plenty of time for various
implementers to review and test the resulting updates to DUCET. The 6.2.1
timeframe would be perfect for this, because it would be a release without
repertoire additions. Because of that, people could be evaluating just the
changes to the secondary weights (and omission of 4th level weights from the
table), without the confounding complications of adapting to new repertoire
additions to the weight table.
 
Corresponding Proposal for the CTT for ISO 14651

If the UTC decides to make these suggested changes to DUCET for UCA 6.2.1
we would ask that a filled out version of the following document also
be submitted to the OWG-SORT
for the October meeting in Thailand. This would start the process of getting
corresponding changes approved for an amendment to ISO 14651. Note that
Item #4 is a no-op for ISO 14651. The 4th level weights can be dropped from
DUCET independently of any handling of 4th level weight symbols in the CTT
for ISO 14651.

===============================================================

[ header gorp ]

The UTC suggests that the following changes be made to the CTT for ISO 14651,
in a future amendment to that standard.

1. Delete all secondary weight symbols which are only used to distinguish
different sets of script-specific digits. This would consist of the 57
collating-symbol entries in the range:

collating-symbol <ARABIC>
...
collating-symbol <COUNTINGROD>

In the main body of the CTT, replace each of those secondary weight symbols
simply with the general, lowest secondary weight: "<BASE>".

2. Delete all *commented* lines in the range of collating-symbol definitions for
secondary weight symbols. This consists of 62 lines such as:

% "<DAMMATAN><SHADDA>"

These commented lines serve no current function in the CTT, as the weighting
of such symbol sequences is handled simply by concatenation of the single
symbols in the appropriate secondary portion of the weight definitions for
characters which have accent or other diacritic sequences associated with them.

3. Move the following 3 [or number TBD] collating-symbol definitions:

collating-symbol <KNVCE>  % COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
collating-symbol <KNSMV>  % COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
collating-symbol <BARRE>  % COMBINING SHORT STROKE OVERLAY

from their current position in the list to a position between the following entries:

collating-symbol <DAROUND>  % GENERIC MARK AROUND
collating-symbol <OVERLINE>  % COMBINING OVERLINE

[ Then copy here a suitably pared down version of the rationale for these
changes, appropriate to assessment of the changes specifically for the
CTT for ISO 14651.]

==================================================================