L2/12-230 Title: Suggested Changes to DUCET for UCA 6.2.1 Authors: Ken Whistler, Mark Davis, Markus Scherer Date: July 17, 2012 Action: For consideration by the UTC Summary In offline discussion and analysis of implementations of the UCA, we have determined that there are a number of small changes to the way the DUCET table for UCA is generated, which could significantly improve the handling of secondary weights. We first present the proposed changes that we would like the UTC to authorize for the *next* revision of UCA after UCA 6.2. Then we discuss the rationale and implications for each. Finally, at the bottom of the document, we provide a draft of the document that we would ask to be submitted to the OWG-SORT, if the UTC authorizes these changes to DUCET for the UCA, so that the corresponding changes for the CTT for ISO 14651 could be tabled for consideration. Proposal Make the following changes to secondary weights in DUCET for UCA 6.2.1: 1. Remove all script-specific secondaries for digits in DUCET. Just weight all the digits with 0020 for the secondary. ("" for 14651). 2. Remove all the gaps in the secondaries range which were created originally in DUCET to correspond to combined accent symbols for the CTT for ISO 14651. 3. Move a targeted few high-use combining marks in DUCET so they end up with lower secondary weights. The recommended list (which might be extended in discussion) is: 3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK 309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK 0335 COMBINING SHORT STROKE OVERLAY U+3099 and U+309A are associated with secondary weights which are quite common in Japanese data. (Note that if these two characters have their secondary weights reassigned in DUCET, the compatibility characters U+FF9E and U+FF9F would also automatically have their weights adjusted.) U+0335 is associated with a secondary weight which is also fairly common in samples of the web. It is seen in common letters for Polish and Icelandic language material. Although unrelated to secondary weighs in DUCET, we also suggest the following change: 4. Remove 4th level weights from the DUCET listing. Rationale and Analysis Regarding item 1, script secondaries were originally added to DUCET (and to the CTT for ISO 14651) on the theory that different sets of digits were notionally like diacritic-marked variants of letters, so it made sense to give each distinct set of script-specific digits a separate (synthesized) secondary weight to distinguish them in collation weighting. However, as Unicode has grown and added more and more sets of script-specific digits, this weight assignment protocol has, by necessity, led to use of more and more secondary weights in the table. Expanding the use of secondary weights has a fairly significant implication for the UCA algorithm, as optimizations of UCA often try to pack down secondary weights into minimal sets of values, to minimize generated key size, as well as collation table size. As it turns out, there also is no really good reason for maintaining secondary weights for sets of digits, anyway. Sets of digits from different scripts are rarely mixed in real data, at least not in ways that matter for *sorting* data. And even when more than one set of digits occurs in data sets which need to be sorted, they would have to be mixed *within* number strings for the secondary weights to start making a difference for sorting outcome. That situation almost never occurs in non-artificial data. As a result, we think it would be quite feasible to simply remove *all* of the secondary weights now used in the DUCET for different sets of digits, and just replace them all with the lowest secondary weight, 0020. This would work fine as a default for UCA. Any highly specialized implementation could still distinguish sets of digits with a secondary weight by simply tailoring the particular set(s) of digits with which they are concerned. Regarding item 2, there are approximately 62 gaps currently in the secondary ranges for DUCET. These gaps were originally placed in the sequences to account for combined accent symbols used in the generation of the CTT for ISO 14651. In later revisions of the UCA and ISO 14651, the way sequences of accents was handled for generation of default weights changed. And currently all of those combining symbols in the CTT are simply commented out. As a result, the gaps serve no current function. They could simply be removed from the generation of DUCET. That would pack down all the higher secondaries by a delta value of up to 62, allowing many more of them to have default values which can be packed more tightly in constructed keys. Removal of the gaps would also simplify preprocessing of the allkeys.txt table for certain implementations. Regarding item 3, after the initial sequence of secondary weights for some high-frequency combining marks, most of the rest of the secondary weights in DUCET are initialized in an order that is roughly defaulted to code point order, with some modifications to keep combining marks from similar scripts together. One of the bad consequences of this choice is that a few high- frequency combining marks that occur extensively in data, end up with fairly high values for secondary weights. Optimizations of UCA would work better if these high-frequency secondary weights were lower in numeric value. The obvious candidates to move to a lower position are U+3099 and U+309A for Hiragana and Katakana and U+0335 for Polish and Icelandic. But there may be a few others which are high enough frequency to justify also moving them lower in the sequence. Regarding item 4, the 4th weights in the DUCET tables are a holdover from earlier design goals. The original purpose was to provide a last-resort tie- breaker, but that is currently done algorithmically by Step S3.10 in the UCA; the UCA algorithm itself does not actually make use of the 4th weight from the DUCET table. Moreover, the 4th level weights in allkeys.txt are confusing because they have nothing to do with the 4th level weights constructed for variable weighting. And as it turns out, there are also quirks in the assignment of 4th level weights in DUCET, which mean that those weights are not well-formed according to the definitions in UCA. Because of these issues and others, it turns out that implementations of UCA typically do not make any use of the 4th level weights that are listed in DUCET. The simplest solution here is to just not include any 4th level weights in DUCET. Note that we are requesting that these changes for DUCET be introduced for UCA 6.2.1, and *not* as part of the UCA 6.2 update which is currently in beta review. This would give plenty of time to make the necessary adjustments to the sifter utility which generates DUCET and plenty of time for various implementers to review and test the resulting updates to DUCET. The 6.2.1 timeframe would be perfect for this, because it would be a release without repertoire additions. Because of that, people could be evaluating just the changes to the secondary weights (and omission of 4th level weights from the table), without the confounding complications of adapting to new repertoire additions to the weight table. Corresponding Proposal for the CTT for ISO 14651 If the UTC decides to make these suggested changes to DUCET for UCA 6.2.1 we would ask that a filled out version of the following document also be submitted to the OWG-SORT for the October meeting in Thailand. This would start the process of getting corresponding changes approved for an amendment to ISO 14651. Note that Item #4 is a no-op for ISO 14651. The 4th level weights can be dropped from DUCET independently of any handling of 4th level weight symbols in the CTT for ISO 14651. =============================================================== [ header gorp ] The UTC suggests that the following changes be made to the CTT for ISO 14651, in a future amendment to that standard. 1. Delete all secondary weight symbols which are only used to distinguish different sets of script-specific digits. This would consist of the 57 collating-symbol entries in the range: collating-symbol ... collating-symbol In the main body of the CTT, replace each of those secondary weight symbols simply with the general, lowest secondary weight: "". 2. Delete all *commented* lines in the range of collating-symbol definitions for secondary weight symbols. This consists of 62 lines such as: % "" These commented lines serve no current function in the CTT, as the weighting of such symbol sequences is handled simply by concatenation of the single symbols in the appropriate secondary portion of the weight definitions for characters which have accent or other diacritic sequences associated with them. 3. Move the following 3 [or number TBD] collating-symbol definitions: collating-symbol % COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK collating-symbol % COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK collating-symbol % COMBINING SHORT STROKE OVERLAY from their current position in the list to a position between the following entries: collating-symbol % GENERIC MARK AROUND collating-symbol % COMBINING OVERLINE [ Then copy here a suitably pared down version of the rationale for these changes, appropriate to assessment of the changes specifically for the CTT for ISO 14651.] ==================================================================