L2/15-184 Title: Addressing SignWriting Collation in DUCET Author: Ken Whistler Date: July 21, 2015 Status: For Consideration by the UTC Background In AI 143-A39 I was asked to prepare a proposed change for SignWriting in DUCET for UCA 9.0 which addresses feedback from Steve Slevinski in L2/15-144. To make the context clearer, I quote here the relevant section of feedback from L2/15-144: ====================================================================== I am concerned that the sorting of SignWriting symbols has not been properly addressed in the document http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4342.pdf . I believe a few additions to DUCET will solve the issue. Specifically, here is a list of four symbols: 1) U+1D800 2) U+1D800 U+1DAA1 3) U+1D800 U+1DA9B 4) U+1D800 U+1DA9B U+1DAA1 The symbols in the above list are in the correct sort order; however, a binary string compare will incorrectly sort the symbols as 1, 3, 4, 2. I believe the sorting issue could be resolved by additions to the DUCET so that the Rotation modifiers (U+1DAA1 - U+1DAAF) are sorted before the Fill modifiers (U+1DA9B - U+1DA9F). ====================================================================== Analysis As a reminder, the encoding model for SignWriting dealt with the multiple rotations and fills of various hand signs (and some others) by taking the base symbol (in this example U+1D800 SIGNWRITING HAND-FIST INDEX) as having the inherent fill 1 value (white) and the inherent rotation 1 value (0 degrees, upright). Then there are 5 fill modifiers encoded as combining marks to account for the other fills 2 through 6, and 15 rotation modifiers encoded as combining marks to account for the other rotations 2 through 16. The expected sorting order for *individual* SignWriting signs, in all fills and all rotations, is illustrated in the proposal document, L2/12-321 (= WG2 N4342) in Figure 3. The order starts at upper left and then reads off the table in column first order. Basically, each the list takes each sign in fill 1 through the 16 rotations, then in fill 2 through the 16 rotations, ... to fill 6 through the 16 rotations. If SignWriting had been encoded atomically, with all 96 combinations of fill and rotation for each base hand sign encoded as a single character, then we would be done for collation. The set of 96 signs would be weighted in binary order and essentially the expectations for string ordering would be met. However, SignWriting was not encoded atomically (for good reasons), and the problem of exactly how to weight the fill and rotation modifying combining marks, which occur in sequences, is still causing issues -- as demonstrated by the feedback in L2/15-144 quoted above. The feedback points out that a *binary* sort of the cited 4 strings will end up not in expected order. But a binary sort of *any* Unicode data seldom produces expected order for any script. So that was never in the cards. However, in this particular case, the suggested solution for the DUCET, which basically recommends to just give the rotation modifier combining marks primary weights in DUCET lower than the fill modifier combining marks, doesn't solve the problem, either. The essential problem here is that string ordering requires considering more than one sign at a time, but the simple example cited above is considering only a single sign, in an effort to get a simple string sort to replicate Figure 3 from L2/12-321. Let me abstract the example a bit and add some weight values. First, some abbreviations: 1D800 SIGNWRITING HAND-FIST INDEX (HFI) 1DAA1 SIGNWRITING ROTATION MODIFIER-2 (R2) 1DA9B SIGNWRITING FILL MODIFIER-2 (F2) Now giving them some arbitrary weights, per the recommendation above: HFI = 100 Next the rotation modifiers, say 410.., so R2 = 410 Next the fill modifiers, say 420.., so F2 = 420 Then weight the 4 strings: 1. HFI 100 2. HFI R2 100 410 3. HFI F2 100 420 4. HFI F2 R2 100 420 410 This works, right? Considering the weights, these strings order #1, #2, #3, #4, as required. Well, no, it doesn't. Let's add another HFI base character *after* the end of each of the first 3 strings. 1. HFI 100 2. HFI R2 100 410 3. HFI F2 100 420 4. HFI F2 R2 100 420 410 5. HFI HFI 100 100 6. HFI R2 HFI 100 410 100 7. HFI F2 HFI 100 420 100 The expected order of the full strings is as shown: #1, #2, #3, #4, #5, #6, #7, but because the weight for the base (HFI) is lower than either the weight for the fill *or* the rotation modifiers, the actual order that would be calculated based on the weights is: #1, #5, #2, #6, #3, #7, #4, which is completely out of order. This behavior results from two interacting aspects of the system. First, because fill 1 and rotation 1 are considered to be inherent to the signs, they aren't written with separate combining marks. That means that we need to compare strings where the significant signs for comparison could be written with either 1, 2, or 3 characters, as seen above. Second, the way the fills and rotations are conceived, we are dealing with a requirement for a multi-level weighting, rather than simply trying to find primary weights in the right relative order for the fills and rotations. Essentially, a hand sign in all its fills and rotations should constitute one primary weight, and then the next level (each column in Figure 3) constitutes secondary weights for fills, and then the next level (each row in each column in Figure 3) constitutes the tertiary weights for the rotations. Neither of those aspects of the system are adequately addressed by the proposed solution. Current (UCA 8.0) State of DUCET for SignWriting First let's look at the current state of DUCET for SignWriting. Because it was clear that assigning primary weights for the fill modifier combining marks and the rotation modifier combining marks wouldn't really "solve the problem" for SignWriting, whichever order they were assigned, the DUCET took another tack for the default weights for SignWriting. All of the base symbols (gc=So) for SignWriting were given default primary weights, in code chart order. All of the SignWriting combining marks, including the fill and rotation modifiers, were then given fully ignorable 0000.0000.0000 weights. This approach gives a basic default ordering for the signs that follows the code chart order. It is a reasonable default for collation based searching, while ignoring fills and rotations. Keep in mind that the DUCET default order is basically designed and applied in cases where the particular, detailed ordering for a particular script or notation is not in question -- it is the *default* ordering for all the *rest* of the Unicode characters which are not the focus of a particular tailoring. Keep in mind also that as symbols (gc=So), SignWriting signs would be largely ignored in most default collation, anyway, which focus instead on particular alphabetic orderings. Alternative Approachs for SignWriting The situation for ordering the hand signs plus fill and rotation modifiers in SignWriting is reminiscent of the collation problem for Hangul syllables in some regards. The problem is that the significant units for the collation are encoded in different string lengths. For an ordering requirement that specifies that all the "syllables" have to be correctly lined up and then compared to each other, regardless of their encoding length, some special techniques need to be applied. If people want to pursue the comparison in more detail, see Section 7.1.5, Hangul Collation, in UTS #10. That section outlines 4 separate methods to weighting Hangul syllables and jamo sequences to get the correct collation results. Such methods could be adapted to SignWriting, as well. In brief, the example cited above, using abbreviations, can be recapitulated as: 1. HFI 2. HFI R2 3. HFI F2 4. HFI F2 R2 But effectively, what is needed is to treat each of those strings as a *syllable* of fixed length. Filling out to: 1. {HFI F1 R1} 2. {HFI F1 R2} 3. {HFI F2 R1} 4. {HFI F2 R2} This could be accomplished by various means adapted from the Hangul approahcs, including, for example, insertion of a syllable terminator and weighting it appropriately. The problem here is that the hand signs cannot be considered in isolation -- the structure of SignWriting is not as uniform as Hangul syllables, and not all of the signs take the fill modifiers and the rotation modifiers in the same way. Furthermore, there are *other* modifying combining marks that would also need to be taken into account. The issue is not a straightforward as making an isolated hand sign plus its combinations fill out Figure 3 in the correct order. Another approach that might make more sense, given the current encoding of SignWriting is to fully embrace the multi-level collation weighting, instead. This would give all of the fill modifiers *secondary* collation weights, and then all of the rotation modifiers *tertiary* collation weights. To do this would also require one, however, to take account of the *other* 97 modifying combining marks in SignWriting, besides the 5 fills and 15 rotations. A third approach, which would address *only* the hand signs plus fill and rotation modifiers, would be to construct a massive table of all possible contraction combinations. This would, for collation, effectively undo the sequential encoding and treat each possible combination *as if* it were encoded as an atomic character. Such a table would require 96 entries for each base hand sign subject to fills and rotations. Recommendation Given this analysis, my current recommendation is that the DUCET entries for SignWriting be left exactly as they are for UCA 9.0. I do not think the problem is amenable to a simple fix in DUCET, nor do I think that the requirement that SignWriting collation work by *default* meets the design criteria for DUCET. Some further considerations: 1. Approaches which treat the hand sign + fill + rotation as "syllables" and attempt to weigh according are just outside the engineering scope of DUCET. Such approachs -- as for Hangul -- simply need to be dealt with as tailorings of UCA. 2. Trying to bake in a multi-level weighting by giving a large number of secondary and tertiary weights for all the combining marks of SignWriting would cause problems for the maintenance software that assigns default secondary and tertiary weights to *all* Unicode characters. In particular, assigning new tertiary weights breaks a number of assumptions regarding the relationship between particular default weights and case status and decomposition status of Unicode characters. Any collation requirement for a script or notation which mandates adding many new secondary weights and/or any new tertiary weights should *always* be done by tailoring, rather than by trying to make it happen in DUCET. 3. Trying to solve the problem for SignWriting by adding a large table of contractions for all the hand sign + fill + rotation combinations is also a very bad idea for DUCET. Such contraction tables add significant overhead for *all* implementations, for default implementations that for the most part will neither care about nor ever deal with SignWriting characters in particular. Again, any solution which requires addition of large tables of contractions should *always* be done by tailoring, unless there is no feasible alternative. Given that it seems incorrect to apply any of these approaches to DUCET, the correct alternative would seem to be to work at a solution involving tailoring. Some cautions are in order there, however, as well. It is not clear that sufficient analysis has been done of the overall requirement for string collation for SignWriting data as yet. All of the discussion has focused entirely on the limited issue, which amounts to the chart order of hand signs plus fills and rotation. No clear evidence has been presented for lexical order of longer strings of SignWriting data. Nor is it even clear that SignWriting material would be lexically ordered in ways that depend very much on a UCA-style multi-level string ordering algorithm. A more detailed specification of how actual strings of data would be expected to behave in a context that requires some string collation -- such as indexed data fields in a data base -- should first be presented before rushing in to start designing a CLDR tailoring for SignWriting or some other specialized tailoring.