L2/15-202 Title: Addressing SignWriting Collation in DUCET -- Rejoinder to L2/15-194 Author: Ken Whistler Date: July 23, 2015 Status: For Consideration by the UTC Background The discussion by Stephen Slevinski in L2/15-194 (in response to my L2/15-184), makes it clear that the outcome desired by the SignWriting community for collation of the signs does not involve treating the fills and rotations as secondary or tertiary weights. Instead, the desired outcome for sorting is to treat each sign (together with its fill and rotation) *as if* it had been encoded atomically, rather than as a sequence -- and then each atomic symbol had been given a primary collation weight. That observation should take any approach to DUCET changes involving secondary or tertiary weight assignments off the table. However, the proposed approach of simply giving fills and rotations primary weights, with rotations having lesser weights than the fills, still has problems, for the reasons I outlined in L2/15-184 regarding the variable weight length of the sequences involving fills and rotations. Essentially, the problem is still analogous to the issue for Hangul syllables, because of the interaction of the sequences of weights. Because, as it turns out, the short example I showed in L2/15-184, which (incorrectly) assumed that the desired outcome would follow from secondary/tertiary weight treatments for fills and rotations, I have constructed a more extended example here to illustrate the problem resulting from the "syllable edge registration" issue for the weights. =================================================== Extended Example Here is a somewhat more extended example to consider, using the same conventions as the example in L2/15-184, but appending an arbitrary additional non-SignWriting character after either the first or second sign. For this character, I use a stand-in 'a', again with an arbitrary primary weight, but this time higher than that for any of the SignWriting signs or fill or rotation modifiers. This example also presumes that the fill and rotation modifiers have already been given primary weights as requested. First the set of example strings: 01. HFI 100 02. HFI R2 100 410 03. HFI F2 100 420 04. HFI F2 R2 100 420 410 05. HFI HFI 100 100 06. HFI R2 HFI 100 410 100 07. HFI F2 HFI 100 420 100 08. HFI 'a' 100 630 09. HFI R2 'a' 100 410 630 10. HFI F2 'a' 100 420 630 11. HFI F2 R2 'a' 100 420 410 630 12. HFI HFI 'a' 100 100 630 13. HFI R2 HFI 'a' 100 410 100 630 14. HFI F2 HFI 'a' 100 420 100 630 15. HFI 'a' HFI 100 630 100 16. HFI R2 'a' HFI 100 410 630 100 19. HFI F2 'a' HFI 100 420 630 100 =================================================== Next the order which results from using the assigned collation weights for the strings. For the equivalent short forms shown after the "==>" arrow, I use abbreviations for each of the 4 relevant fill/rotate forms of HFI, labeled '1' through '4' in the expected order for those: HFI = HFIf1r1 = '1' HFI R2 = HFIf1r2 = '2' HFI F2 = HFIf2r1 = '3' HFI F2 R2 = HFIf2r2 = '4' 01. HFI ==> 1 100 05. HFI HFI ==> 11 100 100 12. HFI HFI 'a' ==> 11a 100 100 630 02. HFI R2 ==> 2 100 410 06. HFI R2 HFI ==> 21 100 410 100 13. HFI R2 HFI 'a' ==> 21a 100 410 100 630 09. HFI R2 'a' ==> 2a 100 410 630 16. HFI R2 'a' HFI ==> 2a1 100 410 630 100 03. HFI F2 ==> 3 100 420 07. HFI F2 HFI ==> 31 100 420 100 14. HFI F2 HFI 'a' ==> 31a 100 420 100 630 04. HFI F2 R2 ==> 4 100 420 410 11. HFI F2 R2 'a' ==> 4a 100 420 410 630 10. HFI F2 'a' ==> 3a 100 420 630 19. HFI F2 'a' HFI ==> 3a1 100 420 630 100 08. HFI 'a' ==> 1a 100 630 15. HFI 'a' HFI ==> 1a1 100 630 100 The problem here can now be seen: the high collation weight for the intervening non-SignWriting symbol here interferes with the interpretation of the weight sequences for the fills and rotations. As a result, the string order will cycle around for the initial sign, depending on the position of the following character: 1 < 11 < ... < 2 < 21 .. < 31a < 4 < 4a < 3a < 3a1 < 1a < 1a1 That clearly is *not* the expected result here. =================================================== Discussion As the extended example shows, once characters other than the signs in the base+fill+rotation sets are introduced into the strings, the ordering of the strings breaks down. This inevitably follows from the variable length of the weightings for these sequences, once they start interacting with other characters. Given the current encoding of SignWriting, there are basically two remaining approaches to "fix" to collation for the signs that use the fill and rotation modifiers. Both of these approaches were briefly mentioned in L2/15-184, but I will elaborate a bit further here. Approach #1: Contractions Contraction tables could be generated, which would map all possible sequences of BASE or BASE + FILL or BASE + ROTATION or BASE + FILL + ROTATION into primary weights in the correct order. The problem here is the size of the required contraction table. As L2/15-194 notes, SignWriting uses 37,811 glyphs. A good fraction of those are required for all the possible fill and rotation combinations, because each base hand sign (or other pertinent base) can occur in up to 96 configurations. In principle, then the required contractions table needs 10's of thousands of entries. That approach is a non-starter for DUCET, because of the overhead it imposes on the basic table and all implementations of the default. Such an approach would work as a tailoring for SignWriting, but with such a large contraction table, it would still be unwieldy. Approach #2: Pre-processing of "syllables" A second approach is to do context-sensitive pre-processing of all SignWriting strings to be weighted, "normalizing" the representation of the signs involving fills and rotations into forms that *can* be compared without the length conundrums. One subtype of this approach was already illustrated in L2/15-194: the "inherent" fill-1 and rotation-1 values are detected and turned into explicit separate weights. That is the equivalent of rewriting jamo sequences for Hangul syllables to insert fillers, so that every syllable ends up written in exactly three characters: Ci + V + Cf. With appropriately chosen weights for the fills and rotations, including the two weights for the fill-1 and rotation-1 values, this can basically solve the problem. Alternatively, the strings can be pre-processed to insert special terminators for each "syllable" -- i.e, for SignWriting, for each detected instance of BASE or BASE + FILL or BASE + ROTATION or BASE + FILL + ROTATION, depending. With an appropriately assigned weight for the inserted terminator, this can also solve the problem. The issue here, of course, is that pre-processing doesn't come for free. Data could be stored in a database in a pre-processed form, to simplify certain operations, but the preprocessed form wouldn't be the same as the interchange form for text. And outside of tightly controlled contexts, there would be little expectation that any such pre-processing would be applied systematically where sorting or searching might use general routines. Recommendation First, I stand by my conclusion in L2/15-184 that the problem here is not amenable to a simple fix in DUCET. So I do not think that changes should be made to the current values for SignWriting symbols in DUCET for UCA 9.0. However, I agree with the discussion in L2/15-194 that the basic intractability of the default collation problem for SignWriting ultimately stems from the original decision to opt for somewhat more compact text representation by making fill-1 and rotation-1 be inherent values, thus resulting in variable length representations for the base+fill+rotation sets. One possible response, rather than attempting a hack at the current DUCET values (which doesn't work anyway -- see above), would be to go back to the drawing board for the encoding of SignWriting: explicitly add the encoding of FILL MODIFIER-1 and ROTATION MODIFIER-1 and change the text model to *require* an explicit fill modifier and rotation modifier in all cases. That would make the collation (and searching) issue more tractable. But it would have serious down sides as well. Any implementation would need to define the fallback representation of sequences that were missing either a fill modifier or a rotation modifier (or both). If the fallbacks end up looking the same for HFI and HFI+F1 and HFI+R1 as for the "canonical" HFI+F1+R1 sequence, for example, then you have an introduced multiple representation problem. But if the fallbacks don't look like that, then other possible confusion can be introduced and/or you may end up with more complications for input methods and editing. However, before jumping off in that direction, I want to point to another storm cloud on the horizon. In feedback on L2/15-184 not included in L2/15-194, Stephen Slevinski noted: "The facial diacritic section has never been tested or supported. The only working font ignores the facial diacritic properties." If the implication here is that all of the face symbols used in SignWriting *also* have to be treated *as if* they were atomically encoded, rather than as sequences of the base face sign U+1D9FF plus various diacritic modifiers as combining marks (presumably in combinations of variable length), then the collation problem for those combinations is *also* intractable, and would require either more entries in a large contraction table or extensive pre-processing of strings for comparison. There are more observations about SignWriting implementation that follow from that, but I'll stop here -- as the basic point of this document is that there is no demonstrated simple fix to DUCET for SignWriting that would suffice.