L2/15-202


Title: Addressing SignWriting Collation in DUCET -- Rejoinder to L2/15-194

Author: Ken Whistler

Date: July 23, 2015

Status: For Consideration by the UTC


Background

The discussion by Stephen Slevinski in L2/15-194 (in response to
my L2/15-184), makes it clear that the outcome desired by
the SignWriting community for collation of the signs does
not involve treating the fills and rotations as secondary
or tertiary weights. Instead, the desired outcome for
sorting is to treat each sign (together with its fill and
rotation) *as if* it had been encoded atomically, rather
than as a sequence -- and then each atomic symbol had been
given a primary collation weight.

That observation should take any approach to DUCET changes
involving secondary or tertiary weight assignments off the
table.

However, the proposed approach of simply giving fills and
rotations primary weights, with rotations having lesser
weights than the fills, still has problems, for the reasons
I outlined in L2/15-184 regarding the variable weight length
of the sequences involving fills and rotations. Essentially,
the problem is still analogous to the issue for Hangul
syllables, because of the interaction of the sequences of
weights.

Because, as it turns out, the short example I showed in L2/15-184,
which (incorrectly) assumed that the desired outcome would
follow from secondary/tertiary weight treatments for
fills and rotations, I have constructed a more extended
example here to illustrate the problem resulting from
the "syllable edge registration" issue for the weights.

===================================================

Extended Example

Here is a somewhat more extended example to consider, using
the same conventions as the example in L2/15-184, but appending
an arbitrary additional non-SignWriting character after either
the first or second sign. For this character, I use a stand-in
'a', again with an arbitrary primary weight, but this time
higher than that for any of the SignWriting signs or fill or
rotation modifiers. This example also presumes that the fill
and rotation modifiers have already been given primary weights
as requested.

First the set of example strings:

01. HFI
    100

02. HFI R2
    100 410

03. HFI F2
    100 420

04. HFI F2  R2
    100 420 410

05. HFI HFI
    100 100

06. HFI R2  HFI
    100 410 100

07. HFI F2  HFI
    100 420 100

08. HFI 'a'
    100 630

09. HFI R2  'a'
    100 410 630

10. HFI F2  'a'
    100 420 630

11. HFI F2  R2  'a'
    100 420 410 630

12. HFI HFI 'a'
    100 100 630

13. HFI R2  HFI 'a'
    100 410 100 630

14. HFI F2  HFI 'a'
    100 420 100 630

15. HFI 'a' HFI
    100 630 100

16. HFI R2  'a' HFI
    100 410 630 100

19. HFI F2  'a' HFI
    100 420 630 100

===================================================

Next the order which results from using the assigned
collation weights for the strings. For the equivalent
short forms shown after the "==>" arrow, I use 
abbreviations for each of the 4 relevant fill/rotate
forms of HFI, labeled '1' through '4' in the
expected order for those:

HFI =       HFIf1r1 = '1'
HFI R2 =    HFIf1r2 = '2'
HFI F2 =    HFIf2r1 = '3'
HFI F2 R2 = HFIf2r2 = '4'

01. HFI                 ==> 1
    100

05. HFI HFI             ==> 11
    100 100

12. HFI HFI 'a'         ==> 11a
    100 100 630

02. HFI R2              ==> 2
    100 410

06. HFI R2  HFI         ==> 21
    100 410 100

13. HFI R2  HFI 'a'     ==> 21a
    100 410 100 630

09. HFI R2  'a'         ==> 2a
    100 410 630

16. HFI R2  'a' HFI     ==> 2a1
    100 410 630 100

03. HFI F2              ==> 3
    100 420

07. HFI F2  HFI         ==> 31
    100 420 100

14. HFI F2  HFI 'a'     ==> 31a
    100 420 100 630

04. HFI F2  R2          ==> 4
    100 420 410

11. HFI F2  R2  'a'     ==> 4a
    100 420 410 630

10. HFI F2  'a'         ==> 3a
    100 420 630

19. HFI F2  'a' HFI     ==> 3a1
    100 420 630 100

08. HFI 'a'             ==> 1a
    100 630

15. HFI 'a' HFI         ==> 1a1
    100 630 100

The problem here can now be seen: the high collation
weight for the intervening non-SignWriting symbol
here interferes with the interpretation of the
weight sequences for the fills and rotations. As
a result, the string order will cycle around for
the initial sign, depending on the position of the
following character:

1 < 11 < ... < 2 < 21 .. < 31a < 4 < 4a < 3a < 3a1 < 1a < 1a1

That clearly is *not* the expected result here.

===================================================

Discussion

As the extended example shows, once characters other
than the signs in the base+fill+rotation sets are
introduced into the strings, the ordering of the strings
breaks down. This inevitably follows from the variable length
of the weightings for these sequences, once they start
interacting with other characters.

Given the current encoding of SignWriting, there are
basically two remaining approaches to "fix" to collation for
the signs that use the fill and rotation modifiers. Both of
these approaches were briefly mentioned in L2/15-184, but I will
elaborate a bit further here.

Approach #1: Contractions

Contraction tables could be generated, which would map all
possible sequences of BASE or BASE + FILL or BASE + ROTATION
or BASE + FILL + ROTATION into primary weights in the correct
order.

The problem here is the size of the required contraction table.
As L2/15-194 notes, SignWriting uses 37,811 glyphs. A good
fraction of those are required for all the possible fill and rotation
combinations, because each base hand sign (or other pertinent base)
can occur in up to 96 configurations. In principle, then the
required contractions table needs 10's of thousands of entries.
That approach is a non-starter for DUCET, because of the overhead
it imposes on the basic table and all implementations of the
default.

Such an approach would work as a tailoring for SignWriting, but
with such a large contraction table, it would still be unwieldy.

Approach #2: Pre-processing of "syllables"

A second approach is to do context-sensitive pre-processing of
all SignWriting strings to be weighted, "normalizing" the
representation of the signs involving fills and rotations into
forms that *can* be compared without the length conundrums.

One subtype of this approach was already illustrated in
L2/15-194: the "inherent" fill-1 and rotation-1 values are
detected and turned into explicit separate weights. That is
the equivalent of rewriting jamo sequences for Hangul syllables
to insert fillers, so that every syllable ends up written in
exactly three characters: Ci + V + Cf. With appropriately
chosen weights for the fills and rotations, including the
two weights for the fill-1 and rotation-1 values, this
can basically solve the problem.

Alternatively, the strings can be pre-processed to insert
special terminators for each "syllable" -- i.e, for SignWriting,
for each detected instance of BASE or BASE + FILL or BASE + ROTATION
or BASE + FILL + ROTATION, depending. With an appropriately assigned
weight for the inserted terminator, this can also solve the problem.

The issue here, of course, is that pre-processing doesn't come for
free. Data could be stored in a database in a pre-processed form,
to simplify certain operations, but the preprocessed form wouldn't
be the same as the interchange form for text. And outside of
tightly controlled contexts, there would be little expectation
that any such pre-processing would be applied systematically where
sorting or searching might use general routines.


Recommendation

First, I stand by my conclusion in L2/15-184 that the problem
here is not amenable to a simple fix in DUCET. So I do not think
that changes should be made to the current values for SignWriting
symbols in DUCET for UCA 9.0.

However, I agree with the discussion in L2/15-194 that the basic
intractability of the default collation problem for SignWriting
ultimately stems from the original decision to opt for somewhat
more compact text representation by making fill-1 and rotation-1
be inherent values, thus resulting in variable length representations
for the base+fill+rotation sets.

One possible response, rather than attempting a hack at the current
DUCET values (which doesn't work anyway -- see above), would be to
go back to the drawing board for the encoding of SignWriting:
explicitly add the encoding of FILL MODIFIER-1 and ROTATION MODIFIER-1 and
change the text model to *require* an explicit fill modifier and
rotation modifier in all cases. That would make the collation
(and searching) issue more tractable. But it would have serious
down sides as well. Any implementation would need to define the
fallback representation of sequences that were missing either
a fill modifier or a rotation modifier (or both). If the fallbacks
end up looking the same for HFI and HFI+F1 and HFI+R1 as for
the "canonical" HFI+F1+R1 sequence, for example, then you have
an introduced multiple representation problem. But if the fallbacks
don't look like that, then other possible confusion can be
introduced and/or you may end up with more complications for
input methods and editing.

However, before jumping off in that direction, I want to point to
another storm cloud on the horizon. In feedback on L2/15-184 not
included in L2/15-194, Stephen Slevinski noted:

    "The facial diacritic section has never been tested or supported.  
    The only working font ignores the facial diacritic properties."

If the implication here is that all of the face symbols used in
SignWriting *also* have to be treated *as if* they were atomically
encoded, rather than as sequences of the base face sign U+1D9FF
plus various diacritic modifiers as combining marks (presumably in
combinations of variable length), then the collation problem for
those combinations is *also* intractable, and would require either
more entries in a large contraction table or extensive pre-processing
of strings for comparison.

There are more observations about SignWriting implementation that
follow from that, but I'll stop here -- as the basic point of this
document is that there is no demonstrated simple fix to DUCET for
SignWriting that would suffice.