Collation Mechanism for Syllabic Scripts

Public Review Issue #22

Collation Mechanism for Syllabic Scripts

In 7.1.4 Trailing_Weights in UTS #10: Unicode Collation Algorithm, there is discussion of a mechanism for handling syllabic scripts, notably Korean Hangul. The following alternative mechanism is proposed to allow the UCA and tailorings to deal with syllabic collation. The goal is for this mechanism to be very lightweight, and thus easy for implementations to implement without impacting the performance of other characters, while having enough expressive power to handle the requirements of syllable collation.

The core notion is to add a mechanism that allows terminators to be added to a syllable, which will allow them to be sorted in the correct order. The terminator character would be tailored to be less than any other character that could occur in the syllable, and thus a shorter syllable would sort before a longer one (that contained the shorter one as an initial portion).

This would involve something like the following changes to the text of UTS 10:

In 3.2.1 File_Format, change <collationElementTable> to:

<collationElementTable> := 
	  <version>
    	| <variable>? 
    	| <insertionRule>*
    	| <backwards>* 
    	| <entry>+

And add:

<insertionRule>   := '@add' <insertionString> ';' <rangeBefore>? ';' <rangeAfter>)?
<insertionString> := <charList>
<rangeBefore>     := <charRange>
<rangeAfter>      := <charRange>
<charRange>       := <negation>? <rangeItem> ( ' ' <rangeItem>)*
<rangeItem>       := <char> ( '-' <char>)?
<negation>        := '^'

The insertion rules will change the production of a sort key in the following way: Whenever there is a sequence of two characters in the input string where (in some rule) the first character matches the rangeBefore, and the second character matches the rangeAfter, then the insertionString is inserted at that point. A more exact formulation is given in the Main Algorithm.

Additional Validity Requirements:

Non-repetition: In each insertion rule, the last character of the <charList> cannot match the <rangeBefore>.
Non-collision: If any two rules would trigger at the same position in any string, then the file is invalid.
Negation: a negated range also matches off the front or end of a string, respectively.

Example 1 (for Hangul Jamo):

# add a Filler character after an L, whenever not followed by L, V, or T
@add 115F; 1100-1159; ^1100-1159 1160-11A2 11A8-11F9

# add a Filler after a V, whenever not followed by a V or T
@add 115F; 1160-11A2; ^1160-11A2 11A8-11F9

# add a Filler after a T, whenever not followed by a T
@add 115F; 1160-11A2; ^11A8-11F9

Note these would add Fillers at the end of an input string with L, V, or T as the last character, since the negations match anything after the end of the input string.

Example 2 (showing empty ranges, strings)

# add 'a' after any 'b' or 'c'
@add 0061; 0061-0062;

# add 'dg' before any 'e' or 'f'
@add 0063 0066; ; 0064-0065

In 4.1 Normalize each input string, add:

S1.4 Walk through the input string. If any position in the input string matches an insertion rule, insert the string specified by the rule. If a string is inserted, continue at the position corresponding to the end of the inserted string plus 1 (thus don't revisit characters that have been inserted).