L2/04-277

Subject: Assessment of Collation Proposals
Status: Personal Contribution for SC2 ad hoc on ISO 14651
Author: Mark Davis
Date: 2004-06-23

We've considered at some length a number of issues raised by Kent in recent documents regarding collation. This document contains an assessment of those proposals, and provides background information.

Assessment of Proposals

  1. Stability of 4.0. There should be no changes should be made in the version of 14651 that is to be in sync with Unicode 4.0. We should preserve that as is, and any new changes should be aimed at Unicode/UCA Version 4.1 and the following version of ISO 14651. (It is unclear whether Kent is asking for changes in this balloted version, but we need to be clear.)
  2. Logical-Order Exceptions. We have looked over the proposal for the logical-order exception characters, changing Thai / Lao to use contractions instead of logical order, as proposed by Kent. While this does make Thai / Lao data larger, it of acceptable size and is much simpler to implement than what is currently in UCA. A corresponding change in UCA would be required, retracting [S1.2].
  3. Jamo Decomposition. We do not see the necessity of a radical decomposition of Hangul (decomposing the Jamo). This has never been a customer complaint, and would produce bigger sort keys and be slower in comparison. We would need to verify any proposed change across a broad spectrum of Korean users — users who are made aware of the performance implications for sorting speed and data size.
  4. Trailing Hangul Weights. We agree with using Trailing Weights to fix Hangul cluster collation, but in tailorings — not in the default table. (For a discussion of Trailing Weights vs. Terminators, see Cluster Collation, below.) This is for a few reasons:
  5. Other Languages. For Thai, Lao, and especially Indic, we have not seen sufficient evidence that these languages must be sorted syllabically. And there is a very substantial cost in terms of performance and data size to using Trailing Weights when they are not needed (see Cluster Collation, below). Thus this should not be applied unless there is compelling evidence that it is required for the language, and there is substantial opportunity for review by government and industry.
  6. Default Table Criteria. It has become clear that the UTC and SC2 need to spend some time outlining the factors that we should use when assessing how to add characters to the default table (or whether to change them in a future release). Some initial suggestions are in Annex B.
  7. Small Thai Changes. For Thai, we had received information from our Thai contacts that the following changes should be made to make the UCA in accordance with the Royal Dictionary, and are sufficient to do so*. That should be reflected in Annex C.2 and in the default table. The other changes suggested by Kent have not been confirmed, and should not be made unless more evidence is forthcoming.
    1. After U+0E24 ฤ THAI CHARACTER RU
      Insert the sequence: U+0E24 ฤ {THAI CHARACTER RU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
    2. After U+0E26 ฦ THAI CHARACTER LU
      Insert the sequence: U+0E26 ฦ THAI CHARACTER LU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
    3. After U+0E44 ไ THAI CHARACTER SARA AI MAIMALAI
      Insert the character U+0E45 ๅ THAI CHARACTER LAKKHANGYAO
    4. Before U+0E47  ็ THAI CHARACTER MAITAIKHU
      Insert the character U+0E4E ๎ THAI CHARACTER YAMAKKAN
    5. After U+0E4B ๋ THAI CHARACTER MAI CHATTAWA
      Insert the character U+0E4C ์ THAI CHARACTER THANTHAKHAT
      Then the character U+0E4D ํ THAI CHARACTER NIKHAHIT
    6. After the last secondary ignorable
      Insert the character U+0E2F ฯ THAI CHARACTER PAIYANNOI
      Then the character U+0E46 ๆ THAI CHARACTER MAIYAMOK
      Then the character U+0E4F ๏ THAI CHARACTER FONGMAN
      Then the character U+0E5A ๚ THAI CHARACTER ANGKHANKHU
      Then the character U+0E5B ๛ THAI CHARACTER KHOMUT
    7. * Actually, full conformance requires changing space and a couple of other characters to be secondary ignorables, but that cannot be done without a tailoring since it would disturb too many other languages.

Cluster Collation

With cluster collation, certain sequences (clusters) of characters sort as primary units. A special case of cluster collation is called syllabic collation: that is where the clusters correspond to syllables in the language. For background information, see [UCA]. What sorting as primary units means is that if a cluster C1 sorts as primary-less than a cluster C2, then that ordering cannot change based on following characters. In other words, we can never have an X and Y such that both of the following are true:

If all clusters in the language have the same number of characters, collation is not a problem. The normal character-by-character comparison works. If they are not all the same length, then something has to be done, otherwise we get the following problem. Suppose C2 above is really composed of C1 + C3. Then any X that sorts after C3 will cause the above conditions to be violated violated:

Case 1 Case 2
1 C1  
2 C1 C3
2 C1 C3 Y
1 C1 X  

Where there are a small number of clusters that are of different lengths, the longer clusters can just be treated as contractions. For example, Hungarian has the following clusters: [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} {ly} {ny} {sz} {ty} {zs} {ccs} {ddz} {ddzs} {ggy} {lly} {nny} {ssz} {tty} {zzs}]. Most are single letters, but some are multiple letters (those between {...}), and are handled as contractions in collation. See, for example, the Hungarian CLDR data [Hun].

Where cluster collation has a significant impact is where the clusters have different lengths, and are generative (meaning that there are predictable patterns, but generate a very large number of combinations of clusters). If all possible combinations had to be expressed as contractions, that could lead to tens of thousands of contractions — resulting in poor performance and excessive data size.

Both of these factors are, of course, perceived as strong negatives by users.

Solutions

With the UCA / ISO 14651, much of this generative cluster collation can be accomplished by reordering weights (possibly supplemented by forming contractions) so that a character-by-character comparison produces the right answer, and reducing the need for huge numbers of contractions.

Let's take an example. Suppose that collation clusters in a particular language are of the form of a Base character plus zero to n Trailing characters: symbolically expressed as Bn Tn* (where x* means zero or more x's). If B1 is primary-less than B1T1, then B1X < B1T1, for all X that can start a cluster. The correct ordering can be achieved in UCA by ensuring that X < T1 for all X that can start a cluster. That means that each Tn have to be higher than each Bn, but also higher than each character outside of the alphabet, such as Latin, Greek, CJK-Ideographs, etc. There is a special area in UCA, the Trailing Weights area, for this purpose, so that the trailing characters can be tailored to that range, and be higher than all single characters. (See [7.1.4].)

This method does have some limitations. Where multiple initial letters need to act internally as a cluster, one has to fall back on contractions. Thus if B1B1 acts as a cluster, then a character-by-character comparison would end up with B1 > T1. The way to do this is to define the sequence <B1,B2> as a contraction that is primary-greater than B1.

Note that if you have two different languages that both need trailing weights, the above mechanism can still work. As long as the trailing weights cannot start a cluster, there is no conflict among them for different languages, in well formed text. (As always, we don't much care about the ordering of ill-formed clusters, as long as the fundamental consistency requirements of collation are still valid, e.g. transitivity).

The other major alternative for getting the right answer with a case-by-case comparison would be to use Terminators. This would involve adding some syntax to UCA / ISO 14651 that would insert a Terminator after every cluster. The Terminator would be a special weight which is less than all other primary weights in the UCA. For example, we could have syntax like:

@terminate between [B1...Bn T1...Tn] and [^T1...Tn]

This syntax would have the effect of adding a Terminator between any two characters that matched the above, where [^x..y] means all characters that are not in [x...y]. The following example shows how Terminators (shown here as ) would be solve the problem cited above. We would never be comparing any X against a trailing part of a cluster; we would always compare a first, which is always less than any character. So the correct order always results.

Case 1 Case 2
1 C1  
2 C1 C3
1 C1 X  
2 C1 C3 Y

Here is a table of comparisons of these methods:

  Trailing Weights Terminator
1 + Simple; only takes data changes - Requires new syntax and code
2 + Uncompressed Sort Keys shorter + Compressed Sort Keys shorter
3 - Requires use of contractions to work around restrictions + More general
4 - Contraction list might be quite large + Doesn't need contractions
5 + Faster (in general) - Slower (in general)

Notes

  1. Trailing Weights can be implemented now, either in the default table for UCA / ISO 14651, or in tailorings. Terminators would require new syntax, and all implementations would need to be upgraded to support that syntax.
  2. Compression of primary weights can produce significantly smaller sort keys, which affects both database size and speed. Compression depends, however, on contiguity; moving trailing characters to a different range breaks that contiguity. For example, by our figures, for n Hangul syllables, given an average syllable length of 2.5 jamo, we get approximately the following:
  3. Trailing Weights do require the use of contractions to handle initial sequences. And the approach is not as general as Termination. For example, suppose that we had a language with clusters of the form either (a) or (b) below:
    1. Bn Tn*
    2. Tn*
    That is, a Trailing character on its own constitutes a cluster. The Trailing Weights solution does not allow a (b) cluster to be sorted before an (a) cluster. However, for all the languages we know of, this kind of situation does not obtain.
  4. Because contractions are required for initial sequences, the data tables for Trailing Weights are larger than for Terminators.
  5. The speed may depend on the circumstances. As we pointed out above, any contraction is slower, because you have to do a look-ahead, so that works against Trailing Weights for those cases where it is required. On the other hand, almost every character in the alphabet requires a look-ahead with the Terminator solution.

Our overall conclusion is that we should use that Trailing Weights for now, and only consider adding Terminators if we run across a language that really cannot be correctly tailored without it.

Hangul

Hangul has multiple clusters, since the syllables are of the form L+ V+ T*. To do Hangul correctly with Trailing Weights, you need primary weights in the following order (where the left end is zero weight, and the right end is the top weight):

Xb L Xa T V

The Xb are all the scripts sorted before Hangul, and the Xa are all those sorted after. This ordering gives the right results among the following:

Chars Comments
L1V1Xa  
L1V1L...  
L1V1Xb  
L1V1T1 Works because T1 > all X and L
L1V1V2 Works because V2 > all T
L1L2V1 Works if L1L2 is a contraction.
 

This method requires that all the known LL combinations are treated as contractions. It would not handle any that were not. However, there is relatively good information about when LL combinations actually occurred — in ancient Hangul — and any others could be added over time.

The Terminator method would assign the following weights:

Xb T V L Xa

This ordering gives the right results among the following:

Chars Term Insert Comments
L1V1Xa L1V1Xa  
L1V1L... L1V1L...  
L1V1Xb L1V1Xb  
L1V1T1 L1V1T1 Works because T1 > all X and
L1V1V2 L1V1V2 Works because V2 > all T
L1L2V1 L1L2V1 Works because L1 > all V
 

Hangul is in a rather unique position, because of all the precomposed characters, and because those precomposed characters are the normal (NFC) form of interchanged text. Because of that, we have concluded that in implementation it is actually better to take an alternative approach, one that still preserves canonical equivalence, has the right performance/speed characteristics, and handles all characters correctly without requiring contractions.

We mention this only for background information; it is not necessary for the implementation of Hangul sorting. The method is given by the following (it may be easier to follow with the example given immediately after):

  1. Generate a modified weight table, in the following way:
    1. Each precomposed syllable has a 3-byte weight, where:
      1. the first byte is identical across the entire range. (An identical first byte allows for primary compression.)
      2. there are 2 gaps between each syllable's weight. That can be done because there are roughly 60K weights in 2 bytes, and 11,170 syllables to distribute across that 60K
        • The first gap is used for tailoring other characters (including Han Ideographs).
        • The second gap is reserved for additional syllables, as described below.
    2. Each Jamo has a 1-byte weight, ensuring that all Ts are below all Vs are below all Ls. The lowest weight is a terminator . These weights are separate from the default weights, and are just used internally. We'll refer to the byte weight of Ln as WLn
  2. When any string of Jamo and/or Hangul Syllables is encountered, break it into syllables according to the definition in the Unicode Standard, Chapter 3. Process each syllable separately:
    1. If a syllable is canonically equivalent to one of the precomposed Hangul Syllables, then just assign the weight as above
    2. If not, then find the least syllable that it is greater than; call that the base syllable. Find the reserved gap after the base syllable. Generate a weight sequence corresponding to the weights of the gap, followed by all the Jamo, followed by the terminator.

Example

Suppose we find a syllable of the form L1L2V1T1. We find the precomposed syllable just less than it: L1V1T1. The weights for that syllable and the following will look something like:

L1V1T1 W1, W2, W3
tailoring W1, W2, W4
gap W1, W2, W5
L1V1T2 W1, W2, W6

The generated weight will have the form: <W1, W2, W5, WL1, WL2, WV1, WT1, >. This weight will produce the right ordering, because:

This mechanism is unique to Hangul because of the algorithmic syllable formation in the Unicode Standard. The advantages over either Trailing Weights or Terminators are that all precomposed syllables — which are by a huge margin the most common cases — get 3 bytes of primary weight. For a sequence of n bytes, this compresses to 2 + 2n bytes. The other syllables sort correctly. They are rather long in bytes, but that doesn't matter because of their extreme infrequency.

This method does require that the Jamo and Hangul Syllables themselves not be tailored. Whenever that is done, we would turn off this mechanism, and fall back to Trailing Weights.

Annex B. Default Table Criteria

We cannot produce a hard and fast set of rule determining how we should place characters in the default table, but the following is a suggested list of factors that should be taken into account. Warning: these factors may in some cases conflict; it still takes judgment to weigh the relative importance of these factors for the particular case.

  1. Common Usage.
    1. If x < y in all languages that use x, that is strong reason for having that ordering in the default table. This is especially clear if there is only a single major language that uses a given script.
      • For example, Polish is the only real 'user' of the ł, so the placement of that character should follow Polish conventions.
    2. If some languages have x < y and others don't, then the choice should be based on frequency of occurrence in world data
      • Thus the ordering of ä (e.g. ä < b versus z < ä) should be based on its frequency of usage in the languages where it occurs (German, Swedish,...) and the relative amount of data in those languages, and the frequency of occurrence of the character in those languages. That argues for having ä sort as in German.
      • As another example, æ sorts as (a) an expansion of 'ae' in a wide number of languages (e.g. cæsium or Cæsar), but in infrequent usage; or (b) above z very commonly in a few languages (with relatively small populations); or (c) sorts after a in very few languages, with little data in actual use. Thus the best choice is either (a) or (b).
      • Note that behavior in clusters must be taken into account. Thus if "ch" sorts one way in Slovak, one has to look at all other languages where the sequence "ch" could occur — not just those languages where it is a cluster.
  2. Default Table Size / Complexity. However, if the amount of data required for a given language would unduly burden all users of the default table, that may be reason to not include the tailorings.
  3. Other Factors.
    1. If x is generally considered an accent variant of another character (e.g. å, ę ħ, ł, ñ, ʑ, ...) then it should be a secondary (level 2) variant of the other in the default table.
      • This may be overridden by a common usage factor, as above.
    2. If x is generally considered a presentation or case variant of another character (e.g. ❶, ➀, ➊, , , , z, ...), then it should be a tertiary (level 3) variant of the other in the default table.
      • In particular, the only current failure in the default table is that German ß uppercases to SS, so it should be a tertiary difference, not a secondary difference. This is patterned after DIN 5007; however, we believe it to be a mistake in that standard and in the default table. (Note also that DIN 5007 as it stands cannot be fully implemented with UCA / ISO 14651, since it requires human judgment: it was written more for librarians than computers.) See [DIN]
    3. Even if punctuation, digits, and other symbols may sort after some letters in a language, for uniformity they are all grouped together with similar sorts of characters in the default table.
  4. Stability.
    1. Stability of the ordering is important; where we are discussing a change to an existing order, the evidence should be very clear based on common usage.
    2. That being said, we have told national bodies that whereas we cannot change character positions in UCS, the appropriate action is to change the sort order by changing the default table; so we need to be responsive where there are clear cases.

References

[UCA] UTS #10: Unicode Collation Algorithm, http://www.unicode.org/reports/tr10/
[Hun] http://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/main/hu.xmlhttp://oss.software.ibm.com/cvs/icu/~checkout~/locale/common/collation/hu.xml.
[7.1.4] UTS #10: Unicode Collation Algorithm, Section 7.1.4 "Trailing Weights", http://www.unicode.org/reports/tr10/#Trailing_Weights
[S1.2] UTS #10: Unicode Collation Algorithm, Section 4 "Main Algorithm", Section 4.1 "Normalize each input string", S1.2 http://www.unicode.org/reports/tr10/#Step_1
[DIN] http://www.unicode.org/L2/L2004/04140-din-letter.html