UCA Conformance Test

2003-10-22

The following files provide conformance tests for the Unicode Collation Algorithm (UTS #10: Unicode Collation Algorithm).

CollationTest_SHIFTED.txt
CollationTest_NON_IGNORABLE.txt

These files are large, and thus packaged in zip format to save download time.

Format

There are two different files, corresponding to two of the 3.2.2 Variable Weighting values.

The format is illustrated by the following example:

0385 0021; #GREEK DIALYTIKA TONOS [0209 0237|0047 0032 0020|0002 0002 0002|]

The part before the semicolon is the hex representation of a sequence of Unicode code points. After the hash mark is a comment. This comment is purely informational, and may change in the future. Currently it consists of the name of the first code point, followed by a representation of the sort key for the sequence.

The sort key representation is in square brackets. It uses a vertical bar for the ZERO separator. Between the bars are the primary, secondary, tertiary, and quaternary weights (if any), in hex.

Note: The sort key is purely informational. UCA does not require the production of any particular sort key, as long as the results of comparisons match.

Note:	The sort key is purely informational. UCA does not require the production of any particular sort key, as long as the results of comparisons match.

Testing

The files are designed so each line in the file will order as being greater than or equal to the previous one, when using the UCA and the Default Unicode Collation Element Table. A test program can read in each line, compare it to the last line, and signal an error if order is not correct. The exact comparison that should be used is as follows:

Read the next line.
Parse each sequence up to the semicolon, and convert it into a Unicode string.
Compare that string with the string on the previous line, according to the UCA implementation.
If the last string is less than the current string, continue to the next line.
If the last string is greater than the current string, then stop with an error.
Compare the strings in code point order.
If the last string is greater than the current string, then stop with an error
Continue to the next line.

If there are any errors, then the UCA implementation is not compliant.

Notes:

This test is only valid for an untailored Default Unicode Collation Element Table.

In step 6, the strings are compared in code point order. Note that in the case of UTF-16, this is not binary order.

Steps 2.1.1 through 2.1.3 of the UCA are triggered by certain lines. As it says in the UCA, "Conformant implementations may skip steps 2.1.1 through 2.1.3 if their repertoire of supported character sequences does not require this level of processing."