Unicode Collation Algorithm Demo
 

This page contains a simple demonstration of collation code using for the Unicode Collation Algorithm. The source for this was written in Java, and can be found at source. (If your browser doesn't display the source properly, choose the View > Source menu.)

Operation

Hit the Load button above. (If you don't see a Load button, there is a problem with Java in your browser.) Once the window opens, you will see a box labeled Strings containing sample strings to be sorted. Click on Sort to see the strings in sorted order in the Sorted box. The format is as follows:

 => <0061,0308> => [06C3 0713| 0020 0020| 0070 0002| FFFF FFFF]

First is the string itself, then its Normalization Form D, then the (uncompressed) sort key. (To improve Applet download time, I am only loading information (decompositions and rules) for characters 0000-00FF (excluding most controls) and 0300-03FF.)

You can change the contents of the Strings box to try out different strings. If your system doesn't let you type in non-ASCII characters, use \uXXXX, where XXXX is the Unicode codepoint number. For example, a micro sign would be \u00B5.

The controls above the Sorted box allow you to fine-tune the results of the sort. As you click them on and off, the strings from the samples will resort.

Changing Rules

The Rules box contains a set of rules. If you try changing any of them, click on Rebuild afterwards. [If you've made any errors, a message will appear in the bottom box.]

Sample Rearrangement

qaa
qaz
qa
a
qzz
qza
qz
z

Rearrangement is used in Thai and Laotian. For illustration, both q and Q are set to rearrange in the sample rules. To see how it works, copy the sample in the sidebar into the Strings box, and Sort. You can delete the @rearrange line to see the difference.

To check out the handling of unsupported characters, you can remove some of the Rules, add those characters to the String box, and rebuild.  For example, delete the rule for e:

0065 ; [.0713.0020.0002.0065] # LATIN SMALL LETTER E

click on Rebuild. Now strings starting with e will sort at the end.

The rules in the "Rules" box have some tailored test rules with contracting characters for testing. These include ch as in traditional Spanish, ä as in traditional German, and å as in Danish. Notice also that intervening accents don't interfere with successive contracting accents unless they are blocked by other accents of the same canonical class: e.g. with a + dot_under + umlaut, the a-umlaut will form, but in a + grave + umlaut, the grave will block the umlaut. (Intervening accents always block successive contracting base characters: e.g. any accent in between will block ch from forming.)

Note that contracting characters must be closed. That is, if you have the contracting character in the table, such as abc, then any initial substrings of that string (e.g. ab and a) must be present in the table. Otherwise you will get an error. To see this, cut the line for the letter a from the Rules box:

0061 ; [.06C3.0020.0002.0061] # LATIN SMALL LETTER A

When you click on Rebuild, you get an error. To fix it, paste back the line for a, or remove the contracting characters that start with a.