L2/14-230

Source: Peter Occil
Date: October 2, 2014
Subject: Suggested changes to UTS #10 Appendix A

Attached is a version of Appendix A showing all the proposed changes. Notes shown in red are not part of the changes but merely explain some of them. Note also that additional changes are proposed that are not mentioned in my first message.

Appendix A: Deterministic Sorting

There is often a good deal of confusion about what is meant by the terms "stable" or "deterministic" when applied to sorting or comparison. This confusion in terms often leads people to make mistakes in their software architecture, or make choices of language-sensitive comparison options that have significant impact in terms of performance and footprint on performance and memory use, and yet do not give the results that users expect.

A.1 Stable Sort

A stable sort is one an algorithm where two records will retain their order when sorted according to a particular field, even when the two fields have the same contents. Thus those two records come out in with the same contents will have the same relative order that they were in before sorting, although their positions relative to other records may change. Importantly, this is a property of the sort algorithm, not the comparison mechanism.

Two examples of differing sort algorithms are Quicksort and Merge sort. Quicksort is not stable while Merge sort is stable. (A Bubble sort, as typically implemented, is also stable.)

Assume the following records:

Original Records

Record Last_Name First_Name
1 Davis John
2 Davis Mark
3 Curtner Fred

The results of a Merge sort on the Last_Name field only are:

Merge Sort Results

Record Last_Name First_Name
3 Curtner Fred
1 Davis John
2 Davis Mark

The results of a Quicksort on the Last_Name field only are:

Quicksort Results

Record Last_Name First_Name
3 Curtner Fred
2 Davis Mark
1 Davis John

As is apparent, the Quicksort algorithm is not stable; records 1 and 2 are not in the same order they were in before sorting.

A stable sort is often desirable—for one thing, it allows records to be successively sorted according to different fields, and to retain the correct lexicographic order. Thus, with a stable sort, one an application could sort all the records by First_Name, and then sort them again by Last_Name, giving the desired results: that all records would be ordered by Last_Name, and in the case where the Last_Name values are the same, be further subordered by First_Name.

A.1.1 Forcing Stable Sorts

NOTE: Moved here from section A.3.3. The changes are marked individually for convenience.

Typically, what people really want when they say they want a deterministic comparison is actually a stable sort.

One can force a non-stable sort algorithm to produce stable results by how one does the comparison. However, this has literally nothing to do with making the comparison deterministic or not. Forcing stable results can be done by appending the current record number to the strings to be compared. (The implementation may not actually append the number; it may use some other mechanism, but the effect would be the same.)

A non-stable sort algorithm can be forced to produce stable results by comparing the current record number (or some other value that is guaranteed to be unique for each record) for otherwise equal strings.

If such a modified comparison is used, for example, it forces Quicksort to get the same results as a Merge sort. In that case, the irrelevant character ZWJ does not ignored characters such as zero-width joiner (ZWJ) do not affect the outcome. The correct results occur, as illustrated below. The results below are sorted first by last name, then by first name.

Note that {ZWJ} appears below in order to demonstrate the result better.

First then Last Last Name then First Name (Forced Stable Results)

Record Last_Name First_Name
3 Curtner Fred
1 Da{ZWJ}vis John
2 Davis Mark

If anything, this then is what users want when they say they want a deterministic comparison. See also Section 1.6, Merging Sort Keys.

A.2 Deterministic Sort

A deterministic sort is a very different beast. This different from a stable sort: this is a sort algorithm that returns the same results each time. On the face of it, it would seem odd for any sort algorithm to not be deterministic, but there are examples of real-world sort algorithms that are not.

The key concept is that these sort algorithms are deterministic when two records have unequal fields, but they may return different results at different times when two records have equal fields.

For example, a classic Quicksort algorithm works recursively on ranges of records. For any given range of records, it takes the first element as the pivot element. However, that algorithm performs badly with input data that happens to be already sorted (or mostly sorted). A randomized Quicksort, which picks a random element as the pivot, can on average be faster. Because of this random selection, different outputs can result from exactly the same input: the algorithm is not deterministic.

Enhanced Quicksort Results (sorting on last name only)

Record Last_Name First_Name
3 Curtner Fred
2 Davis John
1 Davis Mark

or

Record Last_Name First_Name
3 Curtner Fred
1 Davis Mark
2 Davis John

As another example, multiprocessor sort algorithms can be non-deterministic. The work of sorting different blocks of data is farmed out to different processors and then merged back together. The ordering of records with equal fields might be different according to when different processors finish different tasks.

Note that a deterministic sort is weaker than a stable sort. A stable sort is always deterministic, but not vice versa. Typically, when people say they want a deterministic sort, they really mean that they want a stable sort.

A.3 Deterministic Comparison

A deterministic comparison is different than either a stable sort or a deterministic sort; it is a property of a comparison function, not a sort algorithm. This is a comparison where strings that do not have identical binary contents (optionally, after some process of normalization) will compare as unequal. A deterministic comparison is sometimes called a stable (or semi-stable) comparison.

There are many people who confuse a deterministic comparison with a deterministic (or stable) sort, but this ignores the fundamental difference between a comparison and a sort. A comparison is used by a sort algorithm to determine the relative ordering of two fields, such as strings. Using a deterministic comparison cannot cause a sort to be deterministic, nor to be stable. Whether a sort is deterministic or stable is a property of the sort algorithm, not the comparison function, as the prior examples show.

A.3.1 Forcing Deterministic Comparisons

One can produce a deterministic comparison function from a non-deterministic one, in the following way (in pseudo-code):

int new_compare (String a, String b) {
  int result = old_compare(a, b);
  if (result == 0) {
    result = binary_compare(a, b);
  }
  return result;
}

Programs typically also provide the facility to generate a sort key, which is a sequences of bytes generated from a string in alignment with a comparison function. Two sort keys will binary-compare in the same order as their original strings. The simplest means to create a deterministic sort key that aligns with the above new_compare is to append a copy of the original string to the sort key. This will force the comparison to be deterministic.

byteSequence new_sort_key (String a) {
  return old_sort_key(a) + SEPARATOR + toByteSequence(a);
}

Because sort keys and comparisons must be aligned, a sort key generator is deterministic if and only if a comparison is.

A.3.2 Best Practice

A deterministic comparison is generally not best practice. First, it has a certain performance cost in comparison, and a quite substantial impact on sort key size. (For example, ICU language-sensitive sort keys are generally about the size of the original string, so appending a copy of the original string to force a deterministic comparison generally doubles the size of the sort key.)  A database using these sort keys can have substantially increased disk footprint and memory footprint, and consequently reduced performance use more memory and disk space and thus have reduced performance.

More importantly, a deterministic comparison function does not actually achieve the effect people think it will have. Look at the sorted examples above. Whether a deterministic comparison is used or not, there will be no effect on the Quicksort example, Second, a deterministic comparison function does not affect the order of equal fields. Even if such a function is used, the order of equal fields is still not guaranteed in the Quicksort example, because the two records in question have identical Last_Name fields. It does not make a non-deterministic sort into a deterministic one, nor does it make a non-stable sort into a stable one.

Thirdly Third, a deterministic comparison is often not what is wanted, when people look closely at the implications. This is especially the case when the comparison involves only fields that are not guaranteed to be unique.

NOTE: Alternatively or additionally, it could say "This is especially the case when the comparison is not essentially a binary comparison, as is the case for linguistically appropriate string comparison."

NOTE: Perhaps add another reason here that deterministic comparisons provide few benefits for stable sort algorithms.

To illustrate this, look Look at the example again, and suppose that this time the user is sorting first by last name, then by first name.

Original Records

Record Last_Name First_Name
1 Davis John
2 Davis Mark
3 Curtner Fred

The desired results are the following, which should result whether the sort algorithm is stable or not, because it uses both fields.

First then Last Last Name then First Name

Record Last_Name First_Name
3 Curtner Fred
1 Davis John
2 Davis Mark

Now suppose that in record 2, the source for the data caused the last name to contain a format control character, such as a ZWJ (used to request ligatures on display) zero-width joiner (ZWJ, used to request ligatures on display). In this case there is no visible distinction in the forms, because the font does not have any ligatures for these sequences of Latin letters. The default UCA collation weighting causes the ZWJ to be—correctly—ignored in comparison, since it should only affect rendering. However, if that comparison is changed to be deterministic (by appending the binary values for the original string), then unexpected results will occur.

Note that {ZWJ} appears below in order to demonstrate the result better.

First then Last Last Name then First Name (Deterministic)

Record Last_Name First_Name
3 Curtner Fred
2 Davis Mark
1 Da{ZWJ}vis John

This is because in this case, the First_Name and Last_Name fields are each not guaranteed to be unique (there can be multiple people named "Mark" and multiple people named "Davis", for example).

NOTE: Alternatively or additionally, it could say "This is because in this case, the First_Name and Last_Name fields use a linguistic string comparison, which is different from a binary string comparison. While the linguistic comparison treats both "Davis" strings as equal, the binary comparison doesn't, and thus record 1 gets sorted after record 2 regardless of the value of the First_Name field."

Typically, what people really want when they say they want a deterministic comparison is actually a stable sort. Forcing stable results for non-stable sorts is shown in A.1.1 Forcing Stable Sorts.

NOTE: A.3.3 moved to A.1.1.

A.4 Stable and Portable Comparison

NOTE: Most of the changes in this section are editorial.

There are a few other terms worth mentioning, simply because they are also subject to considerable confusion. Any or all of the following terms may be easily confused with the discussion above.

A stable comparison is one that does not change over successive software versions. That is, as one an application uses successive versions of an API, with the same "settings" (such as locale), one it gets the same results.

A stable sort key generator is one that generates the same binary sequence over successive software versions.

Warning: If the sort key generator is stable, then the associated comparison will perforce necessarily be. However, the reverse is not guaranteed. To take a trivial example, suppose the new version of the software always adds an 0xFF byte at the front the byte 0xFF at the start of every sort key. The results of any comparison of any two new keys would be identical to the results of the comparison of any two corresponding old keys. However, the bytes have changed, and the comparison of old and new keys would give different results. Thus one can have there can be a stable comparison, yet an associated non-stable sort key generator.

A portable comparison is where corresponding APIs for comparison produce the same results across different platforms. That is, if one an application uses the same "settings" (such as locale), one it gets the same results.

A portable sort key generator is where corresponding sort key APIs produce exactly the same sequence of bytes across different platforms.

Warning: As above, a comparison may be portable without the associated sort key generator being portable.

Ideally, all products would have the same string comparison and sort key generation for, say Swedish, and thus be portable. For historical reasons, this is not the case. Even if the main letters sort the same, there will be differences in the handling of other letters, or of symbols, punctuation, and other characters. There are some libraries that offer portable comparison, such as [ICUCollator], but in general the results of comparison or sort key generation may vary significantly between different platforms.

In a closed system, or in simple scenarios, portability may not matter. Where someone has a given set of data to present to a user, and just wants the output to be reasonably appropriate for Swedish, the exact order on the screen may not matter.

In other circumstances, differences can lead to data corruption. For example, suppose that two implementations do a database SELECT query for records between a pair of strings. If the collation is different in the least way, they can get different data results. Financial data might be different, for example, if a city is included in one SELECT query on one platform and excluded from the same SELECT query on another platform.