Re: Sorting numbers

From: Doug Ewell (dewell@compuserve.com)
Date: Mon Mar 20 2000 - 10:40:28 EST


Just for fun, sorry, I mean as a pedagogical exercise, I added a variant
of the numeric sorting algorithm described by Markus Kuhn to a special-
purpose sorting utility I wrote some time ago. The utility pre-sorts
lists of data items for an application designed to manage quality control
data in medical laboratories, so Markus's example:

> Di-hydro-2,9-hexacylanomol
>
> should really be sorted right before
>
> Di-hydro-2,10-hexacylanomol

was relevant and gave me a justification.

First off, I have to agree with Ken that proper numeric sorting is by no
means "a trivial extension" of basic sorting. This has nothing to do
with how good a programmer one is. You do need to maintain a structure
of some sort to represent the weighting, and if you are going to deal
with all the cases Ken mentioned (e.g. thousands separators) then it gets
really complicated indeed. However, my application needed only to deal
with a reduced version of the problem (integers consisting of consecutive
runs of no more than 5 digits from U+0030 through U+0039), so I could
ignore this for now.

I copied the string to a temporary buffer, prefacing each numeric run
with '0' followed by the count of digits. '0' is there to keep numbers
in their proper ASCII and Unicode order -- we want '9' to come before
'10' but not before '/' or after ':'.

I added this to the comparison routine called by qsort(). For non-C
programmers, qsort() is a standard library function that sorts a list of
data according to criteria specified in a user-defined comparison
routine. This is cool because you get to say what "sorted" means, while
not having to bother writing all the plumbing of a sort function every
time. You can equate "Mc" and "Mac," ignore or observe punctuation,
sort the 'n'th field of multi-field delimited records, whatever.

I already had a pretty complex comparison function due to the need for
'n'th field handling and special cases (that's why it's a special-purpose
sorting utility in the first place), so adding the overhead for numeric
sorting didn't make things much slower.

Everything worked fine. The grand irony, though, is that the main QC
application for which the sorting utility was written *expects* the data
to be sorted in pure ASCII order (except for the special cases), so the
"improved" sorting actually caused new problems!

This illustrates perhaps the most important consideration of this whole
sorting discussion, namely, that sorting must always take the intended
use of the data into account. "Culturally valid" sorting is only one
example of this; so are numeric ordering, caps before smalls (or vice
versa), and Mc/Mac phone-book equivalence. Everyone on this list who
says that users (a) need to be able to define this for themselves and (b)
often don't know what they want or need, at this fine level of detail,
truly grasps the real problem.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT