Re: Sorting Numbers

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Mon Mar 13 2000 - 07:28:44 EST


Otto Stolz wrote on 2000-03-13 10:49 UTC:
> In contrast, the current DIN 5007,
>
> - defines a peculiar rule for the ordering of numbers (based on their
> respective values rather than on their representation), which does
> not really fit in a framework for ordering of character strings.

Why not? Correct ordering of at least integer numbers is a trivial
extension of the algorithm:

  - you treat leading zeros as second level entities
  - you sort digits as first-level entities
  - you prefix conceptually any substring of digits
    with a first-level sorting marker that indicates the length of the
    digit string.

Example: you transform

1
11
111
2
22
222
01
011
0111

into

[1]1
[2]11
[3]111
[1]2
[2]22
[3]222
(0)[1]1
(0)[2]11
(0)[3]111

where () means "moved to second level" and [n] is an inserted
first-level sorting marker that appears in the sorting order in the
digits range.

The resulting sorting order will be

[1]1
(0)[1]1
[1]2
[2]11
(0)[2]11
[2]22
[3]111
(0)[3]111
[3]222

which after undoing the additional markup shown here only for illustration
purposes will lead to the sorting order

1
01
2
11
011
22
111
0111
222

which is pretty much what you would expect.

Catalog index entries for electronic components or chemical substances
often don't contain decimal separators, so that is already all you need
here:

  Di-hydro-2,9-hexacylanomol

should really be sorted right before

  Di-hydro-2,10-hexacylanomol

and

  Light bulb, 100 W

should be sorted before

  Light bulb, 90 W

The only remaining problem are entries with decimal fractions like

  Sheet metal, 1.2 mm
  Sheet metal, 1.25 mm
  Sheet metal, 1.3 mm

The solution here is to add to the tailoring parameters a value
DECIMAL_SEPARATOR that can usually have the values "comma", "dot", or
"none". Then once a digit substring is recognized, it will not be
prefixed with a length indicator if the string is directly preceded by
the DECIMAL_SEPARATOR character.

I never understood, why this can't be handled properly within the
international sorting ordering framework. Culturally correct sorting of
numbers is most definitely a function that would be highly useful to
have available. It is also most definitely a function that cannot be
delegated simply to "higher processing levels", because these higher
processing level have no way of encoding arbitrary length numbers in a
way such that the sorting engine will treat them correctly.

Example:

Imagine that I have a higher processing level that replaces written
numbers such as "one", "two", "twothousand-one", "three" by 1, 2, 2001,
3. If the resulting decimal numbers are not sorted correctly, the higher
layer has to perform zero-padding to circumvent this. However, zero
padding can only be performed efficiently if you know in advance an
upper limit for the number of digits. If you insert strings that have
undergone a sorting transform into a B-tree, then you can't just recode
the entire B-tree, just because a user fancied to enter a number higher
than the current zero prefixing scheme anticipated.

So if your preprocessor transformed "one", "two", "twothousand-one",
"three" into 000001, 000002, 002001, 000003 before inserting the
sortable strings into a B-tree, then I can play the nasty user by
entering "one billion" and your sorting scheme breaks down immediately.

Therefore, numerical sorting is not only necessary to sort numbers
directly. It is also an essential mechanism for flexible interfacing to
higher-level sorting preprocessors.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT