RE: Sorting Numbers

From: Karlsson Kent - keka (keka@im.se)
Date: Mon Mar 13 2000 - 08:08:47 EST


I agree in principle with Markus here.

ISO/IEC FCD 14651 describes (informatively) in an annex
similar methods, including how to handle numerals for
negative numbers, in order to get "numerical" order for
strings that differ only in numerals, while still using
the basic ordering mechanism.

                Kind regards
                /kent k

> -----Original Message-----
> From: Markus Kuhn [mailto:Markus.Kuhn@cl.cam.ac.uk]
> Sent: Monday, March 13, 2000 1:24 PM
> To: Unicode List
> Subject: Re: Sorting Numbers
>
>
> Otto Stolz wrote on 2000-03-13 10:49 UTC:
> > In contrast, the current DIN 5007,
> >
> > - defines a peculiar rule for the ordering of numbers
> (based on their
> > respective values rather than on their representation), which does
> > not really fit in a framework for ordering of character strings.
>
> Why not? Correct ordering of at least integer numbers is a trivial
> extension of the algorithm:
>
> - you treat leading zeros as second level entities
> - you sort digits as first-level entities
> - you prefix conceptually any substring of digits
> with a first-level sorting marker that indicates the length of the
> digit string.
>
> Example: you transform
>
> 1
> 11
> 111
> 2
> 22
> 222
> 01
> 011
> 0111
>
> into
>
> [1]1
> [2]11
> [3]111
> [1]2
> [2]22
> [3]222
> (0)[1]1
> (0)[2]11
> (0)[3]111
>
> where () means "moved to second level" and [n] is an inserted
> first-level sorting marker that appears in the sorting order in the
> digits range.
>
> The resulting sorting order will be
>
> [1]1
> (0)[1]1
> [1]2
> [2]11
> (0)[2]11
> [2]22
> [3]111
> (0)[3]111
> [3]222
>
> which after undoing the additional markup shown here only for
> illustration
> purposes will lead to the sorting order
>
> 1
> 01
> 2
> 11
> 011
> 22
> 111
> 0111
> 222
>
> which is pretty much what you would expect.
>
> Catalog index entries for electronic components or chemical substances
> often don't contain decimal separators, so that is already
> all you need
> here:
>
> Di-hydro-2,9-hexacylanomol
>
> should really be sorted right before
>
> Di-hydro-2,10-hexacylanomol
>
> and
>
> Light bulb, 100 W
>
> should be sorted before
>
> Light bulb, 90 W
>
> The only remaining problem are entries with decimal fractions like
>
> Sheet metal, 1.2 mm
> Sheet metal, 1.25 mm
> Sheet metal, 1.3 mm
>
> The solution here is to add to the tailoring parameters a value
> DECIMAL_SEPARATOR that can usually have the values "comma", "dot", or
> "none". Then once a digit substring is recognized, it will not be
> prefixed with a length indicator if the string is directly preceded by
> the DECIMAL_SEPARATOR character.
>
> I never understood, why this can't be handled properly within the
> international sorting ordering framework. Culturally correct
> sorting of
> numbers is most definitely a function that would be highly useful to
> have available. It is also most definitely a function that cannot be
> delegated simply to "higher processing levels", because these higher
> processing level have no way of encoding arbitrary length numbers in a
> way such that the sorting engine will treat them correctly.
>
> Example:
>
> Imagine that I have a higher processing level that replaces written
> numbers such as "one", "two", "twothousand-one", "three" by
> 1, 2, 2001,
> 3. If the resulting decimal numbers are not sorted correctly,
> the higher
> layer has to perform zero-padding to circumvent this. However, zero
> padding can only be performed efficiently if you know in advance an
> upper limit for the number of digits. If you insert strings that have
> undergone a sorting transform into a B-tree, then you can't
> just recode
> the entire B-tree, just because a user fancied to enter a
> number higher
> than the current zero prefixing scheme anticipated.
>
> So if your preprocessor transformed "one", "two", "twothousand-one",
> "three" into 000001, 000002, 002001, 000003 before inserting the
> sortable strings into a B-tree, then I can play the nasty user by
> entering "one billion" and your sorting scheme breaks down
> immediately.
>
> Therefore, numerical sorting is not only necessary to sort numbers
> directly. It is also an essential mechanism for flexible
> interfacing to
> higher-level sorting preprocessors.
>
> Markus
>
> --
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT