Re: Sorting Numbers

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 13 2000 - 16:03:57 EST


Markus,

>
> Otto Stolz wrote on 2000-03-13 10:49 UTC:
> > In contrast, the current DIN 5007,
> >
> > - defines a peculiar rule for the ordering of numbers (based on their
> > respective values rather than on their representation), which does
> > not really fit in a framework for ordering of character strings.
>
> Why not? Correct ordering of at least integer numbers is a trivial
> extension of the algorithm:
>

Well, as Kent has pointed out, the draft DIS text for 14651 *does*
include an informative annex about numeric sorting in the context of
the international string ordering. And it does follow the general
approach you have suggested. The basic principles are clear enough.

However, there is no consensus that numeric ordering should be required
as part of the normative mechanism of that standard or of the
Unicode Collation Algorithm. In fact, if anything, the consensus is
the opposite -- that the basic string ordering should be defined without
taking numeric values into account, and that sorting numbers should be
treated as an epiphenomenon, not required for conformance to the
standard(s).

Why? Because while you can characterize numeric sorting as a "trivial
extension of the algorithm", in practice it is not so trivial. Having
implemented such an algorithm, my rough estimate is that modifying
it to deal with even the simplest numeric ordering would roughly double
its complexity and halve its efficiency.

Problems:

Unlike implementation of the basic algorithm, which simply
has to do table lookup matches, numeric processing introduces what
is effectively a separate pass to parse and lex numbers out of any
candidate string for weighting. Depending on how clever that pass
is integrated with the rest of the weighting, I might still have to
implement extra buffer allocations and copies. And in any case, I have
to carry around the machinery for the parsing.

The parts of the algorithm implementations that do incremental weighting
of two strings, for efficiency, would be particularly impacted by
a requirement to identify and differentially weight numbers. Such
incremental weighting is already rather complex when it has to
deal with on-the-fly canonical equivalence and reordering, calculation
of the implications of reverse accent weighting for French, ignoring
or weighting of various levels (for fuzzy matches), and such. If, on
top of all that, you also have to put in the push points that say,
"Oops, this might be the start of a number, so scan ahead in the
parser to determine if it is, and construct the candidate weights
for the next n positions accordingly...", that can no longer be
considered a "trivial extension" in actual implementation.

The problem of numeric formatting also extends, unfortunately, to the
distant horizon. Any implementation is going to run afoul of locale-specific
differences in the expression of numbers. While the weighting may be able
to suppress differences for thousands separators, for example, when encountered,
the problem is that the *parser* is still going to have to recognize them,
so that it sees "123456789" and "123,456,789" as the *same* number for
weighting, and does not misapprehend the second as a list of three
numbers "123", "456", and "789" for weighting. And the parser is
going to have to be apprised of such differences as "123,456,789.01"
and "123.456.789,01", as well as the even more troublesome use of spaces
as thousands separators in some formatting conventions.

Worse for implementations: the expectations set up for an algorithm
that sorts *numbers* correctly will easily be extrpolated by users to
expectations that they will sort *currency amounts* correctly as well.
And that means that the parser will then also have to venture into the
even more complex territory of locale-specific formatting conventions
for currency.

Should number sorting be constrained to ASCII digits 0..9? What about
all the other decimal digits defined in the standard? Are they to be
treated distinctly or analogously? And if decimal digits get treated
according to their numeric values, why shouldn't Han numerics also
be treated according to their numeric values? If not, where is the
rationale to draw the line? If so, then the parser once again embarks
onto uncharted deep waters, trying to distinguish the "three" in
"sambiki" (3 head of...) versus the "three" in "Saburo" (personal name).

Should number sorting be constrained to decimal radix expressions?
Why not hexadecimal or other radices as well? The "trivial extension" to the
algorithm, if applied without care to hexadecimal expressions would produce bizarre
results for lists of hexadecimal numbers.

Where should numeric boundaries be specified? This again is a problem
of sophistication and standardization of the parser/lexer.
Presumably "(600)" and "[600]" would be identifier a numbers for
sorting, but what about "6-0-0" in a part id? or "6.0.0" in an index?
Are those single numbers with ignorable punctuation? Or 3 numbers
with delimiters? For that matter, in some conventions, "(600)" should
be sorted the same as "-600". And so on and so on.

This is just a major Pandora's box. And since even the simplest
extensions significantly complicate the implementations of the basic
algorithm, there is no agreement to force any of this into the
basic statement of the standards.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT