**Next message:**Robert A. Rosenberg: "Re: Sorting Numbers"**Previous message:**Kenneth Whistler: "Re: DIN 5007, Swiss Sorting"**Maybe in reply to:**Markus Kuhn: "Re: Sorting Numbers"**Next in thread:**Robert A. Rosenberg: "Re: Sorting Numbers"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

Markus,

*>
*

*> Otto Stolz wrote on 2000-03-13 10:49 UTC:
*

*> > In contrast, the current DIN 5007,
*

*> >
*

*> > - defines a peculiar rule for the ordering of numbers (based on their
*

*> > respective values rather than on their representation), which does
*

*> > not really fit in a framework for ordering of character strings.
*

*>
*

*> Why not? Correct ordering of at least integer numbers is a trivial
*

*> extension of the algorithm:
*

*>
*

Well, as Kent has pointed out, the draft DIS text for 14651 *does*

include an informative annex about numeric sorting in the context of

the international string ordering. And it does follow the general

approach you have suggested. The basic principles are clear enough.

However, there is no consensus that numeric ordering should be required

as part of the normative mechanism of that standard or of the

Unicode Collation Algorithm. In fact, if anything, the consensus is

the opposite -- that the basic string ordering should be defined without

taking numeric values into account, and that sorting numbers should be

treated as an epiphenomenon, not required for conformance to the

standard(s).

Why? Because while you can characterize numeric sorting as a "trivial

extension of the algorithm", in practice it is not so trivial. Having

implemented such an algorithm, my rough estimate is that modifying

it to deal with even the simplest numeric ordering would roughly double

its complexity and halve its efficiency.

Problems:

Unlike implementation of the basic algorithm, which simply

has to do table lookup matches, numeric processing introduces what

is effectively a separate pass to parse and lex numbers out of any

candidate string for weighting. Depending on how clever that pass

is integrated with the rest of the weighting, I might still have to

implement extra buffer allocations and copies. And in any case, I have

to carry around the machinery for the parsing.

The parts of the algorithm implementations that do incremental weighting

of two strings, for efficiency, would be particularly impacted by

a requirement to identify and differentially weight numbers. Such

incremental weighting is already rather complex when it has to

deal with on-the-fly canonical equivalence and reordering, calculation

of the implications of reverse accent weighting for French, ignoring

or weighting of various levels (for fuzzy matches), and such. If, on

top of all that, you also have to put in the push points that say,

"Oops, this might be the start of a number, so scan ahead in the

parser to determine if it is, and construct the candidate weights

for the next n positions accordingly...", that can no longer be

considered a "trivial extension" in actual implementation.

The problem of numeric formatting also extends, unfortunately, to the

distant horizon. Any implementation is going to run afoul of locale-specific

differences in the expression of numbers. While the weighting may be able

to suppress differences for thousands separators, for example, when encountered,

the problem is that the *parser* is still going to have to recognize them,

so that it sees "123456789" and "123,456,789" as the *same* number for

weighting, and does not misapprehend the second as a list of three

numbers "123", "456", and "789" for weighting. And the parser is

going to have to be apprised of such differences as "123,456,789.01"

and "123.456.789,01", as well as the even more troublesome use of spaces

as thousands separators in some formatting conventions.

Worse for implementations: the expectations set up for an algorithm

that sorts *numbers* correctly will easily be extrpolated by users to

expectations that they will sort *currency amounts* correctly as well.

And that means that the parser will then also have to venture into the

even more complex territory of locale-specific formatting conventions

for currency.

Should number sorting be constrained to ASCII digits 0..9? What about

all the other decimal digits defined in the standard? Are they to be

treated distinctly or analogously? And if decimal digits get treated

according to their numeric values, why shouldn't Han numerics also

be treated according to their numeric values? If not, where is the

rationale to draw the line? If so, then the parser once again embarks

onto uncharted deep waters, trying to distinguish the "three" in

"sambiki" (3 head of...) versus the "three" in "Saburo" (personal name).

Should number sorting be constrained to decimal radix expressions?

Why not hexadecimal or other radices as well? The "trivial extension" to the

algorithm, if applied without care to hexadecimal expressions would produce bizarre

results for lists of hexadecimal numbers.

Where should numeric boundaries be specified? This again is a problem

of sophistication and standardization of the parser/lexer.

Presumably "(600)" and "[600]" would be identifier a numbers for

sorting, but what about "6-0-0" in a part id? or "6.0.0" in an index?

Are those single numbers with ignorable punctuation? Or 3 numbers

with delimiters? For that matter, in some conventions, "(600)" should

be sorted the same as "-600". And so on and so on.

This is just a major Pandora's box. And since even the simplest

extensions significantly complicate the implementations of the basic

algorithm, there is no agreement to force any of this into the

basic statement of the standards.

--Ken

**Next message:**Robert A. Rosenberg: "Re: Sorting Numbers"**Previous message:**Kenneth Whistler: "Re: DIN 5007, Swiss Sorting"**Maybe in reply to:**Markus Kuhn: "Re: Sorting Numbers"**Next in thread:**Robert A. Rosenberg: "Re: Sorting Numbers"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.2
: Tue Jul 10 2001 - 17:20:59 EDT
*