Re: International Sorting Order

From: Alain LaBont/e'/ (alb@sct.gouv.qc.ca)
Date: Mon Sep 16 1996 - 16:44:25 EDT


At 15:22 16/09/1996 -0500, Markus G. Kuhn wrote:
>A trivial question about the International String Ordering:
>
>Will it sort
>
> label1
> label2
> label10
> label20
>
>(which I hope), or will it sort
>
> label1
> label10
> label2
> label20
>
>the awful problem with many existing simple-minded sorting
>applications. The numericly correct sorting can be implemented easy by
>prefixing every sequence of digits with a length indicator.
>
>The Unix command "ls" is one popular application of the ISO and there
>exist many applications (MH, news, etc.) where filenames contain
>variable length decimal numbers that should be sorted in numerical and
>not lexicographic order. The same applies for alphabetic lists of
>chemical terms like 102-tetra-2,41-dibenzol. If only integer numbers
>are sorted numericly correct, than this is ok for real-world
>applications. The extension to correctly sort decimal fractions is
>difficult, but fortunatelly usually not necessary.
>
>Markus

[Alain]:

There is provision for pre-handling so that your list be presented as
follows to the comparison process (without touching the original):

  label01
  label02
  label10
  label20

And that would produce the result you expect.

There were numerous discussions over years on this problem and it was
considered application-dependent to make the assumption that you do, which
is of course a legitimate requirement. *The ISO* (International String
Ordering, ISO/IEC 14651) does not specify how prehandling is done but gives
an example, in particular for numeric strings.

Numeric data can also be Roman numerals and we discuss also that, for
example, in this case, in French:

CHAPITRE DIX

may be interpreted as CHAPTER 10 or CHAPTER 509, depending on context and on
application.

Furthermore there might be fractions like <, =, >, or decimal fractions
using a decimal point or a decimal comma, and a triad separator which could
be a comma (SI system), a point (English-speaking countries), a '
(Switzerland) and so on.
[So far <, = and >, are sorted with weights between 0 and 1, though, as this
is easy].

Finally there can be Hindi digits, Chinese digits, and so on.

So this issue is quite complex. But *the ISO* has provisions for prehandling
all of these.

Thanks for your question which is very pertinent.

As this is of interest to different lists, I forward the answer to several
of them, hoping not to start a long debate. All opinions about this are
legitimate requirements and should be able to be satisfied, but perhaps not
at the basic level. That is why the model used in *the ISO* is quite
general. The default is bacic, but a considerable improvement over sorting
at the code level. For millions of people, this will mean finding the
information much better and more reliably that computers used us to do.

Alain LaBonti
Project Editor, ISO/IEC 14651



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT