From email@example.com Mon Feb 10 18:18:07 2003
From: Marc Wilhelm Kuester <firstname.lastname@example.org>
Subject: Sorting text
A few ideas on sorting
Marc Wilhelm Küster
A few loose terms:
sorting: a well-defined arrangement of two sortables using the
algorithm that is specified later on
sortable: a sequence of one or more orderables. The type of a
sortable is defined by the sequence of types of the orderables
orderable: a unit which has a well-defined ordering with regards to
other units of the same type
ordering: the arrangement of two units of the same type into a
Note: A special case of ordering is string ordering
Note: Two units can be considered to be of the same type if their
respective values have a comparison function that puts them into a
well-defined greater-than relationship.
Note: In reality, sortables will be read from some datasource. This
datasource can be a structured file (e.g. an XML-based file format),
a relational database, an object-oriented database or something else
that can deliver structured data.
Sorting vs. ordering
In usual English parlance sorting and ordering are roughly
equivalent terms. For the purpose of this text they have different
meanings, however: Ordering is the arrangement of basic units such
as strings or numbers into a well-defined order whereas sorting is
the arrangement of sequences of orderable units into a well-defined
An example for this distinction is the ordering of simple strings
such as abacus and abc in contrast to the sorting of phone book
entries where each entry consists of (e.g.) a family name, one or
more first names, a street name, a house number and finally a
The rules for sorting are culturally sensitive as well as
potentially subject to personal preferences. It is likewise
dependent on the type of sortable at hand and the field of
A document on this issue must define a sample syntax for a
unambiguous definition of the sorting rules within this framework.
The multikey algorithm
Any sorting problem can be solved by a the following algorithm,
usually known as the multikey algorithm. Two sortables can be
1. Taking the first orderable from both sortables
1a. If using "word-by-word ordering" for strings, split the
orderable along the desired split criterion (usually whitespace)
and follow this algorithm for each of the resulting parts
2. Apply any required preprocessing to each of the two orderables
3. Comparing the two resulting units. If they have a unique
ordering, than this is the sequence of the two sortables
Repeat this algorithm for all orderables until a unique sequence
could be found. If no sequence can be found, the sortables are
Preprocessing of orderables is for some types of orderables a
necessity in most fields of application.
For strings, preprocessing can be seen conceptionally as the
application of UNIX-style regular expressions to a string (though in
reality, it would rarely be defined in this manner).
Preprocessing is culturally sensitive. Rules for preprocessing are
traditionally defined in national sorting standards such as NF Z
44-001 for France, DIN 5007-2 for Germany and (to a lesser extend)
ANSI/NISO Z39.75 for the US.
A sample syntax must allow for the definition of (potentially
complex) preprocessing rules.
Each of the orderables needs a well-defined comparison method. For
some orderables such as natural numbers the ordering will rarely be
disputed and could hardly be considered to be culturally
sensitive. For other units such as strings the comparison method is
highly dependend on the culture in question (cf. ISO/IEC
14651). For yet other types such as complex numbers there is no
single accepted comparison method. Yet other types may have
monetary or other unit signs attached to them which will directly
influence the comparison (e.g. 1 cm compared to 1 inch or USD 1
compared to EUR 1).
A sample syntax must allow for the definition of ordering rules for
different types of orderables and potentially for different rules
for orderables of the same type when they appear in different
positions of the same sortable.
The TR must provide a sample syntax for the customized specification
of an internationalized sorting algorithm much in the same vein that
ISO/IEC 14651 provides a sample syntax for defining culturally
adapted ordering specifications.
This sample syntax will be XML based (and thus also be ISO 8879
conformant). It should be able to drive an application.
Preparation of a quality document
This paper sketches the basics for an internationalized sorting
algorithm. The details for the configuration and implementation of
such an algorithm are still open. In order to get a quality
document, an open-source reference sample application should be
developed in sync with the progress of this document.
The reference application should use object-oriented techniques
consistently to minimize the porting effort between OO-languages. C#
(ISO/IEC 23270) is suggested as a suitable language for the
Marc Wilhelm Küster
Tel.: (+49) / (0)7472 / 949 100
Fax: (+49) / (0)7472 / 949 114