SC22/WG20 N1014
From
[email protected] Mon Feb 10 18:18:07 2003
From:
Marc Wilhelm Kuester <[email protected]>
Subject:
Sorting text
--------------------------------------------------------------
A few ideas on sorting
�Marc Wilhelm K�ster
�A few loose terms:
� sorting: a well-defined arrangement of two
sortables using the
� algorithm that is specified later on
� sortable: a sequence of one or more
orderables. The type of a
� sortable is defined by the sequence of types
of the orderables
� orderable: a unit which has a well-defined
ordering with regards to
� other units of the same type
� ordering: the arrangement of two units of
the same type into a
� well-defined sequence
� Note: A special case of ordering is string
ordering
� Note: Two units can be considered to be of
the same type if their
� respective values have a comparison function
that puts them into a
� well-defined greater-than relationship.
� Note: In reality, sortables will be read
from some datasource. This
� datasource can be a structured file (e.g. an
XML-based file format),
� a relational database, an object-oriented
database or something else
� that can deliver structured data.
�Sorting vs. ordering
� In usual English parlance sorting and
ordering are roughly
� equivalent terms. For the purpose of this
text they have different
� meanings, however: Ordering is the
arrangement of basic units such
� as strings or numbers into a well-defined
order whereas sorting is
� the arrangement of sequences of orderable
units into a well-defined
� order.
� An example for this distinction is the
ordering of simple strings
� such as abacus and abc in contrast to the
sorting of phone book
� entries where each entry consists of (e.g.)
a family name, one or
� more first names, a street name, a house
number and finally a
� telephone number.
� The rules for sorting are culturally
sensitive as well as
� potentially subject to personal preferences.
It is likewise
� dependent on the type of sortable at hand
and the field of
� application.
� A document on this issue must define a
sample syntax for a
� unambiguous definition of the sorting rules
within this framework.
�The multikey algorithm
�
� Any sorting problem can be solved by a the
following algorithm,
� usually known as the multikey algorithm. Two
sortables can be
� compared by:
�� 1. Taking the first orderable from both
sortables
�� 1a. If using "word-by-word
ordering" for strings, split the
�� orderable along the desired split criterion
(usually whitespace)
�� and follow this algorithm for each of the
resulting parts
�� 2. Apply any required preprocessing to each
of the two orderables
�� 3. Comparing the two resulting units. If
they have a unique
����� ordering, than this is the sequence of
the two sortables
� Repeat this algorithm for all orderables
until a unique sequence
� could be found. If no sequence can be found,
the sortables are
� considered equivalent.
�Preprocessing
� Preprocessing of orderables is for some
types of orderables a
� necessity in most fields of application.
� For strings, preprocessing can be seen
conceptionally as the
� application of UNIX-style regular
expressions to a string (though in
� reality, it would rarely be defined in this
manner).
� Preprocessing is culturally sensitive. Rules
for preprocessing are
� traditionally defined in national sorting
standards such as NF Z
� 44-001 for France, DIN 5007-2 for Germany
and (to a lesser extend)
� ANSI/NISO Z39.75 for the US.
� A sample syntax must allow for the
definition of (potentially
� complex) preprocessing rules.
�Ordering
�� Each of the orderables needs a well-defined
comparison method. For
�� some orderables such as natural numbers the
ordering will rarely be
�� disputed and could hardly be considered to
be culturally
�� sensitive. For other units such as strings
the comparison method is
�� highly dependend on the culture in question
(cf. ISO/IEC
�� 14651). For yet other types such as complex
numbers there is no
�� single accepted comparison method. Yet
other types may have
�� monetary or other unit signs attached to
them which will directly
�� influence the comparison (e.g. 1 cm
compared to 1 inch or USD 1
�� compared to EUR 1).
�� A sample syntax must allow for the
definition of ordering rules for
�� different types of orderables and
potentially for different rules
�� for orderables of the same type when they
appear in different
�� positions of the same sortable.
�Syntax
� The TR must provide a sample syntax for the
customized specification
� of an internationalized sorting algorithm
much in the same vein that
� ISO/IEC 14651 provides a sample syntax for
defining culturally
� adapted ordering specifications.
� This sample syntax will be XML based (and
thus also be ISO 8879
� conformant). It should be able to drive an
application.
�Preparation of a quality document
� This paper sketches the basics for an
internationalized sorting
� algorithm. The details for the configuration
and implementation of
� such an algorithm are still open. In order
to get a quality
� document, an open-source reference sample
application should be
� developed in sync with the progress of this
document.
�Sample application
� The reference application should use object-oriented
techniques
� consistently to minimize the porting effort
between OO-languages. C#
� (ISO/IEC 23270) is suggested as a suitable
language for the
� reference implementation.
------------------------------------------------------------------------------
*************************
Marc
Wilhelm K�ster
Saphor
GmbH
Fronl�nder 22
D-72072
T�bingen
Tel.:
(+49) / (0)7472 / 949 100
Fax:
(+49) / (0)7472 / 949 114
E-Mail:
[email protected]
Web:
http://www.saphor.net