SC22/WG20 N1014
 
From
kuester@saphor.de Mon Feb 10 18:18:07 2003
From:
Marc Wilhelm Kuester <kuester@saphor.de>
Subject:
Sorting text
 
--------------------------------------------------------------
 
A few ideas on sorting
 
 Marc Wilhelm Küster
 
 A few loose terms:
 
  sorting: a well-defined arrangement of two
sortables using the
  algorithm that is specified later on 
 
  sortable: a sequence of one or more
orderables. The type of a
  sortable is defined by the sequence of types
of the orderables
 
  orderable: a unit which has a well-defined
ordering with regards to
  other units of the same type
 
  ordering: the arrangement of two units of
the same type into a
  well-defined sequence
 
  Note: A special case of ordering is string
ordering
 
  Note: Two units can be considered to be of
the same type if their
  respective values have a comparison function
that puts them into a
  well-defined greater-than relationship.
 
  Note: In reality, sortables will be read
from some datasource. This
  datasource can be a structured file (e.g. an
XML-based file format),
  a relational database, an object-oriented
database or something else
  that can deliver structured data.
 
 Sorting vs. ordering
 
  In usual English parlance sorting and
ordering are roughly
  equivalent terms. For the purpose of this
text they have different
  meanings, however: Ordering is the
arrangement of basic units such
  as strings or numbers into a well-defined
order whereas sorting is
  the arrangement of sequences of orderable
units into a well-defined
  order. 
 
  An example for this distinction is the
ordering of simple strings
  such as abacus and abc in contrast to the
sorting of phone book
  entries where each entry consists of (e.g.)
a family name, one or
  more first names, a street name, a house
number and finally a
  telephone number.
 
  The rules for sorting are culturally
sensitive as well as
  potentially subject to personal preferences.
It is likewise
  dependent on the type of sortable at hand
and the field of
  application. 
 
  A document on this issue must define a
sample syntax for a
  unambiguous definition of the sorting rules
within this framework.
 
 The multikey algorithm
  
  Any sorting problem can be solved by a the
following algorithm,
  usually known as the multikey algorithm. Two
sortables can be
  compared by:
 
   1. Taking the first orderable from both
sortables
 
   1a. If using "word-by-word
ordering" for strings, split the
   orderable along the desired split criterion
(usually whitespace)
   and follow this algorithm for each of the
resulting parts
 
   2. Apply any required preprocessing to each
of the two orderables
 
   3. Comparing the two resulting units. If
they have a unique
      ordering, than this is the sequence of
the two sortables
 
  Repeat this algorithm for all orderables
until a unique sequence
  could be found. If no sequence can be found,
the sortables are
  considered equivalent.
 
 Preprocessing
 
  Preprocessing of orderables is for some
types of orderables a
  necessity in most fields of application. 
 
  For strings, preprocessing can be seen
conceptionally as the
  application of UNIX-style regular
expressions to a string (though in
  reality, it would rarely be defined in this
manner).
 
  Preprocessing is culturally sensitive. Rules
for preprocessing are
  traditionally defined in national sorting
standards such as NF Z
  44-001 for France, DIN 5007-2 for Germany
and (to a lesser extend)
  ANSI/NISO Z39.75 for the US.
 
  A sample syntax must allow for the
definition of (potentially
  complex) preprocessing rules.
 
 Ordering
 
   Each of the orderables needs a well-defined
comparison method. For
   some orderables such as natural numbers the
ordering will rarely be
   disputed and could hardly be considered to
be culturally
   sensitive. For other units such as strings
the comparison method is
   highly dependend on the culture in question
(cf. ISO/IEC
   14651). For yet other types such as complex
numbers there is no
   single accepted comparison method. Yet
other types may have
   monetary or other unit signs attached to
them which will directly
   influence the comparison (e.g. 1 cm
compared to 1 inch or USD 1
   compared to EUR 1).
 
   A sample syntax must allow for the
definition of ordering rules for
   different types of orderables and
potentially for different rules
   for orderables of the same type when they
appear in different
   positions of the same sortable.
 
 Syntax
 
  The TR must provide a sample syntax for the
customized specification
  of an internationalized sorting algorithm
much in the same vein that
  ISO/IEC 14651 provides a sample syntax for
defining culturally
  adapted ordering specifications.
 
  This sample syntax will be XML based (and
thus also be ISO 8879
  conformant). It should be able to drive an
application.
 
 Preparation of a quality document
 
  This paper sketches the basics for an
internationalized sorting
  algorithm. The details for the configuration
and implementation of
  such an algorithm are still open. In order
to get a quality
  document, an open-source reference sample
application should be
  developed in sync with the progress of this
document.
 
 Sample application
 
  The reference application should use object-oriented
techniques
  consistently to minimize the porting effort
between OO-languages. C#
  (ISO/IEC 23270) is suggested as a suitable
language for the
  reference implementation.
 
------------------------------------------------------------------------------
*************************
Marc
Wilhelm Küster 
Saphor
GmbH
 
Fronländer 22
D-72072
Tübingen
 
Tel.:
(+49) / (0)7472 / 949 100
Fax:
(+49) / (0)7472 / 949 114
 
E-Mail:
kuester@saphor.net
Web:
http://www.saphor.net