Re: Sorting tags

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Fri Jun 20 1997 - 16:41:06 EDT


Alain LaBont - SCT wrote on 1997-06-20 13:33 UTC:
> A 07:34 97-06-20 -0400, Winkler, Arnold F a écrit :
> >I am sure you have been following the discusssion on the Unicode list
> >about language tags. One of the reasons for tagging languages in plain
> >text files is always the need for culturally correct sorting.
> >
> Sorting is always on a set of records, according to user expectations.
> Therefore *for sorting purposes* the language tag (for one given script)
> has to be overriden by the user's language expectations which are not
> necessary the same as the language tagged in a record or in a sub-record
> (or field), which is of course, imho, necessary for other purposes.

A more general remark about sorting and Unicode:

One thing that I think would be a nice addition to the work on sorting
Unicode is to specify somewhere in the Unicode code space a number of
sorting control characters, that direct some preprocessing on strings
before they actually enter the sorting algorithm.

Simple example, in a list of names, the most significant letter is the
first letter of the surname and not always the first letter. It is in
many libraries common practice for the librarian who processes a newly
purchased book to mark with a pencil the first letter of the surname
such that this book is later always sorted in the same way.

Let (1) be this marker, then we could store a list of names in a database
like

  John (1)Smith
  Mr. Joseph E. (1)Miller-Rubin, M.D.
  etc.

There might even be a need for more sophisticated reordering markers
that transform

  (3)Mr. (2)Joseph E. (1)Miller-Rubin, (4)M.D.

into

  Miller-Rubin, Joseph E. Mr. M.D.

I believe to remember that I had once seen a German DIN standard for
bibliographical control characters. This character set contained lots of
combining diacritical marks for bibliographic purpuses, but it also
contained around a dozend sorting preprocessing control characters.
Unfortunately, I can't look up the details right now.

Sorting control characters could be available for

  - suppression of substrings for sorting
  - replacing substrings for sorting (like in "Markus <G.|Guenther> Kuhn"
    with <|> being sorting replacement contral characters)
  - hints for correct sorting of digit sequences
  - language tagging
  - various substring reordering mechanisms

Especially in bibliographical applications where you can'd simply split
up names in just the western firstname/lastname scheme, I guess that many
system designers will reinvent these sorting markers, so it might be
interesting to discuss whether this technique is mature enough to be
standardized to increase interoperability.

Does anyone know the DIN standard I was talking about or similar systems?

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT