Re: Sorting tags

From: 10646er@sesame.demon.co.uk
Date: Mon Jul 07 1997 - 07:38:55 EDT

Next message: Misha Wolf: "Re: Usage of CP1252 characters on www.msnbc.com"
Previous message: Dan: "Re: Latin00 (was Re: MES as an ISO standard?)"
Maybe in reply to: Markus G. Kuhn: "Re: Sorting tags"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Re: Sorting tags - from John Clews

NB: to avoid reinventing the wheel, especially if reinvented in a
different way, please note that for three decades librarians have
been tagging such data to a similar degree of complexity to that
described below.

Check any nation's MARC manual, and you will find this sort of tagging
prescibed. No doubt by now there are also a wide range of MARC to
SGML and HTML converters and similar too.

Best wishes

John Clews

For information, the original text is quoted below:

Alain LaBont/e'/ on 1997-06-20 13:33 UTC, quoting
Winkler, Arnold F wrote:
> I am sure you have been following the discusssion on the Unicode list
> about language tags. One of the reasons for tagging languages in plain
> text files is always the need for culturally correct sorting.
>
> Sorting is always on a set of records, according to user expectations.
> Therefore *for sorting purposes* the language tag (for one given script)
> has to be overriden by the user's language expectations which are not
> necessary the same as the language tagged in a record or in a sub-record
> (or field), which is of course, imho, necessary for other purposes.

In message <9706202040.AA13606@unicode.orgkuhn@cs.purdue.edu
("Markus G. Kuhn") writes:
>
A more general remark about sorting and Unicode:

One thing that I think would be a nice addition to the work on sorting
Unicode is to specify somewhere in the Unicode code space a number of
sorting control characters, that direct some preprocessing on strings
before they actually enter the sorting algorithm.

Simple example, in a list of names, the most significant letter is the
first letter of the surname and not always the first letter. It is in
many libraries common practice for the librarian who processes a newly
purchased book to mark with a pencil the first letter of the surname
such that this book is later always sorted in the same way.

Let (1) be this marker, then we could store a list of names in a database
like

  John (1)Smith
  Mr. Joseph E. (1)Miller-Rubin, M.D.
  etc.

There might even be a need for more sophisticated reordering markers
that transform

(3)Mr. (2)Joseph E. (1)Miller-Rubin, (4)M.D.

into

Miller-Rubin, Joseph E. Mr. M.D.

I believe to remember that I had once seen a German DIN standard for
bibliographical control characters. This character set contained lots of
combining diacritical marks for bibliographic purpuses, but it also
contained around a dozend sorting preprocessing control characters.
Unfortunately, I can't look up the details right now.

Sorting control characters could be available for

  - suppression of substrings for sorting
  - replacing substrings for sorting (like in "Markus <G.|GuentherKuhn"
    with <|being sorting replacement contral characters)
  - hints for correct sorting of digit sequences
  - language tagging
  - various substring reordering mechanisms

Especially in bibliographical applications where you can'd simply split
up names in just the western firstname/lastname scheme, I guess that many
system designers will reinvent these sorting markers, so it might be
interesting to discuss whether this technique is mature enough to be
standardized to increase interoperability.

Does anyone know the DIN standard I was talking about or similar systems?

Markus

> --
> Markus G. Kuhn, Computer Science grad student, Purdue
> University, Indiana, USA -- email: kuhn@cs.purdue.edu

-- John Clews (Chair of ISO/TC46/SC2: Conversion of Written Languages)

SESAME Computer Projects, 8 Avenue Road, Harrogate, HG2 7PG, England Email: Converse@sesame.demon.co.uk; tel: +44 (0) 1423 888 432

Next message: Misha Wolf: "Re: Usage of CP1252 characters on www.msnbc.com"
Previous message: Dan: "Re: Latin00 (was Re: MES as an ISO standard?)"
Maybe in reply to: Markus G. Kuhn: "Re: Sorting tags"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT