Re: Sorting tags

From: Kenneth Whistler (
Date: Fri Jun 20 1997 - 20:02:19 EDT

> Kaihsu Tai wrote on 1997-06-20 22:11 UTC:
> > As Alain said once, different parts of names should be stored in different
> > _fields_, instead of putting these burdens onto the character set
> > encoding. They should be stored as something like
> >
> > {
> > surname=Smith
> > givenname=John
> > }
> >
> > {
> > surname=Miller-Rubin
> > middleinitial=E.
> > givenname=Joseph
> > academicdegree=M.D.
> > }

Markus Kuhn replied:

> I thought it was one of the big lessons learned from the X.400 O/R
> names, that this does not scale well, as the concept of having
> a surname, a middle initial, and a givenname is something that works only
> in the U.S. and in around 10 other countries. Only a flat linear
> sequence of characters (as it was introduced in the X.500 CN attribute)
> is really capable of representing human names adequatly for all
> cultures.

Only if the requirement is to flatten them all out into a linear
sequence of characters. This is the least common denominator of
names (whether personal or other). It does not prevent locale-specific
appropriate handling of names as fielded. The one thing which is
clearly stupid is to try to define a universal fielded name format
and then attempt to stuff everybody's local conventions into it.

> Realizing that, adding some sorting control characters
> to this single common flat name string field suddenly makes a lot
> of sense to me.

I consider this an inappropriate use of the character encoding. This
is another attempt to make the character encoding solve some computational
problem which should be dealt with in other ways.

> I still have to see a data structure like you suggest that really
> covers *all* the various human naming schemes that are used on this
> planet adequately. A structured person name will probably be something
> like a five page long dense SGML DTD which no sane database designer
> is going to implement in separate fields.

This is exactly the wrong way to approach this. A far better way
is to take an object-oriented approach, subclassing the behaviors
on a Name object according to local preferences and practices.
Java has wonderful mechanisms for attaching locale-specific behavior
to objects, for example. The base class would simply have the name
expressed as a string--the least common denominator. An appropriate
hierarchy of subclasses could then implement the correct local
behavior in terms of fielding of names and the interpretation and
collation behavior of the fields.

> Another good example of a naming structure that is so diverse that only
> an unstructured single field (according to ISO 11080 preferably 30x6
> chars large) with added markup can handle are postal addresses from
> all over the world.

Yep, and addresses are another prime candidate for an object-oriented
approach that allows local behavior to accrete to subclasses of a
base class PostalAddress.

> > Argh, is this supposed to be in a "character set encoding"?
> I am not suggesting to make this part of the Unicode 3.0 standard itself,
> but reserving a few code positions for this purpose and defining some
> sorting control characters in another ISO standard won't hurt too
> much, right?

Wrong. Reserving positions in the standard for characters with such
behavior is tantamount to encoding them in the standard. If definition
of sorting controls belong in other standards (ISO or otherwise), that
is where they belong--*not* in the universal character encoding standard.


> We could even assign suggested glyphs to these sorting
> control characters that will only be displayed when you edit a
> name but that are normally made invisible by software that just
> displays sorted strings. Glyphs just like this small hook that librarians
> make with a pencil in front of the first letter of the first author's
> surname on the title page.
> Markus
> --
> Markus G. Kuhn, Computer Science grad student, Purdue
> University, Indiana, USA -- email:

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT