Re: Sorting tags

From: Glen Perkins (
Date: Fri Jun 20 1997 - 21:04:22 EDT

Unicode Discussion wrote:
> Kaihsu Tai wrote on 1997-06-20 22:11 UTC:
> > As Alain said once, different parts of names should be stored in different
> > _fields_, instead of putting these burdens onto the character set
> > encoding. They should be stored as something like
> >
> > {
> > surname=Smith
> > givenname=John
> > }
> >
> > {
> > surname=Miller-Rubin
> > middleinitial=E.
> > givenname=Joseph
> > academicdegree=M.D.
> > }
> I thought it was one of the big lessons learned from the X.400 O/R
> names, that this does not scale well, as the concept of having
> a surname, a middle initial, and a givenname is something that works only
> in the U.S. and in around 10 other countries. Only a flat linear
> sequence of characters (as it was introduced in the X.500 CN attribute)
> is really capable of representing human names adequatly for all
> cultures. Realizing that, adding some sorting control characters
> to this single common flat name string field suddenly makes a lot
> of sense to me.

I agree that only a flat linear sequence of characters is sufficient to
handle all names--except perhaps for the name formerly written as
"Prince". ;-) Breaking them up into "surname", "given name", etc.
doesn't scale well worldwide.

This is just an argument for having a "Whole name" field, though,
instead of trying to dissect the name and reassemble it algorithmically.

Now, with the whole name in a single field, you also (might) want to
identify portions of that name for collation purposes. If you can mark a
section of the name with control chars, you could also simply create
another field that contained the offsets of the character positions that
you would have marked. No need to define new characters. It's just
integer data in a custom data field.

I don't usually find storage space so tight that I need to use this
offset method, though. I usually just create additional fields
containing whatever I need, pre-built. One such field could contain the
name, pre-structured for collation. In other words, it's already in the
form that your first collation pass would produce when it parsed your
list of names marked up with collation control chars. Users don't
usually see the contents of this field. If the names are listed in a way
that is visible to the user, the *whole names* are displayed in a list
that was sorted on the collation field.

Besides collation, you might also need to understand the structure of a
name for purposes of creating a salutation, so that you can get "Dear
Mr. Perkins" on your junk mail instead of "Dear Mr. Glen C. Perkins...."
I'll also usually implement this with a field called "Salutation" that
will have a pre-built "Mr. Perkins" so that the salutation doesn't have
to be calculated from other fields. It *could* be generated on the fly,
though, using offset data. That would work as well as a name marked up
with special chars, but not as well as a pre-built salutation. Either
way, whatever effort would be needed to mark up your whole name with
special chars indicating which part was to be used in a salutation could
be used to enter either a pre-built salutation or offset data in a
separate field.

Beyond collation and salutations, there are other things you might want
to do with names. Rather than assigning new characters for all of these
purposes, you could just create new fields in your database and either
enter offset data from the whole name or enter the actual characters
(faster, but takes more space.) Very flexible, no need to wait for
political bodies to respond to your needs, no waste of code points that
could be used for actual characters.

__Glen Perkins__
Java CJK Developer

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT