Re: Encoding of personal names in official databases

From: Paul Keinanen (keinanen@sci.fi)
Date: Tue Mar 30 1999 - 14:32:13 EST


At 03:15 30.3.1999 -0800, Trond Trosterud wrote:
>Within the next month, I am going to write a memo to the Norwegian dept. of
>justice to comment upon the planned revision of the Norwegian laws for
>personal names. The goal of the revision is to allow other naming practices
>than the Norwegian one, due to a culturally more heterogenous population.
>
>My input will deal with the encoding of the names.
>
>Today, the official Norwegian population registry is coded with ascii,
>enriched with the norewegian letters on the ascii positions [\]{|}
>(I guess the same solution is in use in Denmark, Sweden and Finland as
>well, but with for ).
>
>My suggestion will be that they abandon their 7-bit systems and move to...
>
>and here I need your advice.
>
>In Norway, Smi citizens use Smi names, the diacritics (ACUTE ACCENT,
>CARON, HOOK, STROKE) are just stripped off in the registry. We have large
>amounts of Finns and Swedes, their are replaced with . Immigrants from
>other countries bring their letters (and alphabets) with them. A natural
>answer to this is of course: Use the UCS. But the bases are huge: Every
>single citizen is iincluded.

I do not know how much data you store about each citizen :-), but even with
one or two kilobytes for each person, the whole population of Norway or
Finland could be stored on nearly every PC sold today (each having 5-10 GB
of disk storage) and considering that a RAID system of multiple disks would
be needed for reliability anyway, so I do not think the size or storage
format of names is of any real concern. The situation was of course
different in the punched card era with 72 or 80 column cards :-).

As Markus Kuhn already pointed out, a well defined subset of UCS (such as
MES) would make sense. Then there is the problem of legacy applications
running either the 7 bit code or Latin-1 applications. A well defined
fall-back policy is required, while some conversions can be exact, other may
require some fallback (as when converting to 7 bit ).

It might even make sense to store both the UCS as well as the 7 bit legacy
form in the data base. When the existing data bases are converted from 7 bit
to UCS, the typical Norvegian names are converted correctly, but the
situation is more complex with names that had been previously manually
transcoded, say from to , in these cases the automatic conversion
would be wrong and apparently some more or less manual methods are required
to rectify these problems.

On the other hand, when children are born, people change name due to
marriage or people immigrate, the correct UCS name would be entered directly
and the legacy 7 bit name would be generated automatically from it. A status
indication would be required to tell if the UCS name is generated from the
legacy 7 bit name or vice versa to prevent any automatic system from trying
to change the UCS name if it has been entered manually.

As the problems with Y2k shows, some legacy applications can be in use for
quite a long time, so I guess that storing both the UCS and 7 bit legacy
names would have to go on for quite a long time.

The more I think about it, the more it makes sense to store both the UCS and
the legacy name in the data base, thus avoiding any real time heureistics,
which would have to be distributed onto a very large number of legacy
program data base interfaces.

Paul Keinnen



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT