Re: Encoding of personal names in official databases

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Mar 30 1999 - 08:02:32 EST


Trond Trosterud wrote on 1999-03-30 11:15 UTC:
> Within the next month, I am going to write a memo to the Norwegian dept. of
> justice to comment upon the planned revision of the Norwegian laws for
> personal names. The goal of the revision is to allow other naming practices
> than the Norwegian one, due to a culturally more heterogenous population.
>
> My input will deal with the encoding of the names.
[...]
> Since we need both the Sámi names and the names of new immigrants, 8 bits
> really are not enough. If we then use some UCS format, which one shall we
> use (16-bit, utf-8,... , in order to save space and have databases with
> fast retrieval?

Low-level detail decisions such as whether to use UCS-2 or UTF-8 should
not be made in high-level administrative regulations, but should be left
to the designers of the specific implementations, communication
protocols, and file formats. The choice is rather obvious: Use if you
have a requirement for backwards compatibility with ASCII transport
channels (e.g., you want to continue using VT100 terminals, much of the
POSIX infrastructure, etc.), or if you are worried about memory space
but cannot use for some reason a more efficient encoding such as gzip,
then UTF-8 is the obvious choice. UCS-2 is of advantage where you need a
1 word = 1 character relationship (e.g., in parsing and regular
expression processing). Conversion between all these data
representations is quite trivial, therefore you use both at the places
where they are most appropriate. Characters outside Plane 0 are probably
not of concern in your applications, so you won't have to worry about
UCS-2 vs. UTF-16. So don't worry about the encoding.

The really difficult and important design decision is, which precise
subset of Unicode you want to support. This has enormous implications
with regard to user interface design, user training, unification and
search algorithms, output hardware requirements, risk of misentry, etc.

Assuming that your application is restricted to the Latin alphabet and
does not require greek or cyrillic letters, then I guess a possible
reasonable UCS subset for your application might be the following 360
UCS characters:

# Plane 00
# Rows Positions (Cells)

  00 20-7E A0-FF
  01 00-2B 2E-7E B7 DE-EF
  02 92 BB C7 D8-DB DD
  20 15 18-19 1C-1D AC
  21 22 26 5B-5E 90-93
  26 6A

This is the proposed CEN MES-1 subset (includes ISO 8859-1/2/3/4/etc.,
ISO 6397, and the euro sign) plus the characters of ISO IR-158 (various
allged sami characters, it is unclear to me whether all of these are
really needed).

You should be aware, that the above proposal contains the following
21 characters, which are not currently in Windows Glyph List 4 (WGL4),
and which therefore will probably not be widely available in the form
of Microsoft fonts:

01B7 # LATIN CAPITAL LETTER EZH
01DE # LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
01DF # LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
01E0 # LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
01E1 # LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
01E2 # LATIN CAPITAL LETTER AE WITH MACRON
01E3 # LATIN SMALL LETTER AE WITH MACRON
01E4 # LATIN CAPITAL LETTER G WITH STROKE
01E5 # LATIN SMALL LETTER G WITH STROKE
01E6 # LATIN CAPITAL LETTER G WITH CARON
01E7 # LATIN SMALL LETTER G WITH CARON
01E8 # LATIN CAPITAL LETTER K WITH CARON
01E9 # LATIN SMALL LETTER K WITH CARON
01EA # LATIN CAPITAL LETTER O WITH OGONEK
01EB # LATIN SMALL LETTER O WITH OGONEK
01EC # LATIN CAPITAL LETTER O WITH OGONEK AND MACRON
01ED # LATIN SMALL LETTER O WITH OGONEK AND MACRON
01EE # LATIN CAPITAL LETTER EZH WITH CARON
01EF # LATIN SMALL LETTER EZH WITH CARON
0292 # LATIN SMALL LETTER EZH
02BB # MODIFIER LETTER TURNED COMMA

All these characters are from ISO IR-158. MES-1 and ISO 6397 are already
fully covered by WGL4. If you do not need any of these characters, then
just use the 353 characters of MES-1, and you will have excellent
coverage of European names written with Latin characters:

# Plane 00
# Rows Positions (Cells)

  00 20-7E A0-FF
  01 00-13 16-2B 2E-4D 50-7E
  02 C7 D8-DB DD
  20 15 18-19 1C-1D AC
  21 22 26 5B-5E 90-93
  26 6A

The proposed CEN MES-1 subset is described in

  http://www.indigo.ie/egt/standards/iso10646/pdf/p10-1998-11-18.pdf

Warning, this is still a draft under discussion.

When selecting a UCS subset, you also have to take into account the
restrictions of your output devices that are not as easily fixable as
laser printers. For instance, things depend on whether you are using any
of the OCR alphabets, which are only specified for small subsets of UCS.

Hope this helped, ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT