RE: Encoding of personal names in official databases

From: Christophe PIERRET (cpierret@businessobjects.com)
Date: Tue Mar 30 1999 - 07:58:31 EST


On March 30, 1999 1:16 PM, Trond Trosterud [SMTP:Trond.Trosterud@hum.uit.no]
wrote:
>Today, the official Norwegian population registry is coded with ascii,
>enriched with the norewegian letters ÆØÅæøå on the ascii positions [\]{|}
>(I guess the same solution is in use in Denmark, Sweden and Finland as
>well, but with äö for æø).

So, we can assume that more than 95% of characters will be ascii.

>
>My suggestion will be that they abandon their 7-bit systems and move to...
Good idea indeed ;-)

>
>and here I need your advice.
>
>In Norway, Sámi citizens use Sámi names, the diacritics (ACUTE ACCENT,
>CARON, HOOK, STROKE) are just stripped off in the registry. We have large
>amounts of Finns and Swedes, their äö are replaced with æø. Immigrants from
>other countries bring their letters (and alphabets) with them. A natural
>answer to this is of course: Use the UCS. But the bases are huge: Every
>single citizen is iincluded.
>
>Do anyone on this list have experiences with similar cases? What is being
>done around the world? Do other countries use 7-bit solutions as well? Are
>there plans to migrate to 8 bits? 16 bits?

What you can do for data where source character set and language give a hint
that more than 50% will be ascii
(say Western European languages and character sets (apart from Greek and
Cyrillic)
is store data in UTF-8 , other cases get UTF-16.

For french text, the increase in size between ISO-8859-1 and UTF-8 is less
than 2% ...
For english texts,less than 1%.
For Norwegian, I expect it to be equivalent as for French text.

It allows to save space over UTF-16 or UCS-4 (50% or 150%) for Norwegian and
other nordic European countries.

Since you can use unchanged Boyer-Moore-Horspool substring search on UTF-8,
partial name searches will be as fast as ascii for Norwegian names (and most
European names as well).

You can also use regular expressions, but it will not work as is on non pure
ascii names.

Since UTF-8 storage is efficient for languages you plan to support,
retrieval will be nearly as fast as if it were ascii.
UTF-8 is also supported by most recent database engines, if not, you can
qualify data as raw binary varying length 8 bit and handle UTF-8 yourself
(you'll have to check correctness of binary strings as UTF-8).

>
>Since we need both the Sámi names and the names of new immigrants, 8 bits
>really are not enough. If we then use some UCS format, which one shall we

UTF-8 8 bit encoding is enough ... UTF-8 encodes the UCS code values into a
sequence of 8 bit integers.

>use (16-bit, utf-8,... , in order to save space and have databases with
>fast retrieval?
>
>Greetings,
>

Go for UTF-8, it's the clean and efficient solution when you only need
support for latin script based languages.

______________________________________________________
Christophe Pierret
    Software Development Engineer
    Product Group
    Professional E-mail: cpierret@businessobjects.com
    Personal E-mail: pach2@club-internet.fr

Business Objects S.A.
    1, square Chaptal, 92309 Levallois-Perret, FRANCE
    http://www.businessobjects.com
 



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT