On March 30, 1999 1:16 PM, Trond Trosterud [SMTP:[email protected]]
wrote:
>Today, the official Norwegian population registry is coded with ascii,
>enriched with the norewegian letters ������ on the ascii positions [\]{|}
>(I guess the same solution is in use in Denmark, Sweden and Finland as
>well, but with �� for ��).
So, we can assume that more than 95% of characters will be ascii.
>
>My suggestion will be that they abandon their 7-bit systems and move to...
Good idea indeed ;-)
>
>and here I need your advice.
>
>In Norway, S�mi citizens use S�mi names, the diacritics (ACUTE ACCENT,
>CARON, HOOK, STROKE) are just stripped off in the registry. We have large
>amounts of Finns and Swedes, their �� are replaced with ��. Immigrants from
>other countries bring their letters (and alphabets) with them. A natural
>answer to this is of course: Use the UCS. But the bases are huge: Every
>single citizen is iincluded.
>
>Do anyone on this list have experiences with similar cases? What is being
>done around the world? Do other countries use 7-bit solutions as well? Are
>there plans to migrate to 8 bits? 16 bits?
What you can do for data where source character set and language give a hint
that more than 50% will be ascii
(say Western European languages and character sets (apart from Greek and
Cyrillic)
is store data in UTF-8 , other cases get UTF-16.
For french text, the increase in size between ISO-8859-1 and UTF-8 is less
than 2% ...
For english texts,less than 1%.
For Norwegian, I expect it to be equivalent as for French text.
It allows to save space over UTF-16 or UCS-4 (50% or 150%) for Norwegian and
other nordic European countries.
Since you can use unchanged Boyer-Moore-Horspool substring search on UTF-8,
partial name searches will be as fast as ascii for Norwegian names (and most
European names as well).
You can also use regular expressions, but it will not work as is on non pure
ascii names.
Since UTF-8 storage is efficient for languages you plan to support,
retrieval will be nearly as fast as if it were ascii.
UTF-8 is also supported by most recent database engines, if not, you can
qualify data as raw binary varying length 8 bit and handle UTF-8 yourself
(you'll have to check correctness of binary strings as UTF-8).
>
>Since we need both the S�mi names and the names of new immigrants, 8 bits
>really are not enough. If we then use some UCS format, which one shall we
UTF-8 8 bit encoding is enough ... UTF-8 encodes the UCS code values into a
sequence of 8 bit integers.
>use (16-bit, utf-8,... , in order to save space and have databases with
>fast retrieval?
>
>Greetings,
>
Go for UTF-8, it's the clean and efficient solution when you only need
support for latin script based languages.
______________________________________________________
Christophe Pierret
Software Development Engineer
Product Group
Professional E-mail: [email protected]
Personal E-mail: [email protected]
Business Objects S.A.
1, square Chaptal, 92309 Levallois-Perret, FRANCE
http://www.businessobjects.com
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT