RE: Cost of transition to UTF-8 for central census authorities

From: ktadenev@ups.com
Date: Mon Jan 12 2009 - 09:41:14 CST

Next message: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"

Previous message: Curtis Clark: "Re: Emoji: emoticons vs. literacy"
In reply to: Tim Greenwood: "Re: Cost of transition to UTF-8 for central census authorities"
Next in thread: Christopher Fynn: "Re: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I agree, most of these are not valid issues. Just wanted to clarify a couple of points:
1. varchar means bytes, nvarchar – characters in most cases. It is always best to refer to the respective DBMS documentation for details.
2. Major databases support 2 types of sorts in UTF-8 – binary (default UTF-8) and culturally-sensitive. The latter is designed to provide proper order based on a national alphabet or other writing system. It is also useful to keep in mind that many sort products are UTF-8 compliant. UTF8 is also widely supported by all kinds of transport solutions such as FTP, http, WSMQ, JDBC, etc.

Konstantin Tadenev

________________________________
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Tim Greenwood
Sent: Sunday, January 11, 2009 12:13 PM
To: Trond Trosterud
Cc: Unicode List
Subject: Re: Cost of transition to UTF-8 for central census authorities

Most databases still define the schema in terms of characters, not bytes. So a varchar(3) is 3 characters (or perhaps code points) no matter whether the database is storing it in Latin1 or UTF-8.

Is sorting and searching done inside the database? If so then point 2 is a noop.

All decent databases will convert output to the codeset required by the client, converting in ODBC or similar. So conversion of client programs to work with UTF-8, if needed at all, can be phased in.

Tim
On Sun, Jan 11, 2009 at 10:02 AM, Trond Trosterud <trond.trosterud@hum.uit.no<mailto:trond.trosterud@hum.uit.no>> wrote:
I have the following question to the list:

In Norway, our large census databases (https://infobank.edb.com, contains the names, social sec num, address, cars, companies, boats, etc, etc, of all Norwegian citizens). Today, it is encoded with the 8859-1 charset, probably in 8859-1 (some old registries may be EBCDIC, but with the same character repertoire or a subset).

Now, Norway wants to be able to use Sámi in that register, i.e., 6x2 letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are possible, but a natural solution is UGF-8.

Now, what will this cost?

According to key personel, this transition will require a transition period of appr. 10 years, and a relatively high cost (politeness towards the authors of the transition plans prevents me from referring numbers).

Governmental experts see 3 drawbacks with UTF-8:

1. The field length in the database will be longer then the display field. So, given a surname "Årø", we will have a display length of 3 (letters), as compared to the database length of 5 bytes.
2. There will have to be a new sorting routine, and a new search routine
3. Programs may no longer search for characters as single bytes, but must in some cases open for search of sequence of bytes.
4. Many common programs only support 8-bit character sets
5. Data must be removed from registries, converted and replaced
6. Millions of lines of code must be changed and tested

To me it seems most of these points are not real problems, but either a description of the conversion process, or unfounded fear.

My question to the list is this:

a. How can the variable field length be a problem? The field must in any case open for longer names, e.g. my name's (Trosterud) 9 letters requre 9 bytes, more than the 5 of Årø. Can there be data base solutions who generate database fields on the basis of the number of characters? The opposite (view fields on the basis of bytes) should be no problem, it will only give [Årø ] and [Trosterud].

b. Will it really be necessary to change millions of lines of code? How can even old, badly written code require such changes?

c. The problem with the discussion is that the experts within the registries are presenting their conclusions, and not the premises behind them. Politicians listening to them are thus lost. I am invited in to comment the process, but it is not easy, as I get so little information about the process. So, what kind of information is it that I need to evaluate these estimates?

d. Other comments, or perhaps better: experiences from other countries?

Trond Trosterud.

----------------------------------------------------------------------
Trond Trosterud t +47 7764 4763
Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
Trond.Trosterud (a) hum.uit.no<http://hum.uit.no> http://www.hum.uit.no/a/trond/
dn------------------------------------------------------------------đŋ

Next message: ktadenev@ups.com: "RE: Cost of transition to UTF-8 for central census authorities"
Previous message: Curtis Clark: "Re: Emoji: emoticons vs. literacy"
In reply to: Tim Greenwood: "Re: Cost of transition to UTF-8 for central census authorities"
Next in thread: Christopher Fynn: "Re: Cost of transition to UTF-8 for central census authorities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 12 2009 - 09:45:25 CST