RE: Cost of transition to UTF-8 for central census authorities

From: Don Osborn (dzo@bisharat.net)
Date: Sun Jan 11 2009 - 09:41:35 CST

  • Next message: John Hudson: "Re: Flag Symbols"

    Hi Trond,

    I can't answer your questions but would offer the thought that this sort of practical question is the kind of thing that should be raised during the International Year of Languages (which is technically still on until the formal close on International Mother Language Day, 21 Feb.).

    I'll be very interested to read other comments in response to your post, esp. as regards experience in other countries. A longer term issue is how to get the kind of question Norway is dealing with on the agenda in other multilingual countries where UTF-8 would better accommodate the range of orthographies used.

    Put another way, it seems that a question we are facing is whether the Latin script gets accepted as a complex script to accommodate minority and non-official languages, or policies to ASCIIfy (or Latin-1-ify) transcriptions prevail.

    Don

    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    > Behalf Of Trond Trosterud
    > Sent: Sunday, January 11, 2009 10:02 AM
    > To: Unicode List
    > Subject: Cost of transition to UTF-8 for central census authorities
    >
    > I have the following question to the list:
    >
    > In Norway, our large census databases (https://infobank.edb.com,
    > contains the names, social sec num, address, cars, companies, boats,
    > etc, etc, of all Norwegian citizens). Today, it is encoded with the
    > 8859-1 charset, probably in 8859-1 (some old registries may be EBCDIC,
    > but with the same character repertoire or a subset).
    >
    > Now, Norway wants to be able to use Sámi in that register, i.e., 6x2
    > letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are
    > possible, but a natural solution is UGF-8.
    >
    > Now, what will this cost?
    >
    > According to key personel, this transition will require a transition
    > period of appr. 10 years, and a relatively high cost (politeness
    > towards the authors of the transition plans prevents me from referring
    > numbers).
    >
    > Governmental experts see 3 drawbacks with UTF-8:
    >
    > 1. The field length in the database will be longer then the display
    > field. So, given a surname "Årø", we will have a display length of 3
    > (letters), as compared to the database length of 5 bytes.
    > 2. There will have to be a new sorting routine, and a new search
    > routine
    > 3. Programs may no longer search for characters as single bytes, but
    > must in some cases open for search of sequence of bytes.
    > 4. Many common programs only support 8-bit character sets
    > 5. Data must be removed from registries, converted and replaced
    > 6. Millions of lines of code must be changed and tested
    >
    > To me it seems most of these points are not real problems, but either
    > a description of the conversion process, or unfounded fear.
    >
    > My question to the list is this:
    >
    > a. How can the variable field length be a problem? The field must in
    > any case open for longer names, e.g. my name's (Trosterud) 9 letters
    > requre 9 bytes, more than the 5 of Årø. Can there be data base
    > solutions who generate database fields on the basis of the number of
    > characters? The opposite (view fields on the basis of bytes) should be
    > no problem, it will only give [Årø ] and [Trosterud].
    >
    > b. Will it really be necessary to change millions of lines of code?
    > How can even old, badly written code require such changes?
    >
    > c. The problem with the discussion is that the experts within the
    > registries are presenting their conclusions, and not the premises
    > behind them. Politicians listening to them are thus lost. I am invited
    > in to comment the process, but it is not easy, as I get so little
    > information about the process. So, what kind of information is it that
    > I need to evaluate these estimates?
    >
    > d. Other comments, or perhaps better: experiences from other countries?
    >
    > Trond Trosterud.
    >
    >
    > ----------------------------------------------------------------------
    > Trond Trosterud t +47 7764 4763
    > Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
    > N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
    > Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/
    > dn------------------------------------------------------------------đŋ
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 09:44:06 CST