Cost of transition to UTF-8 for central census authorities

From: Trond Trosterud (trond.trosterud@hum.uit.no)
Date: Sun Jan 11 2009 - 09:02:13 CST

  • Next message: Michael D'Errico: "Re: Emoji: emoticons vs. literacy"

    I have the following question to the list:

    In Norway, our large census databases (https://infobank.edb.com,
    contains the names, social sec num, address, cars, companies, boats,
    etc, etc, of all Norwegian citizens). Today, it is encoded with the
    8859-1 charset, probably in 8859-1 (some old registries may be EBCDIC,
    but with the same character repertoire or a subset).

    Now, Norway wants to be able to use Sámi in that register, i.e., 6x2
    letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are
    possible, but a natural solution is UGF-8.

    Now, what will this cost?

    According to key personel, this transition will require a transition
    period of appr. 10 years, and a relatively high cost (politeness
    towards the authors of the transition plans prevents me from referring
    numbers).

    Governmental experts see 3 drawbacks with UTF-8:

    1. The field length in the database will be longer then the display
    field. So, given a surname "Årø", we will have a display length of 3
    (letters), as compared to the database length of 5 bytes.
    2. There will have to be a new sorting routine, and a new search routine
    3. Programs may no longer search for characters as single bytes, but
    must in some cases open for search of sequence of bytes.
    4. Many common programs only support 8-bit character sets
    5. Data must be removed from registries, converted and replaced
    6. Millions of lines of code must be changed and tested

    To me it seems most of these points are not real problems, but either
    a description of the conversion process, or unfounded fear.

    My question to the list is this:

    a. How can the variable field length be a problem? The field must in
    any case open for longer names, e.g. my name's (Trosterud) 9 letters
    requre 9 bytes, more than the 5 of Årø. Can there be data base
    solutions who generate database fields on the basis of the number of
    characters? The opposite (view fields on the basis of bytes) should be
    no problem, it will only give [Årø ] and [Trosterud].

    b. Will it really be necessary to change millions of lines of code?
    How can even old, badly written code require such changes?

    c. The problem with the discussion is that the experts within the
    registries are presenting their conclusions, and not the premises
    behind them. Politicians listening to them are thus lost. I am invited
    in to comment the process, but it is not easy, as I get so little
    information about the process. So, what kind of information is it that
    I need to evaluate these estimates?

    d. Other comments, or perhaps better: experiences from other countries?

    Trond Trosterud.

    ----------------------------------------------------------------------
    Trond Trosterud t +47 7764 4763
    Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
    N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
    Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/
    dn------------------------------------------------------------------đŋ



    This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 09:03:55 CST