Re: Cost of transition to UTF-8 for central census authorities

From: philip chastney (philip_chastney@yahoo.com)
Date: Wed Jan 14 2009 - 14:51:03 CST

  • Next message: Kenneth Whistler: "Compatibility Character (was: Re: Emoji: emoticons vs. literacy)"

    --- On Sun, 11/1/09, Trond Trosterud <trond.trosterud@hum.uit.no> wrote:
    From: Trond Trosterud <trond.trosterud@hum.uit.no>
    Subject: Cost of transition to UTF-8 for central census authorities
    To: "Unicode List" <unicode@unicode.org>
    Date: Sunday, 11 January, 2009, 3:02 PM

    I have the following question to the list:

    In Norway, our large census databases (https://infobank.edb.com, contains the
    names, social sec num, address, cars, companies, boats, etc, etc, of all
    Norwegian citizens). Today, it is encoded with the 8859-1 charset, probably in
    8859-1 (some old registries may be EBCDIC, but with the same character
    repertoire or a subset).

    Now, Norway wants to
     be able to use Sámi in that register, i.e., 6x2 letters
    from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are possible, but a
    natural solution is UTF-8.
    icebergs spring to mind here

    Sámi may be the trigger, but it is part of a bigger issue

    how are names of East European immigrants handled, for instance?
    surely they are not all unregistered?

    and what happens to East Europeans who want to adopt Norwegian citizenship?
    are they required to renounce their diacritical markings?

    issues like these are going to have to be faced  --  wouldn't it be more cost-effective to adopt a solution to the stated problem (Sámi) which also solves the problem with other languages?

    .... and other scripts (let us not forget that Norway shares a border with Russia)

    to put it another way: a move to Unicode will have to be made sometime, and the longer it takes to commit to that move, the more it will cost to convert

    once that point is generally accepted as policy, where and when these Unicode string are stored/transmitted/processed as UTF-8 or UTF-32 is a completely separate technical issue

    it is not common for IT projects to under-run their estimated costs, but if the costs and time-scales for conversion seem inflated, the estimators may be ill-informed, cautious, or hoping for the contract  ...   or just plain wrong

    /phil



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2009 - 14:53:48 CST