RE: Cost of transition to UTF-8 for central census authorities

From: Erkki I. Kolehmainen (eik@iki.fi)
Date: Sun Jan 11 2009 - 11:49:20 CST

  • Next message: Leo Broukhis: "Re: Emoji: emoticons vs. literacy"

    Trond and others,

    There is considerable interest within EU (which Norway, however, isn't a
    member of) and, consequently, within CEN (of which Norway is a member, due
    to its membership of EFTA), to come up with a common, interoperable
    repertoire for proper registration of names in the national population
    registries. The urgency for this stems from the free movement of people and
    goods within the EU. As Don pointed out, the issue of wrong registrations
    (with implications ranging from legal issues to politeness) should be
    brought up in the context of the UN Year of Languages. This is exactly what
    was done on September 26th, 2008 (the European Day of Languages), in the
    main EU event "États généraux du multilinguisme" held at Sorbonne. I had
    been asked to make a statement on the impact of ICT standardization to the
    support of multilingualism, and this was one of the more important issues
    that I addressed in my statement. Right now it would appear that
    considerable progress is being made.

    I don't wish to comment further on the highly surprising statements made by
    the Norwegian authorities. However, I'm totally lost with the apparent lack
    of desire to support Pan-European name writing even for the Latin script,
    leading eventually to further isolation of Norway by limiting the extension
    to the names of S?mi origin.

    It shouldn't take all that long before more information will be available on
    the European-wide effort.

    Sincerely,

    Erkki I. Kolehmainen
    Tilkankatu 12 A 3, FI-00300 Helsinki, Finland
    Puh. (09) 4368 2643, 0400 825 943; Tel. +358 9 4368 2643, +358 400 825 943

    -----Alkuperäinen viesti-----
    Lähettäjä: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    Puolesta Don Osborn
    Lähetetty: 11. tammikuuta 2009 17:42
    Vastaanottaja: 'Trond Trosterud'; 'Unicode List'
    Aihe: RE: Cost of transition to UTF-8 for central census authorities

    Hi Trond,

    I can't answer your questions but would offer the thought that this sort of
    practical question is the kind of thing that should be raised during the
    International Year of Languages (which is technically still on until the
    formal close on International Mother Language Day, 21 Feb.).

    I'll be very interested to read other comments in response to your post,
    esp. as regards experience in other countries. A longer term issue is how to
    get the kind of question Norway is dealing with on the agenda in other
    multilingual countries where UTF-8 would better accommodate the range of
    orthographies used.

    Put another way, it seems that a question we are facing is whether the Latin
    script gets accepted as a complex script to accommodate minority and
    non-official languages, or policies to ASCIIfy (or Latin-1-ify)
    transcriptions prevail.

    Don

    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    > On Behalf Of Trond Trosterud
    > Sent: Sunday, January 11, 2009 10:02 AM
    > To: Unicode List
    > Subject: Cost of transition to UTF-8 for central census authorities
    >
    > I have the following question to the list:
    >
    > In Norway, our large census databases (https://infobank.edb.com,
    > contains the names, social sec num, address, cars, companies, boats,
    > etc, etc, of all Norwegian citizens). Today, it is encoded with the
    > 8859-1 charset, probably in 8859-1 (some old registries may be EBCDIC,
    > but with the same character repertoire or a subset).
    >
    > Now, Norway wants to be able to use S?mi in that register, i.e., 6x2
    > letters from the Latin A block in Unicode. ISO/IEC 8859-4 and -10 are
    > possible, but a natural solution is UGF-8.
    >
    > Now, what will this cost?
    >
    > According to key personel, this transition will require a transition
    > period of appr. 10 years, and a relatively high cost (politeness
    > towards the authors of the transition plans prevents me from referring
    > numbers).
    >
    > Governmental experts see 3 drawbacks with UTF-8:
    >
    > 1. The field length in the database will be longer then the display
    > field. So, given a surname "Årø", we will have a display length of 3
    > (letters), as compared to the database length of 5 bytes. 2. There
    > will have to be a new sorting routine, and a new search routine
    > 3. Programs may no longer search for characters as single bytes, but
    > must in some cases open for search of sequence of bytes.
    > 4. Many common programs only support 8-bit character sets
    > 5. Data must be removed from registries, converted and replaced
    > 6. Millions of lines of code must be changed and tested
    >
    > To me it seems most of these points are not real problems, but either
    > a description of the conversion process, or unfounded fear.
    >
    > My question to the list is this:
    >
    > a. How can the variable field length be a problem? The field must in
    > any case open for longer names, e.g. my name's (Trosterud) 9 letters
    > requre 9 bytes, more than the 5 of Årø. Can there be data base
    > solutions who generate database fields on the basis of the number of
    > characters? The opposite (view fields on the basis of bytes) should be
    > no problem, it will only give [Årø ] and [Trosterud].
    >
    > b. Will it really be necessary to change millions of lines of code?
    > How can even old, badly written code require such changes?
    >
    > c. The problem with the discussion is that the experts within the
    > registries are presenting their conclusions, and not the premises
    > behind them. Politicians listening to them are thus lost. I am invited
    > in to comment the process, but it is not easy, as I get so little
    > information about the process. So, what kind of information is it that
    > I need to evaluate these estimates?
    >
    > d. Other comments, or perhaps better: experiences from other
    > countries?
    >
    > Trond Trosterud.
    >
    >
    > ----------------------------------------------------------------------
    > Trond Trosterud t +47 7764 4763
    > Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
    > N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
    > Trond.Trosterud (a) hum.uit.no http://www.hum.uit.no/a/trond/
    > dn------------------------------------------------------------------??
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Jan 11 2009 - 11:51:17 CST