RE: Cost of transition to UTF-8 for central census authorities

Date: Mon Jan 12 2009 - 09:45:45 CST

  • Next message: Michael D'Errico: "Re: Emoji: emoticons vs. literacy"

    Major Database Management systems support UTF-8 and/or UTF-16, not UTF-32.
    Java external string representation is UTF-8 in most cases, XML standard is UTF-8.
    In essence, one needs to study the technology stack involved to pick the best Unicode implementation.

    Konstantin Tadenev

    -----Original Message-----
    From: [] On Behalf Of Adam Twardoch
    Sent: Sunday, January 11, 2009 11:01 AM
    To: Trond Trosterud; Unicode List
    Subject: Re: Cost of transition to UTF-8 for central census authorities

    Trond Trosterud wrote:
    > 1. The field length in the database will be longer then the display
    > field. So, given a surname "Årø", we will have a display length of 3
    > (letters), as compared to the database length of 5 bytes.
    > 2. There will have to be a new sorting routine, and a new search routine
    > 3. Programs may no longer search for characters as single bytes, but
    > must in some cases open for search of sequence of bytes.

    All of the above can be solved by using UTF-32 rather than UTF-8. Sure,
    the size of the data will grow 4x but the software will be "easier". Or
    at least, the software should be migrated to use UTF-32 (i.e.
    scalar-based Unicode) *internally* and convert from UTF-8 as early as
    possible, and convert to UTF-8 as late as possible

    The advantage of using UTF-32 in the "new" storage rather than UTF-8 is
    that with UTF-8, it is relatively easy to confuse (for either software
    or human) whether the data is actually UTF-8 or still ISO 8859-1. With
    UTF-32, it is much more obvious and striking. I believe debugging code
    that deals with UTF-32 is much easier than debugging code that deals
    with UTF-8.

    For example, I've recently dealt with custom UTF-8 software solutions
    and at some point I discovered that very rarely, problems were creeping
    in because the scalar-to-UTF-8 conversion only worked well for BMP
    scalar values.

    > c. The problem with the discussion is that the experts within the
    > registries are presenting their conclusions, and not the premises behind
    > them. Politicians listening to them are thus lost. I am invited in to
    > comment the process, but it is not easy, as I get so little information
    > about the process. So, what kind of information is it that I need to
    > evaluate these estimates?

    I would illustrate the particular issue mentioned above this way:

    Moving from ISO 8859-1 to UTF-8 is like changing the official color of a
    flag from pine green to Shamrock green.

    Moving from ISO 8859-1 to UTF-32 is like changing the official color of
    a flag from pine green to dark blue.

    The first approach can be done gradually, so the cost can be spread
    throughout years, but during the process it's very difficult to tell the
    old flags and the new flags apart, and you run the risk of using the old
     flag rather than the new flag on an official occasion, which would be
    embarassing. It may happen that some people won't be able to see the
    difference, so they'll need to consult an expert, and it may even happen
    that some stuff will be replaced twice or changed back and forth because
    of the confusion.

    The second approach needs longer preparation and more intense financing
    in that phase, but then the switch can be done in a more decisive way,
    and after the switch it's very easy to spot anything out of the
    ordinary. So even an average office clerk will be able to tell early
    that something's wrong.

    Hope this helps,

    Adam Twardoch
    | Language Typography Unicode Fonts OpenType
    | | |
    I hate to advocate drugs, alcohol, violence, or
    insanity to anyone, but they've always worked for me.
    (Hunter S. Thompson)

    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2009 - 09:48:51 CST