Re: UnicodeData.txt problem

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Dec 10 2005 - 00:09:10 CST

  • Next message: Richard Wordingham: "Re: UnicodeData.txt problem"

    Theo Veenker <Theo dot Veenker at let dot uu dot nl> wrote:

    >>>You could even create a double set of data files: a new reorganized
    >>>set of data files, and a set for backwards compatibility (extracted
    >>>from the new set).
    >>
    >>
    >> Yep, but that creates another mountain of maintenance work for
    >> each release.
    >
    > Yeah you're right. You have to type the "extract-old-format-files"
    > command on each release.

    That's my approach. I write a program, or (if I'm lucky) a line of two
    of grep, to convert the file I get into the file I want. The effort of
    writing the filter is expended just once. Well, not always... sometimes
    I have to upgrade the filter to account for changes in the source file,
    which I suspect is what Ken is talking about.

    Case in point: Every 6 months I download an updated copy of the main
    UN/LOCODE data file (see
    http://www.unece.org/cefact/locode/service/main.htm). This is a
    UN-sponsored database that lists a unique country-and-location code for
    about 44,000 locations worldwide, including about 18,000 with
    latitude/longitude coordinates. I have quite a few diverse uses for
    this data.

    My choices for download are space-delimited text, comma-quote-delimited
    text, and Access. For various reasons I find the comma-quote version
    the best for my needs, but what I'd really like for easiest parsing is a
    tab- or bar-delimited file. Plus, the file contains two fields for the
    name of the location, one with the name spelled "correctly" with
    diacritical marks, and one smashed down to straight ASCII; I have no use
    for the latter. So every time I download the file UNECE offers me,
    whose format I do NOT have control over, I run a simple program that
    converts it to the format I really want. Easy.

    Now every once in a while, UNECE throws me a knuckleball. The latest
    version includes a record with an embedded vertical bar where a dotted-z
    was supposed to be (it's a long story), so clearly there are pitfalls in
    blindly converting this file to bar-delimited! I had to build logic
    into my conversion program to check for embedded vertical bars, and fix
    the one that's already there (since I don't know how long it'll be
    before they fix it at their end).

    So this regular conversion process isn't perfect, but it sure beats
    trying to teach every program how to read a quirky format. I do the
    same thing at work, with data files I need that come in a format I hate.

    Bottom line: Any programmer who can write a program to interpret the
    existing UnicodeData.txt can also write a program to massage that file
    into a format more to their liking. Once you do that, YOU have control
    over the format, and can add to it or modify it to suit your own work.
    I suggest this approach, as opposed to trying to get UTC to change a
    file format that they obviously don't want to change.

    --
    Doug Ewell
    Fullerton, California, USA
    http://users.adelphia.net/~dewell/
    


    This archive was generated by hypermail 2.1.5 : Sat Dec 10 2005 - 04:21:07 CST