Re: UnicodeData.txt problem

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Dec 10 2005 - 00:09:10 CST

Next message: Richard Wordingham: "Re: UnicodeData.txt problem"

Previous message: Tom Emerson: "Re: UnicodeData.txt problem"
In reply to: Theo Veenker: "Re: UnicodeData.txt problem"
Next in thread: Richard Wordingham: "Re: UnicodeData.txt problem"
Reply: Richard Wordingham: "Re: UnicodeData.txt problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theo Veenker <Theo dot Veenker at let dot uu dot nl> wrote:

>>>You could even create a double set of data files: a new reorganized
>>>set of data files, and a set for backwards compatibility (extracted
>>>from the new set).
>>
>>
>> Yep, but that creates another mountain of maintenance work for
>> each release.
>
> Yeah you're right. You have to type the "extract-old-format-files"
> command on each release.

That's my approach. I write a program, or (if I'm lucky) a line of two
of grep, to convert the file I get into the file I want. The effort of
writing the filter is expended just once. Well, not always... sometimes
I have to upgrade the filter to account for changes in the source file,
which I suspect is what Ken is talking about.

Case in point: Every 6 months I download an updated copy of the main
UN/LOCODE data file (see
http://www.unece.org/cefact/locode/service/main.htm). This is a
UN-sponsored database that lists a unique country-and-location code for
about 44,000 locations worldwide, including about 18,000 with
latitude/longitude coordinates. I have quite a few diverse uses for
this data.

My choices for download are space-delimited text, comma-quote-delimited
text, and Access. For various reasons I find the comma-quote version
the best for my needs, but what I'd really like for easiest parsing is a
tab- or bar-delimited file. Plus, the file contains two fields for the
name of the location, one with the name spelled "correctly" with
diacritical marks, and one smashed down to straight ASCII; I have no use
for the latter. So every time I download the file UNECE offers me,
whose format I do NOT have control over, I run a simple program that
converts it to the format I really want. Easy.

Now every once in a while, UNECE throws me a knuckleball. The latest
version includes a record with an embedded vertical bar where a dotted-z
was supposed to be (it's a long story), so clearly there are pitfalls in
blindly converting this file to bar-delimited! I had to build logic
into my conversion program to check for embedded vertical bars, and fix
the one that's already there (since I don't know how long it'll be
before they fix it at their end).

So this regular conversion process isn't perfect, but it sure beats
trying to teach every program how to read a quirky format. I do the
same thing at work, with data files I need that come in a format I hate.

Bottom line: Any programmer who can write a program to interpret the
existing UnicodeData.txt can also write a program to massage that file
into a format more to their liking. Once you do that, YOU have control
over the format, and can add to it or modify it to suit your own work.
I suggest this approach, as opposed to trying to get UTC to change a
file format that they obviously don't want to change.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/

Next message: Richard Wordingham: "Re: UnicodeData.txt problem"
Previous message: Tom Emerson: "Re: UnicodeData.txt problem"
In reply to: Theo Veenker: "Re: UnicodeData.txt problem"
Next in thread: Richard Wordingham: "Re: UnicodeData.txt problem"
Reply: Richard Wordingham: "Re: UnicodeData.txt problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 10 2005 - 04:21:07 CST