Re: UNIHAN.TXT

From: John Jenkins (jenkins@apple.com)
Date: Fri Apr 30 2004 - 13:22:43 EDT

  • Next message: Michael Everson: "Re: Arid Canaanite Wasteland (was: Re: New contribution)"

    On Apr 30, 2004, at 1:12 AM, jameskass@att.net wrote:

    >
    > Like UNIHAN.TXT, brevity is not a feature of the following...
    >
    > Tabs... In addition to the points Mike made about the tab character
    > having
    > different semantics depending on the application/platform, I just don't
    > think a control character like tab belongs in a *.TXT file period.

    I'm sorry, but I still don't get this point. To say that a tab doesn't
    belong in a plain text file makes as much sense to me as saying that a
    carriage return or a line feed doesn't belong in a plain text file.

    > Although
    > UNIHAN.TXT is referred to as a database, it isn't.

    Well, I guess we're going to have to figure out what we mean by
    "database".

    > Rather, it's the raw
    > material for a database offered in plain-text form.

    True.

    > Still, tabs are arguably
    > OK. It's easy enough to strip them out when they're not wanted. (I'd
    > rather deal with tabs in a text file which is to be imported into a
    > database
    > than ASCII quotes.)
    >
    > Unix -vs- DOS... I'll stick with the tools I've been using for a
    > quarter century
    > and their descendants, thanks just the same. With respect to the idea
    > that a
    > text editor is not the proper tool with which to open a *.TXT file,
    > well...
    >

    It's not that a text editor isn't the proper tool, is that some text
    editors barf when they encounter files that are too big or that don't
    follow a certain set of line break conventions.

    > Trivial -vs- non-trivial... Once the raw data has been imported into
    > a database,
    > it's trivial to massage or manipulate it. It's easy enough to
    > generate a CSV
    > file from a database application, and I've done so. But, the only
    > reason that
    > I wanted it in CSV in the first place was to make it easy to import
    > the data
    > into the database application. This was *not* trivial to do; it
    > involved a lot
    > of coding and counting, and a bit of trial-and-error with various
    > field lengths.
    > Still, the task managed to keep me quiet for a few days...
    >

    Perl is your friend. It would be easy to write a perl script to do the
    job of converting the existing file to CSV. That would be better to
    post than a duplicate of the .txt file, anyway, because it would be
    longer-lived and smaller.

    > With a CSV file, importing data from a text file into a database file
    > simply
    > involves a single line command in the interactive mode (once the
    > database
    > file structure has been established). This is true for dBASE, FoxPro,
    > and
    > related database applications.
    >

    But not, apparently, mySQL, which is what we use to maintain the Unihan
    database.

    > But, if you wanted to modify only one field, it's more efficient to
    > skip
    > through 71098 records reading and modifying only the appropriate field
    > in the record than to go skipping through all 1063127. Easier to
    > program, too.

    No, not really. It depends on your programming tools. Personally, I
    find it much easier to write programs that process the file as-is than
    would be the case were it to have a more CSV-like syntax. (Or XML, or
    whatever.)

    IOW, there's no way we can maximize ease-of-use for everybody. No
    matter what format we pick, somebody's going to be inconvenienced by
    it.

    > (Suppose you were a purist who wanted to see Stimson's pronunciations
    > using
    > the actual characters that Stimson used?

    You can use the next edition of the file; we're switching over. :-)

    > John Jenkins wrote,
    >
    >> Unfortunately, nobody's come up with a good strategy for migrating to
    >> something else.
    >
    > I could send you the CSV file for posting, if you think anyone else
    > would
    > want it.
    >

    In this case, just having the CSV file doesn't really help. The
    problem with migration is that anything that depends on the current
    format will break if we switch formats. It's easier IMHO to make
    available techniques for people to massage the data into alternate
    forms if they really want it that way.

    ========
    John H. Jenkins
    jenkins@apple.com
    jhjenkins@mac.com
    http://homepage.mac.com/jhjenkins/



    This archive was generated by hypermail 2.1.5 : Fri Apr 30 2004 - 14:57:07 EDT