UNIHAN.TXT

From: jameskass@att.net
Date: Fri Apr 30 2004 - 03:12:29 EDT

  • Next message: Antoine Leca: "Re: Variation selectors and vowel marks"

    Like UNIHAN.TXT, brevity is not a feature of the following...

    Tabs... In addition to the points Mike made about the tab character having
    different semantics depending on the application/platform, I just don't
    think a control character like tab belongs in a *.TXT file period. Although
    UNIHAN.TXT is referred to as a database, it isn't. Rather, it's the raw
    material for a database offered in plain-text form. Still, tabs are arguably
    OK. It's easy enough to strip them out when they're not wanted. (I'd
    rather deal with tabs in a text file which is to be imported into a database
    than ASCII quotes.)

    Unix -vs- DOS... I'll stick with the tools I've been using for a quarter century
    and their descendants, thanks just the same. With respect to the idea that a
    text editor is not the proper tool with which to open a *.TXT file, well...

    Trivial -vs- non-trivial... Once the raw data has been imported into a database,
    it's trivial to massage or manipulate it. It's easy enough to generate a CSV
    file from a database application, and I've done so. But, the only reason that
    I wanted it in CSV in the first place was to make it easy to import the data
    into the database application. This was *not* trivial to do; it involved a lot
    of coding and counting, and a bit of trial-and-error with various field lengths.
    Still, the task managed to keep me quiet for a few days...

    With a CSV file, importing data from a text file into a database file simply
    involves a single line command in the interactive mode (once the database
    file structure has been established). This is true for dBASE, FoxPro, and
    related database applications.

    Of course, the same kind of single line command can be (and was) used to
    import the data from the UNIHAN.TXT file into a database, but this
    produces a huge database file [266844944 bytes] which *still* does not
    have proper fields. It still has one record/one field just like the original
    UNIHAN.TXT file. Which means, if you want to get the information for
    a certain field of a certain character, that you have to go skipping through
    all 1063127 records checking each one rather than the mere 71098 records
    that the database actually requires. (Of course, you'd use an index file
    rather than skipping through all those records in either case.)

    But, if you wanted to modify only one field, it's more efficient to skip
    through 71098 records reading and modifying only the appropriate field
    in the record than to go skipping through all 1063127. Easier to program, too.
    (Suppose you were a purist who wanted to see Stimson's pronunciations using
    the actual characters that Stimson used? Or, say you wanted pronunciations
    in lower case rather than upper case and preferred that the tone marks be
    superscripted? Hmmm, maybe you'd want those Japanese pronunciations
    in kana instead of romaji...)

    So, UNIHAN.TXT is 27592561 bytes, but the CSV text file is 13384544 bytes.
    Zipped, UNIHCSV.ZIP is 3477887 bytes. (The CSV file lacks the initial 802 lines
    of comments in the source UNIHAN.TXT file.)

    Only cut the size in about half, not as great a savings as I'd imagined. This
    is because many of the "fields" in the source UNIHAN.TXT are actually
    empty, and thus don't occupy a line in the file, while empty fields
    in the CSV file still require a single byte for that comma.

    D. Starner wrote,

    > Because it's a data file, and it's easier to process without all that HTML
    > junk to discard.

    Right on!

    John Jenkins wrote,

    > Now that UTF-8 support is relatively common, we're moving more and more
    > data in the file to non-ASCII form.

    It is a delight to observe this happening already.

    >> But, changing the format of the file might make it harder for some
    >> users to find the data they seek. So, I'm not necessarily proposing
    >> any change, but rather pointing out that alternatives exist.
    >>
    > That's the *real* problem. Goodness knows the current format has real
    > problems, and brevity is not among its virtues. (OTOH, the format it
    > replaces was brief to the point of being incomprehensible.)
    > Unfortunately, nobody's come up with a good strategy for migrating to
    > something else.

    I could send you the CSV file for posting, if you think anyone else would
    want it.

    Doug Ewell wrote,

    > And as John said, converting LF to CRLF is quite a simple task -- it can
    > even be done by your FTP client, while downloading the file -- and
    > should not be thought of as a deficiency in the current plain-text
    > format.

    Right. It's not a deficiency, it simply adds one more step to a multi-step
    process for some of us.

    Benjamin Peterson wrote,

    > Wow -- I'd hate to see your idea of a non-trivial solution!

    Me too!

    Edward H. Trager wrote,

    > People tend to use what they know best, ...

    Exactly.

    > Absolutely. The existence of Cygwin makes work on Windows much more tolerable,
    > especially since Cygwin provides the OpenSSH client, XFree86, Perl,
    > console vim, egrep, etc. However, I still haven't figured out how to display
    > a UTF-8 file with non-latin characters in the Cygwin bash shell (on Win2K). As
    > far as I know, this shell really just sits on top of a DOS shell. And
    > as far as I can tell, "chcp 65001" still doesn't let you see, for example,
    > CJK characters in the terminal. I don't think it is possible. Since I also
    > can't figure out how to see non-latin characters in the graphical
    > version of vim (Gvim 6.2) on Windows, I rest my case that Windows is annoying.

    In order to see non-Latin characters in the DOS-window of Windows, it's
    necessary to install a "console font" covering the characters, and then
    activate (or enable) that font for the "console window". Everson Mono
    Terminal should work fine for non-Han characters which don't require
    complex shaping.

    Best regards,

    James Kass



    This archive was generated by hypermail 2.1.5 : Fri Apr 30 2004 - 03:52:52 EDT