Re: Unihan.txt and the four dictionary sorting algorithm

From: John Jenkins (jenkins@apple.com)
Date: Tue Apr 20 2004 - 21:39:49 EDT

  • Next message: Mike Ayers: "RE: Unihan.txt and the four dictionary sorting algorithm"

    On Apr 20, 2004, at 5:11 PM, jameskass@att.net wrote:

    > The DOS editor chokes on such a large text file, so does my older hex
    > editor. Thank goodness for BabelPad, otherwise it would've been hard
    > to insert proper (for my system) line breaks into the file.
    >

    BBEdit on the Mac tends to be unhappy with it, too.

    > The tab "character" is used in the file. Arguably, this "character"
    > should
    > never appear in a plain text file, rather it should be converted to an
    > appropriate number of U+0020 characters by the application on save.
    > Of course, this would make the file even bigger.
    >

    Tab-separated data files are quite common. (Indeed, I tend to get
    annoyed with the main UCD file because it's semicolon-separated.) I'm
    not sure why you'd want a tab never to appear in a plain-text file.

    > Instead of (for instance) "KUA4", why not "KUA⁴"?
    >

    I think your text got garbled here, but in any event, you've replaced
    one four-character word with another one. :-)

    Realistically, the earliest versions of the Unihan.txt file predate the
    ability to safely exchange or use anything other than ASCII. Our
    Mandarin romanization dates back to those days.

    Now that UTF-8 support is relatively common, we're moving more and more
    data in the file to non-ASCII form.

    > Much of the text in UNIHAN.TXT is redundant, the hex character
    > is repeated along with each field name over and over again.
    >
    > Putting the hex character at the beginning of each line, with one
    > character per line and CSVs would make UNIHAN.TXT *much* smaller.
    > Of course, commas would have to be removed from the definition
    > fields. (Hmmm, maybe definition field commas could be replaced
    > with MIDDLE DOT?)
    >

    Hmm. Interesting suggestion.

    OTOH, the current format lends itself nicely to use with some
    utilities, like the Unix grep command.

    Fundamentally, any format we select would be nice in some situations
    and not so nice in others.

    > But, changing the format of the file might make it harder for some
    > users to find the data they seek. So, I'm not necessarily proposing
    > any change, but rather pointing out that alternatives exist.
    >

    That's the *real* problem. Goodness knows the current format has real
    problems, and brevity is not among its virtues. (OTOH, the format it
    replaces was brief to the point of being incomprehensible.)
    Unfortunately, nobody's come up with a good strategy for migrating to
    something else.

    (Which is why we're stuck with a misspelling in one of the field names.)

    And, of course, you're perfectly free to massage the data as suits your
    own purposes. My Unihan lookup took for Mac OS X converts it all to
    XML, for instance.

    > In spite of its unwieldy size, UNIHAN.TXT is a useful tool and I'm
    > grateful for its existence.
    >

    Thanks.

    ========
    John H. Jenkins
    jenkins@apple.com
    jhjenkins@mac.com
    http://homepage.mac.com/jhjenkins/



    This archive was generated by hypermail 2.1.5 : Tue Apr 20 2004 - 22:30:46 EDT