Re: Unihan.txt and the four dictionary sorting algorithm

From: jameskass@att.net
Date: Tue Apr 20 2004 - 19:11:29 EDT

  • Next message: John Jenkins: "Re: Unihan.txt and the four dictionary sorting algorithm"

    Raymond Mercier wrote,

    > John Jenkins writes
    > >>Also, even though the full Unihan database is 25+ Mb in size, given the
    > cheapness of disk space nowadays, it's not all *that* big, surely.
    > <<
    >
    > The problem of the size of Unihan has nothing at all to do with the cost of
    > storage, and everything to do with the functioning of programs that might
    > open and read it.
    > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this
    > means that when opened in notepad the lines are not separated. Notepad does
    > have the advantage that the UTF-8 encoding is recognized, and the characters
    > are displayed.

    UNIHAN.TXT isn't going to get any smaller by itself. The trend indicates
    that it will just keep on growing, even if VS characters are used with CJK.

    The DOS editor chokes on such a large text file, so does my older hex
    editor. Thank goodness for BabelPad, otherwise it would've been hard
    to insert proper (for my system) line breaks into the file.

    The tab "character" is used in the file. Arguably, this "character" should
    never appear in a plain text file, rather it should be converted to an
    appropriate number of U+0020 characters by the application on save.
    Of course, this would make the file even bigger.

    Instead of (for instance) "KUA4", why not "KUA⁴"?

    Much of the text in UNIHAN.TXT is redundant, the hex character
    is repeated along with each field name over and over again.

    Putting the hex character at the beginning of each line, with one
    character per line and CSVs would make UNIHAN.TXT *much* smaller.
    Of course, commas would have to be removed from the definition
    fields. (Hmmm, maybe definition field commas could be replaced
    with MIDDLE DOT?)

    But, changing the format of the file might make it harder for some
    users to find the data they seek. So, I'm not necessarily proposing
    any change, but rather pointing out that alternatives exist.

    In spite of its unwieldy size, UNIHAN.TXT is a useful tool and I'm
    grateful for its existence.

    Best regards,

    James Kass



    This archive was generated by hypermail 2.1.5 : Tue Apr 20 2004 - 19:58:04 EDT