Re: Unihan.txt and the four dictionary sorting algorithm

From: Doug Ewell (
Date: Wed Apr 21 2004 - 11:24:22 EDT

  • Next message: Gary P. Grosso: "Unihan.txt and other possible representations of the data"

    Raymond Mercier <RaymondM at compuserve dot com> wrote:

    > The problem of the size of Unihan has nothing at all to do with the
    > cost of storage, and everything to do with the functioning of programs
    > that might open and read it.
    > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D,
    > this means that when opened in notepad the lines are not separated...

    I have to agree that an ordinary plain-text editor is probably not the
    right tool for browsing a 25-megabyte data file, even though I've been
    known to do the same with UnicodeData.txt (which is admittedly an order
    of magnitude smaller).

    Even though Unihan is packaged as plain text, one record per
    LF-terminated line (well, sort of), it's really more appropriate to
    think of it as a data file, intended to be read by software. Something
    like a batch file that calls grep (or other plain-text search tool)
    would be more appropriate.

    And as John said, converting LF to CRLF is quite a simple task -- it can
    even be done by your FTP client, while downloading the file -- and
    should not be thought of as a deficiency in the current plain-text

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Wed Apr 21 2004 - 12:17:09 EDT