Re: Unihan.txt and the four dictionary sorting algorithm

From: Andrew C. West (
Date: Wed Apr 21 2004 - 04:55:50 EDT

  • Next message: Doug Ewell: "Re: Unihan.txt and the four dictionary sorting algorithm"

    On Tue, 20 Apr 2004 22:36:48 +0100, "Raymond Mercier" wrote:
    > The problem of the size of Unihan has nothing at all to do with the cost of
    > storage, and everything to do with the functioning of programs that might
    > open and read it.
    > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this
    > means that when opened in notepad the lines are not separated. Notepad does
    > have the advantage that the UTF-8 encoding is recognized, and the characters
    > are displayed.
    > If opened in Wordpad the Chinese characters do not appear, perhaps the UTF-8
    > encoding does not function.
    > If I try MS Word the machine grinds to a halt - and this is a good modern
    > machine (XP with 120Mb HD and 512Mb RAM).
    > Similarly if I open in IE6, with UTF-8 encoding, the text opens up to around
    > U+4C00, and then grinds to a halt.
    > I can open it in the HexWorkshop byte editor, or in the editor in Visual C
    > 6, but these do not recognize UTF-8 encoding, and they hardly count as
    > suitable readers for such a file.

    I've never managed to get either Notepad or Word to open Unihan.txt (or at least
    I've never had the patience to wait for the operation to complete), and editing
    very large files with Notepad is next to impossible as it rerenders the entire
    file on every edit operation or window resizing operation.

    As James mentioned, my BabelPad text editor for Windows will open and edit
    Unihan.txt with no problem (tip - disable undo/redo functionality if you're
    going to make global replacements) - it takes about 20 seconds to open on my
    (rather old) machine. On the other hand, Visual Studio 7.1 opens Unihan
    correctly (autodetecting as UTF-8) in less than 10 seconds, and has regular
    expression find/replace functionality, which makes it quite powerful.


    This archive was generated by hypermail 2.1.5 : Wed Apr 21 2004 - 05:41:42 EDT