Re: Unihan.txt and the four dictionary sorting algorithm

From: Edward H. Trager (
Date: Fri Apr 23 2004 - 12:12:57 EDT

  • Next message: Frank Yung-Fong Tang: "OT: Standardize TimeZone ID"

    I've been following this thread initiated by Raymond Mercier's comments on the
    Unihan database with some slight amusement but mostly dismay that some
    readers of this list are using the completely wrong software tools for dealing
    with a *database* file like the Unihan table.

    My sincerest advice to all of you who have ever had trouble opening multi-megabyte
    data files using Notepad, Word, or IE is:

       (1) First, spend one day learning how to use grep and egrep.
       (2) Secondly, spend one week learning how to use Perl.
       (3) Spend another week learning the basics of vim.

    You can download and install all of these tools on Windows.

    Alternatively, you
    would find that they are already pre-installed on Linux distributions. Major
    Linux distributions (Redhat, SuSE, probably others) are now defaulting to UTF-8-based
    locales in most cases, so you would be ready to roll as soon as you finished
    installing Linux.

    Mac OS X also comes with GNU egrep, Perl, and the console
    version of Vim pre-installed.

    While there are lots of other tools that you could use, learning just the basics
    of these three tools -- which handle UTF-8 and very large files painlessly -- will
    save you hours of frustration. I guarantee it.

    There is an issue that you might confront with these terminal-based tools on
    Windows and on Mac OSX that I myself don't know how to solve, and that is that
    I don't know how to switch to a UTF-8 locale on either Windows or Mac OS-X so
    that terminal programs such as Xterm or the Cygwin terminal would display the UTF-8
    characters beyond ASCII correctly. My own solution to this problem was trivially
    easy: don't use Windows or Mac OS X for multilingual database work; use Linux
    instead. Because recent Linux distributions are shipping with UTF-8 locales
    by default, you should not need to configure anything. It will just work right
    out of the box.

    Perhaps someone else on this list can tell us how to get Apple's terminal application
    or xterm running on OS X to display UTF-8 characters correctly (probably just needs
    the correct UTF-8 based locale setting. There also must be some solutions to this
    problem on Windows terminals too, I just don't know what they are. Perhaps the suggestion
    to use Visual Studio's editor which also has the regexp functionality that you would
    get with the aforementioned Unix tools is the best way to go on Windows.

    On Wednesday 2004.04.21 01:55:50 -0700, Andrew C. West wrote:
    > On Tue, 20 Apr 2004 22:36:48 +0100, "Raymond Mercier" wrote:
    > >
    > > The problem of the size of Unihan has nothing at all to do with the cost of
    > > storage, and everything to do with the functioning of programs that might
    > > open and read it.
    > > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this
    > > means that when opened in notepad the lines are not separated. Notepad does
    > > have the advantage that the UTF-8 encoding is recognized, and the characters
    > > are displayed.
    > >
    > > If opened in Wordpad the Chinese characters do not appear, perhaps the UTF-8
    > > encoding does not function.
    > >
    > > If I try MS Word the machine grinds to a halt - and this is a good modern
    > > machine (XP with 120Mb HD and 512Mb RAM).
    > >
    > > Similarly if I open in IE6, with UTF-8 encoding, the text opens up to around
    > > U+4C00, and then grinds to a halt.
    > >
    > > I can open it in the HexWorkshop byte editor, or in the editor in Visual C
    > > 6, but these do not recognize UTF-8 encoding, and they hardly count as
    > > suitable readers for such a file.
    > >
    > I've never managed to get either Notepad or Word to open Unihan.txt (or at least
    > I've never had the patience to wait for the operation to complete), and editing
    > very large files with Notepad is next to impossible as it rerenders the entire
    > file on every edit operation or window resizing operation.
    > As James mentioned, my BabelPad text editor for Windows will open and edit
    > Unihan.txt with no problem (tip - disable undo/redo functionality if you're
    > going to make global replacements) - it takes about 20 seconds to open on my
    > (rather old) machine. On the other hand, Visual Studio 7.1 opens Unihan
    > correctly (autodetecting as UTF-8) in less than 10 seconds, and has regular
    > expression find/replace functionality, which makes it quite powerful.
    > Andrew

    This archive was generated by hypermail 2.1.5 : Fri Apr 23 2004 - 12:16:41 EDT