Re: Unihan.txt and the four dictionary sorting algorithm

From: Edward H. Trager (ehtrager@umich.edu)
Date: Fri Apr 23 2004 - 12:12:57 EDT

Next message: Frank Yung-Fong Tang: "OT: Standardize TimeZone ID"

Previous message: Philippe Verdy: "Re: CLDR and locale designations (was: [OT] Even viruses are now i18n!)"
In reply to: Andrew C. West: "Re: Unihan.txt and the four dictionary sorting algorithm"
Next in thread: Tom Emerson: "Re: Unihan.txt and the four dictionary sorting algorithm"
Reply: Tom Emerson: "Re: Unihan.txt and the four dictionary sorting algorithm"
Reply: Benjamin Peterson: "Re: Unihan.txt and the four dictionary sorting algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I've been following this thread initiated by Raymond Mercier's comments on the
Unihan database with some slight amusement but mostly dismay that some
readers of this list are using the completely wrong software tools for dealing
with a *database* file like the Unihan table.

My sincerest advice to all of you who have ever had trouble opening multi-megabyte
data files using Notepad, Word, or IE is:

   (1) First, spend one day learning how to use grep and egrep.
   (2) Secondly, spend one week learning how to use Perl.
   (3) Spend another week learning the basics of vim.

You can download and install all of these tools on Windows.

Alternatively, you
would find that they are already pre-installed on Linux distributions. Major
Linux distributions (Redhat, SuSE, probably others) are now defaulting to UTF-8-based
locales in most cases, so you would be ready to roll as soon as you finished
installing Linux.

Mac OS X also comes with GNU egrep, Perl, and the console
version of Vim pre-installed.

While there are lots of other tools that you could use, learning just the basics
of these three tools -- which handle UTF-8 and very large files painlessly -- will
save you hours of frustration. I guarantee it.

There is an issue that you might confront with these terminal-based tools on
Windows and on Mac OSX that I myself don't know how to solve, and that is that
I don't know how to switch to a UTF-8 locale on either Windows or Mac OS-X so
that terminal programs such as Xterm or the Cygwin terminal would display the UTF-8
characters beyond ASCII correctly. My own solution to this problem was trivially
easy: don't use Windows or Mac OS X for multilingual database work; use Linux
instead. Because recent Linux distributions are shipping with UTF-8 locales
by default, you should not need to configure anything. It will just work right
out of the box.

Perhaps someone else on this list can tell us how to get Apple's terminal application
or xterm running on OS X to display UTF-8 characters correctly (probably just needs
the correct UTF-8 based locale setting. There also must be some solutions to this
problem on Windows terminals too, I just don't know what they are. Perhaps the suggestion
to use Visual Studio's editor which also has the regexp functionality that you would
get with the aforementioned Unix tools is the best way to go on Windows.

On Wednesday 2004.04.21 01:55:50 -0700, Andrew C. West wrote:
> On Tue, 20 Apr 2004 22:36:48 +0100, "Raymond Mercier" wrote:
> >
> > The problem of the size of Unihan has nothing at all to do with the cost of
> > storage, and everything to do with the functioning of programs that might
> > open and read it.
> > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this
> > means that when opened in notepad the lines are not separated. Notepad does
> > have the advantage that the UTF-8 encoding is recognized, and the characters
> > are displayed.
> >
> > If opened in Wordpad the Chinese characters do not appear, perhaps the UTF-8
> > encoding does not function.
> >
> > If I try MS Word the machine grinds to a halt - and this is a good modern
> > machine (XP with 120Mb HD and 512Mb RAM).
> >
> > Similarly if I open in IE6, with UTF-8 encoding, the text opens up to around
> > U+4C00, and then grinds to a halt.
> >
> > I can open it in the HexWorkshop byte editor, or in the editor in Visual C
> > 6, but these do not recognize UTF-8 encoding, and they hardly count as
> > suitable readers for such a file.
> >
>
> I've never managed to get either Notepad or Word to open Unihan.txt (or at least
> I've never had the patience to wait for the operation to complete), and editing
> very large files with Notepad is next to impossible as it rerenders the entire
> file on every edit operation or window resizing operation.
>
> As James mentioned, my BabelPad text editor for Windows will open and edit
> Unihan.txt with no problem (tip - disable undo/redo functionality if you're
> going to make global replacements) - it takes about 20 seconds to open on my
> (rather old) machine. On the other hand, Visual Studio 7.1 opens Unihan
> correctly (autodetecting as UTF-8) in less than 10 seconds, and has regular
> expression find/replace functionality, which makes it quite powerful.
>
> Andrew
>

Next message: Frank Yung-Fong Tang: "OT: Standardize TimeZone ID"
Previous message: Philippe Verdy: "Re: CLDR and locale designations (was: [OT] Even viruses are now i18n!)"
In reply to: Andrew C. West: "Re: Unihan.txt and the four dictionary sorting algorithm"
Next in thread: Tom Emerson: "Re: Unihan.txt and the four dictionary sorting algorithm"
Reply: Tom Emerson: "Re: Unihan.txt and the four dictionary sorting algorithm"
Reply: Benjamin Peterson: "Re: Unihan.txt and the four dictionary sorting algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Apr 23 2004 - 12:16:41 EDT