UNIHAN.TXT

From: jameskass@att.net
Date: Fri Apr 30 2004 - 03:12:29 EDT

Next message: Antoine Leca: "Re: Variation selectors and vowel marks"

Previous message: jameskass@att.net: "Re: Public Review Issues Updated"
Next in thread: John Jenkins: "Re: UNIHAN.TXT"
Reply: John Jenkins: "Re: UNIHAN.TXT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Like UNIHAN.TXT, brevity is not a feature of the following...

Tabs... In addition to the points Mike made about the tab character having
different semantics depending on the application/platform, I just don't
think a control character like tab belongs in a *.TXT file period. Although
UNIHAN.TXT is referred to as a database, it isn't. Rather, it's the raw
material for a database offered in plain-text form. Still, tabs are arguably
OK. It's easy enough to strip them out when they're not wanted. (I'd
rather deal with tabs in a text file which is to be imported into a database
than ASCII quotes.)

Unix -vs- DOS... I'll stick with the tools I've been using for a quarter century
and their descendants, thanks just the same. With respect to the idea that a
text editor is not the proper tool with which to open a *.TXT file, well...

Trivial -vs- non-trivial... Once the raw data has been imported into a database,
it's trivial to massage or manipulate it. It's easy enough to generate a CSV
file from a database application, and I've done so. But, the only reason that
I wanted it in CSV in the first place was to make it easy to import the data
into the database application. This was *not* trivial to do; it involved a lot
of coding and counting, and a bit of trial-and-error with various field lengths.
Still, the task managed to keep me quiet for a few days...

With a CSV file, importing data from a text file into a database file simply
involves a single line command in the interactive mode (once the database
file structure has been established). This is true for dBASE, FoxPro, and
related database applications.

Of course, the same kind of single line command can be (and was) used to
import the data from the UNIHAN.TXT file into a database, but this
produces a huge database file [266844944 bytes] which *still* does not
have proper fields. It still has one record/one field just like the original
UNIHAN.TXT file. Which means, if you want to get the information for
a certain field of a certain character, that you have to go skipping through
all 1063127 records checking each one rather than the mere 71098 records
that the database actually requires. (Of course, you'd use an index file
rather than skipping through all those records in either case.)

But, if you wanted to modify only one field, it's more efficient to skip
through 71098 records reading and modifying only the appropriate field
in the record than to go skipping through all 1063127. Easier to program, too.
(Suppose you were a purist who wanted to see Stimson's pronunciations using
the actual characters that Stimson used? Or, say you wanted pronunciations
in lower case rather than upper case and preferred that the tone marks be
superscripted? Hmmm, maybe you'd want those Japanese pronunciations
in kana instead of romaji...)

So, UNIHAN.TXT is 27592561 bytes, but the CSV text file is 13384544 bytes.
Zipped, UNIHCSV.ZIP is 3477887 bytes. (The CSV file lacks the initial 802 lines
of comments in the source UNIHAN.TXT file.)

Only cut the size in about half, not as great a savings as I'd imagined. This
is because many of the "fields" in the source UNIHAN.TXT are actually
empty, and thus don't occupy a line in the file, while empty fields
in the CSV file still require a single byte for that comma.

D. Starner wrote,

> Because it's a data file, and it's easier to process without all that HTML
> junk to discard.

Right on!

John Jenkins wrote,

> Now that UTF-8 support is relatively common, we're moving more and more
> data in the file to non-ASCII form.

It is a delight to observe this happening already.

>> But, changing the format of the file might make it harder for some
>> users to find the data they seek. So, I'm not necessarily proposing
>> any change, but rather pointing out that alternatives exist.
>>
> That's the *real* problem. Goodness knows the current format has real
> problems, and brevity is not among its virtues. (OTOH, the format it
> replaces was brief to the point of being incomprehensible.)
> Unfortunately, nobody's come up with a good strategy for migrating to
> something else.

I could send you the CSV file for posting, if you think anyone else would
want it.

Doug Ewell wrote,

> And as John said, converting LF to CRLF is quite a simple task -- it can
> even be done by your FTP client, while downloading the file -- and
> should not be thought of as a deficiency in the current plain-text
> format.

Right. It's not a deficiency, it simply adds one more step to a multi-step
process for some of us.

Benjamin Peterson wrote,

> Wow -- I'd hate to see your idea of a non-trivial solution!

Me too!

Edward H. Trager wrote,

> People tend to use what they know best, ...

Exactly.

> Absolutely. The existence of Cygwin makes work on Windows much more tolerable,
> especially since Cygwin provides the OpenSSH client, XFree86, Perl,
> console vim, egrep, etc. However, I still haven't figured out how to display
> a UTF-8 file with non-latin characters in the Cygwin bash shell (on Win2K). As
> far as I know, this shell really just sits on top of a DOS shell. And
> as far as I can tell, "chcp 65001" still doesn't let you see, for example,
> CJK characters in the terminal. I don't think it is possible. Since I also
> can't figure out how to see non-latin characters in the graphical
> version of vim (Gvim 6.2) on Windows, I rest my case that Windows is annoying.

In order to see non-Latin characters in the DOS-window of Windows, it's
necessary to install a "console font" covering the characters, and then
activate (or enable) that font for the "console window". Everson Mono
Terminal should work fine for non-Han characters which don't require
complex shaping.

Best regards,

James Kass

Next message: Antoine Leca: "Re: Variation selectors and vowel marks"
Previous message: jameskass@att.net: "Re: Public Review Issues Updated"
Next in thread: John Jenkins: "Re: UNIHAN.TXT"
Reply: John Jenkins: "Re: UNIHAN.TXT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Apr 30 2004 - 03:52:52 EDT