From: John Jenkins (firstname.lastname@example.org)
Date: Tue Apr 20 2004 - 21:39:49 EDT
On Apr 20, 2004, at 5:11 PM, email@example.com wrote:
> The DOS editor chokes on such a large text file, so does my older hex
> editor. Thank goodness for BabelPad, otherwise it would've been hard
> to insert proper (for my system) line breaks into the file.
BBEdit on the Mac tends to be unhappy with it, too.
> The tab "character" is used in the file. Arguably, this "character"
> never appear in a plain text file, rather it should be converted to an
> appropriate number of U+0020 characters by the application on save.
> Of course, this would make the file even bigger.
Tab-separated data files are quite common. (Indeed, I tend to get
annoyed with the main UCD file because it's semicolon-separated.) I'm
not sure why you'd want a tab never to appear in a plain-text file.
> Instead of (for instance) "KUA4", why not "KUA⁴"?
I think your text got garbled here, but in any event, you've replaced
one four-character word with another one. :-)
Realistically, the earliest versions of the Unihan.txt file predate the
ability to safely exchange or use anything other than ASCII. Our
Mandarin romanization dates back to those days.
Now that UTF-8 support is relatively common, we're moving more and more
data in the file to non-ASCII form.
> Much of the text in UNIHAN.TXT is redundant, the hex character
> is repeated along with each field name over and over again.
> Putting the hex character at the beginning of each line, with one
> character per line and CSVs would make UNIHAN.TXT *much* smaller.
> Of course, commas would have to be removed from the definition
> fields. (Hmmm, maybe definition field commas could be replaced
> with MIDDLE DOT?)
Hmm. Interesting suggestion.
OTOH, the current format lends itself nicely to use with some
utilities, like the Unix grep command.
Fundamentally, any format we select would be nice in some situations
and not so nice in others.
> But, changing the format of the file might make it harder for some
> users to find the data they seek. So, I'm not necessarily proposing
> any change, but rather pointing out that alternatives exist.
That's the *real* problem. Goodness knows the current format has real
problems, and brevity is not among its virtues. (OTOH, the format it
replaces was brief to the point of being incomprehensible.)
Unfortunately, nobody's come up with a good strategy for migrating to
(Which is why we're stuck with a misspelling in one of the field names.)
And, of course, you're perfectly free to massage the data as suits your
own purposes. My Unihan lookup took for Mac OS X converts it all to
XML, for instance.
> In spite of its unwieldy size, UNIHAN.TXT is a useful tool and I'm
> grateful for its existence.
John H. Jenkins
This archive was generated by hypermail 2.1.5 : Tue Apr 20 2004 - 22:30:46 EDT