Re: Unihan.txt and the four dictionary sorting algorithm

From: John Jenkins (jenkins@apple.com)
Date: Tue Apr 20 2004 - 21:39:49 EDT

Next message: Mike Ayers: "RE: Unihan.txt and the four dictionary sorting algorithm"

Previous message: jameskass@att.net: "Re: Unihan.txt and the four dictionary sorting algorithm"
In reply to: jameskass@att.net: "Re: Unihan.txt and the four dictionary sorting algorithm"
Next in thread: Mike Ayers: "RE: Unihan.txt and the four dictionary sorting algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Apr 20, 2004, at 5:11 PM, jameskass@att.net wrote:

> The DOS editor chokes on such a large text file, so does my older hex
> editor. Thank goodness for BabelPad, otherwise it would've been hard
> to insert proper (for my system) line breaks into the file.
>

BBEdit on the Mac tends to be unhappy with it, too.

> The tab "character" is used in the file. Arguably, this "character"
> should
> never appear in a plain text file, rather it should be converted to an
> appropriate number of U+0020 characters by the application on save.
> Of course, this would make the file even bigger.
>

Tab-separated data files are quite common. (Indeed, I tend to get
annoyed with the main UCD file because it's semicolon-separated.) I'm
not sure why you'd want a tab never to appear in a plain-text file.

> Instead of (for instance) "KUA4", why not "KUA⁴"?
>

I think your text got garbled here, but in any event, you've replaced
one four-character word with another one. :-)

Realistically, the earliest versions of the Unihan.txt file predate the
ability to safely exchange or use anything other than ASCII. Our
Mandarin romanization dates back to those days.

Now that UTF-8 support is relatively common, we're moving more and more
data in the file to non-ASCII form.

> Much of the text in UNIHAN.TXT is redundant, the hex character
> is repeated along with each field name over and over again.
>
> Putting the hex character at the beginning of each line, with one
> character per line and CSVs would make UNIHAN.TXT *much* smaller.
> Of course, commas would have to be removed from the definition
> fields. (Hmmm, maybe definition field commas could be replaced
> with MIDDLE DOT?)
>

Hmm. Interesting suggestion.

OTOH, the current format lends itself nicely to use with some
utilities, like the Unix grep command.

Fundamentally, any format we select would be nice in some situations
and not so nice in others.

> But, changing the format of the file might make it harder for some
> users to find the data they seek. So, I'm not necessarily proposing
> any change, but rather pointing out that alternatives exist.
>

That's the *real* problem. Goodness knows the current format has real
problems, and brevity is not among its virtues. (OTOH, the format it
replaces was brief to the point of being incomprehensible.)
Unfortunately, nobody's come up with a good strategy for migrating to
something else.

(Which is why we're stuck with a misspelling in one of the field names.)

And, of course, you're perfectly free to massage the data as suits your
own purposes. My Unihan lookup took for Mac OS X converts it all to
XML, for instance.

> In spite of its unwieldy size, UNIHAN.TXT is a useful tool and I'm
> grateful for its existence.
>

Thanks.

========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://homepage.mac.com/jhjenkins/

Next message: Mike Ayers: "RE: Unihan.txt and the four dictionary sorting algorithm"
Previous message: jameskass@att.net: "Re: Unihan.txt and the four dictionary sorting algorithm"
In reply to: jameskass@att.net: "Re: Unihan.txt and the four dictionary sorting algorithm"
Next in thread: Mike Ayers: "RE: Unihan.txt and the four dictionary sorting algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 20 2004 - 22:30:46 EDT