Re: Unihan.txt and the four dictionary sorting algorithm

From: jameskass@att.net
Date: Tue Apr 20 2004 - 19:11:29 EDT

Next message: John Jenkins: "Re: Unihan.txt and the four dictionary sorting algorithm"

Previous message: John Cowan: "Re: Unihan.txt and the four dictionary sorting algorithm"
Maybe in reply to: Ernest Cline: "Unihan.txt and the four dictionary sorting algorithm"
Next in thread: John Jenkins: "Re: Unihan.txt and the four dictionary sorting algorithm"
Reply: John Jenkins: "Re: Unihan.txt and the four dictionary sorting algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Raymond Mercier wrote,

> John Jenkins writes
> >>Also, even though the full Unihan database is 25+ Mb in size, given the
> cheapness of disk space nowadays, it's not all *that* big, surely.
> <<
>
> The problem of the size of Unihan has nothing at all to do with the cost of
> storage, and everything to do with the functioning of programs that might
> open and read it.
> Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this
> means that when opened in notepad the lines are not separated. Notepad does
> have the advantage that the UTF-8 encoding is recognized, and the characters
> are displayed.

UNIHAN.TXT isn't going to get any smaller by itself. The trend indicates
that it will just keep on growing, even if VS characters are used with CJK.

The DOS editor chokes on such a large text file, so does my older hex
editor. Thank goodness for BabelPad, otherwise it would've been hard
to insert proper (for my system) line breaks into the file.

The tab "character" is used in the file. Arguably, this "character" should
never appear in a plain text file, rather it should be converted to an
appropriate number of U+0020 characters by the application on save.
Of course, this would make the file even bigger.

Instead of (for instance) "KUA4", why not "KUA⁴"?

Much of the text in UNIHAN.TXT is redundant, the hex character
is repeated along with each field name over and over again.

Putting the hex character at the beginning of each line, with one
character per line and CSVs would make UNIHAN.TXT *much* smaller.
Of course, commas would have to be removed from the definition
fields. (Hmmm, maybe definition field commas could be replaced
with MIDDLE DOT?)

But, changing the format of the file might make it harder for some
users to find the data they seek. So, I'm not necessarily proposing
any change, but rather pointing out that alternatives exist.

In spite of its unwieldy size, UNIHAN.TXT is a useful tool and I'm
grateful for its existence.

Best regards,

James Kass

Next message: John Jenkins: "Re: Unihan.txt and the four dictionary sorting algorithm"
Previous message: John Cowan: "Re: Unihan.txt and the four dictionary sorting algorithm"
Maybe in reply to: Ernest Cline: "Unihan.txt and the four dictionary sorting algorithm"
Next in thread: John Jenkins: "Re: Unihan.txt and the four dictionary sorting algorithm"
Reply: John Jenkins: "Re: Unihan.txt and the four dictionary sorting algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 20 2004 - 19:58:04 EDT