Re: Unihan.txt and other possible representations of the data

From: Ernest Cline (
Date: Wed Apr 21 2004 - 19:52:41 EDT

  • Next message: Theo Veenker: "Re: Suggestion: use of symbolic links in the FTP site"

    > [Original Message]
    > From: Tom Emerson <>
    > To: Gary P. Grosso <>
    > Cc: <>
    > Date: 4/21/2004 12:58:38 PM
    > Subject: Re: Unihan.txt and other possible representations of the data
    > Gary P. Grosso writes:
    > > There may be value in an HTML representation, utilizing links
    > > and multiple files. What would the logical division(s) be?
    > > Or has this already been done?
    > I'm working on a proposal for generating different representations of
    > Unihan, and this includes logical divisions. I'll post a draft when I
    > have something ready.

    The obvious division is to put the dictionary stuff in one document
    (or group of documents) and to put the encoding equivalencies in
    another document, and the numeric information in a third.

    However, if backward compatibility could be sacrificed there would
    be an easy way to shave 2 MB off the size of Unihan.txt: get rid of
    the initial "U+". It may be only 10%, but its an irritating 10% because
    it's totally worthless. Altho, removing it wouldn't do much to shave
    the size of, , because since it is so redundant, any good
    compression scheme is able to take advantage of it.

    This archive was generated by hypermail 2.1.5 : Wed Apr 21 2004 - 20:41:23 EDT