Fwd: Unihan SQL access

From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Sat Oct 16 2010 - 09:07:42 CDT

  • Next message: Stephane Bortzmeyer: "Re: Derived age regexp"

    Well, I've added support for the remaining few fields, and while at it
    upgraded to Unihan 6.0.0 which is just out, and made quite a few other
    The only remaining piece of data not handled is a single line in each of
    kFenn and kHKGlyph, including two entries instead of one, so I wasn't sure
    whether this is intentional or not. Nice to see some of the questionable
    entries (1- or 2-character kDefinition values, so far) have been fixed
    already in the new Unihan version :)

    ---------- Forwarded message ----------
    From: Uriah Eisenstein <uriaheisenstein@gmail.com>
    Date: Thu, Sep 30, 2010 at 8:48 PM
    Subject: Fwd: Unihan SQL access
    To: unicode List <unicode@unicode.org>

    As usual this took longer than I thought... But an initial version is
    finally ready, and can be found in
    It requires access to the Unihan.zip file and a JDBC driver; there are
    explanations on the web page which I hope would be enough. Quite a few
    improvements are already planned... I'd be glad to hear anyone finds it

    While at it, I found a couple of apparent typos in the source indications
    of variants (using SELECT DISTINCT SOURCE FROM VARIANT_SOURCE). These all
    come from the kSemanticVariant field:

    SELECT * FROM kSemanticVariant_source
    WHERE kSemanticVariant_source IN ('kMathews', 'kMeterWempe')

    [U+3C92] 勽 [U+52FD] kMathews
    勽 [U+52FD] [U+3C92] kMathews
    [U+25500] 渹 [U+6E39] kMeterWempe

    Uriah Eisenstein

    ---------- Forwarded message ----------
    From: Uriah Eisenstein <uriaheisenstein@gmail.com>
    Date: Sun, Sep 12, 2010 at 5:57 PM
    Subject: Unihan SQL access
    To: unicode List <unicode@unicode.org>

    I'm nearing completion of a simple Java program which loads Unihan data from
    the source files into a DB, and provides SQL access to it.There's still at
    least a week or so of work on issues I consider essential, but once ready
    I'd be happy to make it available on the Internet if anyone's interested.
    So far I've used it to search for possibly erroneous data in Unihan; my
    latest find is that 73 characters have a kTaiwanTelegraph value of 0000,
    which seems doubtful. It can also be useful for various statistical
    information such as how many characters are listed under each radical, or
    which blocks include IICore characters.
    I'm also considering adding the contents of the Unicode Character Database
    as well at a later phase.
    Uriah Eisenstein

    This archive was generated by hypermail 2.1.5 : Sat Oct 16 2010 - 09:13:44 CDT