unihan-etl: create exports of UNIHAN db to csv, json and yaml from Tony Narlock via Unicode on 2017-05-30 (Unicode Mail List Archive)

From: Tony Narlock via Unicode <unicode_at_unicode.org>
Date: Tue, 30 May 2017 08:07:05 -0700

I have created a tool in python to extract and transform UNIHAN database's
information. It’s open source (MIT-licensed) and offers users customized
outputs. It’s documented extensively at https://unihan-etl.git-pull.com. In
addition, the project’s source code can be found at
https://github.com/cihai/unihan-etl.

I paired off this tool due to the time-effort of studying the fields and
extracting the information correctly. The hope is that one day a traveller
going down the same path can find this useful.

It has been mentioned before on this list at least once, back in 2004:
http://unicode.org/mail-arch/unicode-ml/y2004-m04/0255.html

> I'm trying to pare Unihan.txt down to a less unwieldy size for my own use
by eliminating properties that are of no interest to me and would like to
be certain that eliminating the four properties containing the actual
values for those dictionaries can be done safely because the information
can be reconstituted if necessary from the kIRG* properties since I'm not
certain if those properties are of interest to me.

There are developers who may only want to extract a pre-determined set of
fields.

$ pip install —user unihan-etl

And create an export values into a CSV (UNIHAN downloads automatically):

$ unihan-etl

Only pull custom fields (once downloaded, Unihan.zip is cached for reuse):

$ unihan-etl -f kMandarin kNelson kMorohashi

Will only pull out those fields. Let’s get a structured output in JSON
(empty values are pruned automatically):

$ unihan-etl -f kMandarin kNelson kMorohashi -F json

Also, with pyyaml you can use -F yaml, as well.

$ pip install pyyaml
$ unihan-etl -f kMandarin kNelson kMorohashi -F yaml

To see all the command line options:
http://unihan-etl.git-pull.com/en/latest/cli.html

Container format: To keep that data exports as portable as possible, it
follows the Data Packages standard (
http://frictionlessdata.io/data-packages/). This is a trickier data set
since fields compact quite a bit of detail in them. Other data sets such as
CEDict will also be made available as data packages.

Backstory: I am trying to create a spiritual successor to cjklib (
https://pypi.python.org/pypi/cjklib). The project aims to pull in CJK
datasets and make them accessible under one library. Datasets are also
going to be available a la carte via a consistent data standard (Data
Packages). I am opting to use UNIHAN database as a core of the CJK data
sources. The project’s homepage is https://cihai.git-pull.com.
Received on Tue May 30 2017 - 10:22:20 CDT

This archive was generated by hypermail 2.2.0 : Tue May 30 2017 - 10:22:21 CDT