Re: SQL version of the Unicode database?

From: Kenneth Whistler (
Date: Fri Jul 06 2007 - 13:15:31 CDT

  • Next message: Sinnathurai Srivas: "Re: Subj: uniscribe and Tamil U+0BB6"

    Stephane Bortzmeyer asked:

    > For various studies of the Unicode database, I prefer to work with a
    > SQL version.
    > I do not find one (even unofficial) on

    There isn't one.

    > Writing a conversion tool from
    > the CSV-like format of
    > to SQL seems
    > trivial but, being lazy, I wonder if it has been done already?

    Please note that "The Unicode Character Database" comprises
    everything under Public/x.y.z/ucd/ for a particular version
    of the standard, and is not *just* UnicodeData.txt. There
    are many character properties defined in the other data files,
    and the complete collection of character properties is thus
    much more complex than just UnicodeData.txt.

    > More generally, any unofficial repository of the UCD in SQL / XML /
    > JSON / whatever, somewhere?

    The UTC has been working for some time now on an XML version
    of the Unicode Character Database. At some point not too
    far off there should be a public review issue to review
    the schema for that version. It is feasible to put the
    entire UCD into XML, but the UTC has had to resolve a number
    of edge cases on property definitions to make everything

    I'm not entirely sure what you mean by "the UCD in SQL", however.
    A SQL database is a series of tables defined *in* a SQL
    DBMS, rather than a flat file or set of files posted in
    a directory. Perhaps you are referring to a set of exported
    SQL database backup files that could then be imported into
    another SQL DBMS to create a copy of the database. However,
    this runs afoul of the same issue that has made it difficult
    to publish an XML version of the data -- you need to first
    have a consistent schema for the entire database before you
    can create an actual database and import all the data into it.

    There *is* a SQL database specifically for the Unihan
    portion of the UCD. That is running on a live MySQL
    DBMS, and you can make queries on it from:

    The Unihan.txt data file is actually just a periodic export
    of a specified number of data fields from the SQL database.
    And the queries include access to publicly available
    dictionary information about Han characters, as well as
    the rest of the Unihan.txt information.

    This archive was generated by hypermail 2.1.5 : Fri Jul 06 2007 - 13:18:56 CDT