Re: SQL version of the Unicode database?

From: Stephane Bortzmeyer (bortzmeyer@nic.fr)
Date: Fri Sep 07 2007 - 09:47:56 CDT

  • Next message: Peter Constable: "RE: [indic] Re: Feedback on PR-104"

    On Fri, Jul 06, 2007 at 11:32:36AM +0200,
     Stephane Bortzmeyer <bortzmeyer@nic.fr> wrote
     a message of 11 lines which said:

    > For various studies of the Unicode database, I prefer to work with a
    > SQL version.

    Several suggestions have been made.

    1) Some people suggested to just load UnicodeData.txt in a DBMS (most
    DBMS allow to load a CSV or CSV-like file simply) which is not a good
    solution, because of the data in other files (such as Han properties)
    or simply because of character ranges.

    2) Some people suggested to wait for the XML version of the UCD (which
    is now in beta-test, see http://www.unicode.org/review/pr-109.html)

    So, I wrote my own (very incomplete solution). It is a simple program
    (257 lines but it is far to handle all the stuff in the UCD, which is
    a rich and complicated database). It was more complicated than
    foreseen because the UCD is complex and the structure of its text
    files is not always easy to handle. But it works for my purposes, I
    can now write things like:

    SELECT To_U(Characters.codepoint) AS UCodepoint, name, definition
       FROM Characters, Han_Properties WHERE
       Characters.codepoint = Han_properties.codepoint AND
       definition ILIKE '%turtle%';

    I attached here, in case some people could find it useful, the SQL
    schema (tested on PostgreSQL, remember that very few real-world SQL
    files are portable) and the program, written in Lua
    (http://www.lua.org/). Feel free to use them as you want.






    This archive was generated by hypermail 2.1.5 : Fri Sep 07 2007 - 09:49:47 CDT