Re: SQL version of the Unicode database?

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 06 2007 - 13:15:31 CDT

Next message: Sinnathurai Srivas: "Re: Subj: uniscribe and Tamil U+0BB6"

Previous message: Stephane Bortzmeyer: "Re: SQL version of the Unicode database?"
Maybe in reply to: Stephane Bortzmeyer: "SQL version of the Unicode database?"
Next in thread: Mike: "Re: SQL version of the Unicode database?"
Reply: Mike: "Re: SQL version of the Unicode database?"
Reply: Stephane Bortzmeyer: "Re: SQL version of the Unicode database?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Stephane Bortzmeyer asked:

> For various studies of the Unicode database, I prefer to work with a
> SQL version.
>
> I do not find one (even unofficial) on
> http://www.unicode.org/Public/UNIDATA/.

There isn't one.

> Writing a conversion tool from
> the CSV-like format of
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt to SQL seems
> trivial but, being lazy, I wonder if it has been done already?

Please note that "The Unicode Character Database" comprises
everything under Public/x.y.z/ucd/ for a particular version
of the standard, and is not *just* UnicodeData.txt. There
are many character properties defined in the other data files,
and the complete collection of character properties is thus
much more complex than just UnicodeData.txt.

> More generally, any unofficial repository of the UCD in SQL / XML /
> JSON / whatever, somewhere?

The UTC has been working for some time now on an XML version
of the Unicode Character Database. At some point not too
far off there should be a public review issue to review
the schema for that version. It is feasible to put the
entire UCD into XML, but the UTC has had to resolve a number
of edge cases on property definitions to make everything
consistent.

I'm not entirely sure what you mean by "the UCD in SQL", however.
A SQL database is a series of tables defined *in* a SQL
DBMS, rather than a flat file or set of files posted in
a directory. Perhaps you are referring to a set of exported
SQL database backup files that could then be imported into
another SQL DBMS to create a copy of the database. However,
this runs afoul of the same issue that has made it difficult
to publish an XML version of the data -- you need to first
have a consistent schema for the entire database before you
can create an actual database and import all the data into it.

There *is* a SQL database specifically for the Unihan
portion of the UCD. That is running on a live MySQL
DBMS, and you can make queries on it from:

http://www.unicode.org/charts/unihan.html

The Unihan.txt data file is actually just a periodic export
of a specified number of data fields from the SQL database.
And the queries include access to publicly available
dictionary information about Han characters, as well as
the rest of the Unihan.txt information.

Next message: Sinnathurai Srivas: "Re: Subj: uniscribe and Tamil U+0BB6"
Previous message: Stephane Bortzmeyer: "Re: SQL version of the Unicode database?"
Maybe in reply to: Stephane Bortzmeyer: "SQL version of the Unicode database?"
Next in thread: Mike: "Re: SQL version of the Unicode database?"
Reply: Mike: "Re: SQL version of the Unicode database?"
Reply: Stephane Bortzmeyer: "Re: SQL version of the Unicode database?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 06 2007 - 13:18:56 CDT