Adding Unicode Support to an Open Source SQL Database: Experience with MySQL
Thomas R. Emerson - Basis Technology Corporation
MySQL is a popular open source, portable, high-performance, SQL database server available for most Unix platforms, OS/2, and Windows NT. It is widely used as the back-end on Web applications, and is supported by PHP, Phorum, Midgaard, among others. APIs are also available in C, C++, Python, Perl, PHP, Tcl, Dylan, and a number of other languages.
By default the MySQL server uses ISO 8859-1 as its standard character set. It is possible, at compile time, to change the character set/language used by the server for sorting behavior and text encoding. Supported languages include Chinese (Simplified and Traditional), Croatian, Czech, Danish, German, Greek, Hebrew, Japanese, Korean, Russian, Swedish, and Ukrainian. No separation is provided between encoding and language, and the server cannot support multiple languages in a single executable. If you need to maintain data in multiple languages or encodings, you are out of luck.
Basis Technology maintains a large amounts of lexical information in a variety of languages, including Japanese, Simplified and Traditional Chinese, Korean, Thai, and English. Recently we moved all of our Chinese data (almost one million UTF-8 encoded cross-referenced lexical items) as well as several mapping tables into a MySQL database. This was done using the default character set (Latin-1) and keeping the data in UTF-8. While this provides a usable environment, the lack of direct Unicode support in MySQL is problematical.
This paper describes the first stages of the full Unicode enablement of MySQL being undertaken by Basis Technology. I describe the design decisions and issues encountered, including MySQL-specific extensions allowing the user to specify the language used for each column in a table, and to import legacy-encoded data into a column. A demo of the Unicode-enabled MySQL being used alone and in conjunction with PHP/Apache will be presented.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
18 Jun 2000, Webmaster