RE: Subj: MySQl and Unicode support

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Feb 22 2007 - 16:56:15 CST

  • Next message: Philippe Verdy: "RE: Subj: MySQl and Unicode support"

    Could you send your precisions to the Unicode@unicode.org mailing list,
    where there are other responses too to your initial question?
    I can't forward your new message below without your authorization.
    Note that MySQL has several versions, not all of them are equal, and there
    are also dependencies with the backend table handlers.

    MySQL is 8-bit clean in all its versions, and MySQL 4.2 does support UTF-8.

    I am not aware of any limitation to 3 bytes instead of 4 when it handles
    UTF-8 (I just suppose that it uses 3 bytes if you use the UTF-32 encoding,
    although there's no real reason to do that, when UTF-8 will be more compact,
    even for Chinese, where non-BMP characters will be exceptional, so UTF-8 or
    SCSU would be a better choice for creating your database).

    > -----Message d'origine-----
    > De : Sinead Martin [mailto:sinead.martin@traventec.com]
    > Envoyé : jeudi 22 février 2007 16:44
    > À : verdy_p@wanadoo.fr
    > Cc : unicode@unicode.org
    > Objet : RE: Subj: MySQl and Unicode support
    >
    > Thank you for your very informative mail Phillippe,
    >
    > I am in the process of selecting a database that must support 10 languages
    > at once in the database.
    > However I still have some concerns about mySQL, perhaps you may be able to
    > address these:
    >
    > 1) One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in
    > UTF-8 encoding. As mySQL only supports 3 bytes it cannot support the
    > supplementary characters used by the Han characterset for Chinese and
    > Japanese which can commonly be used in Chinese and Japanese names. Is this
    > correct?
    >
    > 2) You say if I need to sort the strings it may not work as expected - is
    > this not a worry? For example if I had a table with regions or towns and
    > needed to display them in a drop down list I would have to have them
    > sorted alphabetically. Are you saying this may not be possible to do in
    > the database?
    >
    > 3) When you say 'uncased string matching' and 'case conversion' do not
    > work as expected - how then can I search for a string? Normally I would
    > convert the string to upper case and search ... how else can this be done?
    > Perhaps I have misunderstood.
    > Is this why you say to do this using the programming language (in my case
    > Java). Is this not a performance overhead?
    > You also mention the possibility of using an extra column based on the
    > collation key. but as you point out this has a performance overhead and
    > would not be acceptable to me if other opensource databases i.e.
    > PostgreSQL and Oracle XE can support this sorting feature without the need
    > to program it into the system.
    >
    > It seems the answer is that mySQL does support Unicode - but you need to
    > design the support into your database design and application, therefore
    > negating any advantage that mySQL has in performance over other databases
    > such as Postgres and Oracle who provide a more standard and robust
    > UNICODE support.
    >
    > Would you agree or disagree?
    > Thanks in advance,
    > Sinead.
    >
    > -----Original Message-----
    > From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
    > Sent: 22 February 2007 14:00
    > To: unicode@unicode.org
    > Cc: Sinead Martin
    > Subject: RE: Subj: MySQl and Unicode support
    >
    >
    > MySQL car perfectly support Unicode texts, using UTF-8 as the base
    > encoding
    > for the strings; however some SQL operators will not work as expected,
    > because they use byte positions, instead of character positions (in UTF-8,
    > Uniode characters are encoded with a variable number of bytes, from 1 byte
    > for ASCII characters up to U+007F, to 4).
    >
    > If you don't use the SQL string functions that compute:
    > * substrings based on character indices instead of byte indices
    > * case conversion,
    > * uncased string matching,
    > * sorting the results,
    > then you an perfectly use it. Some newer versions of MySQL have the
    > support
    > of "nvarchar" to support Unicode strings with the Unicode character
    > semantics, but be careful: the data type maximum length will still be
    > based
    > on the internal code units, not really on the Unicode characters (i.e.
    > code
    > points here). To allow this support, you need to install MySQL with extra
    > support libraries, notably a database of character properties (used for
    > proper case conversion and uncased/case-folded string matching).
    >
    > If your MySQL does not have such support, you can still use MySQL, but all
    > string handling that require Unicode semantics should be performed on the
    > SQL client side (for example in JSP, ASP, PHP, C, C++, #, ... provided
    > that
    > the client side language is also installed with the Unicode support).
    >
    > In fact, ANY database engine that supports storing and retrieving vectors
    > of
    > 8-bit bytes without altering them (for example a database that fully
    > supports ISO-8859-1) can be used to store UTF-8 encoded text. If you
    > carefully use the good practice of CASE design for your database, all
    > queries to the SQL engine will not require any Unicode-dependant features.
    >
    > Note that sorting (SELECT... ORDER BY...) may be an issue if you expect
    > the
    > results from the database to be in the appropriate order; but sorting also
    > has a cost on the database engine and should not influence the results (if
    > you follow the good practices of CASE design, using unique IDs for table
    > rows rather than matching them with complex joins with strings), so you
    > may
    > as well accumulate the results on the client-side, and realize the sort in
    > the client, rather than in the database engine.
    >
    > If you still want to use strings for table row unique identifiers used as
    > joining keys, make sure that your system will accept variable length keys;
    > if the key represents a person name, make sure that the string length
    > restriction in the table declaration will be sufficient to store all
    > possible encoded key lengths after computing the effective key string, if
    > the key string can contain any Unicode character.
    >
    > Note also that if you need a particular sort to be still performed by the
    > SQL engine, one solution is to include in your tables a column for storing
    > the sort keys using a binary encoding of collation keys (see the Unicode
    > UCA
    > algorithm about how to compute collation keys); then store the collation
    > key
    > in the table in addition to the effective string (this will of course have
    > an impact on the table storage size, and on the performance due to extra
    > I/O, as less rows will fit in the same page on disk).
    >
    > Note that database engines that have the full support for Unicode use
    > UCA-computed collation keys when they create an index on columns declared
    > to
    > use an Unicode-compatible string data type, for example NVARCHAR(n). You
    > don't need to declare an additional column for the collation key, as the
    > database engine performs itself the computation of collation keys (used
    > for
    > sorts, sub-selections, range-selection, and joining) within its indexes.
    >
    > There are many applications, notably for the web where MySQL is used with
    > PHP, that DO support Unicode strings which are stored in a MySQL-driven
    > database. If you know the limitations of MySQL before designing the
    > database, then it's not complicate to design the client-side code so that
    > it
    > will make SQL queries without requiring Unicode-support in the MySQL
    > database, when you can provide this support directly with the client-side
    > libraries (for example in PHP, there are tons of modules that do accept
    > Unicode strings, perform encoding conversions, ...).
    >
    > Note that PHP itself is quite similar to MySQL: it also has no native
    > support of Unicode for its strings, which are just vectors of 8-bit units.
    > But despite this limitation (where a "char" or "string" in PHP have not
    > the
    > same meaning as in Unicode), the supported data type is enough to allow
    > handling UTF-8 strings correctly. There are constraints to respect in your
    > code to avoid breaking Unicode-compatibility: the language will not
    > prevent
    > you to break those rules (notably not breaking a UTF-8 encoded character
    > by
    > splitting it in the middle of a sequence), but it is perfectly possible.
    >
    > The situation is almost identical in MOST other languages:
    > * JavaScript/ECMA-Script for example uses characters and strings that are
    > effectively vectors of 16-bit code units, and not Unicode characters (i.e.
    > full code points); the difference is that code points above U+FFFF must be
    > represented by two "characters" in JavaScript, each one corresponding to
    > "surrogates", and the language does not enforce the rule of not breaking a
    > character in the middle of a surrogate pair.
    > * C/C++ may use 8-bit "char" or 8/16/32-bit "wchar_t"; you have to study
    > the
    > platform specification of the supported data types to find which format of
    > code units are supported. Then you need to use appropriate code or
    > libraries
    > to use vectors of code units according to the expected Unicode semantics
    > and
    > requirement). The language itself will not enforce the Unicode rules, but
    > it
    > perfectly has all what it needed to support it
    >
    > In fact you could as well program in assembly language with the same
    > capabilities and lacks of native support: it's up to you to add this
    > support
    > or to use libraries that will help you in this task.
    >
    > Note that if your MySQL instance only supports Unicode characters encoded
    > with up to 3 bytes per characters, it is enough to support UTF-32 (after
    > dropping the high byte which is always null and not stored, but will be
    > regenerated implicitly at retrieval time). What this means is that it will
    > use fixed-length encoding for fixed-length Unicode strings, so the storage
    > space can be easily sized to a safe maximum. It will however probably not
    > be
    > very efficient in terms of I/O because most text will then be stored with
    > 1
    > or possibly 2 null bytes per characters, so there will be less rows per
    > disk
    > page.
    >
    > Really, for compact database storage, a SQL engine should propose some
    > standard compaction algorithm (SCSU being a good candidate); what it needs
    > to enforce is the effective Unicode string length, but not the internal
    > storage length, but it must offer a way for database designers to estimate
    > the average size and maximum size for the storage of tables. The average
    > size will be data-dependant, but not the maximum which is very strict and
    > can be enforced by the SQL engine as a simple measure in addition to the
    > maximum Unicode string length. The effective internal representation used
    > in
    > the database storage should not affect the semantics of the strings you
    > are
    > storing and retrieving, if your database says it supports Unicode.
    >
    > But no database should say it supports Unicode if it cannot accept strings
    > containing EVERY character of the 17 planes (including NULL and other
    > control characters, but possibly excluding non-characters whose exclusion
    > should better be enforced by the database engine).
    >
    > Full support of Unicode is illusory: new algorithms get developed
    > everyday,
    > as well as new characters and new properties. A SQL engine really conforms
    > only to a set of Unicode algorithms specified in a precise version, which
    > should be documented. The behaviour of the database with Unicode
    > characters
    > that were still not standardized in that Unicode version is not specified:
    >
    > Some database will adopt a "safe" behaviour and will reject strings
    > containing them, others will accept them using default properties, but
    > this
    > may affect future evolutions when the engine gets updated to support a new
    > version of Unicode: indices may need to be rebuilt for correct sorting or
    > range sub-selections according to newer UCA collation rules including the
    > newer characters. But this reordering of indices and queries may affect
    > applications and that's why some database engines offer you the choice,
    > either in the table store format (where the strict enforcement to the
    > known
    > standardized Unicode version may be encoded), or in the client connection
    > handshakes, or in the stored user properties (that determine the client
    > compatibility, but this may produce inconsistent results if there are
    > different user profiles for the same database).
    >
    > I personally think that the best place for such Unicode version
    > enforcement
    > is in the application, not in the database itself, and not in the user
    > profiles; but this is just an opinion, which may be wrong in other
    > environments where there are multiple applications with different needs
    > connecting to the same database.
    >
    > The per-user profile is for me a bad choice because it easily produces
    > inconstant results (where the queries do not reflect the actual content of
    > the database, which may produce sometimes inconsistent results, like
    > rejecting a row creation due to a unique key constraint, despite that a
    > row
    > with that key could not be retrieved by a simple SELECT, even within the
    > same SQL transaction). However, an application may still be built safely
    > with such per-user profiles (regarding handshaked Unicode versions), if it
    > is prepared to handle row creation exceptions, and do not assume that a
    > ROW
    > NOT FOUND indication (or COUNT(*)==0) after a simple SELECT (or a
    > range-SELECT) means that the row will be creatable in the same
    > transaction.
    >
    > > -----Message d'origine-----
    > > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De
    > la
    > > part de Magda Danish (Unicode)
    > > Envoyé : mercredi 21 février 2007 23:34
    > > À : unicode@unicode.org
    > > Cc : sinead.martin@traventec.com
    > > Objet : FW: Subj: MySQl and Unicode support
    > >
    > > Sinead --
    > > I am forwarding your question to the Unicode mailing list.
    > >
    > > Unicode subscribers --
    > > Does anyone have an answer to Sinead's email?
    > > -----Original Message-----
    > > Date/Time: Tue Feb 20 07:22:21 CST 2007
    > > Contact: sinead.martin@traventec.com
    > > Name: Sinead Martin
    > > Report Type: Other Question, Problem, or Feedback Opt Subject: MySQl
    > and
    > > Unicode support
    > >
    > > Hi there,
    > >
    > > I notice that MySQl is not listed on your website as a database that
    > > supports Unicode.
    > > Would you recommend mySQL as a database that needs to support 10
    > languages
    > > incuding Traditional / simplified Chinese and Japanese?
    > >
    > > I wondered whether the fact that MySQL support only 3 bytes instead of 4
    > > has any significance?
    > >
    > > Thanks in advance,
    > > Sinead Martin
    >
    >
    >
    >
    > This message (including any attachments) is intended only for
    > the use of the individual or entity to which it is addressed and
    > may contain information that is non-public, proprietary,
    > privileged, confidential, and exempt from disclosure under
    > applicable law or may constitute as attorney work product.
    > If you are not the intended recipient, you are hereby notified
    > that any use, dissemination, distribution, or copying of this
    > communication is strictly prohibited. If you have received this
    > communication in error, notify us immediately by telephone and
    > (i) destroy this message if a facsimile or (ii) delete this message
    > immediately if this is an electronic communication.
    >
    > Thank you.
    >
    > --------------------------------------------------------------------------
    > -------------
    > Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
    > Aucun virus connu a ce jour par nos services n'a ete detecte.
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Feb 22 2007 - 17:00:09 CST