Re: UTF-16 inside UTF-8

From: Philippe Verdy (
Date: Tue Nov 04 2003 - 15:49:14 EST

  • Next message: "Re: GSM and Unicode"

    From: "Peter Kirk" <>

    > On 04/11/2003 06:37, Jill Ramonsky wrote:
    > > ... (a very old, legacy application, unaware of the existence of
    > > codepoints above U+FFFF) ...
    > Such applications are not "very old", they are still being written. For
    > example (see, MySQL
    > 4.1 adds UCS-2 and UTF-8 support to previous versions but for single
    > two-byte codes in UCS-2 and up to three bytes per UTF-8 character only
    > :-( - and this is still in alpha!

    When MySQL will correctly implement UCS-2, it will just be a matter of
    conventions between scrupulous software writers to use it in accordance with
    Unicode, when storing text in the database with a UTF-16BE/LE encoding

    Even with that restriction, it's not difficult to comply with Unicode: they
    can use SCSU-8 if they need to store characters out of the BMP, even if for
    now it won't be possible to output them with a UTF-8 label. The other issue
    is that they will have to handle UTF-16 code units, and they won't be able
    to sort strings containing characters aout of the BMP other than by a binary
    sort in this area.

    To include support of UCA for characters out of the BMP is a tremendous
    effort to add and optimize the SQL engine. But it can still be developped by
    independant applications of MySQL, or by computing externally and storing
    collation keys (with a warning for SQL regular expressions with LIKE or for
    some operators like TOUPPER() or TOLOWER() in SQL expressions, or for the
    LENGTH() of a VARCHAR containing surrogates for Han supplementary

    But this also means that MySQL will not fit with the Chinese market that
    needs the correct support of the GB18030 standard which requires the full
    support of supplementary planes as well as a few other conventions, and the
    necessary conversion tables betwen Unicode code points and GB18030 positions
    for characters in the BMP up to Unicode 3 and the algorithmic mapping for
    characters assigned later by Unicode.

    This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 16:49:36 EST