Re: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 04 2003 - 15:49:14 EST

Next message: YTang0648@aol.com: "Re: GSM and Unicode"

Previous message: Philippe Verdy: "Re: GSM and Unicode"
In reply to: Peter Kirk: "Re: UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> On 04/11/2003 06:37, Jill Ramonsky wrote:
>
> > ... (a very old, legacy application, unaware of the existence of
> > codepoints above U+FFFF) ...
>
> Such applications are not "very old", they are still being written. For
> example (see http://www.mysql.com/doc/en/Charset-Unicode.html), MySQL
> 4.1 adds UCS-2 and UTF-8 support to previous versions but for single
> two-byte codes in UCS-2 and up to three bytes per UTF-8 character only
> :-( - and this is still in alpha!

When MySQL will correctly implement UCS-2, it will just be a matter of
conventions between scrupulous software writers to use it in accordance with
Unicode, when storing text in the database with a UTF-16BE/LE encoding
scheme.

Even with that restriction, it's not difficult to comply with Unicode: they
can use SCSU-8 if they need to store characters out of the BMP, even if for
now it won't be possible to output them with a UTF-8 label. The other issue
is that they will have to handle UTF-16 code units, and they won't be able
to sort strings containing characters aout of the BMP other than by a binary
sort in this area.

To include support of UCA for characters out of the BMP is a tremendous
effort to add and optimize the SQL engine. But it can still be developped by
independant applications of MySQL, or by computing externally and storing
collation keys (with a warning for SQL regular expressions with LIKE or for
some operators like TOUPPER() or TOLOWER() in SQL expressions, or for the
LENGTH() of a VARCHAR containing surrogates for Han supplementary
ideographs).

But this also means that MySQL will not fit with the Chinese market that
needs the correct support of the GB18030 standard which requires the full
support of supplementary planes as well as a few other conventions, and the
necessary conversion tables betwen Unicode code points and GB18030 positions
for characters in the BMP up to Unicode 3 and the algorithmic mapping for
characters assigned later by Unicode.

Next message: YTang0648@aol.com: "Re: GSM and Unicode"
Previous message: Philippe Verdy: "Re: GSM and Unicode"
In reply to: Peter Kirk: "Re: UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 16:49:36 EST