Re: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 08:35:48 EST

Next message: Marco Cimarosti: "Re-distributing the files in http://www.unicode.org/Public/MAPPIN GS/VENDORS"

Previous message: Peter Kirk: "Re: UTF-16 inside UTF-8"
In reply to: Peter Kirk: "Re: UTF-16 inside UTF-8"
Next in thread: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Agreed. But to be fair to MySQL, they do mention as a potential problem
> that three bytes have to be allocated in strings for each UTF-8
> character. For full UTF-8 support they would need four bytes per
> character which would, from their perspective, be a greater problem.
> Also I suspect that Unicode data is actually being stored in 16-bit
> entities, and that the major issue is the extra complication of handling
> surrogate pairs within that representation (rather than the trivial one
> of converting such pairs to and from valid UTF-8).

Modern database engines now offer multiple encoding strategy for storage of
characters. In SQL engines, the key issue is performance (notibly in terms
of storage I/O or networking I/O), but this is completely orthogonal of the
logical correctness of SQL functions and selections, which should be based
internally on Unicode characters, independantly of their actual encoding in
storage (as UTF-8, CESU-8, UTF-16BE/LE, UTF-32, GB18030, or any other legacy
charset).

So I do think that it is quite easy to implement UTF-8 and be fully
compliant with it for I/O in the request language or in its results, as well
as for storage where it is certainly better than CESU-8.

The hard part is not in these interfaces (MySQL for example is unique in the
fact that it supports several alternate storage formats for its tables), but
in the core engine itself when it performs identity selection, sorting and
range selections, substring extractions.

The other part of the problem is the interoperability with MySQL clients. As
long as these clients will not be prepared to receive character data out of
the BMP, they should connect with a CESU-8 encoding profile. If they are
prepared for it, they should better use UTF-8. But is the MySQL client
protocol compatible enough to support explicit tagging of the encoding used
for strings? This is the good question. This may require an update in the
protocol, and this may not be the first priority for MySQL, which wants
first to prepare its core engines, and get it to connect to external data
sources or storages like Oracle, Sybase, MS-SQL, UTF-8 text files, Access
MDB files, XML data files, and possibly with more recent extensions of the
Berkeley DB table format that now supports characters out of the BMP.

Tracking the required reencoding between these components connected to the
core engine may be tricky to develop, unless the interfaces between these
components and the engine are prepared to support explicit labelling of the
charsets and encodings actually usable to interoperate, and some negociation
protocol in these interfaces (something like the "Accept-*" headers in
HTTP), notably if there are transcoding issues (which may affect very
serious database integrity constraints, notably for uniqueness and
existence, but also in triggers).

Next message: Marco Cimarosti: "Re-distributing the files in http://www.unicode.org/Public/MAPPIN GS/VENDORS"
Previous message: Peter Kirk: "Re: UTF-16 inside UTF-8"
In reply to: Peter Kirk: "Re: UTF-16 inside UTF-8"
Next in thread: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 09:21:36 EST