Re: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 08:35:48 EST

  • Next message: Marco Cimarosti: "Re-distributing the files in http://www.unicode.org/Public/MAPPIN GS/VENDORS"

    From: "Peter Kirk" <peterkirk@qaya.org>

    > Agreed. But to be fair to MySQL, they do mention as a potential problem
    > that three bytes have to be allocated in strings for each UTF-8
    > character. For full UTF-8 support they would need four bytes per
    > character which would, from their perspective, be a greater problem.
    > Also I suspect that Unicode data is actually being stored in 16-bit
    > entities, and that the major issue is the extra complication of handling
    > surrogate pairs within that representation (rather than the trivial one
    > of converting such pairs to and from valid UTF-8).

    Modern database engines now offer multiple encoding strategy for storage of
    characters. In SQL engines, the key issue is performance (notibly in terms
    of storage I/O or networking I/O), but this is completely orthogonal of the
    logical correctness of SQL functions and selections, which should be based
    internally on Unicode characters, independantly of their actual encoding in
    storage (as UTF-8, CESU-8, UTF-16BE/LE, UTF-32, GB18030, or any other legacy
    charset).

    So I do think that it is quite easy to implement UTF-8 and be fully
    compliant with it for I/O in the request language or in its results, as well
    as for storage where it is certainly better than CESU-8.

    The hard part is not in these interfaces (MySQL for example is unique in the
    fact that it supports several alternate storage formats for its tables), but
    in the core engine itself when it performs identity selection, sorting and
    range selections, substring extractions.

    The other part of the problem is the interoperability with MySQL clients. As
    long as these clients will not be prepared to receive character data out of
    the BMP, they should connect with a CESU-8 encoding profile. If they are
    prepared for it, they should better use UTF-8. But is the MySQL client
    protocol compatible enough to support explicit tagging of the encoding used
    for strings? This is the good question. This may require an update in the
    protocol, and this may not be the first priority for MySQL, which wants
    first to prepare its core engines, and get it to connect to external data
    sources or storages like Oracle, Sybase, MS-SQL, UTF-8 text files, Access
    MDB files, XML data files, and possibly with more recent extensions of the
    Berkeley DB table format that now supports characters out of the BMP.

    Tracking the required reencoding between these components connected to the
    core engine may be tricky to develop, unless the interfaces between these
    components and the engine are prepared to support explicit labelling of the
    charsets and encodings actually usable to interoperate, and some negociation
    protocol in these interfaces (something like the "Accept-*" headers in
    HTTP), notably if there are transcoding issues (which may affect very
    serious database integrity constraints, notably for uniqueness and
    existence, but also in triggers).



    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 09:21:36 EST