Re: UTF-16 inside UTF-8

From: Peter Kirk (
Date: Wed Nov 05 2003 - 06:38:24 EST

  • Next message: Philippe Verdy: "Re: UTF-16 inside UTF-8"

    On 04/11/2003 21:49, Doug Ewell wrote:

    >Peter Kirk <peterkirk at qaya dot org> wrote:
    >>>... (a very old, legacy application, unaware of the existence of
    >>>codepoints above U+FFFF) ...
    >>Such applications are not "very old", they are still being written.
    >>For example (see,
    >>MySQL 4.1 adds UCS-2 and UTF-8 support to previous versions but for
    >>single two-byte codes in UCS-2 and up to three bytes per UTF-8
    >>character only :-( - and this is still in alpha!
    >At the risk of upsetting the open-source faithful, that is just plain
    >lazy. Anyone who can master the wizardly details of building a powerful
    >(and commercially successful) database program can figure out how to
    >slap two surrogates together without destroying performance.
    >Constraining UTF-8 to the BMP is even less defensible, since there is no
    >performance penalty in allowing four-byte UTF-8 sequences.
    >-Doug Ewell
    > Fullerton, California
    Agreed. But to be fair to MySQL, they do mention as a potential problem
    that three bytes have to be allocated in strings for each UTF-8
    character. For full UTF-8 support they would need four bytes per
    character which would, from their perspective, be a greater problem.
    Also I suspect that Unicode data is actually being stored in 16-bit
    entities, and that the major issue is the extra complication of handling
    surrogate pairs within that representation (rather than the trivial one
    of converting such pairs to and from valid UTF-8).

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 07:33:11 EST