Re: UTF-16 inside UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Fri Nov 21 2003 - 03:21:40 EST

  • Next message: Asmus Freytag: "Re: BOM as WJ?"

    Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:

    > One real example I found recently is Tcl. Tcl have the so-called UTF-8
    > support since 8.1. But if you look at the implementation of Tcl 8.4.4
    > (from http://www.tcl.tk ) you will find the UTF-8 implementation:
    > a. do not align with Unicode 3.2/4.0 or RFC 3629 definitation and
    > accept non-shortest form
    > b. by default it does not accept 4 bytes UTF-8.
    > c. 4, 5, 6 byets UTF-8 support is accept if certain compiler flag got
    > turn on. TCL_MAX_UTF (default 3, could be turn to 4, 5, 6)
    > d. no documentation mention about surrogate.
    > e. use unsigned int for Tcl_UniChar if the TCL_MAX_UTF is 4 to 6. use
    > unsigned short if TCL_MAX_UTF is 3. (look like a very very very bad
    > decision).
    > f. there are no way to use UTF-16 internally to accept 4 bytes UTF-8.
    > You can either use up to 3 bytes in UTF-8 and use UTF-16 internally,
    > or support up to 6 (which is wrong, it should stop at 4) bytes with
    > UTF-32 (not really) support internally.
    > g. they really output CESU-8 but not UTF-8 now if the UTF-16
    > (TCL_MAX_UTF = 3 or undefined as default) have surrogate pair.

    Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be
    fixed. Plain and simple. If a system like Tcl only supports the BMP,
    that is its choice, but it *must not* accept non-shortest UTF-8 forms or
    output CESU-8 disguised as UTF-8.

    > If you still think adding 4 bytes UTF-8 support is < 1% of the task,
    > then please join the Tcl project and help me fix that. I appreciate
    > your efforts there and I beleive a lot of people will thank for your
    > contribution.

    I'll be happy to supply UTF-8 code that handles 4-byte sequences. That
    is not the same thing as converting an entire system from 16-bit to
    32-bit integers, or adding proper UTF-16 surrogate support to a
    UCS-2-only system. Of course that is more work.

    Remember, AGAIN, that this thread was originally about taking an
    application like MySQL that did not support Unicode at all, and adding
    Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.** That is what I
    can't imagine -- making BMP-only assumptions *today*, in 2003, knowing
    that you'll have to go back and fix them some day. That is certainly
    more work than adding support for the full Unicode range at once. I
    think you thought I said the opposite, that such retrofitting is easy,
    and are now trying hard to disprove it.

    > double your memory cost and size from UTF-8. x4 of the size for your
    > ASCII data. To change implementation of a ASCII compatable / support
    > application to UTF-16 is already hard since people only care about
    > ASCII will upset the data size x 2 for all "their" data. It is already
    > a hard battle most of the time for someone like me. If we tell them to
    > change to UCS-4 that mean they need not only x2 the memory but x4 of
    > the memory.

    I can't fight this battle with people who would rather stay with ASCII
    and 7/8 bits per character. They are not living in a Unicode world.

    1024 × 768 screen resolution takes 150% more display memory than 640 ×
    480, too.

    > For web services or application which spend multi millions on those
    > memory and database, it mean adding millions of dollars to their cost.
    > They may have to adding some millions of cost to support international
    > customer by using UTF-16. They probably are willing to add multi
    > millions dollars of cost to change it to use UCS4. In fact, there are
    > people proposed to stored UTF-8 in a hackky way into the database
    > instead of using UTF-16 or UCS4 to save cost. They have to add
    > restriction of using the api and build upper level api to do
    > conversion and hacky operation. That mean it will introduce some fixed
    > (not depend on the size of data) developement cost to the project but
    > it will save millions of dollars of memory cost which depend on the
    > size of the data. I don't like that approach but usually my word and
    > what is "right" is less important than multiple million of dollars for
    > a commercial company.

    I would truly be surprised if full 17-plane Unicode support in a single
    app could be demonstrated to be a matter of "multiple millions of
    dollars."

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Fri Nov 21 2003 - 04:01:31 EST