Re: UTF-16 inside UTF-8

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Dec 02 2003 - 19:05:09 EST

  • Next message: John Jenkins: "Re: MS Windows and Unicode 4.0 ?"

    Doug Ewell wrote:

    > Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:
    >
    > Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be
    > fixed. Plain and simple. If a system like Tcl only supports the BMP,
    > that is its choice, but it *must not* accept non-shortest UTF-8 forms or
    > output CESU-8 disguised as UTF-8.

    Agree with you. Just want to make a point that the implementation is not
    "< 1%" of the work.

    >
    > > If you still think adding 4 bytes UTF-8 support is < 1% of the task,
    > > then please join the Tcl project and help me fix that. I appreciate
    > > your efforts there and I beleive a lot of people will thank for your
    > > contribution.
    >
    > I'll be happy to supply UTF-8 code that handles 4-byte sequences. That
    > is not the same thing as converting an entire system from 16-bit to
    > 32-bit integers, or adding proper UTF-16 surrogate support to a
    > UCS-2-only system. Of course that is more work.

    You view is based on the assumption the internal code is UCS4 instead of
    UTF-16.

    >
    > Remember, AGAIN, that this thread was originally about taking an
    > application like MySQL that did not support Unicode at all, and adding
    > Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.** That is what I
    > can't imagine -- making BMP-only assumptions *today*, in 2003, knowing
    > that you'll have to go back and fix them some day. That is certainly
    > more work than adding support for the full Unicode range at once. I
    > think you thought I said the opposite, that such retrofitting is easy,
    > and are now trying hard to disprove it.

    Nothing wrong if people choose to use UTF-16 instead of UCS4 in the API,
    even as 2003. Do you agree?

    If people do use UTF-16 in the API, it is nature for people who do care
    about BMP but not care about Plan 1-16 to only work on BMP, right? I am
    not saying they do the right thing. I am saying they do the "nature"
    thing. Remember, the text describe about 'Surrogate' in the Unocde 4.0
    standard is probably only 5-10 pages total in that 1462 pages standard.
    For developer who won't going to implement the rest 1000 pages right, it
    is nature for them to think "why do I need to make this 10 pages right?"

    >
    > > double your memory cost and size from UTF-8. x4 of the size for your
    > > ASCII data. To change implementation of a ASCII compatable / support
    > > application to UTF-16 is already hard since people only care about
    > > ASCII will upset the data size x 2 for all "their" data. It is already
    > > a hard battle most of the time for someone like me. If we tell them to
    > > change to UCS-4 that mean they need not only x2 the memory but x4 of
    > > the memory.
    >
    > I can't fight this battle with people who would rather stay with ASCII
    > and 7/8 bits per character. They are not living in a Unicode world.

    But how about the UTF-16 vs UCS4 battle?

    >
    > 1024 × 768 screen resolution takes 150% more display memory than 640 ×
    > 480, too.
    >
    > > For web services or application which spend multi millions on those
    > > memory and database, it mean adding millions of dollars to their cost.
    > > They may have to adding some millions of cost to support international
    > > customer by using UTF-16. They probably are willing to add multi
    > > millions dollars of cost to change it to use UCS4. In fact, there are
    > > people proposed to stored UTF-8 in a hackky way into the database
    > > instead of using UTF-16 or UCS4 to save cost. They have to add
    > > restriction of using the api and build upper level api to do
    > > conversion and hacky operation. That mean it will introduce some fixed
    > > (not depend on the size of data) developement cost to the project but
    > > it will save millions of dollars of memory cost which depend on the
    > > size of the data. I don't like that approach but usually my word and
    > > what is "right" is less important than multiple million of dollars for
    > > a commercial company.
    >
    > I would truly be surprised if full 17-plane Unicode support in a single
    > app could be demonstrated to be a matter of "multiple millions of
    > dollars."

    It is not the full 17-plane Unicode support which will contribut to it.
    It is the
    (Number of ASCII only records X sizeof (records in UCS4)) - ( Number of
    ASCII only records X sizeof(record in ASCII))

    contribute to that.

    compare to

    (Number of ASCII only records X sizeof (records in UTF-8)) - ( Number of
    ASCII only records X sizeof(record in ASCII))
    or

    (Number of ASCII only records X sizeof (records in UTF-16)) - ( Number
    of ASCII only records X sizeof(record in ASCII))

    The other comparision is
    (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
    BMP only records X sizeof(record in UTF-8))

    (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
    BMP only records X sizeof(record in UTF-16))

    of course, the sizeof() is really the "average size of record with those
    data"

    >
    > -Doug Ewell
    > Fullerton, California
    > http://users.adelphia.net/~dewell/
    >

    -- 
    --
    Frank Yung-Fong Tang
    Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
    AIM:yungfongta   mailto:ytang0648@aol.com Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 19:44:56 EST