Re: UTF-16 inside UTF-8

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Tue Dec 02 2003 - 19:05:09 EST

Next message: John Jenkins: "Re: MS Windows and Unicode 4.0 ?"

Previous message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"
Next in thread: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Reply: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Reply: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:

> Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:
>
> Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be
> fixed. Plain and simple. If a system like Tcl only supports the BMP,
> that is its choice, but it *must not* accept non-shortest UTF-8 forms or
> output CESU-8 disguised as UTF-8.

Agree with you. Just want to make a point that the implementation is not
"< 1%" of the work.

>
> > If you still think adding 4 bytes UTF-8 support is < 1% of the task,
> > then please join the Tcl project and help me fix that. I appreciate
> > your efforts there and I beleive a lot of people will thank for your
> > contribution.
>
> I'll be happy to supply UTF-8 code that handles 4-byte sequences. That
> is not the same thing as converting an entire system from 16-bit to
> 32-bit integers, or adding proper UTF-16 surrogate support to a
> UCS-2-only system. Of course that is more work.

You view is based on the assumption the internal code is UCS4 instead of
UTF-16.

>
> Remember, AGAIN, that this thread was originally about taking an
> application like MySQL that did not support Unicode at all, and adding
> Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.** That is what I
> can't imagine -- making BMP-only assumptions *today*, in 2003, knowing
> that you'll have to go back and fix them some day. That is certainly
> more work than adding support for the full Unicode range at once. I
> think you thought I said the opposite, that such retrofitting is easy,
> and are now trying hard to disprove it.

Nothing wrong if people choose to use UTF-16 instead of UCS4 in the API,
even as 2003. Do you agree?

If people do use UTF-16 in the API, it is nature for people who do care
about BMP but not care about Plan 1-16 to only work on BMP, right? I am
not saying they do the right thing. I am saying they do the "nature"
thing. Remember, the text describe about 'Surrogate' in the Unocde 4.0
standard is probably only 5-10 pages total in that 1462 pages standard.
For developer who won't going to implement the rest 1000 pages right, it
is nature for them to think "why do I need to make this 10 pages right?"

>
> > double your memory cost and size from UTF-8. x4 of the size for your
> > ASCII data. To change implementation of a ASCII compatable / support
> > application to UTF-16 is already hard since people only care about
> > ASCII will upset the data size x 2 for all "their" data. It is already
> > a hard battle most of the time for someone like me. If we tell them to
> > change to UCS-4 that mean they need not only x2 the memory but x4 of
> > the memory.
>
> I can't fight this battle with people who would rather stay with ASCII
> and 7/8 bits per character. They are not living in a Unicode world.

But how about the UTF-16 vs UCS4 battle?

>
> 1024 × 768 screen resolution takes 150% more display memory than 640 ×
> 480, too.
>
> > For web services or application which spend multi millions on those
> > memory and database, it mean adding millions of dollars to their cost.
> > They may have to adding some millions of cost to support international
> > customer by using UTF-16. They probably are willing to add multi
> > millions dollars of cost to change it to use UCS4. In fact, there are
> > people proposed to stored UTF-8 in a hackky way into the database
> > instead of using UTF-16 or UCS4 to save cost. They have to add
> > restriction of using the api and build upper level api to do
> > conversion and hacky operation. That mean it will introduce some fixed
> > (not depend on the size of data) developement cost to the project but
> > it will save millions of dollars of memory cost which depend on the
> > size of the data. I don't like that approach but usually my word and
> > what is "right" is less important than multiple million of dollars for
> > a commercial company.
>
> I would truly be surprised if full 17-plane Unicode support in a single
> app could be demonstrated to be a matter of "multiple millions of
> dollars."

It is not the full 17-plane Unicode support which will contribut to it.
It is the
(Number of ASCII only records X sizeof (records in UCS4)) - ( Number of
ASCII only records X sizeof(record in ASCII))

contribute to that.

compare to

(Number of ASCII only records X sizeof (records in UTF-8)) - ( Number of
ASCII only records X sizeof(record in ASCII))
or

(Number of ASCII only records X sizeof (records in UTF-16)) - ( Number
of ASCII only records X sizeof(record in ASCII))

The other comparision is
(Number of BMP only records X sizeof (records in UCS4)) - ( Number of
BMP only records X sizeof(record in UTF-8))

(Number of BMP only records X sizeof (records in UCS4)) - ( Number of
BMP only records X sizeof(record in UTF-16))

of course, the sizeof() is really the "average size of record with those
data"

>
> -Doug Ewell
> Fullerton, California
> http://users.adelphia.net/~dewell/
>

-- 
--
Frank Yung-Fong Tang
Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta   mailto:ytang0648@aol.com Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

Next message: John Jenkins: "Re: MS Windows and Unicode 4.0 ?"
Previous message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"
Next in thread: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Reply: Philippe Verdy: "RE: UTF-16 inside UTF-8"
Reply: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 19:44:56 EST