Re: UTF-16 inside UTF-8

From: Frank Yung-Fong Tang ([email protected])
Date: Wed Nov 19 2003 - 12:48:13 EST

Next message: Addison Phillips [wM]: "RE: Proposed Successor to RFC 3066 (language tags)"

Previous message: Peter Constable: "RE: Definitions"
In reply to: Doug Ewell: "Re: UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Re: UTF-16 inside UTF-8"
Reply: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Dear Doug:
Thank you for your reply. What you said about how to do it is exactly
what it should be done. The point of asking those question is not to
seek for an answer. Instead, just want to show from the answer that
adding the surrogate support is not

You wrote earlier:
"For UTF-8 in particular, I can't imagine why
one would choose to implement the 1-, 2-, and 3-byte forms in one stage
and add the 4-byte forms in a later stage."

Can you imaging now? The task list you listed below are additional task
that people need to perform before they add 4 bytes UTF-8. They don't
need that part if they support 2 bytes or 3 bytes UTF-8. It does not
imply they should not add 4 bytes support. It only mean for people want
to add the support, they need plan for extra task and time on it. All
the following task cause it to come "later". The later could be 1 day
late, it could be 1 week later. It could be one milestone (from alpha 1
to alpha 2) late. But the nature that the developer do need to spend
efforts on those task cause it "late".

One real example I found recently is Tcl. Tcl have the so-called UTF-8
support since 8.1. But if you look at the implementation of Tcl 8.4.4
(from http://www.tcl.tk ) you will find the UTF-8 implementation:
a. do not align with Unicode 3.2/4.0 or RFC 3629 definitation and accept
non-shortest form
b. by default it does not accept 4 bytes UTF-8.
c. 4, 5, 6 byets UTF-8 support is accept if certain compiler flag got
turn on. TCL_MAX_UTF (default 3, could be turn to 4, 5, 6)
d. no documentation mention about surrogate.
e. use unsigned int for Tcl_UniChar if the TCL_MAX_UTF is 4 to 6. use
unsigned short if TCL_MAX_UTF is 3. (look like a very very very bad
decision).
f. there are no way to use UTF-16 internally to accept 4 bytes UTF-8.
You can either use up to 3 bytes in UTF-8 and use UTF-16 internally, or
support up to 6 (which is wrong, it should stop at 4) bytes with UTF-32
(not really) support internally.
g. they really output CESU-8 but not UTF-8 now if the UTF-16
(TCL_MAX_UTF = 3 or undefined as default) have surrogate pair.

If you still think adding 4 bytes UTF-8 support is < 1% of the task,
then please join the Tcl project and help me fix that. I appreciate your
efforts there and I beleive a lot of people will thank for your
contribution.

Doug Ewell wrote:

> Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:
>
> What you do is, you go through the exact same process that API vendors
> have had to go through since the beginning of multibyte character sets.
> That is, you decide whether your API returns code units or characters,
> you publicize that decision, and you stick to it. If the decision means
> you have a function that isn't terribly useful, you have to define a new
> function that does the right thing, and leave the old function on the
> mountain to die.
>
> To cite a non-Unicode example, in ECMAScript (née JavaScript) there is a
> function Date.GetYear() that was intended to return the last two digits
> of the year but actually returned the year minus 1900. Of course,
> starting in 2000 the function returned a value which was useful to
> practically nobody. Did Sun or ECMA change the definition of
> Date.GetYear()? No, they introduced a new function, Date.GetFullYear(),
> which does what users really want.
>
> Same thing here: you can't change the 16-bit UniChar, so you'll have to
> declare that your old functions that return a UniChar are defined as
> returning UTF-16 code points, and you'll probably want to define a new
> UniChar32 type and functions like:
>
> UniChar32 ToLower(UniChar32 aChar)
>
> that do the obvious right thing.
>
> And I'm sorry, I know some people will cringe when I say this, but if
> you're like me and get to define your own "UniChar" data type, you've
> been making it 32 bits wide since about 1997.

double your memory cost and size from UTF-8. x4 of the size for your
ASCII data. To change implementation of a ASCII compatable / support
application to UTF-16 is already hard since people only care about ASCII
will upset the data size x 2 for all "their" data. It is already a hard
battle most of the time for someone like me. If we tell them to change
to UCS-4 that mean they need not only x2 the memory but x4 of the
memory. For web services or application which spend multi millions on
those memory and database, it mean adding millions of dollars to their
cost. They may have to adding some millions of cost to support
international customer by using UTF-16. They probably are willing to add
multi millions dollars of cost to change it to use UCS4. In fact, there
are people proposed to stored UTF-8 in a hackky way into the database
instead of using UTF-16 or UCS4 to save cost. They have to add
restriction of using the api and build upper level api to do conversion
and hacky operation. That mean it will introduce some fixed (not depend
on the size of data) developement cost to the project but it will save
millions of dollars of memory cost which depend on the size of the data.
I don't like that approach but usually my word and what is "right" is
less important than multiple million of dollars for a commercial company.

>
> <many similar examples snipped />
>
> > 4. Memory and Performance trade off.
> >
> > You prbably can get a sense of difficulty if you look at how many
> > specification change MS need to make to add surrogate support to the
> > OpenType font. That is just specification change not include code
> > changes or API changes.
> >
> > 'cmap' http://www.microsoft.com/typography/otspec/cmap.htm
>
> Yes, as stated before, if your previous idea of a UniChar was 16 bits,
> you have some work to do. I thought we were talking about apps like
> MySQL to which Unicode support was being added for the first time.
But
>
> > It is easy to add surrogate support to your application if your
> > application do nothing.
>
> Oooooh!

-- 
--
Frank Yung-Fong Tang
Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta   mailto:[email protected] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
John 3:16 "For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.
Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's
Iñtërnâtiônàlizætiøn Secrets
Want to translate your English text to something Thailand users can
understand ?
-> Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/

Next message: Addison Phillips [wM]: "RE: Proposed Successor to RFC 3066 (language tags)"
Previous message: Peter Constable: "RE: Definitions"
In reply to: Doug Ewell: "Re: UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Re: UTF-16 inside UTF-8"
Reply: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 13:50:46 EST