Re: UTF-16 inside UTF-8

From: Frank Yung-Fong Tang (ytang0648@aol.com)
Date: Wed Nov 19 2003 - 12:48:13 EST

  • Next message: Addison Phillips [wM]: "RE: Proposed Successor to RFC 3066 (language tags)"

    Dear Doug:
    Thank you for your reply. What you said about how to do it is exactly
    what it should be done. The point of asking those question is not to
    seek for an answer. Instead, just want to show from the answer that
    adding the surrogate support is not

    You wrote earlier:
    "For UTF-8 in particular, I can't imagine why
    one would choose to implement the 1-, 2-, and 3-byte forms in one stage
    and add the 4-byte forms in a later stage."

    Can you imaging now? The task list you listed below are additional task
    that people need to perform before they add 4 bytes UTF-8. They don't
    need that part if they support 2 bytes or 3 bytes UTF-8. It does not
    imply they should not add 4 bytes support. It only mean for people want
    to add the support, they need plan for extra task and time on it. All
    the following task cause it to come "later". The later could be 1 day
    late, it could be 1 week later. It could be one milestone (from alpha 1
    to alpha 2) late. But the nature that the developer do need to spend
    efforts on those task cause it "late".

    One real example I found recently is Tcl. Tcl have the so-called UTF-8
    support since 8.1. But if you look at the implementation of Tcl 8.4.4
    (from http://www.tcl.tk ) you will find the UTF-8 implementation:
    a. do not align with Unicode 3.2/4.0 or RFC 3629 definitation and accept
    non-shortest form
    b. by default it does not accept 4 bytes UTF-8.
    c. 4, 5, 6 byets UTF-8 support is accept if certain compiler flag got
    turn on. TCL_MAX_UTF (default 3, could be turn to 4, 5, 6)
    d. no documentation mention about surrogate.
    e. use unsigned int for Tcl_UniChar if the TCL_MAX_UTF is 4 to 6. use
    unsigned short if TCL_MAX_UTF is 3. (look like a very very very bad
    decision).
    f. there are no way to use UTF-16 internally to accept 4 bytes UTF-8.
    You can either use up to 3 bytes in UTF-8 and use UTF-16 internally, or
    support up to 6 (which is wrong, it should stop at 4) bytes with UTF-32
    (not really) support internally.
    g. they really output CESU-8 but not UTF-8 now if the UTF-16
    (TCL_MAX_UTF = 3 or undefined as default) have surrogate pair.

    If you still think adding 4 bytes UTF-8 support is < 1% of the task,
    then please join the Tcl project and help me fix that. I appreciate your
    efforts there and I beleive a lot of people will thank for your
    contribution.

    Doug Ewell wrote:

    > Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:
    >
    > What you do is, you go through the exact same process that API vendors
    > have had to go through since the beginning of multibyte character sets.
    > That is, you decide whether your API returns code units or characters,
    > you publicize that decision, and you stick to it. If the decision means
    > you have a function that isn't terribly useful, you have to define a new
    > function that does the right thing, and leave the old function on the
    > mountain to die.
    >
    > To cite a non-Unicode example, in ECMAScript (née JavaScript) there is a
    > function Date.GetYear() that was intended to return the last two digits
    > of the year but actually returned the year minus 1900. Of course,
    > starting in 2000 the function returned a value which was useful to
    > practically nobody. Did Sun or ECMA change the definition of
    > Date.GetYear()? No, they introduced a new function, Date.GetFullYear(),
    > which does what users really want.
    >
    > Same thing here: you can't change the 16-bit UniChar, so you'll have to
    > declare that your old functions that return a UniChar are defined as
    > returning UTF-16 code points, and you'll probably want to define a new
    > UniChar32 type and functions like:
    >
    > UniChar32 ToLower(UniChar32 aChar)
    >
    > that do the obvious right thing.
    >
    > And I'm sorry, I know some people will cringe when I say this, but if
    > you're like me and get to define your own "UniChar" data type, you've
    > been making it 32 bits wide since about 1997.

    double your memory cost and size from UTF-8. x4 of the size for your
    ASCII data. To change implementation of a ASCII compatable / support
    application to UTF-16 is already hard since people only care about ASCII
    will upset the data size x 2 for all "their" data. It is already a hard
    battle most of the time for someone like me. If we tell them to change
    to UCS-4 that mean they need not only x2 the memory but x4 of the
    memory. For web services or application which spend multi millions on
    those memory and database, it mean adding millions of dollars to their
    cost. They may have to adding some millions of cost to support
    international customer by using UTF-16. They probably are willing to add
    multi millions dollars of cost to change it to use UCS4. In fact, there
    are people proposed to stored UTF-8 in a hackky way into the database
    instead of using UTF-16 or UCS4 to save cost. They have to add
    restriction of using the api and build upper level api to do conversion
    and hacky operation. That mean it will introduce some fixed (not depend
    on the size of data) developement cost to the project but it will save
    millions of dollars of memory cost which depend on the size of the data.
    I don't like that approach but usually my word and what is "right" is
    less important than multiple million of dollars for a commercial company.

    >
    > <many similar examples snipped />
    >
    > > 4. Memory and Performance trade off.
    > >
    > > You prbably can get a sense of difficulty if you look at how many
    > > specification change MS need to make to add surrogate support to the
    > > OpenType font. That is just specification change not include code
    > > changes or API changes.
    > >
    > > 'cmap' http://www.microsoft.com/typography/otspec/cmap.htm
    >
    > Yes, as stated before, if your previous idea of a UniChar was 16 bits,
    > you have some work to do. I thought we were talking about apps like
    > MySQL to which Unicode support was being added for the first time.
    But
    >
    > > It is easy to add surrogate support to your application if your
    > > application do nothing.
    >
    > Oooooh!

    -- 
    --
    Frank Yung-Fong Tang
    Šýštém Årçhîtéçt, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
    AIM:yungfongta   mailto:ytang0648@aol.com Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan
    John 3:16 "For God so loved the world that he gave his one and only Son,
    that whoever believes in him shall not perish but have eternal life.
    Does your software display Thai language text correctly for Thailand users?
    -> Basic Conceptof Thai Language linked from Frank Tang's
    Iñtërnâtiônàlizætiøn Secrets
    Want to translate your English text to something Thailand users can
    understand ?
    -> Try English-to-Thai machine translation at
    http://c3po.links.nectec.or.th/parsit/
    


    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 13:50:46 EST