Re: UTF-16 inside UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 05 2003 - 23:48:20 EST

  • Next message: Radovan Garabik: "Re: [OT] Voiced velar fricative"

    Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

    > ok.
    > example to show you how difficult to support surrogate:
    >
    > Example 1: I have this api
    > UniChar is defined to be two byte holding 16 bits.
    > UniChar ToLower(UniChar aChar)
    > Tell me how to support Surrogate?
    >
    > Example 2:
    > I have api
    >
    > int FindCharInString( String, UniChar)
    >
    > Tell me what the return value should mean ? Should it mean the count
    > of UniChar from the beginning of String or should it mean the coutn of
    > the CHARACTER from the beginning of the String. What should I do when
    > I start to add surrogate support?

    What you do is, you go through the exact same process that API vendors
    have had to go through since the beginning of multibyte character sets.
    That is, you decide whether your API returns code units or characters,
    you publicize that decision, and you stick to it. If the decision means
    you have a function that isn't terribly useful, you have to define a new
    function that does the right thing, and leave the old function on the
    mountain to die.

    To cite a non-Unicode example, in ECMAScript (née JavaScript) there is a
    function Date.GetYear() that was intended to return the last two digits
    of the year but actually returned the year minus 1900. Of course,
    starting in 2000 the function returned a value which was useful to
    practically nobody. Did Sun or ECMA change the definition of
    Date.GetYear()? No, they introduced a new function, Date.GetFullYear(),
    which does what users really want.

    Same thing here: you can't change the 16-bit UniChar, so you'll have to
    declare that your old functions that return a UniChar are defined as
    returning UTF-16 code points, and you'll probably want to define a new
    UniChar32 type and functions like:

    UniChar32 ToLower(UniChar32 aChar)

    that do the obvious right thing.

    And I'm sorry, I know some people will cringe when I say this, but if
    you're like me and get to define your own "UniChar" data type, you've
    been making it 32 bits wide since about 1997.

    <many similar examples snipped />

    > 4. Memory and Performance trade off.
    >
    > You prbably can get a sense of difficulty if you look at how many
    > specification change MS need to make to add surrogate support to the
    > OpenType font. That is just specification change not include code
    > changes or API changes.
    >
    > 'cmap' http://www.microsoft.com/typography/otspec/cmap.htm

    Yes, as stated before, if your previous idea of a UniChar was 16 bits,
    you have some work to do. I thought we were talking about apps like
    MySQL to which Unicode support was being added for the first time.

    > It is easy to add surrogate support to your application if your
    > application do nothing.

    Oooooh!

    > It is difficult to add surrogate support (not
    > impossible) if your application do some data processing. It is hard to
    > add surrogate support if your software is a library which have
    > previous defined API.
    >
    > Look at
    > Format 4: Segment mapping to delta values
    > Supporting 4-byte character codes
    >
    > I am not saying software should not support surrogate. I am saying
    > don't under estimate the efforts. And while a software does upport
    > surrogate correctly. Give them a praise instead of take it for
    > granted. It is hard work.

    OK, I'll grant that. Depending on the type of application and the
    previous state of Unicode support, adding support for supplementary
    characters (please, not "surrogates" -- that is a UTF-16-only concept)
    can be a lot of work. My point is that it should be done anyway, except
    perhaps for cases like GSM 03.38 where we can be reasonably sure the
    (BMP-only) subset to be supported will never change. And *new* Unicode
    applications and APIs should never "Y2K" themselves by supporting only
    the BMP at first and expecting to add the other million code points
    later.

    -Doug Ewell
    Fullerton, California
    http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu Nov 06 2003 - 00:38:20 EST