Re: UTF-16 inside UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 05 2003 - 23:48:20 EST

Next message: Radovan Garabik: "Re: [OT] Voiced velar fricative"

Previous message: Jony Rosenne: "RE: Merging combining classes, was: New contribution N2676"
In reply to: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Next in thread: John Cowan: "Re: UTF-16 inside UTF-8"
Reply: John Cowan: "Re: UTF-16 inside UTF-8"
Reply: Frank Yung-Fong Tang: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

> ok.
> example to show you how difficult to support surrogate:
>
> Example 1: I have this api
> UniChar is defined to be two byte holding 16 bits.
> UniChar ToLower(UniChar aChar)
> Tell me how to support Surrogate?
>
> Example 2:
> I have api
>
> int FindCharInString( String, UniChar)
>
> Tell me what the return value should mean ? Should it mean the count
> of UniChar from the beginning of String or should it mean the coutn of
> the CHARACTER from the beginning of the String. What should I do when
> I start to add surrogate support?

What you do is, you go through the exact same process that API vendors
have had to go through since the beginning of multibyte character sets.
That is, you decide whether your API returns code units or characters,
you publicize that decision, and you stick to it. If the decision means
you have a function that isn't terribly useful, you have to define a new
function that does the right thing, and leave the old function on the
mountain to die.

To cite a non-Unicode example, in ECMAScript (née JavaScript) there is a
function Date.GetYear() that was intended to return the last two digits
of the year but actually returned the year minus 1900. Of course,
starting in 2000 the function returned a value which was useful to
practically nobody. Did Sun or ECMA change the definition of
Date.GetYear()? No, they introduced a new function, Date.GetFullYear(),
which does what users really want.

Same thing here: you can't change the 16-bit UniChar, so you'll have to
declare that your old functions that return a UniChar are defined as
returning UTF-16 code points, and you'll probably want to define a new
UniChar32 type and functions like:

UniChar32 ToLower(UniChar32 aChar)

that do the obvious right thing.

And I'm sorry, I know some people will cringe when I say this, but if
you're like me and get to define your own "UniChar" data type, you've
been making it 32 bits wide since about 1997.

> 4. Memory and Performance trade off.
>
> You prbably can get a sense of difficulty if you look at how many
> specification change MS need to make to add surrogate support to the
> OpenType font. That is just specification change not include code
> changes or API changes.
>
> 'cmap' http://www.microsoft.com/typography/otspec/cmap.htm

Yes, as stated before, if your previous idea of a UniChar was 16 bits,
you have some work to do. I thought we were talking about apps like
MySQL to which Unicode support was being added for the first time.

> It is easy to add surrogate support to your application if your
> application do nothing.

Oooooh!

> It is difficult to add surrogate support (not
> impossible) if your application do some data processing. It is hard to
> add surrogate support if your software is a library which have
> previous defined API.
>
> Look at
> Format 4: Segment mapping to delta values
> Supporting 4-byte character codes
>
> I am not saying software should not support surrogate. I am saying
> don't under estimate the efforts. And while a software does upport
> surrogate correctly. Give them a praise instead of take it for
> granted. It is hard work.

OK, I'll grant that. Depending on the type of application and the
previous state of Unicode support, adding support for supplementary
characters (please, not "surrogates" -- that is a UTF-16-only concept)
can be a lot of work. My point is that it should be done anyway, except
perhaps for cases like GSM 03.38 where we can be reasonably sure the
(BMP-only) subset to be supported will never change. And *new* Unicode
applications and APIs should never "Y2K" themselves by supporting only
the BMP at first and expecting to add the other million code points
later.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Radovan Garabik: "Re: [OT] Voiced velar fricative"
Previous message: Jony Rosenne: "RE: Merging combining classes, was: New contribution N2676"
In reply to: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Next in thread: John Cowan: "Re: UTF-16 inside UTF-8"
Reply: John Cowan: "Re: UTF-16 inside UTF-8"
Reply: Frank Yung-Fong Tang: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Nov 06 2003 - 00:38:20 EST