Re: Worst case scenarios on SCSU

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 31 2001 - 20:50:58 EST


David Starner wrote:

> On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote:
> > And before going on, I'm not clear exactly what you are
> > trying to do. SCSU is defined on UTF-16 text.
>
> Why do you say that? I can't find the phrase "UTF-16" in UTS-6.

UTS #6 is a very early Unicode Technical Report. It was drafted,
and essentially completed, before UTF-8 was formally incorporated
into the Unicode Standard (in Unicode 3.0) and well before
UTF-32 was defined and formally incorporated into the Unicode
Standard (in Unicode 3.1). When it was written, Unicode *was*
UTF-16, and nobody went out of their way to make the distinction
in terms all the time. This is true of all Unicode documents from
the Unicode 2.0 era.

> It's
> says that it's "a compression scheme for Unicode" and that "[SCSU] is
> mainly intended for use with short to medium length Unicode strings.".
> I noticed that the sample strings are in UTF-16, and count surrogate
> pairs as two characters (I think; for 9.4, I count 17 characters
> counting pairs as 1 and 19 as two, whereas the text claims 20), but I
> that's merely informative anyway.
>
> All the SCSU pieces I've written work directly from UTF-32. I'll admit
> I haven't done much checking with other encoders/decoders, but my
> decoder can handle all the sample strings correctly, as well as every
> thing my encoders put out.

I have no quarrel with the claim that the SCSU scheme could be
implemented directly on UTF-32 data. But as Unicode Technical Standard
#6 is currently written, that is not how to do it conformantly.

It seems to me that a rewrite of SCSU would be in order to explicitly
allow and define UTF-32 implementations as well as UTF-16 implementations
of SCSU.

>
> > I don't understand this analysis. The worst case for SCSU is always
> > UTF-16 length + 1 byte. This because if any garden path down the
> > heuristics leads to further expansions, you can always represent the
> > text as:
> >
> > SCU + (the rest of the text in Unicode)
>
> Section 5.2.1: "Each reserved tag value collides with 256 Unicode
> characters." If you do that and have private use values in your UTF-16
> string, decoding the SCSU will produce a different text.

My mistake. I went back to my own implementation to remind myself of
the problem involved with the private use characters and the need
for tag quoting. You are correct that if you pick certain aberrant
combinations of PUA characters that themselves cannot compress, you
end up with 3/2 * UTF-16 length as the worst case.

--Ken



This archive was generated by hypermail 2.1.2 : Wed Oct 31 2001 - 21:43:51 EST