Re: Worst case scenarios on SCSU

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 31 2001 - 20:04:44 EST


David Starner wrote:

> Has any one done worst case scenarios on SCSU, with respect to other
> methods of encoding Unicode characters?

As current Czar of Names Rectification, I must start protesting
here. SCSU is a means of *compressing* Unicode text. It is
not "[an]other method of encoding Unicode characters."

UTF-32, UTF-16, and UTF-8 are Unicode character encoding forms
(CEF's). Once can, of course, calculate the worst case length
of text in each of those encoding forms, which turns out to
be 4 bytes per character (code point) in each case. One can
also compare various mixtures of text in any of those encoding
forms with various compression schemes for the same text,
including SCSU. But it is important not to compare apples and
oranges here.

>
> The numbers I've got are:

And before going on, I'm not clear exactly what you are
trying to do. SCSU is defined on UTF-16 text. It would, of
course, be possible to create SCSU-like windowing compression
schemes that would work on UTF-32 or UTF-8 text, but those are
not part of UTS #6 as it is currently written.

At the moment, if you want to compare SCSU-compressed text
against the UTF-32 form, you would have to convert the UTF-32
text to UTF-16, and then compress it using SCSU. You don't
apply SCSU directly to UTF-32 data.

>
> UTF-32: Since all characters (including any necessary state changes)
> can be encoded in four characters, and four characters would be
                         ^bytes ^bytes
> necessary for a supplementary character outside any current window, the
> worst case scenario (for short strings) is an optimal SCSU length = the
> UTF-32 length. But in the long run, we must account for the windows. As
> an optimal sequence will probably look like SQX foo bar baz SQX foo bar
> baz SCn byte SQX foo bar baz . . . SCSU length = UTF-32 length * % of
> astral characters not in able to be covered by 7 windows + UTF-32 length
> * 2/4 * % of astral characters covered by 7 windows + 2 bytes * 7 windows
> (to initially set up the windows)
> = UTF-32 length * 8185/8192 + UTF-32 length * 7/16384 + 14
> = UTF-32 length * 16377/16384 + 14
> (actually, min of this and UTF-32 length.)

This analysis might apply to a UTF-32 adaptation of SCSU, but that
is a different animal than SCSU as it stands.

>
> UTF-16: This time, our worst case scenario is certain private use
> characters. Since certain private use characters take up 3 bytes (when
> encoded window-less) instead of two in UTF-16, preliminary guess is 3/2
  ^compressed
> the size of UTF-16. It's suseptible to the same problem as above, only
> worse. Encoding all characters in as either SDn window byte, SQU high
> low, or SCn byte, and using the reasoning above gets us
> = UTF-16 length * 3/2 * 61/62 + UTF-16 length * 1/62 + 16
> (This may be somewhat weak, as increasing the ration of private use
> characters makes windows more useful, and decreasing it makes Unicode
> mode more useful.)

I don't understand this analysis. The worst case for SCSU is always
UTF-16 length + 1 byte. This because if any garden path down the
heuristics leads to further expansions, you can always represent the
text as:

   SCU + (the rest of the text in Unicode)

>
> UTF-8: Worst case scenario is a series of NULs (or similar characters).
> Since this gives us a string with twice the length of the corresponding
> UTF-8 string, it can't be windowized, and there's no other characters
> that have much if any expansion, I'd say the worst case scenario is 2 *
> the UTF-8 length.

Here, you are saying that if I have a UTF-8 string 0x01 0x01 0x01 0x01...
I'd have to represent it in SCSU as 0x0F 0x00 0x01 0x00 0x01 0x00 0x01...?
(Actually NULs themselves would not be a problem, since they are passed
as single bytes 0x00.)

--Ken



This archive was generated by hypermail 2.1.2 : Wed Oct 31 2001 - 21:16:28 EST