Worst case scenarios on SCSU

From: David Starner (dstarner98@aasaa.ofe.org)
Date: Wed Oct 31 2001 - 18:44:19 EST


Has any one done worst case scenarios on SCSU, with respect to other
methods of encoding Unicode characters?

The numbers I've got are:

UTF-32: Since all characters (including any necessary state changes)
can be encoded in four characters, and four characters would be
necessary for a supplementary character outside any current window, the
worst case scenario (for short strings) is an optimal SCSU length = the
UTF-32 length. But in the long run, we must account for the windows. As
an optimal sequence will probably look like SQX foo bar baz SQX foo bar
baz SCn byte SQX foo bar baz . . . SCSU length = UTF-32 length * % of
astral characters not in able to be covered by 7 windows + UTF-32 length
* 2/4 * % of astral characters covered by 7 windows + 2 bytes * 7 windows
(to initially set up the windows)
= UTF-32 length * 8185/8192 + UTF-32 length * 7/16384 + 14
= UTF-32 length * 16377/16384 + 14
(actually, min of this and UTF-32 length.)

UTF-16: This time, our worst case scenario is certain private use
characters. Since certain private use characters take up 3 bytes (when
encoded window-less) instead of two in UTF-16, preliminary guess is 3/2
the size of UTF-16. It's suseptible to the same problem as above, only
worse. Encoding all characters in as either SDn window byte, SQU high
low, or SCn byte, and using the reasoning above gets us
= UTF-16 length * 3/2 * 61/62 + UTF-16 length * 1/62 + 16
(This may be somewhat weak, as increasing the ration of private use
characters makes windows more useful, and decreasing it makes Unicode
mode more useful.)

UTF-8: Worst case scenario is a series of NULs (or similar characters).
Since this gives us a string with twice the length of the corresponding
UTF-8 string, it can't be windowized, and there's no other characters
that have much if any expansion, I'd say the worst case scenario is 2 *
the UTF-8 length.

-- 
David Starner - dstarner98@aasaa.ofe.org
Pointless website: http://dvdeug.dhis.org
"I saw a daemon stare into my face, and an angel touch my breast; each 
one softly calls my name . . . the daemon scares me less."
- "Disciple", Stuart Davis



This archive was generated by hypermail 2.1.2 : Wed Oct 31 2001 - 19:43:09 EST