Re: Worst case scenarios on SCSU

From: David Starner (dstarner98@aasaa.ofe.org)
Date: Wed Oct 31 2001 - 20:11:21 EST


On Wed, Oct 31, 2001 at 05:44:19PM -0600, David Starner wrote:
> UTF-16: This time, our worst case scenario is certain private use
> characters. Since certain private use characters take up 3 bytes (when
> encoded window-less) instead of two in UTF-16, preliminary guess is 3/2
> the size of UTF-16. It's suseptible to the same problem as above, only
> worse. Encoding all characters in as either SDn window byte, SQU high
> low, or SCn byte, and using the reasoning above gets us
> = UTF-16 length * 3/2 * 61/62 + UTF-16 length * 1/62 + 16

Sorry, this is all wrong, as I forgot that some characters can not be
put into windows. I find this case problematic, as a series of BMP Han
characters must be encoded in Unicode mode to get 2 bytes per character,
but the private-use characters must be encoded in UTF-16.

Some tests with the optimal SCSU encoder I'm working on gets results
between 26-28 bytes for 10 randomly chosen characters in the Unicode
mode tag range.

-- 
David Starner - dstarner98@aasaa.ofe.org
Pointless website: http://dvdeug.dhis.org
"I saw a daemon stare into my face, and an angel touch my breast; each 
one softly calls my name . . . the daemon scares me less."
- "Disciple", Stuart Davis



This archive was generated by hypermail 2.1.2 : Wed Oct 31 2001 - 21:16:27 EST