Re: Worst case scenarios on SCSU

From: David Starner (dstarner98@aasaa.ofe.org)
Date: Wed Oct 31 2001 - 20:28:59 EST


On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote:
> And before going on, I'm not clear exactly what you are
> trying to do. SCSU is defined on UTF-16 text.

Why do you say that? I can't find the phrase "UTF-16" in UTS-6. It's
says that it's "a compression scheme for Unicode" and that "[SCSU] is
mainly intended for use with short to medium length Unicode strings.".
I noticed that the sample strings are in UTF-16, and count surrogate
pairs as two characters (I think; for 9.4, I count 17 characters
counting pairs as 1 and 19 as two, whereas the text claims 20), but I
that's merely informative anyway.

All the SCSU pieces I've written work directly from UTF-32. I'll admit
I haven't done much checking with other encoders/decoders, but my
decoder can handle all the sample strings correctly, as well as every
thing my encoders put out.

> > UTF-32: Since all characters (including any necessary state changes)
> > can be encoded in four characters, and four characters would be
> ^bytes ^bytes

Yes, sorry.

> I don't understand this analysis. The worst case for SCSU is always
> UTF-16 length + 1 byte. This because if any garden path down the
> heuristics leads to further expansions, you can always represent the
> text as:
>
> SCU + (the rest of the text in Unicode)

Section 5.2.1: "Each reserved tag value collides with 256 Unicode
characters." If you do that and have private use values in your UTF-16
string, decoding the SCSU will produce a different text.
 
> Here, you are saying that if I have a UTF-8 string 0x01 0x01 0x01 0x01...
> I'd have to represent it in SCSU as 0x0F 0x00 0x01 0x00 0x01 0x00 0x01...?
> (Actually NULs themselves would not be a problem, since they are passed
> as single bytes 0x00.)

Right. I was thinking of SQ0 0x01 SQ0 0x01 . . . but it's the same idea.

-- 
David Starner - dstarner98@aasaa.ofe.org
Pointless website: http://dvdeug.dhis.org
"I saw a daemon stare into my face, and an angel touch my breast; each 
one softly calls my name . . . the daemon scares me less."
- "Disciple", Stuart Davis



This archive was generated by hypermail 2.1.2 : Wed Oct 31 2001 - 21:18:41 EST