Re: Worst case scenarios on SCSU

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Nov 02 2001 - 12:34:15 EST


Dear fellow SCSU enthusiasts!

SCSU wanders wondrous worlds between CES (Character Encoding Scheme) and TES (Transfer Encoding Syntax).
But, few people care - it is a way to get Unicode into and out of a byte stream, and as such qualifies as a "charset" as used in Internet protocols. (A charset is defined as a method to get text _out_ of a byte stream.)

SCSU is registered as an IANA charset.

In the ICU implementation of the SCSU converter, I believe the worst case is 3 bytes per 16-bit code unit (UTF-16). It actually gets really close to the compressions of the samples in UTS 6, but it is limited mostly because we allow buffering with arbitrarily small input/output buffer sizes. We can not assume that we will see the entire text at once - or more than a byte/code unit at a time. Still, it works quite well though not optimal, and I tried to write it for good performance.

As a theoretical maximum for the output length, the answer is of course "unlimited" for pathological converters. This is because you can write an arbitrary number of useless state changes, like SC0 SC0 SC0 ... without encoding anything.

markus



This archive was generated by hypermail 2.1.2 : Fri Nov 02 2001 - 13:36:42 EST