Re: Unicode forms for internal storage

From: Doug Ewell (
Date: Wed Jan 21 2004 - 01:59:53 EST

  • Next message: Doug Ewell: "Re: Unicode forms for internal storage"

    Elliotte Rusty Harold <elharo at metalab dot unc dot edu> wrote:

    >> BZZZT! Sorry, thanks for playing. You can't get the
    >> advantages of both with no drawbacks. Given the octets 0x5B5B, how
    >> would you know if you had "[[" or a Chinese character?
    > Actually, it looks like SCSU may do exactly that. If I'm
    > understanding the algorithms, it actually encodes most BMP characters
    > in a single byte, compressing quite a bit better than my naive idea
    > to switch between UTF-8 and UTF-16.

    I too missed the point in Elliotte's original post that it was OK for
    this transformation to be stateful. Since that is the case, SCSU
    definitely will fit the bill.

    > All schemes I've seen do involve some sort of flag characters in the
    > data stream to switch between different code ranges. As long as you
    > can keep the number of flag characters added down below the savings,
    > you're good to go. My original idea was to simply use a null to
    > switch between ASCII and UTF-16. SCSU looks a lot more sophisticated.

    SCSU *can be* a lot more sophisticated, but as Markus noted, a subset of
    full-blown SCSU will often achieve really good compression.

    > Of course, neither of those schemes will compress truly random data,
    > but most data isn't random.

    No scheme will compress truly random data, at least not consistently.

    >> Hmmm - again, this may be asking for too much. The
    >> UTF-8/UTF-16 transform is pretty simple. Is it bogging you down?
    > It is a noticeable point in my profiling. I really did have to make a
    > choice between speed and space here. According to
    > it looks like SCSU is
    > faster for a lot of languages but 10-25% slower for English, French
    > and Japanese than the UTF-8/UTF-16 conversion.

    If you are using the "mini" version of SCSU where Latin-1 characters are
    stored as 1 byte each and everything else is stored as UTF-16 (using SCU
    and UC0 tags to switch between modes), you ought to achieve really good

    > If space usage is random/indeterminate/evenly distributed, then,
    > assuming that any given string is primarily in a single language, a
    > TLV type discriminating between UTF-8 and UTF-16 should do nicely.
    > Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16)
    > and the length, in octets, of the string (therefore max of 32,767
    > octets per string, which shouldn't ordinarily be a problem).
    > That would be a problem. I definitely cannot rule out long strings,
    > where long is quite a bit larger than 32K.

    Despite the often-stated claims that SCSU and BOCU-1 are "optimized for
    short strings," they work just as well on arbitrarily long strings.
    It's just that the performance of general-purpose compression schemes
    gets *much* better as the input text gets larger, so the relative
    benefit of SCSU and BOCU-1 (compared to GP compression) is greatly
    reduced. But for an internal-storage need like Elliotte's, and
    especially where speed and simplicity are important, the compression
    formats look like winners.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Wed Jan 21 2004 - 03:32:31 EST