Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Nov 25 2003 - 18:13:34 EST

  • Next message: Doug Ewell: "Re: What is a process?"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > I say YES only for compressors that are supposed to work on Unicode
    > text (this applies to BOCU-1 and SCSU which are not intented to
    > compress anything else than Unicode text), but NO of course for
    > general purpose compressors (like deflate in zip files.)

    Of course.

    > I will say NO for encoding forms that are normally built to be
    > directly parsable code point by codepoint in any direction and from
    > random locations in strings. So a UTF encoding scheme is not supposed
    > to change the normalization form.

    Of course not. Or so I would imagine, anyway. After all, if a process
    (see Peter Kirk's question) that compresses Unicode text can silently
    change the normalization form, then why not a process that stores and
    retrieves Unicode text using, say, UTF-8? But that sounds wrong to me,
    although it's what C10 says.

    >> * Peter Kirk and Mark Shoulson say NO, it can't, because all the
    >> compressor really knows about is the byte stream, so it must be
    >> preserved byte-for-byte.
    >
    > But SCSU and BOCU-1 do not operate in the byte stream level, as their
    > use is invalid on random streams of bytes, but only defined in terms
    > of streams of code units...

    That's right. I tend to agree with the NO camp not because SCSU and
    BOCU-1 are going to be applied to arbitrary binary data, but because the
    *format* in which text is stored isn't normally expected to change the
    contents.

    Converting Unicode text from UTF-16LE to UTF-16BE, or UTF-16 to UTF-8,
    changes the bits. Everyone can see that. But the *code units*
    represented by those bits are not changed. If the UTF-16BE sequence <00
    61 03 01> were converted to the UTF-8 sequence <C3 A1>, that would be a
    change not only in the bits, but in the code units as well. This is
    where the question lies.

    > That's why I won't say that SCSU and BOCU-1 are really compressors,
    > but rather really encoding schemes (CES in the ISO10646 terminology).

    They are transfer encoding syntaxes (TES). And I believe this
    terminology is from Unicode, not 10646, though I could be wrong.

    I would say encoders for SCSU and BOCU-1 are compressors. They're just
    not general-purpose compressors.

    > In fact the result of BOCU-1 and SCSU encoding schemes can create a
    > file which has its own charset (i.e. CCS+CES in the ISO terminology),
    > and thus can also have its own label for MIME usage or in XML charset
    > declarations. This is not a limitation, as true compressors can still
    > be used if needed from this encoding scheme, or transparently within
    > transport layers (such as the "Content-Transfer-Encoding:" in MIME and
    > HTTP applications).

    Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a
    GP compression scheme. Atkin and Stansifer's paper from last year is
    all about that, and I spend a few pages on it in my paper as well. You
    can also re-Zip a Zip file, though, so I don't know what that proves
    about the compression formats.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 19:01:15 EST