RE: Compression through normalization

From: D. Starner (
Date: Wed Nov 26 2003 - 10:05:03 EST

  • Next message: John Cowan: "Re: Definitions"

    > I see no reason why you accept some limitations for this
    > encapsulation, but not ALL the limitations.

    Because I can convert the data from binary to Unicode text in UTF-16
    in a few lines of code if I don't worry about normalization. Suddenly
    the rules become much more complex if I have to worry about normalization.

    The simple fact is I can change UTF-8 to UTF-16 to UTF-32 with several
    utilities on my system, but not the normalization. I don't know of any
    basic text tools that handle normalization, so if I edit a source code
    and email it to someone (which compresses and decompresses automatically),
    they're going to have trouble running diff on the code.
    > If you don't want that such "denormalisation" occurs during the compression,
    > don't claim that your 9-bit encapsulator produces Unicode text (so don't
    > label it with a UTF-* encoding scheme or even a BOCU-* or SCSU character
    > encoding scheme, but use your own charset label)!

    The whole point of such a tool would be to send binary data on a transport that
    only allowed Unicode text. In practice, you'd also have to remap C0 and C1
    characters; but even then 0x00-0x1F -> U+0250-026F and 0x80-0x9F to U+0270-U+028F
    wouldn't be too complex. Unless you've added a Unicode library to what could
    otherwise be coded in 4k, normalization would add a lot of complexity.

    Sign-up for Ads Free at

    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 10:57:30 EST