Compression through normalization (was: Re: Ternary search trees)

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Nov 24 2003 - 01:26:55 EST

  • Next message: J Do: "Re: Anyone seen Roman Czyborra?"

    I wrote in response to Philippe Verdy:

    > Changing a sequence S to a canonically equivalent sequence S' is
    > explicitly allowed by conformance clause C10, but I would be surprised
    > if any designers or implementers of Unicode compression schemes have
    > considered such a thing. Does this violate users' expectations of
    > what "lossless" compression means? This might be a topic all by
    > itself.

    The more I think about this, the more it intrigues me. I can't decide
    what is the "right thing" to do here, or if there is a "right thing" at
    all.

    It certainly seems true that under some circumstances, some samples of
    Unicode text could be made more compressible by converting them to a
    different normalization form -- either NFC, NFD, or a "partial" form
    like the one Philippe described.

    For example, consider the Korean name "Kwan" encoded as jamos (NFD):

    U+110F U+116A U+11AB

    Assuming that this (very) short string appeared at the beginning of a
    text stream, it would be encoded as follows:

    UTF-8: E1 84 8F E1 85 AA E1 86 AB
    SCSU: 1B 22 8F EA 1C 23 AB
    BOCU-1: E1 79 BA D0 38

    (Note in the SCSU example that two dynamic windows have to be defined,
    for the half-blocks at U+1100 and U+1180. This is a one-time cost;
    subsequent references to these blocks within the same stream would use
    the same windows.)

    Now, this same name "Kwan" could be equivalently encoded (NFC) as the
    syllable U+CF74. Even though none of the three formats shown here is
    optimized for efficient representation of Hangul syllables, they would
    all see a big improvement compared to the jamos (again assuming this is
    at the beginning of a stream):

    UTF-8: EC BD B4
    SCSU: 0F CF 74
    BOCU-1: FB BC 10

    For longer, more realistic texts, SCSU would stay in Unicode mode and
    use 2 bytes per syllable, while BOCU-1 would take advantage of
    relatively short 2-byte (and occasional 1-byte) jumps within the Hangul
    Syllables super-block. Spaces and CR/LF pairs between syllables give an
    edge to BOCU-1, while other interspersed Basic Latin characters, such as
    full stops and other punctuation, give a (less frequent) edge to SCSU.
    (Remember a day ago, I wondered why BOCU-1 did so much better on Korean
    text encoded as syllables? Now I know.)

    Regardless of the CES or TES, however, it is clear that modern Korean
    can be expressed more compactly using precomposed syllables (NFC) than
    using jamos (NFD).

    So the question becomes: Is it legitimate for a Unicode compression
    engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
    another (canonically equivalent) normalization form to improve its
    compressibility?

    Conformance clause C10 explicitly allows this:

    "When a process purports not to modify the interpretation of a valid
    coded character representation, it shall make no change to that coded
    character representation *other than the possible replacement of
    character sequences by their canonical-equivalent sequences* or the
    deletion of noncharacter code points." (TUS 4.0, p. 60; emphasis mine)

    However, in the non-text compression world there is a definite
    distinction between "lossy" and "lossless" compression methods. In the
    former, it is understood that a certain amount of information from the
    original source may be discarded, in the interest of improving
    compression. JPEG is the best-known example of a lossy image format;
    image information may be modified or discarded according to compression
    needs and the limitations of human vision.

    In the latter, there is a built-in promise that the compression process
    will not discard any information. GIF is an example of an image format
    that uses lossless compression; a 256-color bitmap, converted to GIF and
    then back to its original format, should be bit-for-bit equivalent to
    the original data (except for metadata used in the original format but
    not in GIF, such as the "important colors" field in Windows BMP).

    Converting Unicode data to another normalization form seems somewhere in
    between these two cases. Conformance clause C10 tells us that a process
    can replace NFD data with its NFC equivalent and still claim "not to
    modify" its interpretation; there is no "loss of data" in the sense of a
    bitmap being converted to JPEG. Yet there is no bit-for-bit equivalence
    either; for a given text T, there is no promise that:

    decompress(compress(T)) ≡ T

    If a compressor were to take advantage of the potential to improve
    compression through normalization, would end users feel that the data
    had been changed in an unacceptable way? Suppose a compression format
    (SCSU or BOCU-1) is being used with a protocol that demands a particular
    normalization form. Would there be a conformance problem? What about
    any other "process" as described by C10 that feeds into another
    protocol? Can a higher-level protocol supersede C10 and *require* a
    given normalization form? The note under C10 says:

    "All processes and higher-level protocols are required to abide by C10
    as a minimum."

    which implies that they may not.

    I'm especially interested in what the compression and conformance
    experts have to say about this, but of course the thoughtful input of
    any list member is very much welcome.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 02:13:15 EST