RE: Compression through normalization

From: Arcane Jill (
Date: Tue Nov 25 2003 - 05:32:26 EST

  • Next message: Arcane Jill: "RE: numeric properties of Nl characters in the UCD"

    I'm pretty sure it depends on whether you regard a text document as a
    sequence of characters, or as a sequence of glyphs. (Er - I mean
    "default grapheme clusters" of course). Regarded as a sequence of
    characters, normalisation changes that sequence. But regarded as a
    sequence of glyphs, normalisation leaves the sequence unchanged. So a
    compression algorithm could legitimately claim to be "lossless" if it
    did normalisation but operated at the glyph level.

    I'm pretty sure you DON'T need to preserve the byte-stream bit for bit.
    For example, at the byte level, I see no reason to preserve invalid
    encoding sequences, and at the codepoint level I see no reason to
    preserve non-character codepoints. So - at the glyph level - we only
    need to preserve glyphs, no? It all depends on how the compression
    algorithm describes itself.

    I think this might go wrong for "tailored grapheme clusters", but I
    don't know much about them.


    This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 06:13:54 EST