RE: Compression through normalization

From: Philippe Verdy (
Date: Tue Nov 25 2003 - 18:10:53 EST

  • Next message: Doug Ewell: "Re: Compression through normalization"

    Mark Davis writes:
    > I would say that a compressor can normalize, if (a) when decompressing it
    > produces NFC, and (b) it advertises that it normalizes.

    Why condition (a) ? NFD could be used as well, and even another
    normalization where combining characters are sorted differently, or partly
    recomposed, or even recomposed by ignoring the composition exclusions, as
    long as the result is a canonical equivalent.

    Whatever the compressor produces, there's no way to specify the
    normalization form in the result: there's no standard to indicate it in the
    output stream.

    The relevant standard is using a MIME or IANA charset, which just specifies
    a pair consisting in a CCS (coded character set, i.e. for us the
    Unicode/ISO/IEC 10646 assigned codepoints) and a CES (for us it it the
    encoding scheme). The normalization form has no standard convention to
    advertize it.

    This imples that any transport protocol cannot assume any normalization form
    of Unicode, even if it's specified with UTF-*, UCS*, BOCU*, SCSU.
    Normalization becomes a normal step in all interchanges, including for
    compression purpose. Unicode already says that all noramlization forms are
    canonically equivalent and must be treated equally.

    I see no justification of accepting some VALID Unicode text and rejected
    some other VALID text, when both texts are canonically equivalent. The
    interaction of C9 and C10 implies that any process that claims respecting
    the canonical equivalence must perform the normalization of its input, or be
    SURE that the input is already normalized the same way as expected. There's
    no other way to be SURE of that, if both processes are not part of the same
    local system and they don't share the same normalization library for their
    implementation at ANY time.

    If there's a delay between those two processes and the system is upgraded,
    you'll experiment problems, unless the intermediate results from the first
    process is renormalized with the newer implementation before attempting any
    use of the second process. If the intermediate result is for example a RDBMS
    database, the database needs to be checked and cleaned up with the new
    normalization to allow correct access to tables through binary sorted
    indices with the upgraded RDBMS engine. In practive, this means rebuilding
    the indices, unless the database also stores somewhere which normalization
    form is used in its indices, and the engine performs the necessary
    normalization on the fly to match storage requirements...

    For me a process that accepts some text but not some other canonical one is
    NOT conforming to the claim that it respects canonical equivalence, and so
    it is only a partial implementation of Unicode.

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 18:53:54 EST