RE: Compression through normalization

From: Philippe Verdy (
Date: Wed Nov 26 2003 - 08:56:49 EST

  • Next message: Philippe Verdy: "RE: Compression through normalization"

    Peter Kirk [peterkirk at qaya dot org] writes:

    > On 25/11/2003 16:38, Doug Ewell wrote:
    > >Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    > >
    > >>So SCSU and BOCU-* formats are NOT general purpose compressors. As
    > >>they are defined only in terms of stream of Unicode code points, they
    > >>are assumed to follow the conformance clauses of Unicode. As they
    > >>recognize their input as Unicode text, they can recognize canonical
    > >>equivalence, and thus this creates an opportunity for them to consider
    > >>if a (de)normalization or de/re-composition would result in higher
    > >>compression (interestingly, the composition exclusion could be
    > >>reconsidered in the case of BOCU-1 and SCSU compressed streams,
    > >>provided that the decompression to code points will redecompose the
    > >>excluded compositions).
    > >
    > >I have to say, if there's a flaw in Philippe's logic here, I don't see
    > >it. Anyone?
    > Yes, the compressor can make any canonically equivalent change, not just
    > composing composition exclusions but reordering combining marks in
    > different classes. The only flaw I see is that the compressor does not
    > have to undo these changes on decompression; at least no other process
    > is allowed to rely on it having done so.

    Being able to undo these changes when decompressing is needed if one wants
    to be able to restore a canonically equivalent text that preserves all its
    initial semantics.

    I don't say that decompressors do need to undo all these changes to be
    lossless, as long as the result of the decompressions is canonically
    equivalent to the original: so the decompressor may keep sequences that were
    composed despite they were normally excluded from recomposition (this
    restriction only applies to encoded streams that claim being in NFC or NFKC
    form when parsed as streams of code points, and in practice, in applications
    that handle code points as binary code units, this is extended to streams of
    _code units_, not to streams of _bytes_ of an UTF encoding _scheme_)

    I see good reasons why a fully Unicode-Compliant application, process or
    system can be built that handle Unicode text symbolically rather than with
    code units. For example a Unicode text can for example be fully handled (and
    transformed with Unicode algorithms) just as a linked list of items, where
    items are symbolic abstract characters, or complete objects with their own
    interface to access its properties, transformation methods and associations,
    or as enumerated XML elements with distinct names. For these applications,
    the normalization form makes sense if it is the internal representation, and
    it has nothing to do with the glyph representation. There may even exist a
    object interface to these objects for interchange which does not use or
    transmit any code unit or even a binary bytes representation.

    In that case, the most important thing is not the code unit or not even the
    code point itself, but the supported enumerated objects, i.e. assigned
    abstract characters that are part of the Unicode CCS (coded character set).
    For me code points are more symbolic than what they look in appearance, and
    they are not numeric values. If this was the case, we wouldn't need the
    concept of code points, and we could just use the code units of the UTF32

    What I mean here is that the numeric code assigned in GB18030 to abstract
    characters is as valid as UTF32 code units, but they both represent the same
    abstract character, so UTF-32BE and GB18030 (for example) encode the same
    set of abstract characters (ISO/IEC 10646 would say they share the same
    subset, but distinct numeric code positions, so they are two distinct coded
    character sets a.k.a. CCS).

    As long as ISO/IEC 10646 and Unicode had not formally merged their character
    set and normative references so that they fully interoperate, it was
    impossible to think about normalizing Unicode texts within compressors. But
    now that there's a normative stability policy for canonically equivalent
    strings, it's clear that even ISO/IEC 10646 is more than just a coded
    character set: it includes the definition of canonically equivalent strings
    bound very tightly with the code points assigned in the CCS.

    Ensuring compliance with the canonically equivalent strings then requires
    indicating which character subset is supported, i.e. the version of the
    Unicode standard, or of the ISI/IEC 10646 standard (which is augmented with
    new assignments more often than in Unicode, until the new repertoires are
    merged by formal agreements between both parties). Interoperability is
    guaranteed only if the character sets used in documents are strictly bound
    to the code points assigned in both published and versioned standards, but
    when this is done, you immediately can assume the rules for canonical
    equivalences of strings using these new characters.

    That's why I do think that both standards (Unicode and ISO/IEC 10646) MUST
    clearly and formally specify to which versions they correspond regarding
    their common CCS. I note that this was not the case before Unicode 4.0, but
    this is now formally indicated since the official publication of Unicode
    4.0, and I hope that this normative reference will be kept in the future.

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 09:38:46 EST