RE: Compression through normalization

From: jon@hackcraft.net
Date: Mon Dec 01 2003 - 08:12:40 EST

  • Next message: Peter Kirk: "Re: Compression through normalization"

    Quoting Philippe Verdy <verdy_p@wanadoo.fr>:

    > jon@hackcraft.net wrote:
    > > Further, a Unicode-aware algorithm would expect a choseong character to
    > > be followed by a jungseong and a jongseong to follow a jungsong, and
    > > could essentially perform the same benefits to compression that
    > > normalising to NFC perfroms but without making an irreversible change
    > > (i.e. it could tokenise the Jamo sequences rather than normalising and
    > > then tokenising).
    >
    > Isn't it equivalent to what bzip2 does, but without knowledge of Unicode
    > composition rules, simply by discovering that jamos are structured
    > within their syllables, and creating, on the fly code positions to
    > represent their composition ?

    I imagine so.

    > A 2% difference can be explained by the fact that bzip2 must still
    > discover the new "clusters" by encoding them first in their decomposed
    > form before using codes to represent the composed forms for the rest of
    > the text.

    Yes. Do we care about that 2%? Can we improve upon it?

    > > > Whether a "silent" normalization to NFC can be a legitimate part of
    > > > Unicode compression remains in question. I notice the list is still
    > > > split as to whether this process "changes" the text (because checksums
    > > > will differ) or not (because C10 says processes must consider the text
    > > > to be equivalent).
    >
    > And what about a compressor that would identify the source as being
    > Unicode, and would convert it first to NFC, but including composed forms
    > for the compositions normally excluded from NFC? This seems marginal but
    > some languages would have better compression results when taking these
    > canonically equivalent compositions into account, such as pointed Hebrew
    > and Arabic.

    Agreed, if we are to rely upon the equivalence of sequences then there is no
    need to exclude such compositions.



    This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 09:04:16 EST