Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Dec 01 2003 - 08:43:51 EST

  • Next message: Mark E. Shoulson: "Re: MS Windows and Unicode 4.0 ?"

    On 01/12/2003 04:25, Philippe Verdy wrote:

    > ...
    >
    >And what about a compressor that would identify the source as being
    >Unicode, and would convert it first to NFC, but including composed forms
    >for the compositions normally excluded from NFC? This seems marginal but
    >some languages would have better compression results when taking these
    >canonically equivalent compositions into account, such as pointed Hebrew
    >and Arabic.
    >
    >
    >
    >
    To get an idea of what orders of magnitude we are talking about here:

    The Hebrew Bible consists of about 2,881,000 Unicode characters
    including accents, or 2,632,000 excluding accents - these figures
    include spaces. Of these, about 172,000 are U+05BC dagesh or mapiq,
    46,000 are shin dot and 12,000 are sin dot. All of these, or very
    nearly, can be canonically composed with the preceding base characters
    into characters FB30-FB4A, thus saving 230,000 characters. Also a
    significant number of combinations could be composed into FB2E, FB2F and
    FB4B. So the Hebrew text could be compressed by something around 10%
    simply by composing it using characters already defined. This compressed
    version is canonically equivalent to the uncompressed version, but is
    not normalised because the characters are in the composition exclusion
    table.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 09:28:45 EST