New paper on Unicode compression

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 31 2003 - 12:36:23 EST

  • Next message: John Cowan: "Re: Today is neither Thursday nor Friday"

    I'm pleased to announce the release of my new paper, "A survey of
    Unicode compression":

    http://users.adelphia.net/~dewell/compression.html

    This 21-page paper is a moderately technical discussion of the various
    ways in which Unicode text can be compressed for storage and
    interchange. Several different approaches are examined and evaluated.
    Specific topics include:

    * UTF-16, UTF-8, and 8-bit legacy character sets
    * the Unicode "compression formats," SCSU and BOCU-1
    * general-purpose compression algorithms (RLE, Huffman, LZW)
    * using multiple compression techniques together
    * using canonical equivalence to improve compression
    * a detailed description of a SCSU encoder

    Although it assumes a basic understanding of Unicode, certain terms
    related to Unicode and information theory are explained. No complicated
    mathematical theory is included. The paper is intended for anyone
    interested in the details of Unicode compression, not just programmers,
    although the sample SCSU encoder will probably be of interest only to
    programmers.

    It's available in HTML format, directly from the URL given above, or can
    be downloaded in either Adobe PDF or Microsoft Word format (zipped or
    unzipped).

    Enjoy,

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Dec 31 2003 - 14:23:57 EST