Re: Amount of Space

From: Doug Ewell (dewell@roadrunner.com)
Date: Tue Jul 17 2007 - 00:31:22 CDT

Next message: WuAllen: "unsubscribe"

Previous message: Doug Ewell: "Re: Subj: Amount of Space Unicode Takes"
In reply to: William J Poser: "Amount of Space"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

William J Poser <wjposer at ldc dot upenn dot edu> wrote:

> If you only need certain ranges, you may be able to find an ad hoc
> compression scheme that saves a lot of space. For example, if you need
> a range that encodes as three or four bytes in UTF-8 and otherwise
> only ASCII, you might save a lot of space simply by subtracting from
> the base codepoint of the range from each codepoint and adding it
> again on decompression. Depending on the case, the fact that a
> particular code represents ASCII or the upper range could be indicated
> either by markup or by downshifting by the base codepoint -128, so
> that any codepoint above 127 would be in the non-ascii range.

This is sort of a composite of the SCSU
(http://www.unicode.org/reports/tr6/) and BOCU-1
(http://www.unicode.org/notes/tn6/) approaches. Differential
compression works well when encoding the differences is cheaper than any
method of encoding the code points themselves, either directly or by
using them as indices into windows (as SCSU does) or some other way.
BOCU-1 defines a system of lead bytes and trail bytes that keeps the
byte count down for typical cases.

I built a Huffman encoder for Unicode text, and stored the tree using
XOR differences between code points instead of using the code points
themselves. This technique led to either slightly worse or
substantially better results, depending on the language and script. So
William has a point about using differences, but only testing can show
whether the added compression justifies the effort.

I suspect Daniel has something much simpler in mind, and in the end the
best answer for him is probably to just use UTF-8.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages

Next message: WuAllen: "unsubscribe"
Previous message: Doug Ewell: "Re: Subj: Amount of Space Unicode Takes"
In reply to: William J Poser: "Amount of Space"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 17 2007 - 00:33:20 CDT