Re: Unicode & space in programming & l10n

From: Mark Davis (mark.davis@icu-project.org)
Date: Wed Sep 20 2006 - 16:25:53 CDT

Next message: Addison Phillips: "Re: Question about formatting numerals"

Previous message: Hans Aberg: "Re: Unicode & space in programming & l10n"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Steve Summit: "Re: Unicode & space in programming & l10n"
Reply: Steve Summit: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I strongly suspect that all of that would give only minor advantages over
general-purpose algorithms like ZIP. But this is all academic -- I don't see
anyone taking the time and effort to investigate it in the absence of a
compelling need.

Mark

On 9/20/06, Hans Aberg <haberg@math.su.se> wrote:
>
>
> On 20 Sep 2006, at 04:14, Doug Ewell wrote:
>
> > Hans Aberg <haberg at math dot su dot se> wrote:
> >
> >> It is probably more efficient to translate the stream into code
> >> points and then use a compression technique on that, because then
> >> the full character structure is taken into account. Then it does
> >> not matter which character encoding is used.
> >
> > If you have not yet read Unicode Technical Note #14, particularly
> > the sections on "general-purpose compression" and "two-layer
> > compression," you might wish to do so.
>
> Relative to that stuff, I suggest to compress the character data, as
> represented by the code points, rather any character encoded data.
> Typically, a compression method build a binary encoding based on a
> statistical analysis of a sequence of data units. So if applied to
> the character data, there results a character encoding from such a
> compression. Conversely, any character encoding can be viewed as a
> compression method with certain statistical properties.
>
> When compressing character encoded data, one first translates it into
> character data, and compresses that. So it does then not matter which
> character encoding originally is used in the input, as the character
> data will be the same: the final compression need only to include the
> additional information about what was the original character encoding
> to restore data.
>
> There is the problem of large translation tables. But that belongs to
> the chapter of table compression, or alternatively, one can use a aet
> of character encodings that, though not providing the most efficient
> compression, may admit compact translation functions. On the other
> hand, a translation table of just a hundred thousand characters is
> not so big anymore in todays computers.
>
> And one can go further, doing a statistical analysis on typical text
> in the different languages, identifying words, and their typical
> statistical frequencies. A compression would then identify common
> words, suitable for compression, and give them one entry in the
> translation table.
>
> Hans Aberg
>
>
>
>

Next message: Addison Phillips: "Re: Question about formatting numerals"
Previous message: Hans Aberg: "Re: Unicode & space in programming & l10n"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Steve Summit: "Re: Unicode & space in programming & l10n"
Reply: Steve Summit: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 16:34:01 CDT