Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Tue Sep 19 2006 - 14:05:53 CDT

  • Next message: Asmus Freytag: "Re: Unicode & space in programming & l10n"

    On 19 Sep 2006, at 20:35, Asmus Freytag wrote:

    > On 9/19/2006 4:47 AM, Hans Aberg wrote:
    >> On 19 Sep 2006, at 07:46, Asmus Freytag wrote:
    >>
    >>> In such situations, you cannot afford to compress/uncompress, as
    >>> most data is seen only once.
    >>
    >> Sure you can, you merely cannot do the base the compression based
    >> on the whole of the data. So either divide it in subpackets, or
    >> make an assumption of what the statistical proportions might be.
    >> This is not as efficient as copression the whole data, but modems
    >> and streaming video and the like use compression techniques, so it
    >> is surely possible to do it one the fly on a stream.
    > By "cannot afford" I did not say it was impossible, but meant to
    > say that it's pointless as it does not gain you anything
    > worthwhile. You encur the same cache misses whether you scan bulky
    > data for processing or compression. However, if processing involves
    > repeated data access, compression can pay off whenever there's a
    > sufficiently high effective compression rate.

    For efficient scanning it is important to keep active data in RAM,
    and if scanned data compressed, you need less RAM.

    So what you say is wholly wrong.

    >>> Finally, if most (much) of your data is ASCII due to the ASCII
    >>> bias of protocols, then any format that's close to ASCII is
    >>> beneficial. UTF-8 fits that bill. SCSU and BOCU take too much
    >>> processing time compared to UTF-8, and UTF-16/32 take too much
    >>> space given the assumptions.
    >>>
    >>> Add to that the fact that often data streams are already in
    >>> UTF-8, and that format becomes the format of choice for
    >>> applications that have the constraints mentioned. (As has been
    >>> pointed out, the 'bloat' for CJK is not a factor as long as the
    >>> data always contains high proportion of ASCII.)
    >>
    >> It is probably more efficient to translate the stream into code
    >> points and then use a compression technique on that, because then
    >> the full character structure is taken into account. Then it does
    >> not matter which character encoding is used.
    > ???

    ???

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Sep 19 2006 - 14:10:41 CDT