Re: Unicode & space in programming & l10n

From: Asmus Freytag (
Date: Tue Sep 19 2006 - 13:35:09 CDT

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    On 9/19/2006 4:47 AM, Hans Aberg wrote:
    > On 19 Sep 2006, at 07:46, Asmus Freytag wrote:
    >> In such situations, you cannot afford to compress/uncompress, as most
    >> data is seen only once.
    > Sure you can, you merely cannot do the base the compression based on
    > the whole of the data. So either divide it in subpackets, or make an
    > assumption of what the statistical proportions might be. This is not
    > as efficient as copression the whole data, but modems and streaming
    > video and the like use compression techniques, so it is surely
    > possible to do it one the fly on a stream.
    By "cannot afford" I did not say it was impossible, but meant to say
    that it's pointless as it does not gain you anything worthwhile. You
    encur the same cache misses whether you scan bulky data for processing
    or compression. However, if processing involves repeated data access,
    compression can pay off whenever there's a sufficiently high effective
    compression rate.
    >> Finally, if most (much) of your data is ASCII due to the ASCII bias
    >> of protocols, then any format that's close to ASCII is beneficial.
    >> UTF-8 fits that bill. SCSU and BOCU take too much processing time
    >> compared to UTF-8, and UTF-16/32 take too much space given the
    >> assumptions.
    >> Add to that the fact that often data streams are already in UTF-8,
    >> and that format becomes the format of choice for applications that
    >> have the constraints mentioned. (As has been pointed out, the 'bloat'
    >> for CJK is not a factor as long as the data always contains high
    >> proportion of ASCII.)
    > It is probably more efficient to translate the stream into code points
    > and then use a compression technique on that, because then the full
    > character structure is taken into account. Then it does not matter
    > which character encoding is used.
    > Hans Aberg

    This archive was generated by hypermail 2.1.5 : Tue Sep 19 2006 - 13:36:58 CDT