Re: Unicode & space in programming & l10n

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Sep 21 2006 - 08:57:22 CDT

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    From: "Asmus Freytag" <asmusf@ix.netcom.com>
    > There are certain types of applications (web log processing for one)
    > where handling vast amount of "text" data is important. In these
    > situations, a reasonably dense representation of data will enable more
    > processing with many fewer cache misses. The size of the mass-storage
    > device is irrelevant, it's the size of your 'peephole' represented by
    > your cache that's the limiting factor.
    >
    > In such situations, you cannot afford to compress/uncompress, as most
    > data is seen only once.

    There's a contradiction here between the two paragraphs: if data is most often seen only once, then the size of the cache does not matter, what is significant is the bandwidth/access time of your storage.

    > Finally, if most (much) of your data is ASCII due to the ASCII bias of
    > protocols, then any format that's close to ASCII is beneficial. UTF-8
    > fits that bill. SCSU and BOCU take too much processing time compared to
    > UTF-8, and UTF-16/32 take too much space given the assumptions.

    That's unbelievable. The processing time for compression schemes (even for general compression algorithms) is negligeable for data that is seen mostly once; it does matter only with data which is extensively reused, because then most of the processing takes place within the CPU and not in the external cache or memory or storage.

    > Add to that the fact that often data streams are already in UTF-8, and
    > that format becomes the format of choice for applications that have the
    > constraints mentioned. (As has been pointed out, the 'bloat' for CJK is
    > not a factor as long as the data always contains high proportion of ASCII.)

    If you cite the case of blogs or community forums, or text archives (for example legal ressources, or news articles), the proportion of text in the overal data will easily outreach the size of other documents or applications. This is the database storage that will become the most limitating factor, and its access time, much more than the communications (the network bandwidth is not so much expensive now, and it can support lots of simultaneous users with good respondiveness; the real cost will not even be on the server treating user requests, but really in the database engine, and its performance will be limited by the way the data is organized and compressed.

    Compression solves a part of the problem, but the other important part is the organization of data for retrieving it fast; text indexing then becomes an issue as it also adds to the storage needed and the speed of the application will highly depend on the storage organization.

    > SCSU was developed for a system that had bandwidth limitations in
    > straight transmission. Except for communications to remote areas
    > (wilderness, marine, space etc.) such severe limitations on transmission
    > bandwidth are a thing of the past, and even then, the block compression
    > algorithms can often be used to good advantage.

    SCSU is definitely not complicate to decompress, and is easily implemented as a simple sequential stream in very modest object instance (with very small size for state variables, and a very simple algorithm, offering good locality by itself and few operations). In most cases, when you have to manage large amounts of texts, most of it is used in read-only access and SCSU (or BOCU and even general LZ-based algorithms) have a very small performance footprint; in fact, keeping the data compressed will dramatically reduce the memory footprint for the data, will reduce the number of paging operations to disk, will increase the data locality, will increase the benefit of caches in processors (by reducing the frequency of discards to make space for new data), and this effect will much more positive to the performance than the few cycles lost in decompressing data with such simple decompression algorithms like SCSU or BOCU.

    Remeber how you can really give more performance to a PC: it's not much by using a processor with more gigahertz, but really by first adding RAM (to reduce the number of page swaps and get larger caches), then by adding more fast caches between RAM and the processor (because RAM is on a bus that is shared by all sorts of devices, not only by the processor), and then by using a faster data bus. The internal working frequency of the processor takes a minor role when processing large amounts of data, as most of the time, the CPU will just be idle, waiting for the data to be retrievedbecause it is missing in the internal cache, or in the external cache, or in memory (and sometimes also on the host itself, when that data is stored on an external storage system or database).

    Don't look at the time it takes to decompress a large zipped archive. Consider the case where your ZIPped archive contains an index to its many files stored in it (look at the JARs of the standard Java distribution), and then consider how fast it is to access to pieces only because the compressed archive is much smaller than equivalent decompressed data. It is absolutely not needed to decompress and store the result, as it's possible to use the compressed data directly from a sequential read access.

    However, what I am really wondering is why we need BOCU or SCSU, when general compresion algorithms (deflate,gzip...) have far more applications and can be used on many other things than just plain-text.

    Philippe.



    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 09:00:18 CDT