RE: Factor implements 24-bit string type for Unicode support

From: Philippe Verdy (
Date: Tue Feb 05 2008 - 11:09:21 CST

  • Next message: Asmus Freytag: "Re: Factor implements 24-bit string type for Unicode support"

    Hans Aberg wrote:
    > Envoyé : lundi 4 février 2008 14:49
    > À : Jeroen Ruigrok van der Werven
    > Cc :
    > Objet : Re: Factor implements 24-bit string type for Unicode support
    > On 3 Feb 2008, at 22:45, Jeroen Ruigrok van der Werven wrote:
    > >
    > >
    > > Personally I'd wonder about this. I can understand the desire to
    > > shave bytes
    > > off in-memory, but given a lot of platforms having issues with
    > > non-32 bit
    > > boundaries and the resulting performance or alignment issues I
    > > seriously
    > > wonder if it is worth the trade off of not just using UCS4 internally.
    > I think that 32-bit is probably best for internal use in programs for
    > speed, avoiding alignment problems; the best way to actually know is
    > to do some profiling. Externally, for distributed files, UTF-8 seems
    > best, because most agree on how to sort out the bits the bytes.

    At first, you should not assert that platforms have "problems" when handling
    data at non "é-bit boundaries. It's true that they may suffer some
    additional cycles penalty, but this highly depends on the structure of
    memory caches, and in fact it will most often be much more costly to suffer
    a cache miss penalty just because you have wasted 25% of fast cache

    Reading memory byte per byte, is not so costly as it appears, and rebuilding
    a 32-bit entity from 3 separate bytes has a very negigeable cost in
    comparision to the memory access time, which greatly depends on data
    locality (in memory or even worse when this memory is paged out to disk).
    The performance penalty of a very basic compression (that needs to be
    decompressed with just three shifts and two ors that are paralleled in
    today's processors with multiple pipelines) is really very small compared to
    the benefit of using it.

    So yes it's true that UTF-32 will be more efficient, but only when handling
    very small volume of data (below 1MB in a single threaded environment, or
    below about 64KB in a multithreaded environment).

    Today, almost all environments are massively multithreaded, and run on OSes
    with many concurrent processes as well; the multiple cores run with their
    own very fast data cache, but each one is limitd in size, and there are
    everal stages of caches, including in the OS itself with paged out memory,
    and modern deployments where data is located on another remote host or
    server. At the same time, the total size of databases has also exploded, and
    computers are used to process much more massive quantities of text.

    As always, this is the bandwidth of the datapipes that is limiting the
    performance, and come basic compression that saves 25% of data size is
    certainly a good thing, if it helps reducing the cache misses in one of the
    various stages of data caches that are now used everywhere. You cannot
    conclude as a general rule that UTF-32 will be always better, and
    experiences shows that data locality (and reduced data sizes) plays a large
    role in increased performances, given that the cost of
    compression/decompression is always falling with evolution of technologies
    according to Moore's Law.

    The difficulty is to find the threshold at which compression saves time:
    it's no more possible to determine it in a precompiled rule without actual
    performance tests on the target platform (because there are thousands
    possible configurations of CPU models, CPU speed, internal cache sizes,
    external buses and caches, external disks...). You can just estimate that
    such threshold does exist, and a good software should no more be written by
    assuming a unique external storage format or a specific compression scheme.

    This archive was generated by hypermail 2.1.5 : Tue Feb 05 2008 - 12:14:26 CST