RE: Factor implements 24-bit string type for Unicode support

From: Philippe Verdy (
Date: Wed Feb 06 2008 - 00:48:58 CST

  • Next message: Joe: "RE: Allocation of Georgian letters"

    > -----Message d'origine-----
    > De : Asmus Freytag []
    > Envoyé : mardi 5 février 2008 19:57
    > À :
    > Cc : 'Hans Aberg'; 'Jeroen Ruigrok van der Werven';
    > Objet : Re: Factor implements 24-bit string type for Unicode support
    > Phillipe gave some very interesting arguments (complete with specific
    > figures)

    Actually no. I was not very precise, on purpose. My conclusion was that
    there's no evidence that using UTF-32 alone or UTF-8 alone, or any other
    variant alone will give a universal performance advantage.

    Well, after rereading the message as you received it, I should have
    corrected some obvious typos (missing letters, or "é instead of 32 on my
    French keyboard). Sorry for the inconvenience, the message has gone too fast
    (blame CTRL+Enter in Outlook for sending the message immediately while
    editing it...).

    > but without citing his evidence or stating the assumptions. A
    > thorough comparison of the performance of large data volumes in the
    > various encoding forms would be interesting.

    My point is that all performance figures will be dependant on the volume of
    data to handle and where it is located (or comes from or will go to). That's
    because our modern architectures are much more complex now, with data going
    through many more datapipes with various performance bottlenecks, and
    various internal caches that are impossible to predict in a rule in some
    precompiled software.

    I don't think that this will ever become simpler in the future, and we'll
    have to live with the fact that the computing architecture will necessarily
    be heterogeneous, and that software will have to use auto-adaptative
    features (that's a good point for the newest architectures based on virtual
    machines, that are not optimized too early, but allow autoadaptation on the
    final target host on which the software will effectively run).

    I gave just some figures about the current case with today's most common
    processors (that have internal cache sizes about 64KB to 1MB, but this
    threshold is likely to change at anytime, when there's a newer generation of
    processor running at even higher frequency, but adding a new stage of data
    cache, that will be much more costly, and so, more reduced than the previous
    legacy cache system that will persist or will be split into several

    If processors continue their evolution, we'll soon have models with dozens
    of cores, each one having a small datacache, because it won't be possible to
    give 1MB to each of them, but they will collaborate by exchanging data
    through several pipelines connected to larger caches (but these larger
    caches will have more concurrent accesses so they will be slower, creating
    the need for the new data cache stage for each core).

    The performance penalty of data alignment is likely to disappear in
    practice, and it will no longer make a huge difference if you read data byte
    per byte or as a whole 32-bit unit over a large bus. In fact you'll
    immediately realize that even UTF-32 is misaligned given today's 64-bit
    processors, and that internally, processors use even larger internal buses
    when communicating with their fastest caches (however this current move may
    as well be reversed by reducing the bus width, due to synchronization issues
    at very high frequencies; note that RAM technologies are now in favour of
    serial 1-bit access instead of parallel buses due to this reason: at very
    high frequencies, the exact length of each line becomes extremely important,
    and if it's not geometrically possible to ensure that each line will provide
    their data at the same time, it will limit the frequency to ensure correct
    synchonization of data).

    If processors follow what has happened in RAM technologies, they could as
    well reduce their internal working bus width, and will work in another way,
    using a network of many very small 1-bit cores with high redundancy, and
    removal of most of their mutual synchronization mechanisms that also require
    more energy to maintain their current state within internal buffering

    May be you'll still be able to program your software using an x86 or IA64
    instruction set, but this will just be a virtual program that will be
    recompiled and reoptimized locally on the final host. If this happens, the
    predicted memory alignment constraints burnt in software will be a thing of
    the past. But what will survive is the fact that the "one-size-fits-all"
    optimization strategy will no longer apply, as there will be various
    compression/decompression steps added everywhere and transparently, as this
    will be used for the needed autoadaptation and scalability to more
    heterogeneous environments.

    Now if you look at the level at which Unicode is specified, it is in terms
    of 32-bit code points. This will be the apparent level at which you'll
    program things, but it will not dictate the way the data will be effectively
    stored in memory or disk or exchanged on the network, using various data
    compression steps (or even expansion to larger words than 32-bit!), when and
    where needed.

    This archive was generated by hypermail 2.1.5 : Wed Feb 06 2008 - 09:38:10 CST