Re: Nicest UTF

From: Antoine Leca (
Date: Mon Dec 06 2004 - 11:41:52 CST

  • Next message: Andy Heninger: "Re: Nicest UTF"

    Asmus Freytag wrote:
    > A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
    > 3) additional cost of accessing 16-bit registers (per character)
    > For many processors, item 3 is not an issue.

    I do not know, I only know of a few of them; for example, I do not know how
    Alpha or Sparc or PowerPC handle 16-bit datas (I did hear different sounds.)
    I agree this was not an issue for 80386-80486 or Pentium. However, for the
    more recent processors, P6, Pentium 4, or AMD K7 or K8, I am unsure; and I
    shall appreciate insights.

    I remember reading that in the case of the AMD K7, for instance, 16-bit
    instructions (all? a few of them? only ALU-related, i.e. exclusing load and
    store, which is the point here? I do not know) are handled in a different
    way from the 32-bit ones, e.g. reduced number of decoders. The impact could
    be really important.

    I also remember that when the P6 was launched (1995, known as PentiumPro),
    there was a bunch of critics toward Intel because the performances of 16-bit
    code was actually worse than an equivalent Pentium (but there were an
    advantage for 32-bit code); of course this should be considered in the
    context, where 16-bit (DOS/Windows 3.x) code was important, something that
    faded. But I believe the reasoning behind the arguments should still hold.

    Finally, there is certainly an issue about the need to add a prefix with the
    X86 processors. The issue is reduced for the Pentium4 (because the prefix
    does not consume space in the L1-cache); but it still holds for L2-cache.
    And the impact is noticeable; I do not have figures for the access to UTF-16
    datas, but I know that for when using 64-bit mode (with AMD K8), the need to
    have a prefix to access 64-bit data, so consuming code cache space for it,
    was given as cause for a 1-3% penality in execution time.

    Of course, such a tiny penalty is easily hidden by other factors, such as
    the others Dr. Freitag mentionned.

    > Given this little model and some additional assumptions about your
    > own project(s), you should be able to determine the 'nicest' UTF for
    > your own performance-critical case.

    My point was that the variability of the factors headed to keeping the three
    UTFs as possible candidates when one consider writing a "perfect-world"
    library. Can we say we are in agreement?

    By the way, this will also mean that the optimisations to be considered
    inside the library could be very different, since the optimal uses can be
    significantly different. For example, use of UTF-32 might signal a user bias
    toward easy management of codepoints, disregarding memory use, so the used
    code in the library should favour time over space (so unrolling loops and
    similar things could be considered).
    UTF-8 /might/ be the reverse.


    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 11:52:18 CST