Re: Nicest UTF

From: Andy Heninger (
Date: Mon Dec 06 2004 - 11:57:05 CST

  • Next message: Edward H. Trager: "Re: OpenType not for Open Communication?"

    Asmus Freytag wrote:
    > A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
    > 1) 1 extra test per character (to see whether it's a surrogate)

    In my experience with tuning a fair amount of utf-16 software, this test
    takes pretty close to zero time. All modern processors have branch and
    pipeline trickery that fairly effectively disappears the cost of a
    predictable branch within a tight loop. Occurrences of supplementary
    characters should generally be rare enough that the extra time to
    process them when they are encountered is not statistically significant.

    > 2) special handling every 100 to 1000 characters (say 10 instructions)
    > 3) additional cost of accessing 16-bit registers (per character)
    > 4) reduction in cache misses (each the equivalent of many instructions)

    This is a big deal. The costs in plowing through lots of text data with
    relatively simple processing appear to be heavily related to the
    required memory bandwidth. Assuming reasonably carefully written code,
    that is.

    > 5) reduction in disk access (each the equivaletn of many many instructions)
    > For many operations, e.g. string length, both 1, and 2 are no-ops,
    > so you need to apply a reduction factor based on the mix of operations
    > you do perform, say 50%-75%.
    > For many processors, item 3 is not an issue.
    > For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
    > occurrence depending on the architecture. Their relative weight depends
    > not only on cache sizes, but also on how many other instructions per
    > character are performed. For text scanning operations, their cost
    > does predominate with large data sets.

          Andy Heninger


    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 11:58:54 CST