From: Andy Heninger (firstname.lastname@example.org)
Date: Mon Dec 06 2004 - 11:57:05 CST
Asmus Freytag wrote:
> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
> 1) 1 extra test per character (to see whether it's a surrogate)
In my experience with tuning a fair amount of utf-16 software, this test
takes pretty close to zero time. All modern processors have branch and
pipeline trickery that fairly effectively disappears the cost of a
predictable branch within a tight loop. Occurrences of supplementary
characters should generally be rare enough that the extra time to
process them when they are encountered is not statistically significant.
> 2) special handling every 100 to 1000 characters (say 10 instructions)
> 3) additional cost of accessing 16-bit registers (per character)
> 4) reduction in cache misses (each the equivalent of many instructions)
This is a big deal. The costs in plowing through lots of text data with
relatively simple processing appear to be heavily related to the
required memory bandwidth. Assuming reasonably carefully written code,
> 5) reduction in disk access (each the equivaletn of many many instructions)
> For many operations, e.g. string length, both 1, and 2 are no-ops,
> so you need to apply a reduction factor based on the mix of operations
> you do perform, say 50%-75%.
> For many processors, item 3 is not an issue.
> For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
> occurrence depending on the architecture. Their relative weight depends
> not only on cache sizes, but also on how many other instructions per
> character are performed. For text scanning operations, their cost
> does predominate with large data sets.
-- Andy Heninger email@example.com
This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 11:58:54 CST