Re: Nicest UTF

From: Andy Heninger (andyh@jtcsv.com)
Date: Mon Dec 06 2004 - 11:57:05 CST

Next message: Edward H. Trager: "Re: OpenType not for Open Communication?"

Previous message: Antoine Leca: "Re: Nicest UTF"
In reply to: Asmus Freytag: "Re: Nicest UTF"
Next in thread: Arcane Jill: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Asmus Freytag wrote:
> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
>
> 1) 1 extra test per character (to see whether it's a surrogate)

In my experience with tuning a fair amount of utf-16 software, this test
takes pretty close to zero time. All modern processors have branch and
pipeline trickery that fairly effectively disappears the cost of a
predictable branch within a tight loop. Occurrences of supplementary
characters should generally be rare enough that the extra time to
process them when they are encountered is not statistically significant.

>
> 2) special handling every 100 to 1000 characters (say 10 instructions)
>
> 3) additional cost of accessing 16-bit registers (per character)
>
> 4) reduction in cache misses (each the equivalent of many instructions)

This is a big deal. The costs in plowing through lots of text data with
relatively simple processing appear to be heavily related to the
required memory bandwidth. Assuming reasonably carefully written code,
that is.

>
> 5) reduction in disk access (each the equivaletn of many many instructions)
>
> For many operations, e.g. string length, both 1, and 2 are no-ops,
> so you need to apply a reduction factor based on the mix of operations
> you do perform, say 50%-75%.
>
> For many processors, item 3 is not an issue.
>
> For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
> occurrence depending on the architecture. Their relative weight depends
> not only on cache sizes, but also on how many other instructions per
> character are performed. For text scanning operations, their cost
> does predominate with large data sets.
>

-- 
      Andy Heninger
      heninger@us.ibm.com

Next message: Edward H. Trager: "Re: OpenType not for Open Communication?"
Previous message: Antoine Leca: "Re: Nicest UTF"
In reply to: Asmus Freytag: "Re: Nicest UTF"
Next in thread: Arcane Jill: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 11:58:54 CST