From: Mark Davis (email@example.com)
Date: Fri Dec 03 2004 - 11:56:38 CST
That's a good response. I would add a couple of other factors:
- What APIs will you be using? If most of the APIs take/return a particular
UTF, the cost of constant conversions will swamp many if not most other
- Asmus mentioned memory, but I'd like to add to that. When you are using
virtual memory, significant increases in memory usage will cause a
considerable slowdown because of swapping. This is especially important in
----- Original Message -----
From: "Asmus Freytag" <firstname.lastname@example.org>
To: "Doug Ewell" <email@example.com>; "Unicode Mailing List"
Sent: Friday, December 03, 2004 07:55
Subject: Re: Nicest UTF
> At 09:56 PM 12/2/2004, Doug Ewell wrote:
> >I use ... and UTF-32 for most internal processing that I write
> >myself. Let people say UTF-32 is wasteful if they want; I don't tend to
> >store huge amounts of text in memory at once, so the overhead is much
> >less important than one code unit per character.
> For performance-critical applications on the other hand, you need to use
> whichever UTF gives you the correct balance in speed and average storage
> size for your data.
> If you have very large amounts of data, you'll be sensitive to cache
> overruns. Enough so, that UTF-32 may be disqualified from the start.
> I have encountered systems for which that was true.
> If your 'per character' operations are based on parsing for ASCII symbols,
> e.g. HTML parsing, then both UTF-8 and UTF-16 allow you to process your
> data directly, w/o need to worry about the longer sequences. For such
> tasks, it may be that some processors will work faster if working in
> 32-bit chunks.
> However, many 'inner loop' algorithms, such as copy, can be implemented
> using native machine words, handling multiple characters, or parts of
> characters, at once, independent of the UTF.
> And even in those situations, the savings from that better not be
> offset by cache limitations.
> A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider
> 1) 1 extra test per character (to see whether it's a surrogate)
> 2) special handling every 100 to 1000 characters (say 10 instructions)
> 3) additional cost of accessing 16-bit registers (per character)
> 4) reduction in cache misses (each the equivalent of many instructions)
> 5) reduction in disk access (each the equivaletn of many many
> For many operations, e.g. string length, both 1, and 2 are no-ops,
> so you need to apply a reduction factor based on the mix of operations
> you do perform, say 50%-75%.
> For many processors, item 3 is not an issue.
> For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
> occurrence depending on the architecture. Their relative weight depends
> not only on cache sizes, but also on how many other instructions per
> character are performed. For text scanning operations, their cost
> does predominate with large data sets.
> Given this little model and some additional assumptions about your
> own project(s), you should be able to determine the 'nicest' UTF for
> your own performance-critical case.
This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 12:01:31 CST