Re: Nicest UTF

From: Asmus Freytag (
Date: Fri Dec 03 2004 - 09:55:32 CST

  • Next message: Peter Constable: "RE: current version of unicode font (Open Type) in e-mails"

    At 09:56 PM 12/2/2004, Doug Ewell wrote:
    >I use ... and UTF-32 for most internal processing that I write
    >myself. Let people say UTF-32 is wasteful if they want; I don't tend to
    >store huge amounts of text in memory at once, so the overhead is much
    >less important than one code unit per character.

    For performance-critical applications on the other hand, you need to use
    whichever UTF gives you the correct balance in speed and average storage
    size for your data.

    If you have very large amounts of data, you'll be sensitive to cache
    overruns. Enough so, that UTF-32 may be disqualified from the start.
    I have encountered systems for which that was true.

    If your 'per character' operations are based on parsing for ASCII symbols,
    e.g. HTML parsing, then both UTF-8 and UTF-16 allow you to process your
    data directly, w/o need to worry about the longer sequences. For such
    tasks, it may be that some processors will work faster if working in
    32-bit chunks.

    However, many 'inner loop' algorithms, such as copy, can be implemented
    using native machine words, handling multiple characters, or parts of
    characters, at once, independent of the UTF.

    And even in those situations, the savings from that better not be
    offset by cache limitations.

    A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider

    1) 1 extra test per character (to see whether it's a surrogate)

    2) special handling every 100 to 1000 characters (say 10 instructions)

    3) additional cost of accessing 16-bit registers (per character)

    4) reduction in cache misses (each the equivalent of many instructions)

    5) reduction in disk access (each the equivaletn of many many instructions)

    For many operations, e.g. string length, both 1, and 2 are no-ops,
    so you need to apply a reduction factor based on the mix of operations
    you do perform, say 50%-75%.

    For many processors, item 3 is not an issue.

    For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each
    occurrence depending on the architecture. Their relative weight depends
    not only on cache sizes, but also on how many other instructions per
    character are performed. For text scanning operations, their cost
    does predominate with large data sets.

    Given this little model and some additional assumptions about your
    own project(s), you should be able to determine the 'nicest' UTF for
    your own performance-critical case.


    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 10:00:31 CST