RE: UTF-8 <> UCS-2/UTF-16 conversion for library use

From: Ayers, Mike (Mike_Ayers@bmc.com)
Date: Mon Sep 24 2001 - 12:23:49 EDT


> From: Asmus Freytag [mailto:asmusf@ix.netcom.com]
> Sent: Sunday, September 23, 2001 02:24 AM

> The typical situation involves cases where large data sets
> are cached in
> memory, for immediate access. Going to UTF-32 reduces the
> cache effectively
> by a factor of two, with no comparable increase in processing
> efficiency to
> balance out the extra cache misses. This is because each
> cache miss is
> orders of magnitude more expensive than a cache hit.

        For this situation you have a good point. For others, however, the
extra data space of UTF-32 is bound to be lower cost than having to check
every character for special meaning (i.e. surrogate) before passing it on.

> For specialized data sets (heavy in ascii) keeping such a
> cache in UTF-8
> might conceivably reduce cache misses further to a point
> where on the fly
> conversion to UTF-16 could get amortized. However, such an
> optimization is
> not robust, unless the assumption is due to the nature of the
> data (e.g.
> HTML) as opposed to merely their source (US). In the latter
> case, such an
> architecture scales badly with change in market.

        Maybe, maybe not. Latin characters are in heavy use wherever
computers are, at least for now.

> [The decision to use UTF-16, on the other hand, is much more robust,
> because the code paths that deal with surrogate pairs will be
> exercised
> with low frequency, due to the deliberate concentration of nearly all
> modern-use characters into the BMP (i.e. the first 64K).]

        Funny. You see robustness, I see latent bugs due to rarely
exercised code paths.

/|/|ike



This archive was generated by hypermail 2.1.2 : Mon Sep 24 2001 - 11:17:49 EDT