RE: [long] Use of Unicode in AbiWord

From: Christophe PIERRET (
Date: Fri Mar 19 1999 - 04:18:38 EST

On March 19, 1999 12:20 AM, Markus Kuhn []
> wrote on 1999-03-18 20:49 UTC:
> > you will still need to compare for each byte if it is <128 to pass it
> > through unchanged. doing so on 16b or even 32b should not cost much
> > if any. using a 16b x-font with UTF-8 should degrade your performance.
> Such ad-hoc predictions of degraded performance are very dangerous
> without actual measurements.
> [...]
> Just a few counterarguments: 16-bit strings require for most languages
> more cache reloads, and it is perfectly possible that some apparently
> less efficiently looking UTF-8 algorithm suddenly performs faster than
> the more efficiently looking UTF-16 variant. You can efficiently
> implement the <128 test for 8 characters at a time on a modern 64-bit
> processor, and your C library should do this invisibly for you anyway in
> mbtowcs(). In addition, in this example the UTF-8 to 16-bit conversion
> is only a negligible small amount of the computation necessary to
> actually display the glyph, such that the difference shouldn't matter
> anyway. PCs are orders of magnitude too fast today anyway, and many
> applications are desperately looking for useful things to do between the
> keystrokes of the horribly slow users ... ;-)

I had a very surprising benchmark with a UTF-8 algorithm going 30% faster
than UTF-16 algorithm !
The benchmark was with a collation algorithm implementation for both UTF-16,
and UTF-8.
(Using Visual C++ 5 on a Pentium II 400 with 128Mb memory)
On latin script data ( 96% of characters were ASCII ), UTF-8 version
outperformed slighlty UTF-16.
Since the only difference is that I extract one UCS-4 at a time from the
UTF-8 string and apply the same operations as for UTF-16, I expected it to
be slower ...
I found no significant intelligible difference even at the assembler level.
The only explanation I could find was in the fact that UTF-8 ( for latin
script) used less memory to store strings.
And even if UTF-8 version does more computations (in registers), it was
The balance between memory access cost and register computations cost seems
to change ...

Christophe Pierret
Business Objects S.A.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT