Efficiency of text processing [was: That UTF-8 Rant]

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Sun Aug 01 1999 - 16:25:58 EDT

Christophe PIERRET <cpierret@businessobjects.com>:

CP> Looking at my code and rethinking about its design, I really
CP> prefer the UTF-16 versions versus the UTF-8 ones (even if
CP> surrogates are not that simple to handle).

That sounds interesting. Could you please explain this? (My gut
instinct would be that the need to reduce to canonical form at some
point nullifies any advantages that UTF-16 gives over UTF-8.)

CP> I benchmarked most of my algorithms on pentium II machines and the
CP> results are: Efficiency of Unicode text processing algorithms
CP> depends on encodings AND language.

CP> In my case: Sorting any latin script language is faster with UTF-8
CP> (using Unicode Collation Algorithm) while it was faster to use
CP> UTF-16 for Japanese or Russian. The same applies to regex search.

This might be a processor-specific thing. The Intel machines are
surprisingly good at dealing with unaligned and eight-bit memory
accesses. You'd probably find different results on machines with poor
support for such accesses, such as Alphas and most RISC processors.

I wonder what IA-64 (Merced) will be like in this respect.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT