Re: Endless endianness annoyance

From: Robert_Ullmann/CAM/Lotus@lotus.com
Date: Wed Dec 03 1997 - 14:48:18 EST


Rather than byte-swapping the text to be searched, byte-swap the search
pattern. Works fine for simple patterns/alternatives, fails miserably
for ranges. :-)
Hi,

Doesn't fail as miserably as you think, if you use bit-mask patterns for
the initial screening. Really fast search algorithms on un-indexed data
will look for pattern signatures first anyway, using wide comparisons. For
example, suppose I'm looking for my last name; instead of walking through
octet-by-octet (in UTF-8) or 16 bits at a time, I build several masks:
"Ullm", "llma", "lman", "mann", then pick up the data 64 bits at a time,
and compare to all 4 masks. On (say) a dual issue Alpha, with the
instructions in the right sequence, I can do 8 comparisons in 10 cycles,
with one data fetch every 5 cycles. (assume some loop unrolling, etc.) So
1.25 cycles/character, at 600 Mhz I'm doing ~Gbyte/second. (Somewhat faster
than the disk drive ...) To do ranges I find the most restrictive bit masks
possible (which then work in either byte order or UTF).

The argument that the data might need to be pre-massaged to deal with
composed vs uncomposed (or whatever) is moot, being orthogonal: this can
happen regardless of whether the concrete encoding is UCS-2, UCS-2-Intel,
or UTF. (Or UCS-4)

UTF-8 is indeed the fastest, because (as I've pointed out on other
occasions) it is network/disk/whatever bandwidth that will always be the
limiting factor. The CPU can then unravel UTF into whatever format you want
2-3 orders of magnitude faster than any external interface will ever be.
(Since they both scale, but CPU scales faster and further :-)

Robert
Lotus/IBM multimedia



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT