Re: Endless endianness annoyance

From: Mark Leisher (
Date: Wed Dec 03 1997 - 16:42:50 EST

    Robert> Rather than byte-swapping the text to be searched, byte-swap
    Robert> the search pattern. Works fine for simple
    Robert> patterns/alternatives, fails miserably for ranges. :-) Hi,

That is actually what I am working on now. I was fishing for other
ideas :-)

    Robert> The argument that the data might need to be pre-massaged to
    Robert> deal with composed vs uncomposed (or whatever) is moot,
    Robert> being orthogonal: this can happen regardless of whether the
    Robert> concrete encoding is UCS-2, UCS-2-Intel, or UTF. (Or UCS-4)

    Robert> UTF-8 is indeed the fastest, because (as I've pointed out on
    Robert> other occasions) it is network/disk/whatever bandwidth that
    Robert> will always be the limiting factor. The CPU can then unravel
    Robert> UTF into whatever format you want 2-3 orders of magnitude
    Robert> faster than any external interface will ever be. (Since
    Robert> they both scale, but CPU scales faster and further :-)

Let me see if I have this right. You are saying:
1. UTF-8 is fastest because no byte swapping is necessary.
2. Normalization will be needed no matter the encoded form.
3. Changing text to some other form is no problem once in memory.
