Re: Endless endianness annoyance

From: Mark Leisher (
Date: Wed Dec 03 1997 - 16:42:50 EST

    Robert> Rather than byte-swapping the text to be searched, byte-swap
    Robert> the search pattern. Works fine for simple
    Robert> patterns/alternatives, fails miserably for ranges. :-) Hi,

That is actually what I am working on now. I was fishing for other
ideas :-)

    Robert> The argument that the data might need to be pre-massaged to
    Robert> deal with composed vs uncomposed (or whatever) is moot,
    Robert> being orthogonal: this can happen regardless of whether the
    Robert> concrete encoding is UCS-2, UCS-2-Intel, or UTF. (Or UCS-4)

    Robert> UTF-8 is indeed the fastest, because (as I've pointed out on
    Robert> other occasions) it is network/disk/whatever bandwidth that
    Robert> will always be the limiting factor. The CPU can then unravel
    Robert> UTF into whatever format you want 2-3 orders of magnitude
    Robert> faster than any external interface will ever be. (Since
    Robert> they both scale, but CPU scales faster and further :-)

Let me see if I have this right. You are saying:
1. UTF-8 is fastest because no byte swapping is necessary.
2. Normalization will be needed no matter the encoded form.
3. Changing text to some other form is no problem once in memory.
Mark Leisher "A designer knows he has achieved perfection
Computing Research Lab not when there is nothing left to add, but
New Mexico State University when there is nothing left to take away."
Box 30001, Dept. 3CRL -- Antoine de Saint-Exupéry
Las Cruces, NM 88003

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT