Re: Endless endianness annoyance

From: Michael Kung (MKUNG@us.oracle.com)
Date: Thu Dec 04 1997 - 12:32:50 EST


Interesting challenge.
 
My two cents:
 
Build the distributed systems to handle the 'endianess' locally. The search
is done locally via remote command. In your case, we did create the large
memory access method on NT. (See Oracle announcement :-).
 
The only issue here is the file access, in RDBMS term large object. It will
still be a performance hit if the access is on the large UCS data on
different
endian machine.
 
UTF8 is not great as for internal processing.
 
Regards,
 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Michael Kung
40P-972 Phone: (650) 506-6954
Manager, Server Globalization Technology Fax: (650) 506-7225
Languages and Object Relational Technology Email: mkung@us.oracle.com
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


attached mail follows:


Endian problems make me cranky, so I thought I would whine about and
alert implementers to a problem I am facing now.

One of the more common performance techniques on Unix platforms is to
memory map large files for read-only activities like searching free
text. When memory mapping is used, the OS handles moving the data from
disk to memory and it usually happens *very* fast.

Assume you have about 5Gb of UCS2 text generated on a Windows NT
machine. The 5Gb is neatly distributed across a set of files of about
200Mb apiece. Also assume you are using a big-endian machine.

Let me tell you, byte swapping lots of 200Mb files before searching them
pretty much makes the whole task pointless unless you are running on a
machine with 1Gb or more of real memory. It takes forever!

So how do I fix it? I clear off another 9Gb hard drive, convert all the
files to big-endian on that other drive, and then move them back to the
local drive. Never mind that I cause delays for people using the other
9Gb hard drive and have the usual lengthy delays copying multi-gigabytes
back and forth from a non-local disk.

Think of the implications in the other, more likely, direction: the data
was generated on a big-endian machine and your Windows NT search
software memory maps for performance reasons. When and where will all
the byte swapping happen?

I don't happen to think that the customer should know a priori whether
text they receive is big or little-endian. Also ask yourself if
customers will buy enough physical memory to load and byte-swap a 200Mb
(or larger) file into memory while other programs are running at the
same time. And you know customers are going to ask and complain about
situations like this simply because they can!

Unlike problems such as world hunger which we can blame on Microsoft and
Intel, this is not something we can reasonably lay at their door and
expect to be fixed tomorrow, so customers are going to demand a software
solution.

What would you do?

The most obvious answer is to index the text because people expect
delays when indexing anyway. There are some significant problems with
that answer. Extra credit for anyone who realizes this and does not
send a reply saying we should just index it all.
------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "A designer knows he has achieved perfection
Computing Research Lab not when there is nothing left to add, but
New Mexico State University when there is nothing left to take away."
Box 30001, Dept. 3CRL -- Antoine de Saint-Exupéry
Las Cruces, NM 88003



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT