Re: Why is "endianness" relevant when storing data on disks but not when in memory? from Bill Poser on 2013-01-05 (Unicode Mail List Archive)

From: Bill Poser <billposer2_at_gmail.com>
Date: Sat, 5 Jan 2013 15:03:17 -0800

Endian-ness of data stored in memory is relevant but only if you are
working at a very low level. Suppose that you have UTF32 data stored as
unsigned C integers. On pretty much any modern computer, each codepoint
will occupy four 8-bit bytes. So long as you deal with that data via C, as
unsigned 32 bit integers, you don't need to know about endian-ness. The C
compiler and run-time routines take care of that for you. Endian-ness is
still relevant, in that your unsigned 32 bit integers could be composed of
bytes in different ways, but unless you work at the byte level, you don't
need to know about it.

The reason that endian-ness is relevant to data stored on disk is that
there is no agreement between disks and other external storage devices and
your programming language as to what constitutes an unsigned 32 bit
integer. Whereas your program can ask the system for a 32 bit unsigned
integer from memory, it can't ask the disk for one because there isn't any
agreement between the disk and your program as to what one of those
consists of. Your program has to ask the disk for four bytes and figure out
how to make them into a 32 bit unsigned integer.

Generally speaking, if you are working in a programming language that has
notions like "Unicode character" or "32 bit unsigned integer", the system
knows how those notions correspond to what is in memory and you don't have
to worry about it. In general the system cannot know what format some stuff
on an external storage device is in so you may be forced to deal with the
details of representation.

On Sat, Jan 5, 2013 at 2:21 PM, Costello, Roger L. <costello_at_mitre.org>wrote:

> Hi Folks,
>
> In the book "Fonts & Encodings" it says (I think) that endianness is
> relevant only when storing data on disks.
>
> Why is endianness is not relevant when data is in memory?
>
> On page 62 it says:
>
> ... when we store ... data on disk, we write
> not 32-bit (or 16-bit) numbers but series of
> four (or two) bytes. And according to the
> type of processor (Intel or RISC), the most
> significant byte will be written either first
> (the "little-endian" system) or last (the
> "big-endian" system). Therefore we have
> both a UTF-32BE and a UTF-32LE, a UTF-16BE
> and a UTF-16LE.
>
> Then, on page 63 it says:
>
> ... UTF-16 or UTF-32 ... if we specify one of
> these, either we are in memory, in which case
> the issue of representation as a sequence of
> bytes does not arise, or we are using a method
> that enables us to detect the endianness of the
> document.
>
> When data is in memory isn't it important to know whether the most
> significant byte is first or last?
>
> Does this mean that when exchanging Unicode data across the Internet the
> endianness is not relevant?
>
> Are these stated correctly:
>
> When Unicode data is in a file we would say, for example, "The file
> contains UTF-32BE data."
>
> When Unicode data is in memory we would say, "There is UTF-32 data in
> memory."
>
> When Unicode data is sent across the Internet we would say, "The
> UTF-32 data was sent across the Internet."
>
> /Roger
>
>
>
Received on Sat Jan 05 2013 - 17:07:05 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 05 2013 - 17:07:06 CST