RE: Translated IUC10 Web pages: Experimental Results

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Thu Feb 06 1997 - 06:15:13 EST

Next message: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Previous message: Chris Wendt: "FW: Unicode Web pages"
In reply to: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Next in thread: David Goldsmith: "RE: Translated IUC10 Web pages: Experimental Results"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Wed, 5 Feb 1997, Murray Sargent wrote:

> I believe the default for UCS2 is big endian, which is amusing since 95%
> of the world's computers are little endian. Evidently the majority
> doesn't always rule.

It's not too difficult to understand. Big-endian is the natural
way to do things. Just look at the way we write down numbers.
Little-endian was introduced at some time because of certain
hardware constraints and quirks that are not relevant anymore.
The fact that currently little-endian PCs are a majority
(whereas the majority is probably not as clear for little-endian
chips in general) is, on a certain timescale, a short-term
historical quirk. Although this might seem irrelevant from
a day-to-day viewpoint, preferring big endian will avoid
that e.g. in 2050, when students are told about Unicode, they
have to be told: Well, you know, Unicode is always the wrong
way around because in the 1980/90, there was a time where for
certain hardware reasons, that was more efficient on what
was then the majority of PCs. Unicode is designed for long-term
use, much longer than processors and operating systems.

But that's just for background. What is important is what the
standard says.

> Does someone know where The Unicode Standard
> defines UCS2 to be big endian?

The relevant statements can be found on page 3-1 of Unicode 2.0.
C3 says that "A process shall interpret a Unicode value that has been
serialized into a sequence of bytes, by most significant byte first,
in the absence of higher-level protocols."

The byte order mark can obviously be seen as a kind of higher-level
protocol. Other information, such as finding the data in an
application format or in a Windows Clipboard, can also constitute
such a higher-level protocol (if it is clearly specified).

The fact that a file appears on a web site, on the other hand,
definitely does not consist such a higher-level protocol. So
all widely available plaintext files have to either start with
a BOM or be big-endian.

> Windows NT has a lot of code that assumes little endian order. I don't
> think there are any big-endian builds of NT.

So UNIX is the only popular OS that works both ways?

> The Win32 predefined clipboard format CF_UNICODETEXT, which is used for
> internal Unicode plain-text data transfer, is defined to be little
> endian.

That's okay, as it is a higher-level convention. In the long run, it
would have been better otherwise, but I guess it's the way it is.

> Win32 plain-text Unicode files are written in little endian form
> starting with a little-endian byte-order mark, i.e., a 0xFF byte
> followed by 0xFE. See Sec. 2.4 of the Unicode Standard, Version 2.0,
> for further discussion of this convention. For example, NotePad writes
> and reads Unicode files in this format, as can the RichEdit 2.0 edit
> control using the TOM interfaces.

If the BOM is there, that's okay.

> In principle, files starting with a big-endian byte-order mark (a 0xFE
> byte followed by 0xFF), could be read and converted to little-endian for
> internal use, but I suspect that most Win32 software hasn't been taught
> to do so.

Now here comes the realy *bad problem*! There is no "in principle".
If things such as NotePad, RichEdit and so on don't get it when the
file starts with FEFF, or interpret a raw unicode file without BOM
as little-endian, they are just non-conformant! Note that the fact
that the raw text file resides on a local disk cannot, for obvious
reasons, be taken to constitute a higher level protocol. The comment
after conformance clause C3 on p. 3-1 is somewhat blue-eyed here.
It says that "The majority of all interchanges occurs with processes
running on the same or a similar configuration." With the global
internet, this is less and less true. And even if it would be true
for 99%, the problem is that you don't know, which is going to be
or came from the remaining 1%. So the only solution for raw text
files is big endian or a proper BOM.

> As others have pointed out, it's a trivial exercise to write
> a utility to invert the byte order of files.

It's a trivial exercise, but it's not the job of an utility.
As the standard very clearly specifies, and as it is highly
clear from the point of user convenience, it's the job of every
single damn application. If the majority of Win32 software
hasn't been thought about it, it is HIGH NOON to tell them.

Regards, Martin.

Next message: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Previous message: Chris Wendt: "FW: Unicode Web pages"
In reply to: Murray Sargent: "RE: Translated IUC10 Web pages: Experimental Results"
Next in thread: David Goldsmith: "RE: Translated IUC10 Web pages: Experimental Results"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT