RE: Translated IUC10 Web pages: Experimental Results

From: Murray Sargent (
Date: Thu Feb 06 1997 - 15:20:40 EST

Thanks for pointing out that The Unicode Standard very clearly spells
out that UCS-2 is big endian; don't know how I missed it!

Just for the record, little endian was introduced by IBM with the 360
back in the mid 1960's. I guess now you can guess how old I am! Intel
just followed the leader in choosing little endian. I don't want to get
into a senseless discussion as to which order is natural (I do prefer
little endian); both orders exist so we need to deal with them.

Re Win32 apps maybe not handling big-endian plain-text files, I agree
it's very parochial, but it's probably reality. One thing I've found
again and again in software is that widespread conventions can just
happen, regardless of the good intentions of standards committees and
company architects. A better choice than plain text is probably HTML
with UTF-8, which avoids the issue altogether. But HTML is in a state
of rapid evolution...

In any event, I think it would be crazy to have a plain-text Unicode
file that didn't start with a BOM of some kind. Else you'd probably be
better off interpreting it as 8859-1, not Unicode. Remember that
backward compatibility is a major issue and 8859-1 (or 1252) is a pretty
good guess to make for a random plain-text file.


>-----Original Message-----
>From: Martin J. Duerst []
>Sent: Thursday, February 06, 1997 3:15 AM
>To: Murray Sargent
>Cc: ''
>Subject: RE: Translated IUC10 Web pages: Experimental Results
>On Wed, 5 Feb 1997, Murray Sargent wrote:
>> I believe the default for UCS2 is big endian, which is amusing since 95%
>> of the world's computers are little endian. Evidently the majority
>> doesn't always rule.
>It's not too difficult to understand. Big-endian is the natural
>way to do things. Just look at the way we write down numbers.
>Little-endian was introduced at some time because of certain
>hardware constraints and quirks that are not relevant anymore.
>The fact that currently little-endian PCs are a majority
>(whereas the majority is probably not as clear for little-endian
>chips in general) is, on a certain timescale, a short-term
>historical quirk. Although this might seem irrelevant from
>a day-to-day viewpoint, preferring big endian will avoid
>that e.g. in 2050, when students are told about Unicode, they
>have to be told: Well, you know, Unicode is always the wrong
>way around because in the 1980/90, there was a time where for
>certain hardware reasons, that was more efficient on what
>was then the majority of PCs. Unicode is designed for long-term
>use, much longer than processors and operating systems.
>But that's just for background. What is important is what the
>standard says.
>> Does someone know where The Unicode Standard
>> defines UCS2 to be big endian?
>The relevant statements can be found on page 3-1 of Unicode 2.0.
>C3 says that "A process shall interpret a Unicode value that has been
>serialized into a sequence of bytes, by most significant byte first,
>in the absence of higher-level protocols."
>The byte order mark can obviously be seen as a kind of higher-level
>protocol. Other information, such as finding the data in an
>application format or in a Windows Clipboard, can also constitute
>such a higher-level protocol (if it is clearly specified).
>The fact that a file appears on a web site, on the other hand,
>definitely does not consist such a higher-level protocol. So
>all widely available plaintext files have to either start with
>a BOM or be big-endian.
>> Windows NT has a lot of code that assumes little endian order. I don't
>> think there are any big-endian builds of NT.
>So UNIX is the only popular OS that works both ways?
>> The Win32 predefined clipboard format CF_UNICODETEXT, which is used for
>> internal Unicode plain-text data transfer, is defined to be little
>> endian.
>That's okay, as it is a higher-level convention. In the long run, it
>would have been better otherwise, but I guess it's the way it is.
>> Win32 plain-text Unicode files are written in little endian form
>> starting with a little-endian byte-order mark, i.e., a 0xFF byte
>> followed by 0xFE. See Sec. 2.4 of the Unicode Standard, Version 2.0,
>> for further discussion of this convention. For example, NotePad writes
>> and reads Unicode files in this format, as can the RichEdit 2.0 edit
>> control using the TOM interfaces.
>If the BOM is there, that's okay.
>> In principle, files starting with a big-endian byte-order mark (a 0xFE
>> byte followed by 0xFF), could be read and converted to little-endian for
>> internal use, but I suspect that most Win32 software hasn't been taught
>> to do so.
>Now here comes the realy *bad problem*! There is no "in principle".
>If things such as NotePad, RichEdit and so on don't get it when the
>file starts with FEFF, or interpret a raw unicode file without BOM
>as little-endian, they are just non-conformant! Note that the fact
>that the raw text file resides on a local disk cannot, for obvious
>reasons, be taken to constitute a higher level protocol. The comment
>after conformance clause C3 on p. 3-1 is somewhat blue-eyed here.
>It says that "The majority of all interchanges occurs with processes
>running on the same or a similar configuration." With the global
>internet, this is less and less true. And even if it would be true
>for 99%, the problem is that you don't know, which is going to be
>or came from the remaining 1%. So the only solution for raw text
>files is big endian or a proper BOM.
>> As others have pointed out, it's a trivial exercise to write
>> a utility to invert the byte order of files.
>It's a trivial exercise, but it's not the job of an utility.
>As the standard very clearly specifies, and as it is highly
>clear from the point of user convenience, it's the job of every
>single damn application. If the majority of Win32 software
>hasn't been thought about it, it is HIGH NOON to tell them.
>Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT