RE: Translated IUC10 Web pages: Experimental Results

From: Murray Sargent (murrays@microsoft.com)
Date: Wed Feb 05 1997 - 18:20:30 EST


I believe the default for UCS2 is big endian, which is amusing since 95%
of the world's computers are little endian. Evidently the majority
doesn't always rule. Does someone know where The Unicode Standard
defines UCS2 to be big endian?

Windows NT has a lot of code that assumes little endian order. I don't
think there are any big-endian builds of NT.

The Win32 predefined clipboard format CF_UNICODETEXT, which is used for
internal Unicode plain-text data transfer, is defined to be little
endian.

The new CF_HTML clipboard format is defined to use UTF-8, so byte order
is irrelevant.

Win32 plain-text Unicode files are written in little endian form
starting with a little-endian byte-order mark, i.e., a 0xFF byte
followed by 0xFE. See Sec. 2.4 of the Unicode Standard, Version 2.0,
for further discussion of this convention. For example, NotePad writes
and reads Unicode files in this format, as can the RichEdit 2.0 edit
control using the TOM interfaces.

In principle, files starting with a big-endian byte-order mark (a 0xFE
byte followed by 0xFF), could be read and converted to little-endian for
internal use, but I suspect that most Win32 software hasn't been taught
to do so. As others have pointed out, it's a trivial exercise to write
a utility to invert the byte order of files.

Murray

> -----Original Message-----
> From: unicode@Unicode.ORG [SMTP:unicode@Unicode.ORG]
> Sent: Wednesday, February 05, 1997 7:05 AM
> To: unicode@Unicode.ORG
> Subject: RE: Translated IUC10 Web pages: Experimental Results
>
> On Tue, 4 Feb 1997 Chris Pratley wrote:
>
> > A few comments on these html files and Word97's capabilities.
> >
> > Word97 supports UCS2 (little-endian) for textfiles
> >
> > Word97 will not open big-endian UCS2:
> > http://194.75.134.50/unicode/iuc10/x-ucs2.html
>
> Very interesting. I thought that the default for UCS2 was
> big-endian. Even on little-endian machines, it would cost
> almost nothing to use that default from the start. And if
> there are really plans to make NT the main OS in the world,
> I hope it is designed so that it doesn't depend on little-
> endian hardware.
>
> The minimum would be to support reading in big-endian
> UCS2 as well as little-endian, if properly tagged, and
> to always write out a tag. The current situation is
> absolutely hilarious, and if not corrected immediately,
> could cause very bad publicity against Unicode.
>
> It is very sad to see that even though a problem has
> been known since years, and appropriate specifications
> and provisions have been included in the standard, a
> company that has been strongly involved in creating that
> standard and that has more resources than most others
> is not able to do at least the minimum necessary to
> let things work the way they were designed.
>
>
> > Word97 supports UTF-8 for HTML (but not UCS2)
> >
> > This is why Word opens the true UTF-8 sites such as
> > http://www.cm.spyglass.com/unicode/iuc10/x-utf8.html
> > as Web pages, and the UCS2 little-endian pages as plain text.
> >
> > Our assumption was that UTF-8 was the only Web-safe encoding that
> was
> > reasonably likely to be adopted by browsers in the near future. Is
> that
> > the consensus, or are raw UCS2 encodings being considered actively
> by
> > people on this alias?
>
> HTTP, the main protocol used to serve Web documents, has absolutely
> no problems transmitting UCS2 or any other kind of "binary" data.
> On a modern server, the character encoding can easily be included
> in the HTTP header. Also, a browser can easily use a simple heuristic
> to distingush between Unicode (starting (hopefully!) with FEFF or
> (hopefully not!) with FFFE)
>
>
> Hope this helps. Regards, Martin.
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT