From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 06 1997 - 17:30:35 EST

Murray Sargent wrote:

> I believe the default for UCS2 is big endian, which is amusing since 95%
> of the world's computers are little endian. Evidently the majority
> doesn't always rule.

Perhaps 95% of the world's computers. But more significantly, about
50% of the major *platforms* are big endian. Any company concerned about
cross-platform implementation of software and cross-platform access to
data must deal with this issue. To get specific, I must deal with at least
the following platforms, all of which are big endian:

   Solaris (and earlier versions of SunOS)
   Silicon Graphics Irix
   AT&T SVR4
   IBM Open Edition Unix
   OpenVMS DEC Alpha
   and all versions of the MacOS

The fact that these don't run on the majority of INTEL-based PC's sitting
on desktops at work or at home, or in the legions of notebook computers,
does not mean that they are not significant in the larger picture of
the world computing infrastructure. And since many of these platforms
are used to implement server technology, a vast amount of data resides on
such platforms and flows to and from such platforms.

The emergence of the Web as a major factor in world computing is
highlighting the importance of getting this right, as Unicode-enabled
browsers, Unicode-enabled content editors, and sites providing Unicode-encoded
data start to emerge. The Web is quintessentially cross-platform. It
also has some of the characteristics of server technology: effectively,
the Web is a vast, distributed, multi-headed hydra of servers dishing up
data and applications to all the thin clients sitting on the desktops
everywhere. It is crucial that we get this right for Unicode, to avoid
turning the Universal Character Set into more character hash for all
the end users.

The choice of bigendian as the default byte order for Unicode (and
UCS2 in ISO 10646) was not capricious. It was agreed upon during the
initial standardization of Unicode in 1990. The wording of the conformance
clause was tightened up a bit in the publication of Unicode 2.0 partly
to clarify the relationship to ISO 10646. It might be useful to also
consider what the relevant clauses in the international standard,
ISO 10646 have to say on this subject, since Unicode and ISO 10646
are tied at the hip.

[Start excerpt from 10646]

Clause 6.3 Octet order

The sequence of the octets that represent a character, and the most
significant and least significant ends of it, shall be maintained
as shown above. [kenw: refers to Clause 6.2, which described MSB
order of canonical form in 10646] When serialised as octets, a more
significant octet shall precede less significant octets. When not
serialised as octets, the order of octets may be specified by agreement
between sender and recipient (see 17.1 and annex F). [kenw: clause
17.1 refers to various announcement and introducer mechanisms.]

Annex F (infomative) The use of "signatures" to identify UCS

[kenw: discusses use of U+FEFF/U-000FEFF as signature for text]... When this
convention is used, a signature at the beginning of a stream of coded
characters indicates that the characters following are encoded in the
UCS-2 or UCS-4 coded representation, and indicates the ordering
of the octets within the coded representation of each character
(see 6.3). ... If an application which uses one of these signatures
recognises its coded representation in reverse sequence (e.g. hexadecimal
FFFE), the application can identify that the coded representations of
the following characters use the opposite octet sequence to the sequence
expected, and may take the necessary action to recognise the characters

[End excerpt from 10646]

--Ken Whistler

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT