Re: UCS-4, UCS-2, UTF-16, UTF-8

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Feb 18 2000 - 05:55:20 EST


Yung-Fong Tang wrote on 2000-02-17 21:18 UTC:
> UCS-4 does not specify byte order, but UTF-32BE and
> UTF-32LE does.

No. UCS-2 and UCS-4 have always been bigendian. Read ISO 10646-1:1993,
section "6.3 Octet order" (page 7):

  When serialized as octets, a more significant octet shall
  precede less significant octets.

ISO and ITU have fortunately always frowned upon Intel's horrible 1970s
decision of staying compatible with some obscure long-forgotten 1960s
mainframe for which they had bought some software when they made the
8080 a littleendian processor (Intel's microcontrollers by the way are
all bigendian, as is pretty much anything else that was not designed to
be Intel compatible).

Littleendian is technological nonsense that has only been cemented by
the Wintel cartel. It is a real pitty that this went into Unicode and we
have now ended up with the BOM mess and almost a dozen different
encoding forms: UCS-2, UCS-4, UTF-1, UTF-7, UTF-8, UTF-16, UTF-32,
UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE. It would have been much more
simple, robust, convenient, and efficient to standardize only on
bigendian than to require everyone to implement both plus the BOM.
Swapping byte orders costs practically no measurable time for modern
processors after all, but having two encodings in files and on networks
will remain a significant and completely unjustified hassle for the
future, motivated only by Microsoft's inability to produce software that
runs portable on machines of any byte sex (which has after all been the
state-of-the-art in the Unix world for over 25 years).

http://www.cl.cam.ac.uk/~mgk25/aes-endian.pdf

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT