Re: UCS-4, UCS-2, UTF-16, UTF-8

From: Doug Ewell (dewell@compuserve.com)
Date: Fri Feb 18 2000 - 09:44:21 EST


Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> wrote:

> It is a real pitty that this went into Unicode and we have now ended
> up with the BOM mess and almost a dozen different encoding forms:
> UCS-2, UCS-4, UTF-1, UTF-7, UTF-8, UTF-16, UTF-32, UTF-16BE, UTF-16LE,
> UTF-32BE, UTF-32LE.

Only the last four have anything to do with byte order.

- UCS-2 is the 16-bit-only version of the "true" UCS-4, with no support
   for surrogates. It is the "original" Unicode.
- UTF-16 is UCS-2 with surrogate support added.
- UTF-32 is UCS-4 constrained to Unicode character semantics (as opposed
   to ISO 10646) and a range of U-00000000 to U-0010FFFF.
- UTF-1 was created to allow systems built around 8-bit characters to
   migrate to Unicode with less pain. (That means Unix and Linux as
   much as Windows.)
- UTF-8 is the much-improved version of UTF-1.
- UTF-7 was created to allow Unicode in e-mail despite the presence of
   network nodes that can't EVEN deal with the 8th bit. (Unix's hands
   are MUCH dirtier than Microsoft's here.)

And while the co-existence of big-endian and little-endian systems that
must communicate with each other is certainly a mess, I hardly consider
the BOM itself to be a mess. It's an elegant solution to an existing
problem that the developers of Unicode and ISO 10646 did not create,
but did anticipate.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT