Re: UCS-4, UCS-2, UTF-16, UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 17 2000 - 19:56:24 EST


Frank Tang wrote:

>
> Not only that. UCS-4 does not specify byte order, but UTF-32BE and
> UTF-32LE does. I think UTF-32 itself (not UTF-32BE neither UTF32-LE) does
> not make too much sense. But remember byte order is essential in network
> transmission.
>

In the context of the Unicode Character Encoding Model (see Unicode
Technical Report #17 http://www.unicode.org/unicode/reports/tr17)
UTF-32 is a Character Encoding Form. It is the mapping from the set
of integers used in the Unicode Standard (the scalar values) to 32-bit
code units (within a code space of 0..10FFFF). In the case of UTF-32,
the mapping is, of course, trivial: each scalar value maps to a single
32-bit code unit of the same numerical value.

UTF-32BE and UTF-32LE, on the other hand are Character Encoding Schemes --
they map the code units into serialized byte sequences.

None of these are *officially* part of the Unicode Standard yet -- they
are proposed as part of the *Draft* Unicode Technical Report #19. It
is likely, however, that they will soon become part of the Unicode
Standard.

When they do, the relationship between UTF-32, UTF-32BE, and UTF-32LE,
will be completely analogous to the relationship between UTF-16,
UTF-16BE, and UTF-16LE, as already specified in the standard. That
includes use and interpretation of the BOM (U+FEFF), which, of course,
in UTF-32 is U-0000FEFF.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT