Re: UTF8 vs. Unicode (UTF16) in code

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Mar 13 2001 - 21:40:26 EST


Keld surmised:

> On Fri, Mar 09, 2001 at 10:56:30AM -0800, Yves Arrouye wrote:
> >
> > Since the U in UTF stands for Unicode, UTF-32 cannot represent more than
> > what Unicode encodes, which is is 1+ million code points. Otherwise, you're
> > talking about UCS-4. But I
> > thought that one of the latest revs of ISO 10646 explicitely specified that
> > UCS-4 will never encode more than what Unicode can encode, and thus
> > definitely these 4 billion characters you're alluding to.
>
> As far as I know the U in UTF stands for Universal - not unicode.

Turn in your hymnals (the Unicode Standard, Version 3.0) to page 46,
definition D29.

"D29 A Unicode (or UCS) transformation format (UTF) transforms each
     Unicode scalar value into a unique sequence of code values."

You can see comparable entries in the printed and online glossaries.

The *preferred* Unicode interpretation is "Unicode Transformation
Format", but the standard also allows of "UCS Transformation Format"
as the interpretation of the acronym "UTF", since that is the usage
of ISO/IEC 10646-1:2000. Cf. Annex D in that document, entitled
"UCS Transformation Format 8 (UTF-8)". And the Unicode editors are
in favor of harmonic convergence of terminology where it makes
sense.

> ISO 10646 can encode characters beyond UTF-16, and should retain
> this capability.

This is technically correct. The wording in the standard states
"Planes 00 to FF in Groups 01 to 5F are reserved for future
standardization, and thus those code positions shall not be used
for any other purpose." That is a way of saying that SC2 *could*
encode characters there at some unspecified time in the future.

However, the proposal that someone mentioned on this thread can
be seen in Item 3 of the PDAM 1 to 10646-1, currently under ballot,
which removes user Planes E0 .. FF and user Groups 60 - 7F, placing
all those code positions into the same reserved status, and
disallowing their use as private use codes.

The *purpose* of that proposal is to restrict the committed encoding
space of 10646 to U+0000..U+10FFFF, so that UTF-16 and UTF-8 (and
UTF-32) are interoperable.

Furthermore, SC2/WG2 is on record, in its minutes, resolutions,
and principles and procedures as not intending to encode anything
past U+10FFFF -- precisely because to do so would break interoperability
between UTF-16 and UTF-8.

> There is a proposal to restrict UTF-8 to
> only encompas the same values as UTF-16,

Actually, that is a separate proposal that has not yet been
floated, which would drop the 5- and 6-byte ranges of UTF-8,
since they are not necessary for UTF-16 interoperability.

> but UCS-4 still encodes
> the 31-bit code space.

Architecturally, this is still correct. 10646 structures the
codespace as 128 groups of 256 planes each, and the "Four-octet
canonical form" (UCS-4) requires the use of 4 "octets". So
this is a 31-bit code space.

Practically, however, the impact of other restrictions, and
the requirement for interoperability of UTF-16 and UTF-8 (and
UTF-32), plus the SC2/WG2 principles and procedures, means
that the G-octet will always be 0x00, and the P-octet will
always be in the range 0x00..0x10. In other words, 10646 as
a *Coded Character Set* (as opposed to an architecture for
encoding) has a 21-bit code space. And SC2/WG2 is perfectly
aware that it would be highly inadvisable (and damaging to
its own successful standard) to exceed that limit.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT