Re: U+xxxx, U-xxxxxx, and the basics

From: Doug Ewell (dewell@compuserve.com)
Date: Mon Mar 06 2000 - 10:03:15 EST


Dan Oscarsson <Dan.Oscarsson@trab.se> wrote:

> But UTF-8 is not as good designed as UTF-16. UTF-16 does not "overload"
> any value used in Unicode (UCS-2) (i.e. 16-bit representation).
> Unfortunately UTF-8 "overloads" values used in what would be UCS-1
> (codes 0-255) (i.e. 8-bit representation) making a conflict with
> those mostly needing codes 0-255.

UTF-16 accomplishes this by excluding the values U+D800 through U+DFFF
from the realm of "real" characters. To do this with an 8-bit character
set, you would need to exclude a significant number of characters from
the U+0080 through U+00FF range (or a smaller number, and then watch your
multibyte sequences grow to 16 or more characters each).

Either way, the dream of an "ISO 8859-1-compatible" UTF-8 cannot be
practically realized, because ISO 8859-1 is too densely packed.

The closest anyone has gotten to this goal is Jörg Knappen's UTF-7.5,
which both avoids the range U+0080 to U+009F and allows *some* 8859-1
legibility for humans by prefixing "commonly used" 8859-1 characters
(U+00C0 through U+00FF) with U+00A3 POUND SIGN. But it requires longer
multibyte sequences than UTF-8 and cannot handle values beyond plane 0
without employing UTF-16 (which will not make Dan happy).

UTF-7.5 is described at:
http://vzdmzj.zdv.uni-mainz.de/~knappen/jk009.html

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT