Re: Unicode 3.1: UTF-8

From: David Starner (dstarner98@aasaa.ofe.org)
Date: Wed Jan 31 2001 - 17:04:45 EST


On Wed, Jan 31, 2001 at 11:18:37AM -0800, John Cowan wrote:
> I propose that the distinction between illegal and irregular UTF-8
> code sequences (D36bc) be eliminated. Since there are no code points
> between U+D7FF and U+E000 (the apparently intervening code points
> are UTF-16 code units, but not Unicode code points)
> the corresponding UTF-8 code sequences should be illegal.
>
> This can be achieved by replacing the U+1000..U+FFFF row in
> Table 3.1B as follows:
>
> U+1000..U+CFFF E1..EC 80..BF 80..BF
> U+D000..U+D7FF ED 80..9F 80..BF [9F underscored]
> U+E000..U+FFFF EE 80..BF 80..BF

Do other people use irregular sequences? I'm forced to. I have to use them
in some UTF-8 source code to sneak them past a compiler that translates that
UTF-8 into UCS-2, so I can convert them into proper UTF-8 and whatever else
in my program. I'm not sure that the occaional utility of using them to
work around older systems doesn't balance the (very rare) problems from
using them.

-- 
David Starner - dstarner98@aasaa.ofe.org
Pointless website: http://dvdeug.dhis.org



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT