Re: Unicode 3.1: UTF-8

From: David Starner (
Date: Wed Jan 31 2001 - 17:04:45 EST

On Wed, Jan 31, 2001 at 11:18:37AM -0800, John Cowan wrote:
> I propose that the distinction between illegal and irregular UTF-8
> code sequences (D36bc) be eliminated. Since there are no code points
> between U+D7FF and U+E000 (the apparently intervening code points
> are UTF-16 code units, but not Unicode code points)
> the corresponding UTF-8 code sequences should be illegal.
> This can be achieved by replacing the U+1000..U+FFFF row in
> Table 3.1B as follows:
> U+1000..U+CFFF E1..EC 80..BF 80..BF
> U+D000..U+D7FF ED 80..9F 80..BF [9F underscored]
> U+E000..U+FFFF EE 80..BF 80..BF

Do other people use irregular sequences? I'm forced to. I have to use them
in some UTF-8 source code to sneak them past a compiler that translates that
UTF-8 into UCS-2, so I can convert them into proper UTF-8 and whatever else
in my program. I'm not sure that the occaional utility of using them to
work around older systems doesn't balance the (very rare) problems from
using them.

David Starner -
Pointless website:

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT