Re: Corrigendum #9

From: David Starner <>
Date: Wed, 2 Jul 2014 11:19:32 -0700

On Wed, Jul 2, 2014 at 8:02 AM, Karl Williamson <> wrote:
> In
> UTF-8, an example would be that Sun, I'm told, and for reasons I've
> forgotten or never knew, did not want raw NUL bytes to appear in text
> streams, so used the overlong sequence \xC0\x80 to represent them; overlong
> sequences generally being considered "bad" because they could be used to
> insert malicious payloads into the input.

In C, NUL ends a string. If you have to run data that may have NUL
characters through C functions, you can't store the NULs as \0. I
might argue 11111111b for 0x00 in UTF-8 would be technically
legal--the standard never specifies which bit sequences correspond to
which byte values--but \xC0\x80 would probably be more reliably
processed by existing code.

Kie ekzistas vivo, ekzistas espero.
Unicode mailing list
Received on Wed Jul 02 2014 - 13:20:29 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 02 2014 - 13:20:29 CDT