From: Doug Ewell (doug@ewellic.org)
Date: Sun May 31 2009 - 15:18:16 CDT
Hans Aberg  <haberg at math dot su dot se> wrote:
>>> In particular, it would be great to know if the range U+0080, , 
>>> U+009F is invalid.
>>
>> That bit is especially wrong.  I can at least imagine why there might 
>> be confusion about the noncharacters and surrogate code points, but 
>> not the C1 controls.
>
> It is a bit disappointing: I was looking for a beginning (escape) byte 
> sequence to tell that string isn't UTF-8, among other valid strings. 
> But perhaps it does not matter.
If you're thinking about inventing one, for your own use, then any byte 
sequence that is not valid UTF-8 should do the job.  One possibility 
would be {0xA0}.
Be sure you understand the difference between an invalid *byte sequence* 
and an invalid *code point*.  There are many invalid byte sequences in 
UTF-8.  As Mark pointed out, the only invalid code points are the 
surrogates.
The section of the Wikipedia article you cited actually contains quite a 
concentration of misleading information:
    "Unpaired surrogate halves may indicate an invalid UTF-16 string was 
encoded, or a valid one with a CESU-8 converter."
Even in CESU-8, surrogate halves are expected to be paired 
appropriately.
    "U+FFFE may indicate encoding of a byte-swapped UTF-16 string as it 
is a backwards BOM."
While true, this has very little to do with UTF-8.  The process from 
which such data was received would have to have been smart enough to 
recognize UTF-16 text and convert it to UTF-8, but dumb enough to get 
the UTF-16 byte order wrong in the first place.
    "U+0080 through U+009F may indicate CP1252 was converted without 
translating the characters to Unicode"
This has to do with the original content, not the validity of the UTF-8. 
Single bytes of value 0x80 through 0x9F are simply errors.  Unicode 
scalar values from U+0080 through U+009F (represented in UTF-8 as {0xC2, 
0x80} through {0xC2, 0x9F}) may indicate that CP1252 was converted as if 
it were ISO 8859-1.  In that case, the UTF-8 is perfectly valid but the 
underlying data may not be correct.
    "U+0080 through U+009F and nothing greater than U+00FF may indicate 
double-converted UTF-8."
Again, this confuses validity of UTF-8 with validity of the underlying 
content.  In any event, incorrect conversion of CP1252 as if it were ISO 
8859-1 (above) would fall into this category.
    "U+DC80 through U+DCFF may be reserved for converting invalid byte 
sequences (see above)"
This is flat wrong and bogus and ill-conceived and non-conformant, and 
should never, ever be done, full stop.
-- Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 15:29:54 CDT