From: Peter Constable (firstname.lastname@example.org)
Date: Fri May 07 2004 - 12:36:57 CDT
> > UTF-8 encoded sequences can be up to 5 bytes long...
> How is that possible. I was under the impression that a UTF-8
> could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF).
Philippe chastised Chan for mentioning illegal sequences, but then went
on to make reference to there being other illegal sequences.
UTF-8 sequences, as originally defined, could be longer than four bytes,
in order to address codepoints in the vast expanse of UCS-4 at
U+110000..U+FFFFFFFF. Since the accepted code space has been constrained
to U+0000..U+10FFFF, only four bytes are needed. There are non-UTF-8s --
beasts that kind of look like UTF-8 but aren't -- in which sequences of
varying length represent the same character and sequences of more than
four bytes appear, but they are not UTF-8; those byte sequences are
considered illegal in UTF-8.
This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:26 CDT