From: Ernest Cline (firstname.lastname@example.org)
Date: Fri May 07 2004 - 15:55:58 CDT
> [Original Message]
> From: Jon Hanna <email@example.com>
> UTF-8 as defined in Unicode4.0 can never be greater than 4 bytes long.
> However illegal sequences can be up to 6 (not just 5) bytes long.
> UTF-8 has been variously defined in various standards and specs as
> an encoding of either Unicode or of ISO 10646. ISO 10646 has space
> up to U+7FFFFFFF, although there is a commitment not to use anything
> about U+10FFFF to maintain compatibility with Unicode.
> Because of this some of the specifications for UTF-8 that have been
> published allow for U+7FFFFFFF and below to be encoded
> (U+7FFFFFFF would be encoded as FD BF BF BF BF BF). For
> example RFC 2279 (which is defined in terms of ISO 10646 alone)
> allows this, but it is obsoleted by RFC 3629 (STD 63) which references
> the Unicode standard.
Theoretically, it is possible to encounter valid 5 or 6 byte sequences
in UTF-8. ISO 10646 IIRC had some private use areas above U+10FFFF.
Therefore a version of UTF-8 that referenced the earlier ISO 10646
definition could have data that referred to such a character. Why anyone
would need or want to do this is beyond me, but it would be possible
for there to exist such data. However, like the possibility of encountering
Unicode 1 Hangul syllables, it isn't something I'd especially worry about.
This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:26 CDT