From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Thu Feb 08 2007 - 10:14:53 CST
Doug Ewell wrote:
>> In general, if you make an incompatible change - a change where an old
>> decoder cannot cope with the output from an updated encoder - then you
>> must change the name of the charset.
> UTF-8 was initially defined to work across the entire original 31-bit
> ISO 10646 code space, with sequences up to 6 bytes long, before Unicode
> and 10646 agreed to limit the range to U+10FFFF. The definition of
> UTF-8 appears to have been changed, and I've personally seen several
> decoders that recognized the longer sequences, but AFAIK the name
> "UTF-8" was never changed or qualified with a version number.
Old UTF-8 decoders can deal with valid "new" UTF-8. In theory a "new"
decoder is lost with "old" UTF-8 above U+10FFFF, but in practice that's
irrelevant.
The only real difference I'm aware of are old overlong constructs. When
I implemented UTF-8 I used the old format for error recovery, after a
"new" invalid lead byte I replace it by a single U+FFFD skipping all
plausible trailing bytes. An attempt to limit the reported errors to a
minimum, but not for 0xFE or 0xFF, because that was always invalid.
Frank
This archive was generated by hypermail 2.1.5 : Thu Feb 08 2007 - 10:32:32 CST