Re: UTS#40 (BOCU-1) special handling of large blocks

From: Frank Ellermann (
Date: Thu Feb 08 2007 - 10:14:53 CST

  • Next message: Johannes Bergerhausen: "Bulgarian Cyrillic"

    Doug Ewell wrote:
    >> In general, if you make an incompatible change - a change where an old
    >> decoder cannot cope with the output from an updated encoder - then you
    >> must change the name of the charset.
    > UTF-8 was initially defined to work across the entire original 31-bit
    > ISO 10646 code space, with sequences up to 6 bytes long, before Unicode
    > and 10646 agreed to limit the range to U+10FFFF. The definition of
    > UTF-8 appears to have been changed, and I've personally seen several
    > decoders that recognized the longer sequences, but AFAIK the name
    > "UTF-8" was never changed or qualified with a version number.

    Old UTF-8 decoders can deal with valid "new" UTF-8. In theory a "new"
    decoder is lost with "old" UTF-8 above U+10FFFF, but in practice that's

    The only real difference I'm aware of are old overlong constructs. When
    I implemented UTF-8 I used the old format for error recovery, after a
    "new" invalid lead byte I replace it by a single U+FFFD skipping all
    plausible trailing bytes. An attempt to limit the reported errors to a
    minimum, but not for 0xFE or 0xFF, because that was always invalid.


    This archive was generated by hypermail 2.1.5 : Thu Feb 08 2007 - 10:32:32 CST