RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (
Date: Sat Dec 11 2004 - 05:47:50 CST

  • Next message: Johannes Bergerhausen: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"

    John Cowan wrote:
    > However, although they are *technically* octet sequences, they
    > are *functionally* character strings. That's the issue.
    Nicely put! But UTC does not seem to care.

    > > The point I'm making is that *whatever* you do, you are still
    > > asking for implementers to obey some convention on conversion
    > > failures for corrupt, uninterpretable character data.
    > > My assessment is that you'd have no better success at making
    > > this work universally well with some set of 128 magic bullet
    > > corruption pills on Plane 14 than you have with the
    > > existing Quoted-Unprintable as a convention.
    > It doesn't have to work universally; indeed, it becomes a QOI issue.
    > Allocating representations of bytes with "bits that are high" makes
    > it possible to do something recoverable, at very little expense to the
    > Unicode Consortium.
    Except that the expense should be slightly higher. The importance of these
    replacement codepoints is still underestimated. They belong in the BMP. And
    at least there is no way anyone can blame UTC for a cultural bias in this
    case, these codepoints are universal.

    > > Further, as it turns out that Lars is actually asking for
    > > "standardizing" corrupt UTF-8, a notion that isn't going to
    > > fly even two feet, I think the whole idea is going to be
    > > a complete non-starter.
    > I agree that that part won't fly, absolutely.
    Then I'll have to restructure it.


    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 05:52:50 CST