Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: John Cowan (jcowan@reutershealth.com)
Date: Wed Dec 08 2004 - 16:38:34 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

    Kenneth Whistler scripsit:

    > A Sybase ASE database has the same behavior running on Windows as
    > running on Sun Solaris or Linux, for that matter.

    Fair enough.

    > UNIX filenames are just one instance of this.

    However, although they are *technically* octet sequences, they
    are *functionally* character strings. That's the issue.

    > Failing that, then BINARY fields *are* the appropriate
    > way to deal with arbitrary arrays of bytes that cannot
    > be interpreted as characters.

    This is purism. All the filenames on my Unix system, for example, can
    be interpreted as character strings; the potential to create filenames
    that can't be is unutilized, and sensibly so. For that matter, the
    potential to create files containing C0 controls is also unutilized.

    > > in the same way that it would
    > > be overkill to encode all 8-bit strings in XML using Base-64
    > > just because some of them may contain control characters that are
    > > illegal in well-formed XML.
    >
    > Dunno about the XML issue here -- you're the expert on what
    > the expected level of illegality in usage is there.

    XML's policy is zero tolerance, both for illegal encodings and for
    illegal characters such as U+0001. So in order to be *100% sure* that
    a character string (ASCII, Latin-1, or UTF-*, it matters not) can be put
    into an XML document, one must treat it as binary and encode it as such,
    using QP or Base64 or what have you. But nobody does.

    XML 1.1 allows the representation of every Unicode character except
    U+0000, which materially reduces the problem, but there is little support
    for XML 1.1 as yet.

    In any case, this case is only an analogy, not an exact equivalent:
    the problems of representing illegal *characters* in an XML document is
    closely analogous to the problem of representing illegal *bytes* in a
    character string.

    > The point I'm making is that *whatever* you do, you are still
    > asking for implementers to obey some convention on conversion
    > failures for corrupt, uninterpretable character data.
    > My assessment is that you'd have no better success at making
    > this work universally well with some set of 128 magic bullet
    > corruption pills on Plane 14 than you have with the
    > existing Quoted-Unprintable as a convention.

    It doesn't have to work universally; indeed, it becomes a QOI issue.
    Allocating representations of bytes with "bits that are high" makes
    it possible to do something recoverable, at very little expense to the
    Unicode Consortium.

    > Further, as it turns out that Lars is actually asking for
    > "standardizing" corrupt UTF-8, a notion that isn't going to
    > fly even two feet, I think the whole idea is going to be
    > a complete non-starter.

    I agree that that part won't fly, absolutely.

    -- 
    In politics, obedience and support      John Cowan <jcowan@reutershealth.com>
    are the same thing.  --Hannah Arendt    http://www.ccil.org/~cowan
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 16:39:52 CST