From: John Cowan (firstname.lastname@example.org)
Date: Wed Dec 08 2004 - 16:38:34 CST
Kenneth Whistler scripsit:
> A Sybase ASE database has the same behavior running on Windows as
> running on Sun Solaris or Linux, for that matter.
> UNIX filenames are just one instance of this.
However, although they are *technically* octet sequences, they
are *functionally* character strings. That's the issue.
> Failing that, then BINARY fields *are* the appropriate
> way to deal with arbitrary arrays of bytes that cannot
> be interpreted as characters.
This is purism. All the filenames on my Unix system, for example, can
be interpreted as character strings; the potential to create filenames
that can't be is unutilized, and sensibly so. For that matter, the
potential to create files containing C0 controls is also unutilized.
> > in the same way that it would
> > be overkill to encode all 8-bit strings in XML using Base-64
> > just because some of them may contain control characters that are
> > illegal in well-formed XML.
> Dunno about the XML issue here -- you're the expert on what
> the expected level of illegality in usage is there.
XML's policy is zero tolerance, both for illegal encodings and for
illegal characters such as U+0001. So in order to be *100% sure* that
a character string (ASCII, Latin-1, or UTF-*, it matters not) can be put
into an XML document, one must treat it as binary and encode it as such,
using QP or Base64 or what have you. But nobody does.
XML 1.1 allows the representation of every Unicode character except
U+0000, which materially reduces the problem, but there is little support
for XML 1.1 as yet.
In any case, this case is only an analogy, not an exact equivalent:
the problems of representing illegal *characters* in an XML document is
closely analogous to the problem of representing illegal *bytes* in a
> The point I'm making is that *whatever* you do, you are still
> asking for implementers to obey some convention on conversion
> failures for corrupt, uninterpretable character data.
> My assessment is that you'd have no better success at making
> this work universally well with some set of 128 magic bullet
> corruption pills on Plane 14 than you have with the
> existing Quoted-Unprintable as a convention.
It doesn't have to work universally; indeed, it becomes a QOI issue.
Allocating representations of bytes with "bits that are high" makes
it possible to do something recoverable, at very little expense to the
> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.
I agree that that part won't fly, absolutely.
-- In politics, obedience and support John Cowan <email@example.com> are the same thing. --Hannah Arendt http://www.ccil.org/~cowan
This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 16:39:52 CST