Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: John Cowan (jcowan@reutershealth.com)
Date: Wed Dec 08 2004 - 16:38:34 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

Previous message: Azzedine Ait Khelifa: "IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."
In reply to: Kenneth Whistler: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Kenneth Whistler: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kenneth Whistler scripsit:

> A Sybase ASE database has the same behavior running on Windows as
> running on Sun Solaris or Linux, for that matter.

Fair enough.

> UNIX filenames are just one instance of this.

However, although they are *technically* octet sequences, they
are *functionally* character strings. That's the issue.

> Failing that, then BINARY fields *are* the appropriate
> way to deal with arbitrary arrays of bytes that cannot
> be interpreted as characters.

This is purism. All the filenames on my Unix system, for example, can
be interpreted as character strings; the potential to create filenames
that can't be is unutilized, and sensibly so. For that matter, the
potential to create files containing C0 controls is also unutilized.

> > in the same way that it would
> > be overkill to encode all 8-bit strings in XML using Base-64
> > just because some of them may contain control characters that are
> > illegal in well-formed XML.
>
> Dunno about the XML issue here -- you're the expert on what
> the expected level of illegality in usage is there.

XML's policy is zero tolerance, both for illegal encodings and for
illegal characters such as U+0001. So in order to be *100% sure* that
a character string (ASCII, Latin-1, or UTF-*, it matters not) can be put
into an XML document, one must treat it as binary and encode it as such,
using QP or Base64 or what have you. But nobody does.

XML 1.1 allows the representation of every Unicode character except
U+0000, which materially reduces the problem, but there is little support
for XML 1.1 as yet.

In any case, this case is only an analogy, not an exact equivalent:
the problems of representing illegal *characters* in an XML document is
closely analogous to the problem of representing illegal *bytes* in a
character string.

> The point I'm making is that *whatever* you do, you are still
> asking for implementers to obey some convention on conversion
> failures for corrupt, uninterpretable character data.
> My assessment is that you'd have no better success at making
> this work universally well with some set of 128 magic bullet
> corruption pills on Plane 14 than you have with the
> existing Quoted-Unprintable as a convention.

It doesn't have to work universally; indeed, it becomes a QOI issue.
Allocating representations of bytes with "bits that are high" makes
it possible to do something recoverable, at very little expense to the
Unicode Consortium.

> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.

I agree that that part won't fly, absolutely.

-- 
In politics, obedience and support      John Cowan <jcowan@reutershealth.com>
are the same thing.  --Hannah Arendt    http://www.ccil.org/~cowan

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Previous message: Azzedine Ait Khelifa: "IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."
In reply to: Kenneth Whistler: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Kenneth Whistler: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 16:39:52 CST