Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Dec 08 2004 - 15:37:07 CST

Next message: D. Starner: "Re: Nicest UTF"

Previous message: Marcin 'Qrczak' Kowalczyk: "Re: If only MS Word was coded this well"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Reply: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John Cowan responded:

> > Storage of UNIX filenames on Windows databases, for example,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

O.k., I just quoted this back from the original email, but
it really is a complete misconception of the issue for
databases. "Windows databases" is a misnomer to start with.

There are some databases, like Access, that are Windows-only
applications, but most serious SQL databases in production (DB2,
Oracle, Sybase ASE and ASA, and so on) are crossplatform from
the get go, and have their *own* rules for what can and
cannot legitimately be stored in data fields, independent
of what platform you are running them on. A Sybase ASE
database has the same behavior running on Windows as running
on Sun Solaris or Linux, for that matter.

> > can be done with BINARY fields, which correctly capture the
> > identity of them as what they are: an unconvertible array of
> > byte values, not a convertible string in some particular
> > code page.
>
> This solution, however, is overkill,

Actually, I don't think it is.

One of the serious classes of fundamental errors that
database administrators and database programmers run into
when creating global applications is ignoring or misconstruing
character set issues.

In a database, if I define the database (or table or field)
as containing UTF-8 data, it damn well better have UTF-8
data in it, or I'm just asking for index corruptions, data
corruptions or worse -- and calls from unhappy customers.
When database programmers "lie" to the database about
character sets, by setting a character set to Latin-1, say,
and then pumping in data which is actually UTF-8, for
instance, expecting it to come back out unchanged with
no problems, they are skating on very thin ice ... which
usually tends to break right in the middle of some critical
application during a holiday while your customer service
desk is also down. ;-)

Such "lying to the database" is generally the tactic of
first resort for "fixing" global applications when they
start having to deal with mixed Japanese/European/UTF-8
data on networks, but it is clearly a hack for not
understanding and dealing with the character set
architecture and interoperability problems of putting
such applications together.

UNIX filenames are just one instance of this. The first
mistake is to network things together in ways that create
a technical mismatch between what the users of the localized
systems think the filenames mean and what somebody on the
other end of such a system may end up interpreted the
bag o' bytes to mean. The application should be constructed
in such a way that the locale/charset state can be preserved
on connection, with the "filename" interpreted in terms
of characters in the realm that needs to deal with it
that way, and restored to its bag o' bytes at the point
that needs it that way. If you can't do that reliably
with a "raw" UNIX set of applications, c'est la vie -- you
should be building more sophisticated multi-tiered applications
on top of your UNIX layer, applications which *can* track
and properly handle locale and character set identities.

Failing that, then BINARY fields *are* the appropriate
way to deal with arbitrary arrays of bytes that cannot
be interpreted as characters. Trying to pump them into
UTF-8 text data fields and processing them as such when
they *aren't* UTF-8 text data is lying to the database
and basically forfeiting your warranty that the database
will do reasonable things with that data. It's as stupid
as trying to store date or numeric types in text data
fields without first converting them to formatted strings
of text data.

> in the same way that it would
> be overkill to encode all 8-bit strings in XML using Base-64
> just because some of them may contain control characters that are
> illegal in well-formed XML.

Dunno about the XML issue here -- you're the expert on what
the expected level of illegality in usage is there.

But for real database applications, there are usually
mountains and mountains of stuff going on, most of it
completely orthogonal to something as conceptually
straightforward as maintaining the correct interpretation
of a UNIX filename. It isn't really overkill, in my
opinion, to design the appropriate tables and metadata
needed for ensuring that your filename handling doesn't
blow up somewhere because you've tried to do an UPDATE
on a UTF-8 data field with some random bag o' bytes that
won't validate as UTF-8 data.

>
> > In my opinion, trying to do that with a set of encoded characters
> > (these 128 or something else) is *less* likely to solve the
> > problem than using some visible markup convention instead.
>
> The trouble with the visible markup, or even the PUA, is that
> "well-formed filenames", those which are interpretable as
> UTF-8 text, must also be encoded so as to be sure any
> markup or PUA that naturally appears in the filename is
> escaped properly. This is essentially the Quoted-Printable
> encoding, which is quite rightly known to those stuck with
> it as "Quoted-Unprintable".

I wasn't actually suggesting that Quoted-Printable (which was,
indeed the model I had in mind) would be an appropriate
solution to UNIX filename handling. It is actually more
appropriate for the corrupted document issue, but as you
note, even there, it basically just leaves you with a
visibly readable corruption, but a corruption nonetheless.

I don't think that having visible markup (or any other scheme
for ostensibly carrying around "correct" corrupt data) is
a substitute for fixing the application architecture and
data conversion points to eliminate the corruptions in
the first place.

> > Simply
> > encoding 128 characters in the Unicode Standard ostensibly to
> > serve this purpose is no guarantee whatsoever that anyone would
> > actually implement and support them in the universal way you
> > envision, any more than they might a "=93", "=94" convention.
>
> Why not, when it's so easy to do so? And they'd be *there*,
> reserved, unassignable for actual character encoding.
>
> Plane E would be a plausible location.

The point I'm making is that *whatever* you do, you are still
asking for implementers to obey some convention on conversion
failures for corrupt, uninterpretable character data.
My assessment is that you'd have no better success at making
this work universally well with some set of 128 magic bullet
corruption pills on Plane 14 than you have with the
existing Quoted-Unprintable as a convention.

Further, as it turns out that Lars is actually asking for
"standardizing" corrupt UTF-8, a notion that isn't going to
fly even two feet, I think the whole idea is going to be
a complete non-starter.

--Ken

Next message: D. Starner: "Re: Nicest UTF"
Previous message: Marcin 'Qrczak' Kowalczyk: "Re: If only MS Word was coded this well"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Reply: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 15:38:49 CST