Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Dec 08 2004 - 15:37:07 CST

  • Next message: D. Starner: "Re: Nicest UTF"

    John Cowan responded:
     
    > > Storage of UNIX filenames on Windows databases, for example,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
    O.k., I just quoted this back from the original email, but
    it really is a complete misconception of the issue for
    databases. "Windows databases" is a misnomer to start with.

    There are some databases, like Access, that are Windows-only
    applications, but most serious SQL databases in production (DB2,
    Oracle, Sybase ASE and ASA, and so on) are crossplatform from
    the get go, and have their *own* rules for what can and
    cannot legitimately be stored in data fields, independent
    of what platform you are running them on. A Sybase ASE
    database has the same behavior running on Windows as running
    on Sun Solaris or Linux, for that matter.

    > > can be done with BINARY fields, which correctly capture the
    > > identity of them as what they are: an unconvertible array of
    > > byte values, not a convertible string in some particular
    > > code page.
    >
    > This solution, however, is overkill,

    Actually, I don't think it is.

    One of the serious classes of fundamental errors that
    database administrators and database programmers run into
    when creating global applications is ignoring or misconstruing
    character set issues.

    In a database, if I define the database (or table or field)
    as containing UTF-8 data, it damn well better have UTF-8
    data in it, or I'm just asking for index corruptions, data
    corruptions or worse -- and calls from unhappy customers.
    When database programmers "lie" to the database about
    character sets, by setting a character set to Latin-1, say,
    and then pumping in data which is actually UTF-8, for
    instance, expecting it to come back out unchanged with
    no problems, they are skating on very thin ice ... which
    usually tends to break right in the middle of some critical
    application during a holiday while your customer service
    desk is also down. ;-)

    Such "lying to the database" is generally the tactic of
    first resort for "fixing" global applications when they
    start having to deal with mixed Japanese/European/UTF-8
    data on networks, but it is clearly a hack for not
    understanding and dealing with the character set
    architecture and interoperability problems of putting
    such applications together.

    UNIX filenames are just one instance of this. The first
    mistake is to network things together in ways that create
    a technical mismatch between what the users of the localized
    systems think the filenames mean and what somebody on the
    other end of such a system may end up interpreted the
    bag o' bytes to mean. The application should be constructed
    in such a way that the locale/charset state can be preserved
    on connection, with the "filename" interpreted in terms
    of characters in the realm that needs to deal with it
    that way, and restored to its bag o' bytes at the point
    that needs it that way. If you can't do that reliably
    with a "raw" UNIX set of applications, c'est la vie -- you
    should be building more sophisticated multi-tiered applications
    on top of your UNIX layer, applications which *can* track
    and properly handle locale and character set identities.

    Failing that, then BINARY fields *are* the appropriate
    way to deal with arbitrary arrays of bytes that cannot
    be interpreted as characters. Trying to pump them into
    UTF-8 text data fields and processing them as such when
    they *aren't* UTF-8 text data is lying to the database
    and basically forfeiting your warranty that the database
    will do reasonable things with that data. It's as stupid
    as trying to store date or numeric types in text data
    fields without first converting them to formatted strings
    of text data.

    > in the same way that it would
    > be overkill to encode all 8-bit strings in XML using Base-64
    > just because some of them may contain control characters that are
    > illegal in well-formed XML.

    Dunno about the XML issue here -- you're the expert on what
    the expected level of illegality in usage is there.

    But for real database applications, there are usually
    mountains and mountains of stuff going on, most of it
    completely orthogonal to something as conceptually
    straightforward as maintaining the correct interpretation
    of a UNIX filename. It isn't really overkill, in my
    opinion, to design the appropriate tables and metadata
    needed for ensuring that your filename handling doesn't
    blow up somewhere because you've tried to do an UPDATE
    on a UTF-8 data field with some random bag o' bytes that
    won't validate as UTF-8 data.

    >
    > > In my opinion, trying to do that with a set of encoded characters
    > > (these 128 or something else) is *less* likely to solve the
    > > problem than using some visible markup convention instead.
    >
    > The trouble with the visible markup, or even the PUA, is that
    > "well-formed filenames", those which are interpretable as
    > UTF-8 text, must also be encoded so as to be sure any
    > markup or PUA that naturally appears in the filename is
    > escaped properly. This is essentially the Quoted-Printable
    > encoding, which is quite rightly known to those stuck with
    > it as "Quoted-Unprintable".

    I wasn't actually suggesting that Quoted-Printable (which was,
    indeed the model I had in mind) would be an appropriate
    solution to UNIX filename handling. It is actually more
    appropriate for the corrupted document issue, but as you
    note, even there, it basically just leaves you with a
    visibly readable corruption, but a corruption nonetheless.

    I don't think that having visible markup (or any other scheme
    for ostensibly carrying around "correct" corrupt data) is
    a substitute for fixing the application architecture and
    data conversion points to eliminate the corruptions in
    the first place.

    > > Simply
    > > encoding 128 characters in the Unicode Standard ostensibly to
    > > serve this purpose is no guarantee whatsoever that anyone would
    > > actually implement and support them in the universal way you
    > > envision, any more than they might a "=93", "=94" convention.
    >
    > Why not, when it's so easy to do so? And they'd be *there*,
    > reserved, unassignable for actual character encoding.
    >
    > Plane E would be a plausible location.

    The point I'm making is that *whatever* you do, you are still
    asking for implementers to obey some convention on conversion
    failures for corrupt, uninterpretable character data.
    My assessment is that you'd have no better success at making
    this work universally well with some set of 128 magic bullet
    corruption pills on Plane 14 than you have with the
    existing Quoted-Unprintable as a convention.

    Further, as it turns out that Lars is actually asking for
    "standardizing" corrupt UTF-8, a notion that isn't going to
    fly even two feet, I think the whole idea is going to be
    a complete non-starter.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 15:38:49 CST