Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: John Cowan (
Date: Tue Dec 07 2004 - 22:12:16 CST

  • Next message: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    Kenneth Whistler scripsit:

    > Storage of UNIX filenames on Windows databases, for example,
    > can be done with BINARY fields, which correctly capture the
    > identity of them as what they are: an unconvertible array of
    > byte values, not a convertible string in some particular
    > code page.

    This solution, however, is overkill, in the same way that it would
    be overkill to encode all 8-bit strings in XML using Base-64
    just because some of them may contain control characters that are
    illegal in well-formed XML.

    > In my opinion, trying to do that with a set of encoded characters
    > (these 128 or something else) is *less* likely to solve the
    > problem than using some visible markup convention instead.

    The trouble with the visible markup, or even the PUA, is that
    "well-formed filenames", those which are interpretable as
    UTF-8 text, must also be encoded so as to be sure any
    markup or PUA that naturally appears in the filename is
    escaped properly. This is essentially the Quoted-Printable
    encoding, which is quite rightly known to those stuck with
    it as "Quoted-Unprintable".

    > Simply
    > encoding 128 characters in the Unicode Standard ostensibly to
    > serve this purpose is no guarantee whatsoever that anyone would
    > actually implement and support them in the universal way you
    > envision, any more than they might a "=93", "=94" convention.

    Why not, when it's so easy to do so? And they'd be *there*,
    reserved, unassignable for actual character encoding.

    Plane E would be a plausible location.

    John Cowan <>
    I amar prestar aen, han mathon ne nen,
    han mathon ne chae, a han noston ne 'wilith.  --Galadriel, LOTR:FOTR

    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 22:13:10 CST