Re: Roundtripping in Unicode

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Dec 15 2004 - 05:51:51 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    On 15/12/2004 00:22, Mike Ayers wrote:

    >
    > > From: Peter Kirk [mailto:peterkirk@qaya.org]
    > > Sent: Tuesday, December 14, 2004 3:37 PM
    >
    > > Thanks for the clarification. Perhaps the bifurcation could
    > > be better expressed as into "strings of characters as defined
    > > by the locale" and "strings of non-null octets". Then I could
    > > re-express this as "the only safe way out of this mess is
    > > never to process filenames as strings of characters as
    > > defined by the locale".
    >
    > That would not be correct for ISO 8859 locales, though
    > (amongst others). That's why I specified UTF-8. Although other
    > locales may have the problem of invalid sequences, we're only
    > interested in UTF-8 here.
    >

    But surely octets 0x80 to 0x9f are (at least mostly) invalid in ISO
    8859? While some applications may choose to process these invalid
    characters as if they were valid, but display them as boxes or not at
    all (and this is a security risk), others and especially those concerned
    with security do in fact treat them as errors, in one way or another.
    For example, Marcin noted for Mozilla:

    >If a filename ... can be
    >converted but contains characters like 0x80-0x9F in ISO-8859-2,
    >they are displayed as question marks and the file is inaccessible.
    >

    It should be treated as a general issue with ALL locales and character
    sets (with perhaps just a few exceptions) that not all sequences of
    octets represent valid character strings. UTF-8 is by no means a special
    case here.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 06:05:40 CST