RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Tue Dec 07 2004 - 12:14:36 CST

  • Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    Doug Ewell wrote:

    > John Cowan <jcowan at reutershealth dot com> wrote:
    >
    > > Windows filesystems do know what encoding they use. But a
    > filename on
    > > a Unix(oid) file system is a mere sequence of octets, of
    > which only 00
    > > and 2F are interpreted. (Filenames containing 20, and
    > especially 0A,
    > > are annoying to handle with standard tools, but not illegal.)
    > >
    > > How these octet sequences are translated to characters, if at all,
    > > is no concern of the file system's. Some higher-level
    > tools, such as
    > > directory listers and shells, have hardwired assumptions,
    > others have
    > > changeable assumptions, but all are assumptions.
    >
    > OK, fair enough. Under a Unixoid file system, a file name
    > consists of a
    > more or less arbitrary sequence of bytes, essentially
    > unregulated by the
    > OS.
    >
    > If interpreted as UTF-8, some of these sequences may be
    > invalid, and the
    > files may be inaccessible.
    >
    > This is *exactly* the same scenario as with GB 2312, or
    > Shift-JIS, or KS
    > C 5601, or ISO 6937, or any other multibyte character encoding ever
    > devised.
    >
    > This is not a problem that needs to be solved within Unicode, any more
    > than it needed to be solved within those other encodings.
    >

    Shift-JIS was typically not mixed with other encodings, except for pure
    7-bit ASCII. UTF-8 will be. And Shift-JIS had other serious problems, like
    the trailing backslash byte. UTF-8 has learned a lot from Shift-JIS. If
    there is anything still to learn, then let's welcome that.

    Also, Shift-JIS (and other MBCS encodings) were a must for those cultures.
    UTF-8 is not a must. If there will be problems, there will be complaints.
    And resistance.

    Lars



    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 12:18:33 CST