Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Doug Ewell ([email protected])
Date: Mon Dec 06 2004 - 14:59:03 CST

Next message: Patrick Andries: "Re: Pour sauver la patrimoine de l'Imprimerie Nationale de France"

Previous message: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Antoine Leca: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John Cowan <jcowan at reutershealth dot com> wrote:

> Windows filesystems do know what encoding they use. But a filename on
> a Unix(oid) file system is a mere sequence of octets, of which only 00
> and 2F are interpreted. (Filenames containing 20, and especially 0A,
> are annoying to handle with standard tools, but not illegal.)
>
> How these octet sequences are translated to characters, if at all,
> is no concern of the file system's. Some higher-level tools, such as
> directory listers and shells, have hardwired assumptions, others have
> changeable assumptions, but all are assumptions.

OK, fair enough. Under a Unixoid file system, a file name consists of a
more or less arbitrary sequence of bytes, essentially unregulated by the
OS.

If interpreted as UTF-8, some of these sequences may be invalid, and the
files may be inaccessible.

This is *exactly* the same scenario as with GB 2312, or Shift-JIS, or KS
C 5601, or ISO 6937, or any other multibyte character encoding ever
devised.

This is not a problem that needs to be solved within Unicode, any more
than it needed to be solved within those other encodings.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Patrick Andries: "Re: Pour sauver la patrimoine de l'Imprimerie Nationale de France"
Previous message: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Antoine Leca: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 15:02:31 CST