mbox series

[GIT,PULL] pidfd updates

Message ID	20230421-kurstadt-stempeln-3459a64aef0c@brauner
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Christian Brauner <brauner@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] pidfd updates Date: Fri, 21 Apr 2023 15:41:10 +0200 Message-Id: <20230421-kurstadt-stempeln-3459a64aef0c@brauner> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[GIT,PULL] pidfd updates \| [GIT,PULL] pidfd updates

Pull-request

git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/v6.4/pidfd.file

Message

Christian Brauner April 21, 2023, 1:41 p.m. UTC

  Hey Linus,

/* Summary */
This adds a new pidfd_prepare() helper which allows the caller to
reserve a pidfd number and allocates a new pidfd file that stashes the
provided struct pid.

It should be avoided installing a file descriptor into a task's file
descriptor table just to close it again via close_fd() in case an
error occurs. The fd has been visible to userspace and might already be
in use. Instead, a file descriptor should be reserved but not installed
into the caller's file descriptor table.

If another failure path is hit then the reserved file descriptor and
file can just be put without any userspace visible side-effects. And if
all failure paths are cleared the file descriptor and file can be
installed into the task's file descriptor table.

This helper is now used in all places that open coded this functionality
before. For example, this is currently done during copy_process() and
fanotify used pidfd_create(), which returns a pidfd that has already
been made visibile in the caller's file descriptor table, but then
closed it using close_fd().

In one of the next merge windows there is also new functionality coming
to unix domain sockets that will have to rely on pidfd_prepare().

/* Testing */
clang: Ubuntu clang version 15.0.6
gcc: (Ubuntu 12.2.0-3ubuntu1) 12.2.0

All patches are based on 6.3-rc4 and have been sitting in linux-next.
No build failures or warnings were observed. All old and new tests in
fstests, selftests, and LTP pass without regressions.

/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with
current mainline.

The following changes since commit 197b6b60ae7bc51dd0814953c562833143b292aa:

  Linux 6.3-rc4 (2023-03-26 14:40:20 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/v6.4/pidfd.file

for you to fetch changes up to eee3a0e93924f2aab8ebaa7f2e26fd0f3b33f9e7:

  fanotify: use pidfd_prepare() (2023-04-03 11:16:57 +0200)

Please consider pulling these changes from the signed v6.4/pidfd.file tag.

Thanks!
Christian

----------------------------------------------------------------
v6.4/pidfd.file

----------------------------------------------------------------
Christian Brauner (3):
      pid: add pidfd_prepare()
      fork: use pidfd_prepare()
      fanotify: use pidfd_prepare()

 fs/notify/fanotify/fanotify_user.c | 13 +++--
 include/linux/pid.h                |  1 +
 kernel/fork.c                      | 98 +++++++++++++++++++++++++++++++++-----
 kernel/pid.c                       | 19 +++-----
 4 files changed, 104 insertions(+), 27 deletions(-)

Comments

Linus Torvalds April 24, 2023, 8:24 p.m. UTC | #1

On Fri, Apr 21, 2023 at 6:42 AM Christian Brauner <brauner@kernel.org> wrote:
>
> This adds a new pidfd_prepare() helper which allows the caller to
> reserve a pidfd number and allocates a new pidfd file that stashes the
> provided struct pid.

So I've pulled this, but I have to say that I think the "cleanup" part
is pretty nasty - and that's also visible in the interface.

In particular, pidfd_prepare() ends up returning two things, so you
have that ugly "return by reference of the third arguments", but you
also end up not being able to have one single cleanup, you have to
(continue) to do a pair of them:

    put_unused_fd(pidfd);
    fput(pidfd_file);

and I really think the old model of just having a single return value,
and doing a single "close()" was a much nicer in many ways.

Now, the reason I pulled is that I agree that actually making the fd
*visible* to user space is a big mistake.

But I really think a potentially much nicer model would have been to
extend our "get_unused_fd_flags()" model.

IOW, we could have instead marked the 'struct file *' in the file
descriptor table as being "not ready yet".

I wonder how nasty it would have been to have the low bit of the
'struct file *' mark "not ready to be used yet" or something similar.
You already can't just access the 'fdt->fd[]' array willy-nilly since
we have both normal RCU issues _and_ the somewhat unusual spectre
array indexing issues.

So looking around with

    git grep -e '->fd\['

we seem to be pretty good about that and it probably wouldn't be too
horrid to add a "check low bit isn't set" to the rules.

Then pidfd_prepare() could actually install the file pointer in the fd
table, just marked as "not ready", and then instead of "fd_install()",
yuo'd have "fd_expose(fd)" or something.

I dislike interfaces that return two different things. Particularly
ones that are supposed to be there to make things easy for the user. I
think your pidfd_prepare() helper fails that "make it easy to use"
test.

Hmm?

Anyway, I think it's an improvement even so, but I wanted to just say
that I think this could maybe have been done with a nicer model.

               Linus

Linus Torvalds April 24, 2023, 8:35 p.m. UTC | #2

On Mon, Apr 24, 2023 at 1:24 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I really think a potentially much nicer model would have been to
> extend our "get_unused_fd_flags()" model.
>
> IOW, we could have instead marked the 'struct file *' in the file
> descriptor table as being "not ready yet".

Maybe that isn't worth it just for pdfd, but I have this feeling that
it might make some other code simpler too.

That pidfd case isn't the only one that has to carry both a file
pointer and a fd around.

Looking around, I get the feeling that quite a lot of users of
"fd_install()" might actually have been happier if they could just
install it early, and then just have a "fd_expose(fd)" for the success
case, and for the error cases have "put_unused_fd(fd)" also do the
fput on the file descriptor even if the low bit was set. One less
thing to worry about.

I dunno. Maybe not worth it. That "two return values" just made me go "Eww".

               Linus

pr-tracker-bot@kernel.org April 24, 2023, 9:45 p.m. UTC | #3

The pull request you sent on Fri, 21 Apr 2023 15:41:10 +0200:

> git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/v6.4/pidfd.file

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ec40758b31ef6f492a48267e9e02edff6b4d62c9

Thank you!

Al Viro April 25, 2023, 6:04 a.m. UTC | #4

On Mon, Apr 24, 2023 at 01:24:24PM -0700, Linus Torvalds wrote:

> But I really think a potentially much nicer model would have been to
> extend our "get_unused_fd_flags()" model.
> 
> IOW, we could have instead marked the 'struct file *' in the file
> descriptor table as being "not ready yet".
> 
> I wonder how nasty it would have been to have the low bit of the
> 'struct file *' mark "not ready to be used yet" or something similar.
> You already can't just access the 'fdt->fd[]' array willy-nilly since
> we have both normal RCU issues _and_ the somewhat unusual spectre
> array indexing issues.
> 
> So looking around with
> 
>     git grep -e '->fd\['
> 
> we seem to be pretty good about that and it probably wouldn't be too
> horrid to add a "check low bit isn't set" to the rules.
> 
> Then pidfd_prepare() could actually install the file pointer in the fd
> table, just marked as "not ready", and then instead of "fd_install()",
> yuo'd have "fd_expose(fd)" or something.
> 
> I dislike interfaces that return two different things. Particularly
> ones that are supposed to be there to make things easy for the user. I
> think your pidfd_prepare() helper fails that "make it easy to use"
> test.
> 
> Hmm?

I'm not fond of "return two things" kind of helpers, but I'm even less
fond of "return fd, file is already there" ones, TBH.  {__,}pidfd_prepare()
users are thankfully very limited in the things they do to the file that
had been returned, but that really invites abuse.

The deeper in call chain we mess with descriptor table, the more painful it
gets, IME.

Speaking of {__,}pidfd_prepare(), I wonder if we wouldn't be better off
with get_unused_fd_flags() lifted into the callers - all three of those
(fanotify copy_event_to_user(), copy_process() and pidfd_create()).
Switch from anon_inode_getfd() to anon_inode_getfile() certainly
made sense, ditto for combining it with get_pid(), but mixing
get_unused_fd_flags() into that is a mistake, IMO.

As for your suggestion... let's see what it leads to.

	Suppose we add such entries (reserved, hold a reference to file,
marked "not yet available" in the LSB).  From the current tree POV those
would be equivalent to descriptor already reserved, but fd_install() not
done.  So behaviour of existing primitives should be the same as for this
situation, except for fd_install() and put_unused_fd().

	* pick_file(), __fget_files_rcu(), iterate_fd(), files_lookup_fd_raw(),
	  loop in dup_fd(), io_close() - treat odd pointers as NULL.
	* close_files() should, AFAICS, treat an odd pointer as "should never
happen" (and that xchg() in there needs to go anyway - it's pointless, since
we are freeing the the array immediately afterwards.
	* do_close_on_exec() should probably treat them as "should never happen".
	* do_dup2() - odd value should be treated as -EBUSY.

The interesting part, of course, is how to legitimize (or dispose of) such
a beast.  The former is your "fd_expose()" - parallel to fd_install(),
AFAICS.  The latter... another primitive that would
	grab ->files_lock
	pick_file() variant that *expects* an odd value
	drop ->files_lock
	clear LSB and pass to fput().

It's doable, but AFAICS doesn't make callers all that happier...

Christian Brauner April 25, 2023, 12:08 p.m. UTC | #5

On Mon, Apr 24, 2023 at 01:35:03PM -0700, Linus Torvalds wrote:
> On Mon, Apr 24, 2023 at 1:24 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But I really think a potentially much nicer model would have been to
> > extend our "get_unused_fd_flags()" model.
> >
> > IOW, we could have instead marked the 'struct file *' in the file
> > descriptor table as being "not ready yet".
> 
> Maybe that isn't worth it just for pdfd, but I have this feeling that
> it might make some other code simpler too.
> 
> That pidfd case isn't the only one that has to carry both a file
> pointer and a fd around.
> 
> Looking around, I get the feeling that quite a lot of users of
> "fd_install()" might actually have been happier if they could just
> install it early, and then just have a "fd_expose(fd)" for the success
> case, and for the error cases have "put_unused_fd(fd)" also do the
> fput on the file descriptor even if the low bit was set. One less
> thing to worry about.
> 
> I dunno. Maybe not worth it. That "two return values" just made me go "Eww".

I'm not fond of "two return values" - in C at least - as well I think
open-coding get_unused_fd() is pretty nasty as well...
Let me see...

Christian Brauner April 25, 2023, 12:34 p.m. UTC | #6

On Tue, Apr 25, 2023 at 07:04:27AM +0100, Al Viro wrote:
> On Mon, Apr 24, 2023 at 01:24:24PM -0700, Linus Torvalds wrote:
> 
> > But I really think a potentially much nicer model would have been to
> > extend our "get_unused_fd_flags()" model.
> > 
> > IOW, we could have instead marked the 'struct file *' in the file
> > descriptor table as being "not ready yet".
> > 
> > I wonder how nasty it would have been to have the low bit of the
> > 'struct file *' mark "not ready to be used yet" or something similar.
> > You already can't just access the 'fdt->fd[]' array willy-nilly since
> > we have both normal RCU issues _and_ the somewhat unusual spectre
> > array indexing issues.
> > 
> > So looking around with
> > 
> >     git grep -e '->fd\['
> > 
> > we seem to be pretty good about that and it probably wouldn't be too
> > horrid to add a "check low bit isn't set" to the rules.
> > 
> > Then pidfd_prepare() could actually install the file pointer in the fd
> > table, just marked as "not ready", and then instead of "fd_install()",
> > yuo'd have "fd_expose(fd)" or something.
> > 
> > I dislike interfaces that return two different things. Particularly
> > ones that are supposed to be there to make things easy for the user. I
> > think your pidfd_prepare() helper fails that "make it easy to use"
> > test.
> > 
> > Hmm?
> 
> I'm not fond of "return two things" kind of helpers, but I'm even less
> fond of "return fd, file is already there" ones, TBH.  {__,}pidfd_prepare()
> users are thankfully very limited in the things they do to the file that
> had been returned, but that really invites abuse.

It's only exposed to kernel core code for good reasons.

> 
> The deeper in call chain we mess with descriptor table, the more painful it
> gets, IME.
> 
> Speaking of {__,}pidfd_prepare(), I wonder if we wouldn't be better off
> with get_unused_fd_flags() lifted into the callers - all three of those
> (fanotify copy_event_to_user(), copy_process() and pidfd_create()).
> Switch from anon_inode_getfd() to anon_inode_getfile() certainly
> made sense, ditto for combining it with get_pid(), but mixing
> get_unused_fd_flags() into that is a mistake, IMO.

I agree with mostly everything here except for get_unused_fd_flags()
being lifted into the callers. That's what I tried to get rid of in
kernel/fork.c.

It is rife with misunderstandings just looking at what we did in
kernel/fork.c earlier:

	retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
        [...]
        pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
                                     O_RDWR | O_CLOEXEC);

seeing where both get_unused_fd_flags() and both *_getfile() take flag
arguments. Sure, for us this is pretty straightforward since we've seen
that code a million times. For others it's confusing why there's two
flag arguments. Sure, we could use one flags argument but it still be
weird to look at.

But with this api we also force all users to remember that they need to
cleanup the fd and the file - but definitely one - depending on where
they fail.

But conceptually the fd and the file belong together. After all it's the
file we're about to make that reserved fd refer to.

But I'm not here to lambast this api. It works nicely overall and that
reserve + install model is pretty elegant. But see for my boring
compromise proposal below...

> 
> As for your suggestion... let's see what it leads to.
> 
> 	Suppose we add such entries (reserved, hold a reference to file,
> marked "not yet available" in the LSB).  From the current tree POV those
> would be equivalent to descriptor already reserved, but fd_install() not
> done.  So behaviour of existing primitives should be the same as for this
> situation, except for fd_install() and put_unused_fd().
> 
> 	* pick_file(), __fget_files_rcu(), iterate_fd(), files_lookup_fd_raw(),
> 	  loop in dup_fd(), io_close() - treat odd pointers as NULL.
> 	* close_files() should, AFAICS, treat an odd pointer as "should never
> happen" (and that xchg() in there needs to go anyway - it's pointless, since
> we are freeing the the array immediately afterwards.
> 	* do_close_on_exec() should probably treat them as "should never happen".
> 	* do_dup2() - odd value should be treated as -EBUSY.
> 
> The interesting part, of course, is how to legitimize (or dispose of) such
> a beast.  The former is your "fd_expose()" - parallel to fd_install(),
> AFAICS.  The latter... another primitive that would
> 	grab ->files_lock
> 	pick_file() variant that *expects* an odd value
> 	drop ->files_lock
> 	clear LSB and pass to fput().
> 
> It's doable, but AFAICS doesn't make callers all that happier...

In the context of using pidfd for some networking stuff we had a similar
discussion because it's the same problem only worse. Think of a
scenario where you need to allocate the fd and file early on and
multiple function calls later you only get to install fd and file. In
that case you need to drag that fd and file around everywhere so you can
then fd_install it... Sure, we could do this "semi-install fd into
fdtable thing" but I think that's too much subtlety and the fdtable is
traumatic enough as it is.

But what about something like the following where we just expose a very
barebones api that allows and encourages callers to bundle fd and file.

Hell, you could even extend that proposal below to wrap the
put_user()...

struct fd_file {
	struct file *file;
	int fd;
	int __user *fd_user;
};

and

static inline int fd_publish_user(struct fd_file *fdf)
{
	int ret = 0;

	if (fdf->fd_user)
		ret = put_user(fdf->fd, fdf->fd_user);

	if (ret)
		fd_discard(fdf)
	else
		fd_publish(fdf)

	return 0;
}

which is also a pretty common pattern...

From d6b48884d33f9a2e4506589d9380bf2036e43b8a Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Tue, 25 Apr 2023 14:21:18 +0200
Subject: [PATCH] UNTESTED UNTESTED DRAFT DRAFT

---
 include/linux/file.h | 20 ++++++++++++++++++++
 include/linux/pid.h  |  3 ++-
 kernel/fork.c        | 35 +++++++++++++++++++----------------
 kernel/pid.c         | 14 +++++++-------
 4 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/include/linux/file.h b/include/linux/file.h
index 39704eae83e2..fdadaaa9f70b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -39,6 +39,11 @@ struct fd {
 #define FDPUT_FPUT       1
 #define FDPUT_POS_UNLOCK 2
 
+struct fd_file {
+	struct file *file;
+	int fd;
+};
+
 static inline void fdput(struct fd fd)
 {
 	if (fd.flags & FDPUT_FPUT)
@@ -90,6 +95,21 @@ extern void put_unused_fd(unsigned int fd);
 
 extern void fd_install(unsigned int fd, struct file *file);
 
+static inline void fd_publish(struct fd_file *fdf)
+{
+	if (fdf->file)
+		fd_install(fdf->fd, fdf->file);
+}
+
+static inline void fd_discard(struct fd_file *fdf)
+{
+	if (fdf->file) {
+		fput(fdf->file);
+		put_unused_fd(fdf->fd);
+
+	}
+}
+
 extern int __receive_fd(struct file *file, int __user *ufd,
 			unsigned int o_flags);
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index b75de288a8c2..0b0695459508 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -75,12 +75,13 @@ extern struct pid init_struct_pid;
 extern const struct file_operations pidfd_fops;
 
 struct file;
+struct fd_file;
 
 extern struct pid *pidfd_pid(const struct file *file);
 struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags);
 struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags);
 int pidfd_create(struct pid *pid, unsigned int flags);
-int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret);
+int pidfd_prepare(struct pid *pid, unsigned int flags, struct fd_file *fdf);
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index bfe73db1c26c..2072d8cc91d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1989,7 +1989,8 @@ const struct file_operations pidfd_fops = {
  *         error, a negative error code is returned from the function and the
  *         last argument remains unchanged.
  */
-static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
+static int __pidfd_prepare(struct pid *pid, unsigned int flags,
+			   struct fd_file *fdf)
 {
 	int pidfd;
 	struct file *pidfd_file;
@@ -2008,8 +2009,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 		return PTR_ERR(pidfd_file);
 	}
 	get_pid(pid); /* held by pidfd_file now */
-	*ret = pidfd_file;
-	return pidfd;
+
+	fdf->file = pidfd_file;
+	fdf->fd = pidfd;
+
+	return 0;
 }
 
 /**
@@ -2038,12 +2042,12 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  *         error, a negative error code is returned from the function and the
  *         last argument remains unchanged.
  */
-int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
+int pidfd_prepare(struct pid *pid, unsigned int flags, struct fd_file *fdf)
 {
 	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
 		return -EINVAL;
 
-	return __pidfd_prepare(pid, flags, ret);
+	return __pidfd_prepare(pid, flags, fdf);
 }
 
 static void __delayed_free_task(struct rcu_head *rhp)
@@ -2106,10 +2110,12 @@ __latent_entropy struct task_struct *copy_process(
 					int node,
 					struct kernel_clone_args *args)
 {
-	int pidfd = -1, retval;
+	int retval;
+	struct fd_file pidfd = {
+		.fd = -1,
+	};
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
-	struct file *pidfile = NULL;
 	const u64 clone_flags = args->flags;
 	struct nsproxy *nsp = current->nsproxy;
 
@@ -2395,12 +2401,11 @@ __latent_entropy struct task_struct *copy_process(
 	 */
 	if (clone_flags & CLONE_PIDFD) {
 		/* Note that no task has been attached to @pid yet. */
-		retval = __pidfd_prepare(pid, O_RDWR | O_CLOEXEC, &pidfile);
+		retval = __pidfd_prepare(pid, O_RDWR | O_CLOEXEC, &pidfd);
 		if (retval < 0)
 			goto bad_fork_free_pid;
-		pidfd = retval;
 
-		retval = put_user(pidfd, args->pidfd);
+		retval = put_user(pidfd.fd, args->pidfd);
 		if (retval)
 			goto bad_fork_put_pidfd;
 	}
@@ -2584,8 +2589,8 @@ __latent_entropy struct task_struct *copy_process(
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
-	if (pidfile)
-		fd_install(pidfd, pidfile);
+	if (clone_flags & CLONE_PIDFD)
+		fd_publish(&pidfd);
 
 	proc_fork_connector(p);
 	sched_post_fork(p);
@@ -2605,10 +2610,8 @@ __latent_entropy struct task_struct *copy_process(
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p, args);
 bad_fork_put_pidfd:
-	if (clone_flags & CLONE_PIDFD) {
-		fput(pidfile);
-		put_unused_fd(pidfd);
-	}
+	if (clone_flags & CLONE_PIDFD)
+		fd_discard(&pidfd);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
diff --git a/kernel/pid.c b/kernel/pid.c
index f93954a0384d..6cd46002aca3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -594,15 +594,15 @@ struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags)
  */
 int pidfd_create(struct pid *pid, unsigned int flags)
 {
-	int pidfd;
-	struct file *pidfd_file;
+	int ret;
+	struct fd_file pidfd;
 
-	pidfd = pidfd_prepare(pid, flags, &pidfd_file);
-	if (pidfd < 0)
-		return pidfd;
+	ret = pidfd_prepare(pid, flags, &pidfd);
+	if (ret < 0)
+		return ret;
 
-	fd_install(pidfd, pidfd_file);
-	return pidfd;
+	fd_publish(&pidfd);
+	return pidfd.fd;
 }
 
 /**

Al Viro April 25, 2023, 1:54 p.m. UTC | #7

On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:

> It is rife with misunderstandings just looking at what we did in
> kernel/fork.c earlier:
> 
> 	retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
>         [...]
>         pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
>                                      O_RDWR | O_CLOEXEC);
> 
> seeing where both get_unused_fd_flags() and both *_getfile() take flag
> arguments. Sure, for us this is pretty straightforward since we've seen
> that code a million times. For others it's confusing why there's two
> flag arguments. Sure, we could use one flags argument but it still be
> weird to look at.

First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and
anon_inode_getfile() - about O_CLOEXEC.  Duplicating the expression in
places like that is cargo-culting.

> But with this api we also force all users to remember that they need to
> cleanup the fd and the file - but definitely one - depending on where
> they fail.
> 
> But conceptually the fd and the file belong together. After all it's the
> file we're about to make that reserved fd refer to.

Why?  The common pattern is
	* reserve the descriptor
	* do some work that might fail
	* get struct file set up (which also might fail)
	* do more work (possibly including copyout, etc.)
	* install file into descriptor table

We want to reserve the descriptor early, since it's about the easiest
thing to undo.  Why bother doing it next to file setup?  That can be
pretty deep in call chain and doing it that way comes with headache
about passing the damn thing around.  And cleanup rules with your
variant end up being more complicated.

Keep the manipulations of descriptor table as close to the surface
as possible.  The real work is with file references; descriptors
belong in userland and passing them around kernel-side is asking
for headache.

Christian Brauner April 25, 2023, 2:36 p.m. UTC | #8

On Tue, Apr 25, 2023 at 02:54:29PM +0100, Al Viro wrote:
> On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:
> 
> > It is rife with misunderstandings just looking at what we did in
> > kernel/fork.c earlier:
> > 
> > 	retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
> >         [...]
> >         pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
> >                                      O_RDWR | O_CLOEXEC);
> > 
> > seeing where both get_unused_fd_flags() and both *_getfile() take flag
> > arguments. Sure, for us this is pretty straightforward since we've seen
> > that code a million times. For others it's confusing why there's two
> > flag arguments. Sure, we could use one flags argument but it still be
> > weird to look at.
> 
> First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and
> anon_inode_getfile() - about O_CLOEXEC.  Duplicating the expression in
> places like that is cargo-culting.

I distinctly remember us having that conversation about how to do this
nicely back then and fwiw this is your patch... ;)
6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups")

So sure, that was my point: people are confused why there's two flag
arguments and what exactly has to go into them and just copy-paste
whatever other users already have.

>  
> > But with this api we also force all users to remember that they need to
> > cleanup the fd and the file - but definitely one - depending on where
> > they fail.
> > 
> > But conceptually the fd and the file belong together. After all it's the
> > file we're about to make that reserved fd refer to.
> 
> Why?  The common pattern is
> 	* reserve the descriptor
> 	* do some work that might fail
> 	* get struct file set up (which also might fail)
> 	* do more work (possibly including copyout, etc.)
> 	* install file into descriptor table
> 
> We want to reserve the descriptor early, since it's about the easiest
> thing to undo.  Why bother doing it next to file setup?  That can be
> pretty deep in call chain and doing it that way comes with headache
> about passing the damn thing around.  And cleanup rules with your
> variant end up being more complicated.
> 
> Keep the manipulations of descriptor table as close to the surface
> as possible.  The real work is with file references; descriptors
> belong in userland and passing them around kernel-side is asking
> for headache.

Imho, there's different use-cases.

There's definitely one where people allocate a file descriptor early on
and then sometimes maybe way later allocate a struct file and install
it. And that's where exposing and using get_unused_fd_flags() and
fd_install() is great and works fine.

But there's also users that do the reserve an fd and allocate a file at
the same time or receive both at the same time as the seccomp notifier,
scm creds, or pidfds. The receive_fd* stuff is basically a version of
the sketch I did without the ability to simply use a struct and not
having to open-code everything multiple times.

I never expected excitement around this as it was difficult enough to
get that receive_fd* thing going so I'm not arguing for this to be the
ultima ratio...

Al Viro April 25, 2023, 3:48 p.m. UTC | #9

On Tue, Apr 25, 2023 at 04:36:27PM +0200, Christian Brauner wrote:
> On Tue, Apr 25, 2023 at 02:54:29PM +0100, Al Viro wrote:
> > On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:
> > 
> > > It is rife with misunderstandings just looking at what we did in
> > > kernel/fork.c earlier:
> > > 
> > > 	retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
> > >         [...]
> > >         pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
> > >                                      O_RDWR | O_CLOEXEC);
> > > 
> > > seeing where both get_unused_fd_flags() and both *_getfile() take flag
> > > arguments. Sure, for us this is pretty straightforward since we've seen
> > > that code a million times. For others it's confusing why there's two
> > > flag arguments. Sure, we could use one flags argument but it still be
> > > weird to look at.
> > 
> > First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and
> > anon_inode_getfile() - about O_CLOEXEC.  Duplicating the expression in
> > places like that is cargo-culting.
> 
> I distinctly remember us having that conversation about how to do this
> nicely back then and fwiw this is your patch... ;)
> 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups")

Should've followed with "no need to pass nonsense flags to get_unused_fd_flags()
and anon_inode_getfile()"...

> So sure, that was my point: people are confused why there's two flag
> arguments and what exactly has to go into them and just copy-paste
> whatever other users already have.

> There's definitely one where people allocate a file descriptor early on
> and then sometimes maybe way later allocate a struct file and install
> it. And that's where exposing and using get_unused_fd_flags() and
> fd_install() is great and works fine.

FWIW, there's something I toyed with a while ago - a primitive along the
lines of
fd_set_file(fd, file)
{
	if (IS_ERR(file)) {
		put_unused_fd(fd);
		return ERR_PTR(file);
	}
	fd_install(fd, file);
	return fd;
}

That simplifies quite a few places, collapsing failure exit handling.
Not sure how many can be massaged into that form, though...

Linus Torvalds April 25, 2023, 4:28 p.m. UTC | #10

On Tue, Apr 25, 2023 at 5:34 AM Christian Brauner <brauner@kernel.org> wrote:
>
> Hell, you could even extend that proposal below to wrap the
> put_user()...
>
> struct fd_file {
>         struct file *file;
>         int fd;
>         int __user *fd_user;
> };

So I don't like this extended version, but your proposal patch below
looks good to me.

Why? Simply because the "two-word struct" is actually a good way to
return two values. But a three-word one would be passed on the stack.

Both gcc and clang return small structs (where "small" is literally
just two words) in registers, and it's part of most (all?) ABIs and
we've relied on that before.

It's kind of the same thing as returning a "long long" in two
registers, except it works with structs too.

We've relied on that for ages, with things like 'struct pte' often
being that kind of two-word struct (going back a *loong* time to the
bad old days of 32-bit x86 and PAE).

And in the vfs layer we actually also do it for 'struct fd', which is
a very similar construct.

So yes, doing something like this:

> +struct fd_file {
> +       struct file *file;
> +       int fd;
> +};

would make me happy, and still return just "one" value, it's just that
the value now is both of those things we need.

And then your helpers:

> +static inline void fd_publish(struct fd_file *fdf)
> +static inline void fd_discard(struct fd_file *fdf)

solves my other issue, except I'd literally make it pass those structs
by value, exactly like fdget/fdput does.

Now, since they are inline functions, the code generation doesn't
really change (compilers are smart enough to not actually generate any
pointer stuff), but I prefer to make things like that expliict, and
have source code that matches the code generation.

(Which is also why I do *not* endorse passing bigger structs by value,
because then the compiler will just pass it as a "pointer to a copy"
instead, again violating the whole concept of "source matches what
happens in reality")

I think the above helper could be improved further with Al's
suggestion to make 'fd_publish()' return an error code, and allow the
file pointer (and maybe even the fd index) to be an error pointer (and
error number), so that you could often unify the error/success paths.

IOW, I like this, and I think it's superior to my stupid original suggestion.

                  Linus

Al Viro April 25, 2023, 5:19 p.m. UTC | #11

On Tue, Apr 25, 2023 at 09:28:54AM -0700, Linus Torvalds wrote:

> Now, since they are inline functions, the code generation doesn't
> really change (compilers are smart enough to not actually generate any
> pointer stuff), but I prefer to make things like that expliict, and
> have source code that matches the code generation.
> 
> (Which is also why I do *not* endorse passing bigger structs by value,
> because then the compiler will just pass it as a "pointer to a copy"
> instead, again violating the whole concept of "source matches what
> happens in reality")
> 
> I think the above helper could be improved further with Al's
> suggestion to make 'fd_publish()' return an error code, and allow the
> file pointer (and maybe even the fd index) to be an error pointer (and
> error number), so that you could often unify the error/success paths.
> 
> IOW, I like this, and I think it's superior to my stupid original suggestion.

We'd better collect the data on the current callers first.  There are
several patterns; I'm going through the old (fairly sparse) notes and
the grep over the current tree right now, will post when I get through
that.

That's one area where we had a *lot* of recurring bugs - mostly of
leak/double put variety.  So we'd better have the calling conventions
right wrt how easy it is to fuck up in failure exits.  And we need
to document the patterns/rules for each/reasons for choosing one over
another.

Note that there's also "set the file up, then get descriptor and either
fd_install or fput, depending on get_unused_fd_flags() success";
sometimes it's the only approach (SCM_RIGHTS, for example), sometimes
it's better than "get descriptor, set the file up, then either install
or release descriptor", sometimes it's definitely worse (e.g. for
O_CREAT it's a non-starter).  It should be a deliberate choice.

Al Viro April 27, 2023, 1:07 a.m. UTC | #12

On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:

> struct fd_file {
> 	struct file *file;
> 	int fd;
> 	int __user *fd_user;

Why is it an int?  Because your case has it that way?

We have a bunch of places where we have an ioctl with
a field in some structure filled that way; any primitive
that combines put_user() with descriptor handling is
going to cause trouble as soon as somebody deals with
a structure where such member is unsigned long.  Gets
especially funny on 64bit big-endian...

And that objection is orthogonal to that 3-member structure -
even if you pass int __user * as an explicit argument into
your helper, the same trouble will be there.

Al, still going through that zoo...

Al Viro April 27, 2023, 7:39 a.m. UTC | #13

On Thu, Apr 27, 2023 at 02:07:15AM +0100, Al Viro wrote:
> On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:
> 
> 
> > struct fd_file {
> > 	struct file *file;
> > 	int fd;
> > 	int __user *fd_user;
> 
> Why is it an int?  Because your case has it that way?
> 
> We have a bunch of places where we have an ioctl with
> a field in some structure filled that way; any primitive
> that combines put_user() with descriptor handling is
> going to cause trouble as soon as somebody deals with
> a structure where such member is unsigned long.  Gets
> especially funny on 64bit big-endian...
> 
> And that objection is orthogonal to that 3-member structure -
> even if you pass int __user * as an explicit argument into
> your helper, the same trouble will be there.
> 
> Al, still going through that zoo...

FWIW, surprisingly large part of messy users in ioctls might get
simplified nicely if we add something along the lines of

int delayed_dup(struct file *file, unsigned flags)
{
	struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL);
	int fd;

	if (likely(p))
		fd = p->fd = get_unused_fd_flags(flags);
	else
		fd = -ENOMEM;
	if (likely(fd >= 0)) {
		p->file = file;
		init_task_work(&p->work, __do_delayed_dup);
		if (!task_work_add(&p, TWA_RESUME))
			return fd;
		put_unused_fd(fd);
		fd = -EINVAL;
	}
	fput(file);
	kfree(p);
	return fd;
}

with

struct delayed_dup {
	struct callback_head work;
	struct file *file;
	unsigned fd;
};

static void __do_delayed_dup(struct callback_head *work)
{
	struct delayed_dup *p = container_of(work, struct delayed_dup, work);

	if (!syscall_get_error(current, current_pt_regs())) {
		fd_install(p->fd, p->file);
	} else {
		put_unused_fd(p->fd);
		fput(p->file);
	}
	kfree(p);
}

Random example (completely untested - I'm not even sure it compiles):

diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c
index af57799c86ce..7c62becfddc9 100644
--- a/drivers/dma-buf/sync_file.c
+++ b/drivers/dma-buf/sync_file.c
@@ -207,55 +207,34 @@ static __poll_t sync_file_poll(struct file *file, poll_table *wait)
 static long sync_file_ioctl_merge(struct sync_file *sync_file,
 				  unsigned long arg)
 {
-	int fd = get_unused_fd_flags(O_CLOEXEC);
 	int err;
 	struct sync_file *fence2, *fence3;
+	struct sync_merge_data __user *argp = (void *)arg;
 	struct sync_merge_data data;
 
-	if (fd < 0)
-		return fd;
-
-	if (copy_from_user(&data, (void __user *)arg, sizeof(data))) {
-		err = -EFAULT;
-		goto err_put_fd;
-	}
+	if (copy_from_user(&data, argp, sizeof(data)))
+		return -EFAULT;
 
-	if (data.flags || data.pad) {
-		err = -EINVAL;
-		goto err_put_fd;
-	}
+	if (data.flags || data.pad)
+		return -EINVAL;
 
 	fence2 = sync_file_fdget(data.fd2);
-	if (!fence2) {
-		err = -ENOENT;
-		goto err_put_fd;
-	}
+	if (!fence2)
+		return -ENOENT;
 
 	data.name[sizeof(data.name) - 1] = '\0';
 	fence3 = sync_file_merge(data.name, sync_file, fence2);
-	if (!fence3) {
+	if (fence3)
+		err = delayed_dup(fence3->file, O_CLOEXEC);
+	else
 		err = -ENOMEM;
-		goto err_put_fence2;
-	}
 
-	data.fence = fd;
-	if (copy_to_user((void __user *)arg, &data, sizeof(data))) {
-		err = -EFAULT;
-		goto err_put_fence3;
+	if (err >= 0) {
+		data.fence = err;
+		err = copy_to_user(argp, &data, sizeof(data)) ? -EFAULT : 0;
 	}
 
-	fd_install(fd, fence3->file);
-	fput(fence2->file);
-	return 0;
-
-err_put_fence3:
-	fput(fence3->file);
-
-err_put_fence2:
 	fput(fence2->file);
-
-err_put_fd:
-	put_unused_fd(fd);
 	return err;
 }
 

IMO it's much easier to follow that way, and that's a fairly typical
example.  One primitive call instead of three, *much* simpler handling
of failure exits, no need to deal with unroll on copy_to_user() failures
(here those were handled; most of the drivers/gpu/drm stuff is buried
quite a few call levels deeper than the place where copyout is done,
so currently it doesn't even try to DTRT in that respect).

It's not a panacea - for that to be useful we need
	* no side effects of file opening that wouldn't be undone by fput()
	* enough work done to make the overhead negligible
	* moderate amount of files opened in one call
	* a pattern that could be massaged into "open file, then deal with
inserting it into descriptor table".

There's a plenty of fd_install() users that match that pattern.
Certainly not all of them, but almost everything in drivers does.

I'm still digging through the rest - there's a couple of other patterns
that seem to be common, but I'm not through the entire pile yet.

Christian Brauner April 27, 2023, 8:11 a.m. UTC | #14

On Thu, Apr 27, 2023 at 02:07:15AM +0100, Al Viro wrote:
> On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:
> 
> 
> > struct fd_file {
> > 	struct file *file;
> > 	int fd;
> > 	int __user *fd_user;
> 
> Why is it an int?  Because your case has it that way?
> 
> We have a bunch of places where we have an ioctl with
> a field in some structure filled that way; any primitive
> that combines put_user() with descriptor handling is
> going to cause trouble as soon as somebody deals with
> a structure where such member is unsigned long.  Gets
> especially funny on 64bit big-endian...
> 
> And that objection is orthogonal to that 3-member structure -
> even if you pass int __user * as an explicit argument into
> your helper, the same trouble will be there.

Ignoring for a second that there are other ways to achieve this. This is
literally the sketch on top of a sketch to encompass an api that _we do
already have today_...

Christian Brauner April 27, 2023, 8:33 a.m. UTC | #15

On Thu, Apr 27, 2023 at 08:39:08AM +0100, Al Viro wrote:
> On Thu, Apr 27, 2023 at 02:07:15AM +0100, Al Viro wrote:
> > On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:
> > 
> > 
> > > struct fd_file {
> > > 	struct file *file;
> > > 	int fd;
> > > 	int __user *fd_user;
> > 
> > Why is it an int?  Because your case has it that way?
> > 
> > We have a bunch of places where we have an ioctl with
> > a field in some structure filled that way; any primitive
> > that combines put_user() with descriptor handling is
> > going to cause trouble as soon as somebody deals with
> > a structure where such member is unsigned long.  Gets
> > especially funny on 64bit big-endian...
> > 
> > And that objection is orthogonal to that 3-member structure -
> > even if you pass int __user * as an explicit argument into
> > your helper, the same trouble will be there.
> > 
> > Al, still going through that zoo...
> 
> FWIW, surprisingly large part of messy users in ioctls might get
> simplified nicely if we add something along the lines of
> 
> int delayed_dup(struct file *file, unsigned flags)
> {
> 	struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL);
> 	int fd;
> 
> 	if (likely(p))
> 		fd = p->fd = get_unused_fd_flags(flags);
> 	else
> 		fd = -ENOMEM;
> 	if (likely(fd >= 0)) {
> 		p->file = file;
> 		init_task_work(&p->work, __do_delayed_dup);
> 		if (!task_work_add(&p, TWA_RESUME))
> 			return fd;
> 		put_unused_fd(fd);
> 		fd = -EINVAL;
> 	}
> 	fput(file);
> 	kfree(p);
> 	return fd;
> }
> 
> with
> 
> struct delayed_dup {
> 	struct callback_head work;
> 	struct file *file;
> 	unsigned fd;
> };
> 
> static void __do_delayed_dup(struct callback_head *work)
> {
> 	struct delayed_dup *p = container_of(work, struct delayed_dup, work);
> 
> 	if (!syscall_get_error(current, current_pt_regs())) {
> 		fd_install(p->fd, p->file);
> 	} else {
> 		put_unused_fd(p->fd);
> 		fput(p->file);
> 	}
> 	kfree(p);
> }
> 
> Random example (completely untested - I'm not even sure it compiles):
> 
> diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c
> index af57799c86ce..7c62becfddc9 100644
> --- a/drivers/dma-buf/sync_file.c
> +++ b/drivers/dma-buf/sync_file.c
> @@ -207,55 +207,34 @@ static __poll_t sync_file_poll(struct file *file, poll_table *wait)
>  static long sync_file_ioctl_merge(struct sync_file *sync_file,
>  				  unsigned long arg)
>  {
> -	int fd = get_unused_fd_flags(O_CLOEXEC);
>  	int err;
>  	struct sync_file *fence2, *fence3;
> +	struct sync_merge_data __user *argp = (void *)arg;
>  	struct sync_merge_data data;
>  
> -	if (fd < 0)
> -		return fd;
> -
> -	if (copy_from_user(&data, (void __user *)arg, sizeof(data))) {
> -		err = -EFAULT;
> -		goto err_put_fd;
> -	}
> +	if (copy_from_user(&data, argp, sizeof(data)))
> +		return -EFAULT;
>  
> -	if (data.flags || data.pad) {
> -		err = -EINVAL;
> -		goto err_put_fd;
> -	}
> +	if (data.flags || data.pad)
> +		return -EINVAL;
>  
>  	fence2 = sync_file_fdget(data.fd2);
> -	if (!fence2) {
> -		err = -ENOENT;
> -		goto err_put_fd;
> -	}
> +	if (!fence2)
> +		return -ENOENT;
>  
>  	data.name[sizeof(data.name) - 1] = '\0';
>  	fence3 = sync_file_merge(data.name, sync_file, fence2);
> -	if (!fence3) {
> +	if (fence3)
> +		err = delayed_dup(fence3->file, O_CLOEXEC);
> +	else
>  		err = -ENOMEM;
> -		goto err_put_fence2;
> -	}
>  
> -	data.fence = fd;
> -	if (copy_to_user((void __user *)arg, &data, sizeof(data))) {
> -		err = -EFAULT;
> -		goto err_put_fence3;
> +	if (err >= 0) {
> +		data.fence = err;
> +		err = copy_to_user(argp, &data, sizeof(data)) ? -EFAULT : 0;
>  	}
>  
> -	fd_install(fd, fence3->file);
> -	fput(fence2->file);
> -	return 0;
> -
> -err_put_fence3:
> -	fput(fence3->file);
> -
> -err_put_fence2:
>  	fput(fence2->file);
> -
> -err_put_fd:
> -	put_unused_fd(fd);
>  	return err;
>  }
>  
> 
> IMO it's much easier to follow that way, and that's a fairly typical
> example.  One primitive call instead of three, *much* simpler handling
> of failure exits, no need to deal with unroll on copy_to_user() failures
> (here those were handled; most of the drivers/gpu/drm stuff is buried
> quite a few call levels deeper than the place where copyout is done,
> so currently it doesn't even try to DTRT in that respect).
> 
> It's not a panacea - for that to be useful we need
> 	* no side effects of file opening that wouldn't be undone by fput()
> 	* enough work done to make the overhead negligible
> 	* moderate amount of files opened in one call
> 	* a pattern that could be massaged into "open file, then deal with
> inserting it into descriptor table".
> 
> There's a plenty of fd_install() users that match that pattern.
> Certainly not all of them, but almost everything in drivers does.
> 
> I'm still digging through the rest - there's a couple of other patterns
> that seem to be common, but I'm not through the entire pile yet.

So the earlier proposal plus added complexity to hide it from users.
I don't feel strongly that we really do need to do something here but if
this is something you feel strongly about then sure. But a few points:

* I don't think this is much of a simplification for file descriptor
  handling in those callers. Judging by the example you pasted here the
  simplifications to this function are much more based on changing the
  general structure. I suspect it'll be the same for a lot of other
  codepaths. But I'm happy to be convinced otherwise.
* This delayed_dup() thing is fairly complex with the task work stuff
  and the memory allocation mixed in there.
* It adds an async pattern into the fd installation path which
  feels like yet another complex add on.

Ultimately what I try to gauge with changes like that is how likely are
people going to use a pattern without someone having to go through the
tree every few months to make sure that the api isn't abused. I'm not
sure that this wouldn't just end up being misused.

File descriptor installation is not core functionality for drivers. It's
just something that they have to do and so it's not that people usually
put a lot of thought into it. So that's why I think an API has to be
dumb enough. A three call api may still be simpler to use than an overly
clever single call api.

Al Viro April 27, 2023, 8:59 a.m. UTC | #16

On Thu, Apr 27, 2023 at 10:33:38AM +0200, Christian Brauner wrote:

> File descriptor installation is not core functionality for drivers. It's
> just something that they have to do and so it's not that people usually
> put a lot of thought into it. So that's why I think an API has to be
> dumb enough. A three call api may still be simpler to use than an overly
> clever single call api.

Grep and you will see...  Seriously, for real mess take a look at e.g.
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c.  See those close_fd() calls in
there?  That's completely wrong - you can't undo the insertion into
descriptor table.

I'm not suggesting that for the core kernel, but there's a plenty of
drivers that do descriptor allocation.

Take a look at e.g. drivers/media/mc/mc-request.c:media_request_alloc().
OK, we open, insert, etc. and we pass the descriptor to caller in
*alloc_fd.  Caller is in drivers/media/mc/mc-device.c:
static long media_device_request_alloc(struct media_device *mdev, void *arg)
and descriptor goes into *(int *)arg there.  That is reached via
static const struct media_ioctl_info ioctl_info[] = {
        MEDIA_IOC(DEVICE_INFO, media_device_get_info, MEDIA_IOC_FL_GRAPH_MUTEX),
	MEDIA_IOC(ENUM_ENTITIES, media_device_enum_entities, MEDIA_IOC_FL_GRAPH_MUTEX),
	MEDIA_IOC(ENUM_LINKS, media_device_enum_links, MEDIA_IOC_FL_GRAPH_MUTEX),
	MEDIA_IOC(SETUP_LINK, media_device_setup_link, MEDIA_IOC_FL_GRAPH_MUTEX),
	MEDIA_IOC(G_TOPOLOGY, media_device_get_topology, MEDIA_IOC_FL_GRAPH_MUTEX),
	MEDIA_IOC(REQUEST_ALLOC, media_device_request_alloc, 0),
};      
used in
static long media_device_ioctl(struct file *filp, unsigned int cmd,
                               unsigned long __arg)
There we have
        ret = info->fn(dev, karg);

	if (info->flags & MEDIA_IOC_FL_GRAPH_MUTEX)
		mutex_unlock(&dev->graph_mutex);

	if (!ret && info->arg_to_user)
		ret = info->arg_to_user(arg, karg, cmd);

array elements are set by
#define MEDIA_IOC_ARG(__cmd, func, fl, from_user, to_user)              \
        [_IOC_NR(MEDIA_IOC_##__cmd)] = {                                \
                .cmd = MEDIA_IOC_##__cmd,                               \
                .fn = func,                                             \
                .flags = fl,                                            \
                .arg_from_user = from_user,                             \
                .arg_to_user = to_user,                                 \
        }

#define MEDIA_IOC(__cmd, func, fl)                                      \
        MEDIA_IOC_ARG(__cmd, func, fl, copy_arg_from_user, copy_arg_to_user)

so this ->arg_to_user() is copy_arg_to_user(), which is
static long copy_arg_to_user(void __user *uarg, void *karg, unsigned int cmd)
{
        if ((_IOC_DIR(cmd) & _IOC_READ) &&
            copy_to_user(uarg, karg, _IOC_SIZE(cmd)))
                return -EFAULT;

        return 0;
}

That copy_to_user() is not attempted until media_device_request_alloc()
returns.  And I don't see any way to make it unroll the insertion into
descriptor table without massive restructuring of the entire thing;
if you do, I'd love to hear it.

This is actually not the worst case - again, drm stuff has a bunch of
such crap, and I don't believe it's feasible to move drm folks from
their "we do copyin and copyout in generic ioctl code, passing the
copy to handlers supplied by drivers and copying whatever they modified
back to userland" approach.

Christian Brauner April 27, 2023, 9:40 a.m. UTC | #17

On Thu, Apr 27, 2023 at 09:59:52AM +0100, Al Viro wrote:
> On Thu, Apr 27, 2023 at 10:33:38AM +0200, Christian Brauner wrote:
> 
> > File descriptor installation is not core functionality for drivers. It's
> > just something that they have to do and so it's not that people usually
> > put a lot of thought into it. So that's why I think an API has to be
> > dumb enough. A three call api may still be simpler to use than an overly
> > clever single call api.
> 
> Grep and you will see...  Seriously, for real mess take a look at e.g.
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c.  See those close_fd() calls in
> there?  That's completely wrong - you can't undo the insertion into
> descriptor table.

I mean, yes. There's literally a description how that's wrong in my pr
message...

> 
> I'm not suggesting that for the core kernel, but there's a plenty of
> drivers that do descriptor allocation.

That's undisputed. What I tried to say was that this is not their main
focus. The point is their design thinking is not centered on fd handling
that's just something they have to do and so they don't think about
cleanly handling for example installing an fd and the closing it again
in some failure path. And I'm not sure one should be just harshly
judging them. They're almost bound to get this wrong. You'd need to know
a bit about file descriptors to have this in mind. File descriptor
handling is subtle.

I mean, look at cachefiles_ondemand_get_fd()...

Linus Torvalds April 27, 2023, 3:21 p.m. UTC | #18

On Thu, Apr 27, 2023 at 12:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> int delayed_dup(struct file *file, unsigned flags)

Ok, this is strange. Let me think about it.

But even without thinking about it, this part I hate:

>         struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL);

Sure, if this is only used in unimportant code where performance
doesn't matter, doing a kmalloc is fine.

But if that is the only use, I think this is too subtle an interface.

Could we instead limit it to "we only have one pending delayed dup",
and make this all be more like the restart-block thing, and be part of
struct task_struct?

I think it's conceptually quite similar to restart_block, ie a "some
pending system call state" thing.

(Obviously it's entirely different in details).

          Linus

Al Viro April 27, 2023, 5:02 p.m. UTC | #19

On Thu, Apr 27, 2023 at 08:21:34AM -0700, Linus Torvalds wrote:
> On Thu, Apr 27, 2023 at 12:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > int delayed_dup(struct file *file, unsigned flags)
> 
> Ok, this is strange. Let me think about it.
> 
> But even without thinking about it, this part I hate:
> 
> >         struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL);
> 
> Sure, if this is only used in unimportant code where performance
> doesn't matter, doing a kmalloc is fine.
> 
> But if that is the only use, I think this is too subtle an interface.

Still hadn't finished with the zoo...

> Could we instead limit it to "we only have one pending delayed dup",
> and make this all be more like the restart-block thing, and be part of
> struct task_struct?

Interesting...  FWIW, *anything* that wants several descriptors has
special needs - there are some places like that (binder, for one)
and they have rather weird code implementing those.

Just to restate the obvious: this is not applicable for the most frequent
caller - open(2).  For the reasons that have nothing to do with performance.
If opening the file has hard-to-reverse side effects (like directory
modification due to O_CREAT), the things are very different.

What I hope for is a small number of patterns, with clear rules for
choosing the one that is applicable and helpers for each that would
reduce the amount of headache when using it.  And I've no problem
with "this particular pattern is not usable if you are adding more
than one descriptor" - that's not hard to understand and verify.
So I'm fine with doing that for one descriptor only and getting
rid of the allocation.

BTW, another pattern is the same sans the "delayed" part.  I.e.
"here's an opened file, get a descriptor and either attach the
file to it or fput() the damn thing; in any case, file reference
is consumed and descriptor-or-error is returned".  That one is
definitely only for single descriptor case.

In any case, I want to finish the survey of the callers first, just to
see what's there and whether anything is worth nicking.

While we are at it, I want to make close_fd() use a very big red flag.
To the point of grepping for new callers in -next and asking the folks
who introduce those to explain WTF they are doing...

David Laight April 28, 2023, 8:40 a.m. UTC | #20

From: Linus Torvalds
> Sent: 25 April 2023 17:29
> 
> On Tue, Apr 25, 2023 at 5:34 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > Hell, you could even extend that proposal below to wrap the
> > put_user()...
> >
> > struct fd_file {
> >         struct file *file;
> >         int fd;
> >         int __user *fd_user;
> > };
> 
> So I don't like this extended version, but your proposal patch below
> looks good to me.
> 
> Why? Simply because the "two-word struct" is actually a good way to
> return two values. But a three-word one would be passed on the stack.
> 
> Both gcc and clang return small structs (where "small" is literally
> just two words) in registers, and it's part of most (all?) ABIs and
> we've relied on that before.

It is definitely architecture dependant.
x86-64 and arm-64 will return two 64bit values in registers.
x86-32 and arm-32 return two 32bit values on stack.

Pretty much everything passes short structures directly by value.
(I'm not sure about sparc-32 though, I'm sure it passed all
structures by reference back in the 1980s)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Linus Torvalds April 28, 2023, 6:26 p.m. UTC | #21

On Fri, Apr 28, 2023 at 1:40 AM David Laight <David.Laight@aculab.com> wrote:
>
> It is definitely architecture dependant.
> x86-64 and arm-64 will return two 64bit values in registers.
> x86-32 and arm-32 return two 32bit values on stack.

Yes, but I think the "return value in two registers" is the generic
thing. At least gcc has very special code to deal with a two-register
return value.

And yes, those registers will then obviously be different sizes on
different architectures, but a tuple like a '(ptr,fd)' thing will fit
regardless.

We very much do depend on this kind of pattern generating reasonable
code in some parts of the kernel.

               Linus

Christian Brauner May 2, 2023, 7:11 a.m. UTC | #22

On Thu, Apr 27, 2023 at 06:02:15PM +0100, Al Viro wrote:
> On Thu, Apr 27, 2023 at 08:21:34AM -0700, Linus Torvalds wrote:
> > On Thu, Apr 27, 2023 at 12:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > int delayed_dup(struct file *file, unsigned flags)
> > 
> > Ok, this is strange. Let me think about it.
> > 
> > But even without thinking about it, this part I hate:
> > 
> > >         struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL);
> > 
> > Sure, if this is only used in unimportant code where performance
> > doesn't matter, doing a kmalloc is fine.
> > 
> > But if that is the only use, I think this is too subtle an interface.
> 
> Still hadn't finished with the zoo...
> 
> > Could we instead limit it to "we only have one pending delayed dup",
> > and make this all be more like the restart-block thing, and be part of
> > struct task_struct?
> 
> Interesting...  FWIW, *anything* that wants several descriptors has
> special needs - there are some places like that (binder, for one)
> and they have rather weird code implementing those.
> 
> Just to restate the obvious: this is not applicable for the most frequent
> caller - open(2).  For the reasons that have nothing to do with performance.
> If opening the file has hard-to-reverse side effects (like directory
> modification due to O_CREAT), the things are very different.
> 
> What I hope for is a small number of patterns, with clear rules for
> choosing the one that is applicable and helpers for each that would
> reduce the amount of headache when using it.  And I've no problem
> with "this particular pattern is not usable if you are adding more
> than one descriptor" - that's not hard to understand and verify.
> So I'm fine with doing that for one descriptor only and getting
> rid of the allocation.
> 
> BTW, another pattern is the same sans the "delayed" part.  I.e.
> "here's an opened file, get a descriptor and either attach the
> file to it or fput() the damn thing; in any case, file reference
> is consumed and descriptor-or-error is returned".  That one is
> definitely only for single descriptor case.
> 
> In any case, I want to finish the survey of the callers first, just to
> see what's there and whether anything is worth nicking.
> 
> While we are at it, I want to make close_fd() use a very big red flag.
> To the point of grepping for new callers in -next and asking the folks
> who introduce those to explain WTF they are doing...

Yeah, I'd fully support this and would be very nice to have.