Message ID | 20230421-kurstadt-stempeln-3459a64aef0c@brauner |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1080326vqo; Fri, 21 Apr 2023 06:49:28 -0700 (PDT) X-Google-Smtp-Source: AKy350Z4mPx7bPxOiN2IHTKgM11x6mtNCFLefWXsL36bAhFd4VVOVZhiCIZJ0+OhB3biAsffSueR X-Received: by 2002:a17:90b:190:b0:24b:27e1:92fc with SMTP id t16-20020a17090b019000b0024b27e192fcmr5021882pjs.44.1682084967979; Fri, 21 Apr 2023 06:49:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682084967; cv=none; d=google.com; s=arc-20160816; b=F+NHQHStf4VGNloLlmFHgARt1nSVrUed1uIIO1OPEr1qJ4r1iLwcdoOmNstYO0/B9R 2qeW4Ka1b/BlmXUrBBckFtdFktoqw2NZc73ygZJiod01Y04j+xNqqEN70UV+3Gu63OV7 hDMjE3DBHKVFgKuS+fl3dmsTcJJEZm6G5EZEAqt4GyCoRHkvXjbQTaZPJS/FXVWEQH4F sy7iv4fIA5k+SHWUQufdpeUrNtfzIBFw4i7RQxflJRTTbHNDquhPYHn39dqsC9sbmQB0 I5Yuny42ScTOV/oE7qSIkbaIJz4Ev+eIbwaMmOeZV3BpdXZm6GPZcPJsuKiCJcg2p1tZ M3iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Pd7NFu8o9BL7AClYLIT0MTRySF/7BKRzAhMvG16bp+M=; b=m0+WxwbycN1RWHO4mp37UlGq4F5BeavOjF+6Uk/J0VY8oTFcTPCE3mKU+fhkSxB1nJ R9mnPMs53SHkiAi0F3fYOGT1K8BG9i2xacrudKYDTb5x1s+9tr1RtQiROMRMzCSpQYlj bDFvATezW+WRGUpYJBooc+YrLQ8LPdx/KBIKc3cbRk4wphsNpLd9hFc11deR25QxOefT f3DE5rajmgfyqTWYFB5QbXT6Zs27upyrXaXVdYFIObd0tuB/cS0BjNkQ5EVC5dlX09tn 8L819QKhxRNr6S7Xfg0VtWiJA+3yIK2d7BlhUcVYVabFI07GABXcSrVC+ljN8jeQfEQi 9rrA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=JNn1DqAb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lp7-20020a17090b4a8700b00240263ef11bsi5205306pjb.120.2023.04.21.06.49.14; Fri, 21 Apr 2023 06:49:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=JNn1DqAb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232438AbjDUNnQ (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Fri, 21 Apr 2023 09:43:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58922 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232422AbjDUNmt (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 21 Apr 2023 09:42:49 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 201861386C; Fri, 21 Apr 2023 06:42:13 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 27E7A616E9; Fri, 21 Apr 2023 13:42:02 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 32791C433EF; Fri, 21 Apr 2023 13:42:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1682084521; bh=jsN16Cs8YsE77h9jjLEg6GRIme2CQpBrwqLQqDJd2TQ=; h=From:To:Cc:Subject:Date:From; b=JNn1DqAbNx/sCs/osw5eIeOedCZVdloVIQo9EhPsYfTNj3HmFkVv/m43vOH5beif6 vaYLCAtdjcIF6mod/8BpKqSg7Dqhpvs9L0+UH68IZ8RGWVv6AdrDEDANMd/x+AdRKL 5nwqm4HJRu9y3RLi2PY/NVlciFSQv0+fhS02th9Jw7yMru8lHNc5VSM8tyBN61dCA+ usXloyPOuciyHmpABQk8tKSykCsGT8r1MKlaILdk+5zdmIIDso+C8xWz1gfWArJw8V k6qsAopG5fhOlg+QqbjyXJHiR0toLNZCoYbk6RkXQ+c4ntim1hKzj1XLPforMFlEJH SmC/QhxcC0cbQ== From: Christian Brauner <brauner@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] pidfd updates Date: Fri, 21 Apr 2023 15:41:10 +0200 Message-Id: <20230421-kurstadt-stempeln-3459a64aef0c@brauner> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=2760; i=brauner@kernel.org; h=from:subject:message-id; bh=jsN16Cs8YsE77h9jjLEg6GRIme2CQpBrwqLQqDJd2TQ=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQ4Tfy1OuXJjEcvXBwfnbry4m3250vnTTL4LSOr/1+12yez fOLeix2lLAxiXAyyYoosDu0m4XLLeSo2G2VqwMxhZQIZwsDFKQATsZNi+O8YtvTPjkVFjsvevOw6c3 pTt8G639cM+X7t77yZKH/UKTaPkeGL1hPBYuHrt37ePcE6wdp4Rc+ilPR3kgd3eRWuW/9mfTQHAA== X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763793927491662184?= X-GMAIL-MSGID: =?utf-8?q?1763793927491662184?= |
Series |
[GIT,PULL] pidfd updates
|
|
Pull-request
git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/v6.4/pidfd.fileMessage
Christian Brauner
April 21, 2023, 1:41 p.m. UTC
Hey Linus, /* Summary */ This adds a new pidfd_prepare() helper which allows the caller to reserve a pidfd number and allocates a new pidfd file that stashes the provided struct pid. It should be avoided installing a file descriptor into a task's file descriptor table just to close it again via close_fd() in case an error occurs. The fd has been visible to userspace and might already be in use. Instead, a file descriptor should be reserved but not installed into the caller's file descriptor table. If another failure path is hit then the reserved file descriptor and file can just be put without any userspace visible side-effects. And if all failure paths are cleared the file descriptor and file can be installed into the task's file descriptor table. This helper is now used in all places that open coded this functionality before. For example, this is currently done during copy_process() and fanotify used pidfd_create(), which returns a pidfd that has already been made visibile in the caller's file descriptor table, but then closed it using close_fd(). In one of the next merge windows there is also new functionality coming to unix domain sockets that will have to rely on pidfd_prepare(). /* Testing */ clang: Ubuntu clang version 15.0.6 gcc: (Ubuntu 12.2.0-3ubuntu1) 12.2.0 All patches are based on 6.3-rc4 and have been sitting in linux-next. No build failures or warnings were observed. All old and new tests in fstests, selftests, and LTP pass without regressions. /* Conflicts */ At the time of creating this PR no merge conflicts were reported from linux-next and no merge conflicts showed up doing a test-merge with current mainline. The following changes since commit 197b6b60ae7bc51dd0814953c562833143b292aa: Linux 6.3-rc4 (2023-03-26 14:40:20 -0700) are available in the Git repository at: git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/v6.4/pidfd.file for you to fetch changes up to eee3a0e93924f2aab8ebaa7f2e26fd0f3b33f9e7: fanotify: use pidfd_prepare() (2023-04-03 11:16:57 +0200) Please consider pulling these changes from the signed v6.4/pidfd.file tag. Thanks! Christian ---------------------------------------------------------------- v6.4/pidfd.file ---------------------------------------------------------------- Christian Brauner (3): pid: add pidfd_prepare() fork: use pidfd_prepare() fanotify: use pidfd_prepare() fs/notify/fanotify/fanotify_user.c | 13 +++-- include/linux/pid.h | 1 + kernel/fork.c | 98 +++++++++++++++++++++++++++++++++----- kernel/pid.c | 19 +++----- 4 files changed, 104 insertions(+), 27 deletions(-)
Comments
On Fri, Apr 21, 2023 at 6:42 AM Christian Brauner <brauner@kernel.org> wrote: > > This adds a new pidfd_prepare() helper which allows the caller to > reserve a pidfd number and allocates a new pidfd file that stashes the > provided struct pid. So I've pulled this, but I have to say that I think the "cleanup" part is pretty nasty - and that's also visible in the interface. In particular, pidfd_prepare() ends up returning two things, so you have that ugly "return by reference of the third arguments", but you also end up not being able to have one single cleanup, you have to (continue) to do a pair of them: put_unused_fd(pidfd); fput(pidfd_file); and I really think the old model of just having a single return value, and doing a single "close()" was a much nicer in many ways. Now, the reason I pulled is that I agree that actually making the fd *visible* to user space is a big mistake. But I really think a potentially much nicer model would have been to extend our "get_unused_fd_flags()" model. IOW, we could have instead marked the 'struct file *' in the file descriptor table as being "not ready yet". I wonder how nasty it would have been to have the low bit of the 'struct file *' mark "not ready to be used yet" or something similar. You already can't just access the 'fdt->fd[]' array willy-nilly since we have both normal RCU issues _and_ the somewhat unusual spectre array indexing issues. So looking around with git grep -e '->fd\[' we seem to be pretty good about that and it probably wouldn't be too horrid to add a "check low bit isn't set" to the rules. Then pidfd_prepare() could actually install the file pointer in the fd table, just marked as "not ready", and then instead of "fd_install()", yuo'd have "fd_expose(fd)" or something. I dislike interfaces that return two different things. Particularly ones that are supposed to be there to make things easy for the user. I think your pidfd_prepare() helper fails that "make it easy to use" test. Hmm? Anyway, I think it's an improvement even so, but I wanted to just say that I think this could maybe have been done with a nicer model. Linus
On Mon, Apr 24, 2023 at 1:24 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I really think a potentially much nicer model would have been to > extend our "get_unused_fd_flags()" model. > > IOW, we could have instead marked the 'struct file *' in the file > descriptor table as being "not ready yet". Maybe that isn't worth it just for pdfd, but I have this feeling that it might make some other code simpler too. That pidfd case isn't the only one that has to carry both a file pointer and a fd around. Looking around, I get the feeling that quite a lot of users of "fd_install()" might actually have been happier if they could just install it early, and then just have a "fd_expose(fd)" for the success case, and for the error cases have "put_unused_fd(fd)" also do the fput on the file descriptor even if the low bit was set. One less thing to worry about. I dunno. Maybe not worth it. That "two return values" just made me go "Eww". Linus
The pull request you sent on Fri, 21 Apr 2023 15:41:10 +0200:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/v6.4/pidfd.file
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ec40758b31ef6f492a48267e9e02edff6b4d62c9
Thank you!
On Mon, Apr 24, 2023 at 01:24:24PM -0700, Linus Torvalds wrote: > But I really think a potentially much nicer model would have been to > extend our "get_unused_fd_flags()" model. > > IOW, we could have instead marked the 'struct file *' in the file > descriptor table as being "not ready yet". > > I wonder how nasty it would have been to have the low bit of the > 'struct file *' mark "not ready to be used yet" or something similar. > You already can't just access the 'fdt->fd[]' array willy-nilly since > we have both normal RCU issues _and_ the somewhat unusual spectre > array indexing issues. > > So looking around with > > git grep -e '->fd\[' > > we seem to be pretty good about that and it probably wouldn't be too > horrid to add a "check low bit isn't set" to the rules. > > Then pidfd_prepare() could actually install the file pointer in the fd > table, just marked as "not ready", and then instead of "fd_install()", > yuo'd have "fd_expose(fd)" or something. > > I dislike interfaces that return two different things. Particularly > ones that are supposed to be there to make things easy for the user. I > think your pidfd_prepare() helper fails that "make it easy to use" > test. > > Hmm? I'm not fond of "return two things" kind of helpers, but I'm even less fond of "return fd, file is already there" ones, TBH. {__,}pidfd_prepare() users are thankfully very limited in the things they do to the file that had been returned, but that really invites abuse. The deeper in call chain we mess with descriptor table, the more painful it gets, IME. Speaking of {__,}pidfd_prepare(), I wonder if we wouldn't be better off with get_unused_fd_flags() lifted into the callers - all three of those (fanotify copy_event_to_user(), copy_process() and pidfd_create()). Switch from anon_inode_getfd() to anon_inode_getfile() certainly made sense, ditto for combining it with get_pid(), but mixing get_unused_fd_flags() into that is a mistake, IMO. As for your suggestion... let's see what it leads to. Suppose we add such entries (reserved, hold a reference to file, marked "not yet available" in the LSB). From the current tree POV those would be equivalent to descriptor already reserved, but fd_install() not done. So behaviour of existing primitives should be the same as for this situation, except for fd_install() and put_unused_fd(). * pick_file(), __fget_files_rcu(), iterate_fd(), files_lookup_fd_raw(), loop in dup_fd(), io_close() - treat odd pointers as NULL. * close_files() should, AFAICS, treat an odd pointer as "should never happen" (and that xchg() in there needs to go anyway - it's pointless, since we are freeing the the array immediately afterwards. * do_close_on_exec() should probably treat them as "should never happen". * do_dup2() - odd value should be treated as -EBUSY. The interesting part, of course, is how to legitimize (or dispose of) such a beast. The former is your "fd_expose()" - parallel to fd_install(), AFAICS. The latter... another primitive that would grab ->files_lock pick_file() variant that *expects* an odd value drop ->files_lock clear LSB and pass to fput(). It's doable, but AFAICS doesn't make callers all that happier...
On Mon, Apr 24, 2023 at 01:35:03PM -0700, Linus Torvalds wrote: > On Mon, Apr 24, 2023 at 1:24 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > But I really think a potentially much nicer model would have been to > > extend our "get_unused_fd_flags()" model. > > > > IOW, we could have instead marked the 'struct file *' in the file > > descriptor table as being "not ready yet". > > Maybe that isn't worth it just for pdfd, but I have this feeling that > it might make some other code simpler too. > > That pidfd case isn't the only one that has to carry both a file > pointer and a fd around. > > Looking around, I get the feeling that quite a lot of users of > "fd_install()" might actually have been happier if they could just > install it early, and then just have a "fd_expose(fd)" for the success > case, and for the error cases have "put_unused_fd(fd)" also do the > fput on the file descriptor even if the low bit was set. One less > thing to worry about. > > I dunno. Maybe not worth it. That "two return values" just made me go "Eww". I'm not fond of "two return values" - in C at least - as well I think open-coding get_unused_fd() is pretty nasty as well... Let me see...
On Tue, Apr 25, 2023 at 07:04:27AM +0100, Al Viro wrote: > On Mon, Apr 24, 2023 at 01:24:24PM -0700, Linus Torvalds wrote: > > > But I really think a potentially much nicer model would have been to > > extend our "get_unused_fd_flags()" model. > > > > IOW, we could have instead marked the 'struct file *' in the file > > descriptor table as being "not ready yet". > > > > I wonder how nasty it would have been to have the low bit of the > > 'struct file *' mark "not ready to be used yet" or something similar. > > You already can't just access the 'fdt->fd[]' array willy-nilly since > > we have both normal RCU issues _and_ the somewhat unusual spectre > > array indexing issues. > > > > So looking around with > > > > git grep -e '->fd\[' > > > > we seem to be pretty good about that and it probably wouldn't be too > > horrid to add a "check low bit isn't set" to the rules. > > > > Then pidfd_prepare() could actually install the file pointer in the fd > > table, just marked as "not ready", and then instead of "fd_install()", > > yuo'd have "fd_expose(fd)" or something. > > > > I dislike interfaces that return two different things. Particularly > > ones that are supposed to be there to make things easy for the user. I > > think your pidfd_prepare() helper fails that "make it easy to use" > > test. > > > > Hmm? > > I'm not fond of "return two things" kind of helpers, but I'm even less > fond of "return fd, file is already there" ones, TBH. {__,}pidfd_prepare() > users are thankfully very limited in the things they do to the file that > had been returned, but that really invites abuse. It's only exposed to kernel core code for good reasons. > > The deeper in call chain we mess with descriptor table, the more painful it > gets, IME. > > Speaking of {__,}pidfd_prepare(), I wonder if we wouldn't be better off > with get_unused_fd_flags() lifted into the callers - all three of those > (fanotify copy_event_to_user(), copy_process() and pidfd_create()). > Switch from anon_inode_getfd() to anon_inode_getfile() certainly > made sense, ditto for combining it with get_pid(), but mixing > get_unused_fd_flags() into that is a mistake, IMO. I agree with mostly everything here except for get_unused_fd_flags() being lifted into the callers. That's what I tried to get rid of in kernel/fork.c. It is rife with misunderstandings just looking at what we did in kernel/fork.c earlier: retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC); [...] pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, O_RDWR | O_CLOEXEC); seeing where both get_unused_fd_flags() and both *_getfile() take flag arguments. Sure, for us this is pretty straightforward since we've seen that code a million times. For others it's confusing why there's two flag arguments. Sure, we could use one flags argument but it still be weird to look at. But with this api we also force all users to remember that they need to cleanup the fd and the file - but definitely one - depending on where they fail. But conceptually the fd and the file belong together. After all it's the file we're about to make that reserved fd refer to. But I'm not here to lambast this api. It works nicely overall and that reserve + install model is pretty elegant. But see for my boring compromise proposal below... > > As for your suggestion... let's see what it leads to. > > Suppose we add such entries (reserved, hold a reference to file, > marked "not yet available" in the LSB). From the current tree POV those > would be equivalent to descriptor already reserved, but fd_install() not > done. So behaviour of existing primitives should be the same as for this > situation, except for fd_install() and put_unused_fd(). > > * pick_file(), __fget_files_rcu(), iterate_fd(), files_lookup_fd_raw(), > loop in dup_fd(), io_close() - treat odd pointers as NULL. > * close_files() should, AFAICS, treat an odd pointer as "should never > happen" (and that xchg() in there needs to go anyway - it's pointless, since > we are freeing the the array immediately afterwards. > * do_close_on_exec() should probably treat them as "should never happen". > * do_dup2() - odd value should be treated as -EBUSY. > > The interesting part, of course, is how to legitimize (or dispose of) such > a beast. The former is your "fd_expose()" - parallel to fd_install(), > AFAICS. The latter... another primitive that would > grab ->files_lock > pick_file() variant that *expects* an odd value > drop ->files_lock > clear LSB and pass to fput(). > > It's doable, but AFAICS doesn't make callers all that happier... In the context of using pidfd for some networking stuff we had a similar discussion because it's the same problem only worse. Think of a scenario where you need to allocate the fd and file early on and multiple function calls later you only get to install fd and file. In that case you need to drag that fd and file around everywhere so you can then fd_install it... Sure, we could do this "semi-install fd into fdtable thing" but I think that's too much subtlety and the fdtable is traumatic enough as it is. But what about something like the following where we just expose a very barebones api that allows and encourages callers to bundle fd and file. Hell, you could even extend that proposal below to wrap the put_user()... struct fd_file { struct file *file; int fd; int __user *fd_user; }; and static inline int fd_publish_user(struct fd_file *fdf) { int ret = 0; if (fdf->fd_user) ret = put_user(fdf->fd, fdf->fd_user); if (ret) fd_discard(fdf) else fd_publish(fdf) return 0; } which is also a pretty common pattern... From d6b48884d33f9a2e4506589d9380bf2036e43b8a Mon Sep 17 00:00:00 2001 From: Christian Brauner <brauner@kernel.org> Date: Tue, 25 Apr 2023 14:21:18 +0200 Subject: [PATCH] UNTESTED UNTESTED DRAFT DRAFT --- include/linux/file.h | 20 ++++++++++++++++++++ include/linux/pid.h | 3 ++- kernel/fork.c | 35 +++++++++++++++++++---------------- kernel/pid.c | 14 +++++++------- 4 files changed, 48 insertions(+), 24 deletions(-) diff --git a/include/linux/file.h b/include/linux/file.h index 39704eae83e2..fdadaaa9f70b 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -39,6 +39,11 @@ struct fd { #define FDPUT_FPUT 1 #define FDPUT_POS_UNLOCK 2 +struct fd_file { + struct file *file; + int fd; +}; + static inline void fdput(struct fd fd) { if (fd.flags & FDPUT_FPUT) @@ -90,6 +95,21 @@ extern void put_unused_fd(unsigned int fd); extern void fd_install(unsigned int fd, struct file *file); +static inline void fd_publish(struct fd_file *fdf) +{ + if (fdf->file) + fd_install(fdf->fd, fdf->file); +} + +static inline void fd_discard(struct fd_file *fdf) +{ + if (fdf->file) { + fput(fdf->file); + put_unused_fd(fdf->fd); + + } +} + extern int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags); diff --git a/include/linux/pid.h b/include/linux/pid.h index b75de288a8c2..0b0695459508 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -75,12 +75,13 @@ extern struct pid init_struct_pid; extern const struct file_operations pidfd_fops; struct file; +struct fd_file; extern struct pid *pidfd_pid(const struct file *file); struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags); struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags); int pidfd_create(struct pid *pid, unsigned int flags); -int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret); +int pidfd_prepare(struct pid *pid, unsigned int flags, struct fd_file *fdf); static inline struct pid *get_pid(struct pid *pid) { diff --git a/kernel/fork.c b/kernel/fork.c index bfe73db1c26c..2072d8cc91d2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1989,7 +1989,8 @@ const struct file_operations pidfd_fops = { * error, a negative error code is returned from the function and the * last argument remains unchanged. */ -static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) +static int __pidfd_prepare(struct pid *pid, unsigned int flags, + struct fd_file *fdf) { int pidfd; struct file *pidfd_file; @@ -2008,8 +2009,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re return PTR_ERR(pidfd_file); } get_pid(pid); /* held by pidfd_file now */ - *ret = pidfd_file; - return pidfd; + + fdf->file = pidfd_file; + fdf->fd = pidfd; + + return 0; } /** @@ -2038,12 +2042,12 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re * error, a negative error code is returned from the function and the * last argument remains unchanged. */ -int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) +int pidfd_prepare(struct pid *pid, unsigned int flags, struct fd_file *fdf) { if (!pid || !pid_has_task(pid, PIDTYPE_TGID)) return -EINVAL; - return __pidfd_prepare(pid, flags, ret); + return __pidfd_prepare(pid, flags, fdf); } static void __delayed_free_task(struct rcu_head *rhp) @@ -2106,10 +2110,12 @@ __latent_entropy struct task_struct *copy_process( int node, struct kernel_clone_args *args) { - int pidfd = -1, retval; + int retval; + struct fd_file pidfd = { + .fd = -1, + }; struct task_struct *p; struct multiprocess_signals delayed; - struct file *pidfile = NULL; const u64 clone_flags = args->flags; struct nsproxy *nsp = current->nsproxy; @@ -2395,12 +2401,11 @@ __latent_entropy struct task_struct *copy_process( */ if (clone_flags & CLONE_PIDFD) { /* Note that no task has been attached to @pid yet. */ - retval = __pidfd_prepare(pid, O_RDWR | O_CLOEXEC, &pidfile); + retval = __pidfd_prepare(pid, O_RDWR | O_CLOEXEC, &pidfd); if (retval < 0) goto bad_fork_free_pid; - pidfd = retval; - retval = put_user(pidfd, args->pidfd); + retval = put_user(pidfd.fd, args->pidfd); if (retval) goto bad_fork_put_pidfd; } @@ -2584,8 +2589,8 @@ __latent_entropy struct task_struct *copy_process( syscall_tracepoint_update(p); write_unlock_irq(&tasklist_lock); - if (pidfile) - fd_install(pidfd, pidfile); + if (clone_flags & CLONE_PIDFD) + fd_publish(&pidfd); proc_fork_connector(p); sched_post_fork(p); @@ -2605,10 +2610,8 @@ __latent_entropy struct task_struct *copy_process( write_unlock_irq(&tasklist_lock); cgroup_cancel_fork(p, args); bad_fork_put_pidfd: - if (clone_flags & CLONE_PIDFD) { - fput(pidfile); - put_unused_fd(pidfd); - } + if (clone_flags & CLONE_PIDFD) + fd_discard(&pidfd); bad_fork_free_pid: if (pid != &init_struct_pid) free_pid(pid); diff --git a/kernel/pid.c b/kernel/pid.c index f93954a0384d..6cd46002aca3 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -594,15 +594,15 @@ struct task_struct *pidfd_get_task(int pidfd, unsigned int *flags) */ int pidfd_create(struct pid *pid, unsigned int flags) { - int pidfd; - struct file *pidfd_file; + int ret; + struct fd_file pidfd; - pidfd = pidfd_prepare(pid, flags, &pidfd_file); - if (pidfd < 0) - return pidfd; + ret = pidfd_prepare(pid, flags, &pidfd); + if (ret < 0) + return ret; - fd_install(pidfd, pidfd_file); - return pidfd; + fd_publish(&pidfd); + return pidfd.fd; } /**
On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > It is rife with misunderstandings just looking at what we did in > kernel/fork.c earlier: > > retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC); > [...] > pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, > O_RDWR | O_CLOEXEC); > > seeing where both get_unused_fd_flags() and both *_getfile() take flag > arguments. Sure, for us this is pretty straightforward since we've seen > that code a million times. For others it's confusing why there's two > flag arguments. Sure, we could use one flags argument but it still be > weird to look at. First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and anon_inode_getfile() - about O_CLOEXEC. Duplicating the expression in places like that is cargo-culting. > But with this api we also force all users to remember that they need to > cleanup the fd and the file - but definitely one - depending on where > they fail. > > But conceptually the fd and the file belong together. After all it's the > file we're about to make that reserved fd refer to. Why? The common pattern is * reserve the descriptor * do some work that might fail * get struct file set up (which also might fail) * do more work (possibly including copyout, etc.) * install file into descriptor table We want to reserve the descriptor early, since it's about the easiest thing to undo. Why bother doing it next to file setup? That can be pretty deep in call chain and doing it that way comes with headache about passing the damn thing around. And cleanup rules with your variant end up being more complicated. Keep the manipulations of descriptor table as close to the surface as possible. The real work is with file references; descriptors belong in userland and passing them around kernel-side is asking for headache.
On Tue, Apr 25, 2023 at 02:54:29PM +0100, Al Viro wrote: > On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > > > It is rife with misunderstandings just looking at what we did in > > kernel/fork.c earlier: > > > > retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC); > > [...] > > pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, > > O_RDWR | O_CLOEXEC); > > > > seeing where both get_unused_fd_flags() and both *_getfile() take flag > > arguments. Sure, for us this is pretty straightforward since we've seen > > that code a million times. For others it's confusing why there's two > > flag arguments. Sure, we could use one flags argument but it still be > > weird to look at. > > First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and > anon_inode_getfile() - about O_CLOEXEC. Duplicating the expression in > places like that is cargo-culting. I distinctly remember us having that conversation about how to do this nicely back then and fwiw this is your patch... ;) 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups") So sure, that was my point: people are confused why there's two flag arguments and what exactly has to go into them and just copy-paste whatever other users already have. > > > But with this api we also force all users to remember that they need to > > cleanup the fd and the file - but definitely one - depending on where > > they fail. > > > > But conceptually the fd and the file belong together. After all it's the > > file we're about to make that reserved fd refer to. > > Why? The common pattern is > * reserve the descriptor > * do some work that might fail > * get struct file set up (which also might fail) > * do more work (possibly including copyout, etc.) > * install file into descriptor table > > We want to reserve the descriptor early, since it's about the easiest > thing to undo. Why bother doing it next to file setup? That can be > pretty deep in call chain and doing it that way comes with headache > about passing the damn thing around. And cleanup rules with your > variant end up being more complicated. > > Keep the manipulations of descriptor table as close to the surface > as possible. The real work is with file references; descriptors > belong in userland and passing them around kernel-side is asking > for headache. Imho, there's different use-cases. There's definitely one where people allocate a file descriptor early on and then sometimes maybe way later allocate a struct file and install it. And that's where exposing and using get_unused_fd_flags() and fd_install() is great and works fine. But there's also users that do the reserve an fd and allocate a file at the same time or receive both at the same time as the seccomp notifier, scm creds, or pidfds. The receive_fd* stuff is basically a version of the sketch I did without the ability to simply use a struct and not having to open-code everything multiple times. I never expected excitement around this as it was difficult enough to get that receive_fd* thing going so I'm not arguing for this to be the ultima ratio...
On Tue, Apr 25, 2023 at 04:36:27PM +0200, Christian Brauner wrote: > On Tue, Apr 25, 2023 at 02:54:29PM +0100, Al Viro wrote: > > On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > > > > > It is rife with misunderstandings just looking at what we did in > > > kernel/fork.c earlier: > > > > > > retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC); > > > [...] > > > pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, > > > O_RDWR | O_CLOEXEC); > > > > > > seeing where both get_unused_fd_flags() and both *_getfile() take flag > > > arguments. Sure, for us this is pretty straightforward since we've seen > > > that code a million times. For others it's confusing why there's two > > > flag arguments. Sure, we could use one flags argument but it still be > > > weird to look at. > > > > First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and > > anon_inode_getfile() - about O_CLOEXEC. Duplicating the expression in > > places like that is cargo-culting. > > I distinctly remember us having that conversation about how to do this > nicely back then and fwiw this is your patch... ;) > 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups") Should've followed with "no need to pass nonsense flags to get_unused_fd_flags() and anon_inode_getfile()"... > So sure, that was my point: people are confused why there's two flag > arguments and what exactly has to go into them and just copy-paste > whatever other users already have. > There's definitely one where people allocate a file descriptor early on > and then sometimes maybe way later allocate a struct file and install > it. And that's where exposing and using get_unused_fd_flags() and > fd_install() is great and works fine. FWIW, there's something I toyed with a while ago - a primitive along the lines of fd_set_file(fd, file) { if (IS_ERR(file)) { put_unused_fd(fd); return ERR_PTR(file); } fd_install(fd, file); return fd; } That simplifies quite a few places, collapsing failure exit handling. Not sure how many can be massaged into that form, though...
On Tue, Apr 25, 2023 at 5:34 AM Christian Brauner <brauner@kernel.org> wrote: > > Hell, you could even extend that proposal below to wrap the > put_user()... > > struct fd_file { > struct file *file; > int fd; > int __user *fd_user; > }; So I don't like this extended version, but your proposal patch below looks good to me. Why? Simply because the "two-word struct" is actually a good way to return two values. But a three-word one would be passed on the stack. Both gcc and clang return small structs (where "small" is literally just two words) in registers, and it's part of most (all?) ABIs and we've relied on that before. It's kind of the same thing as returning a "long long" in two registers, except it works with structs too. We've relied on that for ages, with things like 'struct pte' often being that kind of two-word struct (going back a *loong* time to the bad old days of 32-bit x86 and PAE). And in the vfs layer we actually also do it for 'struct fd', which is a very similar construct. So yes, doing something like this: > +struct fd_file { > + struct file *file; > + int fd; > +}; would make me happy, and still return just "one" value, it's just that the value now is both of those things we need. And then your helpers: > +static inline void fd_publish(struct fd_file *fdf) > +static inline void fd_discard(struct fd_file *fdf) solves my other issue, except I'd literally make it pass those structs by value, exactly like fdget/fdput does. Now, since they are inline functions, the code generation doesn't really change (compilers are smart enough to not actually generate any pointer stuff), but I prefer to make things like that expliict, and have source code that matches the code generation. (Which is also why I do *not* endorse passing bigger structs by value, because then the compiler will just pass it as a "pointer to a copy" instead, again violating the whole concept of "source matches what happens in reality") I think the above helper could be improved further with Al's suggestion to make 'fd_publish()' return an error code, and allow the file pointer (and maybe even the fd index) to be an error pointer (and error number), so that you could often unify the error/success paths. IOW, I like this, and I think it's superior to my stupid original suggestion. Linus
On Tue, Apr 25, 2023 at 09:28:54AM -0700, Linus Torvalds wrote: > Now, since they are inline functions, the code generation doesn't > really change (compilers are smart enough to not actually generate any > pointer stuff), but I prefer to make things like that expliict, and > have source code that matches the code generation. > > (Which is also why I do *not* endorse passing bigger structs by value, > because then the compiler will just pass it as a "pointer to a copy" > instead, again violating the whole concept of "source matches what > happens in reality") > > I think the above helper could be improved further with Al's > suggestion to make 'fd_publish()' return an error code, and allow the > file pointer (and maybe even the fd index) to be an error pointer (and > error number), so that you could often unify the error/success paths. > > IOW, I like this, and I think it's superior to my stupid original suggestion. We'd better collect the data on the current callers first. There are several patterns; I'm going through the old (fairly sparse) notes and the grep over the current tree right now, will post when I get through that. That's one area where we had a *lot* of recurring bugs - mostly of leak/double put variety. So we'd better have the calling conventions right wrt how easy it is to fuck up in failure exits. And we need to document the patterns/rules for each/reasons for choosing one over another. Note that there's also "set the file up, then get descriptor and either fd_install or fput, depending on get_unused_fd_flags() success"; sometimes it's the only approach (SCM_RIGHTS, for example), sometimes it's better than "get descriptor, set the file up, then either install or release descriptor", sometimes it's definitely worse (e.g. for O_CREAT it's a non-starter). It should be a deliberate choice.
On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > struct fd_file { > struct file *file; > int fd; > int __user *fd_user; Why is it an int? Because your case has it that way? We have a bunch of places where we have an ioctl with a field in some structure filled that way; any primitive that combines put_user() with descriptor handling is going to cause trouble as soon as somebody deals with a structure where such member is unsigned long. Gets especially funny on 64bit big-endian... And that objection is orthogonal to that 3-member structure - even if you pass int __user * as an explicit argument into your helper, the same trouble will be there. Al, still going through that zoo...
On Thu, Apr 27, 2023 at 02:07:15AM +0100, Al Viro wrote: > On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > > > > struct fd_file { > > struct file *file; > > int fd; > > int __user *fd_user; > > Why is it an int? Because your case has it that way? > > We have a bunch of places where we have an ioctl with > a field in some structure filled that way; any primitive > that combines put_user() with descriptor handling is > going to cause trouble as soon as somebody deals with > a structure where such member is unsigned long. Gets > especially funny on 64bit big-endian... > > And that objection is orthogonal to that 3-member structure - > even if you pass int __user * as an explicit argument into > your helper, the same trouble will be there. > > Al, still going through that zoo... FWIW, surprisingly large part of messy users in ioctls might get simplified nicely if we add something along the lines of int delayed_dup(struct file *file, unsigned flags) { struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL); int fd; if (likely(p)) fd = p->fd = get_unused_fd_flags(flags); else fd = -ENOMEM; if (likely(fd >= 0)) { p->file = file; init_task_work(&p->work, __do_delayed_dup); if (!task_work_add(&p, TWA_RESUME)) return fd; put_unused_fd(fd); fd = -EINVAL; } fput(file); kfree(p); return fd; } with struct delayed_dup { struct callback_head work; struct file *file; unsigned fd; }; static void __do_delayed_dup(struct callback_head *work) { struct delayed_dup *p = container_of(work, struct delayed_dup, work); if (!syscall_get_error(current, current_pt_regs())) { fd_install(p->fd, p->file); } else { put_unused_fd(p->fd); fput(p->file); } kfree(p); } Random example (completely untested - I'm not even sure it compiles): diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c index af57799c86ce..7c62becfddc9 100644 --- a/drivers/dma-buf/sync_file.c +++ b/drivers/dma-buf/sync_file.c @@ -207,55 +207,34 @@ static __poll_t sync_file_poll(struct file *file, poll_table *wait) static long sync_file_ioctl_merge(struct sync_file *sync_file, unsigned long arg) { - int fd = get_unused_fd_flags(O_CLOEXEC); int err; struct sync_file *fence2, *fence3; + struct sync_merge_data __user *argp = (void *)arg; struct sync_merge_data data; - if (fd < 0) - return fd; - - if (copy_from_user(&data, (void __user *)arg, sizeof(data))) { - err = -EFAULT; - goto err_put_fd; - } + if (copy_from_user(&data, argp, sizeof(data))) + return -EFAULT; - if (data.flags || data.pad) { - err = -EINVAL; - goto err_put_fd; - } + if (data.flags || data.pad) + return -EINVAL; fence2 = sync_file_fdget(data.fd2); - if (!fence2) { - err = -ENOENT; - goto err_put_fd; - } + if (!fence2) + return -ENOENT; data.name[sizeof(data.name) - 1] = '\0'; fence3 = sync_file_merge(data.name, sync_file, fence2); - if (!fence3) { + if (fence3) + err = delayed_dup(fence3->file, O_CLOEXEC); + else err = -ENOMEM; - goto err_put_fence2; - } - data.fence = fd; - if (copy_to_user((void __user *)arg, &data, sizeof(data))) { - err = -EFAULT; - goto err_put_fence3; + if (err >= 0) { + data.fence = err; + err = copy_to_user(argp, &data, sizeof(data)) ? -EFAULT : 0; } - fd_install(fd, fence3->file); - fput(fence2->file); - return 0; - -err_put_fence3: - fput(fence3->file); - -err_put_fence2: fput(fence2->file); - -err_put_fd: - put_unused_fd(fd); return err; } IMO it's much easier to follow that way, and that's a fairly typical example. One primitive call instead of three, *much* simpler handling of failure exits, no need to deal with unroll on copy_to_user() failures (here those were handled; most of the drivers/gpu/drm stuff is buried quite a few call levels deeper than the place where copyout is done, so currently it doesn't even try to DTRT in that respect). It's not a panacea - for that to be useful we need * no side effects of file opening that wouldn't be undone by fput() * enough work done to make the overhead negligible * moderate amount of files opened in one call * a pattern that could be massaged into "open file, then deal with inserting it into descriptor table". There's a plenty of fd_install() users that match that pattern. Certainly not all of them, but almost everything in drivers does. I'm still digging through the rest - there's a couple of other patterns that seem to be common, but I'm not through the entire pile yet.
On Thu, Apr 27, 2023 at 02:07:15AM +0100, Al Viro wrote: > On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > > > > struct fd_file { > > struct file *file; > > int fd; > > int __user *fd_user; > > Why is it an int? Because your case has it that way? > > We have a bunch of places where we have an ioctl with > a field in some structure filled that way; any primitive > that combines put_user() with descriptor handling is > going to cause trouble as soon as somebody deals with > a structure where such member is unsigned long. Gets > especially funny on 64bit big-endian... > > And that objection is orthogonal to that 3-member structure - > even if you pass int __user * as an explicit argument into > your helper, the same trouble will be there. Ignoring for a second that there are other ways to achieve this. This is literally the sketch on top of a sketch to encompass an api that _we do already have today_...
On Thu, Apr 27, 2023 at 08:39:08AM +0100, Al Viro wrote: > On Thu, Apr 27, 2023 at 02:07:15AM +0100, Al Viro wrote: > > On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > > > > > > > struct fd_file { > > > struct file *file; > > > int fd; > > > int __user *fd_user; > > > > Why is it an int? Because your case has it that way? > > > > We have a bunch of places where we have an ioctl with > > a field in some structure filled that way; any primitive > > that combines put_user() with descriptor handling is > > going to cause trouble as soon as somebody deals with > > a structure where such member is unsigned long. Gets > > especially funny on 64bit big-endian... > > > > And that objection is orthogonal to that 3-member structure - > > even if you pass int __user * as an explicit argument into > > your helper, the same trouble will be there. > > > > Al, still going through that zoo... > > FWIW, surprisingly large part of messy users in ioctls might get > simplified nicely if we add something along the lines of > > int delayed_dup(struct file *file, unsigned flags) > { > struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL); > int fd; > > if (likely(p)) > fd = p->fd = get_unused_fd_flags(flags); > else > fd = -ENOMEM; > if (likely(fd >= 0)) { > p->file = file; > init_task_work(&p->work, __do_delayed_dup); > if (!task_work_add(&p, TWA_RESUME)) > return fd; > put_unused_fd(fd); > fd = -EINVAL; > } > fput(file); > kfree(p); > return fd; > } > > with > > struct delayed_dup { > struct callback_head work; > struct file *file; > unsigned fd; > }; > > static void __do_delayed_dup(struct callback_head *work) > { > struct delayed_dup *p = container_of(work, struct delayed_dup, work); > > if (!syscall_get_error(current, current_pt_regs())) { > fd_install(p->fd, p->file); > } else { > put_unused_fd(p->fd); > fput(p->file); > } > kfree(p); > } > > Random example (completely untested - I'm not even sure it compiles): > > diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c > index af57799c86ce..7c62becfddc9 100644 > --- a/drivers/dma-buf/sync_file.c > +++ b/drivers/dma-buf/sync_file.c > @@ -207,55 +207,34 @@ static __poll_t sync_file_poll(struct file *file, poll_table *wait) > static long sync_file_ioctl_merge(struct sync_file *sync_file, > unsigned long arg) > { > - int fd = get_unused_fd_flags(O_CLOEXEC); > int err; > struct sync_file *fence2, *fence3; > + struct sync_merge_data __user *argp = (void *)arg; > struct sync_merge_data data; > > - if (fd < 0) > - return fd; > - > - if (copy_from_user(&data, (void __user *)arg, sizeof(data))) { > - err = -EFAULT; > - goto err_put_fd; > - } > + if (copy_from_user(&data, argp, sizeof(data))) > + return -EFAULT; > > - if (data.flags || data.pad) { > - err = -EINVAL; > - goto err_put_fd; > - } > + if (data.flags || data.pad) > + return -EINVAL; > > fence2 = sync_file_fdget(data.fd2); > - if (!fence2) { > - err = -ENOENT; > - goto err_put_fd; > - } > + if (!fence2) > + return -ENOENT; > > data.name[sizeof(data.name) - 1] = '\0'; > fence3 = sync_file_merge(data.name, sync_file, fence2); > - if (!fence3) { > + if (fence3) > + err = delayed_dup(fence3->file, O_CLOEXEC); > + else > err = -ENOMEM; > - goto err_put_fence2; > - } > > - data.fence = fd; > - if (copy_to_user((void __user *)arg, &data, sizeof(data))) { > - err = -EFAULT; > - goto err_put_fence3; > + if (err >= 0) { > + data.fence = err; > + err = copy_to_user(argp, &data, sizeof(data)) ? -EFAULT : 0; > } > > - fd_install(fd, fence3->file); > - fput(fence2->file); > - return 0; > - > -err_put_fence3: > - fput(fence3->file); > - > -err_put_fence2: > fput(fence2->file); > - > -err_put_fd: > - put_unused_fd(fd); > return err; > } > > > IMO it's much easier to follow that way, and that's a fairly typical > example. One primitive call instead of three, *much* simpler handling > of failure exits, no need to deal with unroll on copy_to_user() failures > (here those were handled; most of the drivers/gpu/drm stuff is buried > quite a few call levels deeper than the place where copyout is done, > so currently it doesn't even try to DTRT in that respect). > > It's not a panacea - for that to be useful we need > * no side effects of file opening that wouldn't be undone by fput() > * enough work done to make the overhead negligible > * moderate amount of files opened in one call > * a pattern that could be massaged into "open file, then deal with > inserting it into descriptor table". > > There's a plenty of fd_install() users that match that pattern. > Certainly not all of them, but almost everything in drivers does. > > I'm still digging through the rest - there's a couple of other patterns > that seem to be common, but I'm not through the entire pile yet. So the earlier proposal plus added complexity to hide it from users. I don't feel strongly that we really do need to do something here but if this is something you feel strongly about then sure. But a few points: * I don't think this is much of a simplification for file descriptor handling in those callers. Judging by the example you pasted here the simplifications to this function are much more based on changing the general structure. I suspect it'll be the same for a lot of other codepaths. But I'm happy to be convinced otherwise. * This delayed_dup() thing is fairly complex with the task work stuff and the memory allocation mixed in there. * It adds an async pattern into the fd installation path which feels like yet another complex add on. Ultimately what I try to gauge with changes like that is how likely are people going to use a pattern without someone having to go through the tree every few months to make sure that the api isn't abused. I'm not sure that this wouldn't just end up being misused. File descriptor installation is not core functionality for drivers. It's just something that they have to do and so it's not that people usually put a lot of thought into it. So that's why I think an API has to be dumb enough. A three call api may still be simpler to use than an overly clever single call api.
On Thu, Apr 27, 2023 at 10:33:38AM +0200, Christian Brauner wrote: > File descriptor installation is not core functionality for drivers. It's > just something that they have to do and so it's not that people usually > put a lot of thought into it. So that's why I think an API has to be > dumb enough. A three call api may still be simpler to use than an overly > clever single call api. Grep and you will see... Seriously, for real mess take a look at e.g. drivers/gpu/drm/amd/amdkfd/kfd_chardev.c. See those close_fd() calls in there? That's completely wrong - you can't undo the insertion into descriptor table. I'm not suggesting that for the core kernel, but there's a plenty of drivers that do descriptor allocation. Take a look at e.g. drivers/media/mc/mc-request.c:media_request_alloc(). OK, we open, insert, etc. and we pass the descriptor to caller in *alloc_fd. Caller is in drivers/media/mc/mc-device.c: static long media_device_request_alloc(struct media_device *mdev, void *arg) and descriptor goes into *(int *)arg there. That is reached via static const struct media_ioctl_info ioctl_info[] = { MEDIA_IOC(DEVICE_INFO, media_device_get_info, MEDIA_IOC_FL_GRAPH_MUTEX), MEDIA_IOC(ENUM_ENTITIES, media_device_enum_entities, MEDIA_IOC_FL_GRAPH_MUTEX), MEDIA_IOC(ENUM_LINKS, media_device_enum_links, MEDIA_IOC_FL_GRAPH_MUTEX), MEDIA_IOC(SETUP_LINK, media_device_setup_link, MEDIA_IOC_FL_GRAPH_MUTEX), MEDIA_IOC(G_TOPOLOGY, media_device_get_topology, MEDIA_IOC_FL_GRAPH_MUTEX), MEDIA_IOC(REQUEST_ALLOC, media_device_request_alloc, 0), }; used in static long media_device_ioctl(struct file *filp, unsigned int cmd, unsigned long __arg) There we have ret = info->fn(dev, karg); if (info->flags & MEDIA_IOC_FL_GRAPH_MUTEX) mutex_unlock(&dev->graph_mutex); if (!ret && info->arg_to_user) ret = info->arg_to_user(arg, karg, cmd); array elements are set by #define MEDIA_IOC_ARG(__cmd, func, fl, from_user, to_user) \ [_IOC_NR(MEDIA_IOC_##__cmd)] = { \ .cmd = MEDIA_IOC_##__cmd, \ .fn = func, \ .flags = fl, \ .arg_from_user = from_user, \ .arg_to_user = to_user, \ } #define MEDIA_IOC(__cmd, func, fl) \ MEDIA_IOC_ARG(__cmd, func, fl, copy_arg_from_user, copy_arg_to_user) so this ->arg_to_user() is copy_arg_to_user(), which is static long copy_arg_to_user(void __user *uarg, void *karg, unsigned int cmd) { if ((_IOC_DIR(cmd) & _IOC_READ) && copy_to_user(uarg, karg, _IOC_SIZE(cmd))) return -EFAULT; return 0; } That copy_to_user() is not attempted until media_device_request_alloc() returns. And I don't see any way to make it unroll the insertion into descriptor table without massive restructuring of the entire thing; if you do, I'd love to hear it. This is actually not the worst case - again, drm stuff has a bunch of such crap, and I don't believe it's feasible to move drm folks from their "we do copyin and copyout in generic ioctl code, passing the copy to handlers supplied by drivers and copying whatever they modified back to userland" approach.
On Thu, Apr 27, 2023 at 09:59:52AM +0100, Al Viro wrote: > On Thu, Apr 27, 2023 at 10:33:38AM +0200, Christian Brauner wrote: > > > File descriptor installation is not core functionality for drivers. It's > > just something that they have to do and so it's not that people usually > > put a lot of thought into it. So that's why I think an API has to be > > dumb enough. A three call api may still be simpler to use than an overly > > clever single call api. > > Grep and you will see... Seriously, for real mess take a look at e.g. > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c. See those close_fd() calls in > there? That's completely wrong - you can't undo the insertion into > descriptor table. I mean, yes. There's literally a description how that's wrong in my pr message... > > I'm not suggesting that for the core kernel, but there's a plenty of > drivers that do descriptor allocation. That's undisputed. What I tried to say was that this is not their main focus. The point is their design thinking is not centered on fd handling that's just something they have to do and so they don't think about cleanly handling for example installing an fd and the closing it again in some failure path. And I'm not sure one should be just harshly judging them. They're almost bound to get this wrong. You'd need to know a bit about file descriptors to have this in mind. File descriptor handling is subtle. I mean, look at cachefiles_ondemand_get_fd()...
On Thu, Apr 27, 2023 at 12:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > int delayed_dup(struct file *file, unsigned flags) Ok, this is strange. Let me think about it. But even without thinking about it, this part I hate: > struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL); Sure, if this is only used in unimportant code where performance doesn't matter, doing a kmalloc is fine. But if that is the only use, I think this is too subtle an interface. Could we instead limit it to "we only have one pending delayed dup", and make this all be more like the restart-block thing, and be part of struct task_struct? I think it's conceptually quite similar to restart_block, ie a "some pending system call state" thing. (Obviously it's entirely different in details). Linus
On Thu, Apr 27, 2023 at 08:21:34AM -0700, Linus Torvalds wrote: > On Thu, Apr 27, 2023 at 12:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > int delayed_dup(struct file *file, unsigned flags) > > Ok, this is strange. Let me think about it. > > But even without thinking about it, this part I hate: > > > struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL); > > Sure, if this is only used in unimportant code where performance > doesn't matter, doing a kmalloc is fine. > > But if that is the only use, I think this is too subtle an interface. Still hadn't finished with the zoo... > Could we instead limit it to "we only have one pending delayed dup", > and make this all be more like the restart-block thing, and be part of > struct task_struct? Interesting... FWIW, *anything* that wants several descriptors has special needs - there are some places like that (binder, for one) and they have rather weird code implementing those. Just to restate the obvious: this is not applicable for the most frequent caller - open(2). For the reasons that have nothing to do with performance. If opening the file has hard-to-reverse side effects (like directory modification due to O_CREAT), the things are very different. What I hope for is a small number of patterns, with clear rules for choosing the one that is applicable and helpers for each that would reduce the amount of headache when using it. And I've no problem with "this particular pattern is not usable if you are adding more than one descriptor" - that's not hard to understand and verify. So I'm fine with doing that for one descriptor only and getting rid of the allocation. BTW, another pattern is the same sans the "delayed" part. I.e. "here's an opened file, get a descriptor and either attach the file to it or fput() the damn thing; in any case, file reference is consumed and descriptor-or-error is returned". That one is definitely only for single descriptor case. In any case, I want to finish the survey of the callers first, just to see what's there and whether anything is worth nicking. While we are at it, I want to make close_fd() use a very big red flag. To the point of grepping for new callers in -next and asking the folks who introduce those to explain WTF they are doing...
From: Linus Torvalds > Sent: 25 April 2023 17:29 > > On Tue, Apr 25, 2023 at 5:34 AM Christian Brauner <brauner@kernel.org> wrote: > > > > Hell, you could even extend that proposal below to wrap the > > put_user()... > > > > struct fd_file { > > struct file *file; > > int fd; > > int __user *fd_user; > > }; > > So I don't like this extended version, but your proposal patch below > looks good to me. > > Why? Simply because the "two-word struct" is actually a good way to > return two values. But a three-word one would be passed on the stack. > > Both gcc and clang return small structs (where "small" is literally > just two words) in registers, and it's part of most (all?) ABIs and > we've relied on that before. It is definitely architecture dependant. x86-64 and arm-64 will return two 64bit values in registers. x86-32 and arm-32 return two 32bit values on stack. Pretty much everything passes short structures directly by value. (I'm not sure about sparc-32 though, I'm sure it passed all structures by reference back in the 1980s) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Apr 28, 2023 at 1:40 AM David Laight <David.Laight@aculab.com> wrote: > > It is definitely architecture dependant. > x86-64 and arm-64 will return two 64bit values in registers. > x86-32 and arm-32 return two 32bit values on stack. Yes, but I think the "return value in two registers" is the generic thing. At least gcc has very special code to deal with a two-register return value. And yes, those registers will then obviously be different sizes on different architectures, but a tuple like a '(ptr,fd)' thing will fit regardless. We very much do depend on this kind of pattern generating reasonable code in some parts of the kernel. Linus
On Thu, Apr 27, 2023 at 06:02:15PM +0100, Al Viro wrote: > On Thu, Apr 27, 2023 at 08:21:34AM -0700, Linus Torvalds wrote: > > On Thu, Apr 27, 2023 at 12:39 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > int delayed_dup(struct file *file, unsigned flags) > > > > Ok, this is strange. Let me think about it. > > > > But even without thinking about it, this part I hate: > > > > > struct delayed_dup *p = kmalloc(sizeof(struct delayed_dup), GFP_KERNEL); > > > > Sure, if this is only used in unimportant code where performance > > doesn't matter, doing a kmalloc is fine. > > > > But if that is the only use, I think this is too subtle an interface. > > Still hadn't finished with the zoo... > > > Could we instead limit it to "we only have one pending delayed dup", > > and make this all be more like the restart-block thing, and be part of > > struct task_struct? > > Interesting... FWIW, *anything* that wants several descriptors has > special needs - there are some places like that (binder, for one) > and they have rather weird code implementing those. > > Just to restate the obvious: this is not applicable for the most frequent > caller - open(2). For the reasons that have nothing to do with performance. > If opening the file has hard-to-reverse side effects (like directory > modification due to O_CREAT), the things are very different. > > What I hope for is a small number of patterns, with clear rules for > choosing the one that is applicable and helpers for each that would > reduce the amount of headache when using it. And I've no problem > with "this particular pattern is not usable if you are adding more > than one descriptor" - that's not hard to understand and verify. > So I'm fine with doing that for one descriptor only and getting > rid of the allocation. > > BTW, another pattern is the same sans the "delayed" part. I.e. > "here's an opened file, get a descriptor and either attach the > file to it or fput() the damn thing; in any case, file reference > is consumed and descriptor-or-error is returned". That one is > definitely only for single descriptor case. > > In any case, I want to finish the survey of the callers first, just to > see what's there and whether anything is worth nicking. > > While we are at it, I want to make close_fd() use a very big red flag. > To the point of grepping for new callers in -next and asking the folks > who introduce those to explain WTF they are doing... Yeah, I'd fully support this and would be very nice to have.