Message ID | 20231124-vfs-fixes-3420a81c0abe@brauner |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp1062453vqx; Fri, 24 Nov 2023 02:28:06 -0800 (PST) X-Google-Smtp-Source: AGHT+IGMmP2aJN1pqzM32e0j+iHCYvynUVt347v1gxhQVJWgGt6XGySFhGzQfJnzKWn2K4dsIrk2 X-Received: by 2002:a05:6830:3b0e:b0:6ce:25b9:d9b5 with SMTP id dk14-20020a0568303b0e00b006ce25b9d9b5mr2316676otb.38.1700821686085; Fri, 24 Nov 2023 02:28:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700821686; cv=none; d=google.com; s=arc-20160816; b=k6wvnKPgeB4PaUZvB3fmr/S+VyAq+3kGm3H9m79XeARg6HM5g0fGbubxbDZhbwxzpB uRK5xDyFOWitT4f0KPm73+I4jxNvzNQfRJEJh+mePAcIoED4joK2hiTchqLvcFyhoMmA KtAbP5L6Cwtin/jj8SJvcNb44P8EF/ehq9YvVhNJDmlw/iI3Lb7JbDKpCEkG2AXXlg3v QXTd+pyWCW3xcYP1mpMwNBk+pNJQskHicOPk2b2s+hUlQx4PtD7QfGjAELqrc0ayWIMa fT2Ckaqh8eldFux3DT3f4d0LtkBU6j8elQUv9pcHWYXql81eH2TasCRmi51uDBF3P1mj uBYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=7hK+tej1jbvltr3Oa0cRwquUjwJ8yTM6L8s4Ele8w9Y=; fh=qDZ+YnS9HJbCVOaeWFr2FnYZt7eE+vt5IJIDIlhBdxU=; b=wbY6U9lfr3rQ27LPFgEugTIFgNjvNCjUhCYHlV2mawwMVqZYgtXNHfwkSNd6NilJAs 2qgyVTiHhrG+/RyyjTL8N9SlP87tznp6E0LE7RIeuZgyDVYhrVE5lo8ONOxnCwVUEXr6 dQGI6WragvIUlUzwZOYXrA3qccGkvQ0xYrLdCncKwDe2MxTFM01QXBWeNtH1BLXKARLr dTnkV2PFQCS2OXXZ9KU1jObE4dLfspfQ8OZ+mHZsqxH6WxyjI+C6IGOkxFDfQLS5Dj/q qsDbyeuGtcQyjGK4lxHgeqyU5LPHhhyHnlgmgVZZTxFAjqGIW54dOyjbVJFWZFOGplDL zPGQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=sYvu8MeE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id bz3-20020a056a02060300b005be2508ce03si3562669pgb.569.2023.11.24.02.28.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 02:28:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=sYvu8MeE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 0447783BC6E2; Fri, 24 Nov 2023 02:28:03 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230334AbjKXK1y (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others); Fri, 24 Nov 2023 05:27:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229580AbjKXK1w (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 24 Nov 2023 05:27:52 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D9A392 for <linux-kernel@vger.kernel.org>; Fri, 24 Nov 2023 02:27:58 -0800 (PST) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A33F0C433C8; Fri, 24 Nov 2023 10:27:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700821678; bh=qVwJcFDiIpymt6eUkHVD8R3OpoVEzT8XHjqe8OUw3gY=; h=From:To:Cc:Subject:Date:From; b=sYvu8MeE8IkSjV0il7aiZosZ3Ymd/EG2GU0ZPtN5SOXpcSpNNccXSaX6RKlIjaO7U jvzp0UG0cp3SOzwlvfN0gdnHpTOEgaS/DXk49f+e9ywvaHydsCwhclZcOkluy1fQDb sD5ZtP0dSOQL1ukPvfGAkae4ZCkjpSLMFkEPNHNJ1RkEiDinqraTlWCapZ2moQGS68 XOib7SnGPn9thxaCkYqdZG84EVhknmbxJMXiDLZ/mP3b7wcfAYvYCG9BEUqDBPQlAX mp+M5zXVjvfmJLkZmuY4OVi1DG18xNl76hTe9eqiprWEJ5HUcI9y9vuvjdDvNNoI5X sfriUGYRWbqPA== From: Christian Brauner <brauner@kernel.org> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] vfs fixes Date: Fri, 24 Nov 2023 11:27:28 +0100 Message-ID: <20231124-vfs-fixes-3420a81c0abe@brauner> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 X-Developer-Signature: v=1; a=openpgp-sha256; l=5043; i=brauner@kernel.org; h=from:subject:message-id; bh=qVwJcFDiIpymt6eUkHVD8R3OpoVEzT8XHjqe8OUw3gY=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQmVE3I2WdSLlbuwSx69I7K3XS/i/s2F1zgC9+W4Pns1 D1WD0PljlIWBjEuBlkxRRaHdpNwueU8FZuNMjVg5rAygQxh4OIUgImYzmdkOPZDQmJrhfnLmA83 ZVyW8B3lczkmpzB52Srt3C7eErHFlxn+GUg4V/3hj+N4rxjw7m6x1+TfW4vXlPKXnXz+dUbror0 TWAE= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Fri, 24 Nov 2023 02:28:03 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783440800114141069 X-GMAIL-MSGID: 1783440800114141069 |
Series |
[GIT,PULL] vfs fixes
|
|
Pull-request
git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7-rc3.fixesMessage
Christian Brauner
Nov. 24, 2023, 10:27 a.m. UTC
Hey Linus, /* Summary */ This contains the usual miscellaneous fixes: * Avoid calling back into LSMs from vfs_getattr_nosec() calls. IMA used to query inode properties accessing raw inode fields without dedicated helpers. That was finally fixed a few releases ago by forcing IMA to use vfs_getattr_nosec() helpers. The goal of the vfs_getattr_nosec() helper is to query for attributes without calling into the LSM layer which would be quite problematic because incredibly IMA is called from __fput()... __fput() -> ima_file_free() What it does is to call back into the filesystem to update the file's IMA xattr. Querying the inode without using vfs_getattr_nosec() meant that IMA didn't handle stacking filesystems such as overlayfs correctly. So the switch to vfs_getattr_nosec() is quite correct. But the switch to vfs_getattr_nosec() revealed another bug when used on stacking filesystems: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() -> security_inode_getattr() # calls back into LSMs Now, if that __fput() happens from task_work_run() of an exiting task current->fs and various other pointer could already be NULL. So anything in the LSM layer relying on that not being NULL would be quite surprised. Fix that by passing the information that this is a security request through to the stacking filesystem by adding a new internal ATT_GETATTR_NOSEC flag. Now the callchain becomes: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> if (AT_GETATTR_NOSEC) vfs_getattr_nosec() else vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() * Fix a bug introduced with the iov_iter rework from last cycle. This broke /proc/kcore by copying too much and without the correct offset. * Add a missing NULL check when allocating the root inode in autofs_fill_super(). * Fix stable writes for multi-device filesystems (xfs, btrfs etc) and the block device pseudo filesystem. Stable writes used to be a superblock flag only, making it a per filesystem property. Add an additional AS_STABLE_WRITES mapping flag to allow for fine-grained control. * Ensure that offset_iterate_dir() returns 0 after reaching the end of a directory so it adheres to getdents() convention. /* Testing */ clang: Debian clang version 16.0.6 (16) gcc: gcc (Debian 13.2.0-5) 13.2.0 All patches are based on v6.7-rc1 and have been sitting in linux-next. No build failures or warnings were observed. Passes xfstests. /* Conflicts */ At the time of creating this PR no merge conflicts were reported from linux-next and no merge conflicts showed up doing a test-merge with current mainline. The following changes since commit b85ea95d086471afb4ad062012a4d73cd328fa86: Linux 6.7-rc1 (2023-11-12 16:19:07 -0800) are available in the Git repository at: git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7-rc3.fixes for you to fetch changes up to 796432efab1e372d404e7a71cc6891a53f105051: libfs: getdents() should return 0 after reaching EOD (2023-11-20 15:34:22 +0100) Please consider pulling these changes from the signed vfs-6.7-rc3.fixes tag. Thanks! Christian ---------------------------------------------------------------- vfs-6.7-rc3.fixes ---------------------------------------------------------------- Christoph Hellwig (4): filemap: add a per-mapping stable writes flag block: update the stable_writes flag in bdev_add xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags xfs: respect the stable writes flag on the RT device Chuck Lever (1): libfs: getdents() should return 0 after reaching EOD Ian Kent (1): autofs: add: new_inode check in autofs_fill_super() Omar Sandoval (1): iov_iter: fix copy_page_to_iter_nofault() Stefan Berger (1): fs: Pass AT_GETATTR_NOSEC flag to getattr interface function block/bdev.c | 2 ++ fs/autofs/inode.c | 56 +++++++++++++++++----------------------------- fs/ecryptfs/inode.c | 12 ++++++++-- fs/inode.c | 2 ++ fs/libfs.c | 14 +++++++++--- fs/overlayfs/inode.c | 10 ++++----- fs/overlayfs/overlayfs.h | 8 +++++++ fs/stat.c | 6 ++++- fs/xfs/xfs_inode.h | 8 +++++++ fs/xfs/xfs_ioctl.c | 30 ++++++++++++++++--------- fs/xfs/xfs_iops.c | 7 ++++++ include/linux/pagemap.h | 17 ++++++++++++++ include/uapi/linux/fcntl.h | 3 +++ lib/iov_iter.c | 2 +- mm/page-writeback.c | 2 +- 15 files changed, 121 insertions(+), 58 deletions(-)
Comments
On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > * Fix a bug introduced with the iov_iter rework from last cycle. > > This broke /proc/kcore by copying too much and without the correct > offset. Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. It does this: /* * vmalloc uses spinlocks, so we optimistically try to * read memory. If this fails, fault pages in and try * again until we are done. */ while (true) { read += vread_iter(iter, src, left); if (read == tsz) break; src += read; left -= read; if (fault_in_iov_iter_writeable(iter, left)) { ret = -EFAULT; goto out; } } and that is just broken beyond words for two totally independent reasons: (a) vread_iter() looks like it can fail because of not having a source, and return 0 (I dunno - it seems to try to avoid that, but it all looks pretty dodgy) At that point fault_in_iov_iter_writeable() will try to fault in the destination, which may work just fine, but if the source was the problem, you'd have an endless loop. (b) That "read += X" is completely broken anyway. It should be just a "=". So that whole loop is crazy broken, and only works for the case where you get it all in one go. This code is crap. Now, I think it all works in practice for one simple reason: I doubt anybody uses this (and it looks like the callees in the while loop try very hard to always fill the whole area - maybe people noticed the first bug and tried to work around it that way). I guess there is at least one test program, but it presumably doesn't trigger or care about the bugs here. But I think we can get rid of this all, and just deal with the KCORE_VMALLOC case exactly the same way we already deal with VMEMMAP and TEXT: by just doing copy_from_kernel_nofault() into a bounce buffer, and then doing a regular _copy_to_iter() or whatever. NOTE! I looked at the code, and threw up in my mouth a little, and maybe I missed something. Maybe it all works fine. But Omar - since you found the original problem, may I implore you to test this attached patch? I just like how the patch looks: 6 files changed, 1 insertion(+), 368 deletions(-) and those 350+ deleted lines really looked disgusting to me. This patch is on top of the pull I did, because obviously the fix in that pull was correct, I just think we should go further and get rid of this whole mess entirely. Linus
The pull request you sent on Fri, 24 Nov 2023 11:27:28 +0100:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7-rc3.fixes
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fa2b906f5148883e2d0be8952767469c2e3de274
Thank you!
On Fri, 24 Nov 2023 at 10:25, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I just like how the patch looks: > > 6 files changed, 1 insertion(+), 368 deletions(-) > > and those 350+ deleted lines really looked disgusting to me. Gaah. I guess it's the VM_IOREMAP case that is the cause of all this horridness. So we'd have to know not to mess with IO mappings. Annoying. But I think my patch could still act as a starting point, except that case KCORE_VMALLOC: would have to have some kind of "if this is not a regular vmalloc, just skip it" logic. So I guess we can't remove all those lines, but we *could* replace all the vread_iter() code with a "bool can_I_vread_this()" instead. So the fallback would still be to just do the bounce buffer copy. Or something. Oh well. Linus
On Fri, 24 Nov 2023 at 10:52, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Gaah. I guess it's the VM_IOREMAP case that is the cause of all this horridness. > > So we'd have to know not to mess with IO mappings. Annoying. Doing a debian code search, I see a number of programs that do a "stat()" on the kcore file, to get some notion of "system memory size". I don't think it's valid, but whatever. We probably shouldn't change it. I also see some programs that actually read the ELF notes and sections for dumping purposes. But does anybody actually run gdb on that thing or similar? That's the original model for that file, but it was always more of a gimmick than anything else. Because we could just say "read zeroes from KCORE_VMALLOC" and be done with it that way. Linus
> Because we could just say "read zeroes from KCORE_VMALLOC" and be done > with it that way. Let's try to do that and see what happens. If we get away with it then great, if not we can think about fixing this.
On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote: > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > > > * Fix a bug introduced with the iov_iter rework from last cycle. > > > > This broke /proc/kcore by copying too much and without the correct > > offset. > > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. Ugh, I didn't even look at that closely because the fix was obviously correct for that helper alone. Let's try and just return zeroed memory like you suggested in your last mail before we bother fixing any of this. Long-term plan, it'd be nice to just get distros to start turning /proc/kcore off. Maybe I underestimate legitimate users but this requires CAP_SYS_RAW_IO so it really can only be useful to pretty privileged stuff anyway.
On Sat, Nov 25, 2023 at 02:10:52PM +0100, Christian Brauner wrote: > On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote: > > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > > > > > * Fix a bug introduced with the iov_iter rework from last cycle. > > > > > > This broke /proc/kcore by copying too much and without the correct > > > offset. > > > > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. > > Ugh, I didn't even look at that closely because the fix was obviously > correct for that helper alone. Let's try and just return zeroed memory > like you suggested in your last mail before we bother fixing any of > this. > > Long-term plan, it'd be nice to just get distros to start turning > /proc/kcore off. Maybe I underestimate legitimate users but this > requires CAP_SYS_RAW_IO so it really can only be useful to pretty > privileged stuff anyway. drgn needs /proc/kcore for debugging the live kernel, which is a very important use case for lots of our users. And it does in fact read from KCORE_VMALLOC segments, which is why I found and fixed this bug. I'm happy to clean up this code, although it's a holiday weekend here so I won't get to it immediately of course. But please don't rip this out. Omar
On Sat, Nov 25, 2023 at 05:28:49AM -0800, Omar Sandoval wrote: > On Sat, Nov 25, 2023 at 02:10:52PM +0100, Christian Brauner wrote: > > On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote: > > > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <brauner@kernel.org> wrote: > > > > > > > > * Fix a bug introduced with the iov_iter rework from last cycle. > > > > > > > > This broke /proc/kcore by copying too much and without the correct > > > > offset. > > > > > > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken. > > > > Ugh, I didn't even look at that closely because the fix was obviously > > correct for that helper alone. Let's try and just return zeroed memory > > like you suggested in your last mail before we bother fixing any of > > this. > > > > Long-term plan, it'd be nice to just get distros to start turning > > /proc/kcore off. Maybe I underestimate legitimate users but this > > requires CAP_SYS_RAW_IO so it really can only be useful to pretty > > privileged stuff anyway. > > drgn needs /proc/kcore for debugging the live kernel, which is a very > important use case for lots of our users. And it does in fact read from > KCORE_VMALLOC segments, which is why I found and fixed this bug. I'm > happy to clean up this code, although it's a holiday weekend here so I > won't get to it immediately of course. But please don't rip this out. Ugh, I see. I just grepped through the drgn repo. I didn't realize that you were actively using it and not just testing it. If this is actively used then we won't break you ofc. And yeah, if you would fix it that would be great. Given that you're the main active user who happens to have kernel experience you might even want to be made responsible for this file in the future. ;)