Message ID | cover.1689074739.git.legion@kernel.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp405757vqm; Tue, 11 Jul 2023 04:30:07 -0700 (PDT) X-Google-Smtp-Source: APBJJlHhdywxn29VCqK0q5dmMsA2N6m8g6jLwCUdRuEY2Szy6VWmqsxCpc1oGXur9qS2ZX/Fu5f5 X-Received: by 2002:a05:6808:1413:b0:3a3:a262:14d2 with SMTP id w19-20020a056808141300b003a3a26214d2mr20677701oiv.8.1689075007076; Tue, 11 Jul 2023 04:30:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689075007; cv=none; d=google.com; s=arc-20160816; b=J5QfbeIJKh20yQJ5Zfp41EZI+mBldwE49WtQVMWGfLSrSjIGTy13d5aaxmsl9VYYeR QOHx5eLFmoweKhCcWwON4NjW7CP8WWR5vV5vzK99nPv/Xnq/RZLC2Fm9fYdZQ52BHGVN jSXw+f10fQ/KJN1QjSwFImh6d1YmkG5lK8tdwU58mVsM+LbaEP7yPnOsK4h0/v84mMnA btbMxNwqVdqoU7VqUvL5rWV1x/EFYKt1Qu0gyxZyyE7W3nWXmRZishxh+N9V7aVWJ2AE Gq36IdWFzZpDx+XWHtCDyTnX580MnjVexJUbEft9u09O4z9aqliIfkf0fCiALrY+VOgm x7gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=5pIrunfZ2E5VKWScAQuLNVuQ2ZlRvPpMouogKalTvng=; fh=6ism3Dc4O3S5xaq3Wfu2tj6fXRIbA5d6TtO7Da76gaE=; b=Xw8kRkQLb9UALBhcq4KO95dus3HxdWTdyir8pnLGBVWfygItgZo82FK10+OG6pW1Vc 48KEzEQR+7emi0uaQidSyxNtv3776RVL0eX2ujn0im/rxiu/Z5uHIuJkJb3Vg56MAPPX sG05TmPD6ydu1w7mQJNkHSGk5IEHpuwYR7WHr8ly2Z2qzffG1njJbSfAclomCdEpFIsD AXwspqwY5gCcf4anikIRW6TmPV4SOWBFTvdki4zWWlQh/yVEp6JNn/ygVJ9t3sksRkF9 TOGrKV5UVH3KsUvsbpyMrIBhxK8IARGqNcqxJoZYM6ZFUJKbvBSq2u5hAba5opJFmgq+ INKw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x5-20020a654145000000b00553b02a9a1asi1260326pgp.249.2023.07.11.04.29.53; Tue, 11 Jul 2023 04:30:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231932AbjGKL2N (ORCPT <rfc822;gnulinuxfreebsd@gmail.com> + 99 others); Tue, 11 Jul 2023 07:28:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52842 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231767AbjGKL2H (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 11 Jul 2023 07:28:07 -0400 Received: from us-smtp-delivery-44.mimecast.com (unknown [207.211.30.44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D9EF5E7E for <linux-kernel@vger.kernel.org>; Tue, 11 Jul 2023 04:28:01 -0700 (PDT) Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-223-P9VisGTbPgGWqvxb_D7cHg-1; Tue, 11 Jul 2023 07:26:13 -0400 X-MC-Unique: P9VisGTbPgGWqvxb_D7cHg-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3A4AF10504C9; Tue, 11 Jul 2023 11:26:13 +0000 (UTC) Received: from localhost.localdomain.com (unknown [10.45.225.44]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1B08863F3C; Tue, 11 Jul 2023 11:25:59 +0000 (UTC) From: Alexey Gladkov <legion@kernel.org> To: LKML <linux-kernel@vger.kernel.org>, Arnd Bergmann <arnd@arndb.de>, linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk Cc: James.Bottomley@HansenPartnership.com, acme@kernel.org, alexander.shishkin@linux.intel.com, axboe@kernel.dk, benh@kernel.crashing.org, borntraeger@de.ibm.com, bp@alien8.de, catalin.marinas@arm.com, christian@brauner.io, dalias@libc.org, davem@davemloft.net, deepa.kernel@gmail.com, deller@gmx.de, dhowells@redhat.com, fenghua.yu@intel.com, firoz.khan@linaro.org, fweimer@redhat.com, geert@linux-m68k.org, glebfm@altlinux.org, gor@linux.ibm.com, hare@suse.com, heiko.carstens@de.ibm.com, hpa@zytor.com, ink@jurassic.park.msu.ru, jhogan@kernel.org, kim.phillips@arm.com, ldv@altlinux.org, linux-alpha@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, linux@armlinux.org.uk, linuxppc-dev@lists.ozlabs.org, luto@kernel.org, mattst88@gmail.com, mingo@redhat.com, monstr@monstr.eu, mpe@ellerman.id.au, namhyung@kernel.org, palmer@sifive.com, paul.burton@mips.com, paulus@samba.org, peterz@infradead.org, ralf@linux-mips.org, rth@twiddle.net, schwidefsky@de.ibm.com, sparclinux@vger.kernel.org, stefan@agner.ch, tglx@linutronix.de, tony.luck@intel.com, tycho@tycho.ws, will@kernel.org, x86@kernel.org, ysato@users.sourceforge.jp Subject: [PATCH v3 0/5] Add a new fchmodat4() syscall Date: Tue, 11 Jul 2023 13:25:41 +0200 Message-Id: <cover.1689074739.git.legion@kernel.org> In-Reply-To: <87o8pscpny.fsf@oldenburg2.str.redhat.com> References: <87o8pscpny.fsf@oldenburg2.str.redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,RDNS_NONE,SPF_HELO_NONE,SPF_SOFTFAIL, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771123514791312818 X-GMAIL-MSGID: 1771123514791312818 |
Series |
Add a new fchmodat4() syscall
|
|
Message
Alexey Gladkov
July 11, 2023, 11:25 a.m. UTC
This patch set adds fchmodat4(), a new syscall. The actual implementation is super simple: essentially it's just the same as fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. I've attempted to make this match "man 2 fchmodat" as closely as possible, which says EINVAL is returned for invalid flags (as opposed to ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). I have a sketch of a glibc patch that I haven't even compiled yet, but seems fairly straight-forward: diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c index 6d9cbc1ce9e0..b1beab76d56c 100644 --- a/sysdeps/unix/sysv/linux/fchmodat.c +++ b/sysdeps/unix/sysv/linux/fchmodat.c @@ -29,12 +29,36 @@ int fchmodat (int fd, const char *file, mode_t mode, int flag) { - if (flag & ~AT_SYMLINK_NOFOLLOW) - return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); -#ifndef __NR_lchmod /* Linux so far has no lchmod syscall. */ + /* There are four paths through this code: + - The flags are zero. In this case it's fine to call fchmodat. + - The flags are non-zero and glibc doesn't have access to + __NR_fchmodat4. In this case all we can do is emulate the error codes + defined by the glibc interface from userspace. + - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has + fchmodat4. This is the simplest case, as the fchmodat4 syscall exactly + matches glibc's library interface so it can be called directly. + - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does + not. In this case we must respect the error codes defined by the glibc + interface instead of returning ENOSYS. + The intent here is to ensure that the kernel is called at most once per + library call, and that the error types defined by glibc are always + respected. */ + +#ifdef __NR_fchmodat4 + long result; +#endif + + if (flag == 0) + return INLINE_SYSCALL (fchmodat, 3, fd, file, mode); + +#ifdef __NR_fchmodat4 + result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag); + if (result == 0 || errno != ENOSYS) + return result; +#endif + if (flag & AT_SYMLINK_NOFOLLOW) return INLINE_SYSCALL_ERROR_RETURN_VALUE (ENOTSUP); -#endif - return INLINE_SYSCALL (fchmodat, 3, fd, file, mode); + return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); } I've never added a new syscall before so I'm not really sure what the proper procedure to follow is. Based on the feedback from my v1 patch set it seems this is somewhat uncontroversial. At this point I don't think there's anything I'm missing, though note that I haven't gotten around to testing it this time because the diff from v1 is trivial for any platform I could reasonably test on. The v1 patches suggest a simple test case, but I didn't re-run it because I don't want to reboot my laptop. Changes since v2 [20190717012719.5524-1-palmer@sifive.com]: * Rebased to master. * The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro. * Selftest added. Changes since v1 [20190531191204.4044-1-palmer@sifive.com]: * All architectures are now supported, which support squashed into a single patch. * The do_fchmodat() helper function has been removed, in favor of directly calling do_fchmodat4(). * The patches are based on 5.2 instead of 5.1. --- Alexey Gladkov (1): selftests: add fchmodat4(2) selftest Palmer Dabbelt (4): Non-functional cleanup of a "__user * filename" fs: Add fchmodat4() arch: Register fchmodat4, usually as syscall 451 tools headers UAPI: Sync files changed by new fchmodat4 syscall arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/open.c | 18 ++- include/linux/syscalls.h | 4 +- include/uapi/asm-generic/unistd.h | 5 +- tools/include/uapi/asm-generic/unistd.h | 5 +- .../arch/mips/entry/syscalls/syscall_n64.tbl | 1 + .../arch/powerpc/entry/syscalls/syscall.tbl | 1 + .../perf/arch/s390/entry/syscalls/syscall.tbl | 1 + .../arch/x86/entry/syscalls/syscall_64.tbl | 1 + tools/testing/selftests/Makefile | 1 + tools/testing/selftests/fchmodat4/.gitignore | 2 + tools/testing/selftests/fchmodat4/Makefile | 6 + .../selftests/fchmodat4/fchmodat4_test.c | 151 ++++++++++++++++++ 29 files changed, 207 insertions(+), 7 deletions(-) create mode 100644 tools/testing/selftests/fchmodat4/.gitignore create mode 100644 tools/testing/selftests/fchmodat4/Makefile create mode 100644 tools/testing/selftests/fchmodat4/fchmodat4_test.c
Comments
* Alexey Gladkov: > This patch set adds fchmodat4(), a new syscall. The actual > implementation is super simple: essentially it's just the same as > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. > I've attempted to make this match "man 2 fchmodat" as closely as > possible, which says EINVAL is returned for invalid flags (as opposed to > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). > I have a sketch of a glibc patch that I haven't even compiled yet, but > seems fairly straight-forward: > > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c > index 6d9cbc1ce9e0..b1beab76d56c 100644 > --- a/sysdeps/unix/sysv/linux/fchmodat.c > +++ b/sysdeps/unix/sysv/linux/fchmodat.c > @@ -29,12 +29,36 @@ > int > fchmodat (int fd, const char *file, mode_t mode, int flag) > { > - if (flag & ~AT_SYMLINK_NOFOLLOW) > - return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); > -#ifndef __NR_lchmod /* Linux so far has no lchmod syscall. */ > + /* There are four paths through this code: > + - The flags are zero. In this case it's fine to call fchmodat. > + - The flags are non-zero and glibc doesn't have access to > + __NR_fchmodat4. In this case all we can do is emulate the error codes > + defined by the glibc interface from userspace. > + - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has > + fchmodat4. This is the simplest case, as the fchmodat4 syscall exactly > + matches glibc's library interface so it can be called directly. > + - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does If you define __NR_fchmodat4 on all architectures, we can use these constants directly in glibc. We no longer depend on the UAPI definitions of those constants, to cut down the number of code variants, and to make glibc's system call profile independent of the kernel header version at build time. Your version is based on 2.31, more recent versions have some reasonable emulation for fchmodat based on /proc/self/fd. I even wrote a comment describing the same buggy behavior that you witnessed: + /* Some Linux versions with some file systems can actually + change symbolic link permissions via /proc, but this is not + intentional, and it gives inconsistent results (e.g., error + return despite mode change). The expected behavior is that + symbolic link modes cannot be changed at all, and this check + enforces that. */ + if (S_ISLNK (st.st_mode)) + { + __close_nocancel (pathfd); + __set_errno (EOPNOTSUPP); + return -1; + } I think there was some kernel discussion about that behavior before, but apparently, it hasn't led to fixes. I wonder if it makes sense to add a similar error return to the system call implementation? > + not. In this case we must respect the error codes defined by the glibc > + interface instead of returning ENOSYS. > + The intent here is to ensure that the kernel is called at most once per > + library call, and that the error types defined by glibc are always > + respected. */ > + > +#ifdef __NR_fchmodat4 > + long result; > +#endif > + > + if (flag == 0) > + return INLINE_SYSCALL (fchmodat, 3, fd, file, mode); > + > +#ifdef __NR_fchmodat4 > + result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag); > + if (result == 0 || errno != ENOSYS) > + return result; > +#endif The last if condition is the recommended approach, but in the past, it broke container host compatibility pretty badly due to seccomp filters that return EPERM instead of ENOSYS. I guess we'll learn soon enough if that's been fixed by now. 8-P Thanks, Florian
On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote: > * Alexey Gladkov: > > > This patch set adds fchmodat4(), a new syscall. The actual > > implementation is super simple: essentially it's just the same as > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. > > I've attempted to make this match "man 2 fchmodat" as closely as > > possible, which says EINVAL is returned for invalid flags (as opposed to > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). > > I have a sketch of a glibc patch that I haven't even compiled yet, but > > seems fairly straight-forward: > > > > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c > > index 6d9cbc1ce9e0..b1beab76d56c 100644 > > --- a/sysdeps/unix/sysv/linux/fchmodat.c > > +++ b/sysdeps/unix/sysv/linux/fchmodat.c > > @@ -29,12 +29,36 @@ > > int > > fchmodat (int fd, const char *file, mode_t mode, int flag) > > { > > - if (flag & ~AT_SYMLINK_NOFOLLOW) > > - return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); > > -#ifndef __NR_lchmod /* Linux so far has no lchmod syscall. */ > > + /* There are four paths through this code: > > + - The flags are zero. In this case it's fine to call fchmodat. > > + - The flags are non-zero and glibc doesn't have access to > > + __NR_fchmodat4. In this case all we can do is emulate the error codes > > + defined by the glibc interface from userspace. > > + - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has > > + fchmodat4. This is the simplest case, as the fchmodat4 syscall exactly > > + matches glibc's library interface so it can be called directly. > > + - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does > > If you define __NR_fchmodat4 on all architectures, we can use these > constants directly in glibc. We no longer depend on the UAPI > definitions of those constants, to cut down the number of code variants, > and to make glibc's system call profile independent of the kernel header > version at build time. > > Your version is based on 2.31, more recent versions have some reasonable > emulation for fchmodat based on /proc/self/fd. I even wrote a comment > describing the same buggy behavior that you witnessed: > > + /* Some Linux versions with some file systems can actually > + change symbolic link permissions via /proc, but this is not > + intentional, and it gives inconsistent results (e.g., error > + return despite mode change). The expected behavior is that > + symbolic link modes cannot be changed at all, and this check > + enforces that. */ > + if (S_ISLNK (st.st_mode)) > + { > + __close_nocancel (pathfd); > + __set_errno (EOPNOTSUPP); > + return -1; > + } > > I think there was some kernel discussion about that behavior before, but > apparently, it hasn't led to fixes. I think I've explained this somewhere else a couple of months ago but just in case you weren't on that thread or don't remember and apologies if you should already know. A lot of filesystem will happily update the mode of a symlink. The VFS doesn't do anything to prevent this from happening. This is filesystem specific. The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs. Specifically it comes from filesystems that call posix_acl_chmod(), e.g., btrfs via if (!err && attr->ia_valid & ATTR_MODE) err = posix_acl_chmod(idmap, dentry, inode->i_mode); Most filesystems don't implement i_op->set_acl() for POSIX ACLs. So posix_acl_chmod() will report EOPNOTSUPP. By the time posix_acl_chmod() is called, most filesystems will have finished updating the inode. POSIX ACLs also often aren't integrated into transactions so a rollback wouldn't even be possible on some filesystems. Any filesystem that doesn't implement POSIX ACLs at all will obviously never fail unless it blocks mode changes on symlinks. Or filesystems that do have a way to rollback failures from posix_acl_chmod(), or filesystems that do return an error on chmod() on symlinks such as 9p, ntfs, ocfs2. > > I wonder if it makes sense to add a similar error return to the system > call implementation? Hm, blocking symlink mode changes is pretty regression prone. And just blocking it through one interface seems weird and makes things even more inconsistent. So two options I see: (1) minimally invasive: Filesystems that do call posix_acl_chmod() on symlinks need to be changed to stop doing that. (2) might hit us on the head invasive: Try and block symlink mode changes in chmod_common(). Thoughts?
On Tue, Jul 11, 2023 at 05:14:24PM +0200, Christian Brauner wrote: > On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote: > > * Alexey Gladkov: > > > > > This patch set adds fchmodat4(), a new syscall. The actual > > > implementation is super simple: essentially it's just the same as > > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. > > > I've attempted to make this match "man 2 fchmodat" as closely as > > > possible, which says EINVAL is returned for invalid flags (as opposed to > > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). > > > I have a sketch of a glibc patch that I haven't even compiled yet, but > > > seems fairly straight-forward: > > > > > > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c > > > index 6d9cbc1ce9e0..b1beab76d56c 100644 > > > --- a/sysdeps/unix/sysv/linux/fchmodat.c > > > +++ b/sysdeps/unix/sysv/linux/fchmodat.c > > > @@ -29,12 +29,36 @@ > > > int > > > fchmodat (int fd, const char *file, mode_t mode, int flag) > > > { > > > - if (flag & ~AT_SYMLINK_NOFOLLOW) > > > - return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); > > > -#ifndef __NR_lchmod /* Linux so far has no lchmod syscall. */ > > > + /* There are four paths through this code: > > > + - The flags are zero. In this case it's fine to call fchmodat. > > > + - The flags are non-zero and glibc doesn't have access to > > > + __NR_fchmodat4. In this case all we can do is emulate the error codes > > > + defined by the glibc interface from userspace. > > > + - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has > > > + fchmodat4. This is the simplest case, as the fchmodat4 syscall exactly > > > + matches glibc's library interface so it can be called directly. > > > + - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does > > > > If you define __NR_fchmodat4 on all architectures, we can use these > > constants directly in glibc. We no longer depend on the UAPI > > definitions of those constants, to cut down the number of code variants, > > and to make glibc's system call profile independent of the kernel header > > version at build time. > > > > Your version is based on 2.31, more recent versions have some reasonable > > emulation for fchmodat based on /proc/self/fd. I even wrote a comment > > describing the same buggy behavior that you witnessed: > > > > + /* Some Linux versions with some file systems can actually > > + change symbolic link permissions via /proc, but this is not > > + intentional, and it gives inconsistent results (e.g., error > > + return despite mode change). The expected behavior is that > > + symbolic link modes cannot be changed at all, and this check > > + enforces that. */ > > + if (S_ISLNK (st.st_mode)) > > + { > > + __close_nocancel (pathfd); > > + __set_errno (EOPNOTSUPP); > > + return -1; > > + } > > > > I think there was some kernel discussion about that behavior before, but > > apparently, it hasn't led to fixes. > > I think I've explained this somewhere else a couple of months ago but > just in case you weren't on that thread or don't remember and apologies > if you should already know. > > A lot of filesystem will happily update the mode of a symlink. The VFS > doesn't do anything to prevent this from happening. This is filesystem > specific. > > The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs. > Specifically it comes from filesystems that call posix_acl_chmod(), > e.g., btrfs via > > if (!err && attr->ia_valid & ATTR_MODE) > err = posix_acl_chmod(idmap, dentry, inode->i_mode); > > Most filesystems don't implement i_op->set_acl() for POSIX ACLs. > So posix_acl_chmod() will report EOPNOTSUPP. By the time > posix_acl_chmod() is called, most filesystems will have finished > updating the inode. POSIX ACLs also often aren't integrated into > transactions so a rollback wouldn't even be possible on some > filesystems. > > Any filesystem that doesn't implement POSIX ACLs at all will obviously > never fail unless it blocks mode changes on symlinks. Or filesystems > that do have a way to rollback failures from posix_acl_chmod(), or > filesystems that do return an error on chmod() on symlinks such as 9p, > ntfs, ocfs2. > > > > > I wonder if it makes sense to add a similar error return to the system > > call implementation? > > Hm, blocking symlink mode changes is pretty regression prone. And just > blocking it through one interface seems weird and makes things even more > inconsistent. > > So two options I see: > (1) minimally invasive: > Filesystems that do call posix_acl_chmod() on symlinks need to be > changed to stop doing that. > (2) might hit us on the head invasive: > Try and block symlink mode changes in chmod_common(). > > Thoughts? > We have third option. We can choose not to call chmod_common and return an error right away: diff --git a/fs/open.c b/fs/open.c index 39a7939f0d00..86a427a2a083 100644 --- a/fs/open.c +++ b/fs/open.c @@ -679,7 +679,9 @@ static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, int l retry: error = user_path_at(dfd, filename, lookup_flags, &path); if (!error) { - error = chmod_common(&path, mode); + error = -EOPNOTSUPP; + if (!(flags & AT_SYMLINK_NOFOLLOW) || !S_ISLNK(path.dentry->d_inode->i_mode)) + error = chmod_common(&path, mode); path_put(&path); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; It doesn't seem to be invasive.
On Tue, Jul 25, 2023 at 01:05:40PM +0200, Alexey Gladkov wrote: > On Tue, Jul 11, 2023 at 05:14:24PM +0200, Christian Brauner wrote: > > On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote: > > > * Alexey Gladkov: > > > > > > > This patch set adds fchmodat4(), a new syscall. The actual > > > > implementation is super simple: essentially it's just the same as > > > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. > > > > I've attempted to make this match "man 2 fchmodat" as closely as > > > > possible, which says EINVAL is returned for invalid flags (as opposed to > > > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). > > > > I have a sketch of a glibc patch that I haven't even compiled yet, but > > > > seems fairly straight-forward: > > > > > > > > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c > > > > index 6d9cbc1ce9e0..b1beab76d56c 100644 > > > > --- a/sysdeps/unix/sysv/linux/fchmodat.c > > > > +++ b/sysdeps/unix/sysv/linux/fchmodat.c > > > > @@ -29,12 +29,36 @@ > > > > int > > > > fchmodat (int fd, const char *file, mode_t mode, int flag) > > > > { > > > > - if (flag & ~AT_SYMLINK_NOFOLLOW) > > > > - return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); > > > > -#ifndef __NR_lchmod /* Linux so far has no lchmod syscall. */ > > > > + /* There are four paths through this code: > > > > + - The flags are zero. In this case it's fine to call fchmodat. > > > > + - The flags are non-zero and glibc doesn't have access to > > > > + __NR_fchmodat4. In this case all we can do is emulate the error codes > > > > + defined by the glibc interface from userspace. > > > > + - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has > > > > + fchmodat4. This is the simplest case, as the fchmodat4 syscall exactly > > > > + matches glibc's library interface so it can be called directly. > > > > + - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does > > > > > > If you define __NR_fchmodat4 on all architectures, we can use these > > > constants directly in glibc. We no longer depend on the UAPI > > > definitions of those constants, to cut down the number of code variants, > > > and to make glibc's system call profile independent of the kernel header > > > version at build time. > > > > > > Your version is based on 2.31, more recent versions have some reasonable > > > emulation for fchmodat based on /proc/self/fd. I even wrote a comment > > > describing the same buggy behavior that you witnessed: > > > > > > + /* Some Linux versions with some file systems can actually > > > + change symbolic link permissions via /proc, but this is not > > > + intentional, and it gives inconsistent results (e.g., error > > > + return despite mode change). The expected behavior is that > > > + symbolic link modes cannot be changed at all, and this check > > > + enforces that. */ > > > + if (S_ISLNK (st.st_mode)) > > > + { > > > + __close_nocancel (pathfd); > > > + __set_errno (EOPNOTSUPP); > > > + return -1; > > > + } > > > > > > I think there was some kernel discussion about that behavior before, but > > > apparently, it hasn't led to fixes. > > > > I think I've explained this somewhere else a couple of months ago but > > just in case you weren't on that thread or don't remember and apologies > > if you should already know. > > > > A lot of filesystem will happily update the mode of a symlink. The VFS > > doesn't do anything to prevent this from happening. This is filesystem > > specific. > > > > The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs. > > Specifically it comes from filesystems that call posix_acl_chmod(), > > e.g., btrfs via > > > > if (!err && attr->ia_valid & ATTR_MODE) > > err = posix_acl_chmod(idmap, dentry, inode->i_mode); > > > > Most filesystems don't implement i_op->set_acl() for POSIX ACLs. > > So posix_acl_chmod() will report EOPNOTSUPP. By the time > > posix_acl_chmod() is called, most filesystems will have finished > > updating the inode. POSIX ACLs also often aren't integrated into > > transactions so a rollback wouldn't even be possible on some > > filesystems. > > > > Any filesystem that doesn't implement POSIX ACLs at all will obviously > > never fail unless it blocks mode changes on symlinks. Or filesystems > > that do have a way to rollback failures from posix_acl_chmod(), or > > filesystems that do return an error on chmod() on symlinks such as 9p, > > ntfs, ocfs2. > > > > > > > > I wonder if it makes sense to add a similar error return to the system > > > call implementation? > > > > Hm, blocking symlink mode changes is pretty regression prone. And just > > blocking it through one interface seems weird and makes things even more > > inconsistent. > > > > So two options I see: > > (1) minimally invasive: > > Filesystems that do call posix_acl_chmod() on symlinks need to be > > changed to stop doing that. > > (2) might hit us on the head invasive: > > Try and block symlink mode changes in chmod_common(). > > > > Thoughts? > > > > We have third option. We can choose not to call chmod_common and return an > error right away: > > diff --git a/fs/open.c b/fs/open.c > index 39a7939f0d00..86a427a2a083 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -679,7 +679,9 @@ static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, int l > retry: > error = user_path_at(dfd, filename, lookup_flags, &path); > if (!error) { > - error = chmod_common(&path, mode); > + error = -EOPNOTSUPP; > + if (!(flags & AT_SYMLINK_NOFOLLOW) || !S_ISLNK(path.dentry->d_inode->i_mode)) > + error = chmod_common(&path, mode); > path_put(&path); > if (retry_estale(error, lookup_flags)) { > lookup_flags |= LOOKUP_REVAL; > > It doesn't seem to be invasive. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=77b652535528770217186589d97261847f15f862