Message ID | cover.1674660533.git.legion@kernel.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp339180wrn; Wed, 25 Jan 2023 07:38:58 -0800 (PST) X-Google-Smtp-Source: AMrXdXvrh9MgjgF4hSmWMgEJ8QAz0vrFUtioRPaHVdGp7SejvsrB1d879SqNWjKNLVmRM6Wr1Jvj X-Received: by 2002:a05:6a20:7d8e:b0:b8:8060:ce5c with SMTP id v14-20020a056a207d8e00b000b88060ce5cmr43667136pzj.44.1674661138075; Wed, 25 Jan 2023 07:38:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674661138; cv=none; d=google.com; s=arc-20160816; b=wDdKvngrp4Fx7qU9GQvPr1Vepwd/8CdcfRr99gb1nIZ5q5uAu7W6XsYrpNdfsVvPXF pDIs4EyexUui0G80SQuAK+L9HIrH6C/Nn+JaJDQzrh7tmsSHygTpiZEqJhMXVEaTUHKJ K2bdubdvP7Qw786ZkjdKLv8UD7+Db1ilm6+Um4jwT+xuu722p0BIkrHdkA/UQX6oeE4B 25/I/QB3C2a5bGrb3vfSAmIlFZxi4nWwN3eRg770vwwSg6Zg2W+66aDLTJwoRAF7JzY8 AtltmjDyZP2PDEEA3ThZIXq6ae2Xlbvba0urfldeEu216ESlwtgbbtRHos8g5o68F/d3 bY1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=U4n5tmc5Pyx4dlmxamm0UMXyiFAAB7NPSiZWVDhkn7s=; b=DKiPaG5751CyZVsFpiRpIUVJ4PdJu3iXMLPFnkxoTzRMCl7+AqAtwVfXWmgeJ4ZGGH U2NHR1cxKTdGxhGdqP2VGDw+k/shbIbD4hNEtXYPIR8I/+1SFHqiVW1ndWvRzNkkXlYi czaB11Ts2AqML5eSORAJLvT1wRTe0K9SB146dMZjxYkETtwdp7zQ/UHdSBpMmH9N81LN 6wYwOc4Z3piUOYD736NR0Qv42zYHi7Ss92HK0GRQVc6F6k5xEjrQ76mfiHbMnXFv0cPv Io+SQbjxQ6GuANQfgMTS3Eby0bRiHIYmv6QLf70djgI9Dx3a4ffe6jGWoHauEBSsdeuh ngfg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k9-20020a63ab49000000b0049800401654si5467952pgp.348.2023.01.25.07.38.44; Wed, 25 Jan 2023 07:38:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235947AbjAYP3k (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Wed, 25 Jan 2023 10:29:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51010 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235827AbjAYP3e (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 25 Jan 2023 10:29:34 -0500 Received: from us-smtp-delivery-44.mimecast.com (us-smtp-delivery-44.mimecast.com [205.139.111.44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2C4F429E25 for <linux-kernel@vger.kernel.org>; Wed, 25 Jan 2023 07:29:29 -0800 (PST) Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-312-3EfvZVd9OROi4KMasCLRNA-1; Wed, 25 Jan 2023 10:29:25 -0500 X-MC-Unique: 3EfvZVd9OROi4KMasCLRNA-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1ECAC802D1B; Wed, 25 Jan 2023 15:29:25 +0000 (UTC) Received: from comp-core-i7-2640m-0182e6.redhat.com (ovpn-208-16.brq.redhat.com [10.40.208.16]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5D2FF2026D4B; Wed, 25 Jan 2023 15:29:23 +0000 (UTC) From: Alexey Gladkov <legion@kernel.org> To: LKML <linux-kernel@vger.kernel.org>, containers@lists.linux.dev, linux-fsdevel@vger.kernel.org Cc: Alexey Dobriyan <adobriyan@gmail.com>, Al Viro <viro@zeniv.linux.org.uk>, Andrew Morton <akpm@linux-foundation.org>, Christian Brauner <brauner@kernel.org>, Val Cowan <vcowan@redhat.com> Subject: [RFC PATCH v1 0/6] proc: Add allowlist for procfs files Date: Wed, 25 Jan 2023 16:28:47 +0100 Message-Id: <cover.1674660533.git.legion@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_SOFTFAIL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756009477530637477?= X-GMAIL-MSGID: =?utf-8?q?1756009477530637477?= |
Series |
proc: Add allowlist for procfs files
|
|
Message
Alexey Gladkov
Jan. 25, 2023, 3:28 p.m. UTC
The patch expands subset= option. If the proc is mounted with the
subset=allowlist option, the /proc/allowlist file will appear. This file
contains the filenames and directories that are allowed for this
mountpoint. By default, /proc/allowlist contains only its own name.
Changing the allowlist is possible as long as it is present in the
allowlist itself.
This allowlist is applied in lookup/readdir so files that will create
modules after mounting will not be visible.
Compared to the previous patches [1][2], I switched to a special virtual
file from listing filenames in the mount options.
[1] https://lore.kernel.org/lkml/20200604200413.587896-1-gladkov.alexey@gmail.com/
[2] https://lore.kernel.org/lkml/YZvuN0Wqmn7XB4dX@localhost.localdomain/
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
Alexey Gladkov (6):
proc: Fix separator for subset option
proc: Add allowlist to control access to procfs files
proc: Check that subset= option has been set
proc: Allow to use the allowlist filter in userns
proc: Validate incoming allowlist
doc: proc: Add description of subset=allowlist
Documentation/filesystems/proc.rst | 10 +
fs/proc/Kconfig | 10 +
fs/proc/Makefile | 1 +
fs/proc/generic.c | 15 +-
fs/proc/inode.c | 16 +-
fs/proc/internal.h | 33 ++++
fs/proc/proc_allowlist.c | 300 +++++++++++++++++++++++++++++
fs/proc/root.c | 36 +++-
include/linux/proc_fs.h | 18 +-
9 files changed, 420 insertions(+), 19 deletions(-)
create mode 100644 fs/proc/proc_allowlist.c
Comments
On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <legion@kernel.org> wrote: > The patch expands subset= option. If the proc is mounted with the > subset=allowlist option, the /proc/allowlist file will appear. This file > contains the filenames and directories that are allowed for this > mountpoint. By default, /proc/allowlist contains only its own name. > Changing the allowlist is possible as long as it is present in the > allowlist itself. > > This allowlist is applied in lookup/readdir so files that will create > modules after mounting will not be visible. > > Compared to the previous patches [1][2], I switched to a special virtual > file from listing filenames in the mount options. > Changlog doesn't explain why you think Linux needs this feature. The [2/6] changelog hints that containers might be involved. IOW, please fully describe the requirement and use-case(s). Also, please describe why /proc/allowlist is made available via a mount option, rather than being permanently present. And why add to subset=, instead of a separate mount option. Does /proc/allowlist work in subdirectories? Like, permit presence of /proc/sys/vm/compact_memory? I think the whole thing is misnamed, really. "allowlist" implies access permissions. Some of the test here uses "visibility" and other places use "presence", which are better. "presentlist" and /proc/presentlist might be better. But why not simply /proc/contents? Please run these patches through checkpatch and consider the result.
On Wed, Jan 25, 2023 at 03:36:28PM -0800, Andrew Morton wrote: > On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <legion@kernel.org> wrote: > > > The patch expands subset= option. If the proc is mounted with the > > subset=allowlist option, the /proc/allowlist file will appear. This file > > contains the filenames and directories that are allowed for this > > mountpoint. By default, /proc/allowlist contains only its own name. > > Changing the allowlist is possible as long as it is present in the > > allowlist itself. > > > > This allowlist is applied in lookup/readdir so files that will create > > modules after mounting will not be visible. > > > > Compared to the previous patches [1][2], I switched to a special virtual > > file from listing filenames in the mount options. > > > > Changlog doesn't explain why you think Linux needs this feature. The > [2/6] changelog hints that containers might be involved. IOW, please > fully describe the requirement and use-case(s). > > Also, please describe why /proc/allowlist is made available via a mount > option, rather than being permanently present. > > And why add to subset=, instead of a separate mount option. > > Does /proc/allowlist work in subdirectories? Like, permit presence of > /proc/sys/vm/compact_memory? > > I think the whole thing is misnamed, really. "allowlist" implies > access permissions. Some of the test here uses "visibility" and other > places use "presence", which are better. "presentlist" and > /proc/presentlist might be better. But why not simply /proc/contents? Currently, a lot of container runtimes - even if they mount a new procfs instance - overmount various procfs files and directories to ensure that they're hidden from the container workload. (The motivations for this are mixed and usually it's only needed for containers that run with the same privilege level as the host.) The consequence of overmounting is that we need to refuse mounting procfs again somewhere else otherwise the procfs instance might reveal files and directories that were supposed to be hidden. So this patchset moves the ability to hide entries into the kernel through an allowlist. This way you can hide files and directories while being able to mount procfs again because it will inherit the same allowlist. I get the motivation. The question is whether this belongs into the kernel at all. I'm unfortunately not convinced. This adds a lot of string parsing to procfs and I think we would also need to decide what a reasonable maximum limit for such allowlists would be. The data structure likely shouldn't be a linked list but at least an rbtree especially if the size isn't limited. But fundamentally I think it moves something that should be and currently is a userspace policy into the kernel which I think is wrong. Sure you can't predict what files show up in procfs over time but then subset=pid is already your friend - even if not as fine-grained. If this where another simple subset style mount option that allowlists a bunch of well-known global proc files then sure. But making this dynamically configurable from userspace doesn't make sense to me. I mean, users could write /gobble/dy/gook into /proc/allowlist or use it to stash secrets or hashes or whatever as we have no way of figuring out whether the entry they allowlist does or will actually ever exist. In general, such flexibility belongs into userspace imho. Frankly, if that is really required it would almost make more sense to be able to attach a new bpf program type to procfs that would allow to filter procfs entries. Then the filter could be done purely in userspace. If signed bpf lands one could then even ship signed programs that are attachable by userns root.
On Wed, Jan 25, 2023 at 03:36:28PM -0800, Andrew Morton wrote: > On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <legion@kernel.org> wrote: > > > The patch expands subset= option. If the proc is mounted with the > > subset=allowlist option, the /proc/allowlist file will appear. This file > > contains the filenames and directories that are allowed for this > > mountpoint. By default, /proc/allowlist contains only its own name. > > Changing the allowlist is possible as long as it is present in the > > allowlist itself. > > > > This allowlist is applied in lookup/readdir so files that will create > > modules after mounting will not be visible. > > > > Compared to the previous patches [1][2], I switched to a special virtual > > file from listing filenames in the mount options. > > > > Changlog doesn't explain why you think Linux needs this feature. The > [2/6] changelog hints that containers might be involved. IOW, please > fully describe the requirement and use-case(s). Ok. I will. Basically, as Christian described, the motivation is to give containerization programs (docker, podman, etc.) a way to control the content in procfs. Now container tools use a list of dangerous files that they hide with overmount. But procfs is not a static filesystem and using a bad list to hide dangerous files can't be the solution. I believe that a container should define a list of files that it considers useful within the container, and not try to hide what it considers unwanted. > Also, please describe why /proc/allowlist is made available via a mount > option, rather than being permanently present. Like subset=pid, this file is needed to change the visibility of files in the procfs mountpoint. > And why add to subset=, instead of a separate mount option. > > Does /proc/allowlist work in subdirectories? Like, permit presence of > /proc/sys/vm/compact_memory? Yes. But /proc/allowlist is limited in size to 128K. > I think the whole thing is misnamed, really. "allowlist" implies > access permissions. Some of the test here uses "visibility" and other > places use "presence", which are better. "presentlist" and > /proc/presentlist might be better. But why not simply /proc/contents? I don't hold on to the name allowlist at all :) present list is perfect for me. The /proc/contents is confusing to me. > Please run these patches through checkpatch and consider the result. Ok. I will.
On Thu, Jan 26, 2023 at 11:16:07AM +0100, Christian Brauner wrote: > On Wed, Jan 25, 2023 at 03:36:28PM -0800, Andrew Morton wrote: > > On Wed, 25 Jan 2023 16:28:47 +0100 Alexey Gladkov <legion@kernel.org> wrote: > > > > > The patch expands subset= option. If the proc is mounted with the > > > subset=allowlist option, the /proc/allowlist file will appear. This file > > > contains the filenames and directories that are allowed for this > > > mountpoint. By default, /proc/allowlist contains only its own name. > > > Changing the allowlist is possible as long as it is present in the > > > allowlist itself. > > > > > > This allowlist is applied in lookup/readdir so files that will create > > > modules after mounting will not be visible. > > > > > > Compared to the previous patches [1][2], I switched to a special virtual > > > file from listing filenames in the mount options. > > > > > > > Changlog doesn't explain why you think Linux needs this feature. The > > [2/6] changelog hints that containers might be involved. IOW, please > > fully describe the requirement and use-case(s). > > > > Also, please describe why /proc/allowlist is made available via a mount > > option, rather than being permanently present. > > > > And why add to subset=, instead of a separate mount option. > > > > Does /proc/allowlist work in subdirectories? Like, permit presence of > > /proc/sys/vm/compact_memory? > > > > I think the whole thing is misnamed, really. "allowlist" implies > > access permissions. Some of the test here uses "visibility" and other > > places use "presence", which are better. "presentlist" and > > /proc/presentlist might be better. But why not simply /proc/contents? > > Currently, a lot of container runtimes - even if they mount a new procfs > instance - overmount various procfs files and directories to ensure that > they're hidden from the container workload. (The motivations for this > are mixed and usually it's only needed for containers that run with the > same privilege level as the host.) > > The consequence of overmounting is that we need to refuse mounting > procfs again somewhere else otherwise the procfs instance might reveal > files and directories that were supposed to be hidden. > > So this patchset moves the ability to hide entries into the kernel > through an allowlist. This way you can hide files and directories while > being able to mount procfs again because it will inherit the same > allowlist. > > I get the motivation. The question is whether this belongs into the > kernel at all. I'm unfortunately not convinced. > > This adds a lot of string parsing to procfs and I think we would also > need to decide what a reasonable maximum limit for such allowlists would > be.> The data structure likely shouldn't be a linked list but at least an > rbtree especially if the size isn't limited. There is a limit. So far I've limited the file size to 128k. I think this is a reasonable limit. > But fundamentally I think it moves something that should be and > currently is a userspace policy into the kernel which I think is wrong. We don't have mechanisms to implement this userspace policy. overmount is not a solution but plugging holes in the absence of other ways to control the visibility of files in procfs. > Sure you can't predict what files show up in procfs over time but then > subset=pid is already your friend - even if not as fine-grained. > > If this where another simple subset style mount option that allowlists a > bunch of well-known global proc files then sure. But making this > dynamically configurable from userspace doesn't make sense to me. I > mean, users could write /gobble/dy/gook into /proc/allowlist or use it > to stash secrets or hashes or whatever as we have no way of figuring out > whether the entry they allowlist does or will actually ever exist. BTW I only allow printable data to be written to the file. We can make this file write-only and then writing any extraneous data there will not make sense. > In general, such flexibility belongs into userspace imho. > > Frankly, if that is really required it would almost make more sense to > be able to attach a new bpf program type to procfs that would allow to > filter procfs entries. Then the filter could be done purely in > userspace. If signed bpf lands one could then even ship signed programs > that are attachable by userns root. I'll ask the podman developers how much more comfortable they would be using bpf to control file visibility in procfs. thanks for the idea.
On Thu, Jan 26, 2023 at 02:39:30PM +0100, Alexey Gladkov wrote: > > In general, such flexibility belongs into userspace imho. > > > > Frankly, if that is really required it would almost make more sense to > > be able to attach a new bpf program type to procfs that would allow to > > filter procfs entries. Then the filter could be done purely in > > userspace. If signed bpf lands one could then even ship signed programs > > that are attachable by userns root. > > I'll ask the podman developers how much more comfortable they would be > using bpf to control file visibility in procfs. thanks for the idea. I write for history. After digging into eBPF, I came to the conclusion that nothing needs to be done in kernel space. Access can be controlled via "lsm/file_open". Access can be controlled per cgroup or per mountpoint, depending on the task. Each project has its own choice. Many thanks for pointing out eBPF.