Message ID | 20230710183338.58531-1-ivan@cloudflare.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp16143vqm; Mon, 10 Jul 2023 12:07:57 -0700 (PDT) X-Google-Smtp-Source: APBJJlGs7lVRB2mUplVmESQsxPl/osaSDUQX9FHdkAX6JyemXKhFo+EAS69BnOf8kzvJhVRdm4cZ X-Received: by 2002:a17:90a:1b09:b0:263:f776:8ba3 with SMTP id q9-20020a17090a1b0900b00263f7768ba3mr11475100pjq.9.1689016077336; Mon, 10 Jul 2023 12:07:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689016077; cv=none; d=google.com; s=arc-20160816; b=Iaw5fGcD+gjmcmQCdJUQ57YSpGSpz+32GsS/GtkrNF4WGyeHD4YIdt9EV9nqnucGlp UlGh5suwiABs7vD1WsdWs+bG2c0a7HKCtu4l5YfCAs1sgbEJDp7ka2M1+teA9s2f4ZcB XJJsZklk0gP2f+//U90/4TQe/pyxikV+nOpj+Kig3AZrfdoEj4B0JpH2LDIAGrJYlw4m /BeJY0DuIhUpzuTyUU5f6rJjpraX6atpjxIWbJfgX+Bvg7lKAMpWkmkCe1ptaId8uc8Z q3YQCS7VnskDccKFVdbP8CN7n26IAdjKaAwd8dOcBavdPzv+X9CY/f5sX1pxvMNmWfsF qh9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=KfOFwa2MqGzUVHMgMbaymeSQf/jXq0PE9ZHxapBD3Wg=; fh=5k5WGeoc8LucMOm/YETK10Zb5Yu1BPLLzGuGCk2BMoI=; b=RZF7RBqpg+Eg8upM0DLm7e8KsdG/JPniZzBIowlOtiWK6EHivIRrE7raWr2vltOU9Z VBwtHPgJ2vdGOFWjmGLHkbCPb+azldEvAw8OdyKPgH3qfhR3H5hY4iWTgSdHrBTkAWLd KnhRrxBvJXbTt6d2EViOLn+jlEKLKr9YgCTl6LKFYtapD5wKyDQgczeUoeN2iyY38M8W QY9REcHhOf+eOtQ09aci5NupPaVS023u5tmKsdNB7gEztly8wHVd7Gt4LXcgQzBbIVGZ 2LWqCAXFkZpwAfFSz0NI+0OOQf7L0Kigxb7JOw5hzYEG8Jbes9X1KGKBbX2hmG7FdVXw YozA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=L6sqxAkY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pg10-20020a17090b1e0a00b002631de09798si8231858pjb.53.2023.07.10.12.07.44; Mon, 10 Jul 2023 12:07:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=L6sqxAkY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231769AbjGJSdq (ORCPT <rfc822;gnulinuxfreebsd@gmail.com> + 99 others); Mon, 10 Jul 2023 14:33:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46960 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231510AbjGJSdo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 10 Jul 2023 14:33:44 -0400 Received: from mail-pf1-x42a.google.com (mail-pf1-x42a.google.com [IPv6:2607:f8b0:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 16F5C198 for <linux-kernel@vger.kernel.org>; Mon, 10 Jul 2023 11:33:42 -0700 (PDT) Received: by mail-pf1-x42a.google.com with SMTP id d2e1a72fcca58-666e64e97e2so2456474b3a.1 for <linux-kernel@vger.kernel.org>; Mon, 10 Jul 2023 11:33:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; t=1689014021; x=1691606021; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=KfOFwa2MqGzUVHMgMbaymeSQf/jXq0PE9ZHxapBD3Wg=; b=L6sqxAkYvprfFLOpuEhxBMp3KNnplB0ZZcAipTraoZiFvvCxf5cGIkbNgr1XsLKwwt dYDuePFBEKrJay9FP/gc2BkJQgoUC0I4JXHljZAkscoST0FgIb9tIOctKtBoElAIOCeJ RLip2ZlDiqlYFlPsq3R684aqltWpmCQoDLRgg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689014021; x=1691606021; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KfOFwa2MqGzUVHMgMbaymeSQf/jXq0PE9ZHxapBD3Wg=; b=g0etvl8heuRWEHKkeNDq97mMBcCxagLf0407etWOPFrOCy5+NiPb9ZvT8pYNxbT39e FeKvRzd5I37V6h+mTB7hLetk8cG+rReOFYCc6nmH93kIk3n7UzXLfsSPfIFRLuklZ+7+ Q6eG/q85fomo9B6MytxGflmiElC7CN/23egf/vbxPLx4g/DOv//JquKn9sLpbSW4MgZm Tvb5JiLqBs1ssMkjQsnvJXmsOtmQYuUu0Wtij1wwvbEmw9xYAp7u9QFrs+ByoDHJXJSK 6F5jsmWuquwU3ksxXKByqhZBz+lAbqiQ6vwIpEi0tnnXaUVcnRDTnWa6J7NfOJdLdA/g K2Zw== X-Gm-Message-State: ABy/qLa7kcZDb9n8VNakHrZ546DMc67bWMYC/rnwy7B+fSOjmWqQZ+iM RFMWV8ui5ltYJPKu4usAjelwxw== X-Received: by 2002:a05:6a00:1915:b0:64a:2dd6:4f18 with SMTP id y21-20020a056a00191500b0064a2dd64f18mr12947103pfi.13.1689014021308; Mon, 10 Jul 2023 11:33:41 -0700 (PDT) Received: from localhost ([2601:644:200:aea:60e1:d34a:f5f6:64b5]) by smtp.gmail.com with ESMTPSA id t17-20020aa79391000000b00679d3fb2f92sm96713pfe.154.2023.07.10.11.33.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jul 2023 11:33:41 -0700 (PDT) From: Ivan Babrou <ivan@cloudflare.com> To: linux-fsdevel@vger.kernel.org Cc: kernel-team@cloudflare.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Tejun Heo <tj@kernel.org>, Hugh Dickins <hughd@google.com>, Andrew Morton <akpm@linux-foundation.org>, Amir Goldstein <amir73il@gmail.com>, Christoph Hellwig <hch@lst.de>, Jan Kara <jack@suse.cz>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Ivan Babrou <ivan@cloudflare.com> Subject: [PATCH] kernfs: attach uuid for every kernfs and report it in fsid Date: Mon, 10 Jul 2023 11:33:38 -0700 Message-ID: <20230710183338.58531-1-ivan@cloudflare.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771060811482402243 X-GMAIL-MSGID: 1771061722350993343 |
Series |
kernfs: attach uuid for every kernfs and report it in fsid
|
|
Commit Message
Ivan Babrou
July 10, 2023, 6:33 p.m. UTC
The following two commits added the same thing for tmpfs:
* commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid")
* commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs")
Having fsid allows using fanotify, which is especially handy for cgroups,
where one might be interested in knowing when they are created or removed.
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
---
fs/kernfs/mount.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
Comments
On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > The following two commits added the same thing for tmpfs: > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > Having fsid allows using fanotify, which is especially handy for cgroups, > where one might be interested in knowing when they are created or removed. > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > --- > fs/kernfs/mount.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > index d49606accb07..930026842359 100644 > --- a/fs/kernfs/mount.c > +++ b/fs/kernfs/mount.c > @@ -16,6 +16,8 @@ > #include <linux/namei.h> > #include <linux/seq_file.h> > #include <linux/exportfs.h> > +#include <linux/uuid.h> > +#include <linux/statfs.h> > > #include "kernfs-internal.h" > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > return 0; > } > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > +{ > + simple_statfs(dentry, buf); > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > + return 0; > +} > + > const struct super_operations kernfs_sops = { > - .statfs = simple_statfs, > + .statfs = kernfs_statfs, > .drop_inode = generic_delete_inode, > .evict_inode = kernfs_evict_inode, > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > } > sb->s_flags |= SB_ACTIVE; > > + uuid_gen(&sb->s_uuid); Since kernfs has as lot of nodes (like hundreds of thousands if not more at times, being created at boot time), did you just slow down creating them all, and increase the memory usage in a measurable way? We were trying to slim things down, what userspace tools need this change? Who is going to use it, and what for? There were some benchmarks people were doing with booting large memory systems that you might want to reproduce here to verify that nothing is going to be harmed. thanks, greg k-h
On Mon, Jul 10, 2023 at 09:40:23PM +0200, Greg Kroah-Hartman wrote: > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > The following two commits added the same thing for tmpfs: > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > where one might be interested in knowing when they are created or removed. > > > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > --- > > fs/kernfs/mount.c | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > index d49606accb07..930026842359 100644 > > --- a/fs/kernfs/mount.c > > +++ b/fs/kernfs/mount.c > > @@ -16,6 +16,8 @@ > > #include <linux/namei.h> > > #include <linux/seq_file.h> > > #include <linux/exportfs.h> > > +#include <linux/uuid.h> > > +#include <linux/statfs.h> > > > > #include "kernfs-internal.h" > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > return 0; > > } > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > +{ > > + simple_statfs(dentry, buf); > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > + return 0; > > +} > > + > > const struct super_operations kernfs_sops = { > > - .statfs = simple_statfs, > > + .statfs = kernfs_statfs, > > .drop_inode = generic_delete_inode, > > .evict_inode = kernfs_evict_inode, > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > } > > sb->s_flags |= SB_ACTIVE; > > > > + uuid_gen(&sb->s_uuid); > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > at times, being created at boot time), did you just slow down creating > them all, and increase the memory usage in a measurable way? > > We were trying to slim things down, what userspace tools need this > change? Who is going to use it, and what for? > > There were some benchmarks people were doing with booting large memory > systems that you might want to reproduce here to verify that nothing is > going to be harmed. Oh wait, is this just a per-superblock thing? confused, greg k-h
On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > The following two commits added the same thing for tmpfs: > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > where one might be interested in knowing when they are created or removed. > > > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > --- > > fs/kernfs/mount.c | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > index d49606accb07..930026842359 100644 > > --- a/fs/kernfs/mount.c > > +++ b/fs/kernfs/mount.c > > @@ -16,6 +16,8 @@ > > #include <linux/namei.h> > > #include <linux/seq_file.h> > > #include <linux/exportfs.h> > > +#include <linux/uuid.h> > > +#include <linux/statfs.h> > > > > #include "kernfs-internal.h" > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > return 0; > > } > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > +{ > > + simple_statfs(dentry, buf); > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > + return 0; > > +} > > + > > const struct super_operations kernfs_sops = { > > - .statfs = simple_statfs, > > + .statfs = kernfs_statfs, > > .drop_inode = generic_delete_inode, > > .evict_inode = kernfs_evict_inode, > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > } > > sb->s_flags |= SB_ACTIVE; > > > > + uuid_gen(&sb->s_uuid); > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > at times, being created at boot time), did you just slow down creating > them all, and increase the memory usage in a measurable way? This is just for the superblock, not every inode. The memory increase is one UUID per kernfs instance (there are maybe 10 of them on a basic system), which is trivial. Same goes for CPU usage. > We were trying to slim things down, what userspace tools need this > change? Who is going to use it, and what for? The one concrete thing is ebpf_exporter: * https://github.com/cloudflare/ebpf_exporter I want to monitor cgroup changes, so that I can have an up to date map of inode -> cgroup path, so that I can resolve the value returned from bpf_get_current_cgroup_id() into something that a human can easily grasp (think system.slice/nginx.service). Currently I do a full sweep to build a map, which doesn't work if a cgroup is short lived, as it just disappears before I can resolve it. Unfortunately, systemd recycles cgroups on restart, changing inode number, so this is a very real issue. There's also this old wiki page from systemd: * https://freedesktop.org/wiki/Software/systemd/Optimizations Quoting from there: > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. So a similar need to mine, but for different systemd-related needs. Initially I tried adding this for cgroup fs only, but the problem felt very generic, so I pivoted to having it in kernfs instead, so that any kernfs based filesystem would benefit. Given pretty much non-existing overhead and simplicity of this, I think it's a change worth doing, unless there's a good reason to not do it. I cc'd plenty of people to make sure it's not a bad decision. > There were some benchmarks people were doing with booting large memory > systems that you might want to reproduce here to verify that nothing is > going to be harmed. Skipping this given that overhead is per superblock and trivial.
On Tue, Jul 11, 2023 at 12:21 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > > The following two commits added the same thing for tmpfs: > > > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > > where one might be interested in knowing when they are created or removed. > > > > > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > > --- > > > fs/kernfs/mount.c | 13 ++++++++++++- > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > > index d49606accb07..930026842359 100644 > > > --- a/fs/kernfs/mount.c > > > +++ b/fs/kernfs/mount.c > > > @@ -16,6 +16,8 @@ > > > #include <linux/namei.h> > > > #include <linux/seq_file.h> > > > #include <linux/exportfs.h> > > > +#include <linux/uuid.h> > > > +#include <linux/statfs.h> > > > > > > #include "kernfs-internal.h" > > > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > > return 0; > > > } > > > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > > +{ > > > + simple_statfs(dentry, buf); > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > > + return 0; > > > +} > > > + > > > const struct super_operations kernfs_sops = { > > > - .statfs = simple_statfs, > > > + .statfs = kernfs_statfs, > > > .drop_inode = generic_delete_inode, > > > .evict_inode = kernfs_evict_inode, > > > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > > } > > > sb->s_flags |= SB_ACTIVE; > > > > > > + uuid_gen(&sb->s_uuid); > > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > > at times, being created at boot time), did you just slow down creating > > them all, and increase the memory usage in a measurable way? > > This is just for the superblock, not every inode. The memory increase > is one UUID per kernfs instance (there are maybe 10 of them on a basic > system), which is trivial. Same goes for CPU usage. > > > We were trying to slim things down, what userspace tools need this > > change? Who is going to use it, and what for? > > The one concrete thing is ebpf_exporter: > > * https://github.com/cloudflare/ebpf_exporter > > I want to monitor cgroup changes, so that I can have an up to date map > of inode -> cgroup path, so that I can resolve the value returned from > bpf_get_current_cgroup_id() into something that a human can easily > grasp (think system.slice/nginx.service). Currently I do a full sweep > to build a map, which doesn't work if a cgroup is short lived, as it > just disappears before I can resolve it. Unfortunately, systemd > recycles cgroups on restart, changing inode number, so this is a very > real issue. > > There's also this old wiki page from systemd: > > * https://freedesktop.org/wiki/Software/systemd/Optimizations > > Quoting from there: > > > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. > > So a similar need to mine, but for different systemd-related needs. > > Initially I tried adding this for cgroup fs only, but the problem felt > very generic, so I pivoted to having it in kernfs instead, so that any > kernfs based filesystem would benefit. > > Given pretty much non-existing overhead and simplicity of this, I > think it's a change worth doing, unless there's a good reason to not > do it. I cc'd plenty of people to make sure it's not a bad decision. > I agree. I think it was a good decision. I have some followup questions though. I guess your use case cares about the creation of cgroups? as long as the only way to create a cgroup is via vfs vfs_mkdir() -> ... cgroup_mkdir() fsnotify_mkdir() will be called. Is that a correct statement? Because if not, then explicit fsnotify_mkdir() calls may be needed similar to tracefs/debugfs. I don't think that the statement holds for dieing cgroups, so explicit fsnotify_rmdir() are almost certainly needed to make inotify/fanotify monitoring on cgroups complete. I am on the fence w.r.t making the above a prerequisite to merging your patch. One the one hand, inotify monitoring of cgroups directory was already possible (I think?) with the mentioned shortcomings for a long time. On the other hand, we have an opportunity to add support to fanotify monitoring of cgroups directory only after the missing fsnotify hooks are added, making fanotify API a much more reliable option for monitoring cgroups. So I am leaning towards requiring the missing fsnotify hooks before attaching a unique fsid to cgroups/kernfs. In any case, either with or without the missing hooks, I would not want this patch merged until Jan had a chance to look at the implications and weigh in on the missing hooks question. Jan is on vacation for three weeks, so in the meanwhile, feel free to implement and test the missing hooks or wait for his judgement. On an unrelated side topic, I would like to point your attention to this comment in the patch that was just merged to v6.5-rc1: 69562eb0bd3e ("fanotify: disallow mount/sb marks on kernel internal pseudo fs") /* * mount and sb marks are not allowed on kernel internal pseudo fs, * like pipe_mnt, because that would subscribe to events on all the * anonynous pipes in the system. * * SB_NOUSER covers all of the internal pseudo fs whose objects are not * exposed to user's mount namespace, but there are other SB_KERNMOUNT * fs, like nsfs, debugfs, for which the value of allowing sb and mount * mark is questionable. For now we leave them alone. */ My question to you, as the only user I know of for fanotify FAN_REPORT_FID on SB_KERNMOUNT, do you have plans to use a mount or filesystem mark to monitor cgroups? or only inotify-like directory watches? Thanks, Amir.
On Mon, Jul 10, 2023 at 02:21:10PM -0700, Ivan Babrou wrote: > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > > The following two commits added the same thing for tmpfs: > > > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > > where one might be interested in knowing when they are created or removed. > > > > > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > > --- > > > fs/kernfs/mount.c | 13 ++++++++++++- > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > > index d49606accb07..930026842359 100644 > > > --- a/fs/kernfs/mount.c > > > +++ b/fs/kernfs/mount.c > > > @@ -16,6 +16,8 @@ > > > #include <linux/namei.h> > > > #include <linux/seq_file.h> > > > #include <linux/exportfs.h> > > > +#include <linux/uuid.h> > > > +#include <linux/statfs.h> > > > > > > #include "kernfs-internal.h" > > > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > > return 0; > > > } > > > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > > +{ > > > + simple_statfs(dentry, buf); > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > > + return 0; > > > +} > > > + > > > const struct super_operations kernfs_sops = { > > > - .statfs = simple_statfs, > > > + .statfs = kernfs_statfs, > > > .drop_inode = generic_delete_inode, > > > .evict_inode = kernfs_evict_inode, > > > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > > } > > > sb->s_flags |= SB_ACTIVE; > > > > > > + uuid_gen(&sb->s_uuid); > > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > > at times, being created at boot time), did you just slow down creating > > them all, and increase the memory usage in a measurable way? > > This is just for the superblock, not every inode. The memory increase > is one UUID per kernfs instance (there are maybe 10 of them on a basic > system), which is trivial. Same goes for CPU usage. Ah, ok, my fault, thanks for clearing that up. thanks, greg k-h
On Tue, Jul 11, 2023 at 2:49 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Tue, Jul 11, 2023 at 12:21 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > > > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > > > The following two commits added the same thing for tmpfs: > > > > > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > > > where one might be interested in knowing when they are created or removed. > > > > > > > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > > > --- > > > > fs/kernfs/mount.c | 13 ++++++++++++- > > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > > > index d49606accb07..930026842359 100644 > > > > --- a/fs/kernfs/mount.c > > > > +++ b/fs/kernfs/mount.c > > > > @@ -16,6 +16,8 @@ > > > > #include <linux/namei.h> > > > > #include <linux/seq_file.h> > > > > #include <linux/exportfs.h> > > > > +#include <linux/uuid.h> > > > > +#include <linux/statfs.h> > > > > > > > > #include "kernfs-internal.h" > > > > > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > > > return 0; > > > > } > > > > > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > > > +{ > > > > + simple_statfs(dentry, buf); > > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > > > + return 0; > > > > +} > > > > + > > > > const struct super_operations kernfs_sops = { > > > > - .statfs = simple_statfs, > > > > + .statfs = kernfs_statfs, > > > > .drop_inode = generic_delete_inode, > > > > .evict_inode = kernfs_evict_inode, > > > > > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > > > } > > > > sb->s_flags |= SB_ACTIVE; > > > > > > > > + uuid_gen(&sb->s_uuid); > > > > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > > > at times, being created at boot time), did you just slow down creating > > > them all, and increase the memory usage in a measurable way? > > > > This is just for the superblock, not every inode. The memory increase > > is one UUID per kernfs instance (there are maybe 10 of them on a basic > > system), which is trivial. Same goes for CPU usage. > > > > > We were trying to slim things down, what userspace tools need this > > > change? Who is going to use it, and what for? > > > > The one concrete thing is ebpf_exporter: > > > > * https://github.com/cloudflare/ebpf_exporter > > > > I want to monitor cgroup changes, so that I can have an up to date map > > of inode -> cgroup path, so that I can resolve the value returned from > > bpf_get_current_cgroup_id() into something that a human can easily > > grasp (think system.slice/nginx.service). Currently I do a full sweep > > to build a map, which doesn't work if a cgroup is short lived, as it > > just disappears before I can resolve it. Unfortunately, systemd > > recycles cgroups on restart, changing inode number, so this is a very > > real issue. > > > > There's also this old wiki page from systemd: > > > > * https://freedesktop.org/wiki/Software/systemd/Optimizations > > > > Quoting from there: > > > > > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. > > > > So a similar need to mine, but for different systemd-related needs. > > > > Initially I tried adding this for cgroup fs only, but the problem felt > > very generic, so I pivoted to having it in kernfs instead, so that any > > kernfs based filesystem would benefit. > > > > Given pretty much non-existing overhead and simplicity of this, I > > think it's a change worth doing, unless there's a good reason to not > > do it. I cc'd plenty of people to make sure it's not a bad decision. > > > > I agree. I think it was a good decision. > I have some followup questions though. > > I guess your use case cares about the creation of cgroups? > as long as the only way to create a cgroup is via vfs > vfs_mkdir() -> ... cgroup_mkdir() > fsnotify_mkdir() will be called. > Is that a correct statement? As far as I'm aware, this is the only way. We have the cgroups mailing list CC'd to confirm. I checked systemd and docker as real world consumers and both use mkdir and are visible in fanotify with this patch applied. > Because if not, then explicit fsnotify_mkdir() calls may be needed > similar to tracefs/debugfs. > > I don't think that the statement holds for dieing cgroups, > so explicit fsnotify_rmdir() are almost certainly needed to make > inotify/fanotify monitoring on cgroups complete. > > I am on the fence w.r.t making the above a prerequisite to merging > your patch. > > One the one hand, inotify monitoring of cgroups directory was already > possible (I think?) with the mentioned shortcomings for a long time. > > On the other hand, we have an opportunity to add support to fanotify > monitoring of cgroups directory only after the missing fsnotify hooks > are added, making fanotify API a much more reliable option for > monitoring cgroups. > > So I am leaning towards requiring the missing fsnotify hooks before > attaching a unique fsid to cgroups/kernfs. Unless somebody responsible for cgroups says there's a different way to create cgroups, I think this requirement doesn't apply. > In any case, either with or without the missing hooks, I would not > want this patch merged until Jan had a chance to look at the > implications and weigh in on the missing hooks question. > Jan is on vacation for three weeks, so in the meanwhile, feel free > to implement and test the missing hooks or wait for his judgement. Sure, I can definitely wait. > On an unrelated side topic, > I would like to point your attention to this comment in the patch that > was just merged to v6.5-rc1: > > 69562eb0bd3e ("fanotify: disallow mount/sb marks on kernel internal pseudo fs") > > /* > * mount and sb marks are not allowed on kernel internal pseudo fs, > * like pipe_mnt, because that would subscribe to events on all the > * anonynous pipes in the system. > * > * SB_NOUSER covers all of the internal pseudo fs whose objects are not > * exposed to user's mount namespace, but there are other SB_KERNMOUNT > * fs, like nsfs, debugfs, for which the value of allowing sb and mount > * mark is questionable. For now we leave them alone. > */ > > My question to you, as the only user I know of for fanotify FAN_REPORT_FID > on SB_KERNMOUNT, do you have plans to use a mount or filesystem mark > to monitor cgroups? or only inotify-like directory watches? My plan is to use FAN_MARK_FILESYSTEM for the whole cgroup mount, since that is what I was able to make work with my limited understanding of the whole fanotify thing. I started with fanotify_fid.c example from here: * https://man7.org/linux/man-pages/man7/fanotify.7.html My existing code does the mark this way (on v6.5-rc1 with my patch applied): ret = fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_ONLYDIR | FAN_MARK_FILESYSTEM, FAN_CREATE | FAN_DELETE | FAN_ONDIR, AT_FDCWD, argv[1]); My goal is to set a watch for all cgroups and drop capabilities, so that I can keep monitoring for events while being unprivileged. As far as I'm aware, this sort of recursive monitoring without races isn't possible with inode level monitoring (I might be wrong here). I do get -EINVAL for FAN_MARK_MOUNT instead of FAN_MARK_FILESYSTEM.
On Wed, Jul 12, 2023 at 1:04 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Tue, Jul 11, 2023 at 2:49 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Tue, Jul 11, 2023 at 12:21 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > > > > > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > > > > The following two commits added the same thing for tmpfs: > > > > > > > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > > > > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > > > > where one might be interested in knowing when they are created or removed. > > > > > > > > > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > > > > --- > > > > > fs/kernfs/mount.c | 13 ++++++++++++- > > > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > > > > index d49606accb07..930026842359 100644 > > > > > --- a/fs/kernfs/mount.c > > > > > +++ b/fs/kernfs/mount.c > > > > > @@ -16,6 +16,8 @@ > > > > > #include <linux/namei.h> > > > > > #include <linux/seq_file.h> > > > > > #include <linux/exportfs.h> > > > > > +#include <linux/uuid.h> > > > > > +#include <linux/statfs.h> > > > > > > > > > > #include "kernfs-internal.h" > > > > > > > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > > > > return 0; > > > > > } > > > > > > > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > > > > +{ > > > > > + simple_statfs(dentry, buf); > > > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > > > > + return 0; > > > > > +} > > > > > + > > > > > const struct super_operations kernfs_sops = { > > > > > - .statfs = simple_statfs, > > > > > + .statfs = kernfs_statfs, > > > > > .drop_inode = generic_delete_inode, > > > > > .evict_inode = kernfs_evict_inode, > > > > > > > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > > > > } > > > > > sb->s_flags |= SB_ACTIVE; > > > > > > > > > > + uuid_gen(&sb->s_uuid); > > > > > > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > > > > at times, being created at boot time), did you just slow down creating > > > > them all, and increase the memory usage in a measurable way? > > > > > > This is just for the superblock, not every inode. The memory increase > > > is one UUID per kernfs instance (there are maybe 10 of them on a basic > > > system), which is trivial. Same goes for CPU usage. > > > > > > > We were trying to slim things down, what userspace tools need this > > > > change? Who is going to use it, and what for? > > > > > > The one concrete thing is ebpf_exporter: > > > > > > * https://github.com/cloudflare/ebpf_exporter > > > > > > I want to monitor cgroup changes, so that I can have an up to date map > > > of inode -> cgroup path, so that I can resolve the value returned from > > > bpf_get_current_cgroup_id() into something that a human can easily > > > grasp (think system.slice/nginx.service). Currently I do a full sweep > > > to build a map, which doesn't work if a cgroup is short lived, as it > > > just disappears before I can resolve it. Unfortunately, systemd > > > recycles cgroups on restart, changing inode number, so this is a very > > > real issue. > > > > > > There's also this old wiki page from systemd: > > > > > > * https://freedesktop.org/wiki/Software/systemd/Optimizations > > > > > > Quoting from there: > > > > > > > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. > > > > > > So a similar need to mine, but for different systemd-related needs. > > > > > > Initially I tried adding this for cgroup fs only, but the problem felt > > > very generic, so I pivoted to having it in kernfs instead, so that any > > > kernfs based filesystem would benefit. > > > > > > Given pretty much non-existing overhead and simplicity of this, I > > > think it's a change worth doing, unless there's a good reason to not > > > do it. I cc'd plenty of people to make sure it's not a bad decision. > > > > > > > I agree. I think it was a good decision. > > I have some followup questions though. > > > > I guess your use case cares about the creation of cgroups? > > as long as the only way to create a cgroup is via vfs > > vfs_mkdir() -> ... cgroup_mkdir() > > fsnotify_mkdir() will be called. > > Is that a correct statement? > > As far as I'm aware, this is the only way. We have the cgroups mailing > list CC'd to confirm. > > I checked systemd and docker as real world consumers and both use > mkdir and are visible in fanotify with this patch applied. > > > Because if not, then explicit fsnotify_mkdir() calls may be needed > > similar to tracefs/debugfs. > > > > I don't think that the statement holds for dieing cgroups, > > so explicit fsnotify_rmdir() are almost certainly needed to make > > inotify/fanotify monitoring on cgroups complete. > > > > I am on the fence w.r.t making the above a prerequisite to merging > > your patch. > > > > One the one hand, inotify monitoring of cgroups directory was already > > possible (I think?) with the mentioned shortcomings for a long time. > > > > On the other hand, we have an opportunity to add support to fanotify > > monitoring of cgroups directory only after the missing fsnotify hooks > > are added, making fanotify API a much more reliable option for > > monitoring cgroups. > > > > So I am leaning towards requiring the missing fsnotify hooks before > > attaching a unique fsid to cgroups/kernfs. > > Unless somebody responsible for cgroups says there's a different way > to create cgroups, I think this requirement doesn't apply. > I was more concerned about the reliability of FAN_DELETE for dieing cgroups without an explicit rmdir() from userspace. > > In any case, either with or without the missing hooks, I would not > > want this patch merged until Jan had a chance to look at the > > implications and weigh in on the missing hooks question. > > Jan is on vacation for three weeks, so in the meanwhile, feel free > > to implement and test the missing hooks or wait for his judgement. > > Sure, I can definitely wait. > > > On an unrelated side topic, > > I would like to point your attention to this comment in the patch that > > was just merged to v6.5-rc1: > > > > 69562eb0bd3e ("fanotify: disallow mount/sb marks on kernel internal pseudo fs") > > > > /* > > * mount and sb marks are not allowed on kernel internal pseudo fs, > > * like pipe_mnt, because that would subscribe to events on all the > > * anonynous pipes in the system. > > * > > * SB_NOUSER covers all of the internal pseudo fs whose objects are not > > * exposed to user's mount namespace, but there are other SB_KERNMOUNT > > * fs, like nsfs, debugfs, for which the value of allowing sb and mount > > * mark is questionable. For now we leave them alone. > > */ > > > > My question to you, as the only user I know of for fanotify FAN_REPORT_FID > > on SB_KERNMOUNT, do you have plans to use a mount or filesystem mark > > to monitor cgroups? or only inotify-like directory watches? > > My plan is to use FAN_MARK_FILESYSTEM for the whole cgroup mount, > since that is what I was able to make work with my limited > understanding of the whole fanotify thing. I started with > fanotify_fid.c example from here: > > * https://man7.org/linux/man-pages/man7/fanotify.7.html > > My existing code does the mark this way (on v6.5-rc1 with my patch applied): > > ret = fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_ONLYDIR | > FAN_MARK_FILESYSTEM, FAN_CREATE | FAN_DELETE | FAN_ONDIR, AT_FDCWD, > argv[1]); > > My goal is to set a watch for all cgroups and drop capabilities, so > that I can keep monitoring for events while being unprivileged. As long as you do not need to open_by_handle_at() or as long as you retain CAP_DAC_READ_SEARCH. > As far > as I'm aware, this sort of recursive monitoring without races isn't > possible with inode level monitoring (I might be wrong here). You are not wrong. > > I do get -EINVAL for FAN_MARK_MOUNT instead of FAN_MARK_FILESYSTEM. Right. FAN_CREATE | FAN_DELETE not supported with FAN_MARK_MOUNT. Thanks, Amir.
On Tue, Jul 11, 2023 at 10:44 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > I agree. I think it was a good decision. > > > I have some followup questions though. > > > > > > I guess your use case cares about the creation of cgroups? > > > as long as the only way to create a cgroup is via vfs > > > vfs_mkdir() -> ... cgroup_mkdir() > > > fsnotify_mkdir() will be called. > > > Is that a correct statement? > > > > As far as I'm aware, this is the only way. We have the cgroups mailing > > list CC'd to confirm. > > > > I checked systemd and docker as real world consumers and both use > > mkdir and are visible in fanotify with this patch applied. > > > > > Because if not, then explicit fsnotify_mkdir() calls may be needed > > > similar to tracefs/debugfs. > > > > > > I don't think that the statement holds for dieing cgroups, > > > so explicit fsnotify_rmdir() are almost certainly needed to make > > > inotify/fanotify monitoring on cgroups complete. > > > > > > I am on the fence w.r.t making the above a prerequisite to merging > > > your patch. > > > > > > One the one hand, inotify monitoring of cgroups directory was already > > > possible (I think?) with the mentioned shortcomings for a long time. > > > > > > On the other hand, we have an opportunity to add support to fanotify > > > monitoring of cgroups directory only after the missing fsnotify hooks > > > are added, making fanotify API a much more reliable option for > > > monitoring cgroups. > > > > > > So I am leaning towards requiring the missing fsnotify hooks before > > > attaching a unique fsid to cgroups/kernfs. > > > > Unless somebody responsible for cgroups says there's a different way > > to create cgroups, I think this requirement doesn't apply. > > > > I was more concerned about the reliability of FAN_DELETE for > dieing cgroups without an explicit rmdir() from userspace. I checked the code and cgroups are destroyed in cgroup_destroy_locked, which is only called from two places: * In cgroup_mkdir (only on failure) * In cgroup_rmdir See: https://elixir.bootlin.com/linux/v6.5-rc1/A/ident/cgroup_destroy_locked We should be covered here.
Hello. On Mon, Jul 10, 2023 at 02:21:10PM -0700, Ivan Babrou <ivan@cloudflare.com> wrote: > I want to monitor cgroup changes, so that I can have an up to date map > of inode -> cgroup path, so that I can resolve the value returned from > bpf_get_current_cgroup_id() into something that a human can easily > grasp (think system.slice/nginx.service). Have you considered cgroup_path_from_kernfs_id()? > Currently I do a full sweep to build a map, which doesn't work if a > cgroup is short lived, as it just disappears before I can resolve it. > Unfortunately, systemd recycles cgroups on restart, changing inode > number, so this is a very real issue. So, a historical map of cgroup id -> path is also useful for you, right? (IOW, cgroup_path_from_kernfs_id() is possible but it'd inflate log buffer size if full paths were stored instead of ids.) (I think a similar map would be beneficial for SCM_CGROUP [1] idea too.) > There's also this old wiki page from systemd: > > * https://freedesktop.org/wiki/Software/systemd/Optimizations The page also states: > Last edited Sat 18 May 2013 08:20:38 AM UTC Emptiness notifications via release_agent are so 2016 :-), unified hiearchy has more convenient API [2], this is FTR. My 0.02€, Michal [1] https://uapi-group.org/kernel-features/ [2] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html?#un-populated-notification
On Tue, Jul 25, 2023 at 7:07 AM Michal Koutný <mkoutny@suse.com> wrote: > > Hello. > > On Mon, Jul 10, 2023 at 02:21:10PM -0700, Ivan Babrou <ivan@cloudflare.com> wrote: > > I want to monitor cgroup changes, so that I can have an up to date map > > of inode -> cgroup path, so that I can resolve the value returned from > > bpf_get_current_cgroup_id() into something that a human can easily > > grasp (think system.slice/nginx.service). > > Have you considered cgroup_path_from_kernfs_id()? * It's not available from bpf from what I see (should I send a patch?) * It turns short numeric keys into large string keys (you mention this below) * It's a lot more work upfront for every event under a spinlock > > Currently I do a full sweep to build a map, which doesn't work if a > > cgroup is short lived, as it just disappears before I can resolve it. > > Unfortunately, systemd recycles cgroups on restart, changing inode > > number, so this is a very real issue. > > So, a historical map of cgroup id -> path is also useful for you, right? > (IOW, cgroup_path_from_kernfs_id() is possible but it'd inflate log > buffer size if full paths were stored instead of ids.) For the most part the historical map would not be necessary if we had cgroup paths (except for the points I mentioned above). > (I think a similar map would be beneficial for SCM_CGROUP [1] idea too.) Yes, it seems like it. > > There's also this old wiki page from systemd: > > > > * https://freedesktop.org/wiki/Software/systemd/Optimizations > > The page also states: > > > Last edited Sat 18 May 2013 08:20:38 AM UTC > > Emptiness notifications via release_agent are so 2016 :-), unified > hiearchy has more convenient API [2], this is FTR. Sure, but these aren't arguments against having fanotify for cgroup filesystem.
On Tue 11-07-23 12:49:05, Amir Goldstein wrote: > On Tue, Jul 11, 2023 at 12:21 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > > > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > > > The following two commits added the same thing for tmpfs: > > > > > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > > > where one might be interested in knowing when they are created or removed. > > > > > > > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > > > > --- > > > > fs/kernfs/mount.c | 13 ++++++++++++- > > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > > > index d49606accb07..930026842359 100644 > > > > --- a/fs/kernfs/mount.c > > > > +++ b/fs/kernfs/mount.c > > > > @@ -16,6 +16,8 @@ > > > > #include <linux/namei.h> > > > > #include <linux/seq_file.h> > > > > #include <linux/exportfs.h> > > > > +#include <linux/uuid.h> > > > > +#include <linux/statfs.h> > > > > > > > > #include "kernfs-internal.h" > > > > > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > > > return 0; > > > > } > > > > > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > > > +{ > > > > + simple_statfs(dentry, buf); > > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > > > + return 0; > > > > +} > > > > + > > > > const struct super_operations kernfs_sops = { > > > > - .statfs = simple_statfs, > > > > + .statfs = kernfs_statfs, > > > > .drop_inode = generic_delete_inode, > > > > .evict_inode = kernfs_evict_inode, > > > > > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > > > } > > > > sb->s_flags |= SB_ACTIVE; > > > > > > > > + uuid_gen(&sb->s_uuid); > > > > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > > > at times, being created at boot time), did you just slow down creating > > > them all, and increase the memory usage in a measurable way? > > > > This is just for the superblock, not every inode. The memory increase > > is one UUID per kernfs instance (there are maybe 10 of them on a basic > > system), which is trivial. Same goes for CPU usage. > > > > > We were trying to slim things down, what userspace tools need this > > > change? Who is going to use it, and what for? > > > > The one concrete thing is ebpf_exporter: > > > > * https://github.com/cloudflare/ebpf_exporter > > > > I want to monitor cgroup changes, so that I can have an up to date map > > of inode -> cgroup path, so that I can resolve the value returned from > > bpf_get_current_cgroup_id() into something that a human can easily > > grasp (think system.slice/nginx.service). Currently I do a full sweep > > to build a map, which doesn't work if a cgroup is short lived, as it > > just disappears before I can resolve it. Unfortunately, systemd > > recycles cgroups on restart, changing inode number, so this is a very > > real issue. > > > > There's also this old wiki page from systemd: > > > > * https://freedesktop.org/wiki/Software/systemd/Optimizations > > > > Quoting from there: > > > > > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. > > > > So a similar need to mine, but for different systemd-related needs. > > > > Initially I tried adding this for cgroup fs only, but the problem felt > > very generic, so I pivoted to having it in kernfs instead, so that any > > kernfs based filesystem would benefit. > > > > Given pretty much non-existing overhead and simplicity of this, I > > think it's a change worth doing, unless there's a good reason to not > > do it. I cc'd plenty of people to make sure it's not a bad decision. > > I agree. I think it was a good decision. > I have some followup questions though. > > I guess your use case cares about the creation of cgroups? > as long as the only way to create a cgroup is via vfs > vfs_mkdir() -> ... cgroup_mkdir() > fsnotify_mkdir() will be called. > Is that a correct statement? > Because if not, then explicit fsnotify_mkdir() calls may be needed > similar to tracefs/debugfs. > > I don't think that the statement holds for dieing cgroups, > so explicit fsnotify_rmdir() are almost certainly needed to make > inotify/fanotify monitoring on cgroups complete. Yeah, as Ivan writes, we should already have all that is needed to generate CREATE and DELETE events for the cgroup filesystem. In theory inotify or fanotify for inodes could be already used with cgroupfs now. Thus I have no objection to providing fsid for it so that filesystem-wide notifications can be used for it as well. Feel free to add: Acked-by: Jan Kara <jack@suse.cz> to your patch. > On an unrelated side topic, > I would like to point your attention to this comment in the patch that > was just merged to v6.5-rc1: > > 69562eb0bd3e ("fanotify: disallow mount/sb marks on kernel internal pseudo fs") > > /* > * mount and sb marks are not allowed on kernel internal pseudo fs, > * like pipe_mnt, because that would subscribe to events on all the > * anonynous pipes in the system. > * > * SB_NOUSER covers all of the internal pseudo fs whose objects are not > * exposed to user's mount namespace, but there are other SB_KERNMOUNT > * fs, like nsfs, debugfs, for which the value of allowing sb and mount > * mark is questionable. For now we leave them alone. > */ > > My question to you, as the only user I know of for fanotify FAN_REPORT_FID > on SB_KERNMOUNT, do you have plans to use a mount or filesystem mark > to monitor cgroups? or only inotify-like directory watches? Yeah, for situations file cgroupfs the filesystem-wide watches do make sense and are useful at times as Ivan describes so I guess we'll leave those alone... Honza
On Mon, Jul 31, 2023 at 6:46 AM Jan Kara <jack@suse.cz> wrote: > > I agree. I think it was a good decision. > > I have some followup questions though. > > > > I guess your use case cares about the creation of cgroups? > > as long as the only way to create a cgroup is via vfs > > vfs_mkdir() -> ... cgroup_mkdir() > > fsnotify_mkdir() will be called. > > Is that a correct statement? > > Because if not, then explicit fsnotify_mkdir() calls may be needed > > similar to tracefs/debugfs. > > > > I don't think that the statement holds for dieing cgroups, > > so explicit fsnotify_rmdir() are almost certainly needed to make > > inotify/fanotify monitoring on cgroups complete. > > Yeah, as Ivan writes, we should already have all that is needed to > generate CREATE and DELETE events for the cgroup filesystem. In theory > inotify or fanotify for inodes could be already used with cgroupfs now. > Thus I have no objection to providing fsid for it so that filesystem-wide > notifications can be used for it as well. Feel free to add: > > Acked-by: Jan Kara <jack@suse.cz> > > to your patch. Thank you, I just sent v2: * https://lore.kernel.org/linux-fsdevel/20230731184731.64568-1-ivan@cloudflare.com/
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index d49606accb07..930026842359 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -16,6 +16,8 @@ #include <linux/namei.h> #include <linux/seq_file.h> #include <linux/exportfs.h> +#include <linux/uuid.h> +#include <linux/statfs.h> #include "kernfs-internal.h" @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) return 0; } +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) +{ + simple_statfs(dentry, buf); + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); + return 0; +} + const struct super_operations kernfs_sops = { - .statfs = simple_statfs, + .statfs = kernfs_statfs, .drop_inode = generic_delete_inode, .evict_inode = kernfs_evict_inode, @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) } sb->s_flags |= SB_ACTIVE; + uuid_gen(&sb->s_uuid); + down_write(&root->kernfs_supers_rwsem); list_add(&info->node, &info->root->supers); up_write(&root->kernfs_supers_rwsem);