Message ID | 20230717093612.40846-1-jiahao.os@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c923:0:b0:3e4:2afc:c1 with SMTP id j3csp1012254vqt; Mon, 17 Jul 2023 03:04:46 -0700 (PDT) X-Google-Smtp-Source: APBJJlE7j+/wqX+jWE5bbjUSu0Xq5jGKxr0JwV0Q/wSjPAWJvqvpcvAVLFrMhjabMZ6ijt5++2bd X-Received: by 2002:a17:902:724b:b0:1b8:5aba:509d with SMTP id c11-20020a170902724b00b001b85aba509dmr8836882pll.21.1689588285876; Mon, 17 Jul 2023 03:04:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689588285; cv=none; d=google.com; s=arc-20160816; b=zH5vfGofvH7xF7SR8SgujEH/QjKw6mNQS7dTHH/9scN+hiYx+akIX4omNXrDDoFb8Y e3jHNGXk1jJ4VVvtujduZxDT3wZrIVcyePueZITyKFm3/odFr5KKF4H2brrzI/93fZJg KilHZ/5he94ng7PmE2a+CJZTV6z2P37altDTp2V5yqNjQwFfsRFjmkuS1oWeDXfI2uc5 nC8FoWthxba6YeKdqeb5xpSkzGy65jk/NQgFrvOQmyaE39coKNirwHMX/1H00F6IwIjV QUasL2d0g0/iojNYM0b3Mq5NIGLTABWY6VO0O+y7DHPAnBxA/c0wht3P8eOrKge5gn4c dZVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Qt43AGwP78WC80RAn41rmisnWae//0xO36ekfLuw42c=; fh=UF9CtIf7rS5KUBzmE2VJ8Qw8VYClQAxAKwZWAx/KZCg=; b=EBFxIZ+Bg4xFrFvUeUrOBhVm+TVOrKHTQws+iNX8P39VmeUQg6YlChTJ3zLlfZ9S7G sH9EUE/LV/hxYWzG/u7AnPp4LndwMLGBrNRRE2Oc8iJV341Po2m9RZA37XXaPj4j8yzZ BSAxPOgVYsX+bxyq5wQBMWwtNx0FBFNi9q/06HcSyzx0aOmulyPd3AZb3J9I8U5W7yTP zrMs8ETC2EW1W2zlzHBOuNI3a2yiHmaENlzqRW6QxrnuuXTRKp4W5t3ph5yDapbjZX2f yDBgZXkY2tUpKQbhPA7RxvGFu8pLscsUa6sBdQvMHYHHVE2EPFi0TtgoDRPGNSuTfwxn ILCQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=jC7fae8l; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i5-20020a170902c94500b001b8419bd0f5si12041249pla.254.2023.07.17.03.04.33; Mon, 17 Jul 2023 03:04:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=jC7fae8l; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230482AbjGQJkU (ORCPT <rfc822;hadasmailinglist@gmail.com> + 99 others); Mon, 17 Jul 2023 05:40:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45280 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230464AbjGQJkA (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 17 Jul 2023 05:40:00 -0400 Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CFDF26A4 for <linux-kernel@vger.kernel.org>; Mon, 17 Jul 2023 02:37:37 -0700 (PDT) Received: by mail-pf1-x434.google.com with SMTP id d2e1a72fcca58-66872d4a141so2655811b3a.1 for <linux-kernel@vger.kernel.org>; Mon, 17 Jul 2023 02:37:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1689586656; x=1692178656; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Qt43AGwP78WC80RAn41rmisnWae//0xO36ekfLuw42c=; b=jC7fae8lFjyqrJWA2LzQPl6FbH1OzsT5Gc9737HZWMKXRD8pS+FDesVLzG0MQUqivh kl0w0WfhfNnUmT2cbdRlls0wJchGwg894mxP/5U0yarwfQgprEiVOnnwfmnHohS1jZSW j87ESoEDgVl+b4VDIsU5ca1jtHsLvptabPi+ZnzFRa3j6DfGYqKs/18Zme/MQdY6nH/O LbhS023WhlvCRp50+HcW6ViDqX6PaIOYyw0PGCdNGbKwtmrs29t8xq7FwNPsgOT6Z0tG XR6uUb3PxyTKPdpOIoiLAzWTe7TZZhXf8qC5Jq99z9ah1ypSPMuES/nvhj5TOAPtn5wl IZ6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689586656; x=1692178656; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Qt43AGwP78WC80RAn41rmisnWae//0xO36ekfLuw42c=; b=LkO953QMXuvN/IzzCQs2wIU/93Dii8geqLb566ZHjnpw22RRS4xP2AQnUqP/wdGttQ iKIXfIAJ6I9APdWzqzT2mkjQOzhJOz7Z7IGzsaZNyLMS6z5ujm+im10x97zqTfhTFBqt zWtjAnyayWYBXSKho9NURmDA9Sg3N8cttpNVkjLXxTUzSTz2wEJKPX+A97cnyDj/btmk w1NHafuBXcaXOaLJudzP8MGgzFzHLrS8llGpy6WS+2nk5YzVaDkO0hMwKMUW3rriuts4 lfH7K1DwLQm8peO3zRhJHknZwkUqgvwE2570hndwMgroFjQl8S2nkrnh1D9FQpvkIK7Q oWqw== X-Gm-Message-State: ABy/qLbaS7hMuXer6aktSMEHn4EHFj5EY4zKorVq7opvr3YXoyZb5z9B xMv3QyIupEQUOef80aQAECmLSA== X-Received: by 2002:a05:6a21:8cc4:b0:134:2fd0:7362 with SMTP id ta4-20020a056a218cc400b001342fd07362mr5248697pzb.22.1689586655878; Mon, 17 Jul 2023 02:37:35 -0700 (PDT) Received: from C02G87K0MD6R.bytedance.net ([2408:8000:b001:1:1f:58ff:f102:103]) by smtp.gmail.com with ESMTPSA id t5-20020aa79385000000b0063d2dae6247sm11541435pfe.77.2023.07.17.02.37.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Jul 2023 02:37:35 -0700 (PDT) From: Hao Jia <jiahao.os@bytedance.com> To: tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, mkoutny@suse.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Jia <jiahao.os@bytedance.com> Subject: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants Date: Mon, 17 Jul 2023 17:36:12 +0800 Message-Id: <20230717093612.40846-1-jiahao.os@bytedance.com> X-Mailer: git-send-email 2.37.0 (Apple Git-136) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771661726452881552 X-GMAIL-MSGID: 1771661726452881552 |
Series |
cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants
|
|
Commit Message
Hao Jia
July 17, 2023, 9:36 a.m. UTC
Now the member variable bstat of the structure cgroup_rstat_cpu
records the per-cpu time of the cgroup itself, but does not
include the per-cpu time of its descendants. The per-cpu time
including descendants is very useful for calculating the
per-cpu usage of cgroups.
Although we can indirectly obtain the total per-cpu time
of the cgroup and its descendants by accumulating the per-cpu
bstat of each descendant of the cgroup. But after a child cgroup
is removed, we will lose its bstat information. This will cause
the cumulative value to be non-monotonic, thus affecting
the accuracy of cgroup per-cpu usage.
So we add the cumul_bstat variable to record the total
per-cpu time of this cgroup and its descendants, which is
similar to "cpuacct.usage*" in cgroup v1. And this is
also helpful for the migration from cgroup v1 to cgroup v2.
After adding this variable, we can obtain the per-cpu time of
cgroup and its descendants in user mode through eBPF, etc.
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
---
include/linux/cgroup-defs.h | 3 +++
kernel/cgroup/rstat.c | 4 ++++
2 files changed, 7 insertions(+)
Comments
Hello, On Mon, Jul 17, 2023 at 05:36:12PM +0800, Hao Jia wrote: > Now the member variable bstat of the structure cgroup_rstat_cpu You said "now" indicating that the behavior has changed recently but I don't see what changed there. Can you elaborate? > records the per-cpu time of the cgroup itself, but does not > include the per-cpu time of its descendants. The per-cpu time It does. The per-cpu delta is added to its parent and then that will in turn be used to propagate to its parent. > including descendants is very useful for calculating the > per-cpu usage of cgroups. > > Although we can indirectly obtain the total per-cpu time > of the cgroup and its descendants by accumulating the per-cpu > bstat of each descendant of the cgroup. But after a child cgroup > is removed, we will lose its bstat information. This will cause > the cumulative value to be non-monotonic, thus affecting > the accuracy of cgroup per-cpu usage. > > So we add the cumul_bstat variable to record the total > per-cpu time of this cgroup and its descendants, which is > similar to "cpuacct.usage*" in cgroup v1. And this is > also helpful for the migration from cgroup v1 to cgroup v2. > After adding this variable, we can obtain the per-cpu time of > cgroup and its descendants in user mode through eBPF, etc. I think you're misunderstanding how the code works. Can you please double check? Thanks.
Hello, On 2023/7/18 Tejun Heo wrote: > On Mon, Jul 17, 2023 at 05:36:12PM +0800, Hao Jia wrote: >> Now the member variable bstat of the structure cgroup_rstat_cpu > > You said "now" indicating that the behavior has changed recently but I don't > see what changed there. Can you elaborate? Thank you for your review, sorry for confusing you with my expression, it is true that it has not changed, bstat has always recorded the per-cpu time of cgroup itself. > >> records the per-cpu time of the cgroup itself, but does not >> include the per-cpu time of its descendants. The per-cpu time > > It does. The per-cpu delta is added to its parent and then that will in turn > be used to propagate to its parent. >Yes, so only cgrp->bstat contains the cpu time of the cgroup and its descendants. rstatc->bstat only records the per-cpu time of cgroup itself. > > I think you're misunderstanding how the code works. Can you please double > check? --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -361,11 +361,15 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) cgroup_base_stat_sub(&delta, &rstatc->last_bstat); cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta); We add the current cgroup's per-cpu delta to its per-cpu variable (cumul_bstat). + cgroup_base_stat_add(&rstatc->cumul_bstat, &delta); /* propagate global delta to parent (unless that's root) */ if (cgroup_parent(parent)) { delta = cgrp->bstat; + rstatc = cgroup_rstat_cpu(parent, cpu); + cgroup_base_stat_sub(&delta, &cgrp->last_bstat); Since we hold the global cgroup_rstat_lock and disable interrupts in cgroup_base_stat_flush() and we only update cgrp->bstat here. So the global delta at this time is equal to the increment of this cgroup and its descendants on this cpu (unless root cgroup), so we can add the global delta to the cumul_bstat of its parent and propagate it upward. + cgroup_base_stat_add(&rstatc->cumul_bstat, &delta); cgroup_base_stat_add(&parent->bstat, &delta); cgroup_base_stat_add(&cgrp->last_bstat, &delta); } I wrote a kernel module to verify and test the above kernel code, The kernel module adds a cpu.stat_percpu_all file for each cgroup, which can output the per-cpu time information of the cgroup and its descendants calculated in two ways: 1. Cumulatively add the per-cpu bstat of cgroup and each of its descendants. 2. Directly output @cumul_bstat read in the kernel (patch has been applied). When the child cgroup is not removed, the results calculated by the two methods should be equal. NOTE: We need to flush the data before "cat cpu.stat_percpu_all", such as "cat cpu.stat". Code link: https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main The testing procedure is in README. I installed the 6.5.0-rc1 kernel on my machine (an Intel Xeon(R) Platinum 8260 machine 96 CPUs in total.) and did some tests. So far I haven't found anything unusual, if I'm wrong, please point it out, it's very helpful to me, thanks again! Thanks, Hao
On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote:
> https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main
Isn't that just adding the same numbers twice and verifying that? Maybe I'm
misunderstanding you. Here's a simpler case:
# cd /sys/fs/cgroup
# mkdir -p asdf/test0
# grep usage_usec asdf/test0/cpu.stat
usage_usec 0
# echo $$ > asdf/test0/cgroup.procs
# stress -c 1 & sleep 1; kill %%
[1] 122329
stress: info: [122329] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
# grep usage_usec asdf/test0/cpu.stat
usage_usec 1000956
[1]+ Terminated stress -c 1
# grep usage_usec asdf/cpu.stat
usage_usec 1002548
# echo $$ > /sys/fs/cgroup/cgroup.procs
# rmdir asdf/test0
# grep usage_usec asdf/cpu.stat
usage_usec 1006338
So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and
asdf/cpu.stat correctly reports the cumulative usage. After removing
asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here?
What are you adding?
Thanks.
On 2023/7/19 Tejun Heo wrote: > On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote: >> https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main > > So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and > asdf/cpu.stat correctly reports the cumulative usage. After removing > asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here? Sorry, some of my expressions may have misled you. Yes, cpu.stat will display the cumulative **global** cpu time of the cgroup and its descendants (the corresponding kernel variable is "cgrp->bstat"), and it will not be lost when the child cgroup is removed. Similarly, we need a **per-cpu** variable to record the accumulated per-cpu time of cgroup and its descendants. The existing kernel variable "cgroup_rstat_cpu(cgrp, cpu)->bstat" is not satisfied, it only records the per-cpu time of cgroup itself, So I try to add "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" to record per-cpu time of cgroup and its descendants. In order to verify the correctness of my patch, I wrote a kernel module to compare the results of calculating the per-cpu time of cgroup and its descendants in two ways: Method 1. Traverse and add the per-cpu rstatc->bstat of cgroup and each of its descendants. Method 2. Directly read "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" in the kernel. When the child cgroup is not removed, the results calculated by the two methods should be equal. > What are you adding? I want to add a **per-cpu variable** to record the cumulative per-cpu time of cgroup and its descendants, which is similar to the variable "cgrp->bstat", but it is a per-cpu variable. It is very useful and convenient for calculating the usage of cgroup on each cpu, and its behavior is similar to the "cpuacct.usage*" interface of cgroup v1. Thanks, Hao
On 2023/7/19 Hao Jia wrote: > > > On 2023/7/19 Tejun Heo wrote: >> On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote: >>> https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main >> >> So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and >> asdf/cpu.stat correctly reports the cumulative usage. After removing >> asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here? > > Sorry, some of my expressions may have misled you. > > Yes, cpu.stat will display the cumulative **global** cpu time of the > cgroup and its descendants (the corresponding kernel variable is > "cgrp->bstat"), and it will not be lost when the child cgroup is removed. > > Similarly, we need a **per-cpu** variable to record the accumulated > per-cpu time of cgroup and its descendants. > The existing kernel variable "cgroup_rstat_cpu(cgrp, cpu)->bstat" is not > satisfied, it only records the per-cpu time of cgroup itself, > So I try to add "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" to record > per-cpu time of cgroup and its descendants. > > In order to verify the correctness of my patch, I wrote a kernel module > to compare the results of calculating the per-cpu time of cgroup and its > descendants in two ways: > Method 1. Traverse and add the per-cpu rstatc->bstat of cgroup and > each of its descendants. > Method 2. Directly read "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" in > the kernel. > > When the child cgroup is not removed, the results calculated by the two > methods should be equal. > >> What are you adding? > I want to add a **per-cpu variable** to record the cumulative per-cpu > time of cgroup and its descendants, which is similar to the variable > "cgrp->bstat", but it is a per-cpu variable. > It is very useful and convenient for calculating the usage of cgroup on > each cpu, and its behavior is similar to the "cpuacct.usage*" interface > of cgroup v1. > Hello Tejun, I don't know if I explained it clearly, and do you understand what I mean? Would you mind adding a variable like this to facilitate per-cpu usage calculations and migration from cgroup v1 to cgroup v2? Thanks, Hao
On Thu, Jul 27, 2023 at 08:05:44PM +0800, Hao Jia wrote: > > > On 2023/7/19 Hao Jia wrote: > > > > > > On 2023/7/19 Tejun Heo wrote: > > > On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote: > > > > https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main > > > > > > So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and > > > asdf/cpu.stat correctly reports the cumulative usage. After removing > > > asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here? > > > > Sorry, some of my expressions may have misled you. > > > > Yes, cpu.stat will display the cumulative **global** cpu time of the > > cgroup and its descendants (the corresponding kernel variable is > > "cgrp->bstat"), and it will not be lost when the child cgroup is > > removed. > > > > Similarly, we need a **per-cpu** variable to record the accumulated > > per-cpu time of cgroup and its descendants. > > The existing kernel variable "cgroup_rstat_cpu(cgrp, cpu)->bstat" is not > > satisfied, it only records the per-cpu time of cgroup itself, > > So I try to add "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" to record > > per-cpu time of cgroup and its descendants. > > > > In order to verify the correctness of my patch, I wrote a kernel module > > to compare the results of calculating the per-cpu time of cgroup and its > > descendants in two ways: > > Method 1. Traverse and add the per-cpu rstatc->bstat of cgroup and > > each of its descendants. > > Method 2. Directly read "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" in > > the kernel. > > > > When the child cgroup is not removed, the results calculated by the two > > methods should be equal. > > > > > What are you adding? > > I want to add a **per-cpu variable** to record the cumulative per-cpu > > time of cgroup and its descendants, which is similar to the variable > > "cgrp->bstat", but it is a per-cpu variable. > > It is very useful and convenient for calculating the usage of cgroup on > > each cpu, and its behavior is similar to the "cpuacct.usage*" interface > > of cgroup v1. > > > > Hello Tejun, > > I don't know if I explained it clearly, and do you understand what I mean? > > Would you mind adding a variable like this to facilitate per-cpu usage > calculations and migration from cgroup v1 to cgroup v2? Oh yeah, I do. I'm just thinking whether we also want to expose that in the cgroupfs. We are currently not showing anything per-cpu and the output formatting gets nasty with a huge number of CPUs, so maybe that's not going to work out all that well. Anyways, I'll get back to you next week. Thanks.
Hello, On Thu, Jul 27, 2023 at 07:44:28AM -1000, Tejun Heo wrote: > Oh yeah, I do. I'm just thinking whether we also want to expose that in the > cgroupfs. We are currently not showing anything per-cpu and the output > formatting gets nasty with a huge number of CPUs, so maybe that's not going > to work out all that well. Anyways, I'll get back to you next week. I couldn't come up with an answer. Let's go ahead with adding the field but can you please do the followings? * Name it to something like subtree_bstat instead of cumul_bstat. The counters are all cumulative. * Are you sure the upward propagation logic is correct? It's calculating global delta and then propagating to the per-cpu delta of the parent. Is that correct because the two delta calculations always end up the same? * Please add a comment explaining that the field is not currently used outside of being read from bpf / drgn and what not and that we're still trying to determine how to expose that in the cgroupfs interface. Thanks.
On 2023/8/3 Tejun Heo wrote: > I couldn't come up with an answer. Let's go ahead with adding the field but > can you please do the followings? > Thank you for your suggestion. I am very willing to do it. > * Name it to something like subtree_bstat instead of cumul_bstat. The > counters are all cumulative. I did it in v2 patch. > > * Are you sure the upward propagation logic is correct? It's calculating > global delta and then propagating to the per-cpu delta of the parent. Is > that correct because the two delta calculations always end up the same? Sorry, I made a mistake and misled you. These two deltas are not always equal. I found and reproduced a bug case: We build /sys/fs/cgroup/test /sys/fs/cgroup/test/t1, /sys/fs/cgroup/test/t1/tt1 in turn. And there are only tasks in /sys/fs/cgroup/test/t1/tt1. After applying this patch, some operations and corresponding data changes are as follows: Step 1、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat *cpu 6* current flush cgroup /test/t1/tt1 per-cpu delta utime 0 stime 0 sum_exec_runtime 257341 /test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 257341 /test/t1/tt1 cgrp->bstat utime 0 stime 0 sum_exec_runtime 257341 if (cgroup_parent(parent)) { cgrp delta utime 0 stime 0 sum_exec_runtime 257341 parent(/test/t1) ->bstat utime 0 stime 0 sum_exec_runtime 257341 parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0 parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 257341 } Step 2、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat *cpu 12* current flush cgroup /test/t1/tt1 per-cpu delta utime 0 stime 1000000 sum_exec_runtime 747042 /test/t1/tt1 cumul_bstat utime 0 stime 1000000 sum_exec_runtime 747042 /test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1004383 if (cgroup_parent(parent)) { cgrp delta utime 0 stime 1000000 sum_exec_runtime 747042 parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1004383 parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0 parent(/test/t1) cumul_bstat utime 0 stime 1000000 sum_exec_runtime 747042 } Step 3、cat /sys/fs/cgroup/test/cpu.stat (cgroup fulsh /test/t1/tt1 -> /test/t1 -> /test in turn) *cpu 6* current flush cgroup /test/t1/tt1 per-cpu delta utime 0 stime 0 sum_exec_runtime 263468 /test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809 /test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851 if (cgroup_parent(parent)) { cgrp delta utime 0 stime 0 sum_exec_runtime 263468 parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851 parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0 parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 520809 } *cpu 6* current flush cgroup /test/t1 per-cpu delta utime 0 stime 0 sum_exec_runtime 0 /test/t1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809 /test/t1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851 if (cgroup_parent(parent)) { cgrp delta utime 0 stime 1000000 sum_exec_runtime 1267851 <--- parent(/test) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851 parent(/test) last_bstat utime 0 stime 0 sum_exec_runtime 0 parent(/test) cumul_bstat (cpu 6) utime 0 stime 1000000 sum_exec_runtime 1267851 <--- *error* ****** Here cgrp delta is *not equal* to per-cpu delta. The frequency of cgroup (/test) and its chiled cgroup (/test/t1/tt1) flush is inconsistent. In other words (when we call cgroup_base_stat_flush(), we will not necessarily flush to the highest-level cgroup except root(like step 1 and 2 above)). Therefore, cgrp delta may contain the cumulative value of multiple per-cpu deltas. The correct value of parent(/test) cumul_bstat should be utime 0 stime 0 sum_exec_runtime 520809. ****** } *cpu 6* current flush cgroup /test per-cpu delta utime 0 stime 0 sum_exec_runtime 0 cumul_bstat utime 0 stime 1000000 sum_exec_runtime 1267851 /test ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851 cgroup_parent(parent) is NULL end. So we should add a per-cpu variable subtree_last_bstat similar to cgrp->last_bstat to record the last value. I have sent v2 patch, please review it again. v2 link: https://lore.kernel.org/all/20230807032930.87785-1-jiahao.os@bytedance.com > * Please add a comment explaining that the field is not currently used > outside of being read from bpf / drgn and what not and that we're still > trying to determine how to expose that in the cgroupfs interface. Thanks, I did it in v2 patch. Thanks, Hao
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 8a0d5466c7be..25114d62089e 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -341,6 +341,9 @@ struct cgroup_rstat_cpu { */ struct cgroup_base_stat last_bstat; + /* Record the cumulative per-cpu time of cgroup and its descendants */ + struct cgroup_base_stat cumul_bstat; + /* * Child cgroups with stat updates on this cpu since the last read * are linked on the parent's ->updated_children through diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 2542c21b6b6d..7c0ddd2e90d7 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -361,11 +361,15 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) cgroup_base_stat_sub(&delta, &rstatc->last_bstat); cgroup_base_stat_add(&cgrp->bstat, &delta); cgroup_base_stat_add(&rstatc->last_bstat, &delta); + cgroup_base_stat_add(&rstatc->cumul_bstat, &delta); /* propagate global delta to parent (unless that's root) */ if (cgroup_parent(parent)) { delta = cgrp->bstat; + rstatc = cgroup_rstat_cpu(parent, cpu); + cgroup_base_stat_sub(&delta, &cgrp->last_bstat); + cgroup_base_stat_add(&rstatc->cumul_bstat, &delta); cgroup_base_stat_add(&parent->bstat, &delta); cgroup_base_stat_add(&cgrp->last_bstat, &delta); }