Message ID | 20230804090858.7605-1-rui.zhang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:44a:b0:3f2:4152:657d with SMTP id ez10csp139185vqb; Fri, 4 Aug 2023 02:36:10 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG/X+ZuGzIDngZ8KMH1AoGdL0uaotLRFVpMwmlTsnoux+C1ZdMqE1QD1wCGHCk+cDFxXEp/ X-Received: by 2002:a05:620a:3187:b0:76c:e6b6:58fe with SMTP id bi7-20020a05620a318700b0076ce6b658femr1583677qkb.31.1691141770506; Fri, 04 Aug 2023 02:36:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691141770; cv=none; d=google.com; s=arc-20160816; b=0OeVuZXPyvBi+2JjYaDjmhG+b8dYKRjQHzlfo32Ueop19+v9AmhNtKUs8Rwl/2Beul NMfd2ueWKMIHLxFLFCZrksV3QygFTKDulrE6eYFuySCqraScTzG8+6uICCuUjV7uGx3j OSkHmwjiBolTfeavUizU1Zsy1N04REXNhRHpIMbwWZCSp7a9ALWyB5U/2nWNb/1/5yD+ x3ToQDQ7dR7y2B6jv0l99l/AK460iT/RD17ynlIKQVZUw4X4G9SrLkJcadUqb0PVuA5D eEbB8ftuwCKFAtEiyfYHVQZ6Mn65NwISiHi3LkIhVzu5lphJMO0nc7DG36/olqfaHXVO 4Scg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=8Z6yHZjKr3w6RYyP/dXSEyxHVnCKYcm5Iv+DlsODEdY=; fh=+O90Dc7rRrKkb6k/lnhJh1tIzjw9BfQynRmeXX0B0a4=; b=u3HHinzzlNRrvvRVy85Gd7dKGGVRWCuEFx81xyEsP6H+DrRv5rpjwo0hhrgtyzx4bI SvO2/FYsZ6tYOEHoQrTDZggg6KhfM80DIjT52aQOPP7+Mr032t4NaiSaiFVJqzr93uf+ Tvq5SYbBOwY3DofXfZgu53krJIMUfNcqE5amUOrHm+zXj3EoNcW0YBhn91Bo1rgNpTP/ P53W0PQ8yCtkYLG+Qf3z0zH4l/BT+srHagT9P2YzFZCeJ1EAHbn3Lq9+Mtm0mf16mCI0 Y4B8L7UQm5xpkqg3/lOy2yffORn2PS/uEJlbBlWGsbzuLdfusWDpZ2STSntzOAD0c/eL JLQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="WF/rVAsg"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l8-20020a635b48000000b0051b423d966csi1363259pgm.280.2023.08.04.02.35.56; Fri, 04 Aug 2023 02:36:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="WF/rVAsg"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229746AbjHDJNN (ORCPT <rfc822;sukrut.bellary@gmail.com> + 99 others); Fri, 4 Aug 2023 05:13:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58512 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230135AbjHDJMO (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 4 Aug 2023 05:12:14 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 45B035246 for <linux-kernel@vger.kernel.org>; Fri, 4 Aug 2023 02:09:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1691140163; x=1722676163; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=I3ZENAHe7Oen+mjPad36DJ7HKb4Se9mPnDSl4djH1Bk=; b=WF/rVAsgKSYr3tEoEwudBnugscD21cwYOp6gfCyz8tf+1ARMz6q15ny3 MqmPUC8MXTAgvruMq39C2dCDTKhrmhh6AlFkSUeVeV9gQYwLSldtz4l3y Np5mZ3IZc2qJmFaSwCLcQOX+li96ZCE33CQapI44YKks7NC0HBL5lCkvH yvDRHp9VVKpcd6LopJ74C14EpLpxTj8IBB/oFZQipbVm+HpDXeL3OUF5B F9GHRsMhrJfUzTQw/nOV0Xy1GwtHZ3NPk6eoQAFesfu8uZ/FPMU/TkmDE JG1z4rPjCi2sN/KyjuW7WhkEJ5N1zq9P3ZATxQwKB92U9bb/OxsUHdW// A==; X-IronPort-AV: E=McAfee;i="6600,9927,10791"; a="360164068" X-IronPort-AV: E=Sophos;i="6.01,254,1684825200"; d="scan'208";a="360164068" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Aug 2023 02:09:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10791"; a="795332566" X-IronPort-AV: E=Sophos;i="6.01,254,1684825200"; d="scan'208";a="795332566" Received: from rzhang1-mobl7.sh.intel.com ([10.238.6.118]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Aug 2023 02:09:20 -0700 From: Zhang Rui <rui.zhang@intel.com> To: mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, tj@kernel.org, srinivas.pandruvada@intel.com Subject: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance Date: Fri, 4 Aug 2023 17:08:58 +0800 Message-Id: <20230804090858.7605-1-rui.zhang@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773290672914392068 X-GMAIL-MSGID: 1773290672914392068 |
Series |
sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance
|
|
Commit Message
Zhang, Rui
Aug. 4, 2023, 9:08 a.m. UTC
Problem statement
-----------------
When using cgroup isolated partition to isolate cpus including cpu0, it
is observed that cpu0 is woken up frequenctly but doing nothing. This is
not good for power efficiency.
<idle>-0 [000] 616.491602: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
<idle>-0 [000] 616.491608: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=615996000000 softexpires=615996000000
<idle>-0 [000] 616.491616: rcu_utilization: Start context switch
<idle>-0 [000] 616.491618: rcu_utilization: End context switch
<idle>-0 [000] 616.491637: tick_stop: success=1 dependency=NONE
<idle>-0 [000] 616.491637: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10
<idle>-0 [000] 616.491638: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=616420000000 softexpires=616420000000
The above pattern repeats every one or multiple ticks, results in total
2000+ wakeups on cpu0 in 60 seconds, when running workload on the
cpus that are not in the isolated partition.
Rootcause
---------
In NOHZ mode, an active cpu either sends an IPI or touches the idle
cpu's polling flag to wake it up, so that the idle cpu can pull tasks
from the busy cpu. The logic for selecting the target cpu is to use the
first idle cpu that presents in both nohz.idle_cpus_mask and
housekeeping_cpumask.
In the above scenario, when cpu0 is in the cgroup isolated partition,
its sched domain is deteched, but it is still available in both of the
above cpumasks. As a result, cpu0
1. is always selected when kicking idle load balance
2. is woken up from the idle loop
3. calls __schedule() but cannot find any task to pull because it is not
in any sched_domain, thus it does nothing and reenters idle.
Solution
--------
Fix the problem by skipping cpus with no sched domain attached during
NOHZ idle balance.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)
Comments
On 2023-08-04 at 17:08:58 +0800, Zhang Rui wrote: > Problem statement > ----------------- > When using cgroup isolated partition to isolate cpus including cpu0, it > is observed that cpu0 is woken up frequenctly but doing nothing. This is > not good for power efficiency. > > <idle>-0 [000] 616.491602: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10 > <idle>-0 [000] 616.491608: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=615996000000 softexpires=615996000000 > <idle>-0 [000] 616.491616: rcu_utilization: Start context switch > <idle>-0 [000] 616.491618: rcu_utilization: End context switch > <idle>-0 [000] 616.491637: tick_stop: success=1 dependency=NONE > <idle>-0 [000] 616.491637: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10 > <idle>-0 [000] 616.491638: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=616420000000 softexpires=616420000000 > > The above pattern repeats every one or multiple ticks, results in total > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the > cpus that are not in the isolated partition. > > Rootcause > --------- > In NOHZ mode, an active cpu either sends an IPI or touches the idle > cpu's polling flag to wake it up, so that the idle cpu can pull tasks > from the busy cpu. The logic for selecting the target cpu is to use the > first idle cpu that presents in both nohz.idle_cpus_mask and > housekeeping_cpumask. > > In the above scenario, when cpu0 is in the cgroup isolated partition, > its sched domain is deteched, but it is still available in both of the > above cpumasks. As a result, cpu0 > 1. is always selected when kicking idle load balance > 2. is woken up from the idle loop > 3. calls __schedule() but cannot find any task to pull because it is not > in any sched_domain, thus it does nothing and reenters idle. > > Solution > -------- > Fix the problem by skipping cpus with no sched domain attached during > NOHZ idle balance. > > Signed-off-by: Zhang Rui <rui.zhang@intel.com> > --- > kernel/sched/fair.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b3e25be58e2b..ea3185a46962 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void) > if (ilb == smp_processor_id()) > continue; > > + if (unlikely(on_null_domain(cpu_rq(ilb)))) > + continue; > + > if (idle_cpu(ilb)) > return ilb; > } Is it possible to pass a valid cpumask to kick_ilb() via nohz_balancer_kick() and let find_new_ilb() scan in that mask? So we could shrink the scan range and also reduce the null domain check in each loop. CPUs in different cpuset are in different root domains, the busy CPU(in cpuset0) will not ask nohz idle CPU0(in isolated cpuset1) to launch idle load balance. struct root_domain *rd = rq->rd; ... kick_ilb(flags, rd->span) thanks, Chenyu
Hi, Yu, On Wed, 2023-08-09 at 15:00 +0800, Chen Yu wrote: > On 2023-08-04 at 17:08:58 +0800, Zhang Rui wrote: > > Problem statement > > ----------------- > > When using cgroup isolated partition to isolate cpus including > > cpu0, it > > is observed that cpu0 is woken up frequenctly but doing nothing. > > This is > > not good for power efficiency. > > > > <idle>-0 [000] 616.491602: hrtimer_cancel: > > hrtimer=0xffff8e8fdf623c10 > > <idle>-0 [000] 616.491608: hrtimer_start: > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 > > expires=615996000000 softexpires=615996000000 > > <idle>-0 [000] 616.491616: rcu_utilization: Start > > context switch > > <idle>-0 [000] 616.491618: rcu_utilization: End context > > switch > > <idle>-0 [000] 616.491637: tick_stop: success=1 > > dependency=NONE > > <idle>-0 [000] 616.491637: hrtimer_cancel: > > hrtimer=0xffff8e8fdf623c10 > > <idle>-0 [000] 616.491638: hrtimer_start: > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 > > expires=616420000000 softexpires=616420000000 > > > > The above pattern repeats every one or multiple ticks, results in > > total > > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the > > cpus that are not in the isolated partition. > > > > Rootcause > > --------- > > In NOHZ mode, an active cpu either sends an IPI or touches the idle > > cpu's polling flag to wake it up, so that the idle cpu can pull > > tasks > > from the busy cpu. The logic for selecting the target cpu is to use > > the > > first idle cpu that presents in both nohz.idle_cpus_mask and > > housekeeping_cpumask. > > > > In the above scenario, when cpu0 is in the cgroup isolated > > partition, > > its sched domain is deteched, but it is still available in both of > > the > > above cpumasks. As a result, cpu0 > > 1. is always selected when kicking idle load balance > > 2. is woken up from the idle loop > > 3. calls __schedule() but cannot find any task to pull because it > > is not > > in any sched_domain, thus it does nothing and reenters idle. > > > > Solution > > -------- > > Fix the problem by skipping cpus with no sched domain attached > > during > > NOHZ idle balance. > > > > Signed-off-by: Zhang Rui <rui.zhang@intel.com> > > --- > > kernel/sched/fair.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index b3e25be58e2b..ea3185a46962 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void) > > if (ilb == smp_processor_id()) > > continue; > > > > + if (unlikely(on_null_domain(cpu_rq(ilb)))) > > + continue; > > + > > if (idle_cpu(ilb)) > > return ilb; > > } > > Is it possible to pass a valid cpumask to kick_ilb() via > nohz_balancer_kick() > and let find_new_ilb() scan in that mask? So we could shrink the scan > range > and also reduce the null domain check in each loop. CPUs in different > cpuset are in different root domains, the busy CPU(in cpuset0) will > not ask > nohz idle CPU0(in isolated cpuset1) to launch idle load balance. > > struct root_domain *rd = rq->rd; > ... > kick_ilb(flags, rd->span) > Yeah. This also sounds like a reasonable approach. I can make a patch to confirm it works as expected. thanks, rui
Hi Rui, On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote: > Problem statement > ----------------- > When using cgroup isolated partition to isolate cpus including cpu0, it > is observed that cpu0 is woken up frequenctly but doing nothing. This is > not good for power efficiency. > > <idle>-0 [000] 616.491602: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10 > <idle>-0 [000] 616.491608: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=615996000000 softexpires=615996000000 > <idle>-0 [000] 616.491616: rcu_utilization: Start context switch > <idle>-0 [000] 616.491618: rcu_utilization: End context switch > <idle>-0 [000] 616.491637: tick_stop: success=1 dependency=NONE > <idle>-0 [000] 616.491637: hrtimer_cancel: hrtimer=0xffff8e8fdf623c10 > <idle>-0 [000] 616.491638: hrtimer_start: hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=616420000000 softexpires=616420000000 > > The above pattern repeats every one or multiple ticks, results in total > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the > cpus that are not in the isolated partition. > > Rootcause > --------- > In NOHZ mode, an active cpu either sends an IPI or touches the idle > cpu's polling flag to wake it up, so that the idle cpu can pull tasks > from the busy cpu. The logic for selecting the target cpu is to use the > first idle cpu that presents in both nohz.idle_cpus_mask and > housekeeping_cpumask. > > In the above scenario, when cpu0 is in the cgroup isolated partition, > its sched domain is deteched, but it is still available in both of the > above cpumasks. As a result, cpu0 I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not set itself in nohz.idle_cpus_mask and thus should not be chosen as ilb_cpu. I wonder what's stopping this from working? > 1. is always selected when kicking idle load balance > 2. is woken up from the idle loop > 3. calls __schedule() but cannot find any task to pull because it is not > in any sched_domain, thus it does nothing and reenters idle. > > Solution > -------- > Fix the problem by skipping cpus with no sched domain attached during > NOHZ idle balance. > > Signed-off-by: Zhang Rui <rui.zhang@intel.com> > --- > kernel/sched/fair.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b3e25be58e2b..ea3185a46962 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void) > if (ilb == smp_processor_id()) > continue; > > + if (unlikely(on_null_domain(cpu_rq(ilb)))) > + continue; > + > if (idle_cpu(ilb)) > return ilb; > } > -- > 2.34.1 >
On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote: > Hi Rui, > > On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote: > > Problem statement > > ----------------- > > When using cgroup isolated partition to isolate cpus including > > cpu0, it > > is observed that cpu0 is woken up frequenctly but doing nothing. > > This is > > not good for power efficiency. > > > > <idle>-0 [000] 616.491602: hrtimer_cancel: > > hrtimer=0xffff8e8fdf623c10 > > <idle>-0 [000] 616.491608: hrtimer_start: > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 > > expires=615996000000 softexpires=615996000000 > > <idle>-0 [000] 616.491616: rcu_utilization: Start > > context switch > > <idle>-0 [000] 616.491618: rcu_utilization: End context > > switch > > <idle>-0 [000] 616.491637: tick_stop: success=1 > > dependency=NONE > > <idle>-0 [000] 616.491637: hrtimer_cancel: > > hrtimer=0xffff8e8fdf623c10 > > <idle>-0 [000] 616.491638: hrtimer_start: > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 > > expires=616420000000 softexpires=616420000000 > > > > The above pattern repeats every one or multiple ticks, results in > > total > > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the > > cpus that are not in the isolated partition. > > > > Rootcause > > --------- > > In NOHZ mode, an active cpu either sends an IPI or touches the idle > > cpu's polling flag to wake it up, so that the idle cpu can pull > > tasks > > from the busy cpu. The logic for selecting the target cpu is to use > > the > > first idle cpu that presents in both nohz.idle_cpus_mask and > > housekeeping_cpumask. > > > > In the above scenario, when cpu0 is in the cgroup isolated > > partition, > > its sched domain is deteched, but it is still available in both of > > the > > above cpumasks. As a result, cpu0 > > I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not > set itself in nohz.idle_cpus_mask and thus should not be chosen as > ilb_cpu. I wonder what's stopping this from working? One thing I forgot to mention is that the problem is gone if we offline and re-online those cpus. In that case, the isolated cpus are removed from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never added back. At runtime, the cpus can be removed from the nohz.idle_cpus_mask in below case only trigger_load_balance() if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq)))) return; nohz_balancer_kick(rq); nohz_balance_exit_idle() My understanding is that if a cpu is in nohz.idle_cpus_mask when it is isolated, there is no chance to remove it from that mask later, so the check in nohz_balance_enter_idle() does not help. thanks, rui > > > 1. is always selected when kicking idle load balance > > 2. is woken up from the idle loop > > 3. calls __schedule() but cannot find any task to pull because it > > is not > > in any sched_domain, thus it does nothing and reenters idle. > > > > Solution > > -------- > > Fix the problem by skipping cpus with no sched domain attached > > during > > NOHZ idle balance. > > > > Signed-off-by: Zhang Rui <rui.zhang@intel.com> > > --- > > kernel/sched/fair.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index b3e25be58e2b..ea3185a46962 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void) > > if (ilb == smp_processor_id()) > > continue; > > > > + if (unlikely(on_null_domain(cpu_rq(ilb)))) > > + continue; > > + > > if (idle_cpu(ilb)) > > return ilb; > > } > > -- > > 2.34.1 > >
Hello Rui, On 8/14/23 10:30, Zhang, Rui wrote: > On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote: >> Hi Rui, >> >> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote: >>> Problem statement >>> ----------------- >>> When using cgroup isolated partition to isolate cpus including >>> cpu0, it >>> is observed that cpu0 is woken up frequenctly but doing nothing. >>> This is >>> not good for power efficiency. >>> >>> <idle>-0 [000] 616.491602: hrtimer_cancel: >>> hrtimer=0xffff8e8fdf623c10 >>> <idle>-0 [000] 616.491608: hrtimer_start: >>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 >>> expires=615996000000 softexpires=615996000000 >>> <idle>-0 [000] 616.491616: rcu_utilization: Start >>> context switch >>> <idle>-0 [000] 616.491618: rcu_utilization: End context >>> switch >>> <idle>-0 [000] 616.491637: tick_stop: success=1 >>> dependency=NONE >>> <idle>-0 [000] 616.491637: hrtimer_cancel: >>> hrtimer=0xffff8e8fdf623c10 >>> <idle>-0 [000] 616.491638: hrtimer_start: >>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 >>> expires=616420000000 softexpires=616420000000 >>> >>> The above pattern repeats every one or multiple ticks, results in >>> total >>> 2000+ wakeups on cpu0 in 60 seconds, when running workload on the >>> cpus that are not in the isolated partition. >>> >>> Rootcause >>> --------- >>> In NOHZ mode, an active cpu either sends an IPI or touches the idle >>> cpu's polling flag to wake it up, so that the idle cpu can pull >>> tasks >>> from the busy cpu. The logic for selecting the target cpu is to use >>> the >>> first idle cpu that presents in both nohz.idle_cpus_mask and >>> housekeeping_cpumask. >>> >>> In the above scenario, when cpu0 is in the cgroup isolated >>> partition, >>> its sched domain is deteched, but it is still available in both of >>> the >>> above cpumasks. As a result, cpu0 >> >> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not >> set itself in nohz.idle_cpus_mask and thus should not be chosen as >> ilb_cpu. I wonder what's stopping this from working? > > One thing I forgot to mention is that the problem is gone if we offline > and re-online those cpus. In that case, the isolated cpus are removed > from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never > added back. > > At runtime, the cpus can be removed from the nohz.idle_cpus_mask in > below case only > trigger_load_balance() > if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq)))) > return; > nohz_balancer_kick(rq); > nohz_balance_exit_idle() > > My understanding is that if a cpu is in nohz.idle_cpus_mask when it is > isolated, there is no chance to remove it from that mask later, so the > check in nohz_balance_enter_idle() does not help. As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped is never updated. Instead of avoiding isolated CPUs at each find_new_ilb() call as this patch does, maybe it would be better to permanently remove these CPUs from nohz.idle_cpus_mask. This would lower the number of checks done. This could be done with something like: @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu) */ rq->has_blocked_load = 1; + /* If we're a completely isolated CPU, we don't play: */ + if (on_null_domain(rq)) { + if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask)) { + cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask); + atomic_dec(&nohz.nr_cpus); + } + + /* + * Set nohz_tick_stopped on isolated CPUs to avoid keeping the + * value that was stored when rebuild_sched_domains() was called. + */ + rq->nohz_tick_stopped = 1; + } + /* * The tick is still stopped but load could have been added in the * meantime. We set the nohz.has_blocked flag to trig a check of the @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu) if (rq->nohz_tick_stopped) goto out; - /* If we're a completely isolated CPU, we don't play: */ - if (on_null_domain(rq)) - return; - rq->nohz_tick_stopped = 1; cpumask_set_cpu(cpu, nohz.idle_cpus_mask); Otherwise I could reproduce the issue and the patch was solving it, so: Tested-by: Pierre Gondois <pierre.gondois@arm.com> Also, your patch doesn't aim to solve that, but I think there is an issue when updating cpuset.cpus when an isolated partition was already created: // Create an isolated partition containing CPU0 # mkdir cgroup # mount -t cgroup2 none cgroup/ # mkdir cgroup/Testing # echo "+cpuset" > cgroup/cgroup.subtree_control # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control # echo 0 > cgroup/Testing/cpuset.cpus # echo isolated > cgroup/Testing/cpuset.cpus.partition // CPU0's sched domain is detached: # ls /sys/kernel/debug/sched/domains/cpu0/ # ls /sys/kernel/debug/sched/domains/cpu1/ domain0 domain1 // Change the isolated partition to be CPU1 # echo 1 > cgroup/Testing/cpuset.cpus // CPU[0-1] sched domains are not updated: # ls /sys/kernel/debug/sched/domains/cpu0/ # ls /sys/kernel/debug/sched/domains/cpu1/ domain0 domain1 Regards, Pierre
Hello Rui and Aaron, On 9/11/23 18:23, Zhang, Rui wrote: > On Mon, 2023-09-11 at 19:42 +0800, Aaron Lu wrote: >> Hi Pierre, >> >> On Fri, Sep 08, 2023 at 11:43:50AM +0200, Pierre Gondois wrote: >>> Hello Rui, >>> >>> On 8/14/23 10:30, Zhang, Rui wrote: >>>> On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote: >>>>> Hi Rui, >>>>> >>>>> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote: >>>>>> Problem statement >>>>>> ----------------- >>>>>> When using cgroup isolated partition to isolate cpus >>>>>> including >>>>>> cpu0, it >>>>>> is observed that cpu0 is woken up frequenctly but doing >>>>>> nothing. >>>>>> This is >>>>>> not good for power efficiency. >>>>>> >>>>>> <idle>-0 [000] 616.491602: hrtimer_cancel: >>>>>> hrtimer=0xffff8e8fdf623c10 >>>>>> <idle>-0 [000] 616.491608: hrtimer_start: >>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 >>>>>> expires=615996000000 softexpires=615996000000 >>>>>> <idle>-0 [000] 616.491616: rcu_utilization: Start >>>>>> context switch >>>>>> <idle>-0 [000] 616.491618: rcu_utilization: End >>>>>> context >>>>>> switch >>>>>> <idle>-0 [000] 616.491637: tick_stop: >>>>>> success=1 >>>>>> dependency=NONE >>>>>> <idle>-0 [000] 616.491637: hrtimer_cancel: >>>>>> hrtimer=0xffff8e8fdf623c10 >>>>>> <idle>-0 [000] 616.491638: hrtimer_start: >>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 >>>>>> expires=616420000000 softexpires=616420000000 >>>>>> >>>>>> The above pattern repeats every one or multiple ticks, >>>>>> results in >>>>>> total >>>>>> 2000+ wakeups on cpu0 in 60 seconds, when running workload on >>>>>> the >>>>>> cpus that are not in the isolated partition. >>>>>> >>>>>> Rootcause >>>>>> --------- >>>>>> In NOHZ mode, an active cpu either sends an IPI or touches >>>>>> the idle >>>>>> cpu's polling flag to wake it up, so that the idle cpu can >>>>>> pull >>>>>> tasks >>>>>> from the busy cpu. The logic for selecting the target cpu is >>>>>> to use >>>>>> the >>>>>> first idle cpu that presents in both nohz.idle_cpus_mask and >>>>>> housekeeping_cpumask. >>>>>> >>>>>> In the above scenario, when cpu0 is in the cgroup isolated >>>>>> partition, >>>>>> its sched domain is deteched, but it is still available in >>>>>> both of >>>>>> the >>>>>> above cpumasks. As a result, cpu0 >>>>> >>>>> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it >>>>> will not >>>>> set itself in nohz.idle_cpus_mask and thus should not be chosen >>>>> as >>>>> ilb_cpu. I wonder what's stopping this from working? >>>> >>>> One thing I forgot to mention is that the problem is gone if we >>>> offline >>>> and re-online those cpus. In that case, the isolated cpus >>>> are removed >>>> from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are >>>> never >>>> added back. >>>> >>>> At runtime, the cpus can be removed from the nohz.idle_cpus_mask >>>> in >>>> below case only >>>> trigger_load_balance() >>>> if (unlikely(on_null_domain(rq) || >>>> !cpu_active(cpu_of(rq)))) >>>> return; >>>> nohz_balancer_kick(rq); >>>> nohz_balance_exit_idle() >>>> >>>> My understanding is that if a cpu is in nohz.idle_cpus_mask when >>>> it is >>>> isolated, there is no chance to remove it from that mask later, >>>> so the >>>> check in nohz_balance_enter_idle() does not help. >>> >>> >>> As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped >>> is >>> never updated. Instead of avoiding isolated CPUs at each >>> find_new_ilb() call >>> as this patch does, maybe it would be better to permanently remove >>> these CPUs >>> from nohz.idle_cpus_mask. This would lower the number of checks >>> done. >> >> I agree. > > Sounds reasonable to me. > >> >>> This could be done with something like: >>> @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu) >>> */ >>> rq->has_blocked_load = 1; >>> + /* If we're a completely isolated CPU, we don't play: */ >>> + if (on_null_domain(rq)) { >>> + if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask)) >>> { >>> + cpumask_clear_cpu(rq->cpu, >>> nohz.idle_cpus_mask); >>> + atomic_dec(&nohz.nr_cpus); >>> + } >>> + >> >> >>> + /* >>> + * Set nohz_tick_stopped on isolated CPUs to avoid >>> keeping the >>> + * value that was stored when >>> rebuild_sched_domains() was called. >>> + */ >>> + rq->nohz_tick_stopped = 1; >> >> It's not immediately clear to me what the above comment and code >> mean, >> maybe that's because I know very little about sched domain related >> code. >> Other than this, your change looks good to me. >> > > If we set rq->nohz_tick_stopped for the isolated cpu, next time when we > invoke nohz_balance_exit_idle(), we clear rq->nohz_tick_stopped and > decrease nohz.nr_cpus (for the second time). This seems like a problem. > > In the current code, when rq->nohz_tick_stopped is set, it means the > cpu is in the nohz.idle_cpus_mask, and the code above breaks this > assumption. Yes right indeed, This happens when putting a CPU offline (as you mentioned earlier, putting a CPU offline clears the CPU in the idle_cpus_mask). The load balancing related variables are unused if a CPU has a NULL rq as it cannot pull any task. Ideally we should clear them once, when attaching a NULL sd to the CPU. The following snipped should do that and solve the issue you mentioned: --- snip --- --- a/include/linux/sched/nohz.h +++ b/include/linux/sched/nohz.h @@ -9,8 +9,10 @@ #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) extern void nohz_balance_enter_idle(int cpu); extern int get_nohz_timer_target(void); +extern void nohz_clean_sd_state(int cpu); #else static inline void nohz_balance_enter_idle(int cpu) { } +static inline void nohz_clean_sd_state(int cpu) { } #endif #ifdef CONFIG_NO_HZ_COMMON diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3e25be58e2b..6fcabe5d08f5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq) { SCHED_WARN_ON(rq != this_rq()); + if (on_null_domain(rq)) + return; + if (likely(!rq->nohz_tick_stopped)) return; @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu) rcu_read_unlock(); } +void nohz_clean_sd_state(int cpu) { + struct rq *rq = cpu_rq(cpu); + + rq->nohz_tick_stopped = 0; + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); + atomic_dec(&nohz.nr_cpus); + } + set_cpu_sd_state_idle(cpu); +} + /* * This routine will record that the CPU is going idle with tick stopped. * This info will be used in performing idle load balancing in the future. diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d3a3b2646ec4..d31137b5f0ce 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const struct cpumask *cpu_map) static_branch_dec_cpuslocked(&sched_asym_cpucapacity); rcu_read_lock(); - for_each_cpu(i, cpu_map) + for_each_cpu(i, cpu_map) { cpu_attach_domain(NULL, &def_root_domain, i); + nohz_clean_sd_state(i); + } rcu_read_unlock(); } --- snip --- Regards, Pierre > >> >>> + } >>> + >>> /* >>> * The tick is still stopped but load could have been >>> added in the >>> * meantime. We set the nohz.has_blocked flag to trig a >>> check of the >>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu) >>> if (rq->nohz_tick_stopped) >>> goto out; >>> - /* If we're a completely isolated CPU, we don't play: */ >>> - if (on_null_domain(rq)) >>> - return; >>> - >>> rq->nohz_tick_stopped = 1; >>> cpumask_set_cpu(cpu, nohz.idle_cpus_mask); >>> >>> Otherwise I could reproduce the issue and the patch was solving it, >>> so: >>> Tested-by: Pierre Gondois <pierre.gondois@arm.com> > > Thanks for testing, really appreciated! >>> >>> >>> >>> >>> >>> Also, your patch doesn't aim to solve that, but I think there is an >>> issue >>> when updating cpuset.cpus when an isolated partition was already >>> created: >>> >>> // Create an isolated partition containing CPU0 >>> # mkdir cgroup >>> # mount -t cgroup2 none cgroup/ >>> # mkdir cgroup/Testing >>> # echo "+cpuset" > cgroup/cgroup.subtree_control >>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control >>> # echo 0 > cgroup/Testing/cpuset.cpus >>> # echo isolated > cgroup/Testing/cpuset.cpus.partition >>> >>> // CPU0's sched domain is detached: >>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>> domain0 domain1 >>> >>> // Change the isolated partition to be CPU1 >>> # echo 1 > cgroup/Testing/cpuset.cpus >>> >>> // CPU[0-1] sched domains are not updated: >>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>> domain0 domain1 >>> > Interesting. Let me check and get back to you later on this. :) > > thanks, > rui
Hi, Pierre, > > Yes right indeed, > This happens when putting a CPU offline (as you mentioned earlier, > putting a CPU offline clears the CPU in the idle_cpus_mask). > > The load balancing related variables including? > are unused if a CPU has a NULL > rq as it cannot pull any task. Ideally we should clear them once, > when attaching a NULL sd to the CPU. This sounds good to me. But TBH, I don't have enough confidence to do so because I'm not crystal clear about how these variables are used. Some questions about the code below. > > The following snipped should do that and solve the issue you > mentioned: > --- snip --- > --- a/include/linux/sched/nohz.h > +++ b/include/linux/sched/nohz.h > @@ -9,8 +9,10 @@ > #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) > extern void nohz_balance_enter_idle(int cpu); > extern int get_nohz_timer_target(void); > +extern void nohz_clean_sd_state(int cpu); > #else > static inline void nohz_balance_enter_idle(int cpu) { } > +static inline void nohz_clean_sd_state(int cpu) { } > #endif > > #ifdef CONFIG_NO_HZ_COMMON > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b3e25be58e2b..6fcabe5d08f5 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq) > { > SCHED_WARN_ON(rq != this_rq()); > > + if (on_null_domain(rq)) > + return; > + > if (likely(!rq->nohz_tick_stopped)) > return; > if we force clearing rq->nohz_tick_stopped when detaching domain, why bother adding the first check? > > @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu) > rcu_read_unlock(); > } > > +void nohz_clean_sd_state(int cpu) { > + struct rq *rq = cpu_rq(cpu); > + > + rq->nohz_tick_stopped = 0; > + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); > + atomic_dec(&nohz.nr_cpus); > + } > + set_cpu_sd_state_idle(cpu); > +} > + detach_destroy_domains cpu_attach_domain update_top_cache_domain as we clears per_cpu(sd_llc, cpu) for the isolated cpu in cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here, no? thanks, rui > /* > * This routine will record that the CPU is going idle with tick > stopped. > * This info will be used in performing idle load balancing in the > future. > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d3a3b2646ec4..d31137b5f0ce 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const > struct cpumask *cpu_map) > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity); > > rcu_read_lock(); > - for_each_cpu(i, cpu_map) > + for_each_cpu(i, cpu_map) { > cpu_attach_domain(NULL, &def_root_domain, i); > + nohz_clean_sd_state(i); > + } > rcu_read_unlock(); > } > > --- snip --- > > Regards, > Pierre > > > > > > > > > > + } > > > > + > > > > /* > > > > * The tick is still stopped but load could have been > > > > added in the > > > > * meantime. We set the nohz.has_blocked flag to trig > > > > a > > > > check of the > > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu) > > > > if (rq->nohz_tick_stopped) > > > > goto out; > > > > - /* If we're a completely isolated CPU, we don't play: > > > > */ > > > > - if (on_null_domain(rq)) > > > > - return; > > > > - > > > > rq->nohz_tick_stopped = 1; > > > > cpumask_set_cpu(cpu, nohz.idle_cpus_mask); > > > > > > > > Otherwise I could reproduce the issue and the patch was solving > > > > it, > > > > so: > > > > Tested-by: Pierre Gondois <pierre.gondois@arm.com> > > > > Thanks for testing, really appreciated! > > > > > > > > > > > > > > > > > > > > > > > > Also, your patch doesn't aim to solve that, but I think there > > > > is an > > > > issue > > > > when updating cpuset.cpus when an isolated partition was > > > > already > > > > created: > > > > > > > > // Create an isolated partition containing CPU0 > > > > # mkdir cgroup > > > > # mount -t cgroup2 none cgroup/ > > > > # mkdir cgroup/Testing > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control > > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control > > > > # echo 0 > cgroup/Testing/cpuset.cpus > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition > > > > > > > > // CPU0's sched domain is detached: > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > domain0 domain1 > > > > > > > > // Change the isolated partition to be CPU1 > > > > # echo 1 > cgroup/Testing/cpuset.cpus > > > > > > > > // CPU[0-1] sched domains are not updated: > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > domain0 domain1 > > > > > > Interesting. Let me check and get back to you later on this. :) > > > > thanks, > > rui
On 9/14/23 11:23, Zhang, Rui wrote: > Hi, Pierre, > >> >> Yes right indeed, >> This happens when putting a CPU offline (as you mentioned earlier, >> putting a CPU offline clears the CPU in the idle_cpus_mask). >> >> The load balancing related variables > > including? I meant the nohz idle variables in the load balancing, so I was referring to: (struct sched_domain_shared).nr_busy_cpus (struct sched_domain).nohz_idle nohz.idle_cpus_mask nohz.nr_cpus (struct rq).nohz_tick_stopped > >> are unused if a CPU has a NULL >> rq as it cannot pull any task. Ideally we should clear them once, >> when attaching a NULL sd to the CPU. > > This sounds good to me. But TBH, I don't have enough confidence to do > so because I'm not crystal clear about how these variables are used. > > Some questions about the code below. >> >> The following snipped should do that and solve the issue you >> mentioned: >> --- snip --- >> --- a/include/linux/sched/nohz.h >> +++ b/include/linux/sched/nohz.h >> @@ -9,8 +9,10 @@ >> #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) >> extern void nohz_balance_enter_idle(int cpu); >> extern int get_nohz_timer_target(void); >> +extern void nohz_clean_sd_state(int cpu); >> #else >> static inline void nohz_balance_enter_idle(int cpu) { } >> +static inline void nohz_clean_sd_state(int cpu) { } >> #endif >> >> #ifdef CONFIG_NO_HZ_COMMON >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b3e25be58e2b..6fcabe5d08f5 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq) >> { >> SCHED_WARN_ON(rq != this_rq()); >> >> + if (on_null_domain(rq)) >> + return; >> + >> if (likely(!rq->nohz_tick_stopped)) >> return; >> > if we force clearing rq->nohz_tick_stopped when detaching domain, why > bother adding the first check? Yes you're right. I added this check for safety, but this is not mandatory. > >> >> @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu) >> rcu_read_unlock(); >> } >> >> +void nohz_clean_sd_state(int cpu) { >> + struct rq *rq = cpu_rq(cpu); >> + >> + rq->nohz_tick_stopped = 0; >> + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { >> + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); >> + atomic_dec(&nohz.nr_cpus); >> + } >> + set_cpu_sd_state_idle(cpu); >> +} >> + > > detach_destroy_domains > cpu_attach_domain > update_top_cache_domain > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here, > no? Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls have to be inverted to avoid what you just described. It also seems that the current kernel doesn't decrease nr_busy_cpus when putting CPUs in an isolated partition. Indeed if a CPU is counted in nr_busy_cpus, putting the CPU in an isolated partition doesn't trigger any call to set_cpu_sd_state_idle(). So it might an additional argument. Thanks for reading the patch, Regards, Pierre > > thanks, > rui >> /* >> * This routine will record that the CPU is going idle with tick >> stopped. >> * This info will be used in performing idle load balancing in the >> future. >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index d3a3b2646ec4..d31137b5f0ce 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const >> struct cpumask *cpu_map) >> >> static_branch_dec_cpuslocked(&sched_asym_cpucapacity); >> >> rcu_read_lock(); >> - for_each_cpu(i, cpu_map) >> + for_each_cpu(i, cpu_map) { >> cpu_attach_domain(NULL, &def_root_domain, i); >> + nohz_clean_sd_state(i); >> + } >> rcu_read_unlock(); >> } >> >> --- snip --- >> >> Regards, >> Pierre >> >>> >>>> >>>>> + } >>>>> + >>>>> /* >>>>> * The tick is still stopped but load could have been >>>>> added in the >>>>> * meantime. We set the nohz.has_blocked flag to trig >>>>> a >>>>> check of the >>>>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu) >>>>> if (rq->nohz_tick_stopped) >>>>> goto out; >>>>> - /* If we're a completely isolated CPU, we don't play: >>>>> */ >>>>> - if (on_null_domain(rq)) >>>>> - return; >>>>> - >>>>> rq->nohz_tick_stopped = 1; >>>>> cpumask_set_cpu(cpu, nohz.idle_cpus_mask); >>>>> >>>>> Otherwise I could reproduce the issue and the patch was solving >>>>> it, >>>>> so: >>>>> Tested-by: Pierre Gondois <pierre.gondois@arm.com> >>> >>> Thanks for testing, really appreciated! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Also, your patch doesn't aim to solve that, but I think there >>>>> is an >>>>> issue >>>>> when updating cpuset.cpus when an isolated partition was >>>>> already >>>>> created: >>>>> >>>>> // Create an isolated partition containing CPU0 >>>>> # mkdir cgroup >>>>> # mount -t cgroup2 none cgroup/ >>>>> # mkdir cgroup/Testing >>>>> # echo "+cpuset" > cgroup/cgroup.subtree_control >>>>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control >>>>> # echo 0 > cgroup/Testing/cpuset.cpus >>>>> # echo isolated > cgroup/Testing/cpuset.cpus.partition >>>>> >>>>> // CPU0's sched domain is detached: >>>>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>>>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>>>> domain0 domain1 >>>>> >>>>> // Change the isolated partition to be CPU1 >>>>> # echo 1 > cgroup/Testing/cpuset.cpus >>>>> >>>>> // CPU[0-1] sched domains are not updated: >>>>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>>>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>>>> domain0 domain1 >>>>> >>> Interesting. Let me check and get back to you later on this. :) >>> >>> thanks, >>> rui >
Hi, Pierre, Sorry for the late response. I'm still ramping up on the related code. On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote: > > > On 9/14/23 11:23, Zhang, Rui wrote: > > Hi, Pierre, > > > > > > > > Yes right indeed, > > > This happens when putting a CPU offline (as you mentioned > > > earlier, > > > putting a CPU offline clears the CPU in the idle_cpus_mask). > > > > > > The load balancing related variables > > > > including? > > I meant the nohz idle variables in the load balancing, so I was > referring to: > (struct sched_domain_shared).nr_busy_cpus > (struct sched_domain).nohz_idle > nohz.idle_cpus_mask > nohz.nr_cpus > (struct rq).nohz_tick_stopped IMO, the problem is that, for an isolated CPU, 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared) 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased) But current code does not have a third state to describe this, so we need to either 1. add extra logic, like on_null_domain() checks or 2. rely on current logic, but update all related variables correctly, like you proposed. But in any case, we should stick with one direction. If we follow the first one, the original patch should be used, which IMO is simple and straight forward. If we follow the later one, we'd better audit and remove the current on_null_domain() usage at the same time. TBH, I'm not confident enough to make such a change. But if you want to propose something, I'd glad to test it. thanks, rui > > > > > > are unused if a CPU has a NULL > > > rq as it cannot pull any task. Ideally we should clear them once, > > > when attaching a NULL sd to the CPU. > > > > This sounds good to me. But TBH, I don't have enough confidence to > > do > > so because I'm not crystal clear about how these variables are > > used. > > > > Some questions about the code below. > > > > > > The following snipped should do that and solve the issue you > > > mentioned: > > > --- snip --- > > > --- a/include/linux/sched/nohz.h > > > +++ b/include/linux/sched/nohz.h > > > @@ -9,8 +9,10 @@ > > > #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) > > > extern void nohz_balance_enter_idle(int cpu); > > > extern int get_nohz_timer_target(void); > > > +extern void nohz_clean_sd_state(int cpu); > > > #else > > > static inline void nohz_balance_enter_idle(int cpu) { } > > > +static inline void nohz_clean_sd_state(int cpu) { } > > > #endif > > > > > > #ifdef CONFIG_NO_HZ_COMMON > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index b3e25be58e2b..6fcabe5d08f5 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq > > > *rq) > > > { > > > SCHED_WARN_ON(rq != this_rq()); > > > > > > + if (on_null_domain(rq)) > > > + return; > > > + > > > if (likely(!rq->nohz_tick_stopped)) > > > return; > > > > > if we force clearing rq->nohz_tick_stopped when detaching domain, > > why > > bother adding the first check? > > Yes you're right. I added this check for safety, but this is not > mandatory. > > > > > > > > > @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int > > > cpu) > > > rcu_read_unlock(); > > > } > > > > > > +void nohz_clean_sd_state(int cpu) { > > > + struct rq *rq = cpu_rq(cpu); > > > + > > > + rq->nohz_tick_stopped = 0; > > > + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { > > > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); > > > + atomic_dec(&nohz.nr_cpus); > > > + } > > > + set_cpu_sd_state_idle(cpu); > > > +} > > > + > > > > detach_destroy_domains > > cpu_attach_domain > > update_top_cache_domain > > > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in > > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op > > here, > > no? > > Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls > have to be inverted to avoid what you just described. > > It also seems that the current kernel doesn't decrease nr_busy_cpus > when putting CPUs in an isolated partition. Indeed if a CPU is > counted > in nr_busy_cpus, putting the CPU in an isolated partition doesn't > trigger > any call to set_cpu_sd_state_idle(). > So it might an additional argument. > > Thanks for reading the patch, > Regards, > Pierre > > > > > thanks, > > rui > > > /* > > > * This routine will record that the CPU is going idle with > > > tick > > > stopped. > > > * This info will be used in performing idle load balancing in > > > the > > > future. > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d3a3b2646ec4..d31137b5f0ce 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const > > > struct cpumask *cpu_map) > > > > > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity); > > > > > > rcu_read_lock(); > > > - for_each_cpu(i, cpu_map) > > > + for_each_cpu(i, cpu_map) { > > > cpu_attach_domain(NULL, &def_root_domain, i); > > > + nohz_clean_sd_state(i); > > > + } > > > rcu_read_unlock(); > > > } > > > > > > --- snip --- > > > > > > Regards, > > > Pierre > > > > > > > > > > > > > > > > > > + } > > > > > > + > > > > > > /* > > > > > > * The tick is still stopped but load could have > > > > > > been > > > > > > added in the > > > > > > * meantime. We set the nohz.has_blocked flag to > > > > > > trig > > > > > > a > > > > > > check of the > > > > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int > > > > > > cpu) > > > > > > if (rq->nohz_tick_stopped) > > > > > > goto out; > > > > > > - /* If we're a completely isolated CPU, we don't > > > > > > play: > > > > > > */ > > > > > > - if (on_null_domain(rq)) > > > > > > - return; > > > > > > - > > > > > > rq->nohz_tick_stopped = 1; > > > > > > cpumask_set_cpu(cpu, nohz.idle_cpus_mask); > > > > > > > > > > > > Otherwise I could reproduce the issue and the patch was > > > > > > solving > > > > > > it, > > > > > > so: > > > > > > Tested-by: Pierre Gondois <pierre.gondois@arm.com> > > > > > > > > Thanks for testing, really appreciated! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, your patch doesn't aim to solve that, but I think > > > > > > there > > > > > > is an > > > > > > issue > > > > > > when updating cpuset.cpus when an isolated partition was > > > > > > already > > > > > > created: > > > > > > > > > > > > // Create an isolated partition containing CPU0 > > > > > > # mkdir cgroup > > > > > > # mount -t cgroup2 none cgroup/ > > > > > > # mkdir cgroup/Testing > > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control > > > > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control > > > > > > # echo 0 > cgroup/Testing/cpuset.cpus > > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition > > > > > > > > > > > > // CPU0's sched domain is detached: > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > > > domain0 domain1 > > > > > > > > > > > > // Change the isolated partition to be CPU1 > > > > > > # echo 1 > cgroup/Testing/cpuset.cpus > > > > > > > > > > > > // CPU[0-1] sched domains are not updated: > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > > > domain0 domain1 > > > > > > > > > > Interesting. Let me check and get back to you later on this. :) > > > > > > > > thanks, > > > > rui > >
Hi Rui, On 9/20/23 09:24, Zhang, Rui wrote: > Hi, Pierre, > > Sorry for the late response. I'm still ramping up on the related code. No worries, I also need to read the code, > > On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote: >> >> >> On 9/14/23 11:23, Zhang, Rui wrote: >>> Hi, Pierre, >>> >>>> >>>> Yes right indeed, >>>> This happens when putting a CPU offline (as you mentioned >>>> earlier, >>>> putting a CPU offline clears the CPU in the idle_cpus_mask). >>>> >>>> The load balancing related variables >>> >>> including? >> >> I meant the nohz idle variables in the load balancing, so I was >> referring to: >> (struct sched_domain_shared).nr_busy_cpus >> (struct sched_domain).nohz_idle >> nohz.idle_cpus_mask >> nohz.nr_cpus >> (struct rq).nohz_tick_stopped > > IMO, the problem is that, for an isolated CPU, > 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared) > 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased) > > But current code does not have a third state to describe this, so we > need to either > 1. add extra logic, like on_null_domain() checks I m not sure I understand, do you mean adding on_null_domain() in addition to the one in the present patch ? > or > 2. rely on current logic, but update all related variables correctly, > like you proposed. > > But in any case, we should stick with one direction. > > If we follow the first one, the original patch should be used, which > IMO is simple and straight forward. > If we follow the later one, we'd better audit and remove the current > on_null_domain() usage at the same time. TBH, I'm not confident enough Here aswell, I'm not sure I understand whether you are referring to the on_null_domain() call in the present patch or to the on_null_domain() calls in fair.c ? Regards, Pierre > to make such a change. But if you want to propose something, I'd glad > to test it. > > thanks, > rui >
Hi Rui, On Wed, 20 Sept 2023 at 09:24, Zhang, Rui <rui.zhang@intel.com> wrote: > > Hi, Pierre, > > Sorry for the late response. I'm still ramping up on the related code. > > On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote: > > > > > > On 9/14/23 11:23, Zhang, Rui wrote: > > > Hi, Pierre, > > > > > > > > > > > Yes right indeed, > > > > This happens when putting a CPU offline (as you mentioned > > > > earlier, > > > > putting a CPU offline clears the CPU in the idle_cpus_mask). > > > > > > > > The load balancing related variables > > > > > > including? > > > > I meant the nohz idle variables in the load balancing, so I was > > referring to: > > (struct sched_domain_shared).nr_busy_cpus > > (struct sched_domain).nohz_idle > > nohz.idle_cpus_mask > > nohz.nr_cpus > > (struct rq).nohz_tick_stopped > > IMO, the problem is that, for an isolated CPU, > 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared) > 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased) > > But current code does not have a third state to describe this, so we > need to either > 1. add extra logic, like on_null_domain() checks > or > 2. rely on current logic, but update all related variables correctly, > like you proposed. Isn't the housekeeping cpu mask there to manage such a case ? I was expecting that your isolated cpu should be cleared from the housekeeping cpumask used by scheduler and ILB I think that your solution is the comment of the ffind_new_ilb() unction: " * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED is not set * anywhere yet. " IMO, you should look at enabling and using the HK_TYPE_SCHED for isolated CPU CCed Frederic to get his opinion > > But in any case, we should stick with one direction. > > If we follow the first one, the original patch should be used, which > IMO is simple and straight forward. > If we follow the later one, we'd better audit and remove the current > on_null_domain() usage at the same time. TBH, I'm not confident enough > to make such a change. But if you want to propose something, I'd glad > to test it. > > thanks, > rui > > > > > > > > > > are unused if a CPU has a NULL > > > > rq as it cannot pull any task. Ideally we should clear them once, > > > > when attaching a NULL sd to the CPU. > > > > > > This sounds good to me. But TBH, I don't have enough confidence to > > > do > > > so because I'm not crystal clear about how these variables are > > > used. > > > > > > Some questions about the code below. > > > > > > > > The following snipped should do that and solve the issue you > > > > mentioned: > > > > --- snip --- > > > > --- a/include/linux/sched/nohz.h > > > > +++ b/include/linux/sched/nohz.h > > > > @@ -9,8 +9,10 @@ > > > > #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) > > > > extern void nohz_balance_enter_idle(int cpu); > > > > extern int get_nohz_timer_target(void); > > > > +extern void nohz_clean_sd_state(int cpu); > > > > #else > > > > static inline void nohz_balance_enter_idle(int cpu) { } > > > > +static inline void nohz_clean_sd_state(int cpu) { } > > > > #endif > > > > > > > > #ifdef CONFIG_NO_HZ_COMMON > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index b3e25be58e2b..6fcabe5d08f5 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq > > > > *rq) > > > > { > > > > SCHED_WARN_ON(rq != this_rq()); > > > > > > > > + if (on_null_domain(rq)) > > > > + return; > > > > + > > > > if (likely(!rq->nohz_tick_stopped)) > > > > return; > > > > > > > if we force clearing rq->nohz_tick_stopped when detaching domain, > > > why > > > bother adding the first check? > > > > Yes you're right. I added this check for safety, but this is not > > mandatory. > > > > > > > > > > > > > @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int > > > > cpu) > > > > rcu_read_unlock(); > > > > } > > > > > > > > +void nohz_clean_sd_state(int cpu) { > > > > + struct rq *rq = cpu_rq(cpu); > > > > + > > > > + rq->nohz_tick_stopped = 0; > > > > + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { > > > > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); > > > > + atomic_dec(&nohz.nr_cpus); > > > > + } > > > > + set_cpu_sd_state_idle(cpu); > > > > +} > > > > + > > > > > > detach_destroy_domains > > > cpu_attach_domain > > > update_top_cache_domain > > > > > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in > > > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op > > > here, > > > no? > > > > Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls > > have to be inverted to avoid what you just described. > > > > It also seems that the current kernel doesn't decrease nr_busy_cpus > > when putting CPUs in an isolated partition. Indeed if a CPU is > > counted > > in nr_busy_cpus, putting the CPU in an isolated partition doesn't > > trigger > > any call to set_cpu_sd_state_idle(). > > So it might an additional argument. > > > > Thanks for reading the patch, > > Regards, > > Pierre > > > > > > > > thanks, > > > rui > > > > /* > > > > * This routine will record that the CPU is going idle with > > > > tick > > > > stopped. > > > > * This info will be used in performing idle load balancing in > > > > the > > > > future. > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > > index d3a3b2646ec4..d31137b5f0ce 100644 > > > > --- a/kernel/sched/topology.c > > > > +++ b/kernel/sched/topology.c > > > > @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const > > > > struct cpumask *cpu_map) > > > > > > > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity); > > > > > > > > rcu_read_lock(); > > > > - for_each_cpu(i, cpu_map) > > > > + for_each_cpu(i, cpu_map) { > > > > cpu_attach_domain(NULL, &def_root_domain, i); > > > > + nohz_clean_sd_state(i); > > > > + } > > > > rcu_read_unlock(); > > > > } > > > > > > > > --- snip --- > > > > > > > > Regards, > > > > Pierre > > > > > > > > > > > > > > > > > > > > > > + } > > > > > > > + > > > > > > > /* > > > > > > > * The tick is still stopped but load could have > > > > > > > been > > > > > > > added in the > > > > > > > * meantime. We set the nohz.has_blocked flag to > > > > > > > trig > > > > > > > a > > > > > > > check of the > > > > > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int > > > > > > > cpu) > > > > > > > if (rq->nohz_tick_stopped) > > > > > > > goto out; > > > > > > > - /* If we're a completely isolated CPU, we don't > > > > > > > play: > > > > > > > */ > > > > > > > - if (on_null_domain(rq)) > > > > > > > - return; > > > > > > > - > > > > > > > rq->nohz_tick_stopped = 1; > > > > > > > cpumask_set_cpu(cpu, nohz.idle_cpus_mask); > > > > > > > > > > > > > > Otherwise I could reproduce the issue and the patch was > > > > > > > solving > > > > > > > it, > > > > > > > so: > > > > > > > Tested-by: Pierre Gondois <pierre.gondois@arm.com> > > > > > > > > > > Thanks for testing, really appreciated! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, your patch doesn't aim to solve that, but I think > > > > > > > there > > > > > > > is an > > > > > > > issue > > > > > > > when updating cpuset.cpus when an isolated partition was > > > > > > > already > > > > > > > created: > > > > > > > > > > > > > > // Create an isolated partition containing CPU0 > > > > > > > # mkdir cgroup > > > > > > > # mount -t cgroup2 none cgroup/ > > > > > > > # mkdir cgroup/Testing > > > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control > > > > > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control > > > > > > > # echo 0 > cgroup/Testing/cpuset.cpus > > > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition > > > > > > > > > > > > > > // CPU0's sched domain is detached: > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > > > > domain0 domain1 > > > > > > > > > > > > > > // Change the isolated partition to be CPU1 > > > > > > > # echo 1 > cgroup/Testing/cpuset.cpus > > > > > > > > > > > > > > // CPU[0-1] sched domains are not updated: > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > > > > domain0 domain1 > > > > > > > > > > > > Interesting. Let me check and get back to you later on this. :) > > > > > > > > > > thanks, > > > > > rui > > > >
Hi, Vincent, Really appreciate your comments on this. On Wed, 2023-11-15 at 21:01 +0100, Vincent Guittot wrote: > Hi Rui, > > On Wed, 20 Sept 2023 at 09:24, Zhang, Rui <rui.zhang@intel.com> > wrote: > > > > Hi, Pierre, > > > > Sorry for the late response. I'm still ramping up on the related > > code. > > > > On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote: > > > > > > > > > On 9/14/23 11:23, Zhang, Rui wrote: > > > > Hi, Pierre, > > > > > > > > > > > > > > Yes right indeed, > > > > > This happens when putting a CPU offline (as you mentioned > > > > > earlier, > > > > > putting a CPU offline clears the CPU in the idle_cpus_mask). > > > > > > > > > > The load balancing related variables > > > > > > > > including? > > > > > > I meant the nohz idle variables in the load balancing, so I was > > > referring to: > > > (struct sched_domain_shared).nr_busy_cpus > > > (struct sched_domain).nohz_idle > > > nohz.idle_cpus_mask > > > nohz.nr_cpus > > > (struct rq).nohz_tick_stopped > > > > IMO, the problem is that, for an isolated CPU, > > 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared) > > 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased) > > > > But current code does not have a third state to describe this, so > > we > > need to either > > 1. add extra logic, like on_null_domain() checks > > or > > 2. rely on current logic, but update all related variables > > correctly, > > like you proposed. > > Isn't the housekeeping cpu mask there to manage such a case ? This is true for isolated CPUs using boot option "nohz_full=". But for CPUs in the cgroup isolated partition, the housekeeping cpumask is not updated. I don't know if this is intended or not. > I was > expecting that your isolated cpu should be cleared from the > housekeeping cpumask used by scheduler and ILB This patch is a direct fix when I found the isolated CPUs are woke up by this piece of code. > > I think that your solution is the comment of the ffind_new_ilb() > unction: > " > * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED > is not set > * anywhere yet. > " > > IMO, you should look at enabling and using the HK_TYPE_SCHED for > isolated CPU yeah, this seems reasonable. I'm new to cgroup and I'm not sure what should be the proper behavior for CPUs in isolated partition. > > CCed Frederic to get his opinion Thanks. -rui > > > But in any case, we should stick with one direction. > > > > If we follow the first one, the original patch should be used, > > which > > IMO is simple and straight forward. > > If we follow the later one, we'd better audit and remove the > > current > > on_null_domain() usage at the same time. TBH, I'm not confident > > enough > > to make such a change. But if you want to propose something, I'd > > glad > > to test it. > > > > thanks, > > rui > > > > > > > > > > > > > > are unused if a CPU has a NULL > > > > > rq as it cannot pull any task. Ideally we should clear them > > > > > once, > > > > > when attaching a NULL sd to the CPU. > > > > > > > > This sounds good to me. But TBH, I don't have enough confidence > > > > to > > > > do > > > > so because I'm not crystal clear about how these variables are > > > > used. > > > > > > > > Some questions about the code below. > > > > > > > > > > The following snipped should do that and solve the issue you > > > > > mentioned: > > > > > --- snip --- > > > > > --- a/include/linux/sched/nohz.h > > > > > +++ b/include/linux/sched/nohz.h > > > > > @@ -9,8 +9,10 @@ > > > > > #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) > > > > > extern void nohz_balance_enter_idle(int cpu); > > > > > extern int get_nohz_timer_target(void); > > > > > +extern void nohz_clean_sd_state(int cpu); > > > > > #else > > > > > static inline void nohz_balance_enter_idle(int cpu) { } > > > > > +static inline void nohz_clean_sd_state(int cpu) { } > > > > > #endif > > > > > > > > > > #ifdef CONFIG_NO_HZ_COMMON > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index b3e25be58e2b..6fcabe5d08f5 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq > > > > > *rq) > > > > > { > > > > > SCHED_WARN_ON(rq != this_rq()); > > > > > > > > > > + if (on_null_domain(rq)) > > > > > + return; > > > > > + > > > > > if (likely(!rq->nohz_tick_stopped)) > > > > > return; > > > > > > > > > if we force clearing rq->nohz_tick_stopped when detaching > > > > domain, > > > > why > > > > bother adding the first check? > > > > > > Yes you're right. I added this check for safety, but this is not > > > mandatory. > > > > > > > > > > > > > > > > > @@ -11551,6 +11554,17 @@ static void > > > > > set_cpu_sd_state_idle(int > > > > > cpu) > > > > > rcu_read_unlock(); > > > > > } > > > > > > > > > > +void nohz_clean_sd_state(int cpu) { > > > > > + struct rq *rq = cpu_rq(cpu); > > > > > + > > > > > + rq->nohz_tick_stopped = 0; > > > > > + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { > > > > > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); > > > > > + atomic_dec(&nohz.nr_cpus); > > > > > + } > > > > > + set_cpu_sd_state_idle(cpu); > > > > > +} > > > > > + > > > > > > > > detach_destroy_domains > > > > cpu_attach_domain > > > > update_top_cache_domain > > > > > > > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in > > > > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no- > > > > op > > > > here, > > > > no? > > > > > > Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() > > > calls > > > have to be inverted to avoid what you just described. > > > > > > It also seems that the current kernel doesn't decrease > > > nr_busy_cpus > > > when putting CPUs in an isolated partition. Indeed if a CPU is > > > counted > > > in nr_busy_cpus, putting the CPU in an isolated partition doesn't > > > trigger > > > any call to set_cpu_sd_state_idle(). > > > So it might an additional argument. > > > > > > Thanks for reading the patch, > > > Regards, > > > Pierre > > > > > > > > > > > thanks, > > > > rui > > > > > /* > > > > > * This routine will record that the CPU is going idle > > > > > with > > > > > tick > > > > > stopped. > > > > > * This info will be used in performing idle load > > > > > balancing in > > > > > the > > > > > future. > > > > > diff --git a/kernel/sched/topology.c > > > > > b/kernel/sched/topology.c > > > > > index d3a3b2646ec4..d31137b5f0ce 100644 > > > > > --- a/kernel/sched/topology.c > > > > > +++ b/kernel/sched/topology.c > > > > > @@ -2584,8 +2584,10 @@ static void > > > > > detach_destroy_domains(const > > > > > struct cpumask *cpu_map) > > > > > > > > > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity); > > > > > > > > > > rcu_read_lock(); > > > > > - for_each_cpu(i, cpu_map) > > > > > + for_each_cpu(i, cpu_map) { > > > > > cpu_attach_domain(NULL, &def_root_domain, > > > > > i); > > > > > + nohz_clean_sd_state(i); > > > > > + } > > > > > rcu_read_unlock(); > > > > > } > > > > > > > > > > --- snip --- > > > > > > > > > > Regards, > > > > > Pierre > > > > > > > > > > > > > > > > > > > > > > > > > > + } > > > > > > > > + > > > > > > > > /* > > > > > > > > * The tick is still stopped but load could > > > > > > > > have > > > > > > > > been > > > > > > > > added in the > > > > > > > > * meantime. We set the nohz.has_blocked > > > > > > > > flag to > > > > > > > > trig > > > > > > > > a > > > > > > > > check of the > > > > > > > > @@ -11585,10 +11609,6 @@ void > > > > > > > > nohz_balance_enter_idle(int > > > > > > > > cpu) > > > > > > > > if (rq->nohz_tick_stopped) > > > > > > > > goto out; > > > > > > > > - /* If we're a completely isolated CPU, we don't > > > > > > > > play: > > > > > > > > */ > > > > > > > > - if (on_null_domain(rq)) > > > > > > > > - return; > > > > > > > > - > > > > > > > > rq->nohz_tick_stopped = 1; > > > > > > > > cpumask_set_cpu(cpu, nohz.idle_cpus_mask); > > > > > > > > > > > > > > > > Otherwise I could reproduce the issue and the patch was > > > > > > > > solving > > > > > > > > it, > > > > > > > > so: > > > > > > > > Tested-by: Pierre Gondois <pierre.gondois@arm.com> > > > > > > > > > > > > Thanks for testing, really appreciated! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, your patch doesn't aim to solve that, but I think > > > > > > > > there > > > > > > > > is an > > > > > > > > issue > > > > > > > > when updating cpuset.cpus when an isolated partition > > > > > > > > was > > > > > > > > already > > > > > > > > created: > > > > > > > > > > > > > > > > // Create an isolated partition containing CPU0 > > > > > > > > # mkdir cgroup > > > > > > > > # mount -t cgroup2 none cgroup/ > > > > > > > > # mkdir cgroup/Testing > > > > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control > > > > > > > > # echo "+cpuset" > > > > > > > > > cgroup/Testing/cgroup.subtree_control > > > > > > > > # echo 0 > cgroup/Testing/cpuset.cpus > > > > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition > > > > > > > > > > > > > > > > // CPU0's sched domain is detached: > > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > > > > > domain0 domain1 > > > > > > > > > > > > > > > > // Change the isolated partition to be CPU1 > > > > > > > > # echo 1 > cgroup/Testing/cpuset.cpus > > > > > > > > > > > > > > > > // CPU[0-1] sched domains are not updated: > > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/ > > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/ > > > > > > > > domain0 domain1 > > > > > > > > > > > > > > Interesting. Let me check and get back to you later on > > > > > > this. :) > > > > > > > > > > > > thanks, > > > > > > rui > > > > > >
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3e25be58e2b..ea3185a46962 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void) if (ilb == smp_processor_id()) continue; + if (unlikely(on_null_domain(cpu_rq(ilb)))) + continue; + if (idle_cpu(ilb)) return ilb; }