Message ID | 20230725193048.124796-1-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9010:0:b0:3e4:2afc:c1 with SMTP id l16csp2713228vqg; Tue, 25 Jul 2023 13:27:24 -0700 (PDT) X-Google-Smtp-Source: APBJJlEvlewEPbNJKTHnoWnliENHfVf/UtFO5BmQ3ub1lhLu4cYkxOq3wMpQvPjkswlmYhFmpyO8 X-Received: by 2002:aa7:c50c:0:b0:522:56e7:bafb with SMTP id o12-20020aa7c50c000000b0052256e7bafbmr3774edq.19.1690316844454; Tue, 25 Jul 2023 13:27:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690316844; cv=none; d=google.com; s=arc-20160816; b=f+uZvPW9+dsOJH7OrFAp5Zmqhc+saKWw2wEVS7D9+la4kaVlyrV1mGbjShDApJAIUT ak9mD7NCDaJEj7RiGd9nP6LT9ZhetXI1T23/U3nEAn9ytM4JVuKQJoBdSK7JdMW4g0aJ HuoKEdfU5XYfMzWqNsefHibjUKpmB5LZPZwB+d9rlnRt9SuSbJ1MwIZerHm/BpP41Wp9 38CbtW626Kdv/vczYr0Yd2+2AunSv+r/w3z9HZAFMCr1xAKrCw8dp83O4J44aHVUhNZU Kh5ziXrzho1c5PvMkNRAp7wW4xk04oQR2clrubO1v/mDb2gQ1fAIKa1da2XdjOJw9X0n mQvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ccP1ir6jefWXqE7qXOrwt2wVtdr+z1qhzq9gZj4Wddc=; fh=6jUvvwNgggoeLCB3PgvGfBZFdr+LWzDd6oFaQWdUTrM=; b=zCLiQLfOsLZgEOBwQ5p7rv/tV2ow5Qc/S5ApXSOhio+Fdz/glo6Hn5+ZE8V681OANA tq3tKkMKjWYadz5u1KH4RQ0NfDg2fYg7SJ/swFThE1Pxm2ChkZeqmdht9KpS44vYhqV6 CMv9a8DBpUJvpihtJd91OsBK7BxWTv56eLlh/oqEy31lBjhhZHWgd+XL3CxQPwFmbfpf i+5w/HyB6qJTfnV//+Fkmy4d+z+GJTaRWPIZKfgRwGymdOtUtpY628ZVllaOvZkQ/tqE HYrthcNHA9xZZrblgdMbnigbyCF+50LMYBb8vo4LX9yF/CaBw7xmkiusyWcsMHojnbg/ acFA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=eXeG+43P; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f19-20020a056402069300b0051e16fd4d9asi8361830edy.223.2023.07.25.13.27.00; Tue, 25 Jul 2023 13:27:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=eXeG+43P; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229907AbjGYTaO (ORCPT <rfc822;kautuk.consul.80@gmail.com> + 99 others); Tue, 25 Jul 2023 15:30:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32842 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230379AbjGYTaN (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 25 Jul 2023 15:30:13 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 78F0B1FFC for <linux-kernel@vger.kernel.org>; Tue, 25 Jul 2023 12:30:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1690313409; bh=eksg/xSQaqbvhzxjIvfY0VWqJKo9vV4O1aJahYJV9EU=; h=From:To:Cc:Subject:Date:From; b=eXeG+43Pt5Ybsmy61t2JpU5U9Bbulf/sp2lq3j+akg+AtWvpxqBhpUJ23PVSw1+at oJObQ7aUzj7wNJGZU/M9yv2YsVMMFxwNCJkmcHpiGWTNFNPz4UbYTSr8X7pxZfuooL kP6JK1GFJF1L1jT4f0LLW5gPNf8GBFf9NIESdlCjAswZaFycQT0bYLqt3hTDuYCAHB ztXFfdObz06oinFRCUFIvpx37fn7opjlH3JZTtGnaBiXlF4/vyq9r3ko5yGFs6hTtA u9OToijt0yeBlshWrDePz6KdFwRcpxL6+y/AKoI/dELDqNQ/V9c8yQ85R4IPbno8zr ZtfK3Jd2cw/6Q== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4R9Rv463Hcz1JZF; Tue, 25 Jul 2023 15:30:08 -0400 (EDT) From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Ingo Molnar <mingo@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com>, Swapnil Sapkal <Swapnil.Sapkal@amd.com>, Aaron Lu <aaron.lu@intel.com>, x86@kernel.org Subject: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms Date: Tue, 25 Jul 2023 15:30:48 -0400 Message-Id: <20230725193048.124796-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED,RDNS_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772425675469755676 X-GMAIL-MSGID: 1772425675469755676 |
Series |
[RFC,1/1] sched: Extend cpu idle state for 1ms
|
|
Commit Message
Mathieu Desnoyers
July 25, 2023, 7:30 p.m. UTC
Allow select_task_rq to consider a cpu as idle for 1ms after that cpu
has exited the idle loop.
This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):
hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
from 49s to 34s. (30% speedup)
My working hypothesis for why this helps is: queuing more than a single
task on the runqueue of a cpu which just exited idle rather than
spreading work over other idle cpus helps power efficiency on systems
with large number of cores.
This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench.
It turned out that changing this raw spinlock for a loop of 10000x
cpu_relax within do_idle() had similar benefits.
This patch achieve a similar effect without the busy-waiting by
introducing a runqueue state sampling the sched_clock() when exiting
idle, which allows select_task_rq to consider "as idle" a cpu which has
recently exited idle.
This patch should be considered "food for thoughts", and I would be glad
to hear feedback on whether it causes regressions on _other_ workloads,
and whether it helps with the hackbench workload on large Intel system
as well.
Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: x86@kernel.org
---
kernel/sched/core.c | 4 ++++
kernel/sched/sched.h | 3 +++
2 files changed, 7 insertions(+)
Comments
On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: > Allow select_task_rq to consider a cpu as idle for 1ms after that cpu > has exited the idle loop. > > This speeds up the following hackbench workload on a 192 cores AMD EPYC > 9654 96-Core Processor (over 2 sockets): > > hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 > > from 49s to 34s. (30% speedup) > > My working hypothesis for why this helps is: queuing more than a single > task on the runqueue of a cpu which just exited idle rather than > spreading work over other idle cpus helps power efficiency on systems > with large number of cores. > > This was developed as part of the investigation into a weird regression > reported by AMD where adding a raw spinlock in the scheduler context > switch accelerated hackbench. > > It turned out that changing this raw spinlock for a loop of 10000x > cpu_relax within do_idle() had similar benefits. > > This patch achieve a similar effect without the busy-waiting by > introducing a runqueue state sampling the sched_clock() when exiting > idle, which allows select_task_rq to consider "as idle" a cpu which has > recently exited idle. > > This patch should be considered "food for thoughts", and I would be glad > to hear feedback on whether it causes regressions on _other_ workloads, > and whether it helps with the hackbench workload on large Intel system > as well. > > Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> > Cc: Aaron Lu <aaron.lu@intel.com> > Cc: x86@kernel.org > --- > kernel/sched/core.c | 4 ++++ > kernel/sched/sched.h | 3 +++ > 2 files changed, 7 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a68d1276bab0..d40e3a0a5ced 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void) > * TASK_RUNNING state. > */ > WARN_ON_ONCE(current->__state); > + WRITE_ONCE(this_rq()->idle_end_time, sched_clock()); > do { > __schedule(SM_NONE); > } while (need_resched()); > @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu) > { > struct rq *rq = cpu_rq(cpu); > > + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS) Wouldn't this hurt the latency badly? Specially on a loaded system with a workload that does a lot of wakeup. ran schbench on a 50% loaded system with stress-ng. (there could be a better benchmark to measure latency) I see that latency takes a hit. specially tail latencies.full log below with different schbench groups. 6.5-rc3 6.5-rc3+this patch Groups: 1 50.0th: 14.0 13.0 75.0th: 16.0 16.0 90.0th: 19.5 20.0 95.0th: 53.0 226.0 99.0th: 1969.0 2165.0 99.5th: 2912.0 2648.0 99.9th: 4680.0 4142.0 Groups: 2 50.0th: 15.5 15.5 75.0th: 18.0 19.5 90.0th: 25.5 497.0 95.0th: 323.0 1384.0 99.0th: 2055.0 3144.0 99.5th: 2972.0 4014.0 99.9th: 6026.0 6560.0 Groups: 4 50.0th: 18.0 18.5 75.0th: 21.5 26.0 90.0th: 56.0 940.5 95.0th: 678.0 1896.0 99.0th: 2484.0 3756.0 99.5th: 3224.0 4616.0 99.9th: 4960.0 6824.0 Groups: 8 50.0th: 23.5 25.5 75.0th: 30.5 421.5 90.0th: 443.5 1722.0 95.0th: 1410.0 2736.0 99.0th: 3942.0 5496.0 99.5th: 5232.0 7016.0 99.9th: 7996.0 8896.0 Groups: 16 50.0th: 33.5 41.5 75.0th: 49.0 752.0 90.0th: 1067.5 2332.0 95.0th: 2093.0 3468.0 99.0th: 5048.0 6728.0 99.5th: 6760.0 7624.0 99.9th: 8592.0 9504.0 Groups: 32 50.0th: 60.0 79.0 75.0th: 456.5 1712.0 90.0th: 2788.0 3996.0 95.0th: 4544.0 5768.0 99.0th: 8444.0 9104.0 99.5th: 9168.0 9808.0 99.9th: 11984.0 12448.0 > + return 1; > + > if (rq->curr != rq->idle) > return 0; > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 81ac605b9cd5..8932e198a33a 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -97,6 +97,8 @@ > # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) > #endif > > +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ > + > struct rq; > struct cpuidle_state; > > @@ -1010,6 +1012,7 @@ struct rq { > > struct task_struct __rcu *curr; > struct task_struct *idle; > + u64 idle_end_time; > struct task_struct *stop; > unsigned long next_balance; > struct mm_struct *prev_mm;
On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: > Allow select_task_rq to consider a cpu as idle for 1ms after that cpu > has exited the idle loop. > > This speeds up the following hackbench workload on a 192 cores AMD EPYC > 9654 96-Core Processor (over 2 sockets): > > hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 > > from 49s to 34s. (30% speedup) > > My working hypothesis for why this helps is: queuing more than a single > task on the runqueue of a cpu which just exited idle rather than > spreading work over other idle cpus helps power efficiency on systems > with large number of cores. > > This was developed as part of the investigation into a weird regression > reported by AMD where adding a raw spinlock in the scheduler context > switch accelerated hackbench. > > It turned out that changing this raw spinlock for a loop of 10000x > cpu_relax within do_idle() had similar benefits. > > This patch achieve a similar effect without the busy-waiting by > introducing a runqueue state sampling the sched_clock() when exiting > idle, which allows select_task_rq to consider "as idle" a cpu which has > recently exited idle. > > This patch should be considered "food for thoughts", and I would be glad > to hear feedback on whether it causes regressions on _other_ workloads, > and whether it helps with the hackbench workload on large Intel system > as well. > > Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> > Cc: Aaron Lu <aaron.lu@intel.com> > Cc: x86@kernel.org > --- > kernel/sched/core.c | 4 ++++ > kernel/sched/sched.h | 3 +++ > 2 files changed, 7 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a68d1276bab0..d40e3a0a5ced 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void) > * TASK_RUNNING state. > */ > WARN_ON_ONCE(current->__state); > + WRITE_ONCE(this_rq()->idle_end_time, sched_clock()); > do { > __schedule(SM_NONE); > } while (need_resched()); > @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu) > { > struct rq *rq = cpu_rq(cpu); > > + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS) Wouldn't this hurt the latency badly? Specially on a loaded system with a workload that does a lot of wakeup. ran schbench on a 50% loaded system with stress-ng. (there could be a better benchmark to measure latency) I see that latency takes a hit. specially tail latencies.full log below with different schbench groups. 6.5-rc3 6.5-rc3+this patch Groups: 1 50.0th: 14.0 13.0 75.0th: 16.0 16.0 90.0th: 19.5 20.0 95.0th: 53.0 226.0 99.0th: 1969.0 2165.0 99.5th: 2912.0 2648.0 99.9th: 4680.0 4142.0 Groups: 2 50.0th: 15.5 15.5 75.0th: 18.0 19.5 90.0th: 25.5 497.0 95.0th: 323.0 1384.0 99.0th: 2055.0 3144.0 99.5th: 2972.0 4014.0 99.9th: 6026.0 6560.0 Groups: 4 50.0th: 18.0 18.5 75.0th: 21.5 26.0 90.0th: 56.0 940.5 95.0th: 678.0 1896.0 99.0th: 2484.0 3756.0 99.5th: 3224.0 4616.0 99.9th: 4960.0 6824.0 Groups: 8 50.0th: 23.5 25.5 75.0th: 30.5 421.5 90.0th: 443.5 1722.0 95.0th: 1410.0 2736.0 99.0th: 3942.0 5496.0 99.5th: 5232.0 7016.0 99.9th: 7996.0 8896.0 Groups: 16 50.0th: 33.5 41.5 75.0th: 49.0 752.0 90.0th: 1067.5 2332.0 95.0th: 2093.0 3468.0 99.0th: 5048.0 6728.0 99.5th: 6760.0 7624.0 99.9th: 8592.0 9504.0 Groups: 32 50.0th: 60.0 79.0 75.0th: 456.5 1712.0 90.0th: 2788.0 3996.0 95.0th: 4544.0 5768.0 99.0th: 8444.0 9104.0 99.5th: 9168.0 9808.0 99.9th: 11984.0 12448.0 > + return 1; > + > if (rq->curr != rq->idle) > return 0; > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 81ac605b9cd5..8932e198a33a 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -97,6 +97,8 @@ > # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) > #endif > > +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ > + > struct rq; > struct cpuidle_state; > > @@ -1010,6 +1012,7 @@ struct rq { > > struct task_struct __rcu *curr; > struct task_struct *idle; > + u64 idle_end_time; > struct task_struct *stop; > unsigned long next_balance; > struct mm_struct *prev_mm;
On 7/26/23 04:04, Shrikanth Hegde wrote: > > > On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: >> Allow select_task_rq to consider a cpu as idle for 1ms after that cpu >> has exited the idle loop. >> >> This speeds up the following hackbench workload on a 192 cores AMD EPYC >> 9654 96-Core Processor (over 2 sockets): >> >> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 >> >> from 49s to 34s. (30% speedup) >> >> My working hypothesis for why this helps is: queuing more than a single >> task on the runqueue of a cpu which just exited idle rather than >> spreading work over other idle cpus helps power efficiency on systems >> with large number of cores. >> >> This was developed as part of the investigation into a weird regression >> reported by AMD where adding a raw spinlock in the scheduler context >> switch accelerated hackbench. >> >> It turned out that changing this raw spinlock for a loop of 10000x >> cpu_relax within do_idle() had similar benefits. >> >> This patch achieve a similar effect without the busy-waiting by >> introducing a runqueue state sampling the sched_clock() when exiting >> idle, which allows select_task_rq to consider "as idle" a cpu which has >> recently exited idle. >> >> This patch should be considered "food for thoughts", and I would be glad >> to hear feedback on whether it causes regressions on _other_ workloads, >> and whether it helps with the hackbench workload on large Intel system >> as well. >> >> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Valentin Schneider <vschneid@redhat.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Ben Segall <bsegall@google.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> >> Cc: Vincent Guittot <vincent.guittot@linaro.org> >> Cc: Juri Lelli <juri.lelli@redhat.com> >> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> >> Cc: Aaron Lu <aaron.lu@intel.com> >> Cc: x86@kernel.org >> --- >> kernel/sched/core.c | 4 ++++ >> kernel/sched/sched.h | 3 +++ >> 2 files changed, 7 insertions(+) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index a68d1276bab0..d40e3a0a5ced 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void) >> * TASK_RUNNING state. >> */ >> WARN_ON_ONCE(current->__state); >> + WRITE_ONCE(this_rq()->idle_end_time, sched_clock()); >> do { >> __schedule(SM_NONE); >> } while (need_resched()); >> @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> >> + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS) > > > Wouldn't this hurt the latency badly? Specially on a loaded system with > a workload that does a lot of wakeup. Good point ! Can you try your benchmark replacing the if () statement above by: + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS && + READ_ONCE(rq->nr_running) <= 4) + return 1; It speeds up the hackbench test-case even more here. It's now 30s, and it should improve tail latency. Thanks, Mathieu > > ran schbench on a 50% loaded system with stress-ng. (there could be a better benchmark to measure latency) > I see that latency takes a hit. specially tail latencies.full log below with different schbench groups. > > 6.5-rc3 6.5-rc3+this patch > > Groups: 1 > 50.0th: 14.0 13.0 > 75.0th: 16.0 16.0 > 90.0th: 19.5 20.0 > 95.0th: 53.0 226.0 > 99.0th: 1969.0 2165.0 > 99.5th: 2912.0 2648.0 > 99.9th: 4680.0 4142.0 > > Groups: 2 > 50.0th: 15.5 15.5 > 75.0th: 18.0 19.5 > 90.0th: 25.5 497.0 > 95.0th: 323.0 1384.0 > 99.0th: 2055.0 3144.0 > 99.5th: 2972.0 4014.0 > 99.9th: 6026.0 6560.0 > > Groups: 4 > 50.0th: 18.0 18.5 > 75.0th: 21.5 26.0 > 90.0th: 56.0 940.5 > 95.0th: 678.0 1896.0 > 99.0th: 2484.0 3756.0 > 99.5th: 3224.0 4616.0 > 99.9th: 4960.0 6824.0 > > Groups: 8 > 50.0th: 23.5 25.5 > 75.0th: 30.5 421.5 > 90.0th: 443.5 1722.0 > 95.0th: 1410.0 2736.0 > 99.0th: 3942.0 5496.0 > 99.5th: 5232.0 7016.0 > 99.9th: 7996.0 8896.0 > > Groups: 16 > 50.0th: 33.5 41.5 > 75.0th: 49.0 752.0 > 90.0th: 1067.5 2332.0 > 95.0th: 2093.0 3468.0 > 99.0th: 5048.0 6728.0 > 99.5th: 6760.0 7624.0 > 99.9th: 8592.0 9504.0 > > Groups: 32 > 50.0th: 60.0 79.0 > 75.0th: 456.5 1712.0 > 90.0th: 2788.0 3996.0 > 95.0th: 4544.0 5768.0 > 99.0th: 8444.0 9104.0 > 99.5th: 9168.0 9808.0 > 99.9th: 11984.0 12448.0 > > >> + return 1; >> + >> if (rq->curr != rq->idle) >> return 0; >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 81ac605b9cd5..8932e198a33a 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -97,6 +97,8 @@ >> # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) >> #endif >> >> +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ >> + >> struct rq; >> struct cpuidle_state; >> >> @@ -1010,6 +1012,7 @@ struct rq { >> >> struct task_struct __rcu *curr; >> struct task_struct *idle; >> + u64 idle_end_time; >> struct task_struct *stop; >> unsigned long next_balance; >> struct mm_struct *prev_mm;
On 7/26/23 7:37 PM, Mathieu Desnoyers wrote: > On 7/26/23 04:04, Shrikanth Hegde wrote: >> >> >> On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: >>> Allow select_task_rq to consider a cpu as idle for 1ms after that cpu >>> has exited the idle loop. >>> >>> This speeds up the following hackbench workload on a 192 cores AMD EPYC >>> 9654 96-Core Processor (over 2 sockets): >>> >>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 >>> >>> from 49s to 34s. (30% speedup) >>> >>> My working hypothesis for why this helps is: queuing more than a single >>> task on the runqueue of a cpu which just exited idle rather than >>> spreading work over other idle cpus helps power efficiency on systems >>> with large number of cores. >>> >>> This was developed as part of the investigation into a weird regression >>> reported by AMD where adding a raw spinlock in the scheduler context >>> switch accelerated hackbench. Do you have SMT here? What is the system utilization when you are running this workload? >>> >>> It turned out that changing this raw spinlock for a loop of 10000x >>> cpu_relax within do_idle() had similar benefits. >>> >>> This patch achieve a similar effect without the busy-waiting by >>> introducing a runqueue state sampling the sched_clock() when exiting >>> idle, which allows select_task_rq to consider "as idle" a cpu which has >>> recently exited idle. >>> >>> This patch should be considered "food for thoughts", and I would be glad >>> to hear feedback on whether it causes regressions on _other_ workloads, >>> and whether it helps with the hackbench workload on large Intel system >>> as well. >>> >>> Link: >>> https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com >>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>> Cc: Ingo Molnar <mingo@redhat.com> >>> Cc: Peter Zijlstra <peterz@infradead.org> >>> Cc: Valentin Schneider <vschneid@redhat.com> >>> Cc: Steven Rostedt <rostedt@goodmis.org> >>> Cc: Ben Segall <bsegall@google.com> >>> Cc: Mel Gorman <mgorman@suse.de> >>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> >>> Cc: Vincent Guittot <vincent.guittot@linaro.org> >>> Cc: Juri Lelli <juri.lelli@redhat.com> >>> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> >>> Cc: Aaron Lu <aaron.lu@intel.com> >>> Cc: x86@kernel.org >>> --- >>> kernel/sched/core.c | 4 ++++ >>> kernel/sched/sched.h | 3 +++ >>> 2 files changed, 7 insertions(+) >>> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index a68d1276bab0..d40e3a0a5ced 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void) >>> * TASK_RUNNING state. >>> */ >>> WARN_ON_ONCE(current->__state); >>> + WRITE_ONCE(this_rq()->idle_end_time, sched_clock()); >>> do { >>> __schedule(SM_NONE); >>> } while (need_resched()); >>> @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu) >>> { >>> struct rq *rq = cpu_rq(cpu); >>> + if (sched_clock() < READ_ONCE(rq->idle_end_time) + >>> IDLE_CPU_DELAY_NS) >> >> >> Wouldn't this hurt the latency badly? Specially on a loaded system with >> a workload that does a lot of wakeup. > > Good point ! > > Can you try your benchmark replacing the if () statement above by: > > + if (sched_clock() < READ_ONCE(rq->idle_end_time) + > IDLE_CPU_DELAY_NS && > + READ_ONCE(rq->nr_running) <= 4) > + return 1; Tried with this change. I think it does help in reducing latency compared to earlier specially till 95th percentile. 6.5-rc3 6.5-rc3+RFC_Patch 6.5-rc3_RFC_Patch + nr<4 4 Groups 50.0th: 18.00 18.50 18.50 75.0th: 21.50 26.00 23.50 90.0th: 56.00 940.50 501.00 95.0th: 678.00 1896.00 1392.00 99.0th: 2484.00 3756.00 3708.00 99.5th: 3224.00 4616.00 5088.00 99.9th: 4960.00 6824.00 8068.00 8 Groups 50.0th: 23.50 25.50 23.00 75.0th: 30.50 421.50 30.50 90.0th: 443.50 1722.00 741.00 95.0th: 1410.00 2736.00 1670.00 99.0th: 3942.00 5496.00 4032.00 99.5th: 5232.00 7016.00 5064.00 99.9th: 7996.00 8896.00 8012.00 16 Groups 50.0th: 33.50 41.50 32.50 75.0th: 49.00 752.00 47.00 90.0th: 1067.50 2332.00 994.50 95.0th: 2093.00 3468.00 2117.00 99.0th: 5048.00 6728.00 5568.00 99.5th: 6760.00 7624.00 6960.00 99.9th: 8592.00 9504.00 11104.00 32 Groups 50.0th: 60.00 79.00 53.00 75.0th: 456.50 1712.00 209.50 90.0th: 2788.00 3996.00 2752.00 95.0th: 4544.00 5768.00 5024.00 99.0th: 8444.00 9104.00 10352.00 99.5th: 9168.00 9808.00 12720.00 99.9th: 11984.00 12448.00 17624.00 > > It speeds up the hackbench test-case even more here. It's now 30s, and > it should > improve tail latency. > > Thanks, > > Mathieu > > >> >> ran schbench on a 50% loaded system with stress-ng. (there could be a >> better benchmark to measure latency) >> I see that latency takes a hit. specially tail latencies.full log >> below with different schbench groups. >> >> 6.5-rc3 6.5-rc3+this patch >> >> Groups: 1 >> 50.0th: 14.0 13.0 >> 75.0th: 16.0 16.0 >> 90.0th: 19.5 20.0 >> 95.0th: 53.0 226.0 >> 99.0th: 1969.0 2165.0 >> 99.5th: 2912.0 2648.0 >> 99.9th: 4680.0 4142.0 >> >> Groups: 2 >> 50.0th: 15.5 15.5 >> 75.0th: 18.0 19.5 >> 90.0th: 25.5 497.0 >> 95.0th: 323.0 1384.0 >> 99.0th: 2055.0 3144.0 >> 99.5th: 2972.0 4014.0 >> 99.9th: 6026.0 6560.0 >> >> Groups: 4 >> 50.0th: 18.0 18.5 >> 75.0th: 21.5 26.0 >> 90.0th: 56.0 940.5 >> 95.0th: 678.0 1896.0 >> 99.0th: 2484.0 3756.0 >> 99.5th: 3224.0 4616.0 >> 99.9th: 4960.0 6824.0 >> >> Groups: 8 >> 50.0th: 23.5 25.5 >> 75.0th: 30.5 421.5 >> 90.0th: 443.5 1722.0 >> 95.0th: 1410.0 2736.0 >> 99.0th: 3942.0 5496.0 >> 99.5th: 5232.0 7016.0 >> 99.9th: 7996.0 8896.0 >> >> Groups: 16 >> 50.0th: 33.5 41.5 >> 75.0th: 49.0 752.0 >> 90.0th: 1067.5 2332.0 >> 95.0th: 2093.0 3468.0 >> 99.0th: 5048.0 6728.0 >> 99.5th: 6760.0 7624.0 >> 99.9th: 8592.0 9504.0 >> >> Groups: 32 >> 50.0th: 60.0 79.0 >> 75.0th: 456.5 1712.0 >> 90.0th: 2788.0 3996.0 >> 95.0th: 4544.0 5768.0 >> 99.0th: 8444.0 9104.0 >> 99.5th: 9168.0 9808.0 >> 99.9th: 11984.0 12448.0 >> >> >>> + return 1; >>> + >>> if (rq->curr != rq->idle) >>> return 0; >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index 81ac605b9cd5..8932e198a33a 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -97,6 +97,8 @@ >>> # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) >>> #endif >>> +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ >>> + >>> struct rq; >>> struct cpuidle_state; >>> @@ -1010,6 +1012,7 @@ struct rq { >>> struct task_struct __rcu *curr; >>> struct task_struct *idle; >>> + u64 idle_end_time; There is clock_idle already in the rq. Can that be used for the same? >>> struct task_struct *stop; >>> unsigned long next_balance; >>> struct mm_struct *prev_mm; >
On 7/26/23 13:40, Shrikanth Hegde wrote: > > > On 7/26/23 7:37 PM, Mathieu Desnoyers wrote: >> On 7/26/23 04:04, Shrikanth Hegde wrote: >>> >>> >>> On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: >>>> Allow select_task_rq to consider a cpu as idle for 1ms after that cpu >>>> has exited the idle loop. >>>> >>>> This speeds up the following hackbench workload on a 192 cores AMD EPYC >>>> 9654 96-Core Processor (over 2 sockets): >>>> >>>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 >>>> >>>> from 49s to 34s. (30% speedup) >>>> >>>> My working hypothesis for why this helps is: queuing more than a single >>>> task on the runqueue of a cpu which just exited idle rather than >>>> spreading work over other idle cpus helps power efficiency on systems >>>> with large number of cores. >>>> >>>> This was developed as part of the investigation into a weird regression >>>> reported by AMD where adding a raw spinlock in the scheduler context >>>> switch accelerated hackbench. > > Do you have SMT here? What is the system utilization when you are running > this workload? Yes, SMT is enabled, which brings the number of logical cpus to 384. CPU utilization (through htop): * 6.4.4: 27500% * 6.4.4 with the extend-idle+nr_running<=4 patch: 30500% > >>>> >>>> It turned out that changing this raw spinlock for a loop of 10000x >>>> cpu_relax within do_idle() had similar benefits. >>>> >>>> This patch achieve a similar effect without the busy-waiting by >>>> introducing a runqueue state sampling the sched_clock() when exiting >>>> idle, which allows select_task_rq to consider "as idle" a cpu which has >>>> recently exited idle. >>>> >>>> This patch should be considered "food for thoughts", and I would be glad >>>> to hear feedback on whether it causes regressions on _other_ workloads, >>>> and whether it helps with the hackbench workload on large Intel system >>>> as well. >>>> >>>> Link: >>>> https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>>> Cc: Ingo Molnar <mingo@redhat.com> >>>> Cc: Peter Zijlstra <peterz@infradead.org> >>>> Cc: Valentin Schneider <vschneid@redhat.com> >>>> Cc: Steven Rostedt <rostedt@goodmis.org> >>>> Cc: Ben Segall <bsegall@google.com> >>>> Cc: Mel Gorman <mgorman@suse.de> >>>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> >>>> Cc: Vincent Guittot <vincent.guittot@linaro.org> >>>> Cc: Juri Lelli <juri.lelli@redhat.com> >>>> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> >>>> Cc: Aaron Lu <aaron.lu@intel.com> >>>> Cc: x86@kernel.org >>>> --- >>>> kernel/sched/core.c | 4 ++++ >>>> kernel/sched/sched.h | 3 +++ >>>> 2 files changed, 7 insertions(+) >>>> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>>> index a68d1276bab0..d40e3a0a5ced 100644 >>>> --- a/kernel/sched/core.c >>>> +++ b/kernel/sched/core.c >>>> @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void) >>>> * TASK_RUNNING state. >>>> */ >>>> WARN_ON_ONCE(current->__state); >>>> + WRITE_ONCE(this_rq()->idle_end_time, sched_clock()); >>>> do { >>>> __schedule(SM_NONE); >>>> } while (need_resched()); >>>> @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu) >>>> { >>>> struct rq *rq = cpu_rq(cpu); >>>> + if (sched_clock() < READ_ONCE(rq->idle_end_time) + >>>> IDLE_CPU_DELAY_NS) >>> >>> >>> Wouldn't this hurt the latency badly? Specially on a loaded system with >>> a workload that does a lot of wakeup. >> >> Good point ! >> >> Can you try your benchmark replacing the if () statement above by: >> >> + if (sched_clock() < READ_ONCE(rq->idle_end_time) + >> IDLE_CPU_DELAY_NS && >> + READ_ONCE(rq->nr_running) <= 4) >> + return 1; > > > Tried with this change. I think it does help in reducing latency compared to > earlier specially till 95th percentile. For the records, I also tried with nr_running <= 2 and still had decent performance (32s with nr_running <= 2 instead of 30s for nr_running <= 4). It did drop with nr_running <= 1 (40s). nr_running <= 5 was similar to 4, and performances start degrading with nr_running <= 8 (31s). So it might be interesting to measure the latency with nr_running <= 2 as well. Perhaps nr_running <= 2 would be a good compromise between throughput and tail latency. > > 6.5-rc3 6.5-rc3+RFC_Patch 6.5-rc3_RFC_Patch > + nr<4 > 4 Groups > 50.0th: 18.00 18.50 18.50 > 75.0th: 21.50 26.00 23.50 > 90.0th: 56.00 940.50 501.00 > 95.0th: 678.00 1896.00 1392.00 > 99.0th: 2484.00 3756.00 3708.00 > 99.5th: 3224.00 4616.00 5088.00 > 99.9th: 4960.00 6824.00 8068.00 > 8 Groups > 50.0th: 23.50 25.50 23.00 > 75.0th: 30.50 421.50 30.50 > 90.0th: 443.50 1722.00 741.00 > 95.0th: 1410.00 2736.00 1670.00 > 99.0th: 3942.00 5496.00 4032.00 > 99.5th: 5232.00 7016.00 5064.00 > 99.9th: 7996.00 8896.00 8012.00 > 16 Groups > 50.0th: 33.50 41.50 32.50 > 75.0th: 49.00 752.00 47.00 > 90.0th: 1067.50 2332.00 994.50 > 95.0th: 2093.00 3468.00 2117.00 > 99.0th: 5048.00 6728.00 5568.00 > 99.5th: 6760.00 7624.00 6960.00 > 99.9th: 8592.00 9504.00 11104.00 > 32 Groups > 50.0th: 60.00 79.00 53.00 > 75.0th: 456.50 1712.00 209.50 > 90.0th: 2788.00 3996.00 2752.00 > 95.0th: 4544.00 5768.00 5024.00 > 99.0th: 8444.00 9104.00 10352.00 > 99.5th: 9168.00 9808.00 12720.00 > 99.9th: 11984.00 12448.00 17624.00 [...] >>>> @@ -1010,6 +1012,7 @@ struct rq { >>>> struct task_struct __rcu *curr; >>>> struct task_struct *idle; >>>> + u64 idle_end_time; > > There is clock_idle already in the rq. Can that be used for the same? Good point! And I'll change my use of "sched_clock()" in idle_cpu() for a proper "sched_clock_cpu(cpu_of(rq))", which will work better on systems without constant tsc. The updated patch: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a68d1276bab0..1c7d5bd2968b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); + if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING && + sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS) + return 1; + if (rq->curr != rq->idle) return 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81ac605b9cd5..57a49a5524f0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -97,6 +97,9 @@ # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) #endif +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ +#define IDLE_CPU_DELAY_MAX_RUNNING 4 + struct rq; struct cpuidle_state; And using it now brings the hackbench wall time at 28s :) Thanks, Mathieu > >>>> struct task_struct *stop; >>>> unsigned long next_balance; >>>> struct mm_struct *prev_mm; >>
On 7/26/23 14:56, Mathieu Desnoyers wrote: > On 7/26/23 13:40, Shrikanth Hegde wrote: [...] >> Do you have SMT here? What is the system utilization when you are running >> this workload? > > Yes, SMT is enabled, which brings the number of logical cpus to 384. Here is an additional interesting data point with nosmt=force on 6.4.4: hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 baseline: 90s with idle-delay+nr_running<=4: 87s (3% speedup) hackbench -g 16 -f 20 --threads --pipe -l 480000 -s 100 baseline: 52s with idle-delay+nr_running<=4: 32s (38% speedup) So the impact of the patch appears to depend on how much the system actually reaches idle, which does make sense. Thanks, Mathieu
On 2023-07-26 at 10:07:30 -0400, Mathieu Desnoyers wrote: > On 7/26/23 04:04, Shrikanth Hegde wrote: > > > > > > On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: > > > Allow select_task_rq to consider a cpu as idle for 1ms after that cpu > > > has exited the idle loop. > > > > > > This speeds up the following hackbench workload on a 192 cores AMD EPYC > > > 9654 96-Core Processor (over 2 sockets): > > > > > > hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 > > > > > > from 49s to 34s. (30% speedup) > > > > > > My working hypothesis for why this helps is: queuing more than a single > > > task on the runqueue of a cpu which just exited idle rather than > > > spreading work over other idle cpus helps power efficiency on systems > > > with large number of cores. > > > This looks interesting. And it does help power efficiency but how it could improve throughput? Is it because of hot cache locality waking up task on it previous running CPU(because it will be easier to be treated as idle), or just reducing the time in select_idle_sibling()? > Good point ! > > Can you try your benchmark replacing the if () statement above by: > > + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS && > + READ_ONCE(rq->nr_running) <= 4) If I understand correctly, this nr_running is to filter the case that the system is saturated? If that is the case, maybe rq->avg_idle >= sysctl_sched_migration_cost could be checked in case there is 1 long running task and we don't want to treat this cpu as 'idle'? thanks, Chenyu
On Wed, Jul 26, 2023 at 02:56:19PM -0400, Mathieu Desnoyers wrote: ... ... > The updated patch: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a68d1276bab0..1c7d5bd2968b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu) > { > struct rq *rq = cpu_rq(cpu); > + if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING && > + sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS) > + return 1; > + > if (rq->curr != rq->idle) > return 0; > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 81ac605b9cd5..57a49a5524f0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -97,6 +97,9 @@ > # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) > #endif > +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ > +#define IDLE_CPU_DELAY_MAX_RUNNING 4 > + > struct rq; > struct cpuidle_state; > I gave this patch a run on Intel SPR(2 sockets/112cores/224cpus) and I also noticed huge improvement when running hackbench, especially for group=32/fds=20 case: when group=10/fds=20(400 tasks): time wakeups/migration tg->load_avg% base: 43s 27874246/13953871 25% this patch: 32s 33200766/244457 2% my patch: 37s 29186608/16307254 2% when group=20/fds=20(800 tasks): time wakeups/migrations tg->load_avg% base: 65s 27108751/16238701 27% this patch: 45s 35718552/1691220 3% my patch: 48s 37506974/24797284 2% when group=32/fds=20(1280 tasks): time wakeups/migrations tg->load_avg% base: 150s 36902527/16423914 36% this patch: 57s 30536830/6035346 6% my patch: 73s 45264605/21595791 3% One thing I noticed is, after this patch, the migration on wakeup path has dramatically reduced(see above wakeups/migrations, the number were captured for 5s during the run). I think this makes sense because now a cpu is more likely to be considered idle so a wakeup task will more likely stay on its prev_cpu. And when migrations is reduced, the cost of accessing tg->load_avg is also reduced(tg->load_avg% is the sum of update_cfs_group()% + update_load_avg()% as reported by perf). I think this is part of the reason why performance improved on this machine. Since I've been working on reducing the cost of accessing tg->load_avg[1], I also gave my patch a run. According to the result, even when the cost of accessing tg->load_avg is smaller for my patch, Mathieu's patch is still faster. It's not clear to me why, maybe it has something to do with cache reuse since my patch doesn't inhibit migration? I suppose ipc could reflect this? [1]: https://lore.kernel.org/lkml/20230718134120.81199-1-aaron.lu@intel.com/ Thanks, Aaron
On Thu, Jul 27, 2023 at 01:04:13PM +0800, Chen Yu wrote: > On 2023-07-26 at 10:07:30 -0400, Mathieu Desnoyers wrote: > > On 7/26/23 04:04, Shrikanth Hegde wrote: > > > > > > > > > On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: > > > > Allow select_task_rq to consider a cpu as idle for 1ms after that cpu > > > > has exited the idle loop. > > > > > > > > This speeds up the following hackbench workload on a 192 cores AMD EPYC > > > > 9654 96-Core Processor (over 2 sockets): > > > > > > > > hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 > > > > > > > > from 49s to 34s. (30% speedup) > > > > > > > > My working hypothesis for why this helps is: queuing more than a single > > > > task on the runqueue of a cpu which just exited idle rather than > > > > spreading work over other idle cpus helps power efficiency on systems > > > > with large number of cores. > > > > > > This looks interesting. And it does help power efficiency but how it could > improve throughput? Is it because of hot cache locality waking up task on > it previous running CPU(because it will be easier to be treated as idle), > or just reducing the time in select_idle_sibling()? > According to my tests on Intel SPR, part of the reason is reduced migration cost due to tg->load_avg. I think it has similar effect as your previous patch(wake up short task on current CPU): https://lore.kernel.org/lkml/cover.1682661027.git.yu.c.chen@intel.com/ Both can reduce task migration somehow. Thanks, Aaron > > Good point ! > > > > Can you try your benchmark replacing the if () statement above by: > > > > + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS && > > + READ_ONCE(rq->nr_running) <= 4) > > If I understand correctly, this nr_running is to filter the case that the system > is saturated? If that is the case, maybe > rq->avg_idle >= sysctl_sched_migration_cost > could be checked in case there is 1 long running task and we don't want to treat this > cpu as 'idle'? > > thanks, > Chenyu
On 2023-08-01 at 15:24:03 +0800, Aaron Lu wrote: > On Wed, Jul 26, 2023 at 02:56:19PM -0400, Mathieu Desnoyers wrote: > > ... ... > > > The updated patch: > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index a68d1276bab0..1c7d5bd2968b 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu) > > { > > struct rq *rq = cpu_rq(cpu); > > + if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING && > > + sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS) > > + return 1; > > + > > if (rq->curr != rq->idle) > > return 0; > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 81ac605b9cd5..57a49a5524f0 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -97,6 +97,9 @@ > > # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) > > #endif > > +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ > > +#define IDLE_CPU_DELAY_MAX_RUNNING 4 > > + > > struct rq; > > struct cpuidle_state; > > > > I gave this patch a run on Intel SPR(2 sockets/112cores/224cpus) and I > also noticed huge improvement when running hackbench, especially for > group=32/fds=20 case: > > when group=10/fds=20(400 tasks): > time wakeups/migration tg->load_avg% > base: 43s 27874246/13953871 25% > this patch: 32s 33200766/244457 2% > my patch: 37s 29186608/16307254 2% > > when group=20/fds=20(800 tasks): > time wakeups/migrations tg->load_avg% > base: 65s 27108751/16238701 27% > this patch: 45s 35718552/1691220 3% > my patch: 48s 37506974/24797284 2% > > when group=32/fds=20(1280 tasks): > time wakeups/migrations tg->load_avg% > base: 150s 36902527/16423914 36% > this patch: 57s 30536830/6035346 6% > my patch: 73s 45264605/21595791 3% > > One thing I noticed is, after this patch, the migration on wakeup path > has dramatically reduced(see above wakeups/migrations, the number were > captured for 5s during the run). I think this makes sense because now a > cpu is more likely to be considered idle so a wakeup task will more > likely stay on its prev_cpu. And when migrations is reduced, the cost of > accessing tg->load_avg is also reduced(tg->load_avg% is the sum of > update_cfs_group()% + update_load_avg()% as reported by perf). I think > this is part of the reason why performance improved on this machine. > > Since I've been working on reducing the cost of accessing tg->load_avg[1], > I also gave my patch a run. According to the result, even when the cost > of accessing tg->load_avg is smaller for my patch, Mathieu's patch is > still faster. It's not clear to me why, maybe it has something to do > with cache reuse since my patch doesn't inhibit migration? I suppose ipc > could reflect this? > Yeah, probably. I'm thinking that, since in hackbench the sender send the data to the receiver, and the receiver reads it, if the cache is still hot for the wakee(receiver) during every wakeup, it could improve performance. Maybe increase the default transfer data size 100 bytes could evict the L1/L2 more offen for each data send/receive to figure out if it is related to cache locallity. thanks, Chenyu > [1]: https://lore.kernel.org/lkml/20230718134120.81199-1-aaron.lu@intel.com/ > > Thanks, > Aaron
Hello Mathieu, On 7/27/2023 12:26 AM, Mathieu Desnoyers wrote: > On 7/26/23 13:40, Shrikanth Hegde wrote: >> >> >> On 7/26/23 7:37 PM, Mathieu Desnoyers wrote: >>> On 7/26/23 04:04, Shrikanth Hegde wrote: >>>> >>>> >>>> On 7/26/23 1:00 AM, Mathieu Desnoyers wrote: >>>>> Allow select_task_rq to consider a cpu as idle for 1ms after that cpu >>>>> has exited the idle loop. >>>>> >>>>> This speeds up the following hackbench workload on a 192 cores AMD EPYC >>>>> 9654 96-Core Processor (over 2 sockets): >>>>> >>>>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 >>>>> >>>>> from 49s to 34s. (30% speedup) >>>>> >>>>> My working hypothesis for why this helps is: queuing more than a single >>>>> task on the runqueue of a cpu which just exited idle rather than >>>>> spreading work over other idle cpus helps power efficiency on systems >>>>> with large number of cores. >>>>> >>>>> This was developed as part of the investigation into a weird regression >>>>> reported by AMD where adding a raw spinlock in the scheduler context >>>>> switch accelerated hackbench. >> >> Do you have SMT here? What is the system utilization when you are running >> this workload? > > Yes, SMT is enabled, which brings the number of logical cpus to 384. > > CPU utilization (through htop): > > * 6.4.4: 27500% > * 6.4.4 with the extend-idle+nr_running<=4 patch: 30500% > >> >>>>> >>>>> It turned out that changing this raw spinlock for a loop of 10000x >>>>> cpu_relax within do_idle() had similar benefits. >>>>> >>>>> This patch achieve a similar effect without the busy-waiting by >>>>> introducing a runqueue state sampling the sched_clock() when exiting >>>>> idle, which allows select_task_rq to consider "as idle" a cpu which has >>>>> recently exited idle. >>>>> >>>>> This patch should be considered "food for thoughts", and I would be glad >>>>> to hear feedback on whether it causes regressions on _other_ workloads, >>>>> and whether it helps with the hackbench workload on large Intel system >>>>> as well. >>>>> >>>>> Link: >>>>> https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com >>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>>>> Cc: Ingo Molnar <mingo@redhat.com> >>>>> Cc: Peter Zijlstra <peterz@infradead.org> >>>>> Cc: Valentin Schneider <vschneid@redhat.com> >>>>> Cc: Steven Rostedt <rostedt@goodmis.org> >>>>> Cc: Ben Segall <bsegall@google.com> >>>>> Cc: Mel Gorman <mgorman@suse.de> >>>>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> >>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org> >>>>> Cc: Juri Lelli <juri.lelli@redhat.com> >>>>> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> >>>>> Cc: Aaron Lu <aaron.lu@intel.com> >>>>> Cc: x86@kernel.org >>>>> --- >>>>> kernel/sched/core.c | 4 ++++ >>>>> kernel/sched/sched.h | 3 +++ >>>>> 2 files changed, 7 insertions(+) >>>>> >>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>>>> index a68d1276bab0..d40e3a0a5ced 100644 >>>>> --- a/kernel/sched/core.c >>>>> +++ b/kernel/sched/core.c >>>>> @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void) >>>>> * TASK_RUNNING state. >>>>> */ >>>>> WARN_ON_ONCE(current->__state); >>>>> + WRITE_ONCE(this_rq()->idle_end_time, sched_clock()); >>>>> do { >>>>> __schedule(SM_NONE); >>>>> } while (need_resched()); >>>>> @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu) >>>>> { >>>>> struct rq *rq = cpu_rq(cpu); >>>>> + if (sched_clock() < READ_ONCE(rq->idle_end_time) + >>>>> IDLE_CPU_DELAY_NS) >>>> >>>> >>>> Wouldn't this hurt the latency badly? Specially on a loaded system with >>>> a workload that does a lot of wakeup. >>> >>> Good point ! >>> >>> Can you try your benchmark replacing the if () statement above by: >>> >>> + if (sched_clock() < READ_ONCE(rq->idle_end_time) + >>> IDLE_CPU_DELAY_NS && >>> + READ_ONCE(rq->nr_running) <= 4) >>> + return 1; >> >> >> Tried with this change. I think it does help in reducing latency compared to >> earlier specially till 95th percentile. > > For the records, I also tried with nr_running <= 2 and still had decent performance > (32s with nr_running <= 2 instead of 30s for nr_running <= 4). It did drop with > nr_running <= 1 (40s). nr_running <= 5 was similar to 4, and performances start > degrading with nr_running <= 8 (31s). > > So it might be interesting to measure the latency with nr_running <= 2 as well. > Perhaps nr_running <= 2 would be a good compromise between throughput and tail > latency. > >> 6.5-rc3 6.5-rc3+RFC_Patch 6.5-rc3_RFC_Patch >> + nr<4 >> 4 Groups >> 50.0th: 18.00 18.50 18.50 >> 75.0th: 21.50 26.00 23.50 >> 90.0th: 56.00 940.50 501.00 >> 95.0th: 678.00 1896.00 1392.00 >> 99.0th: 2484.00 3756.00 3708.00 >> 99.5th: 3224.00 4616.00 5088.00 >> 99.9th: 4960.00 6824.00 8068.00 >> 8 Groups >> 50.0th: 23.50 25.50 23.00 >> 75.0th: 30.50 421.50 30.50 >> 90.0th: 443.50 1722.00 741.00 >> 95.0th: 1410.00 2736.00 1670.00 >> 99.0th: 3942.00 5496.00 4032.00 >> 99.5th: 5232.00 7016.00 5064.00 >> 99.9th: 7996.00 8896.00 8012.00 >> 16 Groups >> 50.0th: 33.50 41.50 32.50 >> 75.0th: 49.00 752.00 47.00 >> 90.0th: 1067.50 2332.00 994.50 >> 95.0th: 2093.00 3468.00 2117.00 >> 99.0th: 5048.00 6728.00 5568.00 >> 99.5th: 6760.00 7624.00 6960.00 >> 99.9th: 8592.00 9504.00 11104.00 >> 32 Groups >> 50.0th: 60.00 79.00 53.00 >> 75.0th: 456.50 1712.00 209.50 >> 90.0th: 2788.00 3996.00 2752.00 >> 95.0th: 4544.00 5768.00 5024.00 >> 99.0th: 8444.00 9104.00 10352.00 >> 99.5th: 9168.00 9808.00 12720.00 >> 99.9th: 11984.00 12448.00 17624.00 > > [...] > >>>>> @@ -1010,6 +1012,7 @@ struct rq { >>>>> struct task_struct __rcu *curr; >>>>> struct task_struct *idle; >>>>> + u64 idle_end_time; >> >> There is clock_idle already in the rq. Can that be used for the same? > > Good point! And I'll change my use of "sched_clock()" in idle_cpu() for a > proper "sched_clock_cpu(cpu_of(rq))", which will work better on systems > without constant tsc. > > The updated patch: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a68d1276bab0..1c7d5bd2968b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu) > { > struct rq *rq = cpu_rq(cpu); > > + if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING && > + sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS) > + return 1; > + > if (rq->curr != rq->idle) > return 0; > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 81ac605b9cd5..57a49a5524f0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -97,6 +97,9 @@ > # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) > #endif > > +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ > +#define IDLE_CPU_DELAY_MAX_RUNNING 4 > + > struct rq; > struct cpuidle_state; I have run some standard micro-benchmarks to test this patch on a 2 Socket Bergamo System which has 256C/512T (2 X 128C/Socket). The following are the results: base : 6.5.0-rc4 base + extend-idle : base + Original patch which has change to to extend idle state by 1ms. base + extend-idle + nr_running : base + updated patch which contains both extend idle and nr_running limit. ========================= hackbench ========================= Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 1-groups: 19.95 (0.00 pct) 15.92 (20.20 pct) 15.30 (23.30 pct) 2-groups: 21.33 (0.00 pct) 23.00 (-7.82 pct) 19.70 (7.64 pct) 4-groups: 22.57 (0.00 pct) 30.02 (-33.00 pct) 26.74 (-18.47 pct) 8-groups: 24.68 (0.00 pct) 32.54 (-31.84 pct) 28.27 (-14.54 pct) 16-groups: 31.20 (0.00 pct) 32.47 (-4.07 pct) 27.70 (11.21 pct) Observation: Hackbench shows improvement with lower and higher number of groups still it shows dip in performance for middle order. ========================= new_schbench ========================= Metric: wakeup_lat_summary #workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 1: 30.00 (0.00 pct) 33.00 (-10.00 pct) 33.00 (-10.00 pct) 2: 26.00 (0.00 pct) 32.00 (-23.07 pct) 33.00 (-26.92 pct) 4: 26.00 (0.00 pct) 32.00 (-23.07 pct) 32.00 (-23.07 pct) 8: 9.00 (0.00 pct) 10.00 (-11.11 pct) 9.00 (0.00 pct) 16: 8.00 (0.00 pct) 10.00 (-25.00 pct) 9.00 (-12.50 pct) 32: 8.00 (0.00 pct) 9.00 (-12.50 pct) 10.00 (-25.00 pct) 64: 8.00 (0.00 pct) 10.00 (-25.00 pct) 10.00 (-25.00 pct) 128: 8.00 (0.00 pct) 11.00 (-37.50 pct) 12.00 (-50.00 pct) 256: 102.00 (0.00 pct) 48.00 (52.94 pct) 50.00 (50.98 pct) 512: 20704.00 (0.00 pct) 23200.00 (-12.05 pct) 23200.00 (-12.05 pct) Metric: request_lat_summary #workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 1: 6712.00 (0.00 pct) 6744.00 (-0.47 pct) 6760.00 (-0.71 pct) 2: 6792.00 (0.00 pct) 6840.00 (-0.70 pct) 6872.00 (-1.17 pct) 4: 6792.00 (0.00 pct) 6840.00 (-0.70 pct) 6856.00 (-0.94 pct) 8: 6776.00 (0.00 pct) 6824.00 (-0.70 pct) 6872.00 (-1.41 pct) 16: 6760.00 (0.00 pct) 6792.00 (-0.47 pct) 6872.00 (-1.65 pct) 32: 6808.00 (0.00 pct) 6776.00 (0.47 pct) 6872.00 (-0.94 pct) 64: 6808.00 (0.00 pct) 6776.00 (0.47 pct) 6872.00 (-0.94 pct) 128: 12208.00 (0.00 pct) 11856.00 (2.88 pct) 12784.00 (-4.71 pct) 256: 13264.00 (0.00 pct) 13296.00 (-0.24 pct) 13680.00 (-3.13 pct) 512: 84096.00 (0.00 pct) 100992.00 (-20.09 pct) 110208.00 (-31.05 pct) Metric: rps_summary #workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 1: 305.00 (0.00 pct) 302.00 (0.98 pct) 296.00 (2.95 pct) 2: 607.00 (0.00 pct) 601.00 (0.98 pct) 593.00 (2.30 pct) 4: 1214.00 (0.00 pct) 1202.00 (0.98 pct) 1190.00 (1.97 pct) 8: 2436.00 (0.00 pct) 2420.00 (0.65 pct) 2372.00 (2.62 pct) 16: 4888.00 (0.00 pct) 4872.00 (0.32 pct) 4808.00 (1.63 pct) 32: 9776.00 (0.00 pct) 9744.00 (0.32 pct) 9680.00 (0.98 pct) 64: 19616.00 (0.00 pct) 19488.00 (0.65 pct) 19360.00 (1.30 pct) 128: 38592.00 (0.00 pct) 38720.00(-0.33 pct) 38080.00 (1.32 pct) 256: 41024.00 (0.00 pct) 40896.00 (0.31 pct) 38976.00 (4.99 pct) 512: 36800.00 (0.00 pct) 35776.00 (2.78 pct) 33728.00 (8.34 pct) Observation: new_schbench is showing regression in wakeup latencies while request latencies and rps latencies shows no change except for highly loaded case in request latency. ========================= schbench ========================= #workers: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 1: 185.00 (0.00 pct) 181.00 (2.16 pct) 181.00 (2.16 pct) 2: 191.00 (0.00 pct) 192.00 (-0.52 pct) 189.00 (1.04 pct) 4: 191.00 (0.00 pct) 192.00 (-0.52 pct) 194.00 (-1.57 pct) 8: 218.00 (0.00 pct) 198.00 (9.17 pct) 200.00 (8.25 pct) 16: 226.00 (0.00 pct) 225.00 (0.44 pct) 224.00 (0.88 pct) 32: 537.00 (0.00 pct) 537.00 (0.00 pct) 539.00 (-0.37 pct) 64: 605.00 (0.00 pct) 621.00 (-2.64 pct) 619.00 (-2.31 pct) 128: 765.00 (0.00 pct) 795.00 (-3.92 pct) 781.00 (-2.09 pct) 256: 1122.00 (0.00 pct) 1150.00 (-2.49 pct) 1150.00 (-2.49 pct) 512: 2276.00 (0.00 pct) 2476.00 (-8.78 pct) 2404.00 (-5.62 pct) ========================= tbench ========================= Clients: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 1 386.07 (0.00 pct) 390.13 (1.05 pct) 463.49 (20.05 pct) 2 718.40 (0.00 pct) 856.37 (19.20 pct) 799.31 (11.26 pct) 4 1399.34 (0.00 pct) 1514.03 (8.19 pct) 1482.01 (5.90 pct) 8 2716.56 (0.00 pct) 3000.02 (10.43 pct) 2823.23 (3.92 pct) 16 5275.97 (0.00 pct) 5468.17 (3.64 pct) 5430.77 (2.93 pct) 32 10534.37 (0.00 pct) 10442.13 (-0.87 pct) 10386.92 (-1.39 pct) 64 22079.03 (0.00 pct) 19161.30 (-13.21 pct) 16773.73 (-24.02 pct) 128 41051.13 (0.00 pct) 29923.57 (-27.10 pct) 22510.13 (-45.16 pct) 256 55603.43 (0.00 pct) 43203.67 (-22.30 pct) 40343.17 (-27.44 pct) 512 130673.33 (0.00 pct) 76581.40 (-41.39 pct) 69152.47 (-47.07 pct) 1024 133323.67 (0.00 pct) 114910.67 (-13.81 pct) 107728.67 (-19.19 pct) 2048 143674.33 (0.00 pct) 123842.67 (-13.80 pct) 107202.00 (-25.38 pct) Observation: tbench is showing dip in throughput for 64 clients and onwards. ====================== stream 10 RUNS ====================== Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 Copy: 354190.51 (0.00 pct) 356650.82 (0.69 pct) 353287.34 (-0.25 pct) Scale: 355427.44 (0.00 pct) 356686.34 (0.35 pct) 354406.79 (-0.28 pct) Add: 373800.46 (0.00 pct) 376610.56 (0.75 pct) 374609.00 (0.21 pct) Triad: 374697.25 (0.00 pct) 377635.98 (0.78 pct) 375343.44 (0.17 pct) ====================== stream 100 RUNS ====================== Test: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 Copy: 357922.89 (0.00 pct) 356560.50 (-0.38 pct) 356507.22 (-0.39 pct) Scale: 358118.38 (0.00 pct) 357435.86 (-0.19 pct) 358033.29 (-0.02 pct) Add: 375307.34 (0.00 pct) 376046.70 (0.19 pct) 375586.33 (0.07 pct) Triad: 375656.40 (0.00 pct) 376674.43 (0.27 pct) 376581.95 (0.24 pct) ========================== netperf ========================== Clients: 6.5.0-rc4 (base) base + extend-idle base + extend-idle + nr_running<=4 1-clients: 114299.07 (0.00 pct) 110695.40 (-3.15 pct) 111533.30 (-2.41 pct) 2-clients: 114130.01 (0.00 pct) 110192.51 (-3.45 pct) 111682.01 (-2.14 pct) 4-clients: 109126.45 (0.00 pct) 107275.60 (-1.69 pct) 109574.77 (0.41 pct) 8-clients: 111209.21 (0.00 pct) 104360.40 (-6.15 pct) 106518.41 (-4.21 pct) 16-clients: 102955.20 (0.00 pct) 100968.38 (-1.92 pct) 101897.25 (-1.02 pct) 32-clients: 98537.18 (0.00 pct) 103018.09 ( 4.54 pct) 103917.70 (5.46 pct) 64-clients: 103619.68 (0.00 pct) 100376.37 (-3.13 pct) 102651.24 (-0.93 pct) 128-clients: 98536.55 (0.00 pct) 77845.69 (-20.99 pct) 98566.87 (0.03 pct) 256-clients: 51934.45 (0.00 pct) 52844.10 (1.75 pct) 53562.99 (3.13 pct) I have also tried the same experiment on one socket Genoa system with 96C/192T. On that system also I am seeing similar behavior. Can you share your build config just in case I am missing something. > > And using it now brings the hackbench wall time at 28s :) > > Thanks, > > Mathieu > >> >>>>> struct task_struct *stop; >>>>> unsigned long next_balance; >>>>> struct mm_struct *prev_mm; >>> > -- Thanks and regards, Swapnil
On 8/3/23 01:53, Swapnil Sapkal wrote: [...] Those are interesting metrics. I still have no clue why it behaves that way though. More specifically: I also noticed that the number of migrations is heavily affected, and that select_task_rq behavior changes drastically. I'm unsure why though. > > Can you share your build config just in case I am missing something. Build config attached. Thanks, Mathieu > >> >> And using it now brings the hackbench wall time at 28s :) >> >> Thanks, >> >> Mathieu >> >>> >>>>>> struct task_struct *stop; >>>>>> unsigned long next_balance; >>>>>> struct mm_struct *prev_mm; >>>> >> > -- > Thanks and regards, > Swapnil
On 8/1/23 03:24, Aaron Lu wrote: > On Wed, Jul 26, 2023 at 02:56:19PM -0400, Mathieu Desnoyers wrote: > > ... ... > >> The updated patch: >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index a68d1276bab0..1c7d5bd2968b 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> + if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING && >> + sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS) >> + return 1; >> + >> if (rq->curr != rq->idle) >> return 0; >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 81ac605b9cd5..57a49a5524f0 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -97,6 +97,9 @@ >> # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) >> #endif >> +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ >> +#define IDLE_CPU_DELAY_MAX_RUNNING 4 >> + >> struct rq; >> struct cpuidle_state; >> > > I gave this patch a run on Intel SPR(2 sockets/112cores/224cpus) and I > also noticed huge improvement when running hackbench, especially for > group=32/fds=20 case: > > when group=10/fds=20(400 tasks): > time wakeups/migration tg->load_avg% > base: 43s 27874246/13953871 25% > this patch: 32s 33200766/244457 2% > my patch: 37s 29186608/16307254 2% > > when group=20/fds=20(800 tasks): > time wakeups/migrations tg->load_avg% > base: 65s 27108751/16238701 27% > this patch: 45s 35718552/1691220 3% > my patch: 48s 37506974/24797284 2% > > when group=32/fds=20(1280 tasks): > time wakeups/migrations tg->load_avg% > base: 150s 36902527/16423914 36% > this patch: 57s 30536830/6035346 6% > my patch: 73s 45264605/21595791 3% > > One thing I noticed is, after this patch, the migration on wakeup path > has dramatically reduced(see above wakeups/migrations, the number were > captured for 5s during the run). I think this makes sense because now a > cpu is more likely to be considered idle so a wakeup task will more > likely stay on its prev_cpu. And when migrations is reduced, the cost of > accessing tg->load_avg is also reduced(tg->load_avg% is the sum of > update_cfs_group()% + update_load_avg()% as reported by perf). I think > this is part of the reason why performance improved on this machine. > > Since I've been working on reducing the cost of accessing tg->load_avg[1], > I also gave my patch a run. According to the result, even when the cost > of accessing tg->load_avg is smaller for my patch, Mathieu's patch is > still faster. It's not clear to me why, maybe it has something to do > with cache reuse since my patch doesn't inhibit migration? I suppose ipc > could reflect this? I've also noticed a drastic reduction in the number of migrations with my patch. I have noticed that the behavior of select_task_rq changes drastically, but I have not figured out why yet. I tried adding tons of schedstats counters within select_task_rq to try to compare the decisions taken in the baseline vs modified implementations of cpu_idle. I also tried to count how many times the target task rq changes (which implies a migration) with a breakdown by cause (which branch within select_task_rq cause it). I could not find a clear culprit yet though (and I am currently on vacation, so not working on this actively). Thanks, Mathieu > > [1]: https://lore.kernel.org/lkml/20230718134120.81199-1-aaron.lu@intel.com/ > > Thanks, > Aaron
From: Shrikanth Hegde > Sent: 26 July 2023 09:05 ... > > + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS) > > > Wouldn't this hurt the latency badly? Specially on a loaded system with > a workload that does a lot of wakeup. Having spotted this I'm also rather worried about systems that are doing (eg) real time audio and need to wakeup a lot of threads (less than the number of cpu) every (say) 10ms. It is hard enough waking up a lot of threads quickly without another 1ms delay being added. (I'm only talking about 30 threads as well, not 300.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 8/4/23 1:42 AM, Mathieu Desnoyers wrote: > On 8/3/23 01:53, Swapnil Sapkal wrote: > [...] > > Those are interesting metrics. I still have no clue why it behaves that > way though. I was thinking this might be the case. some workload would benefit while some would suffer. Specially ones which favor latency over cache might suffer. > > More specifically: I also noticed that the number of migrations is > heavily affected, and that select_task_rq behavior changes drastically. > I'm unsure why though. > FWIU, load_balance uses idle_cpu to calculate the number of idle_cpus in the sched_domain. Maybe that is getting confused with 1ms delay concept. Likely sched_domain stay balanced because of this, and hence less migrations. In select_rq_fair, prev_cpu will returned by wake_affine_idle since idle_cpu will return true more often. Hence task will get woken on the same CPU as before instead of migrating. on SMT systems, gain is further added by having the threads on single CPU, thereby making it ST mode. That is subject to utilization. Running on ST is more faster compared to running on SMT. ------------------------------------------------------------------------------------------- Ran the hackbench with perf stat on SMT system. That indicates slightly higher ST mode cycles and ips improves slightly thereby making it faster. baseline 6.5-rc1: hackbench -pipe (50 groups) Time: 0.67 ( Average of 50 runs) 94,432,028,029 instructions # 0.52 insn per cycle 168,130,543,309 cycles (% of total cycles) 1,162,153,934 PM_RUN_CYC_ST_MODE ( 0.70% ) 613,018,646 PM_RUN_CYC_SMT2_MODE ( 0.35% ) 166,358,778,832 PM_RUN_CYC_SMT4_MODE (98.95% ) Latest patch in this series. https://lore.kernel.org/lkml/447f756c-9c79-f801-8257-a97cc8256efe@efficios.com/#t hackbench -pipe (50 groups) Time: 0.62 ( Average of 50 runs) 92,078,390,150 instructions # 0.55 insn per cycle 159,368,662,574 cycles 1,330,653,107 PM_RUN_CYC_ST_MODE ( 0.83% ) 656,950,636 PM_RUN_CYC_SMT2_MODE ( 0.41% ) 157,384,470,123 PM_RUN_CYC_SMT4_MODE (98.75% ) >> >> Can you share your build config just in case I am missing something. > > Build config attached. > > Thanks, > > Mathieu > >> >>> >>> And using it now brings the hackbench wall time at 28s :) >>> >>> Thanks, >>> >>> Mathieu >>> >>>> >>>>>>> struct task_struct *stop; >>>>>>> unsigned long next_balance; >>>>>>> struct mm_struct *prev_mm; >>>>> >>> >> -- >> Thanks and regards, >> Swapnil >
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a68d1276bab0..d40e3a0a5ced 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6769,6 +6769,7 @@ void __sched schedule_idle(void) * TASK_RUNNING state. */ WARN_ON_ONCE(current->__state); + WRITE_ONCE(this_rq()->idle_end_time, sched_clock()); do { __schedule(SM_NONE); } while (need_resched()); @@ -7300,6 +7301,9 @@ int idle_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); + if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS) + return 1; + if (rq->curr != rq->idle) return 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81ac605b9cd5..8932e198a33a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -97,6 +97,8 @@ # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) #endif +#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */ + struct rq; struct cpuidle_state; @@ -1010,6 +1012,7 @@ struct rq { struct task_struct __rcu *curr; struct task_struct *idle; + u64 idle_end_time; struct task_struct *stop; unsigned long next_balance; struct mm_struct *prev_mm;