Message ID | 20230726093612.1882644-1-sshegde@linux.vnet.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a985:0:b0:3e4:2afc:c1 with SMTP id t5csp321119vqo; Wed, 26 Jul 2023 04:06:04 -0700 (PDT) X-Google-Smtp-Source: APBJJlF2dKqUvI4pdl0GzF82YHXY+Lvk8STSUkQuksiE37mVmQT6DYjj252Q0KTMUrKImxonHMzv X-Received: by 2002:a17:902:f68f:b0:1b6:6b90:7c2f with SMTP id l15-20020a170902f68f00b001b66b907c2fmr1509294plg.55.1690369563873; Wed, 26 Jul 2023 04:06:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690369563; cv=none; d=google.com; s=arc-20160816; b=POPzuKC21ZUkspsGDVss+wXcbcXV0SYIyFw4wCW6AbhEvAk6VDoTsHgQUsyMnn5QFV jGtzzBtupzgE+I1og0uvwKvfX9dPRXdBucFRi6KeeIL7uvVOG2rc2H+TAW84lfGXooha Ye06QImMHUVAGLK5sC+3cqToi12iu1DaHFxH4217PxZApjrt8nsgJ0opJ3RIPxyIc4K1 GnPiaJjpnC/i4jqF4e6M42Fmj8RM/feWFewnvQf23nYwd+G1xu3PlenB5mafvmyJOqqj xHdaGuzP/XGJ3iTX0FtVtXvc/VrvehKmSY6XcScBTNS5qkKhKjUR0b953zgQ5Hpjj0A7 39HA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=SoYofsSQ3ZYhteD634BlJ89CumxmiI/eGtSFXLNkEn4=; fh=X6B1YPBp1hONv3Qj6S0Z/HPH9zjz2zgfg4838doP8fY=; b=H7lwSuH+PSQPTxN0+wyWJhIASRNYBA/CZsQVFMe5WjUA4WGzqdd/O18KapD6rlheZq vaY0wFhkNche7rFJInz1VwEuT1VGRjqKLgLds0tt/WCscCUcr1u6TOZWFwb5nJhNG/IY PxVqFmKuac9EGZ+/FE8S2dEYi3OKft29CglXEskxcW3MX8XfHipFWWGycJb2mDGlY+1A 4+TW0tFgmD/hY4Dy6B5dvE3MAuOJjidQbAdhIYuEke8EsTWGcmi49DscJFdKhoRPF8c7 MulHlKWrH+PTAwHfYXTgohtIbdtuMeQtQ4Q06O4CPkba8UNrMCY2lq92MkWQwZaOwSnR 4jzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=RiDNMDIb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e13-20020a17090301cd00b001b864e277e1si14078457plh.494.2023.07.26.04.05.49; Wed, 26 Jul 2023 04:06:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=RiDNMDIb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232086AbjGZJiY (ORCPT <rfc822;kloczko.tomasz@gmail.com> + 99 others); Wed, 26 Jul 2023 05:38:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54674 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233829AbjGZJiM (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 26 Jul 2023 05:38:12 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 87119A2 for <linux-kernel@vger.kernel.org>; Wed, 26 Jul 2023 02:38:11 -0700 (PDT) Received: from pps.filterd (m0353728.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 36Q9bikD015029; Wed, 26 Jul 2023 09:37:58 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=pp1; bh=SoYofsSQ3ZYhteD634BlJ89CumxmiI/eGtSFXLNkEn4=; b=RiDNMDIbP/pIiObx9LU6OX/p1TOFj1ExgVI+c9pNfbbIAGDAxyg4B4yOu4STIPr1hmMm kjFeKCL0ZjMvPStXf5O8nd3c3RDPW8Q8VKhYVJtvdMbL+GeboD2IvA4YAcNLI65+634l zLEB1kS+f5QDwejac+L/1Q82RiEhpSgfB1CgIdo9sdG4KzL6fIJPd9RcwLKN4t8J21db NGpKW09C8UaVK8lN3ypK0Zlg/WPoczDman7dqgmY2vhsw+QQsGCsrs3JXMBz5njR9U2G iZWzyqbef/nr2aNN0URW+KCRmPWw2U3mhRMy4tzXqJ6pk7M5Y5+uGBPvW6gzGpUBRxw6 oA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3s30w282jm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 26 Jul 2023 09:37:57 +0000 Received: from m0353728.ppops.net (m0353728.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 36Q9bv7s016060; Wed, 26 Jul 2023 09:37:57 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3s30w282hu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 26 Jul 2023 09:37:57 +0000 Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 36Q8mMJs003624; Wed, 26 Jul 2023 09:37:56 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3s0txk3hb1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 26 Jul 2023 09:37:55 +0000 Received: from smtpav07.fra02v.mail.ibm.com (smtpav07.fra02v.mail.ibm.com [10.20.54.106]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 36Q9bs5i18285122 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 26 Jul 2023 09:37:54 GMT Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E24852004D; Wed, 26 Jul 2023 09:37:53 +0000 (GMT) Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A85FE2004E; Wed, 26 Jul 2023 09:37:51 +0000 (GMT) Received: from li-c1fdab4c-355a-11b2-a85c-ef242fe9efb4.ibm.com.com (unknown [9.179.15.237]) by smtpav07.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 26 Jul 2023 09:37:51 +0000 (GMT) From: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> To: peterz@infradead.org, vincent.guittot@linaro.org Cc: sshegde@linux.vnet.ibm.com, srikar@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, mingo@redhat.com, dietmar.eggemann@arm.com, mgorman@suse.de Subject: [RFC PATCH] sched/fair: Skip idle CPU search on busy system Date: Wed, 26 Jul 2023 15:06:12 +0530 Message-Id: <20230726093612.1882644-1-sshegde@linux.vnet.ibm.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: HXHlIhn_VNCTSYEy09sfnCApQElA2qwD X-Proofpoint-GUID: Tc2ZT_nucLg8jF9l306BkyfxXMEzzfTD X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26 definitions=2023-07-26_03,2023-07-25_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 adultscore=0 malwarescore=0 mlxlogscore=999 mlxscore=0 impostorscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 suspectscore=0 spamscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2306200000 definitions=main-2307260084 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772480955726820610 X-GMAIL-MSGID: 1772480955726820610 |
Series |
[RFC] sched/fair: Skip idle CPU search on busy system
|
|
Commit Message
Shrikanth Hegde
July 26, 2023, 9:36 a.m. UTC
When the system is fully busy, there will not be any idle CPU's.
In that case, load_balance will be called mainly with CPU_NOT_IDLE
type. In should_we_balance its currently checking for an idle CPU if
one exist. When system is 100% busy, there will not be an idle CPU and
these idle_cpu checks can be skipped. This would avoid fetching those rq
structures.
This is a minor optimization for a specific case of 100% utilization.
.....
Coming to the current implementation. It is a very basic approach to the
issue. It may not be the best/perfect way to this. It works only in
case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
can use that instead. nohz.nr_cpus is atomic, which might be costly too.
Alternative way would be to add a new attribute to sched_domain and update
it in cpu idle entry/exit path per CPU. Advantage is, check can be per
env->sd instead of global. Slightly complicated, but maybe better. there
could other advantage at wake up to limit the scan etc.
Your feedback would really help. Does this optimization makes sense?
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)
--
2.31.1
Comments
On 2023-07-26 at 15:06:12 +0530, Shrikanth Hegde wrote: > When the system is fully busy, there will not be any idle CPU's. > In that case, load_balance will be called mainly with CPU_NOT_IDLE > type. In should_we_balance its currently checking for an idle CPU if > one exist. When system is 100% busy, there will not be an idle CPU and > these idle_cpu checks can be skipped. This would avoid fetching those rq > structures. > Yes, I guess this could help reducing the cost if the sched group has many CPUs. > This is a minor optimization for a specific case of 100% utilization. > > ..... > Coming to the current implementation. It is a very basic approach to the > issue. It may not be the best/perfect way to this. It works only in > case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which > tracks idle CPU's. AFAIU there isn't any other. If there is such info, we > can use that instead. nohz.nr_cpus is atomic, which might be costly too. > > Alternative way would be to add a new attribute to sched_domain and update > it in cpu idle entry/exit path per CPU. Advantage is, check can be per > env->sd instead of global. Slightly complicated, but maybe better. there > could other advantage at wake up to limit the scan etc. > When checking the code, I found that there is per domain nr_busy_cpus. However that variable is only for LLC domain. Maybe extend the sd_share for domains under NUMA is applicable IMO. thanks, Chenyu > Your feedback would really help. Does this optimization makes sense? > > Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> > --- > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 373ff5f55884..903d59b5290c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env) > return 1; > } > > +#ifdef CONFIG_NO_HZ_COMMON > + /* If the system is fully busy, its better to skip the idle checks */ > + if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0) > + return group_balance_cpu(sg) == env->dst_cpu; > +#endif > + > /* Try to find first idle CPU */ > for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) { > if (!idle_cpu(cpu)) > -- > 2.31.1 >
On 7/27/23 12:55 PM, Chen Yu wrote: > On 2023-07-26 at 15:06:12 +0530, Shrikanth Hegde wrote: >> When the system is fully busy, there will not be any idle CPU's. >> In that case, load_balance will be called mainly with CPU_NOT_IDLE >> type. In should_we_balance its currently checking for an idle CPU if >> one exist. When system is 100% busy, there will not be an idle CPU and >> these idle_cpu checks can be skipped. This would avoid fetching those rq >> structures. >> > > Yes, I guess this could help reducing the cost if the sched group > has many CPUs. Thank you for the review Chen Yu. > >> This is a minor optimization for a specific case of 100% utilization. >> >> ..... >> Coming to the current implementation. It is a very basic approach to the >> issue. It may not be the best/perfect way to this. It works only in >> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which >> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we >> can use that instead. nohz.nr_cpus is atomic, which might be costly too. >> >> Alternative way would be to add a new attribute to sched_domain and update >> it in cpu idle entry/exit path per CPU. Advantage is, check can be per >> env->sd instead of global. Slightly complicated, but maybe better. there >> could other advantage at wake up to limit the scan etc. >> > > When checking the code, I found that there is per domain nr_busy_cpus. > However that variable is only for LLC domain. Maybe extend the sd_share > for domains under NUMA is applicable IMO. True. I did see that. Doing at every level when there are large number of CPU's will likely need lock when updating the sd_share and that would be the bottleneck as well. Since sd_share never makes sense for NUMA, This would cause different code check for NUMA and non-NUMA. Though main benefit for this corner case would be in NUMA as there would be large number of CPU's there. I will keep that thought and will try to work something along. > > thanks, > Chenyu > >> Your feedback would really help. Does this optimization makes sense? >> >> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> >> --- >> kernel/sched/fair.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 373ff5f55884..903d59b5290c 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env) >> return 1; >> } >> >> +#ifdef CONFIG_NO_HZ_COMMON >> + /* If the system is fully busy, its better to skip the idle checks */ >> + if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0) >> + return group_balance_cpu(sg) == env->dst_cpu; >> +#endif >> + >> /* Try to find first idle CPU */ >> for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) { >> if (!idle_cpu(cpu)) >> -- >> 2.31.1 >>
On Wed, Jul 26, 2023 at 03:06:12PM +0530, Shrikanth Hegde wrote: > When the system is fully busy, there will not be any idle CPU's. > In that case, load_balance will be called mainly with CPU_NOT_IDLE > type. In should_we_balance its currently checking for an idle CPU if > one exist. When system is 100% busy, there will not be an idle CPU and > these idle_cpu checks can be skipped. This would avoid fetching those rq > structures. > > This is a minor optimization for a specific case of 100% utilization. > > ..... > Coming to the current implementation. It is a very basic approach to the > issue. It may not be the best/perfect way to this. It works only in > case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which > tracks idle CPU's. AFAIU there isn't any other. If there is such info, we > can use that instead. nohz.nr_cpus is atomic, which might be costly too. > > Alternative way would be to add a new attribute to sched_domain and update > it in cpu idle entry/exit path per CPU. Advantage is, check can be per > env->sd instead of global. Slightly complicated, but maybe better. there > could other advantage at wake up to limit the scan etc. > > Your feedback would really help. Does this optimization makes sense? > > Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> > --- > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 373ff5f55884..903d59b5290c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env) > return 1; > } > > +#ifdef CONFIG_NO_HZ_COMMON > + /* If the system is fully busy, its better to skip the idle checks */ > + if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0) > + return group_balance_cpu(sg) == env->dst_cpu; > +#endif > + > /* Try to find first idle CPU */ > for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) { > if (!idle_cpu(cpu)) > -- > 2.31.1 > Tested this patchset on top of v6.4 5 Runs of stress-ng (100% load) on a system with 16CPUs spawning 23 threads for 60 minutes. stress-ng: 16CPUs, 23threads, 60mins - 6.4.0 | completion time(sec) | user | sys | |----------------------+-----------+------------| | 3600.05 | 57582.44 | 0.70 | | 3600.10 | 57597.07 | 0.68 | | 3600.05 | 57596.65 | 0.47 | | 3600.04 | 57596.36 | 0.71 | | 3600.06 | 57595.32 | 0.42 | | 3600.06 | 57593.568 | 0.596 | average | 0.046904158 | 12.508392 | 0.27878307 | stddev - 6.4.0+ (with patch) | completion time(sec) | user | sys | |----------------------+-----------+-------------| | 3600.04 | 57596.58 | 0.50 | | 3600.04 | 57595.19 | 0.48 | | 3600.05 | 57597.39 | 0.49 | | 3600.04 | 57596.64 | 0.53 | | 3600.04 | 57595.94 | 0.43 | | 3600.042 | 57596.348 | 0.486 | average | 0.0089442719 | 1.6529610 | 0.072938330 | stddev The average system time is slightly lower in the patched version (0.486 seconds) compared to the 6.4.0 version (0.596 seconds). The standard deviation for system time is also lower in the patched version (0.0729 seconds) than in the 6.4.0 version (0.2788 seconds), suggesting more consistent system time results with the patch. vishal.c
On 8/10/23 12:14 AM, Vishal Chourasia wrote: > On Wed, Jul 26, 2023 at 03:06:12PM +0530, Shrikanth Hegde wrote: >> When the system is fully busy, there will not be any idle CPU's. >> > Tested this patchset on top of v6.4 [...] > 5 Runs of stress-ng (100% load) on a system with 16CPUs spawning 23 threads for > 60 minutes. > > stress-ng: 16CPUs, 23threads, 60mins > > - 6.4.0 > > | completion time(sec) | user | sys | > |----------------------+-----------+------------| > | 3600.05 | 57582.44 | 0.70 | > | 3600.10 | 57597.07 | 0.68 | > | 3600.05 | 57596.65 | 0.47 | > | 3600.04 | 57596.36 | 0.71 | > | 3600.06 | 57595.32 | 0.42 | > | 3600.06 | 57593.568 | 0.596 | average > | 0.046904158 | 12.508392 | 0.27878307 | stddev > > - 6.4.0+ (with patch) > > | completion time(sec) | user | sys | > |----------------------+-----------+-------------| > | 3600.04 | 57596.58 | 0.50 | > | 3600.04 | 57595.19 | 0.48 | > | 3600.05 | 57597.39 | 0.49 | > | 3600.04 | 57596.64 | 0.53 | > | 3600.04 | 57595.94 | 0.43 | > | 3600.042 | 57596.348 | 0.486 | average > | 0.0089442719 | 1.6529610 | 0.072938330 | stddev > > The average system time is slightly lower in the patched version (0.486 seconds) > compared to the 6.4.0 version (0.596 seconds). > The standard deviation for system time is also lower in the patched version > (0.0729 seconds) than in the 6.4.0 version (0.2788 seconds), suggesting more > consistent system time results with the patch. > > vishal.c Thank you very much Vishal for trying this out. Meanwhile, I am yet to try the suggestion given by chen. Let me see if that works okay.
* Shrikanth Hegde <sshegde@linux.vnet.ibm.com> wrote: > When the system is fully busy, there will not be any idle CPU's. > In that case, load_balance will be called mainly with CPU_NOT_IDLE > type. In should_we_balance its currently checking for an idle CPU if > one exist. When system is 100% busy, there will not be an idle CPU and > these idle_cpu checks can be skipped. This would avoid fetching those rq > structures. > > This is a minor optimization for a specific case of 100% utilization. > > ..... > Coming to the current implementation. It is a very basic approach to the > issue. It may not be the best/perfect way to this. It works only in > case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which > tracks idle CPU's. AFAIU there isn't any other. If there is such info, we > can use that instead. nohz.nr_cpus is atomic, which might be costly too. > > Alternative way would be to add a new attribute to sched_domain and update > it in cpu idle entry/exit path per CPU. Advantage is, check can be per > env->sd instead of global. Slightly complicated, but maybe better. there > could other advantage at wake up to limit the scan etc. > > Your feedback would really help. Does this optimization makes sense? > > Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> > --- > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 373ff5f55884..903d59b5290c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env) > return 1; > } > > +#ifdef CONFIG_NO_HZ_COMMON > + /* If the system is fully busy, its better to skip the idle checks */ > + if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0) > + return group_balance_cpu(sg) == env->dst_cpu; > +#endif Not a big fan of coupling NOHZ to a scheduler optimization in this fashion, and not a big fan of the nohz.nr_cpus global cacheline either. I think it should be done unconditionally, via the scheduler topology tree: - We should probably slow-propagate "permanently busy" status of a CPU down the topology tree, ie.: - mark a domain fully-busy with a delay & batching, probably driven by the busy-tick only, - while marking a domain idle instantly & propagating this up the domain tree only if necessary. The propagation can stop if it finds a non-busy domain, so usually it won't reach the root domain. - This approach ensures there's no real overhead problem in the domain tree: think of hundreds of CPUs all accessing the nohz.nr_cpus global variable... I bet it's a measurable problem already on large systems. - The "atomic_read(&nohz.nr_cpus) == 0" condition in your patch is simply the busy-flag checked at the root domain: a readonly global cacheline that never gets modified on a permanently busy system. Thanks, Ingo
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 373ff5f55884..903d59b5290c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env) return 1; } +#ifdef CONFIG_NO_HZ_COMMON + /* If the system is fully busy, its better to skip the idle checks */ + if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0) + return group_balance_cpu(sg) == env->dst_cpu; +#endif + /* Try to find first idle CPU */ for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) { if (!idle_cpu(cpu))