[v3,3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups
Message ID | 4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9f45:0:b0:3ea:f831:8777 with SMTP id v5csp3585951vqx; Fri, 7 Jul 2023 15:59:08 -0700 (PDT) X-Google-Smtp-Source: APBJJlEYAVJIl3JkjHTSDqJM276pZUAGMtPMy+M4AbvQafJCEjTEJV1AfmlV6o9WcALgZ8GTfKg4 X-Received: by 2002:a05:6a21:3391:b0:12d:10b6:98f9 with SMTP id yy17-20020a056a21339100b0012d10b698f9mr7097207pzb.56.1688770747897; Fri, 07 Jul 2023 15:59:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688770747; cv=none; d=google.com; s=arc-20160816; b=MG7pAJA4GG7JvvjKfIWZ6bZTCom2nFeaqypu3V742yQwpni8t7BNcf8LAj2cv3BVhL XU0GQfuHYuw95I8cV25VabMMPX1PyAP4TD8tJw8D8uZZGiol6Tru1I+uMV17ZRrHrs6V Jo6wmLfcCP8UNaBKpOy+ttNNOgO/AK3+uF6ZqN3L13oqE+8kyzKMlXptXlBFlgS84KK3 mUXrBYGTZIy3e3YGBMuKcfZYKHzfv9UzYX05leVZdx3yWWA2ekam1f0xs49OZjP2/lxg 1lDRnI9Usghyfi+FgSGQ2VZa+tXYBl1uzxVE9b6lZw4ny+udif5rcjglxKxu0TJo3a2A yz8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=aTqDsweyA5mWeRb5BW5XX/Qq+RJ+eoIchmMVPhEIXzc=; fh=4tv6885AGcJ8YvvipL25y/n+e+aw2LwA0TLilxxK7G0=; b=0tnaQ1bZkwN7JTiLR/8WwD5vQRUV0pb4pxnW4fppQcItArF1HNOHuxMibnL6Wh5GDV buXWQ9X2/D2O7jFi7lPGqpOOJ7+NyoeWEyE7hldUApG+oM4YcE6Kk3kZhGmZaX0EGZ94 AOpPZDR7ugXWmrKUz+Giueq6wccJJKPxiiuCPREDhdm1Bx9e7HmImxlEh8+5kEh3HAhR Va/QN5o2q0HGF3lVP6/nur6CUVr0gtsywQQD3Ta2TbpY3ugACfF0GmUOBDtAxNgbT2m0 i39Q/BgbWkPV3FKyw5MqlVujIoe6dQc5zA9ymhBtuM48lzCQlL6DpNzOK3gocOudt3lV xb8g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YNpKETcj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c5-20020a6566c5000000b0055731f11410si4657065pgw.470.2023.07.07.15.58.54; Fri, 07 Jul 2023 15:59:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YNpKETcj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232438AbjGGW5I (ORCPT <rfc822;tebrre53rla2o@gmail.com> + 99 others); Fri, 7 Jul 2023 18:57:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60492 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231617AbjGGW44 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 7 Jul 2023 18:56:56 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F17611992 for <linux-kernel@vger.kernel.org>; Fri, 7 Jul 2023 15:56:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770615; x=1720306615; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FpAHkI8xLSWqEP7PVVGD4WuKrAVFt/AYDCvhVA2Njec=; b=YNpKETcjCbhcRR9uwK34ZQCicxfZrrR9/UvFNZs3uChyx4vqqf9V7qbC kijiWvwgWePN/eWnw+aN0aVmufoVccOR8eEZN/gt1s5Y+GhTPfiMUabLZ 2hGvnMsbJUfoB7ONab8LU/TIjnd+qo+E0xDL8r1g2wC5jSummbgk4aNFT MpQealJrJwZCQkyEhx79wBmZQv9CMoNoMmOL4qALAbgTM+79hP9moD/li EX0RasOpts7MDtTkjZW0kkwCMih4vyyPLsDEnAvaN5leDtk3w/K5gQMPB NOwYW1i7DLccCfWa9Oa0sN+gsyVNZ9zDobNhKIgpQjvEEdFERMrk5gwl0 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683479" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683479" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:55 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176670" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176670" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:55 -0700 From: Tim Chen <tim.c.chen@linux.intel.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Tim C Chen <tim.c.chen@linux.intel.com>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Ricardo Neri <ricardo.neri@intel.com>, "Ravi V . Shankar" <ravi.v.shankar@intel.com>, Ben Segall <bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Len Brown <len.brown@intel.com>, Mel Gorman <mgorman@suse.de>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>, Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>, Steven Rostedt <rostedt@goodmis.org>, Valentin Schneider <vschneid@redhat.com>, Ionela Voinescu <ionela.voinescu@arm.com>, x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde <sshegde@linux.vnet.ibm.com>, Srikar Dronamraju <srikar@linux.vnet.ibm.com>, naveen.n.rao@linux.vnet.ibm.com, Yicong Yang <yangyicong@hisilicon.com>, Barry Song <v-songbaohua@oppo.com>, Chen Yu <yu.c.chen@intel.com>, Hillf Danton <hdanton@sina.com> Subject: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Date: Fri, 7 Jul 2023 15:57:02 -0700 Message-Id: <4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <cover.1688770494.git.tim.c.chen@linux.intel.com> References: <cover.1688770494.git.tim.c.chen@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1770804475698484052?= X-GMAIL-MSGID: =?utf-8?q?1770804475698484052?= |
Series |
Enable Cluster Scheduling for x86 Hybrid CPUs
|
|
Commit Message
Tim Chen
July 7, 2023, 10:57 p.m. UTC
From: Tim C Chen <tim.c.chen@linux.intel.com> In the current prefer sibling load balancing code, there is an implicit assumption that the busiest sched group and local sched group are equivalent, hence the tasks to be moved is simply the difference in number of tasks between the two groups (i.e. imbalance) divided by two. However, we may have different number of cores between the cluster groups, say when we take CPU offline or we have hybrid groups. In that case, we should balance between the two groups such that #tasks/#cores ratio is the same between the same between both groups. Hence the imbalance computed will need to reflect this. Adjust the sibling imbalance computation to take into account of the above considerations. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> --- kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 4 deletions(-)
Comments
On 7/8/23 4:27 AM, Tim Chen wrote: > From: Tim C Chen <tim.c.chen@linux.intel.com> > > In the current prefer sibling load balancing code, there is an implicit > assumption that the busiest sched group and local sched group are > equivalent, hence the tasks to be moved is simply the difference in > number of tasks between the two groups (i.e. imbalance) divided by two. > > However, we may have different number of cores between the cluster groups, > say when we take CPU offline or we have hybrid groups. In that case, > we should balance between the two groups such that #tasks/#cores ratio > is the same between the same between both groups. Hence the imbalance nit: type here. the same between is repeated. > computed will need to reflect this. > > Adjust the sibling imbalance computation to take into account of the > above considerations. > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- > kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++---- > 1 file changed, 37 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f636d6c09dc6..f491b94908bf 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs, > return false; > } > > +static inline long sibling_imbalance(struct lb_env *env, > + struct sd_lb_stats *sds, > + struct sg_lb_stats *busiest, > + struct sg_lb_stats *local) > +{ > + int ncores_busiest, ncores_local; > + long imbalance; can imbalance be unsigned int or unsigned long? as sum_nr_running is unsigned int. > + > + if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running) > + return 0; > + > + ncores_busiest = sds->busiest->cores; > + ncores_local = sds->local->cores; > + > + if (ncores_busiest == ncores_local) { > + imbalance = busiest->sum_nr_running; > + lsub_positive(&imbalance, local->sum_nr_running); > + return imbalance; > + } > + > + /* Balance such that nr_running/ncores ratio are same on both groups */ > + imbalance = ncores_local * busiest->sum_nr_running; > + lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running); > + /* Normalize imbalance and do rounding on normalization */ > + imbalance = 2 * imbalance + ncores_local + ncores_busiest; > + imbalance /= ncores_local + ncores_busiest; > + Could this work for case where number of CPU/cores would differ between two sched groups in a sched domain? Such as problem pointed by tobias on S390. It would be nice if this patch can work for that case as well. Ran numbers for a few cases. It looks to work. https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901 > + /* Take advantage of resource in an empty sched group */ > + if (imbalance == 0 && local->sum_nr_running == 0 && > + busiest->sum_nr_running > 1) > + imbalance = 2; > + I don't see how this case would be true. When there are unequal number of cores and local->sum_nr_ruuning is 0, and busiest->sum_nr_running is atleast 2, imbalance will be atleast 1. Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> > + return imbalance; > +} > + > static inline bool > sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) > { > @@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > } > > if (busiest->group_weight == 1 || sds->prefer_sibling) { > - unsigned int nr_diff = busiest->sum_nr_running; > /* > * When prefer sibling, evenly spread running tasks on > * groups. > */ > env->migration_type = migrate_task; > - lsub_positive(&nr_diff, local->sum_nr_running); > - env->imbalance = nr_diff; > + env->imbalance = sibling_imbalance(env, sds, busiest, local); > } else { > > /* > @@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > * group's child domain. > */ > if (sds.prefer_sibling && local->group_type == group_has_spare && > - busiest->sum_nr_running > local->sum_nr_running + 1) > + sibling_imbalance(env, &sds, busiest, local) > 1) > goto force_balance; > > if (busiest->group_type != group_overloaded) {
On 2023-07-14 15:14, Shrikanth Hegde wrote: > On 7/8/23 4:27 AM, Tim Chen wrote: >> From: Tim C Chen <tim.c.chen@linux.intel.com> >> >> In the current prefer sibling load balancing code, there is an >> implicit >> assumption that the busiest sched group and local sched group are >> equivalent, hence the tasks to be moved is simply the difference in >> number of tasks between the two groups (i.e. imbalance) divided by >> two. >> >> However, we may have different number of cores between the cluster >> groups, >> say when we take CPU offline or we have hybrid groups. In that case, >> we should balance between the two groups such that #tasks/#cores ratio >> is the same between the same between both groups. Hence the imbalance > > nit: type here. the same between is repeated. > >> computed will need to reflect this. >> >> Adjust the sibling imbalance computation to take into account of the >> above considerations. >> >> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> >> --- >> kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++---- >> 1 file changed, 37 insertions(+), 4 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index f636d6c09dc6..f491b94908bf 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env >> *env, struct sg_lb_stats *sgs, >> return false; >> } >> >> +static inline long sibling_imbalance(struct lb_env *env, >> + struct sd_lb_stats *sds, >> + struct sg_lb_stats *busiest, >> + struct sg_lb_stats *local) >> +{ >> + int ncores_busiest, ncores_local; >> + long imbalance; > > can imbalance be unsigned int or unsigned long? as sum_nr_running is > unsigned int. > >> + >> + if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running) >> + return 0; >> + >> + ncores_busiest = sds->busiest->cores; >> + ncores_local = sds->local->cores; >> + >> + if (ncores_busiest == ncores_local) { >> + imbalance = busiest->sum_nr_running; >> + lsub_positive(&imbalance, local->sum_nr_running); >> + return imbalance; >> + } >> + >> + /* Balance such that nr_running/ncores ratio are same on both groups >> */ >> + imbalance = ncores_local * busiest->sum_nr_running; >> + lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running); >> + /* Normalize imbalance and do rounding on normalization */ >> + imbalance = 2 * imbalance + ncores_local + ncores_busiest; >> + imbalance /= ncores_local + ncores_busiest; >> + > > Could this work for case where number of CPU/cores would differ > between two sched groups in a sched domain? Such as problem pointed > by tobias on S390. It would be nice if this patch can work for that > case > as well. Ran numbers for a few cases. It looks to work. > https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901 > Just stumbled upon this patch series as well. In this version it looks similar to the prototypes I played around with, but more complete. So I'm happy that my understanding of the load balancer was kinda correct :) From a functional perspective this appears to address the issues we saw on s390. > > >> + /* Take advantage of resource in an empty sched group */ >> + if (imbalance == 0 && local->sum_nr_running == 0 && >> + busiest->sum_nr_running > 1) >> + imbalance = 2; >> + > > I don't see how this case would be true. When there are unequal number > of cores and local->sum_nr_ruuning > is 0, and busiest->sum_nr_running is atleast 2, imbalance will be > atleast 1. > > > Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> > >> + return imbalance; >> +} >> + >> static inline bool >> sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) >> { >> @@ -10230,14 +10265,12 @@ static inline void >> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s >> } >> >> if (busiest->group_weight == 1 || sds->prefer_sibling) { >> - unsigned int nr_diff = busiest->sum_nr_running; >> /* >> * When prefer sibling, evenly spread running tasks on >> * groups. >> */ >> env->migration_type = migrate_task; >> - lsub_positive(&nr_diff, local->sum_nr_running); >> - env->imbalance = nr_diff; >> + env->imbalance = sibling_imbalance(env, sds, busiest, local); >> } else { >> >> /* >> @@ -10424,7 +10457,7 @@ static struct sched_group >> *find_busiest_group(struct lb_env *env) >> * group's child domain. >> */ >> if (sds.prefer_sibling && local->group_type == group_has_spare && >> - busiest->sum_nr_running > local->sum_nr_running + 1) >> + sibling_imbalance(env, &sds, busiest, local) > 1) >> goto force_balance; >> >> if (busiest->group_type != group_overloaded) {
On Fri, 2023-07-14 at 18:44 +0530, Shrikanth Hegde wrote: > > On 7/8/23 4:27 AM, Tim Chen wrote: > > From: Tim C Chen <tim.c.chen@linux.intel.com> > > > > In the current prefer sibling load balancing code, there is an implicit > > assumption that the busiest sched group and local sched group are > > equivalent, hence the tasks to be moved is simply the difference in > > number of tasks between the two groups (i.e. imbalance) divided by two. > > > > However, we may have different number of cores between the cluster groups, > > say when we take CPU offline or we have hybrid groups. In that case, > > we should balance between the two groups such that #tasks/#cores ratio > > is the same between the same between both groups. Hence the imbalance > > nit: type here. the same between is repeated. Thanks for catching. > > > computed will need to reflect this. > > > > Adjust the sibling imbalance computation to take into account of the > > above considerations. > > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > --- > > kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++---- > > 1 file changed, 37 insertions(+), 4 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index f636d6c09dc6..f491b94908bf 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs, > > return false; > > } > > > > +static inline long sibling_imbalance(struct lb_env *env, > > + struct sd_lb_stats *sds, > > + struct sg_lb_stats *busiest, > > + struct sg_lb_stats *local) > > +{ > > + int ncores_busiest, ncores_local; > > + long imbalance; > > can imbalance be unsigned int or unsigned long? as sum_nr_running is unsigned int. It could be made unsigned long. > > > + > > + if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running) > > + return 0; > > + > > + ncores_busiest = sds->busiest->cores; > > + ncores_local = sds->local->cores; > > + > > + if (ncores_busiest == ncores_local) { > > + imbalance = busiest->sum_nr_running; > > + lsub_positive(&imbalance, local->sum_nr_running); > > + return imbalance; > > + } > > + > > + /* Balance such that nr_running/ncores ratio are same on both groups */ > > + imbalance = ncores_local * busiest->sum_nr_running; > > + lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running); > > + /* Normalize imbalance and do rounding on normalization */ > > + imbalance = 2 * imbalance + ncores_local + ncores_busiest; > > + imbalance /= ncores_local + ncores_busiest; > > + > > Could this work for case where number of CPU/cores would differ > between two sched groups in a sched domain? > Yes. That's the problem I was targeting. > Such as problem pointed > by tobias on S390. It would be nice if this patch can work for that case > as well. Ran numbers for a few cases. It looks to work. > https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901 > > > > > + /* Take advantage of resource in an empty sched group */ > > + if (imbalance == 0 && local->sum_nr_running == 0 && > > + busiest->sum_nr_running > 1) > > + imbalance = 2; > > + > > I don't see how this case would be true. When there are unequal number of cores and local->sum_nr_ruuning > is 0, and busiest->sum_nr_running is atleast 2, imbalance will be atleast 1. > > > Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Thanks for the review. > > > + return imbalance; > > +} > > + > > static inline bool > > sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) > > { > > @@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > } > > > > if (busiest->group_weight == 1 || sds->prefer_sibling) { > > - unsigned int nr_diff = busiest->sum_nr_running; > > /* > > * When prefer sibling, evenly spread running tasks on > > * groups. > > */ > > env->migration_type = migrate_task; > > - lsub_positive(&nr_diff, local->sum_nr_running); > > - env->imbalance = nr_diff; > > + env->imbalance = sibling_imbalance(env, sds, busiest, local); > > } else { > > > > /* > > @@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > > * group's child domain. > > */ > > if (sds.prefer_sibling && local->group_type == group_has_spare && > > - busiest->sum_nr_running > local->sum_nr_running + 1) > > + sibling_imbalance(env, &sds, busiest, local) > 1) > > goto force_balance; > > > > if (busiest->group_type != group_overloaded) { >
On Fri, 2023-07-14 at 13:44 -0700, Tim Chen wrote: > > > > > > +static inline long sibling_imbalance(struct lb_env *env, > > > + struct sd_lb_stats *sds, > > > + struct sg_lb_stats *busiest, > > > + struct sg_lb_stats *local) > > > +{ > > > + int ncores_busiest, ncores_local; > > > + long imbalance; > > > > can imbalance be unsigned int or unsigned long? as sum_nr_running is unsigned int. > > It could be made unsigned long. > > Though in theory the imbalance can be both positive or negative. We are considering only positive imbalance here as we only pull task to local group and do not push task from local group. Tim
On Fri, 2023-07-14 at 16:22 +0200, Tobias Huschle wrote: > > > > > Could this work for case where number of CPU/cores would differ > > between two sched groups in a sched domain? Such as problem pointed > > by tobias on S390. It would be nice if this patch can work for that > > case > > as well. Ran numbers for a few cases. It looks to work. > > https://lore.kernel.org/lkml/20230704134024.GV4253@hirez.programming.kicks-ass.net/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901 > > > > > Just stumbled upon this patch series as well. In this version it looks > similar to the prototypes I played around with, but more complete. > So I'm happy that my understanding of the load balancer was kinda > correct :) > > From a functional perspective this appears to address the issues we saw > on s390. Glad that this patch addresses this common issue across architectures. I did aim to address the asymmetric groups balancing in general. Peter pointed out this problem that's inherent when he reviewed the first version of my patchset. Tim > > > > > > > > + /* Take advantage of resource in an empty sched group */ > > > + if (imbalance == 0 && local->sum_nr_running == 0 && > > > + busiest->sum_nr_running > 1) > > > + imbalance = 2; > > > + > > > > I don't see how this case would be true. When there are unequal number > > of cores and local->sum_nr_ruuning > > is 0, and busiest->sum_nr_running is atleast 2, imbalance will be > > atleast 1. > > > > > > Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> > > > > > + return imbalance; > > > +} > > > + > > > static inline bool > > > sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) > > > { > > > @@ -10230,14 +10265,12 @@ static inline void > > > calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > } > > > > > > if (busiest->group_weight == 1 || sds->prefer_sibling) { > > > - unsigned int nr_diff = busiest->sum_nr_running; > > > /* > > > * When prefer sibling, evenly spread running tasks on > > > * groups. > > > */ > > > env->migration_type = migrate_task; > > > - lsub_positive(&nr_diff, local->sum_nr_running); > > > - env->imbalance = nr_diff; > > > + env->imbalance = sibling_imbalance(env, sds, busiest, local); > > > } else { > > > > > > /* > > > @@ -10424,7 +10457,7 @@ static struct sched_group > > > *find_busiest_group(struct lb_env *env) > > > * group's child domain. > > > */ > > > if (sds.prefer_sibling && local->group_type == group_has_spare && > > > - busiest->sum_nr_running > local->sum_nr_running + 1) > > > + sibling_imbalance(env, &sds, busiest, local) > 1) > > > goto force_balance; > > > > > > if (busiest->group_type != group_overloaded) {
On Fri, 2023-07-14 at 18:44 +0530, Shrikanth Hegde wrote: > > > + /* Take advantage of resource in an empty sched group */ > > + if (imbalance == 0 && local->sum_nr_running == 0 && > > + busiest->sum_nr_running > 1) > > + imbalance = 2; > > + > > I don't see how this case would be true. When there are unequal number of cores and local->sum_nr_ruuning > is 0, and busiest->sum_nr_running is atleast 2, imbalance will be atleast 1. I think you are correct. With at least 2 task in the busiest group, imbalance will be at least 1. This is the effect of doing rounding when adding the (ncores_local + ncores_busy) rounding factor. Returning an imbalance value of 1 will not be correct as we will be dividing imbalance by 2 and we will still not move task to the empty group as intended. So this code should be updated as below: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3fc8d3a3bd22..16bf75e6a775 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9400,7 +9400,7 @@ static inline long sibling_imbalance(struct lb_env *env, imbalance /= ncores_local + ncores_busiest; /* Take advantage of resource in an empty sched group */ - if (imbalance == 0 && local->sum_nr_running == 0 && + if (imbalance <= 1 && local->sum_nr_running == 0 && busiest->sum_nr_running > 1) imbalance = 2; Tim
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f636d6c09dc6..f491b94908bf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs, return false; } +static inline long sibling_imbalance(struct lb_env *env, + struct sd_lb_stats *sds, + struct sg_lb_stats *busiest, + struct sg_lb_stats *local) +{ + int ncores_busiest, ncores_local; + long imbalance; + + if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running) + return 0; + + ncores_busiest = sds->busiest->cores; + ncores_local = sds->local->cores; + + if (ncores_busiest == ncores_local) { + imbalance = busiest->sum_nr_running; + lsub_positive(&imbalance, local->sum_nr_running); + return imbalance; + } + + /* Balance such that nr_running/ncores ratio are same on both groups */ + imbalance = ncores_local * busiest->sum_nr_running; + lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running); + /* Normalize imbalance and do rounding on normalization */ + imbalance = 2 * imbalance + ncores_local + ncores_busiest; + imbalance /= ncores_local + ncores_busiest; + + /* Take advantage of resource in an empty sched group */ + if (imbalance == 0 && local->sum_nr_running == 0 && + busiest->sum_nr_running > 1) + imbalance = 2; + + return imbalance; +} + static inline bool sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) { @@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s } if (busiest->group_weight == 1 || sds->prefer_sibling) { - unsigned int nr_diff = busiest->sum_nr_running; /* * When prefer sibling, evenly spread running tasks on * groups. */ env->migration_type = migrate_task; - lsub_positive(&nr_diff, local->sum_nr_running); - env->imbalance = nr_diff; + env->imbalance = sibling_imbalance(env, sds, busiest, local); } else { /* @@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type == group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + sibling_imbalance(env, &sds, busiest, local) > 1) goto force_balance; if (busiest->group_type != group_overloaded) {