Message ID | 20230515114601.12737-2-huschle@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp6883752vqo; Mon, 15 May 2023 05:32:42 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7HG2qHKtP/AiSDNvvE1QMK/6uI04OI1I809Fq7fE1TVGKTARdEb1AECSEmOjPj+1GzbgvU X-Received: by 2002:a17:902:ec8a:b0:1aa:f3c4:7582 with SMTP id x10-20020a170902ec8a00b001aaf3c47582mr46677813plg.31.1684153961655; Mon, 15 May 2023 05:32:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684153961; cv=none; d=google.com; s=arc-20160816; b=GAgBTmnANWb+CmO1unli1XU0ZCzb7GaMJy05fcrPOqwO41XMmI31Z/wDZ2a4BJ+s2N JRb6OUon6yV1PnCvs+UKTh4oY7TGJ6YjkfHpdnesHtG4OZhQY9ERMyvMg4ClA3qb+7+7 6s3rVtL5cT8wT0Dl+wIfxXBSWbg8F3woEvajrmCMS5mz6uZJa0qBr9zPTIAAotNLkaI8 JgNjGMZhJiHwB9G+M/+yi0JhRwem4UkpnIboYAmudGVLLze1sIO/JkGb+NzEy16S0Kwi 5bTSYdIOIKuftaViuj1yP2n6LmL7LAZvw6KzmtDq3ocPQLiKsIK2rGHPz3Edl4Xl28qq P75A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=G1cCigTeBEXMWy/c4bFSZiuVOkMEUGKj8ygFsmJL550=; b=Z80oLkx06T1aJ5ShbPch9DEWtZnJca1O1uuXBvr3ck8W72hCnhtKr3D0wxrhYezTN9 6H44YB5CCYi6hKnCP8Ior6gAvT5njAhgLh4f8/4oilxUUBv04ir8eKWA3PTBNjzCoknw WaJ4PHeb8J8tTx2AuEXDz0JzcnPHpylja//LAOIfrKjSuVMa6vb63jMGSgozv8NL2VYQ ZgETLkfVbYRiIHrKPXXhbmbPsBl0zOeELyjG3vz8hZ3ZB2Y0dbrnIu6cXO2Gg0tC0dfe 6NoOnhHye/qVhsOTZzVFUbS/xD0S4U/4krc+qSi9UIWJ3Sic3MmpxxxFIg4gHLN/d84j Awkw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=SJ51dMeG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z6-20020a170903018600b001ab115a868esi17283708plg.49.2023.05.15.05.32.28; Mon, 15 May 2023 05:32:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=SJ51dMeG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241489AbjEOLzm (ORCPT <rfc822;peekingduck44@gmail.com> + 99 others); Mon, 15 May 2023 07:55:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241427AbjEOLzV (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 15 May 2023 07:55:21 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2CE102D79 for <linux-kernel@vger.kernel.org>; Mon, 15 May 2023 04:46:25 -0700 (PDT) Received: from pps.filterd (m0353727.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 34FBiW0D014659; Mon, 15 May 2023 11:46:09 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=G1cCigTeBEXMWy/c4bFSZiuVOkMEUGKj8ygFsmJL550=; b=SJ51dMeGcIGj3pUKJGxQomCAI3kd9ReY0rplOvRIFSruLHjfzumUQCtc8oNAcXxHGq2o 3lQXs+aELZPNX4HSwDuiEecWjAxxy6TRPDuoUKUEUR9cwxsjqPCZu1c394M0ywUtAA7i ASNzmPJMazrspqnCyUGOO8Rvw2BFamRVmND6TPTeox5Z0fUCRdG3quFlXtTc2ubY1BVB H9280GHXNEin+nStRFP/lkGuIv/d76JjuU0C4TEY1cGSlTa5nv6lPpbdrH1+BtkhqSkM ThxRE7mC9pQaFf6fbXyqKRabDuNNrr8RSJ78nd3I5mbgh6mWk709GwzB49bBTymhAwkk yg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3qkm4ng2fv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 May 2023 11:46:09 +0000 Received: from m0353727.ppops.net (m0353727.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 34FBj8gE017541; Mon, 15 May 2023 11:46:08 GMT Received: from ppma02fra.de.ibm.com (47.49.7a9f.ip4.static.sl-reverse.com [159.122.73.71]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3qkm4ng2ee-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 May 2023 11:46:08 +0000 Received: from pps.filterd (ppma02fra.de.ibm.com [127.0.0.1]) by ppma02fra.de.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 34F9aOct032315; Mon, 15 May 2023 11:46:06 GMT Received: from smtprelay06.fra02v.mail.ibm.com ([9.218.2.230]) by ppma02fra.de.ibm.com (PPS) with ESMTPS id 3qj264rv7k-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 15 May 2023 11:46:05 +0000 Received: from smtpav07.fra02v.mail.ibm.com (smtpav07.fra02v.mail.ibm.com [10.20.54.106]) by smtprelay06.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 34FBk3Ek3998296 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 15 May 2023 11:46:03 GMT Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6642620043; Mon, 15 May 2023 11:46:03 +0000 (GMT) Received: from smtpav07.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2F28E2004D; Mon, 15 May 2023 11:46:03 +0000 (GMT) Received: from localhost.localdomain (unknown [9.171.138.156]) by smtpav07.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 15 May 2023 11:46:03 +0000 (GMT) From: Tobias Huschle <huschle@linux.ibm.com> To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, sshegde@linux.vnet.ibm.com, srikar@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org Subject: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer Date: Mon, 15 May 2023 13:46:01 +0200 Message-Id: <20230515114601.12737-2-huschle@linux.ibm.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230515114601.12737-1-huschle@linux.ibm.com> References: <20230515114601.12737-1-huschle@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: z15x4WqIxXSZoEm-pRfMLGKWndy1Sh8u X-Proofpoint-GUID: yZspnh9xaAth65j_Vvfj2UoWSDm6IlGI X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.942,Hydra:6.0.573,FMLib:17.11.170.22 definitions=2023-05-15_09,2023-05-05_01,2023-02-09_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 priorityscore=1501 spamscore=0 bulkscore=0 clxscore=1011 adultscore=0 malwarescore=0 phishscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2304280000 definitions=main-2305150100 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765961866835168925?= X-GMAIL-MSGID: =?utf-8?q?1765963424431265852?= |
Series |
sched/fair: Consider asymmetric scheduler groups in load balancer
|
|
Commit Message
Tobias Huschle
May 15, 2023, 11:46 a.m. UTC
The current load balancer implementation implies that scheduler groups,
within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.
The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.
This introduces a performance degredation in the following scenario:
Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:
Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8
Placing some workload ( x = one task ) yields the following
scenarios:
The first 5 tasks are distributed evenly across the two groups.
Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8
x x x x x
Adding a 6th task yields the following distribution:
Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8
SMT1 x x x x x
SMT2 x
The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.
Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Comments
On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> wrote: > > The current load balancer implementation implies that scheduler groups, > within the same domain, all host the same number of CPUs. This is > reflected in the condition, that a scheduler group, which is load > balancing and classified as having spare capacity, should pull work > from the busiest group, if the local group runs less processes than > the busiest one. This implies that these two groups should run the > same number of processes, which is problematic if the groups are not > of the same size. > > The assumption that scheduler groups within the same scheduler domain > host the same number of CPUs appears to be true for non-s390 > architectures. Nevertheless, s390 can have scheduler groups of unequal > size. > > This introduces a performance degredation in the following scenario: > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > the remaining 2 are located on another socket: > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > > Placing some workload ( x = one task ) yields the following > scenarios: > > The first 5 tasks are distributed evenly across the two groups. > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > x x x x x > > Adding a 6th task yields the following distribution: > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > SMT1 x x x x x > SMT2 x Your description is a bit confusing for me. What you name CPU above should be named Core, doesn' it ? Could you share with us your scheduler topology ? > > The task is added to the 2nd scheduler group, as the scheduler has the > assumption that scheduler groups are of the same size, so they should > also host the same number of tasks. This makes CPU 7 run into SMT > thread, which comes with a performance penalty. This means, that in > the window of 6-8 tasks, load balancing is done suboptimally, because > SMT is used although there is no reason to do so as fully idle CPUs > are still available. > > Taking the weight of the scheduler groups into account, ensures that > a load balancing CPU within a smaller group will not try to pull tasks > from a bigger group while the bigger group still has idle CPUs > available. > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 48b6f0ca13ac..b1307d7e4065 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > * group's child domain. > */ > if (sds.prefer_sibling && local->group_type == group_has_spare && > - busiest->sum_nr_running > local->sum_nr_running + 1) > + busiest->sum_nr_running * local->group_weight > > + local->sum_nr_running * busiest->group_weight + 1) This is the prefer_sibling path. Could it be that you should disable prefer_siling between your sockets for such topology ? the default path compares the number of idle CPUs when groups has spare capacity > goto force_balance; > > if (busiest->group_type != group_overloaded) { > -- > 2.34.1 >
On 2023-05-16 15:36, Vincent Guittot wrote: > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> > wrote: >> >> The current load balancer implementation implies that scheduler >> groups, >> within the same domain, all host the same number of CPUs. This is >> reflected in the condition, that a scheduler group, which is load >> balancing and classified as having spare capacity, should pull work >> from the busiest group, if the local group runs less processes than >> the busiest one. This implies that these two groups should run the >> same number of processes, which is problematic if the groups are not >> of the same size. >> >> The assumption that scheduler groups within the same scheduler domain >> host the same number of CPUs appears to be true for non-s390 >> architectures. Nevertheless, s390 can have scheduler groups of unequal >> size. >> >> This introduces a performance degredation in the following scenario: >> >> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, >> the remaining 2 are located on another socket: >> >> Socket -----1----- -2- >> CPU 1 2 3 4 5 6 7 8 >> >> Placing some workload ( x = one task ) yields the following >> scenarios: >> >> The first 5 tasks are distributed evenly across the two groups. >> >> Socket -----1----- -2- >> CPU 1 2 3 4 5 6 7 8 >> x x x x x >> >> Adding a 6th task yields the following distribution: >> >> Socket -----1----- -2- >> CPU 1 2 3 4 5 6 7 8 >> SMT1 x x x x x >> SMT2 x > > Your description is a bit confusing for me. What you name CPU above > should be named Core, doesn' it ? > > Could you share with us your scheduler topology ? > You are correct, it should say core instead of CPU. One actual configuration from one of my test machines (lscpu -e): CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED POLARIZATION ADDRESS 0 0 0 0 0 0 0:0:0 yes yes horizontal 0 1 0 0 0 0 0 1:1:0 yes yes horizontal 1 2 0 0 0 0 1 2:2:0 yes yes horizontal 2 3 0 0 0 0 1 3:3:0 yes yes horizontal 3 4 0 0 0 0 2 4:4:0 yes yes horizontal 4 5 0 0 0 0 2 5:5:0 yes yes horizontal 5 6 0 0 0 0 3 6:6:0 yes yes horizontal 6 7 0 0 0 0 3 7:7:0 yes yes horizontal 7 8 0 0 0 0 4 8:8:0 yes yes horizontal 8 9 0 0 0 0 4 9:9:0 yes yes horizontal 9 10 0 0 0 0 5 10:10:0 yes yes horizontal 10 11 0 0 0 0 5 11:11:0 yes yes horizontal 11 12 0 0 0 1 6 12:12:0 yes yes horizontal 12 13 0 0 0 1 6 13:13:0 yes yes horizontal 13 14 0 0 0 1 7 14:14:0 yes yes horizontal 14 15 0 0 0 1 7 15:15:0 yes yes horizontal 15 So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. If I run stress-ng with 8 cpu stressors on the original code I get a distribution like this: 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 x x x x x x x x Which means that the two cores in the smaller group are running into SMT while two cores in the larger group are still idle. This is caused by the prefer_sibling path which really wants to see both groups run the same number of tasks. >> >> The task is added to the 2nd scheduler group, as the scheduler has the >> assumption that scheduler groups are of the same size, so they should >> also host the same number of tasks. This makes CPU 7 run into SMT >> thread, which comes with a performance penalty. This means, that in >> the window of 6-8 tasks, load balancing is done suboptimally, because >> SMT is used although there is no reason to do so as fully idle CPUs >> are still available. >> >> Taking the weight of the scheduler groups into account, ensures that >> a load balancing CPU within a smaller group will not try to pull tasks >> from a bigger group while the bigger group still has idle CPUs >> available. >> >> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> >> --- >> kernel/sched/fair.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 48b6f0ca13ac..b1307d7e4065 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -10426,7 +10426,8 @@ static struct sched_group >> *find_busiest_group(struct lb_env *env) >> * group's child domain. >> */ >> if (sds.prefer_sibling && local->group_type == group_has_spare >> && >> - busiest->sum_nr_running > local->sum_nr_running + 1) >> + busiest->sum_nr_running * local->group_weight > >> + local->sum_nr_running * busiest->group_weight >> + 1) > > This is the prefer_sibling path. Could it be that you should disable > prefer_siling between your sockets for such topology ? the default > path compares the number of idle CPUs when groups has spare capacity > > If I disable prefer_sibling (I played around with it a bit), I run into the problem, that the tasks are distributed s.t. each group has the same amount of idle CPUs, which yields distributions similar to this: 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 x x x x x x x x Now both groups have 4 idle CPUs which fulfills the criteria imposed by the load balancer, but the larger group is now running SMT while the smaller one is just idle. So, in this asymmetric setup, both criteria seem to not work in an optimal way. Going for the same number of idle CPUs or alternatively for the same number of running processes both cause a sub-optimal distribution of tasks, leading to unnecessary SMT. It seems also to be possible to address the regular load balancing path by aiming to have the same unused capacity between groups instead of the same number of idle CPUs. This seems to have been considered in the past, but the choice went in favor of the number of idle CPUs. Since this decision was actively taken already, I focused on the prefer_sibling path. The question now would be how to address this properly (or if I'm missing something here). As mentioned in the cover letter, this was the most simplistic and least invasive approach I could find, others might be more sophisticated but also have some side-effects. I have a bit of a hard time leaving this one as-is, as it just introduces two additional multiplications with no effect for most architectures. Maybe an architectures specific inline function that the compiler can optimize away if not needed? >> goto force_balance; >> >> if (busiest->group_type != group_overloaded) { >> -- >> 2.34.1 >>
On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote: > The current load balancer implementation implies that scheduler groups, > within the same domain, all host the same number of CPUs. This is > reflected in the condition, that a scheduler group, which is load > balancing and classified as having spare capacity, should pull work > from the busiest group, if the local group runs less processes than > the busiest one. This implies that these two groups should run the > same number of processes, which is problematic if the groups are not > of the same size. > > The assumption that scheduler groups within the same scheduler domain > host the same number of CPUs appears to be true for non-s390 > architectures. Mostly; there's historically the cpuset case where we can create lopsided groups like that. And today we're growing all these hybrid things that will also tickle this, except they're looking towards different balancer extentions to deal with the IPC difference so might not be immediately caring about this here issue. > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 48b6f0ca13ac..b1307d7e4065 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > * group's child domain. > */ > if (sds.prefer_sibling && local->group_type == group_has_spare && > - busiest->sum_nr_running > local->sum_nr_running + 1) > + busiest->sum_nr_running * local->group_weight > > + local->sum_nr_running * busiest->group_weight + 1) Should that not be: busiest->group_weight * (local->sum_nr_running + 1) ? I'm not opposed to this -- it seems fairly straight forward. > goto force_balance; > > if (busiest->group_type != group_overloaded) { > -- > 2.34.1 >
Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : > On 2023-05-16 15:36, Vincent Guittot wrote: > > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> > > wrote: > > > > > > The current load balancer implementation implies that scheduler > > > groups, > > > within the same domain, all host the same number of CPUs. This is > > > reflected in the condition, that a scheduler group, which is load > > > balancing and classified as having spare capacity, should pull work > > > from the busiest group, if the local group runs less processes than > > > the busiest one. This implies that these two groups should run the > > > same number of processes, which is problematic if the groups are not > > > of the same size. > > > > > > The assumption that scheduler groups within the same scheduler domain > > > host the same number of CPUs appears to be true for non-s390 > > > architectures. Nevertheless, s390 can have scheduler groups of unequal > > > size. > > > > > > This introduces a performance degredation in the following scenario: > > > > > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > > > the remaining 2 are located on another socket: > > > > > > Socket -----1----- -2- > > > CPU 1 2 3 4 5 6 7 8 > > > > > > Placing some workload ( x = one task ) yields the following > > > scenarios: > > > > > > The first 5 tasks are distributed evenly across the two groups. > > > > > > Socket -----1----- -2- > > > CPU 1 2 3 4 5 6 7 8 > > > x x x x x > > > > > > Adding a 6th task yields the following distribution: > > > > > > Socket -----1----- -2- > > > CPU 1 2 3 4 5 6 7 8 > > > SMT1 x x x x x > > > SMT2 x > > > > Your description is a bit confusing for me. What you name CPU above > > should be named Core, doesn' it ? > > > > Could you share with us your scheduler topology ? > > > > You are correct, it should say core instead of CPU. > > One actual configuration from one of my test machines (lscpu -e): > [...] > > So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. Thaks for the details > > If I run stress-ng with 8 cpu stressors on the original code I get a > distribution > like this: > > 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 > x x x x x x x x > > Which means that the two cores in the smaller group are running into SMT > while two > cores in the larger group are still idle. This is caused by the > prefer_sibling path > which really wants to see both groups run the same number of tasks. yes and it considers that there are the same number of CPUs per group > > > > > > > The task is added to the 2nd scheduler group, as the scheduler has the > > > assumption that scheduler groups are of the same size, so they should > > > also host the same number of tasks. This makes CPU 7 run into SMT > > > thread, which comes with a performance penalty. This means, that in > > > the window of 6-8 tasks, load balancing is done suboptimally, because > > > SMT is used although there is no reason to do so as fully idle CPUs > > > are still available. > > > > > > Taking the weight of the scheduler groups into account, ensures that > > > a load balancing CPU within a smaller group will not try to pull tasks > > > from a bigger group while the bigger group still has idle CPUs > > > available. > > > > > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> > > > --- > > > kernel/sched/fair.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 48b6f0ca13ac..b1307d7e4065 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -10426,7 +10426,8 @@ static struct sched_group > > > *find_busiest_group(struct lb_env *env) > > > * group's child domain. > > > */ > > > if (sds.prefer_sibling && local->group_type == > > > group_has_spare && > > > - busiest->sum_nr_running > local->sum_nr_running + 1) > > > + busiest->sum_nr_running * local->group_weight > > > > + local->sum_nr_running * > > > busiest->group_weight + 1) what you want to test here is that moving 1 task from busiest to local group would help and balance the ratio of tasks per cpu (busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight which can be develop into busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight and you also have to change how we calculate the imbalance which just provide the half of the diff of nr_running by something like (busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight) > > > > This is the prefer_sibling path. Could it be that you should disable > > prefer_siling between your sockets for such topology ? the default > > path compares the number of idle CPUs when groups has spare capacity > > > > > > If I disable prefer_sibling (I played around with it a bit), I run into the > problem, > that the tasks are distributed s.t. each group has the same amount of idle > CPUs, which > yields distributions similar to this: > > 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 > x x x x x x x x > > Now both groups have 4 idle CPUs which fulfills the criteria imposed by the > load balancer, > but the larger group is now running SMT while the smaller one is just idle. > > So, in this asymmetric setup, both criteria seem to not work in an optimal > way. Going for > the same number of idle CPUs or alternatively for the same number of running > processes > both cause a sub-optimal distribution of tasks, leading to unnecessary SMT. there is the same behavior and assumption here too > > It seems also to be possible to address the regular load balancing path by > aiming to have the > same unused capacity between groups instead of the same number of idle CPUs. > This seems to > have been considered in the past, but the choice went in favor of the number > of idle CPUs. unused capacity doesn't give the instantaneous state so a group can be idle but without unused capacity > Since this decision was actively taken already, I focused on the > prefer_sibling path. > > The question now would be how to address this properly (or if I'm missing > something here). > As mentioned in the cover letter, this was the most simplistic and least > invasive approach > I could find, others might be more sophisticated but also have some > side-effects. > > I have a bit of a hard time leaving this one as-is, as it just introduces > two additional > multiplications with no effect for most architectures. Maybe an > architectures specific > inline function that the compiler can optimize away if not needed? > > > > goto force_balance; > > > > > > if (busiest->group_type != group_overloaded) { > > > -- > > > 2.34.1 > > > >
On 5/15/23 5:16 PM, Tobias Huschle wrote: > The current load balancer implementation implies that scheduler groups, > within the same domain, all host the same number of CPUs. This is > reflected in the condition, that a scheduler group, which is load > balancing and classified as having spare capacity, should pull work > from the busiest group, if the local group runs less processes than > the busiest one. This implies that these two groups should run the > same number of processes, which is problematic if the groups are not > of the same size. > > The assumption that scheduler groups within the same scheduler domain > host the same number of CPUs appears to be true for non-s390 > architectures. Nevertheless, s390 can have scheduler groups of unequal > size. > > This introduces a performance degredation in the following scenario: > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > the remaining 2 are located on another socket: > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > > Placing some workload ( x = one task ) yields the following > scenarios: > > The first 5 tasks are distributed evenly across the two groups. > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > x x x x x > > Adding a 6th task yields the following distribution: > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > SMT1 x x x x x > SMT2 x > > The task is added to the 2nd scheduler group, as the scheduler has the > assumption that scheduler groups are of the same size, so they should > also host the same number of tasks. This makes CPU 7 run into SMT > thread, which comes with a performance penalty. This means, that in > the window of 6-8 tasks, load balancing is done suboptimally, because > SMT is used although there is no reason to do so as fully idle CPUs > are still available. > > Taking the weight of the scheduler groups into account, ensures that > a load balancing CPU within a smaller group will not try to pull tasks > from a bigger group while the bigger group still has idle CPUs > available. > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 48b6f0ca13ac..b1307d7e4065 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) > * group's child domain. > */ > if (sds.prefer_sibling && local->group_type == group_has_spare && > - busiest->sum_nr_running > local->sum_nr_running + 1) > + busiest->sum_nr_running * local->group_weight > > + local->sum_nr_running * busiest->group_weight + 1) > goto force_balance; > I assume its SMT2 here. sched group is enclosed in [busy_cpus/idle_cpus/weight] Lets take an example: we will begin the with the said imbalance. [3/9/12] -- local group [3/1/4] -- busy group. we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to balance. but calculate_imbalance will lead to zero as the imbalance no? in case of prefer sibling case it find the difference of sum_nr_running of the two groups. It will be 3-3 = 0 this would need modifications to calculate_imbalance. Maybe something like this will help? NOT TESTED AT ALL. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..e027d4086edc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s return; } - if (busiest->group_weight == 1 || sds->prefer_sibling) { + if (busiest->group_weight == 1 || + (sds->prefer_sibling && + busiest->group_weight == local->group_weight)) { unsigned int nr_diff = busiest->sum_nr_running; /* * When prefer sibling, evenly spread running tasks on @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s env->migration_type = migrate_task; lsub_positive(&nr_diff, local->sum_nr_running); env->imbalance = nr_diff; + } + if (sds->prefer_sibling && + busiest->group_weight != local->group_weight) { + env->migration_type = migrate_task; + env->imbalance = 1; } else { --------------------------------------------------------------------------------------------------- On a tangential dimension. Since prefer_siblings make inherent assumption that all groups have equal weight, it will cause complications when group_weights are different. I think that becomes very tricky when there are more than two groups. Depending on which cpu is doing load_balance there can be unnecessary migrations. Currently even in NUMA this can be similar case right? There will be unequal number of CPU's right? If we fix that case and remove prefer siblings in your arch, will that work? Maybe something like this? NOT TESTED AT ALL. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..fc6377f48ced 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10514,6 +10514,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto out_balanced; if (busiest->group_weight > 1 && + busiest->group_weight == local->group_weight && local->idle_cpus <= (busiest->idle_cpus + 1)) /* * If the busiest group is not overloaded @@ -10526,6 +10527,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ goto out_balanced; + if ((busiest->group_weight != local->group_weight) && + (local->idle_cpus * busiest->group_weight <= + local->group_weight * (busiest->idle_cpus + 1))) + goto out_balanced; + if (busiest->sum_h_nr_running == 1) /* * busiest doesn't have any tasks waiting to run > if (busiest->group_type != group_overloaded) {
On 2023-07-05 09:52, Vincent Guittot wrote: > Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : >> On 2023-05-16 15:36, Vincent Guittot wrote: >> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> >> > wrote: >> > > >> > > The current load balancer implementation implies that scheduler >> > > groups, >> > > within the same domain, all host the same number of CPUs. This is >> > > reflected in the condition, that a scheduler group, which is load >> > > balancing and classified as having spare capacity, should pull work >> > > from the busiest group, if the local group runs less processes than >> > > the busiest one. This implies that these two groups should run the >> > > same number of processes, which is problematic if the groups are not >> > > of the same size. >> > > >> > > The assumption that scheduler groups within the same scheduler domain >> > > host the same number of CPUs appears to be true for non-s390 >> > > architectures. Nevertheless, s390 can have scheduler groups of unequal >> > > size. >> > > >> > > This introduces a performance degredation in the following scenario: >> > > >> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, >> > > the remaining 2 are located on another socket: >> > > >> > > Socket -----1----- -2- >> > > CPU 1 2 3 4 5 6 7 8 >> > > >> > > Placing some workload ( x = one task ) yields the following >> > > scenarios: >> > > >> > > The first 5 tasks are distributed evenly across the two groups. >> > > >> > > Socket -----1----- -2- >> > > CPU 1 2 3 4 5 6 7 8 >> > > x x x x x >> > > >> > > Adding a 6th task yields the following distribution: >> > > >> > > Socket -----1----- -2- >> > > CPU 1 2 3 4 5 6 7 8 >> > > SMT1 x x x x x >> > > SMT2 x >> > >> > Your description is a bit confusing for me. What you name CPU above >> > should be named Core, doesn' it ? >> > >> > Could you share with us your scheduler topology ? >> > >> >> You are correct, it should say core instead of CPU. >> >> One actual configuration from one of my test machines (lscpu -e): >> > > [...] > >> >> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. > > Thaks for the details > >> >> If I run stress-ng with 8 cpu stressors on the original code I get a >> distribution >> like this: >> >> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >> x x x x x x x x >> >> Which means that the two cores in the smaller group are running into >> SMT >> while two >> cores in the larger group are still idle. This is caused by the >> prefer_sibling path >> which really wants to see both groups run the same number of tasks. > > yes and it considers that there are the same number of CPUs per group > >> >> > > >> > > The task is added to the 2nd scheduler group, as the scheduler has the >> > > assumption that scheduler groups are of the same size, so they should >> > > also host the same number of tasks. This makes CPU 7 run into SMT >> > > thread, which comes with a performance penalty. This means, that in >> > > the window of 6-8 tasks, load balancing is done suboptimally, because >> > > SMT is used although there is no reason to do so as fully idle CPUs >> > > are still available. >> > > >> > > Taking the weight of the scheduler groups into account, ensures that >> > > a load balancing CPU within a smaller group will not try to pull tasks >> > > from a bigger group while the bigger group still has idle CPUs >> > > available. >> > > >> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> >> > > --- >> > > kernel/sched/fair.c | 3 ++- >> > > 1 file changed, 2 insertions(+), 1 deletion(-) >> > > >> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> > > index 48b6f0ca13ac..b1307d7e4065 100644 >> > > --- a/kernel/sched/fair.c >> > > +++ b/kernel/sched/fair.c >> > > @@ -10426,7 +10426,8 @@ static struct sched_group >> > > *find_busiest_group(struct lb_env *env) >> > > * group's child domain. >> > > */ >> > > if (sds.prefer_sibling && local->group_type == >> > > group_has_spare && >> > > - busiest->sum_nr_running > local->sum_nr_running + 1) >> > > + busiest->sum_nr_running * local->group_weight > >> > > + local->sum_nr_running * >> > > busiest->group_weight + 1) > > > what you want to test here is that moving 1 task from busiest to local > group > would help and balance the ratio of tasks per cpu > > (busiest->sum_nr_running - 1) / busiest->group_weight > > (local->sum_nr_running + 1) / local->group_weight > > which can be develop into > > busiest->sum_nr_running * local->group_weight >= local->sum_nr_running > * busiest->group_weight + busiest->group_weight + local->group_weight > > and you also have to change how we calculate the imbalance which just > provide the half of the diff of nr_running > > by something like > > (busiest->sum_nr_running * local->group_weight) - > (local->sum_nr_running * busiest->group_weight) / > (busiest->group_weight + local->group_weight) > Ahh right, I had a look at the imbalance part now and your suggestion works pretty well. Just had to make some minor adjustments so far. Nice side effect is, that this allows the load balancer to behave exactly the same as before in the cases where local->group_weight == busiest->group_weight. The behavior only changes for the case where the groups are not of equal size. I will figure out a solution and send a patch soon which incorporates these adjustments plus a more detailed description. Thanks for the feedback. >> > >> > This is the prefer_sibling path. Could it be that you should disable >> > prefer_siling between your sockets for such topology ? the default >> > path compares the number of idle CPUs when groups has spare capacity >> > >> > >> >> If I disable prefer_sibling (I played around with it a bit), I run >> into the >> problem, >> that the tasks are distributed s.t. each group has the same amount of >> idle >> CPUs, which >> yields distributions similar to this: >> >> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >> x x x x x x x x >> >> Now both groups have 4 idle CPUs which fulfills the criteria imposed >> by the >> load balancer, >> but the larger group is now running SMT while the smaller one is just >> idle. >> >> So, in this asymmetric setup, both criteria seem to not work in an >> optimal >> way. Going for >> the same number of idle CPUs or alternatively for the same number of >> running >> processes >> both cause a sub-optimal distribution of tasks, leading to unnecessary >> SMT. > > there is the same behavior and assumption here too > > >> >> It seems also to be possible to address the regular load balancing >> path by >> aiming to have the >> same unused capacity between groups instead of the same number of idle >> CPUs. >> This seems to >> have been considered in the past, but the choice went in favor of the >> number >> of idle CPUs. > > unused capacity doesn't give the instantaneous state so a group can be > idle but without > unused capacity > >> Since this decision was actively taken already, I focused on the >> prefer_sibling path. >> >> The question now would be how to address this properly (or if I'm >> missing >> something here). >> As mentioned in the cover letter, this was the most simplistic and >> least >> invasive approach >> I could find, others might be more sophisticated but also have some >> side-effects. >> >> I have a bit of a hard time leaving this one as-is, as it just >> introduces >> two additional >> multiplications with no effect for most architectures. Maybe an >> architectures specific >> inline function that the compiler can optimize away if not needed? >> >> > > goto force_balance; >> > > >> > > if (busiest->group_type != group_overloaded) { >> > > -- >> > > 2.34.1 >> > > >>
On 2023-07-04 15:40, Peter Zijlstra wrote: > On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote: >> The current load balancer implementation implies that scheduler >> groups, >> within the same domain, all host the same number of CPUs. This is >> reflected in the condition, that a scheduler group, which is load >> balancing and classified as having spare capacity, should pull work >> from the busiest group, if the local group runs less processes than >> the busiest one. This implies that these two groups should run the >> same number of processes, which is problematic if the groups are not >> of the same size. >> >> The assumption that scheduler groups within the same scheduler domain >> host the same number of CPUs appears to be true for non-s390 >> architectures. > > Mostly; there's historically the cpuset case where we can create > lopsided groups like that. And today we're growing all these hybrid > things that will also tickle this, except they're looking towards > different balancer extentions to deal with the IPC difference so might > not be immediately caring about this here issue. > > >> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> >> --- >> kernel/sched/fair.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 48b6f0ca13ac..b1307d7e4065 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -10426,7 +10426,8 @@ static struct sched_group >> *find_busiest_group(struct lb_env *env) >> * group's child domain. >> */ >> if (sds.prefer_sibling && local->group_type == group_has_spare && >> - busiest->sum_nr_running > local->sum_nr_running + 1) >> + busiest->sum_nr_running * local->group_weight > >> + local->sum_nr_running * busiest->group_weight + 1) > > Should that not be: busiest->group_weight * (local->sum_nr_running + 1) > ? I agree, adding the brackets makes more sense and is clearer on what's intended by this check while also preserving the original behavior for local->group_weight == busiest->group_weight > > I'm not opposed to this -- it seems fairly straight forward. Appreciated, I will go ahead and send a patch once I incorporated the other feedback I got. Thanks. > >> goto force_balance; >> >> if (busiest->group_type != group_overloaded) { >> -- >> 2.34.1 >>
On 2023-07-06 19:19, Shrikanth Hegde wrote: > On 5/15/23 5:16 PM, Tobias Huschle wrote: >> The current load balancer implementation implies that scheduler >> groups, >> within the same domain, all host the same number of CPUs. This is >> reflected in the condition, that a scheduler group, which is load >> balancing and classified as having spare capacity, should pull work >> from the busiest group, if the local group runs less processes than >> the busiest one. This implies that these two groups should run the >> same number of processes, which is problematic if the groups are not >> of the same size. >> >> The assumption that scheduler groups within the same scheduler domain >> host the same number of CPUs appears to be true for non-s390 >> architectures. Nevertheless, s390 can have scheduler groups of unequal >> size. >> >> This introduces a performance degredation in the following scenario: >> >> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, >> the remaining 2 are located on another socket: >> >> Socket -----1----- -2- >> CPU 1 2 3 4 5 6 7 8 >> >> Placing some workload ( x = one task ) yields the following >> scenarios: >> >> The first 5 tasks are distributed evenly across the two groups. >> >> Socket -----1----- -2- >> CPU 1 2 3 4 5 6 7 8 >> x x x x x >> >> Adding a 6th task yields the following distribution: >> >> Socket -----1----- -2- >> CPU 1 2 3 4 5 6 7 8 >> SMT1 x x x x x >> SMT2 x >> >> The task is added to the 2nd scheduler group, as the scheduler has the >> assumption that scheduler groups are of the same size, so they should >> also host the same number of tasks. This makes CPU 7 run into SMT >> thread, which comes with a performance penalty. This means, that in >> the window of 6-8 tasks, load balancing is done suboptimally, because >> SMT is used although there is no reason to do so as fully idle CPUs >> are still available. >> >> Taking the weight of the scheduler groups into account, ensures that >> a load balancing CPU within a smaller group will not try to pull tasks >> from a bigger group while the bigger group still has idle CPUs >> available. >> >> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> >> --- >> kernel/sched/fair.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 48b6f0ca13ac..b1307d7e4065 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -10426,7 +10426,8 @@ static struct sched_group >> *find_busiest_group(struct lb_env *env) >> * group's child domain. >> */ >> if (sds.prefer_sibling && local->group_type == group_has_spare && >> - busiest->sum_nr_running > local->sum_nr_running + 1) >> + busiest->sum_nr_running * local->group_weight > >> + local->sum_nr_running * busiest->group_weight + 1) >> goto force_balance; >> > > > I assume its SMT2 here. sched group is enclosed in > [busy_cpus/idle_cpus/weight] > > Lets take an example: we will begin the with the said imbalance. > [3/9/12] -- local group > [3/1/4] -- busy group. > we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to > balance. > but calculate_imbalance will lead to zero as the imbalance no? in case > of prefer sibling > case it find the difference of sum_nr_running of the two groups. It > will be 3-3 = 0 > > this would need modifications to calculate_imbalance. > Maybe something like this will help? NOT TESTED AT ALL. > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a80a73909dc2..e027d4086edc 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct > lb_env *env, struct sd_lb_stats *s > return; > } > > - if (busiest->group_weight == 1 || sds->prefer_sibling) > { > + if (busiest->group_weight == 1 || > + (sds->prefer_sibling && > + busiest->group_weight == local->group_weight)) > { > unsigned int nr_diff = busiest->sum_nr_running; > /* > * When prefer sibling, evenly spread running > tasks on > @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct > lb_env *env, struct sd_lb_stats *s > env->migration_type = migrate_task; > lsub_positive(&nr_diff, local->sum_nr_running); > env->imbalance = nr_diff; > + } > + if (sds->prefer_sibling && > + busiest->group_weight != local->group_weight) { > + env->migration_type = migrate_task; > + env->imbalance = 1; > } else { > I also had a look at this when Vincent pointed out that this part is missing. The formula proposed by Vincent works pretty well, it is not even necessary to add additional if-statements here. > > --------------------------------------------------------------------------------------------------- > On a tangential dimension. > > > Since prefer_siblings make inherent assumption that all groups have > equal weight, > it will cause complications when group_weights are different. I think > that becomes very > tricky when there are more than two groups. Depending on which cpu is > doing > load_balance there can be unnecessary migrations. > > > Currently even in NUMA this can be similar case right? There will be > unequal number of CPU's right? > If we fix that case and remove prefer siblings in your arch, will that > work? > > Maybe something like this? NOT TESTED AT ALL. > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a80a73909dc2..fc6377f48ced 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10514,6 +10514,7 @@ static struct sched_group > *find_busiest_group(struct lb_env *env) > goto out_balanced; > > if (busiest->group_weight > 1 && > + busiest->group_weight == local->group_weight && > local->idle_cpus <= (busiest->idle_cpus + 1)) > /* > * If the busiest group is not overloaded > @@ -10526,6 +10527,11 @@ static struct sched_group > *find_busiest_group(struct lb_env *env) > */ > goto out_balanced; > > + if ((busiest->group_weight != local->group_weight) && > + (local->idle_cpus * busiest->group_weight <= > + local->group_weight * > (busiest->idle_cpus + 1))) > + goto out_balanced; > + > if (busiest->sum_h_nr_running == 1) > /* > * busiest doesn't have any tasks waiting to > run > > > > > >> if (busiest->group_type != group_overloaded) { I played around with alternate solutions as well, yours could be interesting in order to declare the problematic state as balanced essentially. I wouldn't be opposed to remove prefer_siblings. The question would be if it is necessary to address both scenarios anyway.
On 7/7/23 1:14 PM, Tobias Huschle wrote: > On 2023-07-05 09:52, Vincent Guittot wrote: >> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : >>> On 2023-05-16 15:36, Vincent Guittot wrote: >>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> >>> > wrote: >>> > > >>> > > The current load balancer implementation implies that scheduler >>> > > groups, >>> > > within the same domain, all host the same number of CPUs. This is >>> > > reflected in the condition, that a scheduler group, which is load >>> > > balancing and classified as having spare capacity, should pull work >>> > > from the busiest group, if the local group runs less processes than >>> > > the busiest one. This implies that these two groups should run the >>> > > same number of processes, which is problematic if the groups are not >>> > > of the same size. >>> > > >>> > > The assumption that scheduler groups within the same scheduler >>> domain >>> > > host the same number of CPUs appears to be true for non-s390 >>> > > architectures. Nevertheless, s390 can have scheduler groups of >>> unequal >>> > > size. >>> > > >>> > > This introduces a performance degredation in the following scenario: >>> > > >>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, >>> > > the remaining 2 are located on another socket: >>> > > >>> > > Socket -----1----- -2- >>> > > CPU 1 2 3 4 5 6 7 8 >>> > > >>> > > Placing some workload ( x = one task ) yields the following >>> > > scenarios: >>> > > >>> > > The first 5 tasks are distributed evenly across the two groups. >>> > > >>> > > Socket -----1----- -2- >>> > > CPU 1 2 3 4 5 6 7 8 >>> > > x x x x x >>> > > >>> > > Adding a 6th task yields the following distribution: >>> > > >>> > > Socket -----1----- -2- >>> > > CPU 1 2 3 4 5 6 7 8 >>> > > SMT1 x x x x x >>> > > SMT2 x >>> > >>> > Your description is a bit confusing for me. What you name CPU above >>> > should be named Core, doesn' it ? >>> > >>> > Could you share with us your scheduler topology ? >>> > >>> >>> You are correct, it should say core instead of CPU. >>> >>> One actual configuration from one of my test machines (lscpu -e): >>> >> >> [...] >> >>> >>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. >> >> Thaks for the details >> >>> >>> If I run stress-ng with 8 cpu stressors on the original code I get a >>> distribution >>> like this: >>> >>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >>> x x x x x x x x >>> >>> Which means that the two cores in the smaller group are running into SMT >>> while two >>> cores in the larger group are still idle. This is caused by the >>> prefer_sibling path >>> which really wants to see both groups run the same number of tasks. >> >> yes and it considers that there are the same number of CPUs per group >> >>> >>> > > >>> > > The task is added to the 2nd scheduler group, as the scheduler >>> has the >>> > > assumption that scheduler groups are of the same size, so they >>> should >>> > > also host the same number of tasks. This makes CPU 7 run into SMT >>> > > thread, which comes with a performance penalty. This means, that in >>> > > the window of 6-8 tasks, load balancing is done suboptimally, >>> because >>> > > SMT is used although there is no reason to do so as fully idle CPUs >>> > > are still available. >>> > > >>> > > Taking the weight of the scheduler groups into account, ensures that >>> > > a load balancing CPU within a smaller group will not try to pull >>> tasks >>> > > from a bigger group while the bigger group still has idle CPUs >>> > > available. >>> > > >>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> >>> > > --- >>> > > kernel/sched/fair.c | 3 ++- >>> > > 1 file changed, 2 insertions(+), 1 deletion(-) >>> > > >>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> > > index 48b6f0ca13ac..b1307d7e4065 100644 >>> > > --- a/kernel/sched/fair.c >>> > > +++ b/kernel/sched/fair.c >>> > > @@ -10426,7 +10426,8 @@ static struct sched_group >>> > > *find_busiest_group(struct lb_env *env) >>> > > * group's child domain. >>> > > */ >>> > > if (sds.prefer_sibling && local->group_type == >>> > > group_has_spare && >>> > > - busiest->sum_nr_running > local->sum_nr_running + 1) >>> > > + busiest->sum_nr_running * local->group_weight > >>> > > + local->sum_nr_running * >>> > > busiest->group_weight + 1) >> >> >> what you want to test here is that moving 1 task from busiest to local >> group >> would help and balance the ratio of tasks per cpu >> >> (busiest->sum_nr_running - 1) / busiest->group_weight > >> (local->sum_nr_running + 1) / local->group_weight >> >> which can be develop into >> >> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running >> * busiest->group_weight + busiest->group_weight + local->group_weight >> >> and you also have to change how we calculate the imbalance which just >> provide the half of the diff of nr_running >> >> by something like >> >> (busiest->sum_nr_running * local->group_weight) - >> (local->sum_nr_running * busiest->group_weight) / >> (busiest->group_weight + local->group_weight) >> > > Ahh right, I had a look at the imbalance part now and your suggestion works > pretty well. Just had to make some minor adjustments so far. > Nice side effect is, that this allows the load balancer to behave > exactly the > same as before in the cases where local->group_weight == > busiest->group_weight. > The behavior only changes for the case where the groups are not of equal > size. Not sure if it has been figured/discussed out already, pointing one possible scenario. Taking the formulas: busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight and calulate_imbalance: (busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight) First lets say imbalance was like this. same example as before. sched_group in [busy_cpus/idle_cpus/group_weight] [3/9/12] - local group. [3/1/4] - busy group. 3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) -- 24/16 >> 1 -- lets say 1. we will balance, good. [4/8/12] [2/2/4] There will not be further load balances. good. a new process comes, it would be scheduled on [4/8/120 sched group as that would be idlest. [5/7/12] [2/2/4] Process running on [2/2/4] exits. then in this scenario do you expect the balance to happen again? Since balancing would result into optimal performance. [5/7/12] - busy_group [1/3/4] - local group 5*4 >= 1*12+12+4 --> will not balance. [5/7/12] - local group [1/3/4] - busy group 1*12 >= 5*4 + 12 + 4 --> will not balance. Is this scenario needs to be handled as well? > > I will figure out a solution and send a patch soon which incorporates these > adjustments plus a more detailed description. > > Thanks for the feedback. > >>> > >>> > This is the prefer_sibling path. Could it be that you should disable >>> > prefer_siling between your sockets for such topology ? the default >>> > path compares the number of idle CPUs when groups has spare capacity >>> > >>> > >>> >>> If I disable prefer_sibling (I played around with it a bit), I run >>> into the >>> problem, >>> that the tasks are distributed s.t. each group has the same amount of >>> idle >>> CPUs, which >>> yields distributions similar to this: >>> >>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >>> x x x x x x x x >>> >>> Now both groups have 4 idle CPUs which fulfills the criteria imposed >>> by the >>> load balancer, >>> but the larger group is now running SMT while the smaller one is just >>> idle. >>> >>> So, in this asymmetric setup, both criteria seem to not work in an >>> optimal >>> way. Going for >>> the same number of idle CPUs or alternatively for the same number of >>> running >>> processes >>> both cause a sub-optimal distribution of tasks, leading to >>> unnecessary SMT. >> >> there is the same behavior and assumption here too >> >> >>> >>> It seems also to be possible to address the regular load balancing >>> path by >>> aiming to have the >>> same unused capacity between groups instead of the same number of >>> idle CPUs. >>> This seems to >>> have been considered in the past, but the choice went in favor of the >>> number >>> of idle CPUs. >> >> unused capacity doesn't give the instantaneous state so a group can be >> idle but without >> unused capacity >> >>> Since this decision was actively taken already, I focused on the >>> prefer_sibling path. >>> >>> The question now would be how to address this properly (or if I'm >>> missing >>> something here). >>> As mentioned in the cover letter, this was the most simplistic and least >>> invasive approach >>> I could find, others might be more sophisticated but also have some >>> side-effects. >>> >>> I have a bit of a hard time leaving this one as-is, as it just >>> introduces >>> two additional >>> multiplications with no effect for most architectures. Maybe an >>> architectures specific >>> inline function that the compiler can optimize away if not needed? >>> >>> > > goto force_balance; >>> > > >>> > > if (busiest->group_type != group_overloaded) { >>> > > -- >>> > > 2.34.1 >>> > > >>>
On 2023-07-07 16:33, Shrikanth Hegde wrote: > On 7/7/23 1:14 PM, Tobias Huschle wrote: >> On 2023-07-05 09:52, Vincent Guittot wrote: >>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : >>>> On 2023-05-16 15:36, Vincent Guittot wrote: >>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> >>>> > wrote: >>>> > > >>>> > > The current load balancer implementation implies that scheduler >>>> > > groups, >>>> > > within the same domain, all host the same number of CPUs. This is >>>> > > reflected in the condition, that a scheduler group, which is load >>>> > > balancing and classified as having spare capacity, should pull work >>>> > > from the busiest group, if the local group runs less processes than >>>> > > the busiest one. This implies that these two groups should run the >>>> > > same number of processes, which is problematic if the groups are not >>>> > > of the same size. >>>> > > >>>> > > The assumption that scheduler groups within the same scheduler >>>> domain >>>> > > host the same number of CPUs appears to be true for non-s390 >>>> > > architectures. Nevertheless, s390 can have scheduler groups of >>>> unequal >>>> > > size. >>>> > > >>>> > > This introduces a performance degredation in the following scenario: >>>> > > >>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, >>>> > > the remaining 2 are located on another socket: >>>> > > >>>> > > Socket -----1----- -2- >>>> > > CPU 1 2 3 4 5 6 7 8 >>>> > > >>>> > > Placing some workload ( x = one task ) yields the following >>>> > > scenarios: >>>> > > >>>> > > The first 5 tasks are distributed evenly across the two groups. >>>> > > >>>> > > Socket -----1----- -2- >>>> > > CPU 1 2 3 4 5 6 7 8 >>>> > > x x x x x >>>> > > >>>> > > Adding a 6th task yields the following distribution: >>>> > > >>>> > > Socket -----1----- -2- >>>> > > CPU 1 2 3 4 5 6 7 8 >>>> > > SMT1 x x x x x >>>> > > SMT2 x >>>> > >>>> > Your description is a bit confusing for me. What you name CPU above >>>> > should be named Core, doesn' it ? >>>> > >>>> > Could you share with us your scheduler topology ? >>>> > >>>> >>>> You are correct, it should say core instead of CPU. >>>> >>>> One actual configuration from one of my test machines (lscpu -e): >>>> >>> >>> [...] >>> >>>> >>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. >>> >>> Thaks for the details >>> >>>> >>>> If I run stress-ng with 8 cpu stressors on the original code I get a >>>> distribution >>>> like this: >>>> >>>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >>>> x x x x x x x x >>>> >>>> Which means that the two cores in the smaller group are running into >>>> SMT >>>> while two >>>> cores in the larger group are still idle. This is caused by the >>>> prefer_sibling path >>>> which really wants to see both groups run the same number of tasks. >>> >>> yes and it considers that there are the same number of CPUs per group >>> >>>> >>>> > > >>>> > > The task is added to the 2nd scheduler group, as the scheduler >>>> has the >>>> > > assumption that scheduler groups are of the same size, so they >>>> should >>>> > > also host the same number of tasks. This makes CPU 7 run into SMT >>>> > > thread, which comes with a performance penalty. This means, that in >>>> > > the window of 6-8 tasks, load balancing is done suboptimally, >>>> because >>>> > > SMT is used although there is no reason to do so as fully idle CPUs >>>> > > are still available. >>>> > > >>>> > > Taking the weight of the scheduler groups into account, ensures that >>>> > > a load balancing CPU within a smaller group will not try to pull >>>> tasks >>>> > > from a bigger group while the bigger group still has idle CPUs >>>> > > available. >>>> > > >>>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> >>>> > > --- >>>> > > kernel/sched/fair.c | 3 ++- >>>> > > 1 file changed, 2 insertions(+), 1 deletion(-) >>>> > > >>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> > > index 48b6f0ca13ac..b1307d7e4065 100644 >>>> > > --- a/kernel/sched/fair.c >>>> > > +++ b/kernel/sched/fair.c >>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group >>>> > > *find_busiest_group(struct lb_env *env) >>>> > > * group's child domain. >>>> > > */ >>>> > > if (sds.prefer_sibling && local->group_type == >>>> > > group_has_spare && >>>> > > - busiest->sum_nr_running > local->sum_nr_running + 1) >>>> > > + busiest->sum_nr_running * local->group_weight > >>>> > > + local->sum_nr_running * >>>> > > busiest->group_weight + 1) >>> >>> >>> what you want to test here is that moving 1 task from busiest to >>> local >>> group >>> would help and balance the ratio of tasks per cpu >>> >>> (busiest->sum_nr_running - 1) / busiest->group_weight > >>> (local->sum_nr_running + 1) / local->group_weight >>> >>> which can be develop into >>> >>> busiest->sum_nr_running * local->group_weight >= >>> local->sum_nr_running >>> * busiest->group_weight + busiest->group_weight + local->group_weight >>> >>> and you also have to change how we calculate the imbalance which just >>> provide the half of the diff of nr_running >>> >>> by something like >>> >>> (busiest->sum_nr_running * local->group_weight) - >>> (local->sum_nr_running * busiest->group_weight) / >>> (busiest->group_weight + local->group_weight) >>> >> >> Ahh right, I had a look at the imbalance part now and your suggestion >> works >> pretty well. Just had to make some minor adjustments so far. >> Nice side effect is, that this allows the load balancer to behave >> exactly the >> same as before in the cases where local->group_weight == >> busiest->group_weight. >> The behavior only changes for the case where the groups are not of >> equal >> size. > > > Not sure if it has been figured/discussed out already, pointing one > possible scenario. > > Taking the formulas: > busiest->sum_nr_running * local->group_weight >= local->sum_nr_running > * busiest->group_weight + busiest->group_weight + local->group_weight > and calulate_imbalance: > (busiest->sum_nr_running * local->group_weight) - > (local->sum_nr_running * busiest->group_weight) / > (busiest->group_weight + local->group_weight) > I was considering to just use the imbalance as an indicator whether we should balance or not, i.e. check if the second formula yields a value greater than 0. Will have to play around with that a bit though. > > First lets say imbalance was like this. same example as before. > sched_group in [busy_cpus/idle_cpus/group_weight] > [3/9/12] - local group. > [3/1/4] - busy group. > > 3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) -- > 24/16 >> 1 -- lets say 1. > we will balance, good. > > [4/8/12] > [2/2/4] > There will not be further load balances. good. > > a new process comes, it would be scheduled on [4/8/120 sched group as > that would be idlest. > [5/7/12] > [2/2/4] > > Process running on [2/2/4] exits. then in this scenario do you expect > the balance to happen again? Since balancing would result into optimal > performance. > [5/7/12] - busy_group > [1/3/4] - local group > > 5*4 >= 1*12+12+4 --> will not balance. > > [5/7/12] - local group > [1/3/4] - busy group > 1*12 >= 5*4 + 12 + 4 --> will not balance. > > Is this scenario needs to be handled as well? So, from an SMT standpoint, we would not need to balance here, both groups should not run into SMT. Now, would it be beneficial to balance anyway? Now we have: [5/7/12] -> 42% busy [1/3/4] -> 25% busy If we would now balance and move one task around we would get either [6/6/12] -> 50% busy [0/4/4] -> 0% busy or [4/8/12] -> 33% busy [2/2/4] -> 50% busy The first case does probably not make that much sense (unless we have workload which would benefit from maybe cache locality) and we want everything to run in one group. The second case brings the smaller group right onto the edge of using SMT, while also creating the possibility (depending on the algorithm we would use), that now the larger group will attempt to pull work from the smaller group again, ending up in a back and forth between the two. This is obviously also true for the first variant. Could you maybe elaborate on what you meant by optimal performance? > >> >> I will figure out a solution and send a patch soon which incorporates >> these >> adjustments plus a more detailed description. >> >> Thanks for the feedback. >> >>>> > >>>> > This is the prefer_sibling path. Could it be that you should disable >>>> > prefer_siling between your sockets for such topology ? the default >>>> > path compares the number of idle CPUs when groups has spare capacity >>>> > >>>> > >>>> >>>> If I disable prefer_sibling (I played around with it a bit), I run >>>> into the >>>> problem, >>>> that the tasks are distributed s.t. each group has the same amount >>>> of >>>> idle >>>> CPUs, which >>>> yields distributions similar to this: >>>> >>>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >>>> x x x x x x x x >>>> >>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed >>>> by the >>>> load balancer, >>>> but the larger group is now running SMT while the smaller one is >>>> just >>>> idle. >>>> >>>> So, in this asymmetric setup, both criteria seem to not work in an >>>> optimal >>>> way. Going for >>>> the same number of idle CPUs or alternatively for the same number of >>>> running >>>> processes >>>> both cause a sub-optimal distribution of tasks, leading to >>>> unnecessary SMT. >>> >>> there is the same behavior and assumption here too >>> >>> >>>> >>>> It seems also to be possible to address the regular load balancing >>>> path by >>>> aiming to have the >>>> same unused capacity between groups instead of the same number of >>>> idle CPUs. >>>> This seems to >>>> have been considered in the past, but the choice went in favor of >>>> the >>>> number >>>> of idle CPUs. >>> >>> unused capacity doesn't give the instantaneous state so a group can >>> be >>> idle but without >>> unused capacity >>> >>>> Since this decision was actively taken already, I focused on the >>>> prefer_sibling path. >>>> >>>> The question now would be how to address this properly (or if I'm >>>> missing >>>> something here). >>>> As mentioned in the cover letter, this was the most simplistic and >>>> least >>>> invasive approach >>>> I could find, others might be more sophisticated but also have some >>>> side-effects. >>>> >>>> I have a bit of a hard time leaving this one as-is, as it just >>>> introduces >>>> two additional >>>> multiplications with no effect for most architectures. Maybe an >>>> architectures specific >>>> inline function that the compiler can optimize away if not needed? >>>> >>>> > > goto force_balance; >>>> > > >>>> > > if (busiest->group_type != group_overloaded) { >>>> > > -- >>>> > > 2.34.1 >>>> > > >>>>
On 7/7/23 9:29 PM, Tobias Huschle wrote: > On 2023-07-07 16:33, Shrikanth Hegde wrote: >> On 7/7/23 1:14 PM, Tobias Huschle wrote: >>> On 2023-07-05 09:52, Vincent Guittot wrote: >>>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : >>>>> On 2023-05-16 15:36, Vincent Guittot wrote: >>>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> >>>>> > wrote: >>>>> > > >>>>> > > The current load balancer implementation implies that scheduler >>>>> > > groups, >>>>> > > within the same domain, all host the same number of CPUs. This is >>>>> > > reflected in the condition, that a scheduler group, which is load >>>>> > > balancing and classified as having spare capacity, should pull >>>>> work >>>>> > > from the busiest group, if the local group runs less processes >>>>> than >>>>> > > the busiest one. This implies that these two groups should run the >>>>> > > same number of processes, which is problematic if the groups >>>>> are not >>>>> > > of the same size. >>>>> > > >>>>> > > The assumption that scheduler groups within the same scheduler >>>>> domain >>>>> > > host the same number of CPUs appears to be true for non-s390 >>>>> > > architectures. Nevertheless, s390 can have scheduler groups of >>>>> unequal >>>>> > > size. >>>>> > > >>>>> > > This introduces a performance degredation in the following >>>>> scenario: >>>>> > > >>>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU >>>>> socket, >>>>> > > the remaining 2 are located on another socket: >>>>> > > >>>>> > > Socket -----1----- -2- >>>>> > > CPU 1 2 3 4 5 6 7 8 >>>>> > > >>>>> > > Placing some workload ( x = one task ) yields the following >>>>> > > scenarios: >>>>> > > >>>>> > > The first 5 tasks are distributed evenly across the two groups. >>>>> > > >>>>> > > Socket -----1----- -2- >>>>> > > CPU 1 2 3 4 5 6 7 8 >>>>> > > x x x x x >>>>> > > >>>>> > > Adding a 6th task yields the following distribution: >>>>> > > >>>>> > > Socket -----1----- -2- >>>>> > > CPU 1 2 3 4 5 6 7 8 >>>>> > > SMT1 x x x x x >>>>> > > SMT2 x >>>>> > >>>>> > Your description is a bit confusing for me. What you name CPU above >>>>> > should be named Core, doesn' it ? >>>>> > >>>>> > Could you share with us your scheduler topology ? >>>>> > >>>>> >>>>> You are correct, it should say core instead of CPU. >>>>> >>>>> One actual configuration from one of my test machines (lscpu -e): >>>>> >>>> >>>> [...] >>>> >>>>> >>>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. >>>> >>>> Thaks for the details >>>> >>>>> >>>>> If I run stress-ng with 8 cpu stressors on the original code I get a >>>>> distribution >>>>> like this: >>>>> >>>>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >>>>> x x x x x x x x >>>>> >>>>> Which means that the two cores in the smaller group are running >>>>> into SMT >>>>> while two >>>>> cores in the larger group are still idle. This is caused by the >>>>> prefer_sibling path >>>>> which really wants to see both groups run the same number of tasks. >>>> >>>> yes and it considers that there are the same number of CPUs per group >>>> >>>>> >>>>> > > >>>>> > > The task is added to the 2nd scheduler group, as the scheduler >>>>> has the >>>>> > > assumption that scheduler groups are of the same size, so they >>>>> should >>>>> > > also host the same number of tasks. This makes CPU 7 run into SMT >>>>> > > thread, which comes with a performance penalty. This means, >>>>> that in >>>>> > > the window of 6-8 tasks, load balancing is done suboptimally, >>>>> because >>>>> > > SMT is used although there is no reason to do so as fully idle >>>>> CPUs >>>>> > > are still available. >>>>> > > >>>>> > > Taking the weight of the scheduler groups into account, ensures >>>>> that >>>>> > > a load balancing CPU within a smaller group will not try to pull >>>>> tasks >>>>> > > from a bigger group while the bigger group still has idle CPUs >>>>> > > available. >>>>> > > >>>>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> >>>>> > > --- >>>>> > > kernel/sched/fair.c | 3 ++- >>>>> > > 1 file changed, 2 insertions(+), 1 deletion(-) >>>>> > > >>>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> > > index 48b6f0ca13ac..b1307d7e4065 100644 >>>>> > > --- a/kernel/sched/fair.c >>>>> > > +++ b/kernel/sched/fair.c >>>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group >>>>> > > *find_busiest_group(struct lb_env *env) >>>>> > > * group's child domain. >>>>> > > */ >>>>> > > if (sds.prefer_sibling && local->group_type == >>>>> > > group_has_spare && >>>>> > > - busiest->sum_nr_running > local->sum_nr_running + 1) >>>>> > > + busiest->sum_nr_running * local->group_weight > >>>>> > > + local->sum_nr_running * >>>>> > > busiest->group_weight + 1) >>>> >>>> >>>> what you want to test here is that moving 1 task from busiest to local >>>> group >>>> would help and balance the ratio of tasks per cpu >>>> >>>> (busiest->sum_nr_running - 1) / busiest->group_weight > >>>> (local->sum_nr_running + 1) / local->group_weight >>>> >>>> which can be develop into >>>> >>>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running >>>> * busiest->group_weight + busiest->group_weight + local->group_weight >>>> >>>> and you also have to change how we calculate the imbalance which just >>>> provide the half of the diff of nr_running >>>> >>>> by something like >>>> >>>> (busiest->sum_nr_running * local->group_weight) - >>>> (local->sum_nr_running * busiest->group_weight) / >>>> (busiest->group_weight + local->group_weight) >>>> >>> >>> Ahh right, I had a look at the imbalance part now and your suggestion >>> works >>> pretty well. Just had to make some minor adjustments so far. >>> Nice side effect is, that this allows the load balancer to behave >>> exactly the >>> same as before in the cases where local->group_weight == >>> busiest->group_weight. >>> The behavior only changes for the case where the groups are not of equal >>> size. >> >> >> Not sure if it has been figured/discussed out already, pointing one >> possible scenario. >> >> Taking the formulas: >> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running >> * busiest->group_weight + busiest->group_weight + local->group_weight >> and calulate_imbalance: >> (busiest->sum_nr_running * local->group_weight) - >> (local->sum_nr_running * busiest->group_weight) / >> (busiest->group_weight + local->group_weight) >> > > I was considering to just use the imbalance as an indicator whether we > should > balance or not, i.e. check if the second formula yields a value greater > than 0. > Will have to play around with that a bit though. > >> >> First lets say imbalance was like this. same example as before. >> sched_group in [busy_cpus/idle_cpus/group_weight] >> [3/9/12] - local group. >> [3/1/4] - busy group. >> >> 3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) -- >> 24/16 >> 1 -- lets say 1. >> we will balance, good. >> >> [4/8/12] >> [2/2/4] >> There will not be further load balances. good. >> >> a new process comes, it would be scheduled on [4/8/120 sched group as >> that would be idlest. >> [5/7/12] >> [2/2/4] >> >> Process running on [2/2/4] exits. then in this scenario do you expect >> the balance to happen again? Since balancing would result into optimal >> performance. >> [5/7/12] - busy_group >> [1/3/4] - local group >> >> 5*4 >= 1*12+12+4 --> will not balance. >> >> [5/7/12] - local group >> [1/3/4] - busy group >> 1*12 >= 5*4 + 12 + 4 --> will not balance. >> >> Is this scenario needs to be handled as well? > > So, from an SMT standpoint, we would not need to balance here, both groups > should not run into SMT. Now, would it be beneficial to balance anyway? > Now we have: > [5/7/12] -> 42% busy > [1/3/4] -> 25% busy > > If we would now balance and move one task around we would get either > [6/6/12] -> 50% busy > [0/4/4] -> 0% busy > or > [4/8/12] -> 33% busy > [2/2/4] -> 50% busy > > The first case does probably not make that much sense (unless we have > workload > which would benefit from maybe cache locality) and we want everything to > run > in one group. > The second case brings the smaller group right onto the edge of using > SMT, while > also creating the possibility (depending on the algorithm we would use), > that > now the larger group will attempt to pull work from the smaller group > again, > ending up in a back and forth between the two. This is obviously also > true for > the first variant. > > Could you maybe elaborate on what you meant by optimal performance? > I assumed it might be optimal to have have both group run more or less similar utilization point. Now, that i read your description, it makes sense. load balance may not be needed in this case. Did a few combinations for the check to balance condition. I think it holds good. ( Haven't done all the case). Sorry for the noise. >> >>> >>> I will figure out a solution and send a patch soon which incorporates >>> these >>> adjustments plus a more detailed description. >>> >>> Thanks for the feedback. >>> >>>>> > >>>>> > This is the prefer_sibling path. Could it be that you should disable >>>>> > prefer_siling between your sockets for such topology ? the default >>>>> > path compares the number of idle CPUs when groups has spare capacity >>>>> > >>>>> > >>>>> >>>>> If I disable prefer_sibling (I played around with it a bit), I run >>>>> into the >>>>> problem, >>>>> that the tasks are distributed s.t. each group has the same amount of >>>>> idle >>>>> CPUs, which >>>>> yields distributions similar to this: >>>>> >>>>> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 >>>>> x x x x x x x x >>>>> >>>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed >>>>> by the >>>>> load balancer, >>>>> but the larger group is now running SMT while the smaller one is just >>>>> idle. >>>>> >>>>> So, in this asymmetric setup, both criteria seem to not work in an >>>>> optimal >>>>> way. Going for >>>>> the same number of idle CPUs or alternatively for the same number of >>>>> running >>>>> processes >>>>> both cause a sub-optimal distribution of tasks, leading to >>>>> unnecessary SMT. >>>> >>>> there is the same behavior and assumption here too >>>> >>>> >>>>> >>>>> It seems also to be possible to address the regular load balancing >>>>> path by >>>>> aiming to have the >>>>> same unused capacity between groups instead of the same number of >>>>> idle CPUs. >>>>> This seems to >>>>> have been considered in the past, but the choice went in favor of the >>>>> number >>>>> of idle CPUs. >>>> >>>> unused capacity doesn't give the instantaneous state so a group can be >>>> idle but without >>>> unused capacity >>>> >>>>> Since this decision was actively taken already, I focused on the >>>>> prefer_sibling path. >>>>> >>>>> The question now would be how to address this properly (or if I'm >>>>> missing >>>>> something here). >>>>> As mentioned in the cover letter, this was the most simplistic and >>>>> least >>>>> invasive approach >>>>> I could find, others might be more sophisticated but also have some >>>>> side-effects. >>>>> >>>>> I have a bit of a hard time leaving this one as-is, as it just >>>>> introduces >>>>> two additional >>>>> multiplications with no effect for most architectures. Maybe an >>>>> architectures specific >>>>> inline function that the compiler can optimize away if not needed? >>>>> >>>>> > > goto force_balance; >>>>> > > >>>>> > > if (busiest->group_type != group_overloaded) { >>>>> > > -- >>>>> > > 2.34.1 >>>>> > > >>>>>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 48b6f0ca13ac..b1307d7e4065 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type == group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + busiest->sum_nr_running * local->group_weight > + local->sum_nr_running * busiest->group_weight + 1) goto force_balance; if (busiest->group_type != group_overloaded) {