From patchwork Wed Oct 18 20:45:10 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
X-Patchwork-Id: 155143
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id
 ib8csp5057364vqb;
        Wed, 18 Oct 2023 13:46:11 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IH+0qlS1gumv4XFXHk5+5EmEKlgJuL/oish1zOnibs0Ge40MTYg/WuVEGhNn2DaOAB2py6w
X-Received: by 2002:a05:6300:8004:b0:13e:7d3:61d1 with SMTP id
 an4-20020a056300800400b0013e07d361d1mr290870pzc.12.1697661970728;
        Wed, 18 Oct 2023 13:46:10 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697661970; cv=none;
        d=google.com; s=arc-20160816;
        b=daqxH7YaSKfbesdTCMb9cTEx6s/8LItpKl8SaMnL9CxTGP34BBaXmusXxLUUQQdHyB
         vWedQA36ggVs7m4wU59XugqaQ8aVKpokou2JMJIpjcIaSoOjo60Ypn5a/wktUddyrv7M
         4JBizy45S1y+JZhFYM5Ep5xUUjAcD3bjXa1ao2nyo59wsYdLwb+tenmELK2qhaonZXsN
         e621YMG95QWxi7D6pN0aqBkiC5j+HsnsPsBD+/BX6W9XV5TIMODq1P4ZEzEqol6AED8W
         +7NTkFQBTpybXfpj5NEx+cshCkfEkSQAVhx6c+K9dMRDje19VGz+9pywhON5dF6aOVzp
         abtw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=STeBHKbDomCw5R3hMzpTGHytisczgOLG+k8eKAHAbbk=;
        fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=;
        b=O3BY6fBLh2KSU7uk3ildKjSVgNaX0mpTVfEDJxGzN2ir8XkEj8b/UB+sNBEdXNKw1F
         8u9Wu6gJiHbVRjZ+BmSkUe19z/JEevkak+rb53xC0kPR2UnGr3n3ZYJ1erJHbuyAy+D+
         Bxoz3UiODxsG18D2IG1wdOl+IxlQrnHGpHel/0Ip3/UTYbjGWF6tX+lgYG5iwi4aHJY9
         c2J/gTz0L0+b0khbEf+YKkC+v89ucw7pBEzRI0LvPqLplNLlin0O9gtfuPRi74LEI10W
         OJo6Ik8+yLR+P7IXsHpbkhJTYSVDIHEkC8pNJZ4nYlFfgzHe1tdw/LKZpxiOkM4DwDeg
         QiGw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=rLzhAqnI;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from morse.vger.email (morse.vger.email. [23.128.96.31])
        by mx.google.com with ESMTPS id
 bj11-20020a170902850b00b001c7264c458dsi619797plb.181.2023.10.18.13.46.10
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 18 Oct 2023 13:46:10 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=rLzhAqnI;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by morse.vger.email (Postfix) with ESMTP id 13F17823F5E9;
	Wed, 18 Oct 2023 13:46:08 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231254AbjJRUpP (ORCPT <rfc822;zwp10758@gmail.com> + 24 others);
        Wed, 18 Oct 2023 16:45:15 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51728 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229605AbjJRUpK (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 18 Oct 2023 16:45:10 -0400
Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A18E4F7
        for <linux-kernel@vger.kernel.org>;
 Wed, 18 Oct 2023 13:45:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com;
        s=smtpout1; t=1697661905;
        bh=wmHB0GzjgzeTuu/mZy6F8Xk4AczIQaMt14bc8/r6UEA=;
        h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
        b=rLzhAqnIP5c3DFqFA2/l9TZOCNQ9XPfyZA7j/UPBH5d/EOBErpF5k5IOPbGDT3McC
         pnLjmrRLFxexlFmfu74EiSvdunKkVySuLUNG9Qwrtbd8D8tTY5XTcOCTRp4h4J3USW
         Ek4Av8dU8ixX9OdJCQ4OKjdwPhXZH3ncA2amAyzhUbSydsnM/wXTiN1W5vz3c6AVbY
         CmIsWmTbowKjAmfTqXPk4CQy8NpNpYwLiQ0OWdIzxMVkVjoUMIYeEc9sf88tCYt5JM
         UKsi3XR65jesZqgPL6E8ZKCunRvEOh7kRR4G1l1s3jHcIbOZiLrPSBsXqp8MgTbe7L
         x6jM99HYFECpg==
Received: from thinkos.internal.efficios.com
 (192-222-143-198.qc.cable.ebox.net [192.222.143.198])
        by smtpout.efficios.com (Postfix) with ESMTPSA id 4S9jXK38gCz1Y98;
        Wed, 18 Oct 2023 16:45:05 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Ingo Molnar <mingo@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@amd.com>,
        Aaron Lu <aaron.lu@intel.com>, Chen Yu <yu.c.chen@intel.com>,
        Tim Chen <tim.c.chen@intel.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>, x86@kernel.org
Subject: [RFC PATCH 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature
Date: Wed, 18 Oct 2023 16:45:10 -0400
Message-Id: <20231018204511.1563390-2-mathieu.desnoyers@efficios.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20231018204511.1563390-1-mathieu.desnoyers@efficios.com>
References: <20231018204511.1563390-1-mathieu.desnoyers@efficios.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]);
 Wed, 18 Oct 2023 13:46:08 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1780127598472253461
X-GMAIL-MSGID: 1780127598472253461

Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
selection picks the previous, target, or recent runqueues if they have
enough remaining capacity to enqueue the task before scanning for an
idle cpu.

This feature is introduced in preparation for the SELECT_BIAS_PREV
scheduler feature. Its performance benefits are noticeable when combined
with the SELECT_BIAS_PREV feature.

The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
Those are performed on a v6.5.5 kernel with mitigations=off.

The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
Processor (over 2 sockets) keeps relatively the same wall time (49s).

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

We can observe that the number of migrations is reduced significantly
with this patch (improvement):

Baseline:      117M cpu-migrations  (9.355 K/sec)
With patch:     67M cpu-migrations  (5.470 K/sec)

The task-clock utilization is reduced (degradation):

Baseline:      253.275 CPUs utilized
With patch:    223.130 CPUs utilized

The number of context-switches is increased (degradation):

Baseline:      445M context-switches (35.516 K/sec)
With patch:    581M context-switches (47.548 K/sec)

So the improvement due to reduction of migrations is countered by the
degradation in CPU utilization and context-switches. The following
SELECT_BIAS_PREV feature will address this.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
 kernel/sched/fair.c     | 49 ++++++++++++++++++++++++++++++++++++-----
 kernel/sched/features.h |  6 +++++
 kernel/sched/sched.h    |  5 +++++
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d9c2482c5a3..8058058afb11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4497,6 +4497,37 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	trace_sched_util_est_se_tp(&p->se);
 }
 
+/*
+ * Returns true if adding the task utilization to the estimated
+ * utilization of the runnable tasks on @cpu does not exceed the
+ * capacity of @cpu.
+ *
+ * This considers only the utilization of _runnable_ tasks on the @cpu
+ * runqueue, excluding blocked and sleeping tasks. This is achieved by
+ * using the runqueue util_est.enqueued, and by estimating the capacity
+ * of @cpu based on arch_scale_cpu_capacity and arch_scale_thermal_pressure
+ * rather than capacity_of() because capacity_of() considers
+ * blocked/sleeping tasks in other scheduler classes.
+ *
+ * The utilization vs capacity comparison is done without the margin
+ * provided by fits_capacity(), because fits_capacity() is used to
+ * validate whether the utilization of a task fits within the overall
+ * capacity of a cpu, whereas this function validates whether the task
+ * utilization fits within the _remaining_ capacity of the cpu, which is
+ * more precise.
+ */
+static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
+						    int cpu)
+{
+	unsigned long total_util, capacity;
+
+	if (!sched_util_fits_capacity_active())
+		return false;
+	total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
+	capacity = arch_scale_cpu_capacity(cpu) - arch_scale_thermal_pressure(cpu);
+	return total_util <= capacity;
+}
+
 static inline int util_fits_cpu(unsigned long util,
 				unsigned long uclamp_min,
 				unsigned long uclamp_max,
@@ -7124,12 +7155,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	int i, recent_used_cpu;
 
 	/*
-	 * On asymmetric system, update task utilization because we will check
-	 * that the task fits with cpu's capacity.
+	 * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
+	 * update task utilization because we will check that the task
+	 * fits with cpu's capacity.
 	 */
-	if (sched_asym_cpucap_active()) {
+	if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
 		sync_entity_load_avg(&p->se);
 		task_util = task_util_est(p);
+	}
+	if (sched_asym_cpucap_active()) {
 		util_min = uclamp_eff_value(p, UCLAMP_MIN);
 		util_max = uclamp_eff_value(p, UCLAMP_MAX);
 	}
@@ -7139,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
-	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
+	    task_fits_remaining_cpu_capacity(task_util, target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
@@ -7147,7 +7182,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
-	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+	    (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+	    task_fits_remaining_cpu_capacity(task_util, prev)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev))
 		return prev;
 
@@ -7173,7 +7209,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) ||
+	    task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) &&
 	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
 		return recent_used_cpu;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..9a84a1401123 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true)
 SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
 
+/*
+ * Select the previous, target, or recent runqueue if they have enough
+ * remaining capacity to enqueue the task. Requires UTIL_EST.
+ */
+SCHED_FEAT(UTIL_FITS_CAPACITY, true)
+
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..463e75084aed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 
 #endif /* SCHED_DEBUG */
 
+static __always_inline bool sched_util_fits_capacity_active(void)
+{
+	return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY);
+}
+
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
 

From patchwork Wed Oct 18 20:45:11 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
X-Patchwork-Id: 155142
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id
 ib8csp5057349vqb;
        Wed, 18 Oct 2023 13:46:08 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGhWGe+yx9Eou6mMW75/6hYYshP7P0r6I2r1Tm62p4JhIqFEdtj9wCimORmG8msQ4TE/yAR
X-Received: by 2002:a05:6a00:1885:b0:6b4:ac0e:2f70 with SMTP id
 x5-20020a056a00188500b006b4ac0e2f70mr218606pfh.29.1697661968415;
        Wed, 18 Oct 2023 13:46:08 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697661968; cv=none;
        d=google.com; s=arc-20160816;
        b=PMQGafjigbreGnAdO6qsmSPr7BP7qa1C8ecv8uGAIpHsJ7sXO/aX3GrTNLvdM2RkCP
         WksscqOCYLfcWeyruziJ+LGrfmrSQZX0TZxgn2wj7SBUMTXfywRenz0IvFcMjFB27zb1
         NEbDdduLe9rPVPzooYEm2VO7Ht+uqaW7JBpvZNWWNqOD57EaNln6hMq1VR5ryGa9zHfa
         n2XfCzilSe1b8xyHM7jMj6FYj/LwMVvKzAhKwS0RaorwESjfWgfcz4IZN/lqnJxz6/0F
         jQZCkrTJC8KtHaVNMSLPly4LpNVhTpu/Bj2vzqriFNuw56gJ3EiF4b99R9DiC43uhdby
         XJZQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=b9nnMECgyWvUqWAlie1mztyaTh5TRQAJDawPI/gkXqY=;
        fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=;
        b=lTULDaTTmEq12t8g8mgeSiaSZ/zjGIvHbB3pNmq3QzEA2cRBHE1jYKD2jhAURg48VY
         a859tSwU+5dy4c8qjFfGdv1ubXfuGMQ82mmXtuB9EhHjvgE7z/Y/7aEl2XJ/YWHGgoUK
         4k0FWNI+zN2eG6h0UVsROgNH4riW69j/+SuIRbDIQwA2ujYQ561nzbzrxqaSXYWJmZaw
         zdImmL4+8OlZZE/zhPuUZA4oO1ba1M2F8Y39ypYFVSzi2RmrHZYaIMGCQPEIXGt6su5J
         0hs9guXM1xxbwZ/kYAaBeR4zAOxvtNi1pPqRWqPsjQB7Adlbxa+cFsqLrUVHH5b+dbFV
         9v+g==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=foROnsDq;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from fry.vger.email (fry.vger.email. [23.128.96.38])
        by mx.google.com with ESMTPS id
 c1-20020a656741000000b00578bb009dd1si2773078pgu.809.2023.10.18.13.46.08
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 18 Oct 2023 13:46:08 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=foROnsDq;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by fry.vger.email (Postfix) with ESMTP id C213281BD009;
	Wed, 18 Oct 2023 13:46:04 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232069AbjJRUpS (ORCPT <rfc822;zwp10758@gmail.com> + 24 others);
        Wed, 18 Oct 2023 16:45:18 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51748 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230119AbjJRUpM (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 18 Oct 2023 16:45:12 -0400
Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 781579B
        for <linux-kernel@vger.kernel.org>;
 Wed, 18 Oct 2023 13:45:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com;
        s=smtpout1; t=1697661906;
        bh=Aq04UeYFx+3aZMpLyeXK/SZOdncMqmG/DlwsJlY+Z28=;
        h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
        b=foROnsDqkogW6kxA5rSppgT/KlZWF0Ka4YanXTS5GSoKIHTtY+yNTkrc1em2C/A8y
         oFFkCZhND9AK6Ev1XkFo9P9H6GEoOacqWXdlUhz9j51JdsIZGxFHpzK/izaait5Aqr
         XQgm4HEVn0mDf+lJ0IhHc1ZY5iWxuX5doTLHKskwq4RgOhlqe7PEreTaItuPfCuA04
         MzXV8izmeTgKnzzDOg/fSBkSLnDwBk1qN3S0EQ4Z1Ll5/VLjuWKkwDiqOiOpEeWrHC
         21EOTtqFYWNaUpUGxS7Q14rBV836WjmyYZIz7S18zeMoZpkjwn2GPPEmYcSzE2u+O7
         dYSFixDbDTAVQ==
Received: from thinkos.internal.efficios.com
 (192-222-143-198.qc.cable.ebox.net [192.222.143.198])
        by smtpout.efficios.com (Postfix) with ESMTPSA id 4S9jXK5lmtz1Xqf;
        Wed, 18 Oct 2023 16:45:05 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Ingo Molnar <mingo@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@amd.com>,
        Aaron Lu <aaron.lu@intel.com>, Chen Yu <yu.c.chen@intel.com>,
        Tim Chen <tim.c.chen@intel.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>, x86@kernel.org
Subject: [RFC PATCH 2/2] sched/fair: Introduce SELECT_BIAS_PREV to reduce
 migrations
Date: Wed, 18 Oct 2023 16:45:11 -0400
Message-Id: <20231018204511.1563390-3-mathieu.desnoyers@efficios.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20231018204511.1563390-1-mathieu.desnoyers@efficios.com>
References: <20231018204511.1563390-1-mathieu.desnoyers@efficios.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]);
 Wed, 18 Oct 2023 13:46:04 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1780127596177331135
X-GMAIL-MSGID: 1780127596177331135

Introduce the SELECT_BIAS_PREV scheduler feature to reduce the task
migration rate.

It needs to be used with the UTIL_FITS_CAPACITY scheduler feature to
show benchmark improvements.

For scenarios where the system is under-utilized (CPUs are partly idle),
eliminate frequent task migrations from CPUs with sufficient remaining
capacity left to completely idle CPUs by introducing a bias towards the
previous CPU if it is idle or has enough capacity left.

For scenarios where the system is fully or over-utilized (CPUs are
almost never idle), favor the previous CPU (rather than the target CPU)
if all CPUs are busy to minimize migrations. (suggested by Chen Yu)

The following benchmarks are performed on a v6.5.5 kernel with
mitigations=off.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 26s. (47% speedup)

Elapsed time comparison:

Baseline:                               48-49 s
UTIL_FITS_CAPACITY:                     45-50 s
SELECT_BIAS_PREV:                       48-50 s
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV:    26-27 s

We can observe that the number of migrations is reduced significantly
(-93%) with this patch, which may explain the speedup:

Baseline:                               117M cpu-migrations  (9.355 K/sec)
UTIL_FITS_CAPACITY:                      67M cpu-migrations  (5.470 K/sec)
SELECT_BIAS_PREV:                        75M cpu-migrations  (5.674 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV:      8M cpu-migrations  (0.928 K/sec)

The CPU utilization is also improved:

Baseline:                            253.275 CPUs utilized
UTIL_FITS_CAPACITY:                  223.130 CPUs utilized
SELECT_BIAS_PREV:                    276.526 CPUs utilized
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 309.872 CPUs utilized

Interestingly, the rate of context switch increases with the patch, but
it does not appear to be an issue performance-wise:

Baseline:                               445M context-switches (35.516 K/sec)
UTIL_FITS_CAPACITY:                     581M context-switches (47.548 K/sec)
SELECT_BIAS_PREV:                       655M context-switches (49.074 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV:    597M context-switches (35.516 K/sec)

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch achieves a similar effect without the busy-waiting by
allowing select_task_rq to favor the previously used CPUs based on the
utilization of that CPU.

Feedback is welcome. I am especially interested to learn whether this
patch has positive or detrimental effects on performance of other
workloads.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
 kernel/sched/fair.c     | 28 ++++++++++++++++++++++++++--
 kernel/sched/features.h |  6 ++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8058058afb11..741d53b18d23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7173,15 +7173,30 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
+	/*
+	 * With the SELECT_BIAS_PREV feature, if the previous CPU is
+	 * cache affine and the task fits within the prev cpu capacity,
+	 * prefer the previous CPU to the target CPU to inhibit costly
+	 * task migration.
+	 */
+	if (sched_feat(SELECT_BIAS_PREV) &&
+	    (prev == target || cpus_share_cache(prev, target)) &&
+	    (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+	    task_fits_remaining_cpu_capacity(task_util, prev)) &&
+	    asym_fits_cpu(task_util, util_min, util_max, prev))
+		return prev;
+
 	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
 	    task_fits_remaining_cpu_capacity(task_util, target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
 	/*
-	 * If the previous CPU is cache affine and idle, don't be stupid:
+	 * Without the SELECT_BIAS_PREV feature, use the previous CPU if
+	 * it is cache affine and idle if the target cpu is not idle.
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (!sched_feat(SELECT_BIAS_PREV) &&
+	    prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
 	    task_fits_remaining_cpu_capacity(task_util, prev)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev))
@@ -7254,6 +7269,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	/*
+	 * With the SELECT_BIAS_PREV feature, if the previous CPU is
+	 * cache affine, prefer the previous CPU when all CPUs are busy
+	 * to inhibit migration.
+	 */
+	if (sched_feat(SELECT_BIAS_PREV) &&
+	    prev != target && cpus_share_cache(prev, target))
+		return prev;
+
 	return target;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9a84a1401123..a56d6861c553 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,6 +103,12 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
  */
 SCHED_FEAT(UTIL_FITS_CAPACITY, true)
 
+/*
+ * Bias runqueue selection towards the previous runqueue over the target
+ * runqueue.
+ */
+SCHED_FEAT(SELECT_BIAS_PREV, true)
+
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)