From patchwork Thu Oct 19 16:05:22 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
X-Patchwork-Id: 155645
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:2010:b0:403:3b70:6f57 with SMTP id
 fe16csp492299vqb;
        Thu, 19 Oct 2023 09:06:20 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IESBpbI7aDx7jw2f7K3uWX7dzUNj7UYDCgAze6PQL67Y0mNSVgdDUApz8Bhuy/0rSNG32fM
X-Received: by 2002:a17:903:40c4:b0:1c9:e3ea:74fd with SMTP id
 t4-20020a17090340c400b001c9e3ea74fdmr3164616pld.11.1697731580401;
        Thu, 19 Oct 2023 09:06:20 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697731580; cv=none;
        d=google.com; s=arc-20160816;
        b=U+k+ZhJQzsx2oWORLScKDbsURwH94cvMm/13ZAJONs2pZ0JPUxChN6qEl3lf8XjRiP
         pNaTd24FDm8QJL+f73yyCRtQHox5QwEbT16rQm5AY3Aq5bIDPK7hekV0eoNGOsqmBoWr
         r7Dwjm7TGnh3VbcRgX/soUzAxs8WRRs8jpTonp9r4Ot3+2kJomPdeJf1uBgJHjEgxhPd
         qluCxfoPA2qsaUF+Fx5ohPRF7x7h99K+JLpsIx3eexPL/JAQno74Jh1FIRAQG6i8+byn
         zLSVRCSmsnvte7NnAGE3fA2yUCbWM6chWpwqUNAlKtEU3Zu7bw3NPx00lsLQOI1oFMOA
         /Fpw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=Txtmv4z/ef0J3X/7eeMPtPV1IT4FjjHbnLkVRo1XOmE=;
        fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=;
        b=iamqDYqlHSE1gcUM/r+9wmII1AicY/8DT6/0FyOdZhP62tIVnB9E5+TtcZefjlc7/G
         ERYsn7T+b/GfPU6C1BdYM3nB6MpegvEz5XxMuRddBFZljRxfjUTIrRmd4ZDwsn94KAjO
         GmSFaMAA7V9LEmgancludM/+OK/qOI55Qxqb7OlLruSQ7SSonB2mYRoA+cuKmKSTI1K2
         CFirj+k9shAz9tVe5uAGWPN1RS+7pw0eKddSnBX3EGaIHpLkLO6nMzieZWcYcdtEhMZQ
         I3Wu5THn0ZC1jcrM91xHrDhMdSW44CIbnHwarhgJwvRUwgKgt3msci3LsUM7IUHhakRn
         tXmw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=XCozT0bB;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from fry.vger.email (fry.vger.email. [23.128.96.38])
        by mx.google.com with ESMTPS id
 t8-20020a170902e84800b001c9d4f08c3asi2728427plg.277.2023.10.19.09.06.20
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 19 Oct 2023 09:06:20 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=XCozT0bB;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by fry.vger.email (Postfix) with ESMTP id 7381083BB5B3;
	Thu, 19 Oct 2023 09:05:48 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1346508AbjJSQF1 (ORCPT <rfc822;a1648639935@gmail.com>
        + 26 others); Thu, 19 Oct 2023 12:05:27 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43320 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1346418AbjJSQFV (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 19 Oct 2023 12:05:21 -0400
Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 17541B6
        for <linux-kernel@vger.kernel.org>;
 Thu, 19 Oct 2023 09:05:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com;
        s=smtpout1; t=1697731516;
        bh=/qm7GcGwm1ebkabq6qaUZBB+Rnlf8p16CqUhbmaGPMc=;
        h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
        b=XCozT0bBR8yiQPSUL4b3MjiN4PCSLMl6gR6wH2f2aS7EnSVrtqWDWkUA6SANwXP0i
         /Q7DmTTPXjHeSSP5UHrZ/64piIqGFcOUinT5bJ+v+eGF8uiDpnOvhcy4pQLoCM+krV
         LmrmVpJ1V9GIXL8UeQj7QaLhaxUkmoi08jWtxvAdpEbjw7681RTczW1IgI3Zx3+Wjq
         YMZN5sSBHTmuehbLPE/g88uVb8TvVb+TjW3A1J/WtNc2h5QgR0OUQX51xurmbhvyTz
         QZ3cxgk1J5a4B6fsyinMKjKfmBcPSVsrGHVS+4/HF9u3P90vo0bCyWCYArZxDocc35
         WZHEFC2u4LTxQ==
Received: from thinkos.internal.efficios.com
 (192-222-143-198.qc.cable.ebox.net [192.222.143.198])
        by smtpout.efficios.com (Postfix) with ESMTPSA id 4SBCH03TyNz1YRj;
        Thu, 19 Oct 2023 12:05:16 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Ingo Molnar <mingo@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@amd.com>,
        Aaron Lu <aaron.lu@intel.com>, Chen Yu <yu.c.chen@intel.com>,
        Tim Chen <tim.c.chen@intel.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>, x86@kernel.org
Subject: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature
 (v2)
Date: Thu, 19 Oct 2023 12:05:22 -0400
Message-Id: <20231019160523.1582101-2-mathieu.desnoyers@efficios.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com>
References: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]);
 Thu, 19 Oct 2023 09:05:48 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1780200589348893531
X-GMAIL-MSGID: 1780200589348893531

Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
selection picks the previous, target, or recent runqueues if they have
enough remaining capacity to enqueue the task before scanning for an
idle cpu.

This feature is introduced in preparation for the SELECT_BIAS_PREV
scheduler feature.

The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
Those are performed on a v6.5.5 kernel with mitigations=off.

The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
Processor (over 2 sockets) improves the wall time from 49s to 40s
(18% speedup).

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

We can observe that the number of migrations is reduced significantly
with this patch (improvement):

Baseline:      117M cpu-migrations  (9.355 K/sec)
With patch:     47M cpu-migrations  (3.977 K/sec)

The task-clock utilization is increased (improvement):

Baseline:      253.275 CPUs utilized
With patch:    271.367 CPUs utilized

The number of context-switches is increased (degradation):

Baseline:      445M context-switches (35.516 K/sec)
With patch:    586M context-switches (48.823 K/sec)

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231018204511.1563390-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
Changes since v1:
- Use scale_rt_capacity(),
- Use fits_capacity which leaves 20% unused capacity to account for
  metrics inaccuracy.
---
 kernel/sched/fair.c     | 40 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |  6 ++++++
 kernel/sched/sched.h    |  5 +++++
 3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d9c2482c5a3..cc86d1ffeb27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4497,6 +4497,28 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	trace_sched_util_est_se_tp(&p->se);
 }
 
+static unsigned long scale_rt_capacity(int cpu);
+
+/*
+ * Returns true if adding the task utilization to the estimated
+ * utilization of the runnable tasks on @cpu does not exceed the
+ * capacity of @cpu.
+ *
+ * This considers only the utilization of _runnable_ tasks on the @cpu
+ * runqueue, excluding blocked and sleeping tasks. This is achieved by
+ * using the runqueue util_est.enqueued.
+ */
+static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
+						    int cpu)
+{
+	unsigned long total_util;
+
+	if (!sched_util_fits_capacity_active())
+		return false;
+	total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
+	return fits_capacity(total_util, scale_rt_capacity(cpu));
+}
+
 static inline int util_fits_cpu(unsigned long util,
 				unsigned long uclamp_min,
 				unsigned long uclamp_max,
@@ -7124,12 +7146,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	int i, recent_used_cpu;
 
 	/*
-	 * On asymmetric system, update task utilization because we will check
-	 * that the task fits with cpu's capacity.
+	 * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
+	 * update task utilization because we will check that the task
+	 * fits with cpu's capacity.
 	 */
-	if (sched_asym_cpucap_active()) {
+	if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
 		sync_entity_load_avg(&p->se);
 		task_util = task_util_est(p);
+	}
+	if (sched_asym_cpucap_active()) {
 		util_min = uclamp_eff_value(p, UCLAMP_MIN);
 		util_max = uclamp_eff_value(p, UCLAMP_MAX);
 	}
@@ -7139,7 +7164,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
-	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
+	    task_fits_remaining_cpu_capacity(task_util, target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
@@ -7147,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
-	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+	    (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+	    task_fits_remaining_cpu_capacity(task_util, prev)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev))
 		return prev;
 
@@ -7173,7 +7200,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) ||
+	    task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) &&
 	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
 		return recent_used_cpu;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..9a84a1401123 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true)
 SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
 
+/*
+ * Select the previous, target, or recent runqueue if they have enough
+ * remaining capacity to enqueue the task. Requires UTIL_EST.
+ */
+SCHED_FEAT(UTIL_FITS_CAPACITY, true)
+
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..463e75084aed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 
 #endif /* SCHED_DEBUG */
 
+static __always_inline bool sched_util_fits_capacity_active(void)
+{
+	return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY);
+}
+
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
 

From patchwork Thu Oct 19 16:05:23 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
X-Patchwork-Id: 155646
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:2010:b0:403:3b70:6f57 with SMTP id
 fe16csp492461vqb;
        Thu, 19 Oct 2023 09:06:32 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IG2QfQZDjvamfEeh5N03IylVUfz+U8tAVatca+Xfgq1iqfwZiYVu1CL6oONfxoze9KJtoch
X-Received: by 2002:a05:6e02:1486:b0:34f:75eb:f81 with SMTP id
 n6-20020a056e02148600b0034f75eb0f81mr3510387ilk.5.1697731591728;
        Thu, 19 Oct 2023 09:06:31 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697731591; cv=none;
        d=google.com; s=arc-20160816;
        b=oN+lvjIM3s90NXA4H6d88k2EInqNLBiKx2C5Mz/cAiJYQh7vWP6ARHLRCRNY1JXJKQ
         Sdl6wc/EFzoAvCEp4jaKIWcLyMRTwRCIwXlZ8lM2VQJtExIW4K4u677VxbsZo3lEncsL
         G3G22JFkAd91x02PXyhYa+A8AdNlLtzA+5RandG127l7PydhPFzzmjstRkjdxDtornFs
         SUgZuqSlYTs4k0JJHFocJIfdZUKyjRIPTDuHPfaiurfXX/VabSmKbknvmXfGxLOcqJaQ
         NQk6slxm44rWk2T+i4THieUxM7kt1t9Lrrjbtzr+tnTNxl6pYyBKEWawZoi2ZcBFSwyb
         WLrw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=Z6UEFqueZXT5AiwF8b73YFXN//A4mT854PlsijZV9l8=;
        fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=;
        b=HQD37XGWYEO1g62Fngd57O5GmDFxM8gVxwhvNgSk+BkskgqXdawbqOgP3K0U9nuz/f
         VhwUW91OK3GuJn101+ut4lh25LC0IqbPSXCYLtcW4qAJMgDrA9CICAUSpE/6eQJLs6Xx
         S/q5392KraJ1fghZToHLQ6YmCG/nSFA8ORI0sOblGP+4QJ2uPBgTwokIG39TkOpV/twL
         5aOSjX48OF36/hHTRv0yk90qpOxc3wZrrxeCqTmYmCW/e6HEluxOUzduQL9ofWpurec6
         TVvDbvuM+2yNfN9IuYrKqq4lb2lQ+5hJjIH3QamD25CkC5g58WCozdIeElyWogcP+wYY
         1mzA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=E3ncLwrH;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3])
        by mx.google.com with ESMTPS id
 g25-20020a633759000000b005898cf1c6a0si4510202pgn.324.2023.10.19.09.06.31
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 19 Oct 2023 09:06:31 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 client-ip=2620:137:e000::3:3;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=E3ncLwrH;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id 91E9F8040C42;
	Thu, 19 Oct 2023 09:06:04 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1346491AbjJSQFY (ORCPT <rfc822;a1648639935@gmail.com>
        + 26 others); Thu, 19 Oct 2023 12:05:24 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43316 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1346441AbjJSQFV (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 19 Oct 2023 12:05:21 -0400
Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CCCC512F
        for <linux-kernel@vger.kernel.org>;
 Thu, 19 Oct 2023 09:05:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com;
        s=smtpout1; t=1697731517;
        bh=7KjakoOURxZ/nx5qRaNnouSfOxFpDE60GWEDDF8aPR0=;
        h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
        b=E3ncLwrHo6ePWoeW8vT6xsQj3rEYaahoc+fCloRXoSnPqIR7kUE6m/IPavp/F9AmV
         6jBdiWZoWefRoaoAn8R/i61pbwou8jNFdpXZs2mu7wuXwO3sgSggfEarS9+thXzO1X
         nDEPH8IMO+JTFhoDA0kGQs2kY1x6XbSZdOCcwd5P2UNqAjZf4uI9JWenCOWcE55cI2
         8o2u4KYm1HPH4V74D9krsG/S5anR5aQHl0FwYX0ULoFhvO3DxF480SATejrPvfm/oN
         W21hL0LuEPnJDTARijYQEFqmm2/k8gSiWv4GoAZwUkO+46RrHA+R8Ga8qjCwyYeDmc
         1F8RkJv1NP9Dw==
Received: from thinkos.internal.efficios.com
 (192-222-143-198.qc.cable.ebox.net [192.222.143.198])
        by smtpout.efficios.com (Postfix) with ESMTPSA id 4SBCH05vxzz1Y9Q;
        Thu, 19 Oct 2023 12:05:16 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Ingo Molnar <mingo@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@amd.com>,
        Aaron Lu <aaron.lu@intel.com>, Chen Yu <yu.c.chen@intel.com>,
        Tim Chen <tim.c.chen@intel.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>, x86@kernel.org
Subject: [RFC PATCH v2 2/2] sched/fair: Introduce SELECT_BIAS_PREV to reduce
 migrations
Date: Thu, 19 Oct 2023 12:05:23 -0400
Message-Id: <20231019160523.1582101-3-mathieu.desnoyers@efficios.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com>
References: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Thu, 19 Oct 2023 09:06:04 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1780200601869575267
X-GMAIL-MSGID: 1780200601869575267

Introduce the SELECT_BIAS_PREV scheduler feature to reduce the task
migration rate.

It needs to be used with the UTIL_FITS_CAPACITY scheduler feature to
show benchmark improvements.

For scenarios where the system is under-utilized (CPUs are partly idle),
eliminate frequent task migrations from CPUs with sufficient remaining
capacity left to completely idle CPUs by introducing a bias towards the
previous CPU if it is idle or has enough capacity left.

For scenarios where the system is fully or over-utilized (CPUs are
almost never idle), favor the previous CPU (rather than the target CPU)
if all CPUs are busy to minimize migrations. (suggested by Chen Yu)

The following benchmarks are performed on a v6.5.5 kernel with
mitigations=off.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 29s. (41% speedup)

Elapsed time comparison:

Baseline:                               48-49 s
UTIL_FITS_CAPACITY:                     39-40 s
SELECT_BIAS_PREV:                       48-50 s
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV:    29-30 s

We can observe that the number of migrations is reduced significantly
(-80%) with this patch, which may explain the speedup:

Baseline:                               117M cpu-migrations  (9.355 K/sec)
UTIL_FITS_CAPACITY:                      48M cpu-migrations  (3.977 K/sec)
SELECT_BIAS_PREV:                        75M cpu-migrations  (5.674 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV:     23M cpu-migrations  (2.316 K/sec)

The CPU utilization is also improved:

Baseline:                            253.275 CPUs utilized
UTIL_FITS_CAPACITY:                  271.367 CPUs utilized
SELECT_BIAS_PREV:                    276.526 CPUs utilized
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV: 303.356 CPUs utilized

Interestingly, the rate of context switch increases with the patch, but
it does not appear to be an issue performance-wise:

Baseline:                               445M context-switches (35.516 K/sec)
UTIL_FITS_CAPACITY:                     586M context-switches (48.823 K/sec)
SELECT_BIAS_PREV:                       655M context-switches (49.074 K/sec)
UTIL_FITS_CAPACITY+SELECT_BIAS_PREV:    571M context-switches (57.714 K/sec)

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch achieves a similar effect without the busy-waiting by
allowing select_task_rq to favor the previously used CPUs based on the
utilization of that CPU.

Feedback is welcome. I am especially interested to learn whether this
patch has positive or detrimental effects on performance of other
workloads.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231018204511.1563390-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
 kernel/sched/fair.c     | 28 ++++++++++++++++++++++++++--
 kernel/sched/features.h |  6 ++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc86d1ffeb27..0aae1f15cb0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7164,15 +7164,30 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
+	/*
+	 * With the SELECT_BIAS_PREV feature, if the previous CPU is
+	 * cache affine and the task fits within the prev cpu capacity,
+	 * prefer the previous CPU to the target CPU to inhibit costly
+	 * task migration.
+	 */
+	if (sched_feat(SELECT_BIAS_PREV) &&
+	    (prev == target || cpus_share_cache(prev, target)) &&
+	    (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+	    task_fits_remaining_cpu_capacity(task_util, prev)) &&
+	    asym_fits_cpu(task_util, util_min, util_max, prev))
+		return prev;
+
 	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
 	    task_fits_remaining_cpu_capacity(task_util, target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
 	/*
-	 * If the previous CPU is cache affine and idle, don't be stupid:
+	 * Without the SELECT_BIAS_PREV feature, use the previous CPU if
+	 * it is cache affine and idle if the target cpu is not idle.
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (!sched_feat(SELECT_BIAS_PREV) &&
+	    prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
 	    task_fits_remaining_cpu_capacity(task_util, prev)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev))
@@ -7245,6 +7260,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	/*
+	 * With the SELECT_BIAS_PREV feature, if the previous CPU is
+	 * cache affine, prefer the previous CPU when all CPUs are busy
+	 * to inhibit migration.
+	 */
+	if (sched_feat(SELECT_BIAS_PREV) &&
+	    prev != target && cpus_share_cache(prev, target))
+		return prev;
+
 	return target;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9a84a1401123..a56d6861c553 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,6 +103,12 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
  */
 SCHED_FEAT(UTIL_FITS_CAPACITY, true)
 
+/*
+ * Bias runqueue selection towards the previous runqueue over the target
+ * runqueue.
+ */
+SCHED_FEAT(SELECT_BIAS_PREV, true)
+
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)