From patchwork Tue Dec 12 00:31:34 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176991
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426393vqy;
        Mon, 11 Dec 2023 16:32:18 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IENTjzHECWd4CvaaC34QnaE7wv8TungdLOg18e1jSaKkYAA+Okgb6m7kdZerZXADtU2ZQ6j
X-Received: by 2002:a05:6a00:9385:b0:6ce:76d1:fc21 with SMTP id
 ka5-20020a056a00938500b006ce76d1fc21mr6024443pfb.17.1702341137911;
        Mon, 11 Dec 2023 16:32:17 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341137; cv=none;
        d=google.com; s=arc-20160816;
        b=PREv2wgbXjEKI5GRpj+KyEAFL0cvVgfmDP2VtYOtuEXCVx6CjWpNCnp1g97pwa5Ec6
         SY8VedoA77K7ytVJPRon+sIKzk2rdG/fflP+t5jSUBGMG+OZh5etxCa6hHq79pLsxQmx
         GeT70E328AZh9L1lptWFaojLqjorivUiWr9btxMKlZ+wLauUQrPDsbRnxxIbfvzZ35t1
         IdldFRdw+IC4jm4ci6PtQKAyv4uj9xdILrRomea0WDQTvP/KmzK7cY6h5diDi50r8YUr
         TnO/Lmj4fIrZ/HoHVY/HqhAAe8WayARHX93OC9hn0PemvsScwaJtknEKYOLOjhtoFTy2
         5KxA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=5p7jSbmFWzlgqILy1ydi3XwmgeUDLIb9gpoZhCX/f6c=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=kUUf5VOfrVVkMmxxS9RGkWPFdWhJUw5Yq0JlSd3kZE2NwRaL9jdJ1+gopJJDH4QkgV
         tRk3Nq8/2o+OgRRxoNwqe8ODmZMBLnuvipX831RxrGYvrWRaJ5IyOGVb+5HABU9hIgz/
         XAw5Jo0ThgG+bfDL1UOmZmjJAV131CIJey6699wIfojWdpmX/60xt1LPloj3+kbpJAf+
         VTt05H2XSsd9WOwRBxjnKGw3sNA29Pa1kqZqpLcnz9Mx0jzBtVGrnRfbX0/XC9FKQjSs
         9cEHgoklj7gp5xgNTJvWjp4vQKXsR8H8ZPyTXmyMTRBt49PkbeiBPrC7Ux4cntYN0bl7
         e8+A==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from morse.vger.email (morse.vger.email. [23.128.96.31])
        by mx.google.com with ESMTPS id
 kp8-20020a056a00464800b006cddd0d9820si6816057pfb.72.2023.12.11.16.32.17
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:17 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by morse.vger.email (Postfix) with ESMTP id 93B0F80CF52E;
	Mon, 11 Dec 2023 16:32:14 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345483AbjLLAb6 (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:31:58 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50178 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231442AbjLLAbv (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:51 -0500
Received: from mail-il1-f179.google.com (mail-il1-f179.google.com
 [209.85.166.179])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D258FC4
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:51 -0800 (PST)
Received: by mail-il1-f179.google.com with SMTP id
 e9e14a558f8ab-35d62401a3dso20370205ab.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:51 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341110; x=1702945910;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=5p7jSbmFWzlgqILy1ydi3XwmgeUDLIb9gpoZhCX/f6c=;
        b=eM/7R4dlU+3Hvnd1ff3PxRy31YeSDSlrkl5wUAjSllATS6DsQ+sB1W8rn0uvsQZRq8
         mMVSNkX4pMiGmresjtn31hMW6NVcivPqAxnTT+TyuzNaEwzTROAcIe2nGcZy8EY36sQ5
         herordhLEeKcbjhfJ58TUMUCsBJib0VTcYq0oQNi6RXOy2aE6hyDtiEGR/3JkXhzk25d
         7GXqyoIsJuHKmWoFXkuvjw9FJOuRg7juoUd9veffCb8b3MRdtBYArvF/mEhIfBTPx/oL
         a1iTvIwjZFaFLeOF6nEh+0Vgw2bGIk8N2CPJVUQJGI9QyN4PXZzq+bghax/xblwyQZgR
         V5Eg==
X-Gm-Message-State: AOJu0YzJESYTraqvLntSPio7r+KOX7eIIKtrIAGY0qsO3UJYNcmv7qhz
        zjCPFR8f9nqINL5o68rXeGFvA6wGpkFBNgc5
X-Received: by 2002:a92:c54a:0:b0:35d:7c07:cc99 with SMTP id
 a10-20020a92c54a000000b0035d7c07cc99mr6715539ilj.49.1702341110590;
        Mon, 11 Dec 2023 16:31:50 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 bl1-20020a056e0232c100b0035d79f5d8acsm2598851ilb.79.2023.12.11.16.31.49
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:31:50 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 1/8] sched: Expose move_queued_task() from core.c
Date: Mon, 11 Dec 2023 18:31:34 -0600
Message-ID: <20231212003141.216236-2-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:14 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034060552184286
X-GMAIL-MSGID: 1785034060552184286

The migrate_task_to() function exposed from kernel/sched/core.c migrates
the current task, which is silently assumed to also be its first
argument, to the specified CPU. The function uses stop_one_cpu() to
migrate the task to the target CPU, which won't work if @p is not the
current task as the stop_one_cpu() callback isn't invoked on remote
CPUs.

While this operation is useful for task_numa_migrate() in fair.c, it
would be useful if move_queued_task() in core.c was given external
linkage, as it actually can be used to migrate any task to a CPU.

A follow-on patch will call move_queued_task() from fair.c when
migrating a task in a shared runqueue to a remote CPU.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/sched.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f..fb6f505d5792 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2518,8 +2518,8 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
  *
  * Returns (locked) new rq. Old rq's lock is released.
  */
-static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
-				   struct task_struct *p, int new_cpu)
+struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+			    struct task_struct *p, int new_cpu)
 {
 	lockdep_assert_rq_held(rq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..5afdbd7e2381 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1787,6 +1787,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_SMP
 
+
+extern struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+				   struct task_struct *p, int new_cpu);
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,

From patchwork Tue Dec 12 00:31:35 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176990
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426351vqy;
        Mon, 11 Dec 2023 16:32:13 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGhITbjuTUWuXgHwuo3o1CH9wKwVkEx5ZlEJ/FnQwD21yOzo2AMHNA0JvbsGvKrUriS+hiA
X-Received: by 2002:a05:6e02:148c:b0:35c:9b2c:b9d1 with SMTP id
 n12-20020a056e02148c00b0035c9b2cb9d1mr5648867ilk.32.1702341133340;
        Mon, 11 Dec 2023 16:32:13 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341133; cv=none;
        d=google.com; s=arc-20160816;
        b=KjP0hvtMsBe6k6DMf54FPXft6u4dY7k9zOJLIHRzwdidNmPSvZ5V9y47VRPRQgh6O+
         V3jlm9uVMtVv2w003ejKEib3iAzetRXeN/VL4a0PmQmPeNPiIMOmI9BQ0v7g157e2C8Q
         /TtYuMJzIhPMEbHoVZUJE01a5nh6kXZ6Jgm4geEaFd6pXOuiGLSPcUmRG691uR29YY8F
         5IcZnZ0uScc1aBS0KZyFxcfS4/hITZcilT5Zn4inPJrMRfJhG2JzVyi/5lSVCe6IT3K/
         1M1gl65CVHhqYEIdhJO6dvBNyL6TsirYp0WYFPFgV0DqlJnS56NyKySUxcSIi1i9ZmDg
         nOkQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=Lr94+IMx6s1+HpyvMH2m+IvjCEwFJoJsIekdJu7+PcE=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=hEo/ZAUXmy8mDEFwEdV7E6rEosW+D9C8g89/JSEKSyMZcRbRfdhxGzLomWO3Ika+2F
         5D1i13JpyiL4pRaGSkwsEz8SPazhe/nQ634EJZCrg9vjVQ17l8OEmsP73ZNj8MB8d41e
         LTAdE/oN+7CJyhqmdZLWTYX4NCdmyqNv4YEZmZeU+CqyknE9tspbsuz3sPjHJpbZz8bE
         ybLGEOsqVzUtQyT8r+RV9hW9B1RJpGRD4baAltE5LzMxbfF8uaQH6YWT8fsh3vXf89YQ
         6JmOXu84klwf0GZNnWZH0xCjyQWx6dbv3uIsacSqUwnlalhb8IHjSqnIV2ny0uRT5gdp
         wFzA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32])
        by mx.google.com with ESMTPS id
 o38-20020a635d66000000b005c66e4949basi7146627pgm.244.2023.12.11.16.32.12
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:13 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by agentk.vger.email (Postfix) with ESMTP id DA5988057B0C;
	Mon, 11 Dec 2023 16:32:09 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345480AbjLLAb4 (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:31:56 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50192 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231248AbjLLAbv (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:51 -0500
Received: from mail-il1-f169.google.com (mail-il1-f169.google.com
 [209.85.166.169])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 348D0C6
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:53 -0800 (PST)
Received: by mail-il1-f169.google.com with SMTP id
 e9e14a558f8ab-35d72b72ff7so20438225ab.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:53 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341112; x=1702945912;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Lr94+IMx6s1+HpyvMH2m+IvjCEwFJoJsIekdJu7+PcE=;
        b=KJvrgAQAi75cmYFJuFPDnQFYdA8P3E2F4AtJmeFLQuIY4PopEGiyEtf/t4fZj5GKFv
         asgO6U4R6Rwv00c11RHaHDONfAXMhoNEmBEii6XeWoFpNpIeE6nsVtNwXyN8Y84tYLrA
         HwCzh9Qb8UhpLWd1bTP+Dk/+KXrDcHU8yRnk02nJCwf5X8Xq8Vqkhiqii4njSmPzMwg6
         ZJoE3okwkLoZHIqUncs0P6ptr45UsfaOw/NiNqdtQkNcDR1mb8qoDQ7zhZKDsbciYye+
         nc9bKkMMRzkjeZuJSo090NCrFv860qI5EZ3y9fQVF+etrWOa//KwWoKjDAK1x3eA0HCx
         REqQ==
X-Gm-Message-State: AOJu0YzvV2cvmu6t3l6uuZzOvoLwTTuyGks4EK63mzQtDV5YEb4aZCU4
        A46Ypu4hB4OEvGUleKAdXgFlCFOMFhgIH9c3
X-Received: by 2002:a05:6e02:164f:b0:35d:59a2:6453 with SMTP id
 v15-20020a056e02164f00b0035d59a26453mr7227927ilu.38.1702341111922;
        Mon, 11 Dec 2023 16:31:51 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 s2-20020a92cb02000000b0035f54df2401sm672851ilo.72.2023.12.11.16.31.51
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:31:51 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 2/8] sched: Move is_cpu_allowed() into sched.h
Date: Mon, 11 Dec 2023 18:31:35 -0600
Message-ID: <20231212003141.216236-3-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:10 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034056122247057
X-GMAIL-MSGID: 1785034056122247057

is_cpu_allowed() exists as a static inline function in core.c. The
functionality offered by is_cpu_allowed() is useful to scheduling
policies as well, e.g. to determine whether a runnable task can be
migrated to another core that would otherwise go idle.

Let's move it to sched.h.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 31 -------------------------------
 kernel/sched/sched.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb6f505d5792..9ad7f0255e14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -48,7 +48,6 @@
 #include <linux/kcov.h>
 #include <linux/kprobes.h>
 #include <linux/llist_api.h>
-#include <linux/mmu_context.h>
 #include <linux/mmzone.h>
 #include <linux/mutex_api.h>
 #include <linux/nmi.h>
@@ -2469,36 +2468,6 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 	return rq->nr_pinned;
 }
 
-/*
- * Per-CPU kthreads are allowed to run on !active && online CPUs, see
- * __set_cpus_allowed_ptr() and select_fallback_rq().
- */
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
-{
-	/* When not in the task's cpumask, no point in looking further. */
-	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
-		return false;
-
-	/* migrate_disabled() must be allowed to finish. */
-	if (is_migration_disabled(p))
-		return cpu_online(cpu);
-
-	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu) && task_cpu_possible(cpu, p);
-
-	/* KTHREAD_IS_PER_CPU is always allowed. */
-	if (kthread_is_per_cpu(p))
-		return cpu_online(cpu);
-
-	/* Regular kernel threads don't get to stay during offline. */
-	if (cpu_dying(cpu))
-		return false;
-
-	/* But are allowed during online. */
-	return cpu_online(cpu);
-}
-
 /*
  * This is how migration works:
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5afdbd7e2381..53fe2294eec7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
 #include <linux/lockdep.h>
 #include <linux/minmax.h>
 #include <linux/mm.h>
+#include <linux/mmu_context.h>
 #include <linux/module.h>
 #include <linux/mutex_api.h>
 #include <linux/plist.h>
@@ -1206,6 +1207,36 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+/*
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
+ * __set_cpus_allowed_ptr() and select_fallback_rq().
+ */
+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+{
+	/* When not in the task's cpumask, no point in looking further. */
+	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+		return false;
+
+	/* migrate_disabled() must be allowed to finish. */
+	if (is_migration_disabled(p))
+		return cpu_online(cpu);
+
+	/* Non kernel threads are not allowed during either online or offline. */
+	if (!(p->flags & PF_KTHREAD))
+		return cpu_active(cpu) && task_cpu_possible(cpu, p);
+
+	/* KTHREAD_IS_PER_CPU is always allowed. */
+	if (kthread_is_per_cpu(p))
+		return cpu_online(cpu);
+
+	/* Regular kernel threads don't get to stay during offline. */
+	if (cpu_dying(cpu))
+		return false;
+
+	/* But are allowed during online. */
+	return cpu_online(cpu);
+}
+
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

From patchwork Tue Dec 12 00:31:36 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176989
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426322vqy;
        Mon, 11 Dec 2023 16:32:10 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHd02Nv1B/jvT9W4e2WgMmpfhkjoKLEl+Jils8B7LReODakdQkzT8hOyfgGK3h86C2Cuj9Z
X-Received: by 2002:a05:6a00:888:b0:6d0:89be:e4aa with SMTP id
 q8-20020a056a00088800b006d089bee4aamr3312194pfj.37.1702341129688;
        Mon, 11 Dec 2023 16:32:09 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341129; cv=none;
        d=google.com; s=arc-20160816;
        b=yZpv2je/rGtidzDTL6Jp0ckBjP3UBz8F6G3gcQMTMWwDm68tAUSkonvLJNzd8X3olm
         lpuNiH04JERpPlgDHhu6FJWuUS2orFk25TFb19Hk8rlcVqKaTgj7NXX8ekHRNqHQ0Anp
         wsagPTzDSWx8Omf9lrg06sZEVKne+wysdr8LO7mnE3hE0GeS6mGKZt2XVV3k3K4Zcd3r
         XAvxnTLjnfH9GaxotJ9V5l/fBaXBznmKw8iKeCUW1WP89P8DJdDTyHPbxeCgfg8uHz1U
         3GMeirpA273djSoNHEjMC5R66FtVyJhobeWFaB5xxjHc4YeAxXLn995USqMtHFbqt6sd
         hO3g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=r7yqZVth6bKK2qa/zOSItkbz59H3fBeQuCjW6KsND1Y=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=Rf94ffPfvMh9kiZdCwiJPWrw3YLRQj+Go3TfW7mFaDUSsYZMm4uLDcsV0n+sKx+axA
         +lW0GPF4wqZachCRJFezyHnQCxVNg48g2MdqvhS6nrZz8Pf4BaPokZCebJ6R4ks6pAUb
         z5W5Jm+eC5+YX6dtn/ygcOrj/eG5YZpJ6agoegsDl31K4v3LhnWVRUG9vdkEdPjRvGZL
         9gPJBAv3PlkPwBt+ntAZUcLFxBY4xvuAD/E72BiYmVr12oz7l6eVKijPCwCZY/t+XCt2
         0N2p/BLDarf7nIco1gcg/fNbOPj6wUc22Vq3pkHnb3d7YGPHoTCWmvbF+0rV3qfdkGbc
         f4Lw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.35 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from groat.vger.email (groat.vger.email. [23.128.96.35])
        by mx.google.com with ESMTPS id
 19-20020a631253000000b005c65b28ff0fsi6894468pgs.99.2023.12.11.16.32.09
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:09 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.35 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by groat.vger.email (Postfix) with ESMTP id 854B7809B9DB;
	Mon, 11 Dec 2023 16:32:04 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345471AbjLLAbx (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:31:53 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50202 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232109AbjLLAbv (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:51 -0500
Received: from mail-il1-f171.google.com (mail-il1-f171.google.com
 [209.85.166.171])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CB1FED5
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:54 -0800 (PST)
Received: by mail-il1-f171.google.com with SMTP id
 e9e14a558f8ab-35d74cf427cso20405355ab.1
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:54 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341113; x=1702945913;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=r7yqZVth6bKK2qa/zOSItkbz59H3fBeQuCjW6KsND1Y=;
        b=tAzTNwHHk6Hpwdacfqv9Jj4HztcPrpcGpP8TVnWJWhywNLBdYclqClRM+FI7mLAUzQ
         ek7HAZ9PRcHZSf0tfKnARvIEFxZzt9T3cksL0Co2eceXcluFADYxp5nmDunfy4DA+IyH
         yUitL/AghVW7UwEfazmi3GO2jc44enQKz3MObpqXSnbbl1+szZ9/ObXvcQ8uDhU/PejS
         nF2SjkSLKkX40PeNL1ppnr64tgJK93c/n27lTU6Y/e6aCgxd6XbZRTHkXPAn8IsOxo0y
         OY+r+WvGmwy6v/4w0VzhGBKg8Y19Irjod3Gn5peGenL/aiX/AxAxAaMvFnO6s/mzkRDJ
         PFTg==
X-Gm-Message-State: AOJu0YymoeqhcZKI5a7BUUQCf+ElYfChDG/oi3JOtFSOWlXjWNEyv0Wa
        L+nDh65JixrNUKbFLA3ouV4RmeXeLFDG8Oj4
X-Received: by 2002:a92:c564:0:b0:35d:7de6:c388 with SMTP id
 b4-20020a92c564000000b0035d7de6c388mr6289411ilj.30.1702341113434;
        Mon, 11 Dec 2023 16:31:53 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 by4-20020a056e02260400b0035cb9b85123sm2616670ilb.46.2023.12.11.16.31.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:31:52 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 3/8] sched: Tighten unpinned rq lock window in
 newidle_balance()
Date: Mon, 11 Dec 2023 18:31:36 -0600
Message-ID: <20231212003141.216236-4-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:04 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034052066815163
X-GMAIL-MSGID: 1785034052066815163

In newidle_balance(), we may drop and reacquire the rq lock in the
load-balance phase of the function. We currently do this before we check
rq->rd->overload or rq->avg_idle, which is unnecessary. Let's tighten
the window where we call rq_unpin_lock().

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..e1b676bb1fed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12296,14 +12296,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
-	/*
-	 * This is OK, because current is on_cpu, which avoids it being picked
-	 * for load-balance and preemption/IRQs are still disabled avoiding
-	 * further scheduler activity on it and we're being very careful to
-	 * re-start the picking loop.
-	 */
-	rq_unpin_lock(this_rq, rf);
-
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
@@ -12318,6 +12310,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
+	/*
+	 * This is OK, because current is on_cpu, which avoids it being picked
+	 * for load-balance and preemption/IRQs are still disabled avoiding
+	 * further scheduler activity on it and we're being very careful to
+	 * re-start the picking loop.
+	 */
+	rq_unpin_lock(this_rq, rf);
 	raw_spin_rq_unlock(this_rq);
 
 	t0 = sched_clock_cpu(this_cpu);
@@ -12358,6 +12357,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_unlock();
 
 	raw_spin_rq_lock(this_rq);
+	rq_repin_lock(this_rq, rf);
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -12384,8 +12384,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	else
 		nohz_newidle_balance(this_rq);
 
-	rq_repin_lock(this_rq, rf);
-
 	return pulled_task;
 }
 

From patchwork Tue Dec 12 00:31:37 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176994
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426462vqy;
        Mon, 11 Dec 2023 16:32:27 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHG86UdOvdlQNcPRGAms9lnwG1F5u0o6+TTiUVHyT18Z/bL6slLG1+3477o3J9GtkU2Sa8M
X-Received: by 2002:a05:6808:2e99:b0:3b9:fe98:39af with SMTP id
 gt25-20020a0568082e9900b003b9fe9839afmr6509737oib.6.1702341146830;
        Mon, 11 Dec 2023 16:32:26 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341146; cv=none;
        d=google.com; s=arc-20160816;
        b=WQws1UlfRm8MCHp8cQLMO5CfZFEagKZwP5DMQg9TboMHWOu/DMfpkqpVwtb0KYAhEi
         5Qm3UkE1GdC2zvLNNQwFElUL3KQPngmMjdtt/5chKzwxJxulwDoiQRsXy3Nzj1LccndB
         5p5iq+3F6aHDUrCypQm+FY6dVwFVG+HhsDhwauRzcAPoUEx5+Fwh5BZZk+nqIFZPDR8P
         GgnsElRzz0ZU8ceRU7JAEeFnXVVIIz36wNL13DnDpF8pHbdK8IB2ZmfLuveFiQ9UxMU2
         zXpdkAU+LdqzG/aPjAwl7p/lfNLESvUy/qroE/g2XpYyHUudlUc13J/3yS98yn9ZuYAa
         2ArA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=1ozM4BpN3Xt6MsBMcAeU5+sN7FM9jw+H7uMlVGcPsYU=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=inZldbJLoygYYpyZIPT3w/Pu6vtzdMnZQa4TCjZmIhuivC7X8LkfI6ZWzEnJyps0yk
         7alr+gYnfctw0g8dFRfE6voEqnnkxampvBI43cxhJjzjPjgNNY1lF6IA1O2bnjFpZgHl
         UhXgcClucHR6dlB2wzi0vX3HKY3HPMyU2BhZ9Td32he6RX124C35ueIjPG6mJ+BIniyx
         +SeIjieMCHQIxqMDgKLFGf3f/JKCJoNk9caPCnW2DJt1MECcgFjfPHjiYok/S1BQQzs7
         a0Jw8H5t3DVXYx591UogmYK3w7Qyc1+tdu2HAJ9lWt7OaYe9+7gbD9WiXhIEoldG6/S2
         JKlg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3])
        by mx.google.com with ESMTPS id
 c8-20020a056a000ac800b006d087385ddfsi2722732pfl.194.2023.12.11.16.32.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:26 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 client-ip=2620:137:e000::3:3;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:3 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id F221D8053630;
	Mon, 11 Dec 2023 16:32:21 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345514AbjLLAcB (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:32:01 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50168 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231131AbjLLAbw (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:52 -0500
Received: from mail-il1-f173.google.com (mail-il1-f173.google.com
 [209.85.166.173])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3000FDB
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:55 -0800 (PST)
Received: by mail-il1-f173.google.com with SMTP id
 e9e14a558f8ab-35d4e557c4bso23274655ab.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:55 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341115; x=1702945915;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=1ozM4BpN3Xt6MsBMcAeU5+sN7FM9jw+H7uMlVGcPsYU=;
        b=M6fROFzYXqiUvfysdgYd9JyunX57Gbfjd9muWKBsFPOw9QisqV9jZQgPygmbevtprR
         3MSiK6BlgahonWwHTRCE0VFkz/4IGNWF51swsD8HtuMMIriFL2+VlnpmBPqoBanxFv8U
         iaRlVG6kI9fwKfZYy6vHWbP/VDcBIgu8TUNWjigyE172tLdjLlr/1um2TxIKKOI5VlqH
         Z1kDoChgh8q/RjH+uqRDfc7ro0IOLTosAEDzxH0s7LCl9QoTkRsGYKZ5u20rcZMkDYLQ
         qSqxPv7Fgj0HpI1feVlod7Iu+UjTglgFWkTKjKzsyx3VB6R9EQkKehpQcp1Fq/NyqRnD
         oDiw==
X-Gm-Message-State: AOJu0YwvYjG5KSZJYiKX2aRPNI22ltV8XhcRIGyvrvkQxStgx+Fc+aaz
        XkW5JZxzZNpShrZ5Shvh33XBV2IErqVABBaP
X-Received: by 2002:a05:6e02:1a4d:b0:35d:a4a9:7bb7 with SMTP id
 u13-20020a056e021a4d00b0035da4a97bb7mr8463770ilv.63.1702341114970;
        Mon, 11 Dec 2023 16:31:54 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 u4-20020a056e021a4400b0035d714a68fbsm2619217ilv.78.2023.12.11.16.31.54
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:31:54 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 4/8] sched: Check cpu_active() earlier in newidle_balance()
Date: Mon, 11 Dec 2023 18:31:37 -0600
Message-ID: <20231212003141.216236-5-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:22 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034070383185009
X-GMAIL-MSGID: 1785034070383185009

In newidle_balance(), we check if the current CPU is inactive, and then
decline to pull any remote tasks to the core if so. Before this check,
however, we're currently updating rq->idle_stamp. If a core is offline,
setting its idle stamp is not useful. The core won't be chosen by any
task in select_task_rq_fair(), and setting the rq->idle_stamp is
misleading anyways given that the core being inactive should imply that
it should be idle for a very long time.

Let's set rq->idle_stamp in newidle_balance() only if the cpu is active.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e1b676bb1fed..49f047df5d9d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12284,18 +12284,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (this_rq->ttwu_pending)
 		return 0;
 
-	/*
-	 * We must set idle_stamp _before_ calling idle_balance(), such that we
-	 * measure the duration of idle_balance() as idle time.
-	 */
-	this_rq->idle_stamp = rq_clock(this_rq);
-
 	/*
 	 * Do not pull tasks towards !active CPUs...
 	 */
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	/*
+	 * We must set idle_stamp _before_ calling idle_balance(), such that we
+	 * measure the duration of idle_balance() as idle time.
+	 */
+	this_rq->idle_stamp = rq_clock(this_rq);
+
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 

From patchwork Tue Dec 12 00:31:38 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176992
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426403vqy;
        Mon, 11 Dec 2023 16:32:19 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFhHD2x/3ZHyN4QTDe8JCRgMQHVOOJ6RtkRmSFvhJYyjWoeWFG623iXkNvRrXcGPP8clPX1
X-Received: by 2002:a05:6871:a003:b0:1fb:75a:de6c with SMTP id
 vp3-20020a056871a00300b001fb075ade6cmr6384760oab.90.1702341139184;
        Mon, 11 Dec 2023 16:32:19 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341139; cv=none;
        d=google.com; s=arc-20160816;
        b=L4LjMLGSobH0+7Mn1usudGhFhwdClHaJQ6KqRMekRV1b21Jo7h6BUti+dRq1ddirew
         fnBcRPObdB5pzpc+3RBFNAEuwsTAiS/+/Lvho7DyS0PqQOLKa26O3zlERy49hidop9fg
         odSwfc1bikrtkiQ3qo/CzWqcaZzR+gR0a5rwzXfYFG6j+gH+NwzGSW0EBr0i52Uq79lk
         sHwgrxuC+JeZEm+vOTwh7acubM7G2DRyU5FIzxsF4cEl4IHaTZmWkA8B1qPdtFaB9cbW
         abKzSSa2MhuzVKtLqTF7yftIKCTzgiqvtJDvJtBArvF24zy1Onzhok0hl7k8LRT2Hgw9
         h8Dw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=yBzcRTFaiR0ilwxHlsslqn3F7i5n50sLy2f363pQ9AQ=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=n7wwa81+XX3/MI0SiYuIHtdtyt9h3ky5AddrEO307B0t2T4uUp4XxX1S4l0q1PcfCN
         BUw/5rlkEr1YPqZRrP8LoZxss3Zf8tnYTFFerJuwpZUBONmaeUY5Kzs25fehL/DGKtdt
         brfvhQehouxhJ3jJrXK+JYsC3c6i4uSvM0M87KGuiPOpXzrKaB4aYWeqYdqoQ94WYfMY
         QBFeRoapVmbMZM98XsuwAzkzeqsfwdpNe1Oiyq0qYX77ezwXm8qIc2un0r282jvTUrDF
         T8EMo0aTPdPelM7xsdLUG/MMw0+V+H8bh/hmJa6MRHgA4M19++DaVf+fmAoZVIPTOBfB
         F+jw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from snail.vger.email (snail.vger.email. [23.128.96.37])
        by mx.google.com with ESMTPS id
 r11-20020a65508b000000b005b95fbb1745si6841919pgp.562.2023.12.11.16.32.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:19 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.37 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by snail.vger.email (Postfix) with ESMTP id 1B738804E83A;
	Mon, 11 Dec 2023 16:32:18 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345517AbjLLAcF (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:32:05 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50178 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232286AbjLLAbx (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:53 -0500
Received: from mail-io1-f53.google.com (mail-io1-f53.google.com
 [209.85.166.53])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 89D41EB
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:57 -0800 (PST)
Received: by mail-io1-f53.google.com with SMTP id
 ca18e2360f4ac-7b7282c8941so117781339f.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:57 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341116; x=1702945916;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=yBzcRTFaiR0ilwxHlsslqn3F7i5n50sLy2f363pQ9AQ=;
        b=GJgOIa8+yC6rEFpLya1tJ7q4sv4s5qIg7Fqqsbh66iGR0LB1GWOFwA9muC4Z59HOOk
         VuswMIPhWO/+u6opcOnfA6R8ncUZptGXdBueMBP41eFjgYzo7nE5L2E6UcZSsyLQfNb0
         kJn7msk3Zw7HWBTm2jXLmTloGDMQCPyGzWYmXMEUEy8tgit1y106s2sykoe12BmwQIfj
         cokOrWI4mWXWaJFKgPk3l0y1YAl934fLCpVT4k6qh4NwfdfqvmTSeaFIX7YKSdbLVFi7
         nX2xkmCfcONt3n40vbjPz34QPOzyk04TuWutCXWFToy9n7HSxym8/Wj8aFF6mqIiYGuT
         lHBw==
X-Gm-Message-State: AOJu0YxjcIR7VUoXQl3aOHkHvHyMMr/pX4O9aEQchddQYYsBybtCLE8i
        ePxrhnyaBRtVrT1FJuZRdROcMvI2kiDGSiCg
X-Received: by 2002:a05:6602:21d8:b0:7b6:fffd:a0ff with SMTP id
 c24-20020a05660221d800b007b6fffda0ffmr6634964ioc.32.1702341116362;
        Mon, 11 Dec 2023 16:31:56 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 z26-20020a056602005a00b007b6ed1d8884sm2412341ioz.39.2023.12.11.16.31.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:31:55 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 5/8] sched: Enable sched_feat callbacks on enable/disable
Date: Mon, 11 Dec 2023 18:31:38 -0600
Message-ID: <20231212003141.216236-6-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,
        T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:18 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034062221681435
X-GMAIL-MSGID: 1785034062221681435

When a scheduler feature is enabled or disabled, the sched_feat_enable()
and sched_feat_disable() functions are invoked respectively for that
feature. For features that don't require resetting any state, this works
fine. However, there will be an upcoming feature called SHARED_RUNQ
which needs to drain all tasks from a set of global shared runqueues in
order to avoid stale tasks from staying in the queues after the feature
has been disabled.

This patch therefore defines a new SCHED_FEAT_CALLBACK macro which
allows scheduler features to specify a callback that should be invoked
when a feature is enabled or disabled respectively. The SCHED_FEAT macro
assumes a NULL callback.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  |  4 ++--
 kernel/sched/debug.c | 18 ++++++++++++++----
 kernel/sched/sched.h | 16 ++++++++++------
 3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9ad7f0255e14..045ac2539f37 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -124,12 +124,12 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
  * sysctl_sched_features, defined in sched.h, to allow constants propagation
  * at compile time and compiler optimization based on features default.
  */
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	(1UL << __SCHED_FEAT_##name) * enabled |
 const_debug unsigned int sysctl_sched_features =
 #include "features.h"
 	0;
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 /*
  * Print a warning if need_resched is set for the given duration (if
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 168eecc209b4..0b72799c7e84 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -44,14 +44,14 @@ static unsigned long nsec_low(unsigned long long nsec)
 
 #define SPLIT_NS(x) nsec_high(x), nsec_low(x)
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	#name ,
 
 static const char * const sched_feat_names[] = {
 #include "features.h"
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 static int sched_feat_show(struct seq_file *m, void *v)
 {
@@ -72,22 +72,32 @@ static int sched_feat_show(struct seq_file *m, void *v)
 #define jump_label_key__true  STATIC_KEY_INIT_TRUE
 #define jump_label_key__false STATIC_KEY_INIT_FALSE
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	jump_label_key__##enabled ,
 
 struct static_key sched_feat_keys[__SCHED_FEAT_NR] = {
 #include "features.h"
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
+
+#define SCHED_FEAT_CALLBACK(name, enabled, cb) cb,
+static const sched_feat_change_f sched_feat_cbs[__SCHED_FEAT_NR] = {
+#include "features.h"
+};
+#undef SCHED_FEAT_CALLBACK
 
 static void sched_feat_disable(int i)
 {
+	if (sched_feat_cbs[i])
+		sched_feat_cbs[i](false);
 	static_key_disable_cpuslocked(&sched_feat_keys[i]);
 }
 
 static void sched_feat_enable(int i)
 {
+	if (sched_feat_cbs[i])
+		sched_feat_cbs[i](true);
 	static_key_enable_cpuslocked(&sched_feat_keys[i]);
 }
 #else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 53fe2294eec7..517e67a0cc9a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2091,6 +2091,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+#define SCHED_FEAT(name, enabled) SCHED_FEAT_CALLBACK(name, enabled, NULL)
+
 /*
  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
  */
@@ -2100,7 +2102,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 # define const_debug const
 #endif
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	__SCHED_FEAT_##name ,
 
 enum {
@@ -2108,7 +2110,7 @@ enum {
 	__SCHED_FEAT_NR,
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 #ifdef CONFIG_SCHED_DEBUG
 
@@ -2119,14 +2121,14 @@ enum {
 extern const_debug unsigned int sysctl_sched_features;
 
 #ifdef CONFIG_JUMP_LABEL
-#define SCHED_FEAT(name, enabled)					\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)				\
 static __always_inline bool static_branch_##name(struct static_key *key) \
 {									\
 	return static_key_##enabled(key);				\
 }
 
 #include "features.h"
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
@@ -2144,17 +2146,19 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
  * constants propagation at compile time and compiler optimization based on
  * features default.
  */
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	(1UL << __SCHED_FEAT_##name) * enabled |
 static const_debug __maybe_unused unsigned int sysctl_sched_features =
 #include "features.h"
 	0;
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 #define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 
 #endif /* SCHED_DEBUG */
 
+typedef void (*sched_feat_change_f)(bool enabling);
+
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
 

From patchwork Tue Dec 12 00:31:39 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176993
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426455vqy;
        Mon, 11 Dec 2023 16:32:26 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGOl90JdIjF2XVD4K2onHlsa/Yc8pBoiAGazUhLM2I+HJjOGHHPHMxZ1c8uosYBgOI/1YL0
X-Received: by 2002:a05:6808:3847:b0:3ba:a70:d4 with SMTP id
 ej7-20020a056808384700b003ba0a7000d4mr2831712oib.16.1702341145920;
        Mon, 11 Dec 2023 16:32:25 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341145; cv=none;
        d=google.com; s=arc-20160816;
        b=vOerTjZpI1Alug5JI/SXX/3V6fwUjsDpWz/HGcb0mtKS6gBp5HmTKiow5BVdWirhLo
         qatKsWGTp28WnOBkUevcLj4QVjxVNF0sqx6Xxk95aj2YTmI/jI9Vgiw256oVsniSIMxJ
         bhnQKFuTu0iVWTfciey/nukEDR5HjplIyAKoueOXkvA1YO5zYUMkn+KCKKGPK3tExMlx
         Kz20LWhGGTFEvVL4y45jupKjMHJXYV7Hx6p63PFYwOXDl5/ZrcPsk/C1hQ2eIZGdKuYW
         Gjpv9ynA9A490QdExijBr94azmG0hJqDjwRCal1pcYcKkv9l6KX0gC/FCfBpWNNFj0j7
         pkGA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=1vFhZEBnSX7P03wenRqNCN6oXasaFa4ZGXtqw+9S9fs=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=KW93jVWslJhPy/bfVOjmU+p3YoWCGZxTDmV/uJavPMhbrNOhnQXCgfHg3Rn+s4XJg8
         mOGuUQEFyBN+w7zenmkqZN9+jwFH+z+7eS9vQ6wwKB7tphT/wOisaikMWfN9RNfeSDAr
         g2z4tIisERzFM6vNnfgQUJcISFLqGvgF2mNAtQpluNnGs6BYI5qWavN46l5NbHtpE10K
         jtO67nbef1a3bWa5rOQ3ZgAJAsZwFbax7MbYdBMZMCM95r21Wms04nztnxq2U8ra0PPO
         +p7dih+Vg/CBsfbBPUBj0jmP57lET70KPvmHV/NnCWtUbDYEb8de/cuKlSRGAwDFS71x
         37Rg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:2 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2])
        by mx.google.com with ESMTPS id
 r130-20020a632b88000000b005c6faf0a669si5480723pgr.285.2023.12.11.16.32.25
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:25 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:2 as permitted sender)
 client-ip=2620:137:e000::3:2;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:2 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by agentk.vger.email (Postfix) with ESMTP id 46CCB8089871;
	Mon, 11 Dec 2023 16:32:19 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345556AbjLLAcK (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:32:10 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41608 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345472AbjLLAbz (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:55 -0500
Received: from mail-il1-f172.google.com (mail-il1-f172.google.com
 [209.85.166.172])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2915CA7
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:59 -0800 (PST)
Received: by mail-il1-f172.google.com with SMTP id
 e9e14a558f8ab-35e70495835so15609435ab.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:31:59 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341118; x=1702945918;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=1vFhZEBnSX7P03wenRqNCN6oXasaFa4ZGXtqw+9S9fs=;
        b=YhR7UBn2zZwICBZYskVLkAGikjYPZPUGDdcglT35VHhEBh+iwd3pDC/HyeI0nsMdZm
         lmGL1yu/j5O8ULg3InPE/19H9z3Kp2rFi0q+Q3p8JqRh/R82mt6xzgUu5rNLrzWOXVI8
         aUS0sQ9YdU73Mr+FUY1ZRnZ1FDoYiOx95XABFqhyhW65f1ideBAOz3CVNjfzVhP7xSGD
         w+Grn0GCM1hY0GuSF8cgew/d5K6KN32lD/D8uIiMYG8EWE1s2Ssuc0bBcnnkVQxgxq11
         QKUF/8G75B52bUiiEkdYC5/qXArtO0V5XR0PfokxJ8cJx0KzaujL72WGtaIHiPGp1Zfc
         +9oA==
X-Gm-Message-State: AOJu0YzDwn13BGpK1qGYbRNCU1x8KdYTP9o26oY5HgTCfzdQs8k13Mze
        fn/lzTAvA2g7/bimcxhnIoqBIwLpOQ2KwHDi
X-Received: by 2002:a05:6e02:1a89:b0:35d:6a7d:a675 with SMTP id
 k9-20020a056e021a8900b0035d6a7da675mr8447568ilv.15.1702341117882;
        Mon, 11 Dec 2023 16:31:57 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 x10-20020a02970a000000b00468ea2264e1sm2076301jai.73.2023.12.11.16.31.57
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:31:57 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 6/8] sched: Implement shared runqueue in fair.c
Date: Mon, 11 Dec 2023 18:31:39 -0600
Message-ID: <20231212003141.216236-7-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:19 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034049515587512
X-GMAIL-MSGID: 1785034069384385154

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (SHARED_RUNQ) feature proposed in this patch set. The
idea behind SHARED_RUNQ is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

Note that the design described here reflects sharding, which will be
added in a subsequent patch. The design is described that way in this
commit summary as the benchmarks described in the results section below
all include sharded SHARED_RUNQ. The patches are not combined into one
to ease the burden of review.

The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
list of struct shared_runq_shard objects, which itself is simply a
struct list_head of tasks, and a spinlock:

struct shared_runq_shard {
	struct list_head list;
	raw_spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
	u32 num_shards;
	struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs, and we
create a number of struct shared_runq_shard objects that are housed
there.

When a task first wakes up, it enqueues itself in the shared_runq_shard
of its current LLC at the end of enqueue_task_fair(). Enqueues only
happen if the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from the shards in its LLC's shared_runq at the
beginning of newidle_balance().

Difference between SHARED_RUNQ and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas SHARED_RUNQ causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
shared_Runq (previously called "swqueue") in [2]. SIS_NODE did not
outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
[2]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/

Worth noting as well is that pointed out in [3] that the logic behind
including SIS_NODE in the first place should apply to SHARED_RUNQ
(meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
should benefit from having a single shared_runq stretch across multiple
LLCs). I drafted a patch that implements this by having a minimum LLC
size for creating a shard, and stretches a shared_runq across multiple
LLCs if they're smaller than that size, and sent it to Tejun to test on
his Zen2. Tejun reported back that SIS_NODE did not seem to make a
difference:

[3]: https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/

			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
Vanilla:		    | 108.84s    | 0.0057   |
NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
SHARED_RUNQ:		    | 108.17s    | 0.038s   |
SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
			    o------------o----------o

I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ.

Conclusion
==========

SHARED_RUNQ in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: SHARED_RUNQ encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(1) pull from a single
per-LLC shared runq.

While SHARED_RUNQ in this form encourages work conservation, it of
course does not guarantee it given that we don't implement any kind of
work stealing between shared_runq's. In the future, we could potentially
push CPU utilization even higher by enabling work stealing between
shared_runq's, likely between CCXs on the same NUMA node.

Appendix
========

Worth noting is that various people's review feedback contributed to
this patch. Most notably is likely K Prateek Nayak
<kprateek.nayak@amd.com> who suggested checking this_rq->rd->overload to
avoid unnecessary contention and load balancing overhead with the
SHARED_RUNQ patches, and also corrected how we were doing idle cost
accounting.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 include/linux/sched.h   |   2 +
 init/init_task.c        |   3 +
 kernel/sched/core.c     |  13 ++
 kernel/sched/fair.c     | 332 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |   4 +
 kernel/sched/sched.h    |   9 ++
 kernel/sched/topology.c |   4 +-
 7 files changed, 364 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d258162deb0..0e329040c2ed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -794,6 +794,8 @@ struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;
 
+	struct list_head		shared_runq_node;
+
 	/*
 	 * recent_used_cpu is initially set as the last CPU used by a task
 	 * that wakes affine another task. Waker/wakee relationships can
diff --git a/init/init_task.c b/init/init_task.c
index 5727d42149c3..e57587988cb9 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -75,6 +75,9 @@ struct task_struct init_task
 	.stack		= init_stack,
 	.usage		= REFCOUNT_INIT(2),
 	.flags		= PF_KTHREAD,
+#ifdef CONFIG_SMP
+	.shared_runq_node = LIST_HEAD_INIT(init_task.shared_runq_node),
+#endif
 	.prio		= MAX_PRIO - 20,
 	.static_prio	= MAX_PRIO - 20,
 	.normal_prio	= MAX_PRIO - 20,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045ac2539f37..f12aaa3674fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4523,6 +4523,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	INIT_LIST_HEAD(&p->shared_runq_node);
 #endif
 	init_sched_mm_cid(p);
 }
@@ -9713,6 +9714,18 @@ int sched_cpu_deactivate(unsigned int cpu)
 	return 0;
 }
 
+void sched_update_domains(void)
+{
+	const struct sched_class *class;
+
+	update_sched_domain_debugfs();
+
+	for_each_class(class) {
+		if (class->update_domains)
+			class->update_domains();
+	}
+}
+
 static void sched_rq_cpu_starting(unsigned int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49f047df5d9d..b2f4f8620265 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -91,7 +91,273 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
+/**
+ * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
+ * runnable tasks within an LLC.
+ * @list: The list of tasks in the shared_runq.
+ * @lock: The raw spinlock that synchronizes access to the shared_runq.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
+ * pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq
+ * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
+ * to being pulled from the shared_runq, in which case they're simply dequeued
+ * from the shared_runq in __dequeue_entity().
+ *
+ * There is currently no task-stealing between shared_runqs in different LLCs,
+ * which means that shared_runq is not fully work conserving. This could be
+ * added at a later time, with tasks likely only being stolen across
+ * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
+ *
+ * Note that there is a per-CPU allocation of struct shared_runq objects to
+ * account for the possibility that sched domains are reconfigured during e.g.
+ * hotplug. In practice, most of these struct shared_runq objects are unused at
+ * any given time, with the struct shared_runq of a single core per LLC being
+ * referenced by all other cores in the LLC via a pointer in their struct
+ * cfs_rq.
+ *
+ * HOW
+ * ===
+ *
+ * A shared_runq is comprised of a list, and a spinlock for synchronization.
+ * Given that the critical section for a shared_runq is typically a fast list
+ * operation, and that the shared_runq is localized to a single LLC, the
+ * spinlock will typically only be contended on workloads that do little else
+ * other than hammer the runqueue.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of shared_runq is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * shared_runqs are segmented across LLCs both to avoid contention on the
+ * shared_runq spinlock by minimizing the number of CPUs that could contend on
+ * it, as well as to strike a balance between work conservation, and L3 cache
+ * locality.
+ */
+struct shared_runq {
+	struct list_head list;
+	raw_spinlock_t lock;
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
+
+static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
+DEFINE_STATIC_KEY_FALSE(__shared_runq_force_dequeue);
+
+static struct shared_runq *rq_shared_runq(struct rq *rq)
+{
+	return rq->cfs.shared_runq;
+}
+
+static void shared_runq_reassign_domains(void)
+{
+	int i;
+	struct shared_runq *shared_runq;
+	struct rq *rq;
+	struct rq_flags rf;
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
+
+		rq_lock(rq, &rf);
+		rq->cfs.shared_runq = shared_runq;
+		rq_unlock(rq, &rf);
+	}
+}
+
+static void __shared_runq_drain(struct shared_runq *shared_runq)
+{
+	struct task_struct *p, *tmp;
+
+	raw_spin_lock(&shared_runq->lock);
+	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
+		list_del_init(&p->shared_runq_node);
+	raw_spin_unlock(&shared_runq->lock);
+}
+
+static void update_domains_fair(void)
+{
+	int i;
+	struct shared_runq *shared_runq;
+
+	/* Avoid racing with SHARED_RUNQ enable / disable. */
+	lockdep_assert_cpus_held();
+
+	shared_runq_reassign_domains();
+
+	/* Ensure every core sees its updated shared_runq pointers. */
+	synchronize_rcu();
+
+	/*
+	 * Drain all tasks from all shared_runq's to ensure there are no stale
+	 * tasks in any prior domain runq. This can cause us to drain live
+	 * tasks that would otherwise have been safe to schedule, but this
+	 * isn't a practical problem given how infrequently domains are
+	 * rebuilt.
+	 */
+	for_each_possible_cpu(i) {
+		shared_runq = &per_cpu(shared_runqs, i);
+		__shared_runq_drain(shared_runq);
+	}
+}
+
+void shared_runq_toggle(bool enabling)
+{
+	int cpu;
+
+	if (enabling) {
+		static_branch_enable_cpuslocked(&__shared_runq_force_dequeue);
+		return;
+	}
+
+	/* Avoid racing with hotplug. */
+	lockdep_assert_cpus_held();
+
+	/* Ensure all cores have stopped enqueueing / dequeuing tasks. */
+	synchronize_rcu();
+
+	for_each_possible_cpu(cpu) {
+		int sd_id;
+
+		sd_id = per_cpu(sd_llc_id, cpu);
+		if (cpu == sd_id)
+			__shared_runq_drain(rq_shared_runq(cpu_rq(cpu)));
+	}
+	/*
+	 * Disable dequeue _after_ ensuring that all of the shared runqueues
+	 * are fully drained. Otherwise, a task could remain enqueued on a
+	 * shared runqueue after the feature was disabled, and could exit
+	 * before drain has completed.
+	 */
+	static_branch_disable_cpuslocked(&__shared_runq_force_dequeue);
+}
+
+static struct task_struct *shared_runq_pop_task(struct rq *rq)
+{
+	struct task_struct *p;
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	if (list_empty(&shared_runq->list))
+		return NULL;
+
+	raw_spin_lock(&shared_runq->lock);
+	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+				     shared_runq_node);
+	if (p && is_cpu_allowed(p, cpu_of(rq)))
+		list_del_init(&p->shared_runq_node);
+	else
+		p = NULL;
+	raw_spin_unlock(&shared_runq->lock);
+
+	return p;
+}
+
+static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+{
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	raw_spin_lock(&shared_runq->lock);
+	list_add_tail(&p->shared_runq_node, &shared_runq->list);
+	raw_spin_unlock(&shared_runq->lock);
+}
+
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * Only enqueue the task in the shared runqueue if:
+	 *
+	 * - SHARED_RUNQ is enabled
+	 * - The task isn't pinned to a specific CPU
+	 * - The rq is empty, meaning the task will be picked next anyways.
+	 */
+	if (!sched_feat(SHARED_RUNQ) ||
+	    p->nr_cpus_allowed == 1 ||
+	    rq->nr_running < 1)
+		return;
+
+	shared_runq_push_task(rq, p);
+}
+
+static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+	struct task_struct *p = NULL;
+	struct rq *src_rq;
+	struct rq_flags src_rf;
+	int ret = 0, cpu;
+
+	p = shared_runq_pop_task(rq);
+	if (!p)
+		return 0;
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	src_rq = task_rq_lock(p, &src_rf);
+
+	cpu = cpu_of(rq);
+	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p) &&
+	    likely(!is_migration_disabled(p) && is_cpu_allowed(p, cpu))) {
+		update_rq_clock(src_rq);
+		src_rq = move_queued_task(src_rq, &src_rf, p, cpu);
+		ret = 1;
+	}
+
+	if (src_rq != rq) {
+		task_rq_unlock(src_rq, p, &src_rf);
+		raw_spin_rq_lock(rq);
+	} else {
+		rq_unpin_lock(rq, &src_rf);
+		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
+	}
+	rq_repin_lock(rq, rf);
+
+	if (rq->nr_running != rq->cfs.h_nr_running)
+		ret = -1;
+
+	return ret;
+}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{
+	struct shared_runq *shared_runq;
+
+	/*
+	 * Always dequeue a task if:
+	 * - SHARED_RUNQ is enabled
+	 * - The __shared_runq_force_dequeue static branch is enabled.
+	 *
+	 * The latter is necessary to ensure that we've fully drained the
+	 * shared runqueues after the feature has been disabled. Otherwise, we
+	 * could end up in a situation where we stop dequeuing tasks, and a
+	 * task exits while still on the shared runqueue before it's been
+	 * drained.
+	 */
+	if (!sched_feat(SHARED_RUNQ) &&
+	    !static_branch_unlikely(&__shared_runq_force_dequeue))
+		return;
+
+	if (!list_empty(&p->shared_runq_node)) {
+		shared_runq = rq_shared_runq(task_rq(p));
+		raw_spin_lock(&shared_runq->lock);
+		if (likely(!list_empty(&p->shared_runq_node)))
+			list_del_init(&p->shared_runq_node);
+		raw_spin_unlock(&shared_runq->lock);
+	}
+}
+
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
  */
@@ -114,6 +380,15 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: ~5%)
  */
 #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
+#else
+void shared_runq_toggle(bool enabling)
+{}
+
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
+{}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -823,6 +1098,8 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (entity_is_task(se))
+		shared_runq_enqueue_task(rq_of(cfs_rq), task_of(se));
 	avg_vruntime_add(cfs_rq, se);
 	se->min_vruntime = se->vruntime;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -834,6 +1111,8 @@ static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_vruntime_cb);
 	avg_vruntime_sub(cfs_rq, se);
+	if (entity_is_task(se))
+		shared_runq_dequeue_task(task_of(se));
 }
 
 struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
@@ -8211,6 +8490,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 
 static void task_dead_fair(struct task_struct *p)
 {
+	WARN_ON_ONCE(!list_empty(&p->shared_runq_node));
 	remove_entity_load_avg(&p->se);
 }
 
@@ -12299,15 +12579,47 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
-	if (!READ_ONCE(this_rq->rd->overload) ||
-	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
+	/* Skip all balancing if the root domain is not overloaded. */
+	if (!READ_ONCE(this_rq->rd->overload)) {
 
 		if (sd)
 			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
 
+		goto out;
+	} else if (sched_feat(SHARED_RUNQ)) {
+		/*
+		 * Ignore avg_idle and always try to pull a task from the
+		 * shared_runq when enabled. The goal of SHARED_RUNQ is to
+		 * maximize work conservation, so we want to avoid heuristics
+		 * that could potentially negate that such as newidle lb cost
+		 * tracking.
+		 */
+		pulled_task = shared_runq_pick_next_task(this_rq, rf);
+		if (pulled_task) {
+			rcu_read_unlock();
+			goto out_swq;
+		}
+
+		/*
+		 * We drop and reacquire the rq lock when checking for tasks in
+		 * the shared_runq shards, so check if there's a wakeup pending
+		 * to potentially avoid having to do the full load_balance()
+		 * pass.
+		 */
+		if (this_rq->ttwu_pending) {
+			rcu_read_unlock();
+			return 0;
+		}
+	}
+
+	if (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) {
+		update_next_balance(sd, &next_balance);
+		rcu_read_unlock();
+
 		goto out;
 	}
+
 	rcu_read_unlock();
 
 	/*
@@ -12327,6 +12639,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		int continue_balancing = 1;
 		u64 domain_cost;
 
+		/*
+		 * Skip <= LLC domains as they likely won't have any tasks if
+		 * the shared runq is empty.
+		 */
+		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
+			continue;
+
 		update_next_balance(sd, &next_balance);
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
@@ -12359,6 +12678,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	raw_spin_rq_lock(this_rq);
 	rq_repin_lock(this_rq, rf);
 
+out_swq:
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
 
@@ -12733,6 +13053,9 @@ static void attach_task_cfs_rq(struct task_struct *p)
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
+#ifdef CONFIG_SMP
+	WARN_ON_ONCE(!list_empty(&p->shared_runq_node));
+#endif
 	detach_task_cfs_rq(p);
 }
 
@@ -13125,6 +13448,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 	.task_dead		= task_dead_fair,
 	.set_cpus_allowed	= set_cpus_allowed_common,
+	.update_domains		= update_domains_fair,
 #endif
 
 	.task_tick		= task_tick_fair,
@@ -13191,6 +13515,7 @@ __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
 	int i;
+	struct shared_runq *shared_runq;
 
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
@@ -13202,6 +13527,9 @@ __init void init_sched_fair_class(void)
 		INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i));
 		INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
 #endif
+		shared_runq = &per_cpu(shared_runqs, i);
+		INIT_LIST_HEAD(&shared_runq->list);
+		raw_spin_lock_init(&shared_runq->lock);
 	}
 
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84de430..c38fac5dd042 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -88,3 +88,7 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(HZ_BW, true)
+
+#ifdef CONFIG_SMP
+SCHED_FEAT_CALLBACK(SHARED_RUNQ, false, shared_runq_toggle)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 517e67a0cc9a..79cbdb251ad5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -514,6 +514,12 @@ static inline bool cfs_task_bw_constrained(struct task_struct *p) { return false
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_SMP
+extern void sched_update_domains(void);
+#else
+static inline void sched_update_domains(void) {}
+#endif /* CONFIG_SMP */
+
 extern void unregister_rt_sched_group(struct task_group *tg);
 extern void free_rt_sched_group(struct task_group *tg);
 extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
@@ -593,6 +599,7 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
+	struct shared_runq	*shared_runq;
 	/*
 	 * CFS load tracking
 	 */
@@ -2158,6 +2165,7 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 #endif /* SCHED_DEBUG */
 
 typedef void (*sched_feat_change_f)(bool enabling);
+extern void shared_runq_toggle(bool enabling);
 
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
@@ -2317,6 +2325,7 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 
 	struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
+	void (*update_domains)(void);
 #endif
 
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 10d1391e7416..0f69209ba56a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2612,6 +2612,8 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
 		doms_cur = &fallback_doms;
 	cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
 	err = build_sched_domains(doms_cur[0], NULL);
+	if (!err)
+		sched_update_domains();
 
 	return err;
 }
@@ -2780,7 +2782,7 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	dattr_cur = dattr_new;
 	ndoms_cur = ndoms_new;
 
-	update_sched_domain_debugfs();
+	sched_update_domains();
 }
 
 /*

From patchwork Tue Dec 12 00:31:40 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176995
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426463vqy;
        Mon, 11 Dec 2023 16:32:27 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFVINndRdoCdONPG7jc4Dsb5ku7RR4jVlgVDxhy9ELZqXx7DnB7+43w2nal2KTi1apKdWS5
X-Received: by 2002:a17:90a:989:b0:28a:bd51:7205 with SMTP id
 9-20020a17090a098900b0028abd517205mr126149pjo.43.1702341146801;
        Mon, 11 Dec 2023 16:32:26 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341146; cv=none;
        d=google.com; s=arc-20160816;
        b=zLEKMlR41UIIUdCE18IJLdkMPVBPho3p5qzE6rnlotsksCXljlDjdBc6L+TC/dyZ73
         Xx7QBc3LDba0c/ZHi4/bWgTA4kwlpC8hx9UbzrwxiwNQ9LxbC32Bwf7G0OqKnWFM7Q7v
         /fp4n9AMq9C0omY2M3+mdC+Z/TL0qrYafz4OvOr4NRbraP8HFg14aYHx+dWoy9yPgH7d
         ZUQLUJe2JDQG9+LZ3bmIHsN9PJG89x9KvQW/pqOlr4oKQ/STp0N1qD4ybFSAazdvIgqo
         GpjAuaKWIiikqhlJz/wQ8FMY0a27CYhkoA1oXGn4zlRpopISjOhm3SAzrwCHJMyaodYM
         ZVOQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=Ae3mm2ON6OBR2SQ7MBp1B1n4BG08XaXKdV3B0rLDua0=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=T/aTrvVnnbVV4b8kI+V8VkL49QX9IY/B5UCV1+uvbP1xIJm1ee6qKmLgtg/UR1N/wJ
         9NM/d+dqHffWmNYPJhfxdIZNih+Y9jCLBBeEse/HgAhrrOUYNqyoSXBC44fDNsQfy+ED
         ZXiAf7A9EbWKkt+tj+2AmOVRhQF47Ziqp+9z7YQVYeo03w6waI3oJPYfYbByj9i+oeHa
         xNF3WHcTbugaNt3PS33sTVQtYHcfovh5eCPunxJ+w/L/VHogQCWSiiaTphBtI45mceKy
         zwCmCxOrEUKwr4fmC+dC9R+NrdVZdnNQiJTeO8aCzfNRkK1814TSeublcJti/yokISed
         976Q==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:8 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8])
        by mx.google.com with ESMTPS id
 mv6-20020a17090b198600b0028683c7ed04si6781375pjb.157.2023.12.11.16.32.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:26 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:8 as permitted sender)
 client-ip=2620:137:e000::3:8;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:8 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by fry.vger.email (Postfix) with ESMTP id BE99A809F38A;
	Mon, 11 Dec 2023 16:32:20 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345577AbjLLAcH (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:32:07 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41610 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345476AbjLLAbz (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:55 -0500
Received: from mail-io1-f46.google.com (mail-io1-f46.google.com
 [209.85.166.46])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 933E59C
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:32:00 -0800 (PST)
Received: by mail-io1-f46.google.com with SMTP id
 ca18e2360f4ac-7b70de199f6so136500839f.2
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:32:00 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341119; x=1702945919;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Ae3mm2ON6OBR2SQ7MBp1B1n4BG08XaXKdV3B0rLDua0=;
        b=XCwCsFtv2rxEBxW7o55QYAbG5HIbuB0n3ThUKB60BadjaTrKXeQX9E0cKrBlBp7weM
         2zX7gjPC9zl+NesoL1kq2/qrNmZDY71RlL8C3JpEegtyhH292bkVsDG/+KaaK4HoRx+c
         2oTH19o485cXu4GPL+v11VIUIBcfMhM4P1aRXyOolWHpvtlk/AD3pFbK86wfltQcB2/j
         MAGWHydYVLNiED6VEFeAwsRpIPksi49svtiLBszIXO1vpTbPloBlLxXWqFrZauXsJ/k+
         m+jgatftMiwU9JhthOqwqAkv80OIrnCx17PKek2HyDp3dOIyv8pUZYDE0VMceahVFUB0
         Kqxg==
X-Gm-Message-State: AOJu0Yxhxy9kC+bNNWSik/dsLcVQ1B7eOaF8Xs7/DquovA2oIupcK+mn
        I83k0iQZys6KeGxlTHINFbZybqc/c3kaEe3X
X-Received: by 2002:a5d:9a86:0:b0:7b7:2bb3:2b24 with SMTP id
 c6-20020a5d9a86000000b007b72bb32b24mr965359iom.43.1702341119171;
        Mon, 11 Dec 2023 16:31:59 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 c4-20020a029604000000b00468e18cd2f6sm2138532jai.132.2023.12.11.16.31.58
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:31:58 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 7/8] sched: Shard per-LLC shared runqueues
Date: Mon, 11 Dec 2023 18:31:40 -0600
Message-ID: <20231212003141.216236-8-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:20 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034070239369335
X-GMAIL-MSGID: 1785034070239369335

The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
tasks are put into on enqueue, and pulled from when a core in that LLC
would otherwise go idle. For CPUs with large LLCs, this can sometimes
cause significant contention, as illustrated in [0].

[0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

So as to try and mitigate this contention, we can instead shard the
per-LLC runqueue into multiple per-LLC shards.

While this doesn't outright prevent all contention, it does somewhat mitigate it.
For example, if we run the following schbench command which does almost
nothing other than pound the runqueue:

schbench -L -m 52 -p 512 -r 10 -t 1

we observe with lockstats that sharding significantly decreases
contention.

3 shards:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

No sharding:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the host where this was
collected. This could be addressed in future patch sets by adding a
debugfs knob to control the sharding granularity. If we make the shards
even smaller (what's in this patch, i.e. a size of 6), the contention
goes away almost entirely:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    	   con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan, but we contend even more on the rq lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          47962          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170

In general, the takeaway here is that sharding does help with
contention, but it's not necessarily one size fits all, and it's
workload dependent. For now, let's include sharding to try and avoid
contention, and because it doesn't seem to regress CPUs that don't need
it such as the AMD 7950X.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c  | 181 +++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h |   3 +-
 2 files changed, 126 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b2f4f8620265..3f085c122712 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -92,45 +92,44 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
 /**
- * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
- * runnable tasks within an LLC.
- * @list: The list of tasks in the shared_runq.
- * @lock: The raw spinlock that synchronizes access to the shared_runq.
+ * struct shared_runq_shard - A structure containing a task list and a spinlock
+ * for a subset of cores in a struct shared_runq.
+ * @list: The list of tasks in the shard.
+ * @lock: The raw spinlock that synchronizes access to the shard.
  *
  * WHAT
  * ====
  *
  * This structure enables the scheduler to be more aggressively work
- * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
- * pulled from when another core in the LLC is going to go idle.
+ * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
+ * then be pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores two pointers in its struct cfs_rq:
  *
- * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
- * Waking tasks are enqueued in the calling CPU's struct shared_runq in
- * __enqueue_entity(), and are opportunistically pulled from the shared_runq
- * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
- * to being pulled from the shared_runq, in which case they're simply dequeued
- * from the shared_runq in __dequeue_entity().
+ * 1. The per-LLC struct shared_runq which contains one or more shards of
+ *    enqueued tasks.
+ *
+ * 2. The shard inside of the per-LLC struct shared_runq which contains the
+ *    list of runnable tasks for that shard.
+ *
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
+ * newidle_balance(). Pulling from shards is an O(# shards) operation.
  *
  * There is currently no task-stealing between shared_runqs in different LLCs,
  * which means that shared_runq is not fully work conserving. This could be
  * added at a later time, with tasks likely only being stolen across
  * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
  *
- * Note that there is a per-CPU allocation of struct shared_runq objects to
- * account for the possibility that sched domains are reconfigured during e.g.
- * hotplug. In practice, most of these struct shared_runq objects are unused at
- * any given time, with the struct shared_runq of a single core per LLC being
- * referenced by all other cores in the LLC via a pointer in their struct
- * cfs_rq.
- *
  * HOW
  * ===
  *
- * A shared_runq is comprised of a list, and a spinlock for synchronization.
- * Given that the critical section for a shared_runq is typically a fast list
- * operation, and that the shared_runq is localized to a single LLC, the
- * spinlock will typically only be contended on workloads that do little else
- * other than hammer the runqueue.
+ * A struct shared_runq_shard is comprised of a list, and a spinlock for
+ * synchronization.  Given that the critical section for a shared_runq is
+ * typically a fast list operation, and that the shared_runq_shard is localized
+ * to a subset of cores on a single LLC (plus other cores in the LLC that pull
+ * from the shard in newidle_balance()), the spinlock will typically only be
+ * contended on workloads that do little else other than hammer the runqueue.
  *
  * WHY
  * ===
@@ -144,11 +143,35 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * it, as well as to strike a balance between work conservation, and L3 cache
  * locality.
  */
-struct shared_runq {
+struct shared_runq_shard {
 	struct list_head list;
 	raw_spinlock_t lock;
 } ____cacheline_aligned;
 
+/* This would likely work better as a configurable knob via debugfs */
+#define SHARED_RUNQ_SHARD_SZ 6
+#define SHARED_RUNQ_MAX_SHARDS \
+	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
+
+/**
+ * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
+ * runnable tasks within an LLC.
+ * @num_shards: The number of shards currently active in the shared_runq.
+ * @shards: The shards of the shared_runq. Only @num_shards of these shards are
+ * active at any given time.
+ *
+ * A per-LLC shared_runq that is composed of one of more shards. There is a
+ * per-CPU allocation of struct shared_runq objects to account for the
+ * possibility that sched domains are reconfigured during e.g. hotplug. In
+ * practice, most of these struct shared_runq objects are unused at any given
+ * time, with the struct shared_runq of a single core per LLC being referenced
+ * by all other cores in the LLC via a pointer in their struct cfs_rq.
+ */
+struct shared_runq {
+	unsigned int num_shards;
+	struct shared_runq_shard shards[SHARED_RUNQ_MAX_SHARDS];
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
 
 static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
@@ -159,31 +182,61 @@ static struct shared_runq *rq_shared_runq(struct rq *rq)
 	return rq->cfs.shared_runq;
 }
 
+static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
+{
+	return rq->cfs.shard;
+}
+
+static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
+{
+	return (cpu >> 1) % runq->num_shards;
+}
+
 static void shared_runq_reassign_domains(void)
 {
 	int i;
 	struct shared_runq *shared_runq;
 	struct rq *rq;
 	struct rq_flags rf;
+	unsigned int num_shards, shard_idx;
+
+	for_each_possible_cpu(i) {
+		if (per_cpu(sd_llc_id, i) == i) {
+			shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
+
+			num_shards = per_cpu(sd_llc_size, i) / SHARED_RUNQ_SHARD_SZ;
+			if (per_cpu(sd_llc_size, i) % SHARED_RUNQ_SHARD_SZ)
+				num_shards++;
+			shared_runq->num_shards = num_shards;
+		}
+	}
 
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);
 		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
 
+		shard_idx = shared_runq_shard_idx(shared_runq, i);
 		rq_lock(rq, &rf);
 		rq->cfs.shared_runq = shared_runq;
+		rq->cfs.shard = &shared_runq->shards[shard_idx];
 		rq_unlock(rq, &rf);
 	}
 }
 
 static void __shared_runq_drain(struct shared_runq *shared_runq)
 {
-	struct task_struct *p, *tmp;
+	unsigned int i;
 
-	raw_spin_lock(&shared_runq->lock);
-	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
-		list_del_init(&p->shared_runq_node);
-	raw_spin_unlock(&shared_runq->lock);
+	for (i = 0; i < shared_runq->num_shards; i++) {
+		struct shared_runq_shard *shard;
+		struct task_struct *p, *tmp;
+
+		shard = &shared_runq->shards[i];
+		raw_spin_lock(&shard->lock);
+		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
+			list_del_init(&p->shared_runq_node);
+		raw_spin_unlock(&shard->lock);
+	}
 }
 
 static void update_domains_fair(void)
@@ -237,41 +290,38 @@ void shared_runq_toggle(bool enabling)
 	/*
 	 * Disable dequeue _after_ ensuring that all of the shared runqueues
 	 * are fully drained. Otherwise, a task could remain enqueued on a
-	 * shared runqueue after the feature was disabled, and could exit
-	 * before drain has completed.
+	 * shard after the feature was disabled, and could exit before drain
+	 * has completed.
 	 */
 	static_branch_disable_cpuslocked(&__shared_runq_force_dequeue);
 }
 
-static struct task_struct *shared_runq_pop_task(struct rq *rq)
+static struct task_struct *
+shared_runq_pop_task(struct shared_runq_shard *shard, int target)
 {
 	struct task_struct *p;
-	struct shared_runq *shared_runq;
 
-	shared_runq = rq_shared_runq(rq);
-	if (list_empty(&shared_runq->list))
+	if (list_empty(&shard->list))
 		return NULL;
 
-	raw_spin_lock(&shared_runq->lock);
-	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+	raw_spin_lock(&shard->lock);
+	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, cpu_of(rq)))
+	if (p && is_cpu_allowed(p, target))
 		list_del_init(&p->shared_runq_node);
 	else
 		p = NULL;
-	raw_spin_unlock(&shared_runq->lock);
+	raw_spin_unlock(&shard->lock);
 
 	return p;
 }
 
-static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+static void shared_runq_push_task(struct shared_runq_shard *shard,
+				  struct task_struct *p)
 {
-	struct shared_runq *shared_runq;
-
-	shared_runq = rq_shared_runq(rq);
-	raw_spin_lock(&shared_runq->lock);
-	list_add_tail(&p->shared_runq_node, &shared_runq->list);
-	raw_spin_unlock(&shared_runq->lock);
+	raw_spin_lock(&shard->lock);
+	list_add_tail(&p->shared_runq_node, &shard->list);
+	raw_spin_unlock(&shard->lock);
 }
 
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
@@ -288,7 +338,7 @@ static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
 	    rq->nr_running < 1)
 		return;
 
-	shared_runq_push_task(rq, p);
+	shared_runq_push_task(rq_shared_runq_shard(rq), p);
 }
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
@@ -296,9 +346,22 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 	struct task_struct *p = NULL;
 	struct rq *src_rq;
 	struct rq_flags src_rf;
+	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
+	u32 i, starting_idx, curr_idx, num_shards;
 	int ret = 0, cpu;
 
-	p = shared_runq_pop_task(rq);
+	shared_runq = rq_shared_runq(rq);
+	num_shards = shared_runq->num_shards;
+	starting_idx = shared_runq_shard_idx(shared_runq, cpu_of(rq));
+	for (i = 0; i < num_shards; i++) {
+		curr_idx = (starting_idx + i) % num_shards;
+		shard = &shared_runq->shards[curr_idx];
+
+		p = shared_runq_pop_task(shard, cpu_of(rq));
+		if (p)
+			break;
+	}
 	if (!p)
 		return 0;
 
@@ -332,8 +395,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 
 static void shared_runq_dequeue_task(struct task_struct *p)
 {
-	struct shared_runq *shared_runq;
-
 	/*
 	 * Always dequeue a task if:
 	 * - SHARED_RUNQ is enabled
@@ -350,11 +411,13 @@ static void shared_runq_dequeue_task(struct task_struct *p)
 		return;
 
 	if (!list_empty(&p->shared_runq_node)) {
-		shared_runq = rq_shared_runq(task_rq(p));
-		raw_spin_lock(&shared_runq->lock);
+		struct shared_runq_shard *shard;
+
+		shard = rq_shared_runq_shard(task_rq(p));
+		raw_spin_lock(&shard->lock);
 		if (likely(!list_empty(&p->shared_runq_node)))
 			list_del_init(&p->shared_runq_node);
-		raw_spin_unlock(&shared_runq->lock);
+		raw_spin_unlock(&shard->lock);
 	}
 }
 
@@ -13514,8 +13577,9 @@ void show_numa_stats(struct task_struct *p, struct seq_file *m)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
-	int i;
+	int i, j;
 	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
@@ -13528,8 +13592,11 @@ __init void init_sched_fair_class(void)
 		INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
 #endif
 		shared_runq = &per_cpu(shared_runqs, i);
-		INIT_LIST_HEAD(&shared_runq->list);
-		raw_spin_lock_init(&shared_runq->lock);
+		for (j = 0; j < SHARED_RUNQ_MAX_SHARDS; j++) {
+			shard = &shared_runq->shards[j];
+			INIT_LIST_HEAD(&shard->list);
+			raw_spin_lock_init(&shard->lock);
+		}
 	}
 
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 79cbdb251ad5..4b4534f08d25 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -599,7 +599,8 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-	struct shared_runq	*shared_runq;
+	struct shared_runq	 *shared_runq;
+	struct shared_runq_shard *shard;
 	/*
 	 * CFS load tracking
 	 */

From patchwork Tue Dec 12 00:31:41 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 176996
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426610vqy;
        Mon, 11 Dec 2023 16:32:43 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHd3+jvgH6zkPgYMs9958LK/DXike/79Colbgr+zjHmbiaSbZ66YfhTFKLCaMi0YD8/RUq4
X-Received: by 2002:a05:6e02:1c4c:b0:35d:a62c:bbaa with SMTP id
 d12-20020a056e021c4c00b0035da62cbbaamr5501184ilg.41.1702341163672;
        Mon, 11 Dec 2023 16:32:43 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702341163; cv=none;
        d=google.com; s=arc-20160816;
        b=TZQgx1QD0MVfB9I4/MHsdPHQmejUhRYVv6CQnt2FfAEwCAvYIkeMZxKwWiypwXpHHU
         Bljm7LMomvrnuHd/DBlXOQuDjS7CZPJ65blm92gw2KD7WQC4n0VXW5IycPq33wKRpv+j
         FLYsC0jWj4mW14bpuwcfGXUNZiVaC+KZfAGY89GuJw67c+xy4ACMLoYj565otfAa9uHN
         tn3ahPbUFyAcGIH7vTq2ZwJ422GmuwMtzRYAdD9EIiKxCmbtFKxGmgNDvWVNgdqRUSKL
         a2a9ha/z9HW+/iKlMISQZxBptLp5OkxaUysIKgRKUTK2aEcwrUUWTNrRw193iiwlkQpe
         C+gA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=6w/uO62RCM4lgIOFIJSX/lpUSqBhcJ+S2LdMTwfnN3k=;
        fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=;
        b=wfg018Kju6GVJH7aiXrlFB9af/HlWanhMnfzn+X6Pcq958sQcYit/c0UcHoGPMBiI/
         KBcm+a6NQU+e4hA/Ov7NtCN3gnGfAEX4kkRjt0ngp00kt/Cy22K9FFIXHEf61SuE47PM
         H5F49GjVhbYa0GisAgqa2z0Z2ibquFK/65mryvmwNdpVdCmhLWTyZoIg5m1BawDWEjDr
         uIOCJxChGeVX7FQP9zmTFx7mt0HfeBihW/4wUSQDgvautjagoyfKtuTS9AJNObFKuglS
         VLITOOMopjkQHneKyy27Ae+ivNTJMK7hor36apXOkfdMdWtbEp5uZ19ymXPZcMqLvr9I
         nO7w==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from morse.vger.email (morse.vger.email. [23.128.96.31])
        by mx.google.com with ESMTPS id
 m7-20020a635807000000b005c673949a8dsi6872683pgb.129.2023.12.11.16.32.42
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:43 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by morse.vger.email (Postfix) with ESMTP id 660DB80CF521;
	Mon, 11 Dec 2023 16:32:40 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1345595AbjLLAcN (ORCPT <rfc822;dexuan.linux@gmail.com>
        + 99 others); Mon, 11 Dec 2023 19:32:13 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41660 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1345491AbjLLAb4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 11 Dec 2023 19:31:56 -0500
Received: from mail-io1-f54.google.com (mail-io1-f54.google.com
 [209.85.166.54])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F2A6CB8
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:32:01 -0800 (PST)
Received: by mail-io1-f54.google.com with SMTP id
 ca18e2360f4ac-7b70139d54cso172239139f.1
        for <linux-kernel@vger.kernel.org>;
 Mon, 11 Dec 2023 16:32:01 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702341121; x=1702945921;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=6w/uO62RCM4lgIOFIJSX/lpUSqBhcJ+S2LdMTwfnN3k=;
        b=DqajHofp4GA+X471DNBObNwVjGRJSdWSJu0rm2VmQLmSV0KQCp0fkbCrVWL+9eM8vv
         QC545sHxEAc8ajvuLRa1UVI1V95WISF3QuHVN/k85fZRdURdNJSW5J+mnTn9DhewfyU7
         Xl0TEvGzob7L+Cz2jDvGiQQddHIXkDNsFvpfJsifPAE5GudcVyFsqjx/zqZKD6Uw11dQ
         C3nCltu2UcTOgK7quGZLpagUtoG0ZBd7Pyu+83HSCo9dvVJ1wqgHgFfB/rT3LdFUKYLW
         S0d48vG2nSrX+lFLvBJUFNpHicksUeuzYAefcVV3VJrg3OVC/6VO4I0aDa243N4//JDn
         7F7g==
X-Gm-Message-State: AOJu0YyPF4762vXXRlZobgzXHtwDcxdN+Ky5yy3FLYBC6kpbR7n4cqpc
        cRdtEJKwpOgUR3+bjujszjT8UWyeCKldn5oa
X-Received: by 2002:a05:6e02:154b:b0:35d:6984:c3a with SMTP id
 j11-20020a056e02154b00b0035d69840c3amr4359210ilu.32.1702341120831;
        Mon, 11 Dec 2023 16:32:00 -0800 (PST)
Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177])
        by smtp.gmail.com with ESMTPSA id
 bq10-20020a056e02238a00b0035d4633cf5dsm2645674ilb.61.2023.12.11.16.32.00
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 Dec 2023 16:32:00 -0800 (PST)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
        vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com,
        roman.gushchin@linux.dev, yu.c.chen@intel.com,
        kprateek.nayak@amd.com, gautham.shenoy@amd.com,
        aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com,
        tj@kernel.org, kernel-team@meta.com
Subject: [PATCH v4 8/8] sched: Add selftest for SHARED_RUNQ
Date: Mon, 11 Dec 2023 18:31:41 -0600
Message-ID: <20231212003141.216236-9-void@manifault.com>
X-Mailer: git-send-email 2.42.1
In-Reply-To: <20231212003141.216236-1-void@manifault.com>
References: <20231212003141.216236-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE
	autolearn=unavailable autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]);
 Mon, 11 Dec 2023 16:32:40 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1785034087971235529
X-GMAIL-MSGID: 1785034087971235529

We want to ensure that SHARED_RUNQ works as expected. Let's add a
testcase to the sched/ subdirectory containing SHARED_RUNQ which enables
and disables it in a loop, while stressing the system with rcutorture.

Cc: Aboorva Devarajan <aboorvad@linux.vnet.ibm.com>
Signed-off-by: David Vernet <void@manifault.com>
---
 tools/testing/selftests/sched/Makefile    |  5 ++++-
 tools/testing/selftests/sched/config      |  2 ++
 tools/testing/selftests/sched/test-swq.sh | 23 +++++++++++++++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/sched/test-swq.sh

diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 099ee9213557..22f4941ff76b 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -9,6 +9,9 @@ CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
 LDLIBS += -lpthread
 
 TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_PROGS := \
+	cs_prctl_test \
+	test-srq.sh
+
 
 include ../lib.mk
diff --git a/tools/testing/selftests/sched/config b/tools/testing/selftests/sched/config
index e8b09aa7c0c4..6e1cbdb6eec3 100644
--- a/tools/testing/selftests/sched/config
+++ b/tools/testing/selftests/sched/config
@@ -1 +1,3 @@
 CONFIG_SCHED_DEBUG=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_RCU_TORTURE_TEST=m
diff --git a/tools/testing/selftests/sched/test-swq.sh b/tools/testing/selftests/sched/test-swq.sh
new file mode 100755
index 000000000000..547088840a6c
--- /dev/null
+++ b/tools/testing/selftests/sched/test-swq.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2023 Meta, Inc
+
+echo "TEST: SHARED_RUNQ stress test ..."
+
+modprobe rcutorture
+
+for i in {1..10}; do
+	echo "Beginning iteration $i"
+	echo "SHARED_RUNQ" > /sys/kernel/debug/sched/features
+	sleep 2.3
+	echo "NO_SHARED_RUNQ" > /sys/kernel/debug/sched/features
+	sleep .8
+	echo "Completed iteration $i"
+	echo ""
+done
+
+rmmod rcutorture
+
+echo "DONE: SHARED_RUNQ stress test completed"
+
+exit 0