From patchwork Mon Jul 10 20:03:36 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 118057
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp48757vqm;
        Mon, 10 Jul 2023 13:15:28 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlEvrKRS5K/hGVX+9NYqIFIGzdqFnsDZWAwbPCTNXXDZcmvHMec4z0/qsZtb4hTz281BSDev
X-Received: by 2002:a19:e008:0:b0:4fb:844c:a867 with SMTP id
 x8-20020a19e008000000b004fb844ca867mr10799194lfg.9.1689020127737;
        Mon, 10 Jul 2023 13:15:27 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1689020127; cv=none;
        d=google.com; s=arc-20160816;
        b=wGoIoHL3NlJC5xoRNvbczjjX7/72somnZJSAnaF3KDPuqv3nlugApAT33NhnA4goRl
         vJYe6nyg40rwN34AUTsGUq8kcMDswPigc91tbckdejzq/MMGgVQBAJVsEs1C+lAt81pn
         AwVkRyGdUr51IFQEilABUwyeroUl1OksMehSo6UKC3tejhhJBXKJAhmbsPocCro/L7SB
         GdQZAal9MlO6FyKGhfcXOwr4MA9usitrPat5pW3/BD3uQF+Ex2/z0dMIMRImDtAjjMC4
         8K5DsUBbO9DaTGPHww8hPCA6xtOvCCJi03hmS+sBsIZzQOAvEWLZ0XAL7S5C2X1gEGiu
         Gmew==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=SWzC7HgThkzjSiE3ZBufE1vxY1hAeIuuMOLF9sjxPE0=;
        fh=GCF3HUaVVQnsrFP0NhazYIuTwTFmsBL5+Yby9S9U12k=;
        b=U3Kvjw5NEFfbfoRZGFSKfYqN5UrgOhsJp0XfXFJkZXh3mRgBGqAu833nGfo+hEXB0w
         xqzbgmCkWWVupZ5iVq3iHW95zf21bro52mGMrheYx53Mr+bdqX2a2aCC+fWK3pOzK0Hn
         +/zedXePQTY5e8hSVEhac/bohDQfIZxaI5T28j+uGfKdFmhtPpKMNYGtULvd/fNEPthV
         lepY47xsxO+dslFyWW8CDLev4WY7BKsyJP4OnEyu6J4cO3WQ2SAnLvbICqSB8ttYsq2g
         3VtnNltMTlmGHDKrACo3rlKfozwg2FqUWvK18fE9D0WukiDZMtbaTULRN8rcDnoGleZm
         mCbw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 d4-20020aa7d5c4000000b0051ddf359991si279176eds.344.2023.07.10.13.15.04;
        Mon, 10 Jul 2023 13:15:27 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231458AbjGJUEF (ORCPT <rfc822;ybw1215001957@gmail.com>
        + 99 others); Mon, 10 Jul 2023 16:04:05 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57194 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229635AbjGJUEE (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 10 Jul 2023 16:04:04 -0400
Received: from mail-qv1-f50.google.com (mail-qv1-f50.google.com
 [209.85.219.50])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAD66195
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:02 -0700 (PDT)
Received: by mail-qv1-f50.google.com with SMTP id
 6a1803df08f44-634a3682c25so37773746d6.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:02 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689019442; x=1691611442;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=SWzC7HgThkzjSiE3ZBufE1vxY1hAeIuuMOLF9sjxPE0=;
        b=kH87JA4b76T58XQQzOBGQw/VLdfwrPJAcwP//ZFKpUfKRD+KN2HqKssKU88xrxXPo5
         HziCc+E1/6P2Sk2oFF8SE1F9P+L+2jpP6ZgkwgpKoisr0O1scKVTehsfvLe2GPdqUBJc
         NHy9A/7Rknth+v70kesTsKX4nhlsDl5K9FL51kU/HZjShH2q99aovGynTDMjrbV4h4UE
         x4y3RyHWUgxjhM5UrvBC6E4GTkSgHRz7hO+Zj2tuf2no/tQ6FzC5srftSMrszphuYL1w
         BQyai0aiuVgP1r48iKQYP/EFqEt4V5duOJx3ER6KhaeG+wWeUZEsKmsxB+cFG11e1zjY
         tPHg==
X-Gm-Message-State: ABy/qLb0MSHj6tyIsRGUxyRDes59pPnv7mTpMa3eMBRRgYMat6KcxoSI
        9lzOtEIhMMfm4JzNlYY/toaJ5EznL6vAGSXH
X-Received: by 2002:a0c:f049:0:b0:630:1954:b30 with SMTP id
 b9-20020a0cf049000000b0063019540b30mr11942273qvl.9.1689019441553;
        Mon, 10 Jul 2023 13:04:01 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:4850])
        by smtp.gmail.com with ESMTPSA id
 m18-20020ae9e012000000b00765a676b75csm204370qkk.21.2023.07.10.13.04.01
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Jul 2023 13:04:01 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
        tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com
Subject: [PATCH v2 1/7] sched: Expose move_queued_task() from core.c
Date: Mon, 10 Jul 2023 15:03:36 -0500
Message-Id: <20230710200342.358255-2-void@manifault.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230710200342.358255-1-void@manifault.com>
References: <20230710200342.358255-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,
        T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1771065969174634246
X-GMAIL-MSGID: 1771065969174634246

The migrate_task_to() function exposed from kernel/sched/core.c migrates
the current task, which is silently assumed to also be its first
argument, to the specified CPU. The function uses stop_one_cpu() to
migrate the task to the target CPU, which won't work if @p is not the
current task as the stop_one_cpu() callback isn't invoked on remote
CPUs.

While this operation is useful for task_numa_migrate() in fair.c, it
would be useful if move_queued_task() in core.c was given external
linkage, as it actually can be used to migrate any task to a CPU.

A follow-on patch will call move_queued_task() from fair.c when
migrating a task in a shared runqueue to a remote CPU.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/sched.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7db597e8175..167cd9f11ed0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2493,8 +2493,8 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
  *
  * Returns (locked) new rq. Old rq's lock is released.
  */
-static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
-				   struct task_struct *p, int new_cpu)
+struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+			    struct task_struct *p, int new_cpu)
 {
 	lockdep_assert_rq_held(rq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50d4b61aef3a..94846c947d6e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1759,6 +1759,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_SMP
 
+
+extern struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+				   struct task_struct *p, int new_cpu);
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,

From patchwork Mon Jul 10 20:03:37 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 118056
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp48576vqm;
        Mon, 10 Jul 2023 13:15:02 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlGA4m+De3stu0RFer6xj09FzULKN6XYys2OrXgGI6ObcEmfCjMXbwtJD1fi0ezk/air3Dgk
X-Received: by 2002:a17:906:83:b0:992:42d4:a7dc with SMTP id
 3-20020a170906008300b0099242d4a7dcmr14253359ejc.21.1689020101982;
        Mon, 10 Jul 2023 13:15:01 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1689020101; cv=none;
        d=google.com; s=arc-20160816;
        b=Fn4b7LdbCgMPz3gwuNemi7AiAiyixCrwmphctOxMhQs66o3/QgBCwtNOmTrOJBtWvB
         BVPlGCtebd+buaekoAUfV9g2vR1J9tZBZtl3nUR/tcltJVUTTDkGiBdyGMyPvD6+RyMN
         LLqZrHkLxF8CH/wQTbeMCDOPQsPgubdq74/J810OyUQAvVIEWro1rhIvxu/SYVdpKNjU
         WZViksWWeZ41AVZHxrfwuf8tEg5Ax49Z/5OzmCu5+akKQxPCw3bn5z4u4OtM9ZexIU/O
         38e9+nBZRqi87rwdp8/ZhrYR8y55v1Y/0s90+ebsivDrZafaF9BEbZwiLJU+n+eE8JTl
         KEbg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=QE5xKsSsyeGv9fIIctRqoxdDyE7MPAyTGVDmmeq+ixU=;
        fh=GCF3HUaVVQnsrFP0NhazYIuTwTFmsBL5+Yby9S9U12k=;
        b=i2UzPc25gimUY4CKTIWf9n5fON3PCkoNHrhEikYhlNIjCIsPrSCF7UYw1BVd71tCWp
         ZMuKzSDxv52jWgAygx55/R5ALjOFaXCvb7Ok2jConUeiNPARECNHfnaD6gAg3H4z3SQp
         LWFieEUtinyMpYzzuQneX2l+9TX8iezEAk2IGhN98KLqe8SI6YfaJ4ZLWS5iSPXp8jOb
         M4+8kpA+huTobMRao/lfxGOgmz9oCskrRe/5NIZ6ys0dPF9evrdbyJqol07H3nnveYMq
         bi1Bi4jtbJ3BcvSHH3koCez2Va9EYNm0lsetDn5yiOyI86EdNCXRktpVL6Q95PqU5eiy
         eguw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 y12-20020a17090629cc00b0098d9dfee9c7si290587eje.493.2023.07.10.13.14.37;
        Mon, 10 Jul 2023 13:15:01 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231678AbjGJUEK (ORCPT <rfc822;ybw1215001957@gmail.com>
        + 99 others); Mon, 10 Jul 2023 16:04:10 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57194 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230344AbjGJUEF (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 10 Jul 2023 16:04:05 -0400
Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com
 [209.85.160.173])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4C28A13E
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:04 -0700 (PDT)
Received: by mail-qt1-f173.google.com with SMTP id
 d75a77b69052e-403b30c7377so5868901cf.1
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:04 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689019443; x=1691611443;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=QE5xKsSsyeGv9fIIctRqoxdDyE7MPAyTGVDmmeq+ixU=;
        b=XnKZM9UbzPNGRyQMbF9g/3Zlj+vlzjkH2N6zArI52fqowRA/wH1LxM+wGNk4Q86Edj
         HBGCoHhqyhdGlU93kPNiD/HGJglMCCpL/c39ljqQBEI1wGwXNtYmy2m6jjX4Qlk1CImD
         3h6pJPYK5NGioKoWsZzg9XoY84MYW6kDTEMaGJEAX7BF2aX7HnyoRUYvhpuCjOlVEkZ7
         1/mrwGRk4vTyMpNTiFMHiJDgufpcEyIkrcugHY9F6xav58EDX7mmga8gQmTeqS8IfSvd
         RmysbZVaOwFIILCh/Y2e33XyyfwEx5ExnZeUR+8xRYpxWYOVscHsENumL9aPzBkE6DcL
         gIUQ==
X-Gm-Message-State: ABy/qLbqpDuZbR7q0LL+i6ZTU8CZlkJdV6/q/RTuLiCqYe+1UyMvA4ng
        atrAh1JDE5VfN7ESGHyBD8Xe0/JL8TU6bcyL
X-Received: by 2002:a05:622a:486:b0:3ff:407e:12 with SMTP id
 p6-20020a05622a048600b003ff407e0012mr15034264qtx.25.1689019443006;
        Mon, 10 Jul 2023 13:04:03 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:4850])
        by smtp.gmail.com with ESMTPSA id
 x13-20020ac8538d000000b00403ad47c895sm265964qtp.22.2023.07.10.13.04.02
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Jul 2023 13:04:02 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
        tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com
Subject: [PATCH v2 2/7] sched: Move is_cpu_allowed() into sched.h
Date: Mon, 10 Jul 2023 15:03:37 -0500
Message-Id: <20230710200342.358255-3-void@manifault.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230710200342.358255-1-void@manifault.com>
References: <20230710200342.358255-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,
        SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1771065942095090537
X-GMAIL-MSGID: 1771065942095090537

is_cpu_allowed() exists as a static inline function in core.c. The
functionality offered by is_cpu_allowed() is useful to scheduling
policies as well, e.g. to determine whether a runnable task can be
migrated to another core that would otherwise go idle.

Let's move it to sched.h.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 31 -------------------------------
 kernel/sched/sched.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 167cd9f11ed0..1451f5aa82ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -48,7 +48,6 @@
 #include <linux/kcov.h>
 #include <linux/kprobes.h>
 #include <linux/llist_api.h>
-#include <linux/mmu_context.h>
 #include <linux/mmzone.h>
 #include <linux/mutex_api.h>
 #include <linux/nmi.h>
@@ -2444,36 +2443,6 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 	return rq->nr_pinned;
 }
 
-/*
- * Per-CPU kthreads are allowed to run on !active && online CPUs, see
- * __set_cpus_allowed_ptr() and select_fallback_rq().
- */
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
-{
-	/* When not in the task's cpumask, no point in looking further. */
-	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
-		return false;
-
-	/* migrate_disabled() must be allowed to finish. */
-	if (is_migration_disabled(p))
-		return cpu_online(cpu);
-
-	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu) && task_cpu_possible(cpu, p);
-
-	/* KTHREAD_IS_PER_CPU is always allowed. */
-	if (kthread_is_per_cpu(p))
-		return cpu_online(cpu);
-
-	/* Regular kernel threads don't get to stay during offline. */
-	if (cpu_dying(cpu))
-		return false;
-
-	/* But are allowed during online. */
-	return cpu_online(cpu);
-}
-
 /*
  * This is how migration works:
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 94846c947d6e..187ad5da5ef6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
 #include <linux/lockdep.h>
 #include <linux/minmax.h>
 #include <linux/mm.h>
+#include <linux/mmu_context.h>
 #include <linux/module.h>
 #include <linux/mutex_api.h>
 #include <linux/plist.h>
@@ -1199,6 +1200,36 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+/*
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
+ * __set_cpus_allowed_ptr() and select_fallback_rq().
+ */
+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+{
+	/* When not in the task's cpumask, no point in looking further. */
+	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+		return false;
+
+	/* migrate_disabled() must be allowed to finish. */
+	if (is_migration_disabled(p))
+		return cpu_online(cpu);
+
+	/* Non kernel threads are not allowed during either online or offline. */
+	if (!(p->flags & PF_KTHREAD))
+		return cpu_active(cpu) && task_cpu_possible(cpu, p);
+
+	/* KTHREAD_IS_PER_CPU is always allowed. */
+	if (kthread_is_per_cpu(p))
+		return cpu_online(cpu);
+
+	/* Regular kernel threads don't get to stay during offline. */
+	if (cpu_dying(cpu))
+		return false;
+
+	/* But are allowed during online. */
+	return cpu_online(cpu);
+}
+
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

From patchwork Mon Jul 10 20:03:38 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 118058
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp49343vqm;
        Mon, 10 Jul 2023 13:16:38 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlGGe+XCYgWakpDU8FkrtXfBVArd0R6tfe8uVqgNdTMHCmCVFxqU0mMuSfO2M1O66aef4IxI
X-Received: by 2002:a2e:9b15:0:b0:2b1:a89a:5f2b with SMTP id
 u21-20020a2e9b15000000b002b1a89a5f2bmr11290188lji.2.1689020197937;
        Mon, 10 Jul 2023 13:16:37 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1689020197; cv=none;
        d=google.com; s=arc-20160816;
        b=m/0IYX200HTx+pl/K7KNydkCUSu2/yHcnUwDcFJkE5v4cQu/yxfNzmP5qB3vTgIuBP
         w5nA01I2qEqKScszChgYrwpKCr03TJ1De3nohFeYcDVvfuqAVYKYT9lBFHlzzGnRL6xS
         eqUmhjdQnwgr7eWAfG25CmxhAu8lr+YWOOn8Zgv8uhNqzpkYYPF6gojTF2hRp0rZMJOe
         Oh1hZcC7mcVQt0gRDHjO0HgsoSfC+mx/0OkGw+7rPDC7zhbk3CenAErDp23ikrKY4l7V
         qNahV+qQvVToFwkhgtQcJiu/c6D2yCHu1vUFFNi6UMjqnkTMTOrI5YixqQjZ1Wp/Y0N3
         AuSw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=5SBGw2jXOpZgmbGBedJ7/r5mS0Q0qMFpeepfUvmrysc=;
        fh=GCF3HUaVVQnsrFP0NhazYIuTwTFmsBL5+Yby9S9U12k=;
        b=adcXrcCqSWh5MEOwJ6Y7UYo6a1q5UXj7ws8iFJYTSISgf4RDB8M+Qdxdpely8wrC18
         JvguVcDT1d88rml8hsMujKmKrlFsnzPZQ+1eUyyWhL4uljREVWH4qvtmZ5Oq+LEXOb/1
         ZLxARaNfHXT7A9evBVjdCRKAcellqtn/ndoL95A7ma7G74kNLp+MxYA63QRk0K5sE/+O
         blW+QK4f/8oeP6sEyCLjg2FXJx2jKRhOSip8FNVo93u0G1gy7sWUoSGN0TnnkctH5OQ1
         eRpr1ZZwIkbVXHaJ0OGBJJmD5MCSAhJZP5vHkZUjyUsL5Z75seqSmH0iOrg9ujUzFkTr
         lFPw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 qx16-20020a170906fcd000b00993a37b5e5csi331325ejb.394.2023.07.10.13.16.13;
        Mon, 10 Jul 2023 13:16:37 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231696AbjGJUEN (ORCPT <rfc822;ybw1215001957@gmail.com>
        + 99 others); Mon, 10 Jul 2023 16:04:13 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57192 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229947AbjGJUEG (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 10 Jul 2023 16:04:06 -0400
Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com
 [209.85.160.181])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9CF8F133
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:05 -0700 (PDT)
Received: by mail-qt1-f181.google.com with SMTP id
 d75a77b69052e-40355e76338so43937231cf.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:05 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689019444; x=1691611444;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=5SBGw2jXOpZgmbGBedJ7/r5mS0Q0qMFpeepfUvmrysc=;
        b=IthWIhQjs6zlc88WmpO6Tw+iLkfYLE4qsiPGQbHzR18BIMxjVADrsoqnXTU2Wg0YOx
         VrH7k7uxvBd9QBOKJQM9JBtYM8AEraUrTufGixUtt9Xv7M1v7bYgvHcWKxIUNtETpKoO
         MKzQ+eaq/7V5ehlnF66s6e6ZLa+NPJAJUP79oRgcwF5tNKfBzS9UpEM5BqruHRMNSkJ8
         TRfNwVXBpkR21ZeEgKqAcwehQfdXcErrgNwYwAr+/1w+laS/cqRcznzytuDDOH87Bpz9
         jXPKPvL0aVTPq7IlB77JO8M6HxetvpVRmaDf8V7rdOfxZH3E9QYiz91xP932pqCTZh9G
         3Nmg==
X-Gm-Message-State: ABy/qLaWEmridOc+HdPJy6oMmp3nAeDFscgjwD0P/djKM1Q45JGUWMBZ
        ztmC8s8YJp1IUG4lUX1zrJ1NxcWNMjVafumg
X-Received: by 2002:ac8:5f13:0:b0:403:b395:b450 with SMTP id
 x19-20020ac85f13000000b00403b395b450mr2250143qta.2.1689019444299;
        Mon, 10 Jul 2023 13:04:04 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:4850])
        by smtp.gmail.com with ESMTPSA id
 d15-20020ac8668f000000b003f394decd08sm250151qtp.62.2023.07.10.13.04.03
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Jul 2023 13:04:03 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
        tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com
Subject: [PATCH v2 3/7] sched: Check cpu_active() earlier in newidle_balance()
Date: Mon, 10 Jul 2023 15:03:38 -0500
Message-Id: <20230710200342.358255-4-void@manifault.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230710200342.358255-1-void@manifault.com>
References: <20230710200342.358255-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,
        SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1771066043401957079
X-GMAIL-MSGID: 1771066043401957079

In newidle_balance(), we check if the current CPU is inactive, and then
decline to pull any remote tasks to the core if so. Before this check,
however, we're currently updating rq->idle_stamp. If a core is offline,
setting its idle stamp is not useful. The core won't be chosen by any
task in select_task_rq_fair(), and setting the rq->idle_stamp is
misleading anyways given that the core being inactive should imply that
it should have a very cold cache.

Let's set rq->idle_stamp in newidle_balance() only if the cpu is active.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..6e882b7bf5b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11837,18 +11837,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (this_rq->ttwu_pending)
 		return 0;
 
-	/*
-	 * We must set idle_stamp _before_ calling idle_balance(), such that we
-	 * measure the duration of idle_balance() as idle time.
-	 */
-	this_rq->idle_stamp = rq_clock(this_rq);
-
 	/*
 	 * Do not pull tasks towards !active CPUs...
 	 */
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	/*
+	 * We must set idle_stamp _before_ calling idle_balance(), such that we
+	 * measure the duration of idle_balance() as idle time.
+	 */
+	this_rq->idle_stamp = rq_clock(this_rq);
+
 	/*
 	 * This is OK, because current is on_cpu, which avoids it being picked
 	 * for load-balance and preemption/IRQs are still disabled avoiding

From patchwork Mon Jul 10 20:03:39 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 118054
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp45046vqm;
        Mon, 10 Jul 2023 13:07:44 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlFsQCgbj0YbFDWibwcQ+/OynHnRFMYRA54vh1pQdth00aIcI2duBjgY+gGAAKXHond6FSh7
X-Received: by 2002:aa7:c7d1:0:b0:51e:6312:1f94 with SMTP id
 o17-20020aa7c7d1000000b0051e63121f94mr500686eds.14.1689019664375;
        Mon, 10 Jul 2023 13:07:44 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1689019664; cv=none;
        d=google.com; s=arc-20160816;
        b=sKD0IRGFlLdnJqwyE/CK9pUjdCj9hLJmGSZnGPjVm7DA/KxAD35NinnNhFZj03gy/E
         C3Dpt1pRjg3Z9wxIcYf7jL0eLUb/Lla3g/vMPapVjlUWfWholyP23YYfXkuOTGkjWh7m
         uHF1jcpww3nvRksFBiBhAX1igKxc245LWY7im7ZXaDZjtWQtalo+ADtcY+5YGchuAhi5
         h5B/BSsSYC1YGV/Jbsziur6OGasWFnIspg0Wr1c0XSRMFGFtOOmN/dN6yGFM5/szVEhm
         7pMJHJe0JdAnDurFXdly/U9juY8BsABXF77xHZdFNC9mZrFIXn+jc19uaPW1y3qKiwDk
         0rQg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=5Ayla9KElOyNjvPVUw7NmGEhrNVDhwYYpaWmdT7SKM8=;
        fh=GCF3HUaVVQnsrFP0NhazYIuTwTFmsBL5+Yby9S9U12k=;
        b=AAZlFGJRoetK0cE5aqao7XnfVOb0snbmTuortwG9hD6+DANZuLG/3zr2qJVOPypJNZ
         9xo4mEHTWW3JkbSzP2xqG4w+4T4yyIpmh+qBuuNt5Tp+7Wu4NwLnrtcjeo38Lay6AAek
         Ba9xO9kA+0n0rhzDGrc2EyA32IjjYcXvc+tK9XDRdbpnT9YacyBDvQujYdJPOmRciz9x
         Afu/XFsCiJ5SaN4G9rRLK7rHvq0DMWIfMzoU91NTqSQ2EKsgGP/rjqlIeQJHTlEuj021
         3ESP3+77JC7zgDO9yJ3iDYvCzULobukmsoIvCw+Nesv47st+7h23QM06pzEDT6qatr8j
         cPJQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 d3-20020a056402000300b0051debcc0476si356630edu.152.2023.07.10.13.07.19;
        Mon, 10 Jul 2023 13:07:44 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231851AbjGJUEY (ORCPT <rfc822;gnulinuxfreebsd@gmail.com>
        + 99 others); Mon, 10 Jul 2023 16:04:24 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57208 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231583AbjGJUEI (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 10 Jul 2023 16:04:08 -0400
Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com
 [209.85.219.52])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E9F88133
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:06 -0700 (PDT)
Received: by mail-qv1-f52.google.com with SMTP id
 6a1803df08f44-635ee3baa14so26803866d6.3
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:06 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689019446; x=1691611446;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=5Ayla9KElOyNjvPVUw7NmGEhrNVDhwYYpaWmdT7SKM8=;
        b=ie81bPbNM7mLSHlSzBMi6rwgbYM/E9HVkMaRFpgrmMz0oo4YwcAqe45b8Tw8amNDxj
         88mClPtCTbgBD1wYr/Cp4J1DWi8pc46BYN2ncTEarqyroChBxgFtYaCi1mvntAC7fNEK
         Sl272rdqLkkp2Ggs8S5OK7/J4CezYcZd5AprR8jxg44Y4gIxyceFguXJkMNwwYV+oZ0i
         2X39BGd/JyQWdVn/WVLQKUjxYbdqtlvyniEmbIkyeWkzY/bULTiFI7PXGdvIx3cy4H9g
         gWX+AnlYPbq0QAVBzfktRvk15UkACAQ35Bz+/sB8G3D0JvMcLAeWftcwsk+vXZ1Ecv3U
         +0gA==
X-Gm-Message-State: ABy/qLYcmu8zaAnO3zB30edAp4aL6/29aEwr3z5N+pPbSFV4rCtTu/ZM
        sWqj7k44Dfk1dbgVzLF2HoWZaGwlKKPV1BFk
X-Received: by 2002:a0c:f3c4:0:b0:635:d10b:aea0 with SMTP id
 f4-20020a0cf3c4000000b00635d10baea0mr10844846qvm.54.1689019445619;
        Mon, 10 Jul 2023 13:04:05 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:4850])
        by smtp.gmail.com with ESMTPSA id
 s21-20020a05620a16b500b00767ded911a3sm112469qkj.116.2023.07.10.13.04.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Jul 2023 13:04:05 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
        tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com
Subject: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton
 calls
Date: Mon, 10 Jul 2023 15:03:39 -0500
Message-Id: <20230710200342.358255-5-void@manifault.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230710200342.358255-1-void@manifault.com>
References: <20230710200342.358255-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,
        T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1771065483207945690
X-GMAIL-MSGID: 1771065483207945690

For certain workloads in CFS, CPU utilization is of the upmost
importance. For example, at Meta, our main web workload benefits from a
1 - 1.5% improvement in RPS, and a 1 - 2% improvement in p99 latency,
when CPU utilization is pushed as high as possible.

This is likely something that would be useful for any workload with long
slices, or for which avoiding migration is unlikely to result in
improved cache locality.

We will soon be enabling more aggressive load balancing via a new
feature called shared_runq, which places tasks into a FIFO queue on the
wakeup path, and then dequeues them in newidle_balance(). We don't want
to enable the feature by default, so this patch defines and declares a
new scheduler feature called SHARED_RUNQ which is disabled by default.
In addition, we add some calls to empty / skeleton functions in the
relevant fair codepaths where shared_runq will be utilized.

A set of future patches will implement these functions, and enable
shared_runq for both single and multi socket / CCX architectures.

Note as well that in future patches, the location of these calls may
change. For example, if we end up enqueueing tasks in a shared runqueue
any time they become runnable, we'd move the calls from
enqueue_task_fair() and pick_next_task_fair() to __enqueue_entity() and
__dequeue_entity() respectively.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c     | 32 ++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e882b7bf5b4..f7967be7646c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,6 +140,18 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
 #ifdef CONFIG_SMP
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
+				     int enq_flags)
+{}
+
+static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+	return 0;
+}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
+
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
  */
@@ -162,6 +174,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: ~5%)
  */
 #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
+#else
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
+				     int enq_flags)
+{}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -6386,6 +6405,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		update_overutilized_status(rq);
 
+	if (sched_feat(SHARED_RUNQ))
+		shared_runq_enqueue_task(rq, p, flags);
+
 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
@@ -6467,6 +6489,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
+
+	if (sched_feat(SHARED_RUNQ))
+		shared_runq_dequeue_task(p);
 }
 
 #ifdef CONFIG_SMP
@@ -8173,6 +8198,9 @@ done: __maybe_unused;
 
 	update_misfit_status(p, rq);
 
+	if (sched_feat(SHARED_RUNQ))
+		shared_runq_dequeue_task(p);
+
 	return p;
 
 idle:
@@ -11843,6 +11871,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	if (sched_feat(SHARED_RUNQ) && shared_runq_pick_next_task(this_rq, rf))
+		return -1;
+
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
@@ -12343,6 +12374,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
+	shared_runq_dequeue_task(p);
 	detach_task_cfs_rq(p);
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..cd5db1a24181 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,4 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+SCHED_FEAT(SHARED_RUNQ, false)

From patchwork Mon Jul 10 20:03:40 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 118061
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp51794vqm;
        Mon, 10 Jul 2023 13:21:49 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlEHbQ1JfSCQ15R4n4WfNyhGZCFWEd7BWA/zBnmsN1kjxnKfU23Hu29dYU/W9cSpTkVUtDXv
X-Received: by 2002:a05:6512:b8b:b0:4fb:8ff3:1f72 with SMTP id
 b11-20020a0565120b8b00b004fb8ff31f72mr13610737lfv.1.1689020509483;
        Mon, 10 Jul 2023 13:21:49 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1689020509; cv=none;
        d=google.com; s=arc-20160816;
        b=ZkIDnwHDEsAaGUiet/3tdJq8saFi6q0YVzvj/7aXmeALwuoyMGT1v83a6vuK/LI6y0
         qlXeD85b+DjQWo8GLPBdH6TIovvFoZ25TrulaZv8yuoEKUWCH9NC29tjOgJPPBGNvanx
         WVXEeqapcAaSPWrNzcVo1hRWqBbNbdYyM4iIet18OLss3npfZgV7AuSWoBZBDM8CYihD
         lAh4MGoPIRPq0wkJIRSb7YRuAHKJEBy86j7lNmIma2eQ032FxGjBEn1FkbxhL3mXXP1e
         pwiFoQUUzwL3+jEzKpg7H4eX9Gb093p3YigSbbDZXgkq+Sd5BtI0+5Ki+xygSv5MBxrJ
         49sg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=8JF8d/7rFYc4310TdylNhrvzs02BZ9HwQKncjiiZREU=;
        fh=GCF3HUaVVQnsrFP0NhazYIuTwTFmsBL5+Yby9S9U12k=;
        b=qkaCQqO5jlaUOk5Ap56HFSuprigFB+J/6HRvoNcn95fY9NC3STiPI3F3dgjM22lLrb
         Kfq+oqgOANFo8sD/H+wDYlfbpM9kru/jVv1WjnqkM14EigYiLpcMLET+WvRm5zstk6fF
         xhlWezZY18W4x8nBw4cdc5dOFiUsYphpzzJJrVb7nrQSTzXnwuLx/0FusPeGifeLOzf/
         hYgRynKgoly3eKTB6v4UfaVRxWqvO4uzUsRUWYRU6KZ2/0m3E+HKY/+wvquDGGjQ3wiH
         b9JeB3PX+1N9Oq3zzwIZt5zuYsBSmPwIguNJbd1n7kI5jWFi6v+AwtLd4a3MmIxeNwCd
         AfwQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 bo16-20020a0564020b3000b0051e1a404638si293668edb.277.2023.07.10.13.21.25;
        Mon, 10 Jul 2023 13:21:49 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231728AbjGJUFH (ORCPT <rfc822;gnulinuxfreebsd@gmail.com>
        + 99 others); Mon, 10 Jul 2023 16:05:07 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58012 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231702AbjGJUFC (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 10 Jul 2023 16:05:02 -0400
Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com
 [209.85.210.182])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 100B21729
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:29 -0700 (PDT)
Received: by mail-pf1-f182.google.com with SMTP id
 d2e1a72fcca58-668709767b1so2745137b3a.2
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:29 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689019469; x=1691611469;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=8JF8d/7rFYc4310TdylNhrvzs02BZ9HwQKncjiiZREU=;
        b=NdKkrH+NhJ/V/f8CPELbesxl+fQoT/oOPBX/e9PrKE8h3ATcVsSw0AuCZH7pnCmnUa
         Wrd5e62beAwP0Kco6iFgZNjvPYDSBML8yxH0mKrUOdFzmEMPzyR1UjrG4CS9PUd0G5GI
         p28fMIc4Qz0hmxfj7eugbx3AnWVtZv1MATDB4cNXB1uvcZ4m+Mcxuu5xEgfmOhOMHOEU
         T+R7OApeVHB6/Gnd/NAbkyr6vd4UG+B3WPjbYRs2IHHSJI/R+ZmDGHuYeAEF9Hy60eu9
         is0oHFHa+PtrwlkoRVpSQSPPudSCxI5j07U4BFnPyYkmLk6vGRsMU/Hbp66m56fDauG0
         eYFQ==
X-Gm-Message-State: ABy/qLYRNubBeM/Mhz0Dp23jmbQPNZ5xhwlrPEDd990CC4G2RR/Yb31q
        z4wWVFRNtlrkuJ2gOTuAH1YvEMsIUWMuRwZ1
X-Received: by 2002:a67:fc8d:0:b0:445:2154:746b with SMTP id
 x13-20020a67fc8d000000b004452154746bmr5286367vsp.4.1689019447103;
        Mon, 10 Jul 2023 13:04:07 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:4850])
        by smtp.gmail.com with ESMTPSA id
 d13-20020a0ce44d000000b006301d3cab9csm196971qvm.27.2023.07.10.13.04.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Jul 2023 13:04:06 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
        tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com
Subject: [PATCH v2 5/7] sched: Implement shared runqueue in CFS
Date: Mon, 10 Jul 2023 15:03:40 -0500
Message-Id: <20230710200342.358255-6-void@manifault.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230710200342.358255-1-void@manifault.com>
References: <20230710200342.358255-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,
        SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1771066174097490589
X-GMAIL-MSGID: 1771066369653263773

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (shared_runq) feature proposed in this patch set. The
idea behind shared_runq is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

Note that the design described here reflects sharding, which will be
added in a subsequent patch. The design is described that way in this
commit summary as the benchmarks described in the results section below
all include sharded shared_runq. The patches are not combined into one
to ease the burden of review.

The design of shared_runq is quite simple. A shared_runq is simply a
list of struct shared_runq_shard objects, which itself is simply a
struct list_head of tasks, and a spinlock:

struct shared_runq_shard {
	struct list_head list;
	spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
	u32 num_shards;
	struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs, and we
create a number of struct shared_runq_shard objects that are housed
there.

When a task first wakes up, it enqueues itself in the shared_runq_shard
of its current LLC at the end of enqueue_task_fair(). Enqueues only
happen if the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from the shards in its LLC's shared_runq at the
beginning of newidle_balance().

Difference between shared_runq and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas shared_runq causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while shared_runq enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
shared_runq (previously called "swqueue") in [2]. SIS_NODE did not
outperform shared_runq on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
[2]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/

Results
=======

Note that the motivation for the shared runqueue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile and hackbench, when running shared
runqueue.

On the other hand, some workloads suffer from shared_runq. Workloads
that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
-m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding the
shared datastructures within a CCX, but it doesn't seem to eliminate all
contention in every scenario. On the positive side, it seems that
sharding does not materially harm the benchmarks run for this patch
series; and in fact seems to improve some workloads such as kernel
compile.

Note that for the kernel compile workloads below, the compilation was
done by running make -j$(nproc) built-in.a on several different types of
hosts configured with make allyesconfig on commit a27648c74210 ("afs:
Fix setting of mtime when creating a file/dir/symlink") on Linus' tree
(boost and turbo were disabled on all of these hosts when the
experiments were performed). Additionally, NO_SHARED_RUNQ refers to
SHARED_RUNQ being completely disabled, SHARED_RUNQ_WAKEUPS refers to
sharded SHARED_RUNQ where tasks are enqueued in the shared runqueue at
wakeup time, and SHARED_RUNQ_ALL refers to sharded SHARED_RUNQ where
tasks are enqueued in the shared runqueue on every enqueue. Results are
not included for unsharded shared runqueue, as the results here exceed
the unsharded results already outlined out in [2] as linked above.

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
			    o____________o_______o
			    |    mean    | CPU   |
			    o------------o-------o
NO_SHARED_RUNQ:             | 582.46s    | 3101% |
SHARED_RUNQ_WAKEUPS:        | 581.22s    | 3117% |
SHARED_RUNQ_ALL:            | 578.41s    | 3141% |
			    o------------o-------o

Takeaway: SHARED_RUNQ_WAKEUPS performs roughly the same as
NO_SHARED_RUNQ, but SHARED_RUNQ_ALL results in a statistically
significant ~.7% improvement over NO_SHARED_RUNQ. This suggests that
enqueuing tasks in the shared runqueue on every enqueue improves work
conservation, and thanks to sharding, does not result in contention.

Note that I didn't collect data for kernel compile with SHARED_RUNQ_ALL
_without_ sharding. The reason for this is that we know that CPUs with
sufficiently large LLCs will contend, so if we've decided to accommodate
those CPUs with sharding, there's not much point in measuring the
results of not sharding on CPUs that we know won't contend.

Command: hackbench --loops 10000
                            o____________o_______o
                            |    mean    | CPU   |
                            o------------o-------o
NO_SHARED_RUNQ:             | 2.1912s    | 3117% |
SHARED_RUNQ_WAKEUP:         | 2.1080s    | 3155% |
SHARED_RUNQ_ALL:            | 1.9830s    | 3144% |
                            o------------o-------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, with SHARED_RUNQ_ALL beating NO_SHARED_RUNQ by
almost 10%. This was a surprising result given that it seems
advantageous to err on the side of avoiding migration in hackbench given
that tasks are short lived in sending only 10k bytes worth of messages,
but the results of the benchmark would seem to suggest that minimizing
runqueue delays is preferable.

Command:
for i in `seq 128`; do
    netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            |   mean  (thoughput)   |
                            o-----------------------o
NO_SHARED_RUNQ:             | 25064.12              |
SHARED_RUNQ_WAKEUP:         | 24862.16              |
SHARED_RUNQ_ALL:            | 25287.73              |
                            o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ_WAKEUP
as described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
                            o____________o_______o
                            |    mean    | CPU   |
                            o------------o-------o
NO_SHARED_RUNQ:             | 1535.46s   | 3417% |
SHARED_RUNQ_WAKEUP:         | 1534.56s   | 3428% |
SHARED_RUNQ_ALL:            | 1531.95s   | 3429% |
                            o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~.23% improvement over
NO_SHARED_RUNQ. Not a huge improvement, but consistently measurable.
The cause of this gain is presumably the same as the 7950X: improved
work conservation, with sharding preventing excessive contention on the
shard lock.

Command: hackbench --loops 10000
                            o____________o_______o
                            |    mean    | CPU   |
                            o------------o-------o
NO_SHARED_RUNQ:             | 5.5750s    | 3369% |
SHARED_RUNQ_WAKEUP:         | 5.5764s    | 3495% |
SHARED_RUNQ_ALL:            | 5.4760s    | 3481% |
                            o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~1.6% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
almost 10% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                                o______________________o
                                |   mean  (thoughput)  |
                                o----------------------o
NO_SHARED_RUNQ:                 | 11963.08             |
SHARED_RUNQ_WAKEUP:             | 11943.60             |
SHARED_RUNQ_ALL:                | 11554.32             |
                                o----------------------o

Takeaway: NO_SHARED_RUNQ performs the same as SHARED_RUNQ_WAKEUP, but
beats SHARED_RUNQ_ALL by ~3.4%. This result makes sense -- the workload
is very heavy on the runqueue, so enqueuing tasks in the shared runqueue
in __enqueue_entity() would intuitively result in increased contention
on the shard lock.  The fact that we're at parity with
SHARED_RUNQ_WAKEUP suggests that sharding the shared runqueue has
significantly improved the contention that was observed in v1, but that
__enqueue_entity() puts it over the edge.

NOTE: Parity for SHARED_RUNQ_WAKEUP relies on choosing the correct shard
size. If we chose, for example, a shard size of 16, there would still be
a regression between NO_SHARED_RUNQ and SHARED_RUNQ_WAKEUP. As described
below, this suggests that we may want to add a debugfs tunable for the
shard size.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
                                o____________o_______o
                                |    mean    | CPU   |
                                o------------o-------o
NO_SHARED_RUNQ:                 | 1601.81s   | 6476% |
SHARED_RUNQ_WAKEUP:             | 1602.55s   | 6472% |
SHARED_RUNQ_ALL:                | 1602.49s   | 6475% |
                                o------------o-------o

Takeaway: No statistically significant variance. It might be worth
experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
                                o____________o_______o
                                |    mean    | CPU   |
                                o------------o-------o
NO_SHARED_RUNQ:                 | 5.2672s    | 6463% |
SHARED_RUNQ_WAKEUP:             | 5.1476s    | 6583% |
SHARED_RUNQ_ALL:                | 5.1003s    | 6598% |
                                o------------o-------o

Takeaway: SHARED_RUNQ_ALL again wins, by about 3% over NO_SHARED_RUNQ in
this case.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                                o_______________________o
                                |   mean  (thoughput)   |
                                o-----------------------o
NO_SHARED_RUNQ:                 | 13819.08              |
SHARED_RUNQ_WAKEUP:             | 13907.74              |
SHARED_RUNQ_ALL:                | 13569.69              |
                                o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ_ALL, though by a slightly lower margin of ~1.8%.

Finally, let's look at how sharding affects the following schbench
incantation suggested by Chris in [3]:

schbench -L -m 52 -p 512 -r 10 -t 1

[3]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

The TL;DR is that sharding improves things a lot, but doesn't completely
fix the problem. Here are the results from running the schbench command
on the 18 core / 36 thread single CCX, single-socket Skylake:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

These results are when we create only 3 shards (16 logical cores per
shard), so the contention may be a result of overly-coarse sharding. If
we run the schbench incantation with no sharding whatsoever, we see the
following significantly worse lock stats contention:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the first workload run above.
If we make the shards even smaller, the contention is comparably much
lower:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan as well, but we contend more on the rq lock than the
shard lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0

Worth noting is that increasing the granularity of the shards in general
improves very runqueue-heavy workloads such as netperf UDP_RR and this
schbench command, but it doesn't necessarily make a big difference for
every workload, or for sufficiently small CCXs such as the 7950X. It may
make sense to eventually allow users to control this with a debugfs
knob, but for now we'll elect to choose a default that resulted in good
performance for the benchmarks run for this patch series.

Conclusion
==========

shared_runq in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: shared_runq encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded shared_runq seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(# shards) walk.

For the workloads above, kernel compile and hackbench were clear winners
for shared_runq (especially in __enqueue_entity()). The reason for the
improvement in kernel compile is of course that we have a heavily
CPU-bound workload where cache locality doesn't mean much; getting a CPU
is the #1 goal. As mentioned above, while I didn't expect to see an
improvement in hackbench, the results of the benchmark suggest that
minimizing runqueue delays is preferable to optimizing for L1/L2
locality.

Not all workloads benefit from shared_runq, however. Workloads that
hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m 52
-p 512 -r 10 -t 1, tend to run into contention on the shard locks;
especially when enqueuing tasks in __enqueue_entity(). This can be
mitigated significantly by sharding the shared datastructures within a
CCX, but it doesn't eliminate all contention, as described above.

Worth noting as well is that Gautham Shenoy ran some interesting
experiments on a few more ideas in [4], such as walking the shared_runq
on the pop path until a task is found that can be migrated to the
calling CPU. I didn't run those experiments in this patch set, but it
might be worth doing so.

[4]: https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/

Finally, while shared_runq in this form encourages work conservation, it
of course does not guarantee it given that we don't implement any kind
of work stealing between shared_runqs. In the future, we could
potentially push CPU utilization even higher by enabling work stealing
between shared_runqs, likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   |   2 +
 kernel/sched/fair.c   | 182 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h  |   2 +
 4 files changed, 185 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1292d38d66cc..5c05a3da3d50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -770,6 +770,8 @@ struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;
 
+	struct list_head		shared_runq_node;
+
 	/*
 	 * recent_used_cpu is initially set as the last CPU used by a task
 	 * that wakes affine another task. Waker/wakee relationships can
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1451f5aa82ac..3ad437d4ea3d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4503,6 +4503,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	INIT_LIST_HEAD(&p->shared_runq_node);
 #endif
 	init_sched_mm_cid(p);
 }
@@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)
 
 	init_sched_rt_class();
 	init_sched_dl_class();
+	init_sched_fair_class_late();
 
 	sched_smp_initialized = true;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f7967be7646c..ff2491387201 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,18 +139,163 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
+/**
+ * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
+ * waking tasks.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
+ * pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
+ * Waking tasks are enqueued in a shared_runq at the end of
+ * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
+ * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
+ * to being pulled from the shared_runq, in which case they're simply dequeued
+ * from the shared_runq. A waking task is only enqueued to a shared_runq when
+ * it was _not_ manually migrated to the current runqueue by
+ * select_task_rq_fair().
+ *
+ * There is currently no task-stealing between shared_runqs in different LLCs,
+ * which means that shared_runq is not fully work conserving. This could be
+ * added at a later time, with tasks likely only being stolen across
+ * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
+ *
+ * HOW
+ * ===
+ *
+ * An shared_runq is comprised of a list, and a spinlock for synchronization.
+ * Given that the critical section for a shared_runq is typically a fast list
+ * operation, and that the shared_runq is localized to a single LLC, the
+ * spinlock will typically only be contended on workloads that do little else
+ * other than hammer the runqueue.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of shared_runq is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * shared_runqs are segmented across LLCs both to avoid contention on the
+ * shared_runq spinlock by minimizing the number of CPUs that could contend on
+ * it, as well as to strike a balance between work conservation, and L3 cache
+ * locality.
+ */
+struct shared_runq {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
+static struct shared_runq *rq_shared_runq(struct rq *rq)
+{
+	return rq->cfs.shared_runq;
+}
+
+static struct task_struct *shared_runq_pop_task(struct rq *rq)
+{
+	unsigned long flags;
+	struct task_struct *p;
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	if (list_empty(&shared_runq->list))
+		return NULL;
+
+	spin_lock_irqsave(&shared_runq->lock, flags);
+	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+				     shared_runq_node);
+	if (p && is_cpu_allowed(p, cpu_of(rq)))
+		list_del_init(&p->shared_runq_node);
+	else
+		p = NULL;
+	spin_unlock_irqrestore(&shared_runq->lock, flags);
+
+	return p;
+}
+
+static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+{
+	unsigned long flags;
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	spin_lock_irqsave(&shared_runq->lock, flags);
+	list_add_tail(&p->shared_runq_node, &shared_runq->list);
+	spin_unlock_irqrestore(&shared_runq->lock, flags);
+}
+
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
 				     int enq_flags)
-{}
+{
+	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
+	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
+
+	/*
+	 * Only enqueue the task in the shared runqueue if:
+	 *
+	 * - SWQUEUE is enabled
+	 * - The task is on the wakeup path
+	 * - The task wasn't purposefully migrated to the current rq by
+	 *   select_task_rq()
+	 * - The task isn't pinned to a specific CPU
+	 */
+	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+		return;
+
+	shared_runq_push_task(rq, p);
+}
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 {
-	return 0;
+	struct task_struct *p = NULL;
+	struct rq *src_rq;
+	struct rq_flags src_rf;
+	int ret;
+
+	p = shared_runq_pop_task(rq);
+	if (!p)
+		return 0;
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	src_rq = task_rq_lock(p, &src_rf);
+
+	if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
+		update_rq_clock(src_rq);
+		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
+	}
+
+	if (src_rq->cpu != rq->cpu)
+		ret = 1;
+	else
+		ret = -1;
+
+	task_rq_unlock(src_rq, p, &src_rf);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+
+	return ret;
 }
 
 static void shared_runq_dequeue_task(struct task_struct *p)
-{}
+{
+	unsigned long flags;
+	struct shared_runq *shared_runq;
+
+	if (!list_empty(&p->shared_runq_node)) {
+		shared_runq = rq_shared_runq(task_rq(p));
+		spin_lock_irqsave(&shared_runq->lock, flags);
+		list_del_init(&p->shared_runq_node);
+		spin_unlock_irqrestore(&shared_runq->lock, flags);
+	}
+}
 
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
@@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
 #endif /* SMP */
 
 }
+
+__init void init_sched_fair_class_late(void)
+{
+#ifdef CONFIG_SMP
+	int i;
+	struct shared_runq *shared_runq;
+	struct rq *rq;
+	struct rq *llc_rq;
+
+	for_each_possible_cpu(i) {
+		if (per_cpu(sd_llc_id, i) == i) {
+			llc_rq = cpu_rq(i);
+
+			shared_runq = kzalloc_node(sizeof(struct shared_runq),
+					       GFP_KERNEL, cpu_to_node(i));
+			INIT_LIST_HEAD(&shared_runq->list);
+			spin_lock_init(&shared_runq->lock);
+			llc_rq->cfs.shared_runq = shared_runq;
+		}
+	}
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
+
+		if (rq == llc_rq)
+			continue;
+		rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
+	}
+#endif /* SMP */
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 187ad5da5ef6..8b573dfaba33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -576,6 +576,7 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
+	struct shared_runq	*shared_runq;
 	/*
 	 * CFS load tracking
 	 */
@@ -2440,6 +2441,7 @@ extern void update_max_interval(void);
 extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
+extern void init_sched_fair_class_late(void);
 
 extern void reweight_task(struct task_struct *p, int prio);
 

From patchwork Mon Jul 10 20:03:41 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 118055
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp45306vqm;
        Mon, 10 Jul 2023 13:08:15 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlHc0On9mYQ2eI1LIVr4yOCDFLiFYPvQ5lJJhvXO2EK7tBS61BDaVhkMQvfdVznV3IJGrRtl
X-Received: by 2002:a50:fa85:0:b0:51e:ebd:9f5b with SMTP id
 w5-20020a50fa85000000b0051e0ebd9f5bmr12531247edr.36.1689019695195;
        Mon, 10 Jul 2023 13:08:15 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1689019695; cv=none;
        d=google.com; s=arc-20160816;
        b=DHN3T6B27Xq2RwZ3kO6soyxlXv0IhdRZkfaCLEDUKW15/aiZnaDa8/VRit/wJDI2h3
         X9zbS+qz/yDQpf730QruvYvWRSml1j/ADXxsiovK+lHvGqzFONwM5X9xpuapOfTAruKW
         CFhB1hl9tzhE2qSWuUgUW0XSrOpRZzVfnjgvqUzNKhNr3Eh6IiC3ecWDg6hCjoB/VIdd
         dLZGQHknaPEtUFXjXlais3rITK0SNbXqa81wImjvxRTErmDOev1RaSzdwOgmUei3Nxkt
         Jm5SPAk87lKOvnFnfqD80nREsfTwl9LtDmoPKOvm8i+G3/xR/zg7sSLxJlXdb6/2MDH5
         RrRg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=nF2VH8kNHAb0EtloBoQBFvgN17tlh7Ri+RQTdgHw3fI=;
        fh=GCF3HUaVVQnsrFP0NhazYIuTwTFmsBL5+Yby9S9U12k=;
        b=cKjLMVfKNw1t7ZmCuqNIsQb9Wc2+aqeSyZpPx2P6bet/Xd2aXT+k5PO+prbliOjzj7
         4HhoU484Z4SCMmGh4cYtbcqeYKVQTBrd2unEd8tIMGRyP75pHe7law/k6vj1aXegtixR
         Bx6HLNY4PGL4L4lY4guMuuhFuXnMrTgmjUNLLV/TCLxSZ9cpTFLYq5GrWcBQBl+JRjE6
         yUJEjQ5WXsVjzmWogrSEzLXAq+BjMdb4FsevtjTp6/BtgTG5xaYe2Qlt8xKKZyrsu27r
         BskEZSClfH1PpxPKzmld2hqK4TqeYwN411Y9PyEWUPl12Dgm1JxkFANZRXkGY14IgDxA
         koAQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 ca26-20020aa7cd7a000000b0051e1a2770a3si285750edb.99.2023.07.10.13.07.51;
        Mon, 10 Jul 2023 13:08:15 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231811AbjGJUE2 (ORCPT <rfc822;gnulinuxfreebsd@gmail.com>
        + 99 others); Mon, 10 Jul 2023 16:04:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57226 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231715AbjGJUEX (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 10 Jul 2023 16:04:23 -0400
Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com
 [209.85.160.179])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22BF61A7
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:09 -0700 (PDT)
Received: by mail-qt1-f179.google.com with SMTP id
 d75a77b69052e-400a39d4ffcso22557911cf.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:09 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689019449; x=1691611449;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=nF2VH8kNHAb0EtloBoQBFvgN17tlh7Ri+RQTdgHw3fI=;
        b=d4Em6qLQa3BH4LlpvqmTtzM8BtlGKmeCreQAk4IBGPJ2in5x3GArlNHonRslc/NaC2
         jvojJTwZ1kgqieJGc9SYphVpN6T63RsE6hssq+mn4GIt44oZRDKMu5BfLoVR4we9ZnUP
         Qft4Qjhvs/lpck6l06Z3CDWXizIfdh3oWd2UQVXEIwweUTGPfnwoqx1oQfDjI+mCPWJ8
         6lOW33mzM1/U8I2PRIzpbcjThjtGPvzuFiuWrp46T8rZ/OHu/V7PK8JeJ7pzPTPdeiej
         7b+n/vwtLQrSA2noEgbsDqhc+zNwaa3roVojKFl55eTImyZ615rf6RCVOYVXVT5E6SfD
         CsIg==
X-Gm-Message-State: ABy/qLbqagvB0e/tJ41R3/7BvqLE0H7KOd9bjYmTedQ440ReVaUfZEe3
        2d55TZO+UQVrXbzWNGhordGnpVzvb5WSwEeC
X-Received: by 2002:ac8:7e95:0:b0:403:96e3:4740 with SMTP id
 w21-20020ac87e95000000b0040396e34740mr13111789qtj.25.1689019448421;
        Mon, 10 Jul 2023 13:04:08 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:4850])
        by smtp.gmail.com with ESMTPSA id
 k10-20020ac8074a000000b00401e04c66fesm260118qth.37.2023.07.10.13.04.07
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Jul 2023 13:04:08 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
        tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com
Subject: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues
Date: Mon, 10 Jul 2023 15:03:41 -0500
Message-Id: <20230710200342.358255-7-void@manifault.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230710200342.358255-1-void@manifault.com>
References: <20230710200342.358255-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,
        SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=no
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1771065516191506425
X-GMAIL-MSGID: 1771065516191506425

The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
tasks are put into on enqueue, and pulled from when a core in that LLC
would otherwise go idle. For CPUs with large LLCs, this can sometimes
cause significant contention, as illustrated in [0].

[0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

So as to try and mitigate this contention, we can instead shard the
per-LLC runqueue into multiple per-LLC shards.

While this doesn't outright prevent all contention, it does somewhat mitigate it.
For example, if we run the following schbench command which does almost
nothing other than pound the runqueue:

schbench -L -m 52 -p 512 -r 10 -t 1

we observe with lockstats that sharding significantly decreases
contention.

3 shards:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

No sharding:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the host where this was
collected. This could be addressed in future patch sets by adding a
debugfs knob to control the sharding granularity. If we make the shards
even smaller (what's in this patch, i.e. a size of 6), the contention
goes away almost entirely:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    	   con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan, but we contend even more on the rq lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          47962          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170

In general, the takeaway here is that sharding does help with
contention, but it's not necessarily one size fits all, and it's
workload dependent. For now, let's include sharding to try and avoid
contention, and because it doesn't seem to regress CPUs that don't need
it such as the AMD 7950X.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c  | 139 +++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h |   3 +-
 2 files changed, 96 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff2491387201..97985f28a627 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -143,21 +143,28 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
  * waking tasks.
  *
+ * struct shared_runq_shard - A structure containing a task list and a spinlock
+ * for a subset of cores in a struct shared_runq.
+ *
  * WHAT
  * ====
  *
  * This structure enables the scheduler to be more aggressively work
- * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
- * pulled from when another core in the LLC is going to go idle.
- *
- * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
- * Waking tasks are enqueued in a shared_runq at the end of
- * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
- * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
- * to being pulled from the shared_runq, in which case they're simply dequeued
- * from the shared_runq. A waking task is only enqueued to a shared_runq when
- * it was _not_ manually migrated to the current runqueue by
- * select_task_rq_fair().
+ * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
+ * then be pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores two pointers in its struct cfs_rq:
+ *
+ * 1. The per-LLC struct shared_runq which contains one or more shards of
+ *    enqueued tasks.
+ *
+ * 2. The shard inside of the per-LLC struct shared_runq which contains the
+ *    list of runnable tasks for that shard.
+ *
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard at
+ * the end of enqueue_task_fair(), and are opportunistically pulled from the
+ * shared_runq in newidle_balance(). Pulling from shards is an O(# shards)
+ * operation.
  *
  * There is currently no task-stealing between shared_runqs in different LLCs,
  * which means that shared_runq is not fully work conserving. This could be
@@ -167,11 +174,12 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * HOW
  * ===
  *
- * An shared_runq is comprised of a list, and a spinlock for synchronization.
- * Given that the critical section for a shared_runq is typically a fast list
- * operation, and that the shared_runq is localized to a single LLC, the
- * spinlock will typically only be contended on workloads that do little else
- * other than hammer the runqueue.
+ * A struct shared_runq_shard is comprised of a list, and a spinlock for
+ * synchronization.  Given that the critical section for a shared_runq is
+ * typically a fast list operation, and that the shared_runq_shard is localized
+ * to a subset of cores on a single LLC (plus other cores in the LLC that pull
+ * from the shard in newidle_balance()), the spinlock will typically only be
+ * contended on workloads that do little else other than hammer the runqueue.
  *
  * WHY
  * ===
@@ -185,48 +193,64 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * it, as well as to strike a balance between work conservation, and L3 cache
  * locality.
  */
-struct shared_runq {
+struct shared_runq_shard {
 	struct list_head list;
 	spinlock_t lock;
 } ____cacheline_aligned;
 
+struct shared_runq {
+	u32 num_shards;
+	struct shared_runq_shard shards[];
+} ____cacheline_aligned;
+
+/* This would likely work better as a configurable knob via debugfs */
+#define SHARED_RUNQ_SHARD_SZ 6
+
 #ifdef CONFIG_SMP
 static struct shared_runq *rq_shared_runq(struct rq *rq)
 {
 	return rq->cfs.shared_runq;
 }
 
-static struct task_struct *shared_runq_pop_task(struct rq *rq)
+static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
+{
+	return rq->cfs.shard;
+}
+
+static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
+{
+	return cpu % runq->num_shards;
+}
+
+static struct task_struct *
+shared_runq_pop_task(struct shared_runq_shard *shard, int target)
 {
 	unsigned long flags;
 	struct task_struct *p;
-	struct shared_runq *shared_runq;
 
-	shared_runq = rq_shared_runq(rq);
-	if (list_empty(&shared_runq->list))
+	if (list_empty(&shard->list))
 		return NULL;
 
-	spin_lock_irqsave(&shared_runq->lock, flags);
-	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+	spin_lock_irqsave(&shard->lock, flags);
+	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, cpu_of(rq)))
+	if (p && is_cpu_allowed(p, target))
 		list_del_init(&p->shared_runq_node);
 	else
 		p = NULL;
-	spin_unlock_irqrestore(&shared_runq->lock, flags);
+	spin_unlock_irqrestore(&shard->lock, flags);
 
 	return p;
 }
 
-static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+static void shared_runq_push_task(struct shared_runq_shard *shard,
+				  struct task_struct *p)
 {
 	unsigned long flags;
-	struct shared_runq *shared_runq;
 
-	shared_runq = rq_shared_runq(rq);
-	spin_lock_irqsave(&shared_runq->lock, flags);
-	list_add_tail(&p->shared_runq_node, &shared_runq->list);
-	spin_unlock_irqrestore(&shared_runq->lock, flags);
+	spin_lock_irqsave(&shard->lock, flags);
+	list_add_tail(&p->shared_runq_node, &shard->list);
+	spin_unlock_irqrestore(&shard->lock, flags);
 }
 
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
@@ -247,7 +271,7 @@ static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
 	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
 		return;
 
-	shared_runq_push_task(rq, p);
+	shared_runq_push_task(rq_shared_runq_shard(rq), p);
 }
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
@@ -256,8 +280,21 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 	struct rq *src_rq;
 	struct rq_flags src_rf;
 	int ret;
+	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
+	u32 i, starting_idx, curr_idx, num_shards;
 
-	p = shared_runq_pop_task(rq);
+	shared_runq = rq_shared_runq(rq);
+	starting_idx = shared_runq_shard_idx(shared_runq, cpu_of(rq));
+	num_shards = shared_runq->num_shards;
+	for (i = 0; i < num_shards; i++) {
+		curr_idx = (starting_idx + i) % num_shards;
+		shard = &shared_runq->shards[curr_idx];
+
+		p = shared_runq_pop_task(shard, cpu_of(rq));
+		if (p)
+			break;
+	}
 	if (!p)
 		return 0;
 
@@ -287,13 +324,13 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 static void shared_runq_dequeue_task(struct task_struct *p)
 {
 	unsigned long flags;
-	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 
 	if (!list_empty(&p->shared_runq_node)) {
-		shared_runq = rq_shared_runq(task_rq(p));
-		spin_lock_irqsave(&shared_runq->lock, flags);
+		shard = rq_shared_runq_shard(task_rq(p));
+		spin_lock_irqsave(&shard->lock, flags);
 		list_del_init(&p->shared_runq_node);
-		spin_unlock_irqrestore(&shared_runq->lock, flags);
+		spin_unlock_irqrestore(&shard->lock, flags);
 	}
 }
 
@@ -13003,19 +13040,31 @@ __init void init_sched_fair_class(void)
 __init void init_sched_fair_class_late(void)
 {
 #ifdef CONFIG_SMP
-	int i;
+	int i, j;
 	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 	struct rq *rq;
 	struct rq *llc_rq;
+	size_t shared_runq_size;
+	u32 num_shards, shard_idx;
 
 	for_each_possible_cpu(i) {
 		if (per_cpu(sd_llc_id, i) == i) {
 			llc_rq = cpu_rq(i);
-
-			shared_runq = kzalloc_node(sizeof(struct shared_runq),
-					       GFP_KERNEL, cpu_to_node(i));
-			INIT_LIST_HEAD(&shared_runq->list);
-			spin_lock_init(&shared_runq->lock);
+			num_shards = max(per_cpu(sd_llc_size, i) /
+					 SHARED_RUNQ_SHARD_SZ, 1);
+			shared_runq_size = sizeof(struct shared_runq) +
+				num_shards * sizeof(struct shared_runq_shard);
+
+			shared_runq = kzalloc_node(shared_runq_size,
+						   GFP_KERNEL, cpu_to_node(i));
+			shared_runq->num_shards = num_shards;
+			for (j = 0; j < num_shards; j++) {
+				shard = &shared_runq->shards[j];
+
+				INIT_LIST_HEAD(&shard->list);
+				spin_lock_init(&shard->lock);
+			}
 			llc_rq->cfs.shared_runq = shared_runq;
 		}
 	}
@@ -13024,9 +13073,9 @@ __init void init_sched_fair_class_late(void)
 		rq = cpu_rq(i);
 		llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
 
-		if (rq == llc_rq)
-			continue;
 		rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
+		shard_idx = shared_runq_shard_idx(rq->cfs.shared_runq, i);
+		rq->cfs.shard = &rq->cfs.shared_runq->shards[shard_idx];
 	}
 #endif /* SMP */
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8b573dfaba33..ca56a8120088 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -576,7 +576,8 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-	struct shared_runq	*shared_runq;
+	struct shared_runq	 *shared_runq;
+	struct shared_runq_shard *shard;
 	/*
 	 * CFS load tracking
 	 */

From patchwork Mon Jul 10 20:03:42 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 118064
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp52238vqm;
        Mon, 10 Jul 2023 13:22:45 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlFrM8zzc6pIIE7jyCPjSb/zxcbIb+Fp143dpfCkIjXGRgyLrkG+IQyz1/VxGv7pfAgt4Yx8
X-Received: by 2002:a05:6512:2029:b0:4fa:e7e5:66e0 with SMTP id
 s9-20020a056512202900b004fae7e566e0mr11070537lfs.48.1689020565273;
        Mon, 10 Jul 2023 13:22:45 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1689020565; cv=none;
        d=google.com; s=arc-20160816;
        b=d8c9Wbg7QbQIm4o2nImSOch/E3BqjstxwglC3eKEFj/EAefJB1nkWtLv9yNMXsbEmo
         aLp8v658rm5LeaT17/Bobb9KjPKMk1gfSFiINJSn4EEhPVzLrIhpWO2GU6xH1O7vlF66
         NXwd/WfSr9QJmGit7Vr5cNDHvvblxupmnWJ+UZEhP9B/0Ulx+3YEjIees/6gFkHFR35o
         y/2zB5AYlkdIF89LplQCBJnJjFFUS9q/P1QFHX6bUO6bsgQSFsZUa5WwuP7LYQmVJBuq
         BsG3ACnVpHYpRxOIVGuphZP52r3BMgq9dL8bEjSQdAVwS7d1patIuc5LOY2N3bRx81An
         cpSQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=aywgOaCBg8taQC5AfKHDC49h4sDlP6UNnulT1NAP3Eg=;
        fh=GCF3HUaVVQnsrFP0NhazYIuTwTFmsBL5+Yby9S9U12k=;
        b=EXLXYvzGi6+Ob5Tip8ZC2zNL2l5MV6HhZCzfNnUFOqAV6VpL+SEwJ86b4zCLPnPAnC
         +0I8AjjUiOkwFFYEq87M0GpbWVwx1FvtVuim0G7y0VCHpabQgqbw3c9Py+SI0QG8og0V
         FSmVubwP3Za2R70xWFkXD1o93HxUsFZEcCu0PfZFWPeYglHJir0HGuoh5FqmrVSeByxA
         NJpdo1wvOT9CizoipnDbAquzypeOwTZpezFhCPjOOk1153GUD5Fgbj7zxH0unwi8nE9C
         j90Dw74uCZucux2TQcZsnBNf3LJCBIEGOM0s6yCtb7Klj+vm4rrSN9KALcSyDu+HaW3A
         D2iw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 m16-20020aa7c2d0000000b0051e0eba60bdsi291637edp.456.2023.07.10.13.22.21;
        Mon, 10 Jul 2023 13:22:45 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231842AbjGJUEb (ORCPT <rfc822;gnulinuxfreebsd@gmail.com>
        + 99 others); Mon, 10 Jul 2023 16:04:31 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57602 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231602AbjGJUEX (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 10 Jul 2023 16:04:23 -0400
Received: from mail-qv1-f49.google.com (mail-qv1-f49.google.com
 [209.85.219.49])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A3761B8
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:11 -0700 (PDT)
Received: by mail-qv1-f49.google.com with SMTP id
 6a1803df08f44-635e0e6b829so37916866d6.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 10 Jul 2023 13:04:11 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689019450; x=1691611450;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=aywgOaCBg8taQC5AfKHDC49h4sDlP6UNnulT1NAP3Eg=;
        b=KpHcVIRLqOdeYHz7W7mMko4ses0wtKrliWuRx/7n3ciRXNMqDNWQYTPX+GjgzT5kXp
         xOwV56A9q6SEy7z+xOvDmowMtoCwx06fu7EVDf3kscvmLSOW4uS8C+ZtQkbDEgjjMMaj
         gVQ34PxNXJ5rkRWNHu6ZGMunh7tegmQ6rjFD8DdFBtNnM9ijL4eJKu7YwmAfsVsfcdnA
         2juKJLdz2Cg73zbJVcJPcFQArLA9oZK/RDi2nm/bv84IlccYr95LXoqhVonvXi3a47t0
         qPzy7x4pvZbjzo1JquyHePr4tRL4hiYNN+4Owe6byPuxthsi6aEP4RJZ4RKxiCKS9a+g
         ZeNA==
X-Gm-Message-State: ABy/qLaEFYHdSU8am6Y38Qw6BX8waGXO0SSTGUOJJzr/xC1MGCdKNm8H
        tcdmJMqW/v9krsRlfO6AC00xT87F1TaiPXd+
X-Received: by 2002:a0c:cb8e:0:b0:636:14d4:4450 with SMTP id
 p14-20020a0ccb8e000000b0063614d44450mr11674967qvk.3.1689019449967;
        Mon, 10 Jul 2023 13:04:09 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:4850])
        by smtp.gmail.com with ESMTPSA id
 o11-20020a0ce40b000000b006301ec0d16fsm217825qvl.0.2023.07.10.13.04.09
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Jul 2023 13:04:09 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com, clm@meta.com,
        tj@kernel.org, roman.gushchin@linux.dev, kernel-team@meta.com
Subject: [PATCH v2 7/7] sched: Move shared_runq to
 __{enqueue,dequeue}_entity()
Date: Mon, 10 Jul 2023 15:03:42 -0500
Message-Id: <20230710200342.358255-8-void@manifault.com>
X-Mailer: git-send-email 2.40.1
In-Reply-To: <20230710200342.358255-1-void@manifault.com>
References: <20230710200342.358255-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,
        T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1771066428326555520
X-GMAIL-MSGID: 1771066428326555520

In the thread at [0], Peter suggested experimenting with putting the
shared_runq enqueue and dequeue calls in __enqueue_entity() and
__dequeue_entity() respectively. This has the advantages of improve work
conservation from utilizing the shared runq on all enqueues, as well as
automagically causing shared_runq to work for SCHED_CORE. The possible
disadvantage was that we're dong more enqueues/dequeues, which could
result in untenable overhead or contention on the shard lock(s).

[0]: https://lore.kernel.org/lkml/20230622105841.GH4253@hirez.programming.kicks-ass.net/

It turns out that this appears to improve shared_runq quite a bit:

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
			    o____________o_______o
			    |    mean    | CPU   |
			    o------------o-------o
NO_SHARED_RUNQ:             | 582.46s    | 3101% |
SHARED_RUNQ_WAKEUPS:        | 581.22s    | 3117% |
SHARED_RUNQ_ALL:            | 578.41s    | 3141% |
			    o------------o-------o

Takeaway: SHARED_RUNQ_WAKEUPS performs roughly the same as
NO_SHARED_RUNQ, but SHARED_RUNQ_ALL results in a statistically
significant ~.7% improvement over NO_SHARED_RUNQ. This suggests that
enqueuing tasks in the shared runqueue on every enqueue improves work
conservation, and thanks to sharding, does not result in contention.

Note that I didn't collect data for kernel compile with SHARED_RUNQ_ALL
_without_ sharding. The reason for this is that we know that CPUs with
sufficiently large LLCs will contend, so if we've decided to accommodate
those CPUs with sharding, there's not much point in measuring the
results of not sharding on CPUs that we know won't contend.

Command: hackbench --loops 10000
			    o____________o_______o
			    |    mean    | CPU   |
			    o------------o-------o
NO_SHARED_RUNQ:             | 2.1912s    | 3117% |
SHARED_RUNQ_WAKEUP:         | 2.1080s    | 3155% |
SHARED_RUNQ_ALL:            | 1.9830s    | 3144% |
			    o------------o-------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, with SHARED_RUNQ_ALL beating NO_SHARED_RUNQ by
almost 10%. This was a surprising result given that it seems
advantageous to err on the side of avoiding migration in hackbench given
that tasks are short lived in sending only 10k bytes worth of messages,
but the results of the benchmark would seem to suggest that minimizing
runqueue delays is preferable.

Command:
for i in `seq 128`; do
    netperf -6 -t UDP_RR -c -C -l $runtime &
done
			    o_______________________o
			    |   mean  (thoughput)   |
			    o-----------------------o
NO_SHARED_RUNQ:             | 25064.12              |
SHARED_RUNQ_WAKEUP:         | 24862.16              |
SHARED_RUNQ_ALL:            | 25287.73              |
			    o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ_WAKEUP
as described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
			    o____________o_______o
			    |    mean    | CPU   |
			    o------------o-------o
NO_SHARED_RUNQ:             | 1535.46s   | 3417% |
SHARED_RUNQ_WAKEUP:         | 1534.56s   | 3428% |
SHARED_RUNQ_ALL:            | 1531.95s   | 3429% |
			    o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~.23% improvement over
NO_SHARED_RUNQ. Not a huge improvement, but consistently measurable.
The cause of this gain is presumably the same as the 7950X: improved
work conservation, with sharding preventing excessive contention on the
shard lock.

Command: hackbench --loops 10000
			    o____________o_______o
			    |    mean    | CPU   |
			    o------------o-------o
NO_SHARED_RUNQ:             | 5.5750s    | 3369% |
SHARED_RUNQ_WAKEUP:         | 5.5764s    | 3495% |
SHARED_RUNQ_ALL:            | 5.4760s    | 3481% |
			    o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~1.6% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
almost 10% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
	netperf -6 -t UDP_RR -c -C -l $runtime &
done
				o______________________o
				|   mean  (thoughput)  |
				o----------------------o
NO_SHARED_RUNQ:			| 11963.08             |
SHARED_RUNQ_WAKEUP:		| 11943.60             |
SHARED_RUNQ_ALL:		| 11554.32             |
				o----------------------o

Takeaway: NO_SHARED_RUNQ performs the same as SHARED_RUNQ_WAKEUP, but
beats SHARED_RUNQ_ALL by ~3.4%. This result makes sense -- the workload
is very heavy on the runqueue, so enqueuing tasks in the shared runqueue
in __enqueue_entity() would intuitively result in increased contention
on the shard lock.  The fact that we're at parity with
SHARED_RUNQ_WAKEUP suggests that sharding the shared runqueue has
significantly improved the contention that was observed in v1, but that
__enqueue_entity() puts it over the edge.

NOTE: Parity for SHARED_RUNQ_WAKEUP relies on choosing the correct shard
size. If we chose, for example, a shard size of 16, there would still be
a regression between NO_SHARED_RUNQ and SHARED_RUNQ_WAKEUP. As described
below, this suggests that we may want to add a debugfs tunable for the
shard size.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
				o____________o_______o
				|    mean    | CPU   |
				o------------o-------o
NO_SHARED_RUNQ:			| 1601.81s   | 6476% |
SHARED_RUNQ_WAKEUP:		| 1602.55s   | 6472% |
SHARED_RUNQ_ALL:		| 1602.49s   | 6475% |
				o------------o-------o

Takeaway: No statistically significant variance. It might be worth
experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
				o____________o_______o
				|    mean    | CPU   |
				o------------o-------o
NO_SHARED_RUNQ:			| 5.2672s    | 6463% |
SHARED_RUNQ_WAKEUP:		| 5.1476s    | 6583% |
SHARED_RUNQ_ALL:		| 5.1003s    | 6598% |
				o------------o-------o

Takeaway: SHARED_RUNQ_ALL again wins, by about 3% over NO_SHARED_RUNQ in
this case.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
	netperf -6 -t UDP_RR -c -C -l $runtime &
done
				o_______________________o
				|   mean  (thoughput)   |
				o-----------------------o
NO_SHARED_RUNQ:			| 13819.08              |
SHARED_RUNQ_WAKEUP:		| 13907.74              |
SHARED_RUNQ_ALL:		| 13569.69              |
				o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ_ALL, though by a slightly lower margin of ~1.8%.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c | 38 ++++++++++++--------------------------
 1 file changed, 12 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97985f28a627..bddb2bed4297 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,8 +140,8 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
 /**
- * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
- * waking tasks.
+ * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
+ * runnable tasks within an LLC.
  *
  * struct shared_runq_shard - A structure containing a task list and a spinlock
  * for a subset of cores in a struct shared_runq.
@@ -161,10 +161,9 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * 2. The shard inside of the per-LLC struct shared_runq which contains the
  *    list of runnable tasks for that shard.
  *
- * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard at
- * the end of enqueue_task_fair(), and are opportunistically pulled from the
- * shared_runq in newidle_balance(). Pulling from shards is an O(# shards)
- * operation.
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
+ * newidle_balance(). Pulling from shards is an O(# shards) operation.
  *
  * There is currently no task-stealing between shared_runqs in different LLCs,
  * which means that shared_runq is not fully work conserving. This could be
@@ -253,22 +252,15 @@ static void shared_runq_push_task(struct shared_runq_shard *shard,
 	spin_unlock_irqrestore(&shard->lock, flags);
 }
 
-static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
-				     int enq_flags)
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
 {
-	bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
-	bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
-
 	/*
 	 * Only enqueue the task in the shared runqueue if:
 	 *
 	 * - SWQUEUE is enabled
-	 * - The task is on the wakeup path
-	 * - The task wasn't purposefully migrated to the current rq by
-	 *   select_task_rq()
 	 * - The task isn't pinned to a specific CPU
 	 */
-	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+	if (p->nr_cpus_allowed == 1)
 		return;
 
 	shared_runq_push_task(rq_shared_runq_shard(rq), p);
@@ -357,8 +349,7 @@ int __weak arch_asym_cpu_priority(int cpu)
  */
 #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
 #else
-static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
-				     int enq_flags)
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
 {}
 
 static void shared_runq_dequeue_task(struct task_struct *p)
@@ -843,11 +834,15 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+		shared_runq_enqueue_task(rq_of(cfs_rq), task_of(se));
 	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+		shared_runq_dequeue_task(task_of(se));
 	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
 }
 
@@ -6587,9 +6582,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		update_overutilized_status(rq);
 
-	if (sched_feat(SHARED_RUNQ))
-		shared_runq_enqueue_task(rq, p, flags);
-
 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
@@ -6671,9 +6663,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
-
-	if (sched_feat(SHARED_RUNQ))
-		shared_runq_dequeue_task(p);
 }
 
 #ifdef CONFIG_SMP
@@ -8380,9 +8369,6 @@ done: __maybe_unused;
 
 	update_misfit_status(p, rq);
 
-	if (sched_feat(SHARED_RUNQ))
-		shared_runq_dequeue_task(p);
-
 	return p;
 
 idle: