From patchwork Wed Aug  9 22:12:12 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 133603
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp50353vqi;
        Wed, 9 Aug 2023 15:38:47 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGTQnPLa/miL8zUU/B8cm49UBnrlfW+TPCGjDvv5hrNA1PZxE6dUtdGEjgs4CNCRFY6KmXw
X-Received: by 2002:a5d:61d0:0:b0:314:1b4d:bb27 with SMTP id
 q16-20020a5d61d0000000b003141b4dbb27mr546942wrv.64.1691620727129;
        Wed, 09 Aug 2023 15:38:47 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1691620727; cv=none;
        d=google.com; s=arc-20160816;
        b=i8M1VKXgFALBlnviN5+lElVQSFLNBdE+iXENTq7aGQbPZy6vkKos/DZzcFZOUHh/Ja
         o9ByG52HbaAYvTiGUVTGJCsZvSXN3sejx0E/6GoQsRRYGFW3wXV7YWx5QTWzLHoTSFiG
         Yv7aHDdHyxQck0vIF1dCTdPe3wNv8VDGbc2o/gYQ+pgQKTrLPcQfO15jGTX/KquvhZi6
         NNi/h6JtnwAOdBas+AnX35RXxd7S7e8eHrhkVANvERg7zGzaBKxRCQyFE0088DV6DOop
         Yq4ZXp99z+RrnN+UcntOleBj7etGIzRNXvxTXB/ufPHryUPdYlsBGCcgKist5tfkq9k2
         6Ctg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=zP5NLcx+M4QmYU6meO9mqo2/zqcH6UWPGLNVDd2SvCk=;
        fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=;
        b=LhPl7XKNJ21ICkOtaUz6dLFFTPH+Li/whIidvUUhlu0liC9ZS/sxycjK85m2GZ6RZW
         DuSn7I83aJ2ynt5Zu/CPzgBZTpSq6hkLqYAfIl38GvhiGh+uAdNCbJtzFUZB9n4ku5+F
         7CK7c9ff+ruS+yHFtcWym+HIRegZ74NBKpqNSoGOSvp1K1pfIwUICedr8dgckftpt/aF
         KtOc3a6mhAlRFZykDS02Sy0Q/UFxaO3zLM+VUUv6DMq0np1iqhj5wKtmnmg3XOnLbunr
         jOj9lUhr3q8qxS83ep758juNzSeE0oQM7eCrPzjvI/iqhJjUwkvV82zFte1+VjWYaRrs
         lc/w==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 a3-20020aa7cf03000000b005236b763941si106166edy.524.2023.08.09.15.38.21;
        Wed, 09 Aug 2023 15:38:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231472AbjHIWMw (ORCPT <rfc822;craechal@gmail.com> + 99 others);
        Wed, 9 Aug 2023 18:12:52 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54334 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229540AbjHIWMu (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 9 Aug 2023 18:12:50 -0400
Received: from mail-oo1-f54.google.com (mail-oo1-f54.google.com
 [209.85.161.54])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 17F42C6
        for <linux-kernel@vger.kernel.org>;
 Wed,  9 Aug 2023 15:12:50 -0700 (PDT)
Received: by mail-oo1-f54.google.com with SMTP id
 006d021491bc7-56d6dfa8b52so254996eaf.3
        for <linux-kernel@vger.kernel.org>;
 Wed, 09 Aug 2023 15:12:50 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691619169; x=1692223969;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=zP5NLcx+M4QmYU6meO9mqo2/zqcH6UWPGLNVDd2SvCk=;
        b=Bd9Cb3SOblK0Z4bDjNX08/vUeDhBcQ7VW7bbMF3JhsbQ+CoHKi8qGicEgX1PXHQwcy
         vxFyOtT+SZztv1CLMlmJurl9RlGPB3a1hBsqRw/aNCibsJ5btVnwoPaubEDo/uSjby42
         6DQZzYABUphMy/+Mao8pWd2yLPnQVC517vSvHl5qWz6L9nwC6ludFX/3Px79bIrBe0bE
         Md53He0lnexok8BLabaLdDyKfDih9XzJaQDIptz9vnvgahlqL8VMIGaNgzbAI0pwK9lM
         0VsD7NJyDgcqF0/ijHeJsI0TAjY/FwarB8J7hxMQyTNz16+w/bcdiJFtjibWkP9MDds4
         qArw==
X-Gm-Message-State: AOJu0YygySZyC8xmQrUjDff+9FNzVA85UmNe0+n/TZGAUdICaWlXkigt
        2o4LdZp+c5ysXkl2RBF/Tt/KXKmPG0jeaD2X
X-Received: by 2002:a05:6808:1488:b0:3a7:b47a:9fb with SMTP id
 e8-20020a056808148800b003a7b47a09fbmr869999oiw.46.1691619169037;
        Wed, 09 Aug 2023 15:12:49 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:ed08])
        by smtp.gmail.com with ESMTPSA id
 o10-20020a0c8c4a000000b0063f7cf13c56sm3706775qvb.107.2023.08.09.15.12.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 09 Aug 2023 15:12:48 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 1/7] sched: Expose move_queued_task() from core.c
Date: Wed,  9 Aug 2023 17:12:12 -0500
Message-ID: <20230809221218.163894-2-void@manifault.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230809221218.163894-1-void@manifault.com>
References: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,
        SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1773792895471026992
X-GMAIL-MSGID: 1773792895471026992

The migrate_task_to() function exposed from kernel/sched/core.c migrates
the current task, which is silently assumed to also be its first
argument, to the specified CPU. The function uses stop_one_cpu() to
migrate the task to the target CPU, which won't work if @p is not the
current task as the stop_one_cpu() callback isn't invoked on remote
CPUs.

While this operation is useful for task_numa_migrate() in fair.c, it
would be useful if move_queued_task() in core.c was given external
linkage, as it actually can be used to migrate any task to a CPU.

A follow-on patch will call move_queued_task() from fair.c when
migrating a task in a shared runqueue to a remote CPU.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/sched.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 614271a75525..394e216b9d37 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2519,8 +2519,8 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
  *
  * Returns (locked) new rq. Old rq's lock is released.
  */
-static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
-				   struct task_struct *p, int new_cpu)
+struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+			    struct task_struct *p, int new_cpu)
 {
 	lockdep_assert_rq_held(rq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 19af1766df2d..69b100267fd0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1763,6 +1763,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_SMP
 
+
+extern struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+				   struct task_struct *p, int new_cpu);
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,

From patchwork Wed Aug  9 22:12:13 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 133604
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp50414vqi;
        Wed, 9 Aug 2023 15:38:57 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IEhrvqX0B36+nJoC/JgrOQOBbV6J0JxhGTULnEz+DSOAoVCT5YZnYkrnmIS6RuA3SqO6SBO
X-Received: by 2002:ac2:4bd4:0:b0:4fd:bdf8:930d with SMTP id
 o20-20020ac24bd4000000b004fdbdf8930dmr395880lfq.22.1691620737501;
        Wed, 09 Aug 2023 15:38:57 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1691620737; cv=none;
        d=google.com; s=arc-20160816;
        b=JCHGPDLddSJyjdOge/P5Mk02oj4El5N/2XmKS3phKwy5OHX4ENTsDG0sVx+UPuC0Tr
         23XbS/3fI7pl6ZZN8HwG68Bf1rGUr/+JBkaLKT0UgOKKCy68t74CeTAPOLskycz546lc
         pqpLjgoObHtwFxAXJCOTR4vHoWSGnGUk0cPJLp8eKQA0yOE/12xXB3aemsvrDlgasPyF
         hHebhovdEWEjiPV72ltPXMnn/gtiTTe0zIul8CFRtbDT1yx7fTTEKbtjU8rDXvLk/CNJ
         Fi0JOXoqbCZOW+8fapH0pFTj1KBUqvguxPiZg7NjZUIPNT4H0YVutgpzD9GXw7X3Awxl
         r90g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=Z5L7WYMeUuybF9PWhouTMr+CEGasjomO6XX645WEqJo=;
        fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=;
        b=f/39VpM1F4M4Tsm5RHeJ3Stuf+j1V+59XGxFXNLP9ju88H1HL4j+tYr9z97ALsBguj
         t9dtMkbclnYTR4LyCaz7QVkKFTkYgVKAA+RLL8LpL/lYZVPeZM0PeKXlJ8Zf9s239nmr
         J+WH/4PfFyXaSwYZp/YvfFmAJqEZW4PPY9inwbIpBsFMOHPBgvKyzOYQBnGgj/Yhrmpo
         7QnulDmtYQPpuHZDD5Q0EBQYoRqSDMYDsVLdnHsGAFZ/gYtfy86JpGGHgOHdv10Y5U+l
         sBTZkeiajs7yqJ4Zkb5XhZd5WIujZELhCZcXBDoLBebdoPOngfs+nHvtIzI3J0wtc3iu
         kIIw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 be6-20020a0564021a2600b00522384df47fsi110418edb.319.2023.08.09.15.38.34;
        Wed, 09 Aug 2023 15:38:57 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232242AbjHIWM5 (ORCPT <rfc822;craechal@gmail.com> + 99 others);
        Wed, 9 Aug 2023 18:12:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54362 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231582AbjHIWMw (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 9 Aug 2023 18:12:52 -0400
Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com
 [209.85.222.179])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97ED110C0
        for <linux-kernel@vger.kernel.org>;
 Wed,  9 Aug 2023 15:12:51 -0700 (PDT)
Received: by mail-qk1-f179.google.com with SMTP id
 af79cd13be357-76c93abeb83so26200185a.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 09 Aug 2023 15:12:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691619170; x=1692223970;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Z5L7WYMeUuybF9PWhouTMr+CEGasjomO6XX645WEqJo=;
        b=gCNjP9t6BPHLXf2ca8GtXqhY40IobM5d353Y7DI3oNdXMzn/Mh89JEOL+fs6S/B3EL
         Tzc3+LInX51SQggamAAmUddsLTioKhYLYj7Tqvd/fDpv9eomQnwbZdTzGJooG3/2iPV9
         6Bvxc7jqUlxeQcro1idHnztpE3FMuJNrtCkEyv11e1KbG94l+75nkbputA8ld3zp7183
         94yJxtSP7js602U31Bfy0LAbVUVMOJ+i3sJNKxy67BKym0QIL9Y9uqiqKfe+noeSlPtB
         sdbc53GWzokxr31eSRwCLcgbfeASd2Eh1dhVKPJCxFy/JP6DINNe4yTaUXoeaT0GMkAd
         0F7A==
X-Gm-Message-State: AOJu0YyjPzmuBQHEHDz+5/pcyWBhHPIxBUOIk8LHu6M1dqBP5b6zBsP8
        2zTDnpBmg3maBsf9tboQ+ozsTI92dX0x7tj0
X-Received: by 2002:a05:620a:4690:b0:76c:d6eb:df89 with SMTP id
 bq16-20020a05620a469000b0076cd6ebdf89mr386609qkb.28.1691619170349;
        Wed, 09 Aug 2023 15:12:50 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:ed08])
        by smtp.gmail.com with ESMTPSA id
 c7-20020a05620a11a700b0076827ce13f6sm23034qkk.10.2023.08.09.15.12.49
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 09 Aug 2023 15:12:49 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 2/7] sched: Move is_cpu_allowed() into sched.h
Date: Wed,  9 Aug 2023 17:12:13 -0500
Message-ID: <20230809221218.163894-3-void@manifault.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230809221218.163894-1-void@manifault.com>
References: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS
        autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1773792906414150611
X-GMAIL-MSGID: 1773792906414150611

is_cpu_allowed() exists as a static inline function in core.c. The
functionality offered by is_cpu_allowed() is useful to scheduling
policies as well, e.g. to determine whether a runnable task can be
migrated to another core that would otherwise go idle.

Let's move it to sched.h.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 31 -------------------------------
 kernel/sched/sched.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 394e216b9d37..dd6412a49263 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -48,7 +48,6 @@
 #include <linux/kcov.h>
 #include <linux/kprobes.h>
 #include <linux/llist_api.h>
-#include <linux/mmu_context.h>
 #include <linux/mmzone.h>
 #include <linux/mutex_api.h>
 #include <linux/nmi.h>
@@ -2470,36 +2469,6 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 	return rq->nr_pinned;
 }
 
-/*
- * Per-CPU kthreads are allowed to run on !active && online CPUs, see
- * __set_cpus_allowed_ptr() and select_fallback_rq().
- */
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
-{
-	/* When not in the task's cpumask, no point in looking further. */
-	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
-		return false;
-
-	/* migrate_disabled() must be allowed to finish. */
-	if (is_migration_disabled(p))
-		return cpu_online(cpu);
-
-	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu) && task_cpu_possible(cpu, p);
-
-	/* KTHREAD_IS_PER_CPU is always allowed. */
-	if (kthread_is_per_cpu(p))
-		return cpu_online(cpu);
-
-	/* Regular kernel threads don't get to stay during offline. */
-	if (cpu_dying(cpu))
-		return false;
-
-	/* But are allowed during online. */
-	return cpu_online(cpu);
-}
-
 /*
  * This is how migration works:
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 69b100267fd0..88cca7cc00cf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
 #include <linux/lockdep.h>
 #include <linux/minmax.h>
 #include <linux/mm.h>
+#include <linux/mmu_context.h>
 #include <linux/module.h>
 #include <linux/mutex_api.h>
 #include <linux/plist.h>
@@ -1203,6 +1204,36 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+/*
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
+ * __set_cpus_allowed_ptr() and select_fallback_rq().
+ */
+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+{
+	/* When not in the task's cpumask, no point in looking further. */
+	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+		return false;
+
+	/* migrate_disabled() must be allowed to finish. */
+	if (is_migration_disabled(p))
+		return cpu_online(cpu);
+
+	/* Non kernel threads are not allowed during either online or offline. */
+	if (!(p->flags & PF_KTHREAD))
+		return cpu_active(cpu) && task_cpu_possible(cpu, p);
+
+	/* KTHREAD_IS_PER_CPU is always allowed. */
+	if (kthread_is_per_cpu(p))
+		return cpu_online(cpu);
+
+	/* Regular kernel threads don't get to stay during offline. */
+	if (cpu_dying(cpu))
+		return false;
+
+	/* But are allowed during online. */
+	return cpu_online(cpu);
+}
+
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

From patchwork Wed Aug  9 22:12:14 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 133618
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp69030vqi;
        Wed, 9 Aug 2023 16:25:46 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IFV57Ca77JXL8ZMvGQYlywlAMEXKtGboIb6ll7tM/HL+uinXvF03XS8jNRhZAvuFsOUCvK1
X-Received: by 2002:a50:fb03:0:b0:522:1e2f:99db with SMTP id
 d3-20020a50fb03000000b005221e2f99dbmr576218edq.5.1691623546084;
        Wed, 09 Aug 2023 16:25:46 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1691623546; cv=none;
        d=google.com; s=arc-20160816;
        b=OfeKMrSCwIZDTENzt9JzJ0LMmpELAQEvwE9Cj+9mjBRsUcrI6IWWm24dQfvElq6JrT
         qU7fE9kjYDLCj5Zyn8E+jh98RAJoNKcgtszqwogP0BFIjqVsjBI6i2izk5KNB02/8BsJ
         QFGiWmIiXvh6o5AMGld3JAuqxKWfqiQ/GeUr9VXo0jBLJ2XREXKNSQVzR3kkUzDmsehQ
         8i7zurirzTAxhmiV/oHCWiltomdV7xUcNR0JtTBYJvlv954OYkzFU+esU+ZKOG8mHhm2
         qIV3CJculq3FW0Qt0KmMdZU2ufELPEmc5AbzdXPvLNqODRrzEIjZ+uYvm3Dltp7wYDQW
         hZlQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=NCs/KK1OPApuPq1wtS/IBD+prxoRkH0tsLIOpWbu+9s=;
        fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=;
        b=ZeujvvAvxzbxb/MmMgyTmGsgK+t91/WIl0Yw5/7oDVf5XYXCqrVqsab8FeADErhCQO
         CMpH6lu8IOI2SOHeCsw62w0MbcSur+Wnt4OIRKR4KhmnkNiBm5UnwHK2sRWb9/jPiS+d
         uillAk3aLsG4VJyfIP9z6ZUVQAwAtcmNC2bJzdRFwZxAKOMLQGu6mLuJtv88H8z0L6mc
         CSng6s2o+/WYdNWC+82J/hglz6k8scBK7yoT2y23xIF5/QUh+eO4mW521fF98+aCjGrR
         WrURdPdTGw2gV7widAmraoIjUjgiv9QGDb+exZJqgs50rVN2rQ42shsQLKCNPf47qfFv
         xyUQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 m9-20020aa7c2c9000000b005231f38ea5bsi187125edp.71.2023.08.09.16.25.22;
        Wed, 09 Aug 2023 16:25:46 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232404AbjHIWM6 (ORCPT <rfc822;craechal@gmail.com> + 99 others);
        Wed, 9 Aug 2023 18:12:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54394 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231703AbjHIWMx (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 9 Aug 2023 18:12:53 -0400
Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com
 [209.85.222.176])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C97EEB9
        for <linux-kernel@vger.kernel.org>;
 Wed,  9 Aug 2023 15:12:52 -0700 (PDT)
Received: by mail-qk1-f176.google.com with SMTP id
 af79cd13be357-7658752ce2fso24202985a.1
        for <linux-kernel@vger.kernel.org>;
 Wed, 09 Aug 2023 15:12:52 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691619171; x=1692223971;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=NCs/KK1OPApuPq1wtS/IBD+prxoRkH0tsLIOpWbu+9s=;
        b=WcbPqdOLJiZ38sbu3c/K7zpXLMjMYiwAbdzXRo5A40rCdOA1KKpma0zsYT5dmK/+WJ
         CYfyxzXU1DbUAAhUa46BB3ScOs4r8lFI8aOQrA9dKt+GG3J7IQb4i1AEjsJpOTX4enOd
         9JI8VDqFGkIim56Y219HPSm43W8scNqD67gybuAnb8S1pZlQ9UPt4xnGviO/GDnERUfj
         /Z7d5yEMjp9gYSoUgXWu1+7ZmUXEB9wKUqTF3Xws4WF39tR+/iuhweo7SlqhUPEloGBN
         tqdAuKLVhSlIncbjTYMOiDQb9uV7u4y7z9gf5YkIE7SWAIlnlljzkh0vCh5k6arP60IA
         4XHA==
X-Gm-Message-State: AOJu0YxfmnUv18IYAVK7VOuTmTbmVe6cPg87pDqfc0J/XyxoXFmoZWcO
        INfd7aXdWFYelpdEej0Cr8sbSy0ah4MZGFP7
X-Received: by 2002:a05:620a:29d2:b0:765:7d7b:7197 with SMTP id
 s18-20020a05620a29d200b007657d7b7197mr386020qkp.31.1691619171616;
        Wed, 09 Aug 2023 15:12:51 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:ed08])
        by smtp.gmail.com with ESMTPSA id
 g20-20020a05620a13d400b00767d6ec578csm23080qkl.20.2023.08.09.15.12.51
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 09 Aug 2023 15:12:51 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 3/7] sched: Check cpu_active() earlier in newidle_balance()
Date: Wed,  9 Aug 2023 17:12:14 -0500
Message-ID: <20230809221218.163894-4-void@manifault.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230809221218.163894-1-void@manifault.com>
References: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS
        autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1773795851738135950
X-GMAIL-MSGID: 1773795851738135950

In newidle_balance(), we check if the current CPU is inactive, and then
decline to pull any remote tasks to the core if so. Before this check,
however, we're currently updating rq->idle_stamp. If a core is offline,
setting its idle stamp is not useful. The core won't be chosen by any
task in select_task_rq_fair(), and setting the rq->idle_stamp is
misleading anyways given that the core being inactive should imply that
it should have a very cold cache.

Let's set rq->idle_stamp in newidle_balance() only if the cpu is active.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28206499a3d..eb15d6f46479 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12037,18 +12037,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (this_rq->ttwu_pending)
 		return 0;
 
-	/*
-	 * We must set idle_stamp _before_ calling idle_balance(), such that we
-	 * measure the duration of idle_balance() as idle time.
-	 */
-	this_rq->idle_stamp = rq_clock(this_rq);
-
 	/*
 	 * Do not pull tasks towards !active CPUs...
 	 */
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	/*
+	 * We must set idle_stamp _before_ calling idle_balance(), such that we
+	 * measure the duration of idle_balance() as idle time.
+	 */
+	this_rq->idle_stamp = rq_clock(this_rq);
+
 	/*
 	 * This is OK, because current is on_cpu, which avoids it being picked
 	 * for load-balance and preemption/IRQs are still disabled avoiding

From patchwork Wed Aug  9 22:12:15 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 133605
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp50429vqi;
        Wed, 9 Aug 2023 15:39:00 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IHijV+MsYhwh1fYLA3jkoThVC6B9ACbzdI3AFxZVxiX37CEx7sG8NJbvhibqaMniCn+oNdy
X-Received: by 2002:a17:907:7d8a:b0:978:2b56:d76e with SMTP id
 oz10-20020a1709077d8a00b009782b56d76emr348152ejc.12.1691620740123;
        Wed, 09 Aug 2023 15:39:00 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1691620740; cv=none;
        d=google.com; s=arc-20160816;
        b=MvMz+Qmo59UYGPOTedLzLb4Vp2ZJ0EQVC0G3JDE3v0ICWlWB9AiXEhxeayrzxd8oH6
         ouT9w/Ku1jfFvcsp2eI0iW8PV/pr+oGORh0BIZ223g+/gvR+mxcOUjbPWm5MmwY6Pxch
         Lm4SZzoO5vGOFWt1+1OqxtBkvadUry81IrIvtq1+Kd4eV1qsUoVPwxieIAw2eIwy9hWo
         /ftiZQIn1/0hZoJJcMYzQ1P8vzWfxO3e+lEkSrgMvw0Ku9AeKL72WOF+uepz4IYYWk12
         mSfbXZwoV9N6mG51NCRBmSsegeoCKcD9aLDXJRGdOrJXdO9sjGGaJypytvjw5ADOw88T
         aq9A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=my7KvWpBax07SiFGA75jMeAqtWUxF1Y/C+Nl9uBqALw=;
        fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=;
        b=w8L+NzHS81EV9ucdrvww1biGdvxuHOJ3Ycx4N8XFUjMXN0MIvvgXqDgDGFsOpVDaxO
         XYvA/Xi3mp+z8FgcHJGm3qklo3yJOiJBltdnsYEeQF3YMn1/cdvY7Fgjc+autfQoarMn
         xWOxvApTjGGwntYIT08OxYUHhNxZxk+OZdsrhbkyJQA/NIwy1ecalrRhTdrU+YBftLSA
         BUBif5Jr/3jsMF1hLrVQbh4aT1340MHsBwU4wbIQrfCRvcI01sQg7RX2USpBjKkqONiH
         9JL0kpsCQ9vwweNs+D14uV1qeZRU3Fl67Z62oXEmKC9tfV6joXi28xm6df+KoKLGupLZ
         iQIw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 ck2-20020a170906c44200b0099cc7d42501si162447ejb.485.2023.08.09.15.38.36;
        Wed, 09 Aug 2023 15:39:00 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232475AbjHIWNC (ORCPT <rfc822;craechal@gmail.com> + 99 others);
        Wed, 9 Aug 2023 18:13:02 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54410 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229765AbjHIWM4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 9 Aug 2023 18:12:56 -0400
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com
 [209.85.160.175])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DE49FE
        for <linux-kernel@vger.kernel.org>;
 Wed,  9 Aug 2023 15:12:54 -0700 (PDT)
Received: by mail-qt1-f175.google.com with SMTP id
 d75a77b69052e-40fd2de0ddcso1401881cf.2
        for <linux-kernel@vger.kernel.org>;
 Wed, 09 Aug 2023 15:12:54 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691619173; x=1692223973;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=my7KvWpBax07SiFGA75jMeAqtWUxF1Y/C+Nl9uBqALw=;
        b=d6tpVQEJIhqG9INdfyZos7vupdpaTkItOGIyhMhDu/fmZjhO+0lo2Nd33IH1QuWBPQ
         ZajP98eUKRPyRPEJEwzoS7mQ8/xV8MOlW2viOYSV1FJ8+D+AFDd6y0iBOr/AVd618BsP
         IXyDtQ1GepZWHd55LMj0ZLtf/cW8z5UTswKDhbSt3aC5Zc7wOq0Zz/DZbQZaQwE4fJRq
         80xa1rPKGGePektULh0hygle/n3ey6wCAfsYbdv9g/AH+rUdQl6pxbS3ckI4ou22Et7J
         2rHVFDH2cpg0PdbwHfsZLhuYdwjjpYOL100tJxx8ovgsoGPTYAsO/m9xv+adHlHakJxW
         CfsA==
X-Gm-Message-State: AOJu0Yz6WewfQ/78K2ArblfyupdpKBv5xIne2l4HYda5y2A2yNyU0P+L
        T+e1gml6z2tJM49LQmdejFlLdUVvWetmX68s
X-Received: by 2002:a05:622a:153:b0:403:c3cd:d7c3 with SMTP id
 v19-20020a05622a015300b00403c3cdd7c3mr736588qtw.50.1691619172985;
        Wed, 09 Aug 2023 15:12:52 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:ed08])
        by smtp.gmail.com with ESMTPSA id
 c17-20020ac80091000000b003ff1f891206sm38648qtg.61.2023.08.09.15.12.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 09 Aug 2023 15:12:52 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 4/7] sched: Enable sched_feat callbacks on enable/disable
Date: Wed,  9 Aug 2023 17:12:15 -0500
Message-ID: <20230809221218.163894-5-void@manifault.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230809221218.163894-1-void@manifault.com>
References: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS
        autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1773792909264121743
X-GMAIL-MSGID: 1773792909264121743

When a scheduler feature is enabled or disabled, the sched_feat_enable()
and sched_feat_disable() functions are invoked respectively for that
feature. For features that don't require resetting any state, this works
fine. However, there will be an upcoming feature called shared_runq
which needs to drain all tasks from a set of global shared runqueues in
order to avoid stale tasks from staying in the queues after the feature
has been disabled.

This patch therefore defines a new SCHED_FEAT_CALLBACK macro which
allows scheduler features to specify a callback that should be invoked
when a feature is enabled or disabled respectively. The SCHED_FEAT macro
assumes a NULL callback.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  |  4 ++--
 kernel/sched/debug.c | 18 ++++++++++++++----
 kernel/sched/sched.h | 16 ++++++++++------
 3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd6412a49263..385c565da87f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -124,12 +124,12 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
  * sysctl_sched_features, defined in sched.h, to allow constants propagation
  * at compile time and compiler optimization based on features default.
  */
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	(1UL << __SCHED_FEAT_##name) * enabled |
 const_debug unsigned int sysctl_sched_features =
 #include "features.h"
 	0;
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 /*
  * Print a warning if need_resched is set for the given duration (if
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index aeeba46a096b..803dff75c56f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -44,14 +44,14 @@ static unsigned long nsec_low(unsigned long long nsec)
 
 #define SPLIT_NS(x) nsec_high(x), nsec_low(x)
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	#name ,
 
 static const char * const sched_feat_names[] = {
 #include "features.h"
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 static int sched_feat_show(struct seq_file *m, void *v)
 {
@@ -72,22 +72,32 @@ static int sched_feat_show(struct seq_file *m, void *v)
 #define jump_label_key__true  STATIC_KEY_INIT_TRUE
 #define jump_label_key__false STATIC_KEY_INIT_FALSE
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	jump_label_key__##enabled ,
 
 struct static_key sched_feat_keys[__SCHED_FEAT_NR] = {
 #include "features.h"
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
+
+#define SCHED_FEAT_CALLBACK(name, enabled, cb) cb,
+static const sched_feat_change_f sched_feat_cbs[__SCHED_FEAT_NR] = {
+#include "features.h"
+};
+#undef SCHED_FEAT_CALLBACK
 
 static void sched_feat_disable(int i)
 {
+	if (sched_feat_cbs[i])
+		sched_feat_cbs[i](false);
 	static_key_disable_cpuslocked(&sched_feat_keys[i]);
 }
 
 static void sched_feat_enable(int i)
 {
+	if (sched_feat_cbs[i])
+		sched_feat_cbs[i](true);
 	static_key_enable_cpuslocked(&sched_feat_keys[i]);
 }
 #else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88cca7cc00cf..2631da3c8a4d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2065,6 +2065,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+#define SCHED_FEAT(name, enabled) SCHED_FEAT_CALLBACK(name, enabled, NULL)
+
 /*
  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
  */
@@ -2074,7 +2076,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 # define const_debug const
 #endif
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	__SCHED_FEAT_##name ,
 
 enum {
@@ -2082,7 +2084,7 @@ enum {
 	__SCHED_FEAT_NR,
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 #ifdef CONFIG_SCHED_DEBUG
 
@@ -2093,14 +2095,14 @@ enum {
 extern const_debug unsigned int sysctl_sched_features;
 
 #ifdef CONFIG_JUMP_LABEL
-#define SCHED_FEAT(name, enabled)					\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)				\
 static __always_inline bool static_branch_##name(struct static_key *key) \
 {									\
 	return static_key_##enabled(key);				\
 }
 
 #include "features.h"
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
@@ -2118,17 +2120,19 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
  * constants propagation at compile time and compiler optimization based on
  * features default.
  */
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	(1UL << __SCHED_FEAT_##name) * enabled |
 static const_debug __maybe_unused unsigned int sysctl_sched_features =
 #include "features.h"
 	0;
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 #define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 
 #endif /* SCHED_DEBUG */
 
+typedef void (*sched_feat_change_f)(bool enabling);
+
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
 

From patchwork Wed Aug  9 22:12:16 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 133619
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp69328vqi;
        Wed, 9 Aug 2023 16:26:27 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IEKigLItmaObLsZKBcy66gX2uIprmn+pIp2f7EHU0wW1lH6cS2Ierjwf9GzEe3uKY31urRK
X-Received: by 2002:a17:906:51d9:b0:99b:b3a1:437b with SMTP id
 v25-20020a17090651d900b0099bb3a1437bmr447993ejk.40.1691623587544;
        Wed, 09 Aug 2023 16:26:27 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1691623587; cv=none;
        d=google.com; s=arc-20160816;
        b=DEZ9Ta3C+J+C4zkk82MzaMEsb7zZpy0dxTn9N02yneEf/zvkenqgAjCSKuxFIhZ4OR
         HW4erOWp3b9U5ngv81AVsIV3ExE1SZMhWz487wHUEa1GgoW6wX4bOVWOEWfdEK5U5EqQ
         lsZDs1wauKOn4j07gY1NJh8/s7ywaPpeWnQbAzUwt2tI26L6N9NHaR1bjlQ1tb8r7EqG
         LKuVm/9YynBxqsoJwWpX5dR/gf4kYiOb1M+OOEkJW+HDGJkGtZrvO0C1sS4552ZrMfoB
         0AKlWQ5FVpI4/sNfmobhkqIbM3neqQ06A/eAbYsiqEvMdk+QOkXt7LrAEO5imQMaPpHS
         e2KQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=ZD6vuUjSA6jvQT3OLU+SZ4Y6+nD0/e7X4lVaoLZ7PRc=;
        fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=;
        b=CZGU5kwofApSqADhOINdgWmrbKU1iOJ7jN0RtqaYFxz5p/+7w2lMqKLe7wQ06Ngy1T
         zUwSBbw02Aug6+EFwIUbJ3XRjxil1Wwi7EZOqeHHvnpu52xWmJLI6uE+HE6C4dcvkHzO
         ihTmiKwaGjPco2mUFqpP0xpZaRONxPDk7rdWRmekdqVe+NHzIxguSmtD3bj/zIst6+v3
         S/MKq/z88/oUAarIJnSoFdygvFkqlVlZUd6ipA205+ErGOZmQnUkQrlmZXVcZ49rtt5q
         SJ3VYmtBaU85lz+61PCcQNdSliJ44qR2f2inmcV02SAreNHI2qTvA9a9mmFXecEZuKlq
         CiWA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 fy22-20020a170906b7d600b00982c06b45d9si233349ejb.816.2023.08.09.16.25.55;
        Wed, 09 Aug 2023 16:26:27 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231801AbjHIWNL (ORCPT <rfc822;craechal@gmail.com> + 99 others);
        Wed, 9 Aug 2023 18:13:11 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54458 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232088AbjHIWM4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 9 Aug 2023 18:12:56 -0400
Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com
 [209.85.222.180])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 929D62103
        for <linux-kernel@vger.kernel.org>;
 Wed,  9 Aug 2023 15:12:55 -0700 (PDT)
Received: by mail-qk1-f180.google.com with SMTP id
 af79cd13be357-76ca7b4782cso23492385a.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 09 Aug 2023 15:12:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691619174; x=1692223974;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ZD6vuUjSA6jvQT3OLU+SZ4Y6+nD0/e7X4lVaoLZ7PRc=;
        b=InMoIler5rOHArzpoihm2kyiIc8THS7j1z/8taCzI7XEbA0NhxQ5E3mFrfnCVngFYS
         0dpAwHtks3JUDJ44sRVzkAdDF6qh4dbMYWUxidzynZtmkiuUcMv1g/3OnXhnRfskaa4Q
         iNUcJ1s0D+cBOewbuKgAkVREC2e80qcry+OPkLYOfXzGXB9d3v1aShP7MoUZPVFx2JmX
         xLS+mAZ3NvVDjx9dRQTGKajDRBo10nLrSqH+3iRrzOPDr621Ys2+sFb5k7OG9S8PLMRv
         8AXArdQBj3beVKj9lYY0hFUR4Rqc5aiM3POq4Rw6BUmRH0p2uowzmYNyoDzJ6fATFiF2
         qi8A==
X-Gm-Message-State: AOJu0YydVkpsHeYc5/W1+rrJdtnZoe4I3w94m8maZ7eESRRr3Vzkn/35
        lJgBEGw5F3o88TFj3ehnI1epyfuXg44SrIUN
X-Received: by 2002:a05:620a:4cc:b0:765:3e81:e74c with SMTP id
 12-20020a05620a04cc00b007653e81e74cmr354657qks.21.1691619174312;
        Wed, 09 Aug 2023 15:12:54 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:ed08])
        by smtp.gmail.com with ESMTPSA id
 d18-20020a05620a159200b00767c61ee342sm21970qkk.48.2023.08.09.15.12.53
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 09 Aug 2023 15:12:53 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton
 calls
Date: Wed,  9 Aug 2023 17:12:16 -0500
Message-ID: <20230809221218.163894-6-void@manifault.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230809221218.163894-1-void@manifault.com>
References: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS
        autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1773795895194093668
X-GMAIL-MSGID: 1773795895194093668

For certain workloads in CFS, CPU utilization is of the upmost
importance. For example, at Meta, our main web workload benefits from a
1 - 1.5% improvement in RPS, and a 1 - 2% improvement in p99 latency,
when CPU utilization is pushed as high as possible.

This is likely something that would be useful for any workload with long
slices, or for which avoiding migration is unlikely to result in
improved cache locality.

We will soon be enabling more aggressive load balancing via a new
feature called shared_runq, which places tasks into a FIFO queue in the
__enqueue_entity(), wakeup path, and then opportunistically dequeues
them in newidle_balance(). We don't want to enable the feature by
default, so this patch defines and declares a new scheduler feature
called SHARED_RUNQ which is disabled by default.

A set of future patches will implement these functions, and enable
shared_runq for both single and multi socket / CCX architectures.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c     | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  2 ++
 kernel/sched/sched.h    |  1 +
 3 files changed, 37 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb15d6f46479..9c23e3b948fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,6 +140,20 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
 #ifdef CONFIG_SMP
+void shared_runq_toggle(bool enabling)
+{}
+
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
+{}
+
+static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+	return 0;
+}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
+
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
  */
@@ -162,6 +176,15 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: ~5%)
  */
 #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
+#else
+void shared_runq_toggle(bool enabling)
+{}
+
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
+{}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -642,11 +665,15 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+		shared_runq_enqueue_task(rq_of(cfs_rq), task_of(se));
 	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+		shared_runq_dequeue_task(task_of(se));
 	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
 }
 
@@ -12043,6 +12070,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	if (sched_feat(SHARED_RUNQ)) {
+		pulled_task = shared_runq_pick_next_task(this_rq, rf);
+		if (pulled_task)
+			return pulled_task;
+	}
+
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
@@ -12543,6 +12576,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
+	shared_runq_dequeue_task(p);
 	detach_task_cfs_rq(p);
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index e10074cb4be4..6d7c93fc1a8f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,5 @@ SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
 
 SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT_CALLBACK(SHARED_RUNQ, false, shared_runq_toggle)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2631da3c8a4d..a484bb527ee4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2132,6 +2132,7 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 #endif /* SCHED_DEBUG */
 
 typedef void (*sched_feat_change_f)(bool enabling);
+extern void shared_runq_toggle(bool enabling);
 
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;

From patchwork Wed Aug  9 22:12:17 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 133611
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp61707vqi;
        Wed, 9 Aug 2023 16:08:15 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IELskBcpbbPM+oU61x8K9lty43RqlWpcsYdEwiDxiiLyb7MtG23+XaH+NGGEWhlAc3MpFWY
X-Received: by 2002:a05:6a20:6a27:b0:125:3445:8af0 with SMTP id
 p39-20020a056a206a2700b0012534458af0mr363899pzk.7.1691622495299;
        Wed, 09 Aug 2023 16:08:15 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1691622495; cv=none;
        d=google.com; s=arc-20160816;
        b=iAiosKY6xhcn6avP0yusDNUae1HzOE4mJWiZuEZxsp2fitL+O3cNlN0l0Lbd0v1BQU
         pUynYwUTd4qq5h0bLxmwjF+9XGSLgqolk7mMxolPq4OrzzWw00YRTB4q4hDRYsLcvKsT
         j0lrYg8g9GOqEQoGA04CGocm9DfXM+OkpF+a3PAAei5q4uNNK2n9MQ2J4Zp/7P/Brt3g
         n5MzgOUAQ7C8YEhrrSYWE0MNL4d486Wz1BToL7XSGZiszp3wrwrUv1L1ucHMo5DB3Zeu
         FB/gDY8821uZQrfOLhPWUGxYlEuRJlmFtNP71pIpLsaQFUEAf8g19hz530t37FcUeU8b
         1sVg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=KBBPZZOZWTxF1BTRc530m89uHuzMssFY5s2Pns8fo4I=;
        fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=;
        b=PHCKFJohp+diyJvyr8wbJkhr0JFW6zyhy9dlAfkiiQYh4Q11kONVo3/Oicj9SaWdUZ
         oZk8UYl4iZgva10kkhwwDS2mHoORBhB3R4Ng2vIpaM92LaRIsXNGCczV6I3/M1H6cIum
         4n+TMLdQWssiTpEhObVwn0TBP5mLm8tAGJh8aMjTILHutFlLZH3C+8dCdgqj/rS2xKIX
         jKMbPgwYQ23XlMrhm095QgRlaXUYOp2VJt13Pt5x7UbOiMIi4mRHVKadmDbf7lY18+Ao
         waI94nOQW/8ca8YqWExgL7Zv+FOCJoDPG/q3Ze86Euy+HbKz2nTJe5uMY2/tZvRDDm3w
         2q3A==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 a21-20020a63e855000000b0054fdb18c877si217623pgk.607.2023.08.09.16.08.00;
        Wed, 09 Aug 2023 16:08:15 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231612AbjHIWNO (ORCPT <rfc822;craechal@gmail.com> + 99 others);
        Wed, 9 Aug 2023 18:13:14 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41032 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232797AbjHIWNJ (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 9 Aug 2023 18:13:09 -0400
Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com
 [209.85.222.177])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 637EF2136
        for <linux-kernel@vger.kernel.org>;
 Wed,  9 Aug 2023 15:12:57 -0700 (PDT)
Received: by mail-qk1-f177.google.com with SMTP id
 af79cd13be357-76d1c58ace6so68124385a.1
        for <linux-kernel@vger.kernel.org>;
 Wed, 09 Aug 2023 15:12:57 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691619176; x=1692223976;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=KBBPZZOZWTxF1BTRc530m89uHuzMssFY5s2Pns8fo4I=;
        b=BvDe6p1NCWdtfM3RyiKegpLp4GTD3xmGNB3p1hTTeeNBeXbJdIvx/f8MYTDQ4cfuGr
         zO4ireczhqUnv7/2XyDeOBaIeMl4dZhih+n7oFsJx7vIi1Qa4e5Hzr4+EaWSjEQUdaXM
         CfgGv++8aCIyaXre+10KKVnvKQV7+cXYyjoK3M9aEwcyNl2t3zIDdh2dJzKImtfm4ywT
         X8hXrgyzzklav98XB5oeTOaodsADqKIhh55p+0d514ILYDBW8+pEBWjFTs+0YdZ71ExM
         dNYr2hu6Hjx6tO7gke9hPUb7Gy1naWuol6tTnX1XNaC9umVjFYHWEahACK3bpzPuzZWA
         9qIw==
X-Gm-Message-State: AOJu0Yxs5JaeDKmNWR/M5zUmQOuqClGuHoXBMGeSYC3GZtWRP/w1UYTj
        SuzfjbOMDlEwopvMDAslSmPZhFQrApuKTpUt
X-Received: by 2002:a05:622a:1746:b0:402:3dbf:85fb with SMTP id
 l6-20020a05622a174600b004023dbf85fbmr323012qtk.33.1691619175810;
        Wed, 09 Aug 2023 15:12:55 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:ed08])
        by smtp.gmail.com with ESMTPSA id
 g12-20020a05620a108c00b007339c5114a9sm17702qkk.103.2023.08.09.15.12.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 09 Aug 2023 15:12:55 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
Date: Wed,  9 Aug 2023 17:12:17 -0500
Message-ID: <20230809221218.163894-7-void@manifault.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230809221218.163894-1-void@manifault.com>
References: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS
        autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1773794750085169834
X-GMAIL-MSGID: 1773794750085169834

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (SHARED_RUNQ) feature proposed in this patch set. The
idea behind SHARED_RUNQ is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

Note that the design described here reflects sharding, which will be
added in a subsequent patch. The design is described that way in this
commit summary as the benchmarks described in the results section below
all include sharded SHARED_RUNQ. The patches are not combined into one
to ease the burden of review.

The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
list of struct shared_runq_shard objects, which itself is simply a
struct list_head of tasks, and a spinlock:

struct shared_runq_shard {
	struct list_head list;
	raw_spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
	u32 num_shards;
	struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs, and we
create a number of struct shared_runq_shard objects that are housed
there.

When a task first wakes up, it enqueues itself in the shared_runq_shard
of its current LLC at the end of enqueue_task_fair(). Enqueues only
happen if the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from the shards in its LLC's shared_runq at the
beginning of newidle_balance().

Difference between SHARED_RUNQ and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas SHARED_RUNQ causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
shared_Runq (previously called "swqueue") in [2]. SIS_NODE did not
outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
[2]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/

Worth noting as well is that pointed out in [3] that the logic behind
including SIS_NODE in the first place should apply to SHARED_RUNQ
(meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
should benefit from having a single shared_runq stretch across multiple
LLCs). I drafted a patch that implements this by having a minimum LLC
size for creating a shard, and stretches a shared_runq across multiple
LLCs if they're smaller than that size, and sent it to Tejun to test on
his Zen2. Tejun reported back that SIS_NODE did not seem to make a
difference:

[3]: https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/

			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
Vanilla:		    | 108.84s    | 0.0057   |
NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
SHARED_RUNQ:		    | 108.17s    | 0.038s   |
SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
			    o------------o----------o

I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ (though
a gain was observed relative to NO_SHARED_RUNQ, as described below).

Results
=======

Note that the motivation for the shared runqueue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile and hackbench, when running shared
runqueue.

On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
-m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding the
shared datastructures within a CCX, but it doesn't seem to eliminate all
contention in every scenario. On the positive side, it seems that
sharding does not materially harm the benchmarks run for this patch
series; and in fact seems to improve some workloads such as kernel
compile.

Note that for the kernel compile workloads below, the compilation was
done by running make -j$(nproc) built-in.a on several different types of
hosts configured with make allyesconfig on commit a27648c74210 ("afs:
Fix setting of mtime when creating a file/dir/symlink") on Linus' tree
(boost and turbo were disabled on all of these hosts when the
experiments were performed).

Finally, note that these results were from the patch set built off of
commit ebb83d84e49b ("sched/core: Avoid multiple calling
update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core branch
of tip.git for easy comparison with the v2 patch set results. The
patches in their final form from this set were rebased onto commit
88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
use").

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
SHARED_RUNQ:		    | 577.02s    | 0.084s   |
			    o------------o----------o

Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks in
the shared runqueue on every enqueue improves work conservation, and
thanks to sharding, does not result in contention.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
SHARED_RUNQ:		    | 2.0217s    | .00065s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
result given that it seems advantageous to err on the side of avoiding
migration in hackbench given that tasks are short lived in sending only
10k bytes worth of messages, but the results of the benchmark would seem
to suggest that minimizing runqueue delays is preferable.

Command:
for i in `seq 128`; do
    netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
SHARED_RUNQ:                | 24952.50   | 1268.06  |
                            o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
			    o------------o----------o

Takeaway: There's on statistically significant gain here. I observed
what I claimed was a .23% win in v2, but it appears that this is not
actually statistically significant.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
SHARED_RUNQ:		    | 5.2668s    | .0033s   |
                            o------------o----------o

Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
10+% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
SHARED_RUNQ:                | 14966.42   | 714.13   |
                            o-----------------------o

Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result makes
sense -- the workload is very heavy on the runqueue, so enqueuing tasks
in the shared runqueue in __enqueue_entity() would intuitively result in
increased contention on the shard lock.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
			    o------------o----------o

Takeaway: No statistically significant difference here. It might be
worth experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
SHARED_RUNQ:		    | 5.1716s    | .00289s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ again wins, by about 2%.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
SHARED_RUNQ:                | 16697.25   | 9812.23  |
                            o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here, so
the results of this benchmark should be taken with a large grain of
salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
not contending on the shard lock).

Finally, let's look at how sharding affects the following schbench
incantation suggested by Chris in [4]:

schbench -L -m 52 -p 512 -r 10 -t 1

[4]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

The TL;DR is that sharding improves things a lot, but doesn't completely
fix the problem. Here are the results from running the schbench command
on the 18 core / 36 thread single CCX, single-socket Skylake:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

These results are when we create only 3 shards (16 logical cores per
shard), so the contention may be a result of overly-coarse sharding. If
we run the schbench incantation with no sharding whatsoever, we see the
following significantly worse lock stats contention:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the first workload run above.
If we make the shards even smaller, the contention is comparably much
lower:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan as well, but we contend more on the rq lock than the
shard lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0

Worth noting is that increasing the granularity of the shards in general
improves very runqueue-heavy workloads such as netperf UDP_RR and this
schbench command, but it doesn't necessarily make a big difference for
every workload, or for sufficiently small CCXs such as the 7950X. It may
make sense to eventually allow users to control this with a debugfs
knob, but for now we'll elect to choose a default that resulted in good
performance for the benchmarks run for this patch series.

Conclusion
==========

SHARED_RUNQ in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: SHARED_RUNQ encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(1) pull from a single
per-LLC shared runq.

For the workloads above, kernel compile and hackbench were clear winners
for SHARED_RUNQ (especially in __enqueue_entity()). The reason for the
improvement in kernel compile is of course that we have a heavily
CPU-bound workload where cache locality doesn't mean much; getting a CPU
is the #1 goal. As mentioned above, while I didn't expect to see an
improvement in hackbench, the results of the benchmark suggest that
minimizing runqueue delays is preferable to optimizing for L1/L2
locality.

Not all workloads benefit from SHARED_RUNQ, however. Workloads that
hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m 52
-p 512 -r 10 -t 1, tend to run into contention on the shard locks;
especially when enqueuing tasks in __enqueue_entity(). This can be
mitigated significantly by sharding the shared datastructures within a
CCX, but it doesn't eliminate all contention, as described above.

Finally, while SHARED_RUNQ in this form encourages work conservation, it
of course does not guarantee it given that we don't implement any kind
of work stealing between shared_runq's. In the future, we could
potentially push CPU utilization even higher by enabling work stealing
between shared_runq's, likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 include/linux/sched.h   |   2 +
 kernel/sched/core.c     |  13 +++
 kernel/sched/fair.c     | 238 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h    |   4 +
 kernel/sched/topology.c |   4 +-
 5 files changed, 256 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2aab7be46f7e..8238069fd852 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -769,6 +769,8 @@ struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;
 
+	struct list_head		shared_runq_node;
+
 	/*
 	 * recent_used_cpu is initially set as the last CPU used by a task
 	 * that wakes affine another task. Waker/wakee relationships can
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 385c565da87f..fb7e71d3dc0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4529,6 +4529,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	INIT_LIST_HEAD(&p->shared_runq_node);
 #endif
 	init_sched_mm_cid(p);
 }
@@ -9764,6 +9765,18 @@ int sched_cpu_deactivate(unsigned int cpu)
 	return 0;
 }
 
+void sched_update_domains(void)
+{
+	const struct sched_class *class;
+
+	update_sched_domain_debugfs();
+
+	for_each_class(class) {
+		if (class->update_domains)
+			class->update_domains();
+	}
+}
+
 static void sched_rq_cpu_starting(unsigned int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c23e3b948fc..6e740f8da578 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,20 +139,235 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
+/**
+ * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
+ * runnable tasks within an LLC.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
+ * pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq
+ * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
+ * to being pulled from the shared_runq, in which case they're simply dequeued
+ * from the shared_runq in __dequeue_entity().
+ *
+ * There is currently no task-stealing between shared_runqs in different LLCs,
+ * which means that shared_runq is not fully work conserving. This could be
+ * added at a later time, with tasks likely only being stolen across
+ * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
+ *
+ * HOW
+ * ===
+ *
+ * A shared_runq is comprised of a list, and a spinlock for synchronization.
+ * Given that the critical section for a shared_runq is typically a fast list
+ * operation, and that the shared_runq is localized to a single LLC, the
+ * spinlock will typically only be contended on workloads that do little else
+ * other than hammer the runqueue.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of shared_runq is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * shared_runqs are segmented across LLCs both to avoid contention on the
+ * shared_runq spinlock by minimizing the number of CPUs that could contend on
+ * it, as well as to strike a balance between work conservation, and L3 cache
+ * locality.
+ */
+struct shared_runq {
+	struct list_head list;
+	raw_spinlock_t lock;
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
+
+static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
+
+static struct shared_runq *rq_shared_runq(struct rq *rq)
+{
+	return rq->cfs.shared_runq;
+}
+
+static void shared_runq_reassign_domains(void)
+{
+	int i;
+	struct shared_runq *shared_runq;
+	struct rq *rq;
+	struct rq_flags rf;
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
+
+		rq_lock(rq, &rf);
+		rq->cfs.shared_runq = shared_runq;
+		rq_unlock(rq, &rf);
+	}
+}
+
+static void __shared_runq_drain(struct shared_runq *shared_runq)
+{
+	struct task_struct *p, *tmp;
+
+	raw_spin_lock(&shared_runq->lock);
+	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
+		list_del_init(&p->shared_runq_node);
+	raw_spin_unlock(&shared_runq->lock);
+}
+
+static void update_domains_fair(void)
+{
+	int i;
+	struct shared_runq *shared_runq;
+
+	/* Avoid racing with SHARED_RUNQ enable / disable. */
+	lockdep_assert_cpus_held();
+
+	shared_runq_reassign_domains();
+
+	/* Ensure every core sees its updated shared_runq pointers. */
+	synchronize_rcu();
+
+	/*
+	 * Drain all tasks from all shared_runq's to ensure there are no stale
+	 * tasks in any prior domain runq. This can cause us to drain live
+	 * tasks that would otherwise have been safe to schedule, but this
+	 * isn't a practical problem given how infrequently domains are
+	 * rebuilt.
+	 */
+	for_each_possible_cpu(i) {
+		shared_runq = &per_cpu(shared_runqs, i);
+		__shared_runq_drain(shared_runq);
+	}
+}
+
 void shared_runq_toggle(bool enabling)
-{}
+{
+	int cpu;
+
+	if (enabling)
+		return;
+
+	/* Avoid racing with hotplug. */
+	lockdep_assert_cpus_held();
+
+	/* Ensure all cores have stopped enqueueing / dequeuing tasks. */
+	synchronize_rcu();
+
+	for_each_possible_cpu(cpu) {
+		int sd_id;
+
+		sd_id = per_cpu(sd_llc_id, cpu);
+		if (cpu == sd_id)
+			__shared_runq_drain(rq_shared_runq(cpu_rq(cpu)));
+	}
+}
+
+static struct task_struct *shared_runq_pop_task(struct rq *rq)
+{
+	struct task_struct *p;
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	if (list_empty(&shared_runq->list))
+		return NULL;
+
+	raw_spin_lock(&shared_runq->lock);
+	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+				     shared_runq_node);
+	if (p && is_cpu_allowed(p, cpu_of(rq)))
+		list_del_init(&p->shared_runq_node);
+	else
+		p = NULL;
+	raw_spin_unlock(&shared_runq->lock);
+
+	return p;
+}
+
+static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+{
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	raw_spin_lock(&shared_runq->lock);
+	list_add_tail(&p->shared_runq_node, &shared_runq->list);
+	raw_spin_unlock(&shared_runq->lock);
+}
 
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
-{}
+{
+	/*
+	 * Only enqueue the task in the shared runqueue if:
+	 *
+	 * - SHARED_RUNQ is enabled
+	 * - The task isn't pinned to a specific CPU
+	 */
+	if (p->nr_cpus_allowed == 1)
+		return;
+
+	shared_runq_push_task(rq, p);
+}
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 {
-	return 0;
+	struct task_struct *p = NULL;
+	struct rq *src_rq;
+	struct rq_flags src_rf;
+	int ret = -1;
+
+	p = shared_runq_pop_task(rq);
+	if (!p)
+		return 0;
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	src_rq = task_rq_lock(p, &src_rf);
+
+	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
+		update_rq_clock(src_rq);
+		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
+		ret = 1;
+	}
+
+	if (src_rq != rq) {
+		task_rq_unlock(src_rq, p, &src_rf);
+		raw_spin_rq_lock(rq);
+	} else {
+		rq_unpin_lock(rq, &src_rf);
+		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
+	}
+	rq_repin_lock(rq, rf);
+
+	return ret;
 }
 
 static void shared_runq_dequeue_task(struct task_struct *p)
-{}
+{
+	struct shared_runq *shared_runq;
+
+	if (!list_empty(&p->shared_runq_node)) {
+		shared_runq = rq_shared_runq(task_rq(p));
+		raw_spin_lock(&shared_runq->lock);
+		/*
+		 * Need to double-check for the list being empty to avoid
+		 * racing with the list being drained on the domain recreation
+		 * or SHARED_RUNQ feature enable / disable path.
+		 */
+		if (likely(!list_empty(&p->shared_runq_node)))
+			list_del_init(&p->shared_runq_node);
+		raw_spin_unlock(&shared_runq->lock);
+	}
+}
 
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
@@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
+	/*
+	 * Skip <= LLC domains as they likely won't have any tasks if the
+	 * shared runq is empty.
+	 */
+	if (sched_feat(SHARED_RUNQ)) {
+		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+		if (likely(sd))
+			sd = sd->parent;
+	}
+
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
@@ -12969,6 +13194,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 	.task_dead		= task_dead_fair,
 	.set_cpus_allowed	= set_cpus_allowed_common,
+	.update_domains		= update_domains_fair,
 #endif
 
 	.task_tick		= task_tick_fair,
@@ -13035,6 +13261,7 @@ __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
 	int i;
+	struct shared_runq *shared_runq;
 
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
@@ -13044,6 +13271,9 @@ __init void init_sched_fair_class(void)
 		INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i));
 		INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
 #endif
+		shared_runq = &per_cpu(shared_runqs, i);
+		INIT_LIST_HEAD(&shared_runq->list);
+		raw_spin_lock_init(&shared_runq->lock);
 	}
 
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a484bb527ee4..3665dd935649 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -487,9 +487,11 @@ extern int sched_group_set_idle(struct task_group *tg, long idle);
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);
+extern void sched_update_domains(void);
 #else /* !CONFIG_SMP */
 static inline void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next) { }
+static inline void sched_update_domains(void) {}
 #endif /* CONFIG_SMP */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -578,6 +580,7 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
+	struct shared_runq	*shared_runq;
 	/*
 	 * CFS load tracking
 	 */
@@ -2282,6 +2285,7 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 
 	struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
+	void (*update_domains)(void);
 #endif
 
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a5bc678c08..8aaf644d4f2c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2576,6 +2576,8 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
 		doms_cur = &fallback_doms;
 	cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
 	err = build_sched_domains(doms_cur[0], NULL);
+	if (!err)
+		sched_update_domains();
 
 	return err;
 }
@@ -2741,7 +2743,7 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	dattr_cur = dattr_new;
 	ndoms_cur = ndoms_new;
 
-	update_sched_domain_debugfs();
+	sched_update_domains();
 }
 
 /*

From patchwork Wed Aug  9 22:12:18 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Vernet <void@manifault.com>
X-Patchwork-Id: 133608
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp57042vqi;
        Wed, 9 Aug 2023 15:59:20 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IE1Vo6Jg26NQYhCzef2JvjWrVJQlhQvBzALakgtl882o6lRgm++gW5OJQzTjwix1/o1U16k
X-Received: by 2002:a9d:7d85:0:b0:6b8:90cd:47b5 with SMTP id
 j5-20020a9d7d85000000b006b890cd47b5mr586257otn.7.1691621959739;
        Wed, 09 Aug 2023 15:59:19 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1691621959; cv=none;
        d=google.com; s=arc-20160816;
        b=d3n8jWiE/3nTHAXsIbB1veYIn2qJMA5zKQuWDyZhACmU/HcNf9rQZF831qAO9Evi0i
         +TCYq7WsiOFXJI7juURd6PUXAIBhNor3t4CkBWJYBQzPuZQC6gKQgff2X1N/wYogiegs
         Eqs/ApEoGiL89+EFmgWQPzPSkqGYFhDQjaC23VIppAWG9OvR6TNf7fH2golLRzU6eq6B
         ouK+xVA845t1CkdKyLuG6KCTF8mBxA1xK/gWi9f9OOLtj1GpwT3w5btrSU7v93EDLWpB
         bnFqRT9mjOGLMTUlEO57V5/PLD6Zn5i86dCnWFgh1nvAz2e9fM619t0VvGjdaAGYO9M1
         QQ4A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=289uBjwPOn362GkpAcGTsnLiJb6Nd9Y0Sebv5Fsn+ec=;
        fh=20RY08rsve2lwti6V2FQSjrHvxbvm4llznoKqUOZSeY=;
        b=a4RdG5snyjumSDDjnWC9dN/CvG4+2aEo7GGFaH5BOQgzZ7EGQlylAldokcew4MSgV3
         KhZ3Bicf7PF9IZtWOm7UL5miCAPqsrJiSsDqRsWcbFi503TXZFLX83mBrkS1z5+se2/9
         Nns7XL+1nS+m8ifJb1f3Qpb/JkMkAjjiToCNPbycwfLFUmlGj4v4zU3eW2umGuYXCH5U
         pOiTQqzbnSkbmV/F12M24NqYACI3shtJqQgfsEPCPJWtDh/mDWrX0KrpSsYlm33Pb8YX
         cPUsWl6wrMLw34IQ2lookla4ow4WIBlh2P462lZ8DHASRK2v4RYEXzTD8OYKWvvoq786
         0Cvg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 o23-20020a656157000000b005652febd4b7si213006pgv.356.2023.08.09.15.59.04;
        Wed, 09 Aug 2023 15:59:19 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232558AbjHIWNR (ORCPT <rfc822;craechal@gmail.com> + 99 others);
        Wed, 9 Aug 2023 18:13:17 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54436 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232937AbjHIWNK (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 9 Aug 2023 18:13:10 -0400
Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com
 [209.85.219.43])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70CC0213F
        for <linux-kernel@vger.kernel.org>;
 Wed,  9 Aug 2023 15:12:58 -0700 (PDT)
Received: by mail-qv1-f43.google.com with SMTP id
 6a1803df08f44-635f293884cso1864656d6.3
        for <linux-kernel@vger.kernel.org>;
 Wed, 09 Aug 2023 15:12:58 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691619177; x=1692223977;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=289uBjwPOn362GkpAcGTsnLiJb6Nd9Y0Sebv5Fsn+ec=;
        b=hYxF0oshW+I07f/BqJomcfSnAi0dfzOmxXwrU0oGL/haYwc069LjqsZ7nWSawAmRMQ
         snkaeoPT9aS56UwqliNtYg9x/n/nEASuNzOQK63JDJpYfIEZzyNmKOPjbN5EfJ7Pdo1v
         +4SfGhvudoHvmOtei1tVnQF9XLW7xP8kCl2RPAMRx1FtMdTfray3n3DYmAFM9zcm8AyF
         8Lq1LE8HBbIuY+QGBg/Qau63OXUWP6Bsy3HVo7TQVSBdFuw7vzPLmoEl08/QzR7t1cjs
         FrBB/koAbqGPx6zS/+2P6Qo/qQP9q1ab39w3cJ8rt5U2BmhM9ADdZj42RxerC2H46xSs
         sLwA==
X-Gm-Message-State: AOJu0Yy8fRuQye0VmgJuEjVPaokxsDb7h0h/qalGEj4IdQBtEpVFNw0f
        uUp2DJjOXd7LogcJKgI5e1SiMISVlqgXNPed
X-Received: by 2002:a0c:db06:0:b0:626:1906:bcac with SMTP id
 d6-20020a0cdb06000000b006261906bcacmr627824qvk.0.1691619177105;
        Wed, 09 Aug 2023 15:12:57 -0700 (PDT)
Received: from localhost ([2620:10d:c091:400::5:ed08])
        by smtp.gmail.com with ESMTPSA id
 p6-20020a0ce186000000b00631fea4d5bcsm4752789qvl.95.2023.08.09.15.12.56
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 09 Aug 2023 15:12:56 -0700 (PDT)
From: David Vernet <void@manifault.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
        roman.gushchin@linux.dev, gautham.shenoy@amd.com,
        kprateek.nayak@amd.com, aaron.lu@intel.com,
        wuyun.abel@bytedance.com, kernel-team@meta.com
Subject: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
Date: Wed,  9 Aug 2023 17:12:18 -0500
Message-ID: <20230809221218.163894-8-void@manifault.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230809221218.163894-1-void@manifault.com>
References: <20230809221218.163894-1-void@manifault.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,
        FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
        RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS
        autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1773794188135816179
X-GMAIL-MSGID: 1773794188135816179

The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
tasks are put into on enqueue, and pulled from when a core in that LLC
would otherwise go idle. For CPUs with large LLCs, this can sometimes
cause significant contention, as illustrated in [0].

[0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

So as to try and mitigate this contention, we can instead shard the
per-LLC runqueue into multiple per-LLC shards.

While this doesn't outright prevent all contention, it does somewhat mitigate it.
For example, if we run the following schbench command which does almost
nothing other than pound the runqueue:

schbench -L -m 52 -p 512 -r 10 -t 1

we observe with lockstats that sharding significantly decreases
contention.

3 shards:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

No sharding:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the host where this was
collected. This could be addressed in future patch sets by adding a
debugfs knob to control the sharding granularity. If we make the shards
even smaller (what's in this patch, i.e. a size of 6), the contention
goes away almost entirely:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    	   con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan, but we contend even more on the rq lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          47962          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170

In general, the takeaway here is that sharding does help with
contention, but it's not necessarily one size fits all, and it's
workload dependent. For now, let's include sharding to try and avoid
contention, and because it doesn't seem to regress CPUs that don't need
it such as the AMD 7950X.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c  | 149 ++++++++++++++++++++++++++++++-------------
 kernel/sched/sched.h |   3 +-
 2 files changed, 108 insertions(+), 44 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e740f8da578..d67d86d3bfdf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -143,19 +143,27 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
  * runnable tasks within an LLC.
  *
+ * struct shared_runq_shard - A structure containing a task list and a spinlock
+ * for a subset of cores in a struct shared_runq.
+ *
  * WHAT
  * ====
  *
  * This structure enables the scheduler to be more aggressively work
- * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
- * pulled from when another core in the LLC is going to go idle.
+ * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
+ * then be pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores two pointers in its struct cfs_rq:
+ *
+ * 1. The per-LLC struct shared_runq which contains one or more shards of
+ *    enqueued tasks.
  *
- * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
- * Waking tasks are enqueued in the calling CPU's struct shared_runq in
- * __enqueue_entity(), and are opportunistically pulled from the shared_runq
- * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
- * to being pulled from the shared_runq, in which case they're simply dequeued
- * from the shared_runq in __dequeue_entity().
+ * 2. The shard inside of the per-LLC struct shared_runq which contains the
+ *    list of runnable tasks for that shard.
+ *
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
+ * newidle_balance(). Pulling from shards is an O(# shards) operation.
  *
  * There is currently no task-stealing between shared_runqs in different LLCs,
  * which means that shared_runq is not fully work conserving. This could be
@@ -165,11 +173,12 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * HOW
  * ===
  *
- * A shared_runq is comprised of a list, and a spinlock for synchronization.
- * Given that the critical section for a shared_runq is typically a fast list
- * operation, and that the shared_runq is localized to a single LLC, the
- * spinlock will typically only be contended on workloads that do little else
- * other than hammer the runqueue.
+ * A struct shared_runq_shard is comprised of a list, and a spinlock for
+ * synchronization.  Given that the critical section for a shared_runq is
+ * typically a fast list operation, and that the shared_runq_shard is localized
+ * to a subset of cores on a single LLC (plus other cores in the LLC that pull
+ * from the shard in newidle_balance()), the spinlock will typically only be
+ * contended on workloads that do little else other than hammer the runqueue.
  *
  * WHY
  * ===
@@ -183,11 +192,21 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * it, as well as to strike a balance between work conservation, and L3 cache
  * locality.
  */
-struct shared_runq {
+struct shared_runq_shard {
 	struct list_head list;
 	raw_spinlock_t lock;
 } ____cacheline_aligned;
 
+/* This would likely work better as a configurable knob via debugfs */
+#define SHARED_RUNQ_SHARD_SZ 6
+#define SHARED_RUNQ_MAX_SHARDS \
+	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
+
+struct shared_runq {
+	unsigned int num_shards;
+	struct shared_runq_shard shards[SHARED_RUNQ_MAX_SHARDS];
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
 
 static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
@@ -197,31 +216,61 @@ static struct shared_runq *rq_shared_runq(struct rq *rq)
 	return rq->cfs.shared_runq;
 }
 
+static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
+{
+	return rq->cfs.shard;
+}
+
+static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
+{
+	return (cpu >> 1) % runq->num_shards;
+}
+
 static void shared_runq_reassign_domains(void)
 {
 	int i;
 	struct shared_runq *shared_runq;
 	struct rq *rq;
 	struct rq_flags rf;
+	unsigned int num_shards, shard_idx;
+
+	for_each_possible_cpu(i) {
+		if (per_cpu(sd_llc_id, i) == i) {
+			shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
+
+			num_shards = per_cpu(sd_llc_size, i) / SHARED_RUNQ_SHARD_SZ;
+			if (per_cpu(sd_llc_size, i) % SHARED_RUNQ_SHARD_SZ)
+				num_shards++;
+			shared_runq->num_shards = num_shards;
+		}
+	}
 
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);
 		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
 
+		shard_idx = shared_runq_shard_idx(shared_runq, i);
 		rq_lock(rq, &rf);
 		rq->cfs.shared_runq = shared_runq;
+		rq->cfs.shard = &shared_runq->shards[shard_idx];
 		rq_unlock(rq, &rf);
 	}
 }
 
 static void __shared_runq_drain(struct shared_runq *shared_runq)
 {
-	struct task_struct *p, *tmp;
+	unsigned int i;
 
-	raw_spin_lock(&shared_runq->lock);
-	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
-		list_del_init(&p->shared_runq_node);
-	raw_spin_unlock(&shared_runq->lock);
+	for (i = 0; i < shared_runq->num_shards; i++) {
+		struct shared_runq_shard *shard;
+		struct task_struct *p, *tmp;
+
+		shard = &shared_runq->shards[i];
+		raw_spin_lock(&shard->lock);
+		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
+			list_del_init(&p->shared_runq_node);
+		raw_spin_unlock(&shard->lock);
+	}
 }
 
 static void update_domains_fair(void)
@@ -272,35 +321,32 @@ void shared_runq_toggle(bool enabling)
 	}
 }
 
-static struct task_struct *shared_runq_pop_task(struct rq *rq)
+static struct task_struct *
+shared_runq_pop_task(struct shared_runq_shard *shard, int target)
 {
 	struct task_struct *p;
-	struct shared_runq *shared_runq;
 
-	shared_runq = rq_shared_runq(rq);
-	if (list_empty(&shared_runq->list))
+	if (list_empty(&shard->list))
 		return NULL;
 
-	raw_spin_lock(&shared_runq->lock);
-	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+	raw_spin_lock(&shard->lock);
+	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, cpu_of(rq)))
+	if (p && is_cpu_allowed(p, target))
 		list_del_init(&p->shared_runq_node);
 	else
 		p = NULL;
-	raw_spin_unlock(&shared_runq->lock);
+	raw_spin_unlock(&shard->lock);
 
 	return p;
 }
 
-static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+static void shared_runq_push_task(struct shared_runq_shard *shard,
+				  struct task_struct *p)
 {
-	struct shared_runq *shared_runq;
-
-	shared_runq = rq_shared_runq(rq);
-	raw_spin_lock(&shared_runq->lock);
-	list_add_tail(&p->shared_runq_node, &shared_runq->list);
-	raw_spin_unlock(&shared_runq->lock);
+	raw_spin_lock(&shard->lock);
+	list_add_tail(&p->shared_runq_node, &shard->list);
+	raw_spin_unlock(&shard->lock);
 }
 
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
@@ -314,7 +360,7 @@ static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
 	if (p->nr_cpus_allowed == 1)
 		return;
 
-	shared_runq_push_task(rq, p);
+	shared_runq_push_task(rq_shared_runq_shard(rq), p);
 }
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
@@ -322,9 +368,22 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 	struct task_struct *p = NULL;
 	struct rq *src_rq;
 	struct rq_flags src_rf;
+	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
+	u32 i, starting_idx, curr_idx, num_shards;
 	int ret = -1;
 
-	p = shared_runq_pop_task(rq);
+	shared_runq = rq_shared_runq(rq);
+	num_shards = shared_runq->num_shards;
+	starting_idx = shared_runq_shard_idx(shared_runq, cpu_of(rq));
+	for (i = 0; i < num_shards; i++) {
+		curr_idx = (starting_idx + i) % num_shards;
+		shard = &shared_runq->shards[curr_idx];
+
+		p = shared_runq_pop_task(shard, cpu_of(rq));
+		if (p)
+			break;
+	}
 	if (!p)
 		return 0;
 
@@ -353,11 +412,11 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 
 static void shared_runq_dequeue_task(struct task_struct *p)
 {
-	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 
 	if (!list_empty(&p->shared_runq_node)) {
-		shared_runq = rq_shared_runq(task_rq(p));
-		raw_spin_lock(&shared_runq->lock);
+		shard = rq_shared_runq_shard(task_rq(p));
+		raw_spin_lock(&shard->lock);
 		/*
 		 * Need to double-check for the list being empty to avoid
 		 * racing with the list being drained on the domain recreation
@@ -365,7 +424,7 @@ static void shared_runq_dequeue_task(struct task_struct *p)
 		 */
 		if (likely(!list_empty(&p->shared_runq_node)))
 			list_del_init(&p->shared_runq_node);
-		raw_spin_unlock(&shared_runq->lock);
+		raw_spin_unlock(&shard->lock);
 	}
 }
 
@@ -13260,8 +13319,9 @@ void show_numa_stats(struct task_struct *p, struct seq_file *m)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
-	int i;
+	int i, j;
 	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
@@ -13272,8 +13332,11 @@ __init void init_sched_fair_class(void)
 		INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
 #endif
 		shared_runq = &per_cpu(shared_runqs, i);
-		INIT_LIST_HEAD(&shared_runq->list);
-		raw_spin_lock_init(&shared_runq->lock);
+		for (j = 0; j < SHARED_RUNQ_MAX_SHARDS; j++) {
+			shard = &shared_runq->shards[j];
+			INIT_LIST_HEAD(&shard->list);
+			raw_spin_lock_init(&shard->lock);
+		}
 	}
 
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3665dd935649..b504f8f4416b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,7 +580,8 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-	struct shared_runq	*shared_runq;
+	struct shared_runq	 *shared_runq;
+	struct shared_runq_shard *shard;
 	/*
 	 * CFS load tracking
 	 */