From patchwork Fri Feb  2 08:09:16 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Valentin Schneider <vschneid@redhat.com>
X-Patchwork-Id: 195686
Return-Path: <linux-kernel+bounces-49441-ouuuleilei=gmail.com@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp281660dyc;
        Fri, 2 Feb 2024 00:18:06 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IEXxSayZAOV2+SV3tlyewROK3R0EiUJ28cfL6PvJZbVghA3EwD/kgomDBdbPFc52MvKSWtQ
X-Received: by 2002:a05:6402:78e:b0:55f:b005:57a1 with SMTP id
 d14-20020a056402078e00b0055fb00557a1mr3138202edy.14.1706861886382;
        Fri, 02 Feb 2024 00:18:06 -0800 (PST)
ARC-Seal: i=2; a=rsa-sha256; t=1706861886; cv=pass;
        d=google.com; s=arc-20160816;
        b=n1Rk02L/5OuS+HcnjbTkXkGgWKRTEQ8ngrgJi3HJPKuJFbh0BGHD5pIylQtzYOr0Rx
         1x68XcUmwotaPSiYg6+VQhDdpw2i2qTRxvdRGWUG/zhVb/kUbj8GKBYfSGFBi7S6IsJt
         Y8JF8Cm0KAhfqI7n8jz1LB5Pl/LlHPUJtt4VrHKxVzh8QIn9iTlU9Jhpbu2Cz414Bi5U
         vslzviR/yPh37OGsuo47fo6n0ESCPV4uE+r2v87f+PK3eRvldlKhG3r6R4sHi8Ffo0wW
         g6Ta2KWDYT9vJCSPToBiYKRxbUYXNROQlHZKgFVGFlp39hezmhYYDGEMhxjHty2qYJjP
         5eIw==
ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:mime-version:list-unsubscribe
         :list-subscribe:list-id:precedence:references:in-reply-to:message-id
         :date:subject:cc:to:from:dkim-signature;
        bh=DCgceAE8jBD2WohfFUuNrr+O/9K6+IZGaVDTilf9JRw=;
        fh=poVn4pBnmc0mGCjopJnFYYVEznZEy+bZmdsXOXOCQtE=;
        b=jMV0fR2iXzIwJxVsiDZLjP0OiwonYwYWXCTAtNxEJnhnWzNQWNSu+07F3cVSckqhAN
         +6AkYyHTlrlR+TVDwb5ny6rDgY98qBkFwZjzlsZ1YLlB8/N5K+A6VOV71qw5sW0Rtktg
         1yBlNjJedKyhEUzPeFfcWekdhhebILU7LUdxMeHnmhkcDhHoS49UgKix+f5+O8CtcBgF
         0gYW//27ZOkLxeCZbkrALBbNH77B204WaX3oV3bDh5r5MapPWTEzKyR1BRYPRITXoyjD
         C6tPrE0sfTfz+iUi2dd2Yv/SotdrATwXSqgvPXK8RKy/XZIjm/71cvJMUCGGbJxEuBmc
         M45A==;
        dara=google.com
ARC-Authentication-Results: i=2; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=SaD4K5vu;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49441-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.80.249 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49441-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
X-Forwarded-Encrypted: i=1;
 AJvYcCUoDkvZ91TTBBLVFI4jWIFbm0NxBQzdonakoagasvjpg29h4b6yZsjh8zbbmfhD0oNBZol/l1ZjNxKyrznHzPNhIE5iCA==
Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249])
        by mx.google.com with ESMTPS id
 c17-20020aa7c991000000b0055f5cda6059si576051edt.536.2024.02.02.00.18.06
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Feb 2024 00:18:06 -0800 (PST)
Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-49441-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.80.249 as permitted sender) client-ip=147.75.80.249;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=SaD4K5vu;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49441-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.80.249 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49441-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org
 [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by am.mirrors.kernel.org (Postfix) with ESMTPS id 56A591F2B5A6
	for <ouuuleilei@gmail.com>; Fri,  2 Feb 2024 08:17:01 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 02E8428376;
	Fri,  2 Feb 2024 08:10:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="SaD4K5vu"
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA79918AEE
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861449; cv=none;
 b=mchae6UVb09j9/DrSfksCbRbuTm33GYoYrhFbn4n/Kp9brXXvEMEIYUIc4OqgyAWJEJeDpfSTeXHZofwT53C2E75Rfu0nz5UUujKWqnP+/o1oeeCN2erXvZH4CIEZS6Uhidt4AqrBWp3n+qSgPGS1DrpvPzKeK32ldK99ra1gdU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861449; c=relaxed/simple;
	bh=XvGrGNouSr+0XomitksSmtHJyT2Bn0wbu87wMiy9Dy0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=TRtJoSW8vwiF1oGErbLJCC8KB9qJhfSwklr9LCvSpSClSze1Kpmw1xhFD9jsogqJyzTSVjC7A8E0XxIQ2Eug6IsNnbopeRwsp/T8TpNNQzYw+je1RfinbXbNiC8idgWRw2kfNYz/LwlN/ZEfo4/JNJ5/XAtYyZgs3k8xONcT/z0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=SaD4K5vu; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861445;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DCgceAE8jBD2WohfFUuNrr+O/9K6+IZGaVDTilf9JRw=;
	b=SaD4K5vupbXwYT4BtFqvofZeVGIyvUNaM5hDA+KDghMl0zNfyC/pyl3aWcEmHOB3ef/laT
	bcFD1VbRjRIoIWWI+DI9jB24gSpGQDppiHlbTrzAcV7IKcWrG0o/9gSXtm2AS7HLzlM49y
	apv4Tgn41Ysn4N5rk65Yuo2DoneIomc=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-47-dtpG1UN6MtqxFsoNZDIZvg-1; Fri, 02 Feb 2024 03:10:40 -0500
X-MC-Unique: dtpG1UN6MtqxFsoNZDIZvg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DCF6A185A780;
	Fri,  2 Feb 2024 08:10:39 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 04626C2590D;
	Fri,  2 Feb 2024 08:10:36 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Benjamin Segall <bsegall@google.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to
 userspace
Date: Fri,  2 Feb 2024 09:09:16 +0100
Message-ID: <20240202080920.3337862-2-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1789774409350674856
X-GMAIL-MSGID: 1789774409350674856

From: Benjamin Segall <bsegall@google.com>

The basic idea of this implementation is to maintain duplicate runqueues
in each cfs_rq that contain duplicate pointers to sched_entitys which
should bypass throttling. Then we can skip throttling cfs_rqs that have
any such children, and when we pick inside any not-actually-throttled
cfs_rq, we only look at this duplicated list.

"Which tasks should bypass throttling" here is "all schedule() calls
that don't set a special flag", but could instead involve the lockdep
markers (except for the problem of percpu-rwsem and similar) or explicit
flags around syscalls and faults, or something else.

This approach avoids any O(tasks) loops, but leaves partially-throttled
cfs_rqs still contributing their full h_nr_running to their parents,
which might result in worse balancing. Also it adds more (generally
still small) overhead to the common enqueue/dequeue/pick paths.

The very basic debug test added is to run a cpusoaker and "cat
/sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
cgroup with a quota < 1 cpu.

Not-signed-off-by: Benjamin Segall <bsegall@google.com>
[Slight comment / naming changes]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/sched.h |   7 ++
 kernel/entry/common.c |   2 +-
 kernel/entry/kvm.c    |   2 +-
 kernel/sched/core.c   |  20 ++++
 kernel/sched/debug.c  |  28 +++++
 kernel/sched/fair.c   | 232 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  |   3 +
 7 files changed, 281 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03bfe9ab29511..4a0105d1eaa21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -303,6 +303,7 @@ extern long schedule_timeout_killable(long timeout);
 extern long schedule_timeout_uninterruptible(long timeout);
 extern long schedule_timeout_idle(long timeout);
 asmlinkage void schedule(void);
+asmlinkage void schedule_usermode(void);
 extern void schedule_preempt_disabled(void);
 asmlinkage void preempt_schedule_irq(void);
 #ifdef CONFIG_PREEMPT_RT
@@ -553,6 +554,9 @@ struct sched_entity {
 	struct cfs_rq			*my_q;
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct list_head		kernel_node;
+#endif
 #endif
 
 #ifdef CONFIG_SMP
@@ -1539,6 +1543,9 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	atomic_t			in_return_to_user;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba3..16b5432a62c6f 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -156,7 +156,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+			schedule_usermode(); /* TODO: also all of the arch/ loops that don't use this yet */
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..fc4b73de07539 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -14,7 +14,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 		}
 
 		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+			schedule_usermode();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			resume_user_mode_work(NULL);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f0..a7c028fad5a89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4529,6 +4529,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&p->se.kernel_node);
+	atomic_set(&p->in_return_to_user, 0);
+#endif
 
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
@@ -6818,6 +6822,22 @@ asmlinkage __visible void __sched schedule(void)
 }
 EXPORT_SYMBOL(schedule);
 
+asmlinkage __visible void __sched schedule_usermode(void)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+	/*
+	 * This is only atomic because of this simple implementation. We could
+	 * do something with an SM_USER to avoid other-cpu scheduler operations
+	 * racing against these writes.
+	 */
+	atomic_set(&current->in_return_to_user, true);
+	schedule();
+	atomic_set(&current->in_return_to_user, false);
+#else
+	schedule();
+#endif
+}
+
 /*
  * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
  * state (have scheduled out non-voluntarily) by making sure that all
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834df..4a89dbc3ddfcd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -319,6 +319,32 @@ static const struct file_operations sched_verbose_fops = {
 	.llseek =       default_llseek,
 };
 
+static DEFINE_MUTEX(sched_debug_spin_mutex);
+static int sched_debug_spin_show(struct seq_file *m, void *v) {
+	int count;
+	mutex_lock(&sched_debug_spin_mutex);
+	for (count = 0; count < 1000; count++) {
+		u64 start2;
+		start2 = jiffies;
+		while (jiffies == start2)
+			cpu_relax();
+		schedule();
+	}
+	mutex_unlock(&sched_debug_spin_mutex);
+	return 0;
+}
+static int sched_debug_spin_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_debug_spin_show, NULL);
+}
+
+static const struct file_operations sched_debug_spin_fops = {
+	.open		= sched_debug_spin_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static const struct seq_operations sched_debug_sops;
 
 static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -374,6 +400,8 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_create_file("sched_locked_spin", 0444, NULL, NULL,
+			    &sched_debug_spin_fops);
 	return 0;
 }
 late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a037..a1808459a5acc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -128,6 +128,7 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: 5 msec, units: microseconds)
  */
 static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
+static unsigned int sysctl_sched_cfs_bandwidth_kernel_bypass	= 1;
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -146,6 +147,15 @@ static struct ctl_table sched_fair_sysctls[] = {
 		.proc_handler   = proc_dointvec_minmax,
 		.extra1         = SYSCTL_ONE,
 	},
+	{
+		.procname       = "sched_cfs_bandwidth_kernel_bypass",
+		.data           = &sysctl_sched_cfs_bandwidth_kernel_bypass,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	{
@@ -5445,14 +5455,34 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 /*
  * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
+ * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
+ * 2) keep things fair between processes/task groups
+ * 3) pick the "next" process, since someone really wants that to run
+ * 4) pick the "last" process, for cache locality
+ * 5) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
 {
+#ifdef CONFIG_CFS_BANDWIDTH
+	/*
+	 * TODO: This might trigger, I'm not sure/don't remember. Regardless,
+	 * while we do not explicitly handle the case where h_kernel_running
+	 * goes to 0, we will call account/check_cfs_rq_runtime at worst in
+	 * entity_tick and notice that we can now properly do the full
+	 * throttle_cfs_rq.
+	 */
+	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
+	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+		/*
+		 * TODO: you'd want to factor out pick_eevdf to just take
+		 * tasks_timeline, and replace this list with a second rbtree
+		 * and a call to pick_eevdf.
+		 */
+		return list_first_entry(&cfs_rq->kernel_children,
+					struct sched_entity, kernel_node);
+	}
+#endif
 	/*
 	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
@@ -5651,8 +5681,14 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	/*
 	 * if we're unable to extend our runtime we resched so that the active
 	 * hierarchy can be throttled
+	 *
+	 * Don't resched_curr() if curr is in the kernel. We won't throttle the
+	 * cfs_rq if any task is in the kernel, and if curr in particular is we
+	 * don't need to preempt it in favor of whatever other task is in the
+	 * kernel.
 	 */
-	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr) &&
+	    list_empty(&rq_of(cfs_rq)->curr->se.kernel_node))
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -5741,12 +5777,22 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	return 0;
 }
 
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
+
 static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, dequeue = 1;
+	long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+
+	/*
+	 * We don't actually throttle, though account() will have made sure to
+	 * resched us so that we pick into a kernel task.
+	 */
+	if (cfs_rq->h_kernel_running)
+		return false;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5778,6 +5824,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	kernel_delta = cfs_rq->h_kernel_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
@@ -5791,6 +5838,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		dequeue_kernel(qcfs_rq, se, kernel_delta);
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5813,6 +5861,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		dequeue_kernel(qcfs_rq, se, kernel_delta);
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -5835,7 +5884,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta;
+	long task_delta, idle_task_delta, kernel_delta;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -5870,6 +5919,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	kernel_delta = cfs_rq->h_kernel_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
@@ -5882,6 +5932,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		enqueue_kernel(qcfs_rq, se, kernel_delta);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -5899,6 +5950,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		enqueue_kernel(qcfs_rq, se, kernel_delta);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6557,6 +6609,86 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
 }
 #endif
 
+/*
+ * We keep track of all children that are runnable in the kernel with a count of
+ * all descendants. The state is checked on enqueue and put_prev (and hard
+ * cleared on dequeue), and is stored just as the filled/empty state of the
+ * kernel_node list entry.
+ *
+ * These are simple helpers that do both parts, and should be called bottom-up
+ * until hitting a throttled cfs_rq whenever a task changes state (or a cfs_rq
+ * is (un)throttled).
+ */
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+	if (count == 0)
+		return;
+
+	if (list_empty(&se->kernel_node))
+		list_add(&se->kernel_node, &cfs_rq->kernel_children);
+	cfs_rq->h_kernel_running += count;
+}
+
+static bool is_kernel_task(struct task_struct *p)
+{
+	return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+}
+
+/*
+ * When called on a task this always transitions it to a !kernel state.
+ *
+ * When called on a group it is just synchronizing the state with the new
+ * h_kernel_waiters, unless this it has been throttled and is !on_rq
+ */
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count)
+{
+	if (count == 0)
+		return;
+
+	if (!se->on_rq || entity_is_task(se) ||
+	    !group_cfs_rq(se)->h_kernel_running)
+		list_del_init(&se->kernel_node);
+	cfs_rq->h_kernel_running -= count;
+}
+
+/*
+ * Returns if the cfs_rq "should" be throttled but might not be because of
+ * kernel threads bypassing throttle.
+ */
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_bandwidth_used())
+		return false;
+
+	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+		return false;
+	return true;
+}
+
+static void unthrottle_on_enqueue(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+
+	if (!cfs_bandwidth_used() || !sysctl_sched_cfs_bandwidth_kernel_bypass)
+		return;
+	if (!cfs_rq_of(&p->se)->throttle_count)
+		return;
+
+	/*
+	 * MAYBE TODO: doing it this simple way is O(throttle_count *
+	 * cgroup_depth). We could optimize that into a single pass, but making
+	 * a mostly-copy of unthrottle_cfs_rq that does that is a pain and easy
+	 * to get wrong. (And even without unthrottle_on_enqueue it's O(nm),
+	 * just not while holding rq->lock the whole time)
+	 */
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		if (cfs_rq->throttled)
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static inline bool cfs_bandwidth_used(void)
@@ -6604,6 +6736,16 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 	return false;
 }
 #endif
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count) {}
+static inline bool is_kernel_task(struct task_struct *p)
+{
+	return false;
+}
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+	return false;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 
 #if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
@@ -6707,6 +6849,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
+	bool kernel_task = is_kernel_task(p);
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6735,6 +6878,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			enqueue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6755,6 +6900,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			enqueue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6785,6 +6932,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
+
+	if (kernel_task)
+		unthrottle_on_enqueue(p);
 }
 
 static void set_next_buddy(struct sched_entity *se);
@@ -6801,6 +6951,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int task_sleep = flags & DEQUEUE_SLEEP;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
+	bool kernel_task = !list_empty(&p->se.kernel_node);
 
 	util_est_dequeue(&rq->cfs, p);
 
@@ -6813,6 +6964,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			dequeue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6845,6 +6998,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
+		if (kernel_task)
+			dequeue_kernel(cfs_rq, se, 1);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8343,11 +8498,40 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	resched_curr(rq);
 }
 
+static void handle_kernel_task_prev(struct task_struct *prev)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct sched_entity *se = &prev->se;
+	bool p_in_kernel = is_kernel_task(prev);
+	bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+	/*
+	 * These extra loops are bad and against the whole point of the merged
+	 * PNT, but it's a pain to merge, particularly since we want it to occur
+	 * before check_cfs_runtime().
+	 */
+	if (p_in_kernel_tree && !p_in_kernel) {
+		WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+		for_each_sched_entity(se) {
+			dequeue_kernel(cfs_rq_of(se), se, 1);
+			if (cfs_rq_throttled(cfs_rq_of(se)))
+				break;
+		}
+	} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+		for_each_sched_entity(se) {
+			enqueue_kernel(cfs_rq_of(se), se, 1);
+			if (cfs_rq_throttled(cfs_rq_of(se)))
+				break;
+		}
+	}
+#endif
+}
+
 #ifdef CONFIG_SMP
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
+	bool throttled = false;
 
 again:
 	cfs_rq = &rq->cfs;
@@ -8368,7 +8552,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 				goto again;
 		}
 
-		se = pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled = true;
+
+		se = pick_next_entity(cfs_rq, throttled);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8383,6 +8570,14 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
+	bool throttled;
+
+	/*
+	 * We want to handle this before check_cfs_runtime(prev). We'll
+	 * duplicate a little work in the goto simple case, but that's fine
+	 */
+	if (prev)
+		handle_kernel_task_prev(prev);
 
 again:
 	if (!sched_fair_runnable(rq))
@@ -8400,6 +8595,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	 * hierarchy, only change the part that actually changes.
 	 */
 
+	throttled = false;
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
@@ -8431,7 +8627,10 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 			}
 		}
 
-		se = pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled = true;
+
+		se = pick_next_entity(cfs_rq, throttled);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8469,8 +8668,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	if (prev)
 		put_prev_task(rq, prev);
 
+	throttled = false;
 	do {
-		se = pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled = true;
+		se = pick_next_entity(cfs_rq, throttled);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
@@ -8534,6 +8736,8 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
 
+	handle_kernel_task_prev(prev);
+
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
@@ -12818,6 +13022,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&cfs_rq->kernel_children);
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -12970,6 +13177,9 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&se->kernel_node);
+#endif
 }
 
 static DEFINE_MUTEX(shares_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77de..0b33ce2e60555 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,6 +580,7 @@ struct cfs_rq {
 
 	struct rb_root_cached	tasks_timeline;
 
+
 	/*
 	 * 'curr' points to currently running entity on this cfs_rq.
 	 * It is set to NULL otherwise (i.e when none are currently running).
@@ -658,8 +659,10 @@ struct cfs_rq {
 	u64			throttled_clock_self_time;
 	int			throttled;
 	int			throttle_count;
+	int			h_kernel_running;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
+	struct list_head	kernel_children;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };

From patchwork Fri Feb  2 08:09:17 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Valentin Schneider <vschneid@redhat.com>
X-Patchwork-Id: 195685
Return-Path: <linux-kernel+bounces-49440-ouuuleilei=gmail.com@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp281599dyc;
        Fri, 2 Feb 2024 00:17:56 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IEzq8emNuPbuAXCb8Cs0DAbYme9kOvVk9SQtg2v0y+DSlVvJjvG3LzG6ig1Mto+EBlbfrvV
X-Received: by 2002:a17:906:3654:b0:a36:1ac:e1ed with SMTP id
 r20-20020a170906365400b00a3601ace1edmr5457982ejb.1.1706861876183;
        Fri, 02 Feb 2024 00:17:56 -0800 (PST)
ARC-Seal: i=2; a=rsa-sha256; t=1706861876; cv=pass;
        d=google.com; s=arc-20160816;
        b=agUWx3f8cIwd3GFJ/GBKkhWDIa+4MT1U20qrXBmXCsW5KO9VRrK5BIbelWU2yCmITV
         Gi1IfVcs8H8+FCncK+ybUUY8UnBdCOnZcxodIXIgxiFx0lH0ErzcNI+35HS2Fe+qMz7T
         hgthMoaoXVnJM0CmJ6kFn3JKfo3LBX8BdhV4dYr6nDZs97qzsdG4ayqOSxNiCzub6mvu
         r1WNteeeBxoqQBXZy8Q64vZ5mh3fgj2Pu+1oAlrdkedI1otb8y9QKDiaptQFSWuLjMBb
         JOdqHw+JY1cXFHt4kC4zjJ2lyWBxZiUsZD+9pKpxJmuJOY7DhF5N9lLGiCvnH4z6EbRm
         Qgvg==
ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:mime-version:list-unsubscribe
         :list-subscribe:list-id:precedence:references:in-reply-to:message-id
         :date:subject:cc:to:from:dkim-signature;
        bh=MWiW15gYCd0qq5hmfxU5I8UTcZTFW4ysIqeBM84iI5Y=;
        fh=h4N7+2GPE1BKjnnyrycIECmWUV9+FOadsqZvniYYCKo=;
        b=exzFbm/1bmuWuejKJ6SlVxKwInrj9ZwnMoJOm9j8rXuuVaM1z673PSbyQ1zm6Bs3nM
         YpaFEcnfqsi+OWubdrkD7GPzwUT9jcnoyXyVuKSO52GRoR0L+cgMwZ22ObXrU7RkM1h1
         9qJc26TXL5eTTLOl4wMQvIvauQ7Fw7oZbjKIQwT+Hy5h+AULPiXfbeL7XVO9An+h//Of
         fXkG1DOhyUbaNSaV+9VXzV/DrBEfjXGeS63X503bZM3JdnWwmilvlQX8qEaynL1WTJPv
         AYViFCuMHkIjwRQHxEqfoVibNNbSjmDhjRN8u5319Z006lZQ79c/OCMA5AfWzKzviq/F
         GHOA==;
        dara=google.com
ARC-Authentication-Results: i=2; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=anlPeXsQ;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49440-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49440-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
X-Forwarded-Encrypted: i=1;
 AJvYcCV5vv7rRfyk1OQUyU3ILlqUyyp2REjwPXPbUhiORjdxvQE6BcJw4fzxDz+y/BfnZq7uPwpdgNCDOwKRYAJIzHd+dmnP8w==
Received: from am.mirrors.kernel.org (am.mirrors.kernel.org.
 [2604:1380:4601:e00::3])
        by mx.google.com with ESMTPS id
 ga5-20020a170906b84500b00a2cea3474b6si602077ejb.231.2024.02.02.00.17.55
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Feb 2024 00:17:56 -0800 (PST)
Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-49440-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=anlPeXsQ;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49440-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49440-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org
 [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by am.mirrors.kernel.org (Postfix) with ESMTPS id AAC171F2712F
	for <ouuuleilei@gmail.com>; Fri,  2 Feb 2024 08:16:34 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 54CF01946F;
	Fri,  2 Feb 2024 08:10:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="anlPeXsQ"
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4953F18658
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:45 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861448; cv=none;
 b=YJ5SdYelmP32V3WJ9ZZmuEQZOpdyetACuqhFFboobVH/WYEk3LU0/x64cqEJ/+Swre4jfMrJqoUsk5Pk8PIfKhoLwYG5UlfqE4J2l188PynAWB3PSBt9NjHsOMG1+W+IEO15KRTfHJg1XhJ9tI89XMSXFmtW4io3/L7lCIJTmiI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861448; c=relaxed/simple;
	bh=Zk2mQ+ZW7M5p5xaR77pQtedUVa3RgHq4JSSfbueXse4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=E2ZVtPYSX020Xt22N2wuUOqxIo2zw9UvY61WwQufs54BhuxnGMHuysEc9avSt0ASYJrBtBdhZYn/eUVhTqArO2GEJ+PAlgn8ZXIiDTUYLWNB+IbzMNcLg4HHTdcOqxYIRYRy6ONVOr1vOB3Dj7nSSTlaK1BurPK6r72EF7w+OlM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=anlPeXsQ; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861445;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=MWiW15gYCd0qq5hmfxU5I8UTcZTFW4ysIqeBM84iI5Y=;
	b=anlPeXsQPG37DeisH0PRA/kAA1MDRcU5pHlNSWKRmnKqAMxVRn/smSrQ4OOUP9hHhpUsoN
	/odbLvP0hFbwrcRtPPAmNICScyZh4pKqEqSJengL7AMvaV2K6JhZRv0AVDcGJPZUmnZ2Tu
	MxmtIYHc9+CmBkSagrIHiyszppRqHS0=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-394-tjCGo5PvMb2pCh-tcMxXTQ-1; Fri,
 02 Feb 2024 03:10:43 -0500
X-MC-Unique: tjCGo5PvMb2pCh-tcMxXTQ-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F00E22812FFA;
	Fri,  2 Feb 2024 08:10:42 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 3A3ADC2590E;
	Fri,  2 Feb 2024 08:10:40 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Benjamin Segall <bsegall@google.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 2/5] sched: Note schedule() invocations at
 return-to-user with SM_USER
Date: Fri,  2 Feb 2024 09:09:17 +0100
Message-ID: <20240202080920.3337862-3-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1789774398324721945
X-GMAIL-MSGID: 1789774398324721945

task_struct.in_return_to_user is currently updated via atomic operations in
schedule_usermode().

However, one can note:
o .in_return_to_user is only updated for the current task
o There are no remote (smp_processor_id() != task_cpu(p)) accesses to
  .in_return_to_user

Add schedule_with_mode() to factorize schedule() with different flags to
pass down to __schedule_loop().

Add SM_USER to denote schedule() calls from return-to-userspace points.

Update .in_return_to_user from within the preemption-disabled, rq_lock-held
part of __schedule().

Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/sched.h |  2 +-
 kernel/sched/core.c   | 43 ++++++++++++++++++++++++++++++++-----------
 kernel/sched/fair.c   | 17 ++++++++++++++++-
 3 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a0105d1eaa21..1b6f17b2150a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1544,7 +1544,7 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
-	atomic_t			in_return_to_user;
+	int				in_return_to_user;
 #endif
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7c028fad5a89..54e6690626b13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4531,7 +4531,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	INIT_LIST_HEAD(&p->se.kernel_node);
-	atomic_set(&p->in_return_to_user, 0);
+	p->in_return_to_user            = false;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
@@ -5147,6 +5147,9 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 
 static inline void finish_lock_switch(struct rq *rq)
 {
+#ifdef CONFIG_CFS_BANDWIDTH
+	current->in_return_to_user = false;
+#endif
 	/*
 	 * If we are tracking spinlock dependencies then we have to
 	 * fix up the runqueue lock - which gets 'carried over' from
@@ -6562,6 +6565,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 #define SM_PREEMPT		0x1
 #define SM_RTLOCK_WAIT		0x2
 
+/*
+ * Special case for CFS_BANDWIDTH where we need to know if the call to
+ * __schedule() is directely preceding an entry into userspace.
+ * It is removed from the mode argument as soon as it is used to not go against
+ * the SM_MASK_PREEMPT optimisation below.
+ */
+#ifdef CONFIG_CFS_BANDWIDTH
+# define SM_USER                0x4
+#else
+# define SM_USER                SM_NONE
+#endif
+
 #ifndef CONFIG_PREEMPT_RT
 # define SM_MASK_PREEMPT	(~0U)
 #else
@@ -6646,6 +6661,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	if (sched_mode & SM_USER) {
+		prev->in_return_to_user = true;
+		sched_mode &= ~SM_USER;
+	}
+#endif
+	SCHED_WARN_ON(sched_mode & SM_USER);
+
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<= 1;
 	update_rq_clock(rq);
@@ -6807,7 +6830,7 @@ static __always_inline void __schedule_loop(unsigned int sched_mode)
 	} while (need_resched());
 }
 
-asmlinkage __visible void __sched schedule(void)
+static __always_inline void schedule_with_mode(unsigned int sched_mode)
 {
 	struct task_struct *tsk = current;
 
@@ -6817,22 +6840,20 @@ asmlinkage __visible void __sched schedule(void)
 
 	if (!task_is_running(tsk))
 		sched_submit_work(tsk);
-	__schedule_loop(SM_NONE);
+	__schedule_loop(sched_mode);
 	sched_update_worker(tsk);
 }
+
+asmlinkage __visible void __sched schedule(void)
+{
+	schedule_with_mode(SM_NONE);
+}
 EXPORT_SYMBOL(schedule);
 
 asmlinkage __visible void __sched schedule_usermode(void)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
-	/*
-	 * This is only atomic because of this simple implementation. We could
-	 * do something with an SM_USER to avoid other-cpu scheduler operations
-	 * racing against these writes.
-	 */
-	atomic_set(&current->in_return_to_user, true);
-	schedule();
-	atomic_set(&current->in_return_to_user, false);
+	schedule_with_mode(SM_USER);
 #else
 	schedule();
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1808459a5acc..96504be6ee14a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6631,7 +6631,22 @@ static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c
 
 static bool is_kernel_task(struct task_struct *p)
 {
-	return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_return_to_user);
+	/*
+	 * The flag is updated within __schedule() with preemption disabled,
+	 * under the rq lock, and only when the task is current.
+	 *
+	 * Holding the rq lock for that task's CPU is thus sufficient for the
+	 * value to be stable, if the task is enqueued.
+	 *
+	 * If the task is dequeued, then task_cpu(p) *can* change, but this
+	 * so far only happens in enqueue_task_fair() which means either:
+	 * - the task is being activated, its CPU has been set previously in ttwu()
+	 * - the task is going through a "change" cycle (e.g. sched_move_task()),
+	 *   the pi_lock is also held so the CPU is stable.
+	 */
+	lockdep_assert_rq_held(cpu_rq(task_cpu(p)));
+
+	return sysctl_sched_cfs_bandwidth_kernel_bypass && !p->in_return_to_user;
 }
 
 /*

From patchwork Fri Feb  2 08:09:18 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Valentin Schneider <vschneid@redhat.com>
X-Patchwork-Id: 195692
Return-Path: <linux-kernel+bounces-49442-ouuuleilei=gmail.com@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp285329dyc;
        Fri, 2 Feb 2024 00:28:16 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFylfjeZ0Ow4ItwSZyhFHYPlX8znnUME/XsJZiWDUHuhTsETM5MH6H81v0rZ6V7+c59iDx7
X-Received: by 2002:a05:6358:e4a4:b0:178:7986:a586 with SMTP id
 by36-20020a056358e4a400b001787986a586mr1763381rwb.5.1706862495936;
        Fri, 02 Feb 2024 00:28:15 -0800 (PST)
ARC-Seal: i=2; a=rsa-sha256; t=1706862495; cv=pass;
        d=google.com; s=arc-20160816;
        b=uCB74MXbBilfxV/Kvuh1NdHIB3UXEJuiMjg5qcF7pi5S5P9CUlYhUVTBUo2GSVm6dV
         Xr3ZVbXlxRS5Pt9enbIXwjIOrRllrdgBVAMSY9g/Tw4CjLkt5jrmdltbb6XIeMZ8AEGw
         wmIzTPiZPBV8lRRBlEyjYLIdE5ACGMQAS2W/b3cnZc9DfvhlHQepBc9hkFw9VHxlfXUP
         fO1C/m1FKPviT7mWbY+LKokSTgMPcixnG56vfEz6Eg/lH4uOi7bEbbWMc14c+o9MVgZ8
         nEf6ViwR72IilSPOWocOdQ3a4xn4dbhgDVgBJqBk9yYddIFeFQUOG9jltDbR0L6kNq4D
         Mhgg==
ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:mime-version:list-unsubscribe
         :list-subscribe:list-id:precedence:references:in-reply-to:message-id
         :date:subject:cc:to:from:dkim-signature;
        bh=HxaIkJskuFw38ApXZhFFaPGSaq/2ThOojlsU5lw5wRE=;
        fh=Zk4APOV9cAc4StovDhVW+4ssKpGLWrHdjLa+oVAbYBM=;
        b=UcEVMthg8sh1q+8okIe+zyhEFetOA1EzkS6/F8EHDMi8mSfjDaiHVxYpUOX3pZMIM7
         sY8+ZHYXeVJw96LhQTLUdgsYm7/Fywdlmx6JmgqT+DTmpL4isOEWyLhclVuj10v60DyT
         abxRtt8/nRqpDr04ed9F1tqxNw8SEe3M/6P/Bk5UO4jyn+CCKByKOeRnuwrxhPLOcAcz
         6fUui4wEiVlcTmYaQrBL1oYsmReCeCRi5DEodIavQyujXkrLLekO7bxBfkHGOLyLNRSq
         X0CWb+KxViTHNpvckJqQAHxnhEiQ+t2njnZgHAfFUQXrvQuDw2RrKtz1zLNuMCIlOPx/
         T8Xg==;
        dara=google.com
ARC-Authentication-Results: i=2; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=N3ZpuSiS;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49442-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.48.161 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49442-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
X-Forwarded-Encrypted: i=1;
 AJvYcCXi4mQRgAuuCIMZLTf5CL6GSmOQrH3KxZVvA0wNrjtUkclbRhgG8sronQuCf40EywRzY3HdwgnHmL/O5s2MtHcPsUtlaw==
Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161])
        by mx.google.com with ESMTPS id
 f7-20020a63e307000000b005d8b5aace84si1211936pgh.259.2024.02.02.00.28.15
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Feb 2024 00:28:15 -0800 (PST)
Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-49442-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.48.161 as permitted sender) client-ip=147.75.48.161;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=N3ZpuSiS;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49442-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.48.161 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49442-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org
 [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by sy.mirrors.kernel.org (Postfix) with ESMTPS id C12ADB2992E
	for <ouuuleilei@gmail.com>; Fri,  2 Feb 2024 08:17:03 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 83CBE2943F;
	Fri,  2 Feb 2024 08:10:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="N3ZpuSiS"
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9086E19477
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861453; cv=none;
 b=Xbv6Hkh1+4Wa87hU/AfTGrhyqTHBe34FkbhB9QXeDEfiPdaJ+AWajiQnB1juVlFTC88SgLf7aKNeARRAjWEi1dVlVwo3EDzklkXWhpqkNVqpKIwBdvyAtcamC3f3bL57x8pF+325eEj44k03JVOD+VWCnAhEI6IJUe/OXvTCG0k=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861453; c=relaxed/simple;
	bh=dWuR9UQw2rwTVshyZQG5PCMcxyEdawkP54MOYUetJAM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=iB3w/RGRzNdwRlPgTiGybUFW0vFZ+O41iw8Xq1ZsTFiQt/f/SRsPDAVIGmoKbMkcJU/TQ9n9+rl2ijwXf7GNO0dNdkaYQaxJffse7s1LcIksibw2oq4yP32ZT5Z0CVizexecHKd6SDVsTlzcst5tWGC9cqnvIAetMldLypjIbSc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=N3ZpuSiS; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861450;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=HxaIkJskuFw38ApXZhFFaPGSaq/2ThOojlsU5lw5wRE=;
	b=N3ZpuSiSogq4Fzc3MLaKHNggOKGsr76/9qDKxi11ss+02s+Wy4SPjACFD4rjrS/gq/3HMy
	Mx2izwBJoIhvuqJPjmer9qYtFByOYBYG1TRGX5jG3upGzeVvRxdONUtihwoeSvkrkUF8VG
	y7QyhvAXiGX7MkJdMmAFQ1Gnz2S6y5k=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-319--CXo1PhyNSiCukS3YGEorQ-1; Fri, 02 Feb 2024 03:10:46 -0500
X-MC-Unique: -CXo1PhyNSiCukS3YGEorQ-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E676D1013663;
	Fri,  2 Feb 2024 08:10:45 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 59EE0C2590E;
	Fri,  2 Feb 2024 08:10:43 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(),
 use cfs_rq->throttle_pending instead
Date: Fri,  2 Feb 2024 09:09:18 +0100
Message-ID: <20240202080920.3337862-4-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1789775048649109819
X-GMAIL-MSGID: 1789775048649109819

cfs_rq_throttled_loose() does not check if there is runtime remaining in
the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
for that to be checked.

Cache the throttle attempt in throttle_cfs_rq and reuse that where
needed.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 44 ++++++++++----------------------------------
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96504be6ee14a..60778afbff207 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * 5) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
+pick_next_entity(struct cfs_rq *cfs_rq)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
 	/*
@@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
 	 * throttle_cfs_rq.
 	 */
 	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
-	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+	if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
 		/*
 		 * TODO: you'd want to factor out pick_eevdf to just take
 		 * tasks_timeline, and replace this list with a second rbtree
@@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * We don't actually throttle, though account() will have made sure to
 	 * resched us so that we pick into a kernel task.
 	 */
-	if (cfs_rq->h_kernel_running)
+	if (cfs_rq->h_kernel_running) {
+		cfs_rq->throttle_pending = true;
 		return false;
+	}
+
+	cfs_rq->throttle_pending = false;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -6666,20 +6670,6 @@ static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int c
 	cfs_rq->h_kernel_running -= count;
 }
 
-/*
- * Returns if the cfs_rq "should" be throttled but might not be because of
- * kernel threads bypassing throttle.
- */
-static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
-{
-	if (!cfs_bandwidth_used())
-		return false;
-
-	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
-	return true;
-}
-
 static void unthrottle_on_enqueue(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
@@ -8546,7 +8536,6 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
-	bool throttled = false;
 
 again:
 	cfs_rq = &rq->cfs;
@@ -8567,10 +8556,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 				goto again;
 		}
 
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled = true;
-
-		se = pick_next_entity(cfs_rq, throttled);
+		se = pick_next_entity(cfs_rq);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8585,7 +8571,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
-	bool throttled;
 
 	/*
 	 * We want to handle this before check_cfs_runtime(prev). We'll
@@ -8609,8 +8594,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	 * Therefore attempt to avoid putting and setting the entire cgroup
 	 * hierarchy, only change the part that actually changes.
 	 */
-
-	throttled = false;
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
@@ -8641,11 +8624,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 				goto simple;
 			}
 		}
-
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled = true;
-
-		se = pick_next_entity(cfs_rq, throttled);
+		se = pick_next_entity(cfs_rq);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
@@ -8683,11 +8662,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	if (prev)
 		put_prev_task(rq, prev);
 
-	throttled = false;
 	do {
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled = true;
-		se = pick_next_entity(cfs_rq, throttled);
+		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);

From patchwork Fri Feb  2 08:09:19 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Valentin Schneider <vschneid@redhat.com>
X-Patchwork-Id: 195688
Return-Path: <linux-kernel+bounces-49443-ouuuleilei=gmail.com@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp281791dyc;
        Fri, 2 Feb 2024 00:18:25 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHBJwb/Lq4XCIvcaRCR4QOHvrzREXCelLXy90pIvtNP93aCvV06MDE0Q+lzaBCVjPM204L8
X-Received: by 2002:a05:6402:1497:b0:55f:3b38:a1bf with SMTP id
 e23-20020a056402149700b0055f3b38a1bfmr739438edv.9.1706861904670;
        Fri, 02 Feb 2024 00:18:24 -0800 (PST)
X-Forwarded-Encrypted: i=1;
 AJvYcCXE/Png23VCn9+26Q7vB3nA2A8y28LERbBl5sbSAiePSwBgqnUXF6elmkntJhwAjApvEJAwa6dpB/Jm+627BNgPXJNBEw==
Received: from am.mirrors.kernel.org (am.mirrors.kernel.org.
 [2604:1380:4601:e00::3])
        by mx.google.com with ESMTPS id
 e17-20020aa7d7d1000000b0055ffeb683efsi97267eds.497.2024.02.02.00.18.24
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Feb 2024 00:18:24 -0800 (PST)
Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-49443-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@redhat.com
 header.s=mimecast20190719 header.b=UPlur54y;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49443-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49443-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org
 [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by am.mirrors.kernel.org (Postfix) with ESMTPS id C4A341F22E9E
	for <ouuuleilei@gmail.com>; Fri,  2 Feb 2024 08:17:49 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 76A5B3FE4E;
	Fri,  2 Feb 2024 08:11:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key)
 header.d=redhat.com header.i=@redhat.com header.b="UPlur54y"
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 536BF1AADD
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:54 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861456; cv=none;
 b=rEqZ4EZKraYr+wtHlPrm1b/XTdZ2ICgUrQ2iRkWLmLFRPyAEll/WOm8I4yKhca9fbeh2HLuRNdnVwxPD1QfGanFvRQ/o/VKuznsz8qxqTrFFnfk1DNlHl4Qw/tQTVsvCAHmdACyI2Z+CefS0sQtBdGmePSwzekcejEtj0MtL25E=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861456; c=relaxed/simple;
	bh=XSUm7ZVkUeJgzY+9JV/lu1DGc96b7lL1jb2zM9BjCJw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=EYM2x8UDmGABzZaMedauhb3ttg72aB1JOPC0PIDNqjNE2ZS5CAde05rHAsECldeDT5BlobcKmg7d5pU8O++xrl9T8K1wJCodxjhZadJV180U9zHDo2GZD2w4Z+FcByS+fRLAsTtCmG7RMJETPaeEnCa6FZc2MENzkeygfVMk81w=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=UPlur54y; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861453;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=59FY7Fjt6DUz/1aW5sr5w/uHLVwuGkenIcuvCVNSVT4=;
	b=UPlur54yBIBI1UWQTbEn1GaxfNJniS+Ptuvdfc/bT5D403PrC1lJqyuUYA09QKCJTWOEcY
	wfTbCpu66ChWChvUox8o0aYdNHgD0Yg4YRK74ihZ33nypfKia/fTWZ3vDigNOBOEZ3/PI3
	Zw1pLYz12P3HHUYf4wxeM84hzjxrjPo=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-427-vO7VM0CHMYa-y6FzF5P9nA-1; Fri, 02 Feb 2024 03:10:48 -0500
X-MC-Unique: vO7VM0CHMYa-y6FzF5P9nA-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 635D983B82B;
	Fri,  2 Feb 2024 08:10:48 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 3F9A0C2590D;
	Fri,  2 Feb 2024 08:10:46 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in
 userspace
Date: Fri,  2 Feb 2024 09:09:19 +0100
Message-ID: <20240202080920.3337862-5-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1789774428709272943
X-GMAIL-MSGID: 1789774428709272943

While having a second tree to pick from solves the throttling aspect of things,
it also requires modification of the task count at the cfs_rq level.

h_nr_running is used throughout load_balance(), and it needs to accurately
reflect the amount of pickable tasks: a cfs_rq with .throttle_pending=1 may have
many tasks in userspace (thus effectively throttled), and this "excess" of tasks
shouldn't cause find_busiest_group() / find_busiest_queue() to pick that
cfs_rq's CPU to pull load from when there are other CPUs with more pickable
tasks to pull.

The approach taken here is to track both the count of tasks in kernelspace and
the count of tasks in userspace (technically tasks-just-about-to-enter-userspace).

When a cfs_rq runs out of runtime, it gets marked as .throttle_pending=1. From
this point on, only tasks executing in kernelspace are pickable, and this is
reflected up the hierarchy by removing that cfs_rq.h_user_running from its
parents' .h_nr_running.

To aid in validating the proper behaviour of the implementation, we assert the
following invariants:
  o For any cfs_rq with .throttle_pending == 0:
    .h_kernel_running + .h_user_running == .h_nr_running
  o For any cfs_rq with .throttle_pending == 1:
    .h_kernel_running == .h_nr_running

This means the .h_user_running also needs to be updated as cfs_rq's become
throttle_pending=1. When a cfs_rq becomes .throttle_pending=1, its
h_user_running remains untouched, but it is subtracted from its parents'
h_user_running.

Another way to look at it is that the .h_user_running is "stored" at the level
of the .throttle_pending cfs_rq, and restored to the upper part of the hierarchy
at unthrottle.

An overview of the count logic is:

 Consider:
   cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
   cfs_rq.user   := count of user   *tasks* enqueued on this cfs_rq

 Then, the following logic is implemented:
   cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
   cfs_rq.h_user_running   = Sum(child.user)   for all child cfs_rq with !child.throttle_pending
   cfs_rq.h_nr_running     = Sum(child.kernel) for all child cfs_rq
			   + Sum(child.user)   for all child cfs_rq with !child.throttle_pending

An application of that logic to an A/B/C cgroup hierarchy:

  Initial condition, no throttling

    +------+ .h_kernel_running = C.kernel + B.kernel + A.kernel
  A |cfs_rq| .h_user_running   = C.user   + B.user   + A.user
    +------+ .h_nr_running     = C.{kernel+user} + B.{kernel+user} + A.{kernel+user}
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel + B.kernel
  B |cfs_rq| .h_user_running   = C.user   + B.user
    +------+ .h_nr_running     = C.{kernel+user} + B.{kernel+user}
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel
  C |cfs_rq| .h_user_running   = C.user
    +------+ .h_nr_running     = C.{kernel+user}
	     .throttle_pending = 0

  C becomes .throttle_pending

    +------+ .h_kernel_running = C.kernel + B.kernel + A.kernel               <- Untouched
  A |cfs_rq| .h_user_running   = B.user   + A.user                            <- Decremented by C.user
    +------+ .h_nr_running     = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel + B.kernel                          <- Untouched
  B |cfs_rq| .h_user_running   = B.user                                       <- Decremented by C.user
    +------+ .h_nr_running     = C.kernel + B.{kernel+user} + A.{kernel+user} <- Decremented by C.user
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel
  C |cfs_rq| .h_user_running   = C.user   <- Untouched, the count is "stored" at this level
    +------+ .h_nr_running     = C.kernel <- Decremented by C.user
	     .throttle_pending = 1

  C becomes throttled

    +------+ .h_kernel_running = B.kernel + A.kernel               <- Decremented by C.kernel
  A |cfs_rq| .h_user_running   = B.user   + A.user
    +------+ .h_nr_running     = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = B.kernel                          <- Decremented by C.kernel
  B |cfs_rq| .h_user_running   = B.user
    +------+ .h_nr_running     = B.{kernel+user} + A.{kernel+user} <- Decremented by C.kernel
       ^     .throttle_pending = 0
       |
       | parent
       |
    +------+ .h_kernel_running = C.kernel
  C |cfs_rq| .h_user_running   = C.user
    +------+ .h_nr_running     = C.{kernel+user} <- Incremented by C.user
	     .throttle_pending = 0

Could we get away with just one count, e.g. the user count and not the kernel
count? Technically yes, we could follow this scheme:
  if (throttle_pending) => kernel count := h_nr_running - h_user_running
  else                  => kernel count := h_nr_running
this however prevents any sort of assertion or sanity checking on the counts,
which I am not the biggest fan on - CFS group scheduling is enough of a headache
as it is.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c  | 174 ++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |   2 +
 2 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60778afbff207..2b54d3813d18d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5785,17 +5785,48 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
+	long task_delta, idle_task_delta, kernel_delta, user_delta, dequeue = 1;
+	bool was_pending;
 
 	/*
-	 * We don't actually throttle, though account() will have made sure to
-	 * resched us so that we pick into a kernel task.
+	 * We don't actually throttle just yet, though account_cfs_rq_runtime()
+	 * will have made sure to resched us so that we pick into a kernel task.
 	 */
 	if (cfs_rq->h_kernel_running) {
+		if (cfs_rq->throttle_pending)
+			return false;
+
+		/*
+		 * From now on we're only going to pick tasks that are in the
+		 * second tree. Reflect this by discounting tasks that aren't going
+		 * to be pickable from the ->h_nr_running counts.
+		 */
 		cfs_rq->throttle_pending = true;
+
+		se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+		user_delta = cfs_rq->h_user_running;
+		cfs_rq->h_nr_running -= user_delta;
+
+		for_each_sched_entity(se) {
+			struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
+			if (!se->on_rq)
+				goto done;
+
+			qcfs_rq->h_nr_running -= user_delta;
+			qcfs_rq->h_user_running -= user_delta;
+
+			assert_cfs_rq_counts(qcfs_rq);
+		}
 		return false;
 	}
 
+	/*
+	 * Unlikely as it may be, we may only have user tasks as we hit the
+	 * throttle, in which case we won't have discount them from the
+	 * h_nr_running, and we need to be aware of that.
+	 */
+	was_pending = cfs_rq->throttle_pending;
 	cfs_rq->throttle_pending = false;
 
 	raw_spin_lock(&cfs_b->lock);
@@ -5826,9 +5857,27 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
 
-	task_delta = cfs_rq->h_nr_running;
+	/*
+	 * At this point, h_nr_running == h_kernel_running. We add back the
+	 * h_user_running to the throttled cfs_rq, and only remove the difference
+	 * to the upper cfs_rq's.
+	 */
+	if (was_pending) {
+		WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_kernel_running);
+		cfs_rq->h_nr_running += cfs_rq->h_user_running;
+	} else {
+		WARN_ON_ONCE(cfs_rq->h_nr_running != cfs_rq->h_user_running);
+	}
+
+	/*
+	 * We always discount user tasks from h_nr_running when throttle_pending
+	 * so only h_kernel_running remains to be removed
+	 */
+	task_delta = was_pending ? cfs_rq->h_kernel_running : cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	kernel_delta = cfs_rq->h_kernel_running;
+	user_delta   = was_pending ? 0 : cfs_rq->h_user_running;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
@@ -5843,6 +5892,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running -= user_delta;
+
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5866,6 +5917,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running -= user_delta;
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -5888,7 +5940,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, kernel_delta;
+	long task_delta, idle_task_delta, kernel_delta, user_delta;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -5924,6 +5976,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	kernel_delta = cfs_rq->h_kernel_running;
+	user_delta = cfs_rq->h_user_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
@@ -5937,6 +5990,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running += user_delta;
+
+		assert_cfs_rq_counts(qcfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -5955,6 +6011,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running += user_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6855,6 +6912,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	bool kernel_task = is_kernel_task(p);
+	bool throttle_pending = false;
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6878,13 +6936,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running++;
 		if (kernel_task)
 			enqueue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running++;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6900,13 +6965,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running++;
 		if (kernel_task)
 			enqueue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running++;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6957,6 +7029,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
 	bool kernel_task = !list_empty(&p->se.kernel_node);
+	bool throttle_pending = false;
 
 	util_est_dequeue(&rq->cfs, p);
 
@@ -6964,13 +7037,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running--;
 		if (kernel_task)
 			dequeue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running--;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6998,13 +7078,20 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running--;
 		if (kernel_task)
 			dequeue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running--;
+
+		throttle_pending |= cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8503,28 +8590,65 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	resched_curr(rq);
 }
 
+/*
+ * Consider:
+ *   cfs_rq.kernel := count of kernel *tasks* enqueued on this cfs_rq
+ *   cfs_rq.user   := count of user   *tasks* enqueued on this cfs_rq
+ *
+ * Then, the following logic is implemented:
+ *   cfs_rq.h_kernel_running = Sum(child.kernel) for all child cfs_rq
+ *   cfs_rq.h_user_running   = Sum(child.user)   for all child cfs_rq with !child.throttle_pending
+ *   cfs_rq.h_nr_running     = Sum(child.kernel) for all child cfs_rq
+ *			     + Sum(child.user)   for all child cfs_rq with !child.throttle_pending
+ *
+ * IOW, count of kernel tasks is always propagated up the hierarchy, and count
+ * of user tasks is only propagated up if the cfs_rq isn't .throttle_pending.
+ */
 static void handle_kernel_task_prev(struct task_struct *prev)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
 	struct sched_entity *se = &prev->se;
 	bool p_in_kernel = is_kernel_task(prev);
 	bool p_in_kernel_tree = !list_empty(&se->kernel_node);
+	bool throttle_pending = false;
 	/*
 	 * These extra loops are bad and against the whole point of the merged
 	 * PNT, but it's a pain to merge, particularly since we want it to occur
 	 * before check_cfs_runtime().
 	 */
 	if (p_in_kernel_tree && !p_in_kernel) {
+		/* Switch from KERNEL -> USER */
 		WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+
 		for_each_sched_entity(se) {
-			dequeue_kernel(cfs_rq_of(se), se, 1);
-			if (cfs_rq_throttled(cfs_rq_of(se)))
+			struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+			if (throttle_pending || cfs_rq->throttle_pending)
+				cfs_rq->h_nr_running--;
+			dequeue_kernel(cfs_rq, se, 1);
+			if (!throttle_pending)
+				cfs_rq->h_user_running++;
+
+			throttle_pending |= cfs_rq->throttle_pending;
+
+			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
 	} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+		/* Switch from USER -> KERNEL */
+
 		for_each_sched_entity(se) {
-			enqueue_kernel(cfs_rq_of(se), se, 1);
-			if (cfs_rq_throttled(cfs_rq_of(se)))
+			struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+			if (throttle_pending || cfs_rq->throttle_pending)
+				cfs_rq->h_nr_running++;
+			enqueue_kernel(cfs_rq, se, 1);
+			if (!throttle_pending)
+				cfs_rq->h_user_running--;
+
+			throttle_pending |= cfs_rq->throttle_pending;
+
+			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b33ce2e60555..e8860e0d6fbc7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -660,6 +660,8 @@ struct cfs_rq {
 	int			throttled;
 	int			throttle_count;
 	int			h_kernel_running;
+	int			h_user_running;
+	int                     throttle_pending;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
 	struct list_head	kernel_children;

From patchwork Fri Feb  2 08:09:20 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Valentin Schneider <vschneid@redhat.com>
X-Patchwork-Id: 195687
Return-Path: <linux-kernel+bounces-49444-ouuuleilei=gmail.com@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp281764dyc;
        Fri, 2 Feb 2024 00:18:20 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IELNkkGhncNRgwbWO+3YHWZ/8cjkGN5kqxZHC8mEW44ddAC9jmr3At7fPwYFWXDFMFxLxOD
X-Received: by 2002:a17:906:46d1:b0:a35:e7c2:4a5a with SMTP id
 k17-20020a17090646d100b00a35e7c24a5amr4790351ejs.33.1706861900391;
        Fri, 02 Feb 2024 00:18:20 -0800 (PST)
X-Forwarded-Encrypted: i=1;
 AJvYcCW+3LMNj3E6BJlGgX2OvdkEzX2wCOYW7QKgnNOjScMFA+1A1mierQWdtJKc2EwvYqfAL97FNJmmsxVSbBDv9AvFKhllMw==
Received: from am.mirrors.kernel.org (am.mirrors.kernel.org.
 [2604:1380:4601:e00::3])
        by mx.google.com with ESMTPS id
 p7-20020a17090635c700b00a35c3615faesi601263ejb.208.2024.02.02.00.18.20
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 02 Feb 2024 00:18:20 -0800 (PST)
Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-49444-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@redhat.com
 header.s=mimecast20190719 header.b=W18EPO9K;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of
 linux-kernel+bounces-49444-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-49444-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org
 [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by am.mirrors.kernel.org (Postfix) with ESMTPS id 211C31F2C0AA
	for <ouuuleilei@gmail.com>; Fri,  2 Feb 2024 08:17:39 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 772FF3CF4B;
	Fri,  2 Feb 2024 08:11:01 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key)
 header.d=redhat.com header.i=@redhat.com header.b="W18EPO9K"
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F5A3182C3
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861458; cv=none;
 b=gNgWZWtvJuCrDK5FspCJYCtc2B9OKoZsJ/l+iyzK+OlwoM158scm34rPouvSWmPZa0DiAkwBXlkVR1+SUUSn+HHioHlagQAZNmvBFoZUGtK6sT4zRCVOma4JE0ARLbatLVqozdpTt3ZB5ZtKLvZxa779T8GEbdDGjHTZPi5nOkQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861458; c=relaxed/simple;
	bh=GKjGZ0nEYiWo/e0g4SCEgfAGOP3qsyIKHYvTEjh/1rg=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=RdB/Db2q69k17I6sSbW4eKgJuj6Nw5IFEJHBlO1Rjg2ScyrffzlkBP5+i+Y41j3hKur82idrr+2YCNx1ymdwkIDTzlM+MTeuHxkRN8havlKBCGhkz5mCut5qrMtCqYHCV0Emlvm9x7jsxIRzj27nsQrbvJBlixWWLv3i9timpNY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=W18EPO9K; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861455;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=kzFtBHR5zk59g+CzbXsW0xRKrMzCG41VI7+6LhlJ3kY=;
	b=W18EPO9Kjyaqv4DAb3oPrHqA6oxzwkQe2K3jGyhYIMP4/8YhfMdsGMjWvkOFjraKRKJyQd
	ugL34cqJy/ZMYsF1vdRtoDjhYSmhbVhM+ecAOyLLcwP4xpcrd0YUD2cK+d1OV/0oF9t+sp
	F5CV32/Fn89Q8YDj8tqukjNva3w85ks=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-135-VKghfcTmM0WgeKid_SqZiw-1; Fri,
 02 Feb 2024 03:10:51 -0500
X-MC-Unique: VKghfcTmM0WgeKid_SqZiw-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6132A3C0ED5E;
	Fri,  2 Feb 2024 08:10:51 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id C2AB6C2590D;
	Fri,  2 Feb 2024 08:10:48 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
Date: Fri,  2 Feb 2024 09:09:20 +0100
Message-ID: <20240202080920.3337862-6-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1789774424356427824
X-GMAIL-MSGID: 1789774424356427824

Previous commits have added .h_kernel_running and .h_user_running to struct
cfs_rq, and are using them to play games with the hierarchical
h_nr_running.

Assert some count invariants under SCHED_DEBUG to improve debugging.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b54d3813d18d..52d0ee0e4d47c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5780,6 +5780,30 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
 static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq)
+{
+	lockdep_assert_rq_held(rq_of(cfs_rq));
+
+	/*
+	 * When !throttle_pending, this is the normal operating mode, all tasks
+	 * are pickable, so:
+	 * nr_kernel_tasks + nr_user_tasks == nr_pickable_tasks
+	 */
+	SCHED_WARN_ON(!cfs_rq->throttle_pending &&
+		      (cfs_rq->h_kernel_running + cfs_rq->h_user_running !=
+		       cfs_rq->h_nr_running));
+	/*
+	 * When throttle_pending, only kernel tasks are pickable, so:
+	 * nr_kernel_tasks == nr_pickable_tasks
+	 */
+	SCHED_WARN_ON(cfs_rq->throttle_pending &&
+		      (cfs_rq->h_kernel_running != cfs_rq->h_nr_running));
+}
+#else
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq) { }
+#endif
+
 static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -5894,6 +5918,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running -= user_delta;
 
+		assert_cfs_rq_counts(qcfs_rq);
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5918,6 +5943,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running -= user_delta;
+
+		assert_cfs_rq_counts(qcfs_rq);
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -6013,6 +6040,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running += user_delta;
 
+		assert_cfs_rq_counts(qcfs_rq);
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
@@ -6950,6 +6979,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6965,6 +6995,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
 			cfs_rq->h_nr_running++;
@@ -6979,6 +7010,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -7051,6 +7083,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -7092,6 +7125,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
 
+		assert_cfs_rq_counts(cfs_rq);
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8631,6 +8665,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)
 
 			throttle_pending |= cfs_rq->throttle_pending;
 
+			assert_cfs_rq_counts(cfs_rq);
+
 			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
@@ -8648,6 +8684,8 @@ static void handle_kernel_task_prev(struct task_struct *prev)
 
 			throttle_pending |= cfs_rq->throttle_pending;
 
+			assert_cfs_rq_counts(cfs_rq);
+
 			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}