From patchwork Thu Nov 30 16:12:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Valentin Schneider X-Patchwork-Id: 171974 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp501849vqy; Thu, 30 Nov 2023 08:13:23 -0800 (PST) X-Google-Smtp-Source: AGHT+IFggEsxuJuXLDbbl+lU9Iv/eeNOd/RbbvLrpjfyR3TMOzGgosKrNfDWEXimCcb42OaGcZwe X-Received: by 2002:a17:90b:384a:b0:285:adbb:6ead with SMTP id nl10-20020a17090b384a00b00285adbb6eadmr22726141pjb.0.1701360797984; Thu, 30 Nov 2023 08:13:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701360797; cv=none; d=google.com; s=arc-20160816; b=DCB0lsXx7f937z2BI9BpFaXcKxZ+MvurWhaW8/L3J/LqYBbtzwE/8Soju+kkKMmfCM CX8eAQT1IM4aGwh9krC877P18PXJi4LGUq1uoBMP/mmrEndAjm4ZWsjJ70kP9lVsOpUs D2YM7plh7wxKq9etIazuAFYeGEmuUxVUg6FEE9wBpC5gZ54NBtG4qERA4GKWsTP/B7xf e9goCDwkagQ0T4rVsBUde9VvALTWimyOEAG04BpUdv/JNwnomQKZG6VCp7WfAc/MgN/j 2dHMgRQNQYgyisA7liP7rLTsbkS9pPZy0t1U0De3FnQZGnnkCD1a1P/rk2CmjF7LfPlg E4Lw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=o/phfTUBezyckehvgP1ne0ESiIuzAdRoII/h/BrGy/Q=; fh=OQwQel51yhvcIE2NsWJ3TdH3z3JJnDOjZAXoORFrZCA=; b=X4BBIJVRcMoNfLtAIH46wGPKH/zKURe+WubJneXwMx0+qPCBbTIvDqBaU5z0S9efdp FlFPptVwoyke/7VuD/9Hn/jry+IRfefP9oJFsZlagZHhURpb04oxkYDvpiz2beNLjw6c KupTHFDB8O/YBKilFE5gZQKWInaM5dVl0GkXiMxTyDh6r0Q3h7wenikne/ApfrL7ZyHh YGoEgDuzY6og/UyK7eh5H+b/8dBA8QGK+CDExUKsfL3vYXjr5lHItfdxYpmrm402IR4I zTtBQdDFrc5iN/I0/J9Jxkv7NdWDoz/qwnlcAMkj9zEDenQzDZkGoKTWZXeWoeLPVD+J L2Wg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dqhS6lBm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id gq9-20020a17090b104900b0027717627cf6si3688529pjb.41.2023.11.30.08.13.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Nov 2023 08:13:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dqhS6lBm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id F25E680AEB3B; Thu, 30 Nov 2023 08:13:16 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231960AbjK3QNH (ORCPT + 99 others); Thu, 30 Nov 2023 11:13:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231796AbjK3QNG (ORCPT ); Thu, 30 Nov 2023 11:13:06 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26884197 for ; Thu, 30 Nov 2023 08:13:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701360790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=o/phfTUBezyckehvgP1ne0ESiIuzAdRoII/h/BrGy/Q=; b=dqhS6lBmnub69p1bdZvmcti85i9DjHHM8GyKMTue/cc0uc4+SL2enKELO6t9GwddByUZcZ HTY1GRzXwLlE7Vvx2t1no8T07xbQITrtVYuTKKqeQGmqg3K4EcYIGxLoffADsSdNfoMenc j3cJ4MmJJcAVCeB/GpkSTvcJP/2Ws3s= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-620-fpRAe2hsPUiGw_cxFYIb-A-1; Thu, 30 Nov 2023 11:13:06 -0500 X-MC-Unique: fpRAe2hsPUiGw_cxFYIb-A-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 19ADF812C2C; Thu, 30 Nov 2023 16:13:06 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.195.45]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4121FC15881; Thu, 30 Nov 2023 16:13:03 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Phil Auld , Clark Williams , Tomas Glozar Subject: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace Date: Thu, 30 Nov 2023 17:12:43 +0100 Message-ID: <20231130161245.3894682-2-vschneid@redhat.com> In-Reply-To: <20231130161245.3894682-1-vschneid@redhat.com> References: <20231130161245.3894682-1-vschneid@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 30 Nov 2023 08:13:17 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784006099851452765 X-GMAIL-MSGID: 1784006099851452765 As reported in [1], CFS bandwidth throttling is a source of headaches in PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that prevent ksoftirqd from running, which prevents replenishing & unthrottling the cfs_rq of said CFS task. Peter mentioned that there have been discussions on changing /when/ the throttling happens: rather than have it be done immediately upon updating the runtime statistics and realizing the cfs_rq has depleted its quota, we wait for the task to be about to return to userspace. I'm not aware of the arguments in favour of this for !PREEMPT_RT, but given [1] I had a look into it. Using this approach means no task can be throttled while holding a kernel lock, which solves the locking dependency issue. Concept & headaches =================== The original idea was: o cfs_rq runs out of runtime, flag it o Take the tasks out of the cfs_rq one by one as they exit the kernel o Dequeue the cfs_rq once it is empty of tasks (as if doing "normal" throttling) o Put the tasks back onto it o At unthrottle, simply re-enqueue the cfs_rq sched_entity and all is well Cgroup migration kinda breaks all of that. Consider a task moving to a now-fully-throttled cfs_rq: if we just enqueue it onto the throttled se, then it's the same as throttling it, and if it is holding a problematic lock then we're back to square one. This means we can never fully throttle a cfs_rq, which is making me not so confident in this approach. Should/can we consider deferring such migrations to the unthrottle (with something like in affine_move_task())? I think failing the migration is not an option due to e.g. autogroups, and IIUC returning from writing to cgroup/tasks means the migration has happened (like updating a task's affinity), which means this would need to be a blocking wait. CPU migrations within the same taskgroup are similarly problematic (e.g. moving from a cfs_rq that still has a bit of ->runtime left to one that has run out). Throttle clocks are also kinda busted in this approach, but they aren't my greatest concern at this point :-) Implementation ============== cfs_rq's that deplete their ->runtime get flagged for later throttling via ->in_throttle_limbo, and this is propagated down the hierarchy. Tasks enqueued in a cfs_rq with ->in_throttle_limbo picked for execution get a task_work added, which lets the actual throttling happen in exit_to_user_mode(). Upon unthrottling, tasks are enqueued back onto their respective cfs_rq. Unlike the previous throttling implementation, cfs_rq's can be unthrottled while in a half-throttled state (i.e. some tasks have been removed from them, while others are still enqueued and runnable), so the unthrottling process is a bit more involved. [1]: https://lore.kernel.org/all/20231031160120.GE15024@noisy.programming.kicks-ass.net/ Suggested-by: Peter Zijlstra Signed-off-by: Valentin Schneider --- include/linux/sched.h | 2 + kernel/sched/core.c | 6 +- kernel/sched/debug.c | 4 +- kernel/sched/fair.c | 257 +++++++++++++++++++++++++++++++----------- kernel/sched/sched.h | 7 ++ 5 files changed, 206 insertions(+), 70 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index e4235bbfad776..20194e4c8b0b5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -805,6 +805,8 @@ struct task_struct { #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; + struct callback_head sched_throttle_work; + struct list_head throttle_node; #endif #ifdef CONFIG_UCLAMP_TASK diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7a0c16115b79f..a63152072554b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4504,6 +4504,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.cfs_rq = NULL; #endif +#ifdef CONFIG_CFS_BANDWIDTH + init_cfs_throttle_work(p); +#endif + #ifdef CONFIG_SCHEDSTATS /* Even if schedstat is disabled, there should not be garbage */ memset(&p->stats, 0, sizeof(p->stats)); @@ -10823,7 +10827,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, cfs_rq->runtime_enabled = runtime_enabled; cfs_rq->runtime_remaining = 0; - if (cfs_rq->throttled) + if (cfs_rq->in_throttle_limbo) unthrottle_cfs_rq(cfs_rq); } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 4580a450700ec..8867904cf7eee 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -695,8 +695,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) #endif #endif #ifdef CONFIG_CFS_BANDWIDTH - SEQ_printf(m, " .%-30s: %d\n", "throttled", - cfs_rq->throttled); + SEQ_printf(m, " .%-30s: %d\n", "in_throttle_limbo", + cfs_rq->in_throttle_limbo); SEQ_printf(m, " .%-30s: %d\n", "throttle_count", cfs_rq->throttle_count); #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8767988242ee3..27aee13e7ccd9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5497,7 +5497,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) if (likely(cfs_rq->runtime_remaining > 0)) return; - if (cfs_rq->throttled) + if (cfs_rq->throttled || cfs_rq->in_throttle_limbo) return; /* * if we're unable to extend our runtime we resched so that the active @@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) { struct rq *rq = data; struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; + struct sched_entity *se = tg->se[cpu_of(rq)]; + struct cfs_rq *pcfs_rq = cfs_rq_of(se); + long task_delta = 0, idle_task_delta = 0; + struct task_struct *p, *tmp; + /* + * Re-enqueue the tasks that have been throttled at this level. + * + * The task count is up-propagated via ->unthrottled_*h_nr_running, + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle + * might happen when a cfs_rq still has some tasks enqueued, either still + * making their way to userspace, or freshly migrated to it. + */ + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) { + struct sched_entity *pse = &p->se; + + list_del_init(&p->throttle_node); + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP); + task_delta++; + idle_task_delta += task_has_idle_policy(p); + } + + /* + * Account tasks woken up in children; by this point all direct children + * have been visited. + */ + task_delta += cfs_rq->unthrottled_h_nr_running; + idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running; + + /* + * unthrottle_cfs_rq() needs a value to up-propagate above the + * freshly unthrottled cfs_rq. + */ + cfs_rq->unthrottled_h_nr_running = task_delta; + cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta; + + cfs_rq->h_nr_running += task_delta; + cfs_rq->idle_h_nr_running += idle_task_delta; + + /* Accumulate the delta in the parent's stash. Once all its children + * (i.e. all of this cfs_rq's siblings) have been visited, this value + * will be stable and used for its own count update. + */ + pcfs_rq->unthrottled_h_nr_running += task_delta; + pcfs_rq->unthrottled_idle_h_nr_running += idle_task_delta; + + /* + * If the cfs_rq became empty during throttling, then we dequeued + * it. It needs to be put back in the hierarchy if it or any of + * its children have now-unthrottled tasks. + */ + if (!se->on_rq && (task_delta || idle_task_delta)) + enqueue_entity(pcfs_rq, se, ENQUEUE_WAKEUP); + + cfs_rq->in_throttle_limbo = false; cfs_rq->throttle_count--; if (!cfs_rq->throttle_count) { cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - @@ -5573,6 +5627,17 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) return 0; } +static int tg_unthrottle_clear_up(struct task_group *tg, void *data) +{ + struct rq *rq = data; + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; + + cfs_rq->unthrottled_h_nr_running = 0; + cfs_rq->unthrottled_idle_h_nr_running = 0; + + return 0; +} + static int tg_throttle_down(struct task_group *tg, void *data) { struct rq *rq = data; @@ -5588,61 +5653,59 @@ static int tg_throttle_down(struct task_group *tg, void *data) cfs_rq->throttled_clock_self = rq_clock(rq); } cfs_rq->throttle_count++; + cfs_rq->in_throttle_limbo = true; return 0; } -static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) +static void throttle_cfs_rq_work(struct callback_head *work) { - struct rq *rq = rq_of(cfs_rq); - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); + struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work); struct sched_entity *se; - long task_delta, idle_task_delta, dequeue = 1; + struct rq *rq; + struct cfs_rq *qcfs_rq, *top_cfs_rq; + long task_delta, idle_task_delta; - raw_spin_lock(&cfs_b->lock); - /* This will start the period timer if necessary */ - if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) { - /* - * We have raced with bandwidth becoming available, and if we - * actually throttled the timer might not unthrottle us for an - * entire period. We additionally needed to make sure that any - * subsequent check_cfs_rq_runtime calls agree not to throttle - * us, as we may commit to do cfs put_prev+pick_next, so we ask - * for 1ns of runtime rather than just check cfs_b. - */ - dequeue = 0; - } else { - list_add_tail_rcu(&cfs_rq->throttled_list, - &cfs_b->throttled_cfs_rq); - } - raw_spin_unlock(&cfs_b->lock); + WARN_ON_ONCE(p != current); - if (!dequeue) - return false; /* Throttle no longer required. */ + /* + * If task is exiting, then there won't be a return to userspace, so we + * don't have to bother with any of this. + */ + if ((p->flags & PF_EXITING)) + return; - se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; + CLASS(task_rq_lock, rq_guard)(p); + rq = rq_guard.rq; + se = &p->se; + qcfs_rq = cfs_rq_of(se); - /* freeze hierarchy runnable averages while throttled */ - rcu_read_lock(); - walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); - rcu_read_unlock(); + /* + * If not in limbo, then either replenish has happened or this task got + * migrated out of the throttled cfs_rq, move along + */ + if (!qcfs_rq->in_throttle_limbo) + goto done; + + update_rq_clock(rq); + + list_add(&p->throttle_node, &qcfs_rq->throttled_limbo_list); + task_delta = 1; + idle_task_delta = cfs_rq_is_idle(qcfs_rq) ? 1 : 0; - task_delta = cfs_rq->h_nr_running; - idle_task_delta = cfs_rq->idle_h_nr_running; for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq = cfs_rq_of(se); - /* throttled entity or throttle-on-deactivate */ + qcfs_rq = cfs_rq_of(se); + if (!se->on_rq) goto done; dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_task_delta = cfs_rq->h_nr_running; - qcfs_rq->h_nr_running -= task_delta; qcfs_rq->idle_h_nr_running -= idle_task_delta; + if (qcfs_rq->in_throttle_limbo) + top_cfs_rq = qcfs_rq; + if (qcfs_rq->load.weight) { /* Avoid re-evaluating load for this entity: */ se = parent_entity(se); @@ -5651,34 +5714,84 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) } for_each_sched_entity(se) { - struct cfs_rq *qcfs_rq = cfs_rq_of(se); + qcfs_rq = cfs_rq_of(se); /* throttled entity or throttle-on-deactivate */ if (!se->on_rq) - goto done; + goto throttle_done; update_load_avg(qcfs_rq, se, 0); se_update_runnable(se); - - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_task_delta = cfs_rq->h_nr_running; - qcfs_rq->h_nr_running -= task_delta; - qcfs_rq->idle_h_nr_running -= idle_task_delta; + qcfs_rq->h_nr_running -= idle_task_delta; } +throttle_done: /* At this point se is NULL and we are at root level*/ - sub_nr_running(rq, task_delta); - + sub_nr_running(rq, 1); done: + p->sched_throttle_work.next = &p->sched_throttle_work; /* - * Note: distribution will already see us throttled via the - * throttled-list. rq->lock protects completion. + * XXX: this assumes the highest in_throttle_limbo cfs_rq in the + * hierarchy is the one which got throttled, which might not always + * be the case. */ - cfs_rq->throttled = 1; - SCHED_WARN_ON(cfs_rq->throttled_clock); - if (cfs_rq->nr_running) - cfs_rq->throttled_clock = rq_clock(rq); - return true; + if (top_cfs_rq && !top_cfs_rq->throttled_clock) + top_cfs_rq->throttled_clock = rq_clock(rq); + + resched_curr(rq); +} + +void init_cfs_throttle_work(struct task_struct *p) +{ + /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */ + p->sched_throttle_work.next = &p->sched_throttle_work; + init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work); +} + +static inline bool task_has_throttle_work(struct task_struct *p) +{ + return p->sched_throttle_work.next != &p->sched_throttle_work; +} + +static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) +{ + struct rq *rq = rq_of(cfs_rq); + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); + long dequeue = 1; + + if (cfs_rq->in_throttle_limbo) + return false; + + raw_spin_lock(&cfs_b->lock); + /* This will start the period timer if necessary */ + if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) { + /* + * We have raced with bandwidth becoming available, and if we + * actually throttled the timer might not unthrottle us for an + * entire period. We additionally needed to make sure that any + * subsequent check_cfs_rq_runtime calls agree not to throttle + * us, as we may commit to do cfs put_prev+pick_next, so we ask + * for 1ns of runtime rather than just check cfs_b. + */ + dequeue = 0; + } else { + list_add_tail_rcu(&cfs_rq->throttled_list, + &cfs_b->throttled_cfs_rq); + } + + raw_spin_unlock(&cfs_b->lock); + + if (!dequeue) + return false; /* Throttle no longer required. */ + + rcu_read_lock(); + walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); + rcu_read_unlock(); + + if (!task_has_throttle_work(current)) + task_work_add(current, ¤t->sched_throttle_work, TWA_RESUME); + + return false; } void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) @@ -5690,8 +5803,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) se = cfs_rq->tg->se[cpu_of(rq)]; - cfs_rq->throttled = 0; - update_rq_clock(rq); raw_spin_lock(&cfs_b->lock); @@ -5719,8 +5830,15 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) goto unthrottle_throttle; } - task_delta = cfs_rq->h_nr_running; - idle_task_delta = cfs_rq->idle_h_nr_running; + /* + * cfs_rq's below us may not have been fully emptied out, so we can't rely + * directly on ->h_nr_running. Use the stash instead. + */ + task_delta = cfs_rq->unthrottled_h_nr_running; + idle_task_delta = cfs_rq->unthrottled_idle_h_nr_running; + + walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_clear_up, (void *)rq); + for_each_sched_entity(se) { struct cfs_rq *qcfs_rq = cfs_rq_of(se); @@ -5728,9 +5846,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) break; enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP); - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_task_delta = cfs_rq->h_nr_running; - qcfs_rq->h_nr_running += task_delta; qcfs_rq->idle_h_nr_running += idle_task_delta; @@ -5745,9 +5860,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) update_load_avg(qcfs_rq, se, UPDATE_TG); se_update_runnable(se); - if (cfs_rq_is_idle(group_cfs_rq(se))) - idle_task_delta = cfs_rq->h_nr_running; - qcfs_rq->h_nr_running += task_delta; qcfs_rq->idle_h_nr_running += idle_task_delta; @@ -5797,7 +5909,7 @@ static void __cfsb_csd_unthrottle(void *arg) throttled_csd_list) { list_del_init(&cursor->throttled_csd_list); - if (cfs_rq_throttled(cursor)) + if (cfs_rq_throttled(cursor) || cursor->in_throttle_limbo) unthrottle_cfs_rq(cursor); } @@ -5837,7 +5949,7 @@ static void unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq) { lockdep_assert_rq_held(rq_of(cfs_rq)); - if (SCHED_WARN_ON(!cfs_rq_throttled(cfs_rq) || + if (SCHED_WARN_ON(!(cfs_rq_throttled(cfs_rq) || cfs_rq->in_throttle_limbo) || cfs_rq->runtime_remaining <= 0)) return; @@ -5865,7 +5977,12 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b) } rq_lock_irqsave(rq, &rf); - if (!cfs_rq_throttled(cfs_rq)) + /* + * A cfs_rq can be in_throttle_limbo but not throttled if it is + * waiting for tasks to exit the kernel. In this case we still + * want to replenish. + */ + if (!cfs_rq_throttled(cfs_rq) && !cfs_rq->in_throttle_limbo) goto next; /* Already queued for async unthrottle */ @@ -5914,7 +6031,7 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b) list_del_init(&cfs_rq->throttled_csd_list); - if (cfs_rq_throttled(cfs_rq)) + if (cfs_rq_throttled(cfs_rq) || cfs_rq->in_throttle_limbo) unthrottle_cfs_rq(cfs_rq); rq_unlock_irqrestore(rq, &rf); @@ -6252,6 +6369,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_rq->runtime_enabled = 0; INIT_LIST_HEAD(&cfs_rq->throttled_list); INIT_LIST_HEAD(&cfs_rq->throttled_csd_list); + INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list); } void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) @@ -8274,6 +8392,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf p = task_of(se); + if (unlikely(cfs_rq_of(se)->in_throttle_limbo && !task_has_throttle_work(p))) + task_work_add(p, &p->sched_throttle_work, TWA_RESUME); /* * Since we haven't yet done put_prev_entity and if the selected task * is a different task than we started out with, try and touch the @@ -8314,6 +8434,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf p = task_of(se); + if (unlikely(cfs_rq_of(se)->in_throttle_limbo && !task_has_throttle_work(p))) + task_work_add(p, &p->sched_throttle_work, TWA_RESUME); + done: __maybe_unused; #ifdef CONFIG_SMP /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2e5a95486a422..be29154d93898 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -633,8 +633,13 @@ struct cfs_rq { u64 throttled_clock_self_time; int throttled; int throttle_count; + int in_throttle_limbo; + /* Temp storage for updating the counts during unthrottling */ + unsigned int unthrottled_h_nr_running; + unsigned int unthrottled_idle_h_nr_running; struct list_head throttled_list; struct list_head throttled_csd_list; + struct list_head throttled_limbo_list; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ }; @@ -2416,6 +2421,8 @@ extern void init_sched_dl_class(void); extern void init_sched_rt_class(void); extern void init_sched_fair_class(void); +extern void init_cfs_throttle_work(struct task_struct *p); + extern void reweight_task(struct task_struct *p, int prio); extern void resched_curr(struct rq *rq);