Message ID | 20221026224449.214839-1-joshdon@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp520781wru; Wed, 26 Oct 2022 15:49:52 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4IysFGTVp22biZs7mEM6DNzyQRn7vDXKJ9ojnRO/85NtrSTRnZZHnN7gqXcPK30Lc5xE8O X-Received: by 2002:a17:90a:1c02:b0:1e0:df7:31f2 with SMTP id s2-20020a17090a1c0200b001e00df731f2mr6510448pjs.222.1666824591859; Wed, 26 Oct 2022 15:49:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666824591; cv=none; d=google.com; s=arc-20160816; b=utXtf9ni/t+WdzWe/LqArjOVP2ElXewoqLt8nMFo4HCTgxPOSilKqkrilMQc79TFTJ REgsMjum80lvrn47jj737c7zSbnRUBbq2U8rrE0qL6Yjut8no7U1K9S3cdKrDBEwwmzx mS9YX0Upif2PeuyljBq/Y+13H2KMJwKx3F8nIFnKbGJQMmT2bAVEkSmLVAPeNWSMBWg4 S4gRVQB6+nESHfpamOyNDKTDe+MfUB6aGGq2GcrUudCcG8N8DoH+gAHC+750RwzAKPXC b5MDUCbjctptOYirWtLAoaWwjZzs7SUqQG4AMC6c7yl39BYABEiRTVDzVgJWFMwOSK4P /1IQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=nDwUkzOGIY/oIkFRkpfjLktxkIpm6KbAmOOaw1hHvjY=; b=LwcAL7VoMjKWEyqD71cOgocaKGDNQxzxIdtjVb2xucbUZGT44Q7+ga0J1HvN2EJMC8 Q8RD1peqVip2F7Nr4nr+2dUYPLXX2WtvTVKmranWeia7hoJ4YUhYIP9Nw+IcY2rYHsdo AhVSxkJMjDYJJiDIQZwsUWERtlkOu/WBDibD5dhlidAiS8C6Ud6pLEIkwyQQShy87YXa iW/SJTIyusDqbjfKFKkRx/jE4uwxYUCZFysDG9AZ8iJzzdrn/6IV0And3XJ64gfKekiA pr0jRPfTXdN+qNFoVItrJHcgu7xgvcmjAtOa2a1pX279G3Yh+SoFg/vMYmc0NWiKHFAB iTNQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=o9NDa4bp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q3-20020a634303000000b0046ed62f820dsi8194561pga.288.2022.10.26.15.49.35; Wed, 26 Oct 2022 15:49:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=o9NDa4bp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233884AbiJZWqp (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Wed, 26 Oct 2022 18:46:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54894 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233836AbiJZWqX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 26 Oct 2022 18:46:23 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 192614DF13 for <linux-kernel@vger.kernel.org>; Wed, 26 Oct 2022 15:45:01 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id e8-20020a5b0cc8000000b006bca0fa3ab6so16140833ybr.0 for <linux-kernel@vger.kernel.org>; Wed, 26 Oct 2022 15:45:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=nDwUkzOGIY/oIkFRkpfjLktxkIpm6KbAmOOaw1hHvjY=; b=o9NDa4bpyT46MS1DqRu2fOx06z8OOtLzHwV8E2wlHgp0NtbgJCEsalb+GEgJyDXhkS 7lIqWqnfWXeTaMKfT4ew1IjmSVj4sK7vjJEAgHDYbEnYr6edTSHS1eADUIwkpuPW0KnY EiMM31lSoetw12k5TLBAOs5hEXNuGzy1juaoMp7c7RNvuQcyjMqrmVTAiTIy1H7J7sum FregvEutJbDfPNv/HZikFMN3Vv19CahndOZeh1JxVQJasKqRc+/He8uWMxj77bbsw0h/ ZUh4Kr4W0CpeX0cXZf+WjtkEz/C5tHz1E/3g+/qGaP1G1yjsg2SAXiYxg2NxmHiZQAC9 jqOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=nDwUkzOGIY/oIkFRkpfjLktxkIpm6KbAmOOaw1hHvjY=; b=1QfraMyu5hivCmamhbWBBKAhfrDBtYv7gehXdX5xoyc3GIHS4A4/UynCum58ZN+gDf OL92KD2+Ql4d4W/+TV8Bkumw2RswZQpHEEA64rPKqg7wFsxc6v30tUcPlWRta3jpZoUM yTFKMSkBdw8UDr9x9VLz5RFznRTl/SsRQmlPVy08YRaUNEyB33lzWXWnUAkAxC3Is8KE EV4qgvzooNSIPQjtmac+hGNeqKuFlyv6/plikxRXNhGYxp0pBsDuHoNWLNzBzWn84QAP h0Uld0KB3GwWFnwhF54a+X2bxOKJ/1GCq5PX7Flu1mza5GESzGQ9LcKL5KdWd7S2jx+9 92Iw== X-Gm-Message-State: ACrzQf2CZdpjOoqGo94x9Qpw9kty59H14VaV0mFTHqBke8nwIZt9uIop PVBhcCFNYRAe4wy/K8fYSoWWSp4Ksvd8 X-Received: from joshdon-desktop.svl.corp.google.com ([2620:15c:2d4:203:9a17:338e:7a9c:169b]) (user=joshdon job=sendgmr) by 2002:a05:6902:124d:b0:66d:5ce6:5924 with SMTP id t13-20020a056902124d00b0066d5ce65924mr42604866ybu.320.1666824300879; Wed, 26 Oct 2022 15:45:00 -0700 (PDT) Date: Wed, 26 Oct 2022 15:44:49 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.38.1.273.g43a17bfeac-goog Message-ID: <20221026224449.214839-1-joshdon@google.com> Subject: [PATCH v2] sched: async unthrottling for cfs bandwidth From: Josh Don <joshdon@google.com> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, linux-kernel@vger.kernel.org, Josh Don <joshdon@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747792263033416504?= X-GMAIL-MSGID: =?utf-8?q?1747792263033416504?= |
Series |
[v2] sched: async unthrottling for cfs bandwidth
|
|
Commit Message
Josh Don
Oct. 26, 2022, 10:44 p.m. UTC
CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
inline in an hrtimer callback. Runtime distribution is a per-cpu
operation, and unthrottling is a per-cgroup operation, since a tg walk
is required. On machines with a large number of cpus and large cgroup
hierarchies, this cpus*cgroups work can be too much to do in a single
hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
Specifically, we've found this scalability issue on configurations with
256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
memory bandwidth usage.
To fix this, we can instead unthrottle cfs_rq's asynchronously via a
CSD. Each cpu is responsible for unthrottling itself, thus sharding the
total work more fairly across the system, and avoiding hard lockups.
Signed-off-by: Josh Don <joshdon@google.com>
---
v2: Fixed !CONFIG_SMP build errors
kernel/sched/fair.c | 123 +++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 9 ++++
2 files changed, 123 insertions(+), 9 deletions(-)
Comments
On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote: > CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's > inline in an hrtimer callback. Runtime distribution is a per-cpu > operation, and unthrottling is a per-cgroup operation, since a tg walk > is required. On machines with a large number of cpus and large cgroup > hierarchies, this cpus*cgroups work can be too much to do in a single > hrtimer callback: since IRQ are disabled, hard lockups may easily occur. > Specifically, we've found this scalability issue on configurations with > 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high > memory bandwidth usage. > > To fix this, we can instead unthrottle cfs_rq's asynchronously via a > CSD. Each cpu is responsible for unthrottling itself, thus sharding the > total work more fairly across the system, and avoiding hard lockups. So, TJ has been complaining about us throttling in kernel-space, causing grief when we also happen to hold a mutex or some other resource and has been prodding us to only throttle at the return-to-user boundary. Would this be an opportune moment to do this? That is, what if we replace this CSD with a task_work that's ran on the return-to-user path instead?
Hey Peter, On Mon, Oct 31, 2022 at 6:04 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote: > > CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's > > inline in an hrtimer callback. Runtime distribution is a per-cpu > > operation, and unthrottling is a per-cgroup operation, since a tg walk > > is required. On machines with a large number of cpus and large cgroup > > hierarchies, this cpus*cgroups work can be too much to do in a single > > hrtimer callback: since IRQ are disabled, hard lockups may easily occur. > > Specifically, we've found this scalability issue on configurations with > > 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high > > memory bandwidth usage. > > > > To fix this, we can instead unthrottle cfs_rq's asynchronously via a > > CSD. Each cpu is responsible for unthrottling itself, thus sharding the > > total work more fairly across the system, and avoiding hard lockups. > > So, TJ has been complaining about us throttling in kernel-space, causing > grief when we also happen to hold a mutex or some other resource and has > been prodding us to only throttle at the return-to-user boundary. Yea, we've been having similar priority inversion issues. It isn't limited to CFS bandwidth though, such problems are also pretty easy to hit with configurations of shares, cpumasks, and SCHED_IDLE. I've chatted with the folks working on the proxy execution patch series, and it seems like that could be a better generic solution to these types of issues. Throttle at return-to-user seems only mildly beneficial, and then only really with preemptive kernels. Still pretty easy to get inversion issues, e.g. a thread holding a kernel mutex wake back up into a hierarchy that is currently throttled, or a thread holding a kernel mutex exists in the hierarchy being throttled but is currently waiting to run. > Would this be an opportune moment to do this? That is, what if we > replace this CSD with a task_work that's ran on the return-to-user path > instead? The above comment is about when we throttle, whereas this patch is about the unthrottle case. I think you're asking why don't we unthrottle using e.g. task_work assigned to whatever the current task is? That would work around the issue of keeping IRQ disabled for long periods, but still forces one cpu to process everything, which can take quite a while. Thanks, Josh
Hello, On Mon, Oct 31, 2022 at 02:22:42PM -0700, Josh Don wrote: > > So, TJ has been complaining about us throttling in kernel-space, causing > > grief when we also happen to hold a mutex or some other resource and has > > been prodding us to only throttle at the return-to-user boundary. > > Yea, we've been having similar priority inversion issues. It isn't > limited to CFS bandwidth though, such problems are also pretty easy to > hit with configurations of shares, cpumasks, and SCHED_IDLE. I've We need to distinguish between work-conserving and non-work-conserving control schemes. Work-conserving ones - such as shares and idle - shouldn't affect the aggregate amount of work the system can perform. There may be local and temporary priority inversions but they shouldn't affect the throughput of the system and the scheduler should be able to make the eventual resource distribution conform to the configured targtes. CPU affinity and bw control are not work conserving and thus cause a different class of problems. While it is possible to slow down a system with overly restrictive CPU affinities, it's a lot harder to do so severely compared to BW control because no matter what you do, there's still at least one CPU which can make full forward progress. BW control, it's really easy to stall the entire system almost completely because we're giving userspace the ability to stall tasks for an arbitrary amount of time at random places in the kernel. This is what cgroup1 freezer did which had exactly the same problems. > chatted with the folks working on the proxy execution patch series, > and it seems like that could be a better generic solution to these > types of issues. Care to elaborate? > Throttle at return-to-user seems only mildly beneficial, and then only > really with preemptive kernels. Still pretty easy to get inversion > issues, e.g. a thread holding a kernel mutex wake back up into a > hierarchy that is currently throttled, or a thread holding a kernel > mutex exists in the hierarchy being throttled but is currently waiting > to run. I don't follow. If you only throttle at predefined safe spots, the easiest place being the kernel-user boundary, you cannot get system-wide stalls from BW restrictions, which is something the kernel shouldn't allow userspace to cause. In your example, a thread holding a kernel mutex waking back up into a hierarchy that is currently throttled should keep running in the kernel until it encounters such safe throttling point where it would have released the kernel mutex and then throttle. Thanks.
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote: >> CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's >> inline in an hrtimer callback. Runtime distribution is a per-cpu >> operation, and unthrottling is a per-cgroup operation, since a tg walk >> is required. On machines with a large number of cpus and large cgroup >> hierarchies, this cpus*cgroups work can be too much to do in a single >> hrtimer callback: since IRQ are disabled, hard lockups may easily occur. >> Specifically, we've found this scalability issue on configurations with >> 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high >> memory bandwidth usage. >> >> To fix this, we can instead unthrottle cfs_rq's asynchronously via a >> CSD. Each cpu is responsible for unthrottling itself, thus sharding the >> total work more fairly across the system, and avoiding hard lockups. > > So, TJ has been complaining about us throttling in kernel-space, causing > grief when we also happen to hold a mutex or some other resource and has > been prodding us to only throttle at the return-to-user boundary. > > Would this be an opportune moment to do this? That is, what if we > replace this CSD with a task_work that's ran on the return-to-user path > instead? This is unthrottle, not throttle, but it would probably be straightfoward enough to do what you said for throttle. I'd expect this to not help all that much though, because throttle hits the entire cfs_rq, not individual threads. I'm currently trying something more invasive, which doesn't throttle a cfs_rq while it has any kernel tasks, and prioritizes kernel tasks / ses containing kernel tasks when a cfs_rq "should" be throttled. "Invasive" is a key word though, as it needs to do the sort of h_nr_kernel_tasks tracking on put_prev/set_next in ways we currently only need to do on enqueue/dequeue.
On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Mon, Oct 31, 2022 at 02:22:42PM -0700, Josh Don wrote: > > > So, TJ has been complaining about us throttling in kernel-space, causing > > > grief when we also happen to hold a mutex or some other resource and has > > > been prodding us to only throttle at the return-to-user boundary. > > > > Yea, we've been having similar priority inversion issues. It isn't > > limited to CFS bandwidth though, such problems are also pretty easy to > > hit with configurations of shares, cpumasks, and SCHED_IDLE. I've > > We need to distinguish between work-conserving and non-work-conserving > control schemes. Work-conserving ones - such as shares and idle - shouldn't > affect the aggregate amount of work the system can perform. There may be > local and temporary priority inversions but they shouldn't affect the > throughput of the system and the scheduler should be able to make the > eventual resource distribution conform to the configured targtes. > > CPU affinity and bw control are not work conserving and thus cause a > different class of problems. While it is possible to slow down a system with > overly restrictive CPU affinities, it's a lot harder to do so severely > compared to BW control because no matter what you do, there's still at least > one CPU which can make full forward progress. BW control, it's really easy > to stall the entire system almost completely because we're giving userspace > the ability to stall tasks for an arbitrary amount of time at random places > in the kernel. This is what cgroup1 freezer did which had exactly the same > problems. Yes, but schemes such as shares and idle can still end up creating some severe inversions. For example, a SCHED_IDLE thread on a cpu with many other threads. Eventually the SCHED_IDLE thread will get run, but the round robin times can easily get pushes out to several hundred ms (or even into the seconds range), due to min granularity. cpusets combined with the load balancer's struggle to find low weight tasks exacerbates such situations. > > chatted with the folks working on the proxy execution patch series, > > and it seems like that could be a better generic solution to these > > types of issues. > > Care to elaborate? https://lwn.net/Articles/793502/ gives some historical context, see also https://lwn.net/Articles/910302/. > > Throttle at return-to-user seems only mildly beneficial, and then only > > really with preemptive kernels. Still pretty easy to get inversion > > issues, e.g. a thread holding a kernel mutex wake back up into a > > hierarchy that is currently throttled, or a thread holding a kernel > > mutex exists in the hierarchy being throttled but is currently waiting > > to run. > > I don't follow. If you only throttle at predefined safe spots, the easiest > place being the kernel-user boundary, you cannot get system-wide stalls from > BW restrictions, which is something the kernel shouldn't allow userspace to > cause. In your example, a thread holding a kernel mutex waking back up into > a hierarchy that is currently throttled should keep running in the kernel > until it encounters such safe throttling point where it would have released > the kernel mutex and then throttle. Agree except that for the task waking back up, it isn't on cpu, so there is no "wait to throttle it until it returns to user", since throttling happens in the context of the entire cfs_rq. We'd have to treat threads in a bandwidth hierarchy that are also in kernel mode specially. Mechanically, it is more straightforward to implement the mechanism to wait to throttle until the cfs_rq has no more threads in kernel mode, than it is to exclude a woken task from the currently throttled period of its cfs_rq, though this is incomplete. What you're suggesting would also require that we find a way to preempt the current thread to start running the thread that woke up in kernel (and this becomes more complex when the current thread is also in kernel, or if there are n other waiting threads that are also in kernel). > > Thanks. > > -- > tejun Best, Josh
Hello, On Mon, Oct 31, 2022 at 04:15:54PM -0700, Josh Don wrote: > On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <tj@kernel.org> wrote: > Yes, but schemes such as shares and idle can still end up creating > some severe inversions. For example, a SCHED_IDLE thread on a cpu with > many other threads. Eventually the SCHED_IDLE thread will get run, but > the round robin times can easily get pushes out to several hundred ms > (or even into the seconds range), due to min granularity. cpusets > combined with the load balancer's struggle to find low weight tasks > exacerbates such situations. Yeah, especially with narrow cpuset (or task cpu affinity) configurations, it can get pretty bad. Outside that tho, at least I haven't seen a lot of problematic cases as long as the low priority one isn't tightly entangled with high priority tasks, mostly because 1. if the resource the low pri one is holding affects large part of the system, the problem is self-solving as the system quickly runs out of other things to do 2. if the resource isn't affecting large part of the system, their blast radius is usually reasonably confined to things tightly coupled with it. I'm sure there are exceptions and we definitely wanna improve the situation where it makes sense. > > > chatted with the folks working on the proxy execution patch series, > > > and it seems like that could be a better generic solution to these > > > types of issues. > > > > Care to elaborate? > > https://lwn.net/Articles/793502/ gives some historical context, see > also https://lwn.net/Articles/910302/. Ah, full blown priority inheritance. They're great to pursue but I think we wanna fix cpu bw control regardless. It's such an obvious and basic issue and given how much problem we have with actually understanding resource and control dependencies with all the custom synchronization contstructs in the kernel, fixing it will be useful even in the future where we have a better priority inheritance mechanism. > > I don't follow. If you only throttle at predefined safe spots, the easiest > > place being the kernel-user boundary, you cannot get system-wide stalls from > > BW restrictions, which is something the kernel shouldn't allow userspace to > > cause. In your example, a thread holding a kernel mutex waking back up into > > a hierarchy that is currently throttled should keep running in the kernel > > until it encounters such safe throttling point where it would have released > > the kernel mutex and then throttle. > > Agree except that for the task waking back up, it isn't on cpu, so > there is no "wait to throttle it until it returns to user", since > throttling happens in the context of the entire cfs_rq. We'd have to Oh yeah, we'd have to be able to allow threads running in kernel regardless of cfq_rq throttled state and then force charge the cpu cycles to be paid later. It would definitely require quite a bit of work. > treat threads in a bandwidth hierarchy that are also in kernel mode > specially. Mechanically, it is more straightforward to implement the > mechanism to wait to throttle until the cfs_rq has no more threads in > kernel mode, than it is to exclude a woken task from the currently > throttled period of its cfs_rq, though this is incomplete. My hunch is that bunching them together is likely gonna create too many escape scenarios and control artifacts and it'd be better to always push throttling decisions to the leaves (tasks) so that each task can be controlled separately. That'd involve architectural changes but the eventual behavior would be a lot better. > What you're suggesting would also require that we find a way to > preempt the current thread to start running the thread that woke up in > kernel (and this becomes more complex when the current thread is also > in kernel, or if there are n other waiting threads that are also in > kernel). I don't think it needs that. What allows userspace to easily trigger pathological scenarios is the ability to force the machine idle when there's something which is ready to run in the kernel. If you take that away, most of the problems disappear. It's not perfect but reasonable enough and not worse than a system without cpu bw control. Thanks.
On Mon, Oct 31, 2022 at 4:53 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Mon, Oct 31, 2022 at 04:15:54PM -0700, Josh Don wrote: > > On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <tj@kernel.org> wrote: > > Yes, but schemes such as shares and idle can still end up creating > > some severe inversions. For example, a SCHED_IDLE thread on a cpu with > > many other threads. Eventually the SCHED_IDLE thread will get run, but > > the round robin times can easily get pushes out to several hundred ms > > (or even into the seconds range), due to min granularity. cpusets > > combined with the load balancer's struggle to find low weight tasks > > exacerbates such situations. > > Yeah, especially with narrow cpuset (or task cpu affinity) configurations, > it can get pretty bad. Outside that tho, at least I haven't seen a lot of > problematic cases as long as the low priority one isn't tightly entangled > with high priority tasks, mostly because 1. if the resource the low pri one > is holding affects large part of the system, the problem is self-solving as > the system quickly runs out of other things to do 2. if the resource isn't > affecting large part of the system, their blast radius is usually reasonably > confined to things tightly coupled with it. I'm sure there are exceptions > and we definitely wanna improve the situation where it makes sense. cgroup_mutex and kernfs rwsem beg to differ :) These are shared with control plane threads, so it is pretty easy to starve those out even while the system has plenty of work to do. > > > > chatted with the folks working on the proxy execution patch series, > > > > and it seems like that could be a better generic solution to these > > > > types of issues. > > > > > > Care to elaborate? > > > > https://lwn.net/Articles/793502/ gives some historical context, see > > also https://lwn.net/Articles/910302/. > > Ah, full blown priority inheritance. They're great to pursue but I think we > wanna fix cpu bw control regardless. It's such an obvious and basic issue > and given how much problem we have with actually understanding resource and > control dependencies with all the custom synchronization contstructs in the > kernel, fixing it will be useful even in the future where we have a better > priority inheritance mechanism. Sure, even something like not throttling when there exist threads in kernel mode (while not a complete solution), helps get some of the way towards improving that case. > > > I don't follow. If you only throttle at predefined safe spots, the easiest > > > place being the kernel-user boundary, you cannot get system-wide stalls from > > > BW restrictions, which is something the kernel shouldn't allow userspace to > > > cause. In your example, a thread holding a kernel mutex waking back up into > > > a hierarchy that is currently throttled should keep running in the kernel > > > until it encounters such safe throttling point where it would have released > > > the kernel mutex and then throttle. > > > > Agree except that for the task waking back up, it isn't on cpu, so > > there is no "wait to throttle it until it returns to user", since > > throttling happens in the context of the entire cfs_rq. We'd have to > > Oh yeah, we'd have to be able to allow threads running in kernel regardless > of cfq_rq throttled state and then force charge the cpu cycles to be paid > later. It would definitely require quite a bit of work. > > > treat threads in a bandwidth hierarchy that are also in kernel mode > > specially. Mechanically, it is more straightforward to implement the > > mechanism to wait to throttle until the cfs_rq has no more threads in > > kernel mode, than it is to exclude a woken task from the currently > > throttled period of its cfs_rq, though this is incomplete. > > My hunch is that bunching them together is likely gonna create too many > escape scenarios and control artifacts and it'd be better to always push > throttling decisions to the leaves (tasks) so that each task can be > controlled separately. That'd involve architectural changes but the eventual > behavior would be a lot better. Also a tradeoff, since it is extra overhead to handle individually at the leaf level vs dequeuing a single cfs_rq. > > What you're suggesting would also require that we find a way to > > preempt the current thread to start running the thread that woke up in > > kernel (and this becomes more complex when the current thread is also > > in kernel, or if there are n other waiting threads that are also in > > kernel). > > I don't think it needs that. What allows userspace to easily trigger > pathological scenarios is the ability to force the machine idle when there's > something which is ready to run in the kernel. If you take that away, most > of the problems disappear. It's not perfect but reasonable enough and not > worse than a system without cpu bw control. > > Thanks. > > -- > tejun
On Mon, Oct 31, 2022 at 06:01:19PM -0700, Josh Don wrote: > > Yeah, especially with narrow cpuset (or task cpu affinity) configurations, > > it can get pretty bad. Outside that tho, at least I haven't seen a lot of > > problematic cases as long as the low priority one isn't tightly entangled > > with high priority tasks, mostly because 1. if the resource the low pri one > > is holding affects large part of the system, the problem is self-solving as > > the system quickly runs out of other things to do 2. if the resource isn't > > affecting large part of the system, their blast radius is usually reasonably > > confined to things tightly coupled with it. I'm sure there are exceptions > > and we definitely wanna improve the situation where it makes sense. > > cgroup_mutex and kernfs rwsem beg to differ :) These are shared with > control plane threads, so it is pretty easy to starve those out even > while the system has plenty of work to do. Hahaha yeah, good point. We definitely wanna improve them. There were some efforts to improve kernfs locking granularity earlier this year. It was promising but didn't get to the finish line. cgroup_mutex, w/ cgroup2 and especially with the optimizations around CLONE_INTO_CGROUP, we avoid that in most hot paths and hopefully that should help quite a bit. If it continues to be a problem, we definitely wanna further improve it. Just to better understand the situation, can you give some more details on the scenarios where cgroup_mutex was in the middle of a shitshow? Thanks.
On Mon, Oct 31, 2022 at 6:46 PM Tejun Heo <tj@kernel.org> wrote: > > On Mon, Oct 31, 2022 at 06:01:19PM -0700, Josh Don wrote: > > > Yeah, especially with narrow cpuset (or task cpu affinity) configurations, > > > it can get pretty bad. Outside that tho, at least I haven't seen a lot of > > > problematic cases as long as the low priority one isn't tightly entangled > > > with high priority tasks, mostly because 1. if the resource the low pri one > > > is holding affects large part of the system, the problem is self-solving as > > > the system quickly runs out of other things to do 2. if the resource isn't > > > affecting large part of the system, their blast radius is usually reasonably > > > confined to things tightly coupled with it. I'm sure there are exceptions > > > and we definitely wanna improve the situation where it makes sense. > > > > cgroup_mutex and kernfs rwsem beg to differ :) These are shared with > > control plane threads, so it is pretty easy to starve those out even > > while the system has plenty of work to do. > > Hahaha yeah, good point. We definitely wanna improve them. There were some > efforts to improve kernfs locking granularity earlier this year. It was > promising but didn't get to the finish line. cgroup_mutex, w/ cgroup2 and > especially with the optimizations around CLONE_INTO_CGROUP, we avoid that in > most hot paths and hopefully that should help quite a bit. If it continues > to be a problem, we definitely wanna further improve it. > > Just to better understand the situation, can you give some more details on > the scenarios where cgroup_mutex was in the middle of a shitshow? There have been a couple, I think one of the main ones has been writes to cgroup.procs. cpuset modifications also show up since there's a mutex there. > > Thanks. > > -- > tejun
Hello, On Tue, Nov 01, 2022 at 12:11:30PM -0700, Josh Don wrote: > > Just to better understand the situation, can you give some more details on > > the scenarios where cgroup_mutex was in the middle of a shitshow? > > There have been a couple, I think one of the main ones has been writes > to cgroup.procs. cpuset modifications also show up since there's a > mutex there. If you can, I'd really like to learn more about the details. We've had some issues with the threadgroup_rwsem because it's such a big hammer but not necessarily with cgroup_mutex because they are only used in maintenance operations and never from any hot paths. Regarding threadgroup_rwsem, w/ CLONE_INTO_CGROUP (userspace support is still missing unfortunately), the usual worfklow of creating a cgroup, seeding it with a process and then later shutting it down doesn't involve threadgroup_rwsem at all, so most of the problems should go away in the hopefully near future. Thanks.
On Tue, Nov 1, 2022 at 12:15 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Nov 01, 2022 at 12:11:30PM -0700, Josh Don wrote: > > > Just to better understand the situation, can you give some more details on > > > the scenarios where cgroup_mutex was in the middle of a shitshow? > > > > There have been a couple, I think one of the main ones has been writes > > to cgroup.procs. cpuset modifications also show up since there's a > > mutex there. > > If you can, I'd really like to learn more about the details. We've had some > issues with the threadgroup_rwsem because it's such a big hammer but not > necessarily with cgroup_mutex because they are only used in maintenance > operations and never from any hot paths. > > Regarding threadgroup_rwsem, w/ CLONE_INTO_CGROUP (userspace support is > still missing unfortunately), the usual worfklow of creating a cgroup, > seeding it with a process and then later shutting it down doesn't involve > threadgroup_rwsem at all, so most of the problems should go away in the > hopefully near future. Maybe walking through an example would be helpful? I don't know if there's anything super specific. For cgroup_mutex for example, the same global mutex is being taken for things like cgroup mkdir and cgroup proc attach, regardless of which part of the hierarchy is being modified. So, we end up sharing that mutex between random job threads (ie. that may be manipulating their own cgroup sub-hierarchy), and control plane threads, which are attempting to manage root-level cgroups. Bad things happen when the cgroup_mutex (or similar) is held by a random thread which blocks and is of low scheduling priority, since when it wakes back up it may take quite a while for it to run again (whether that low priority be due to CFS bandwidth, sched_idle, or even just O(hundreds) of threads on a cpu). Starving out the control plane causes us significant issues, since that affects machine health. cgroup manipulation is not a hot path operation, but the control plane tends to hit it fairly often, and so those things combine at our scale to produce this rare problem. > > Thanks. > > -- > tejun
Hello, On Tue, Nov 01, 2022 at 01:56:29PM -0700, Josh Don wrote: > Maybe walking through an example would be helpful? I don't know if > there's anything super specific. For cgroup_mutex for example, the > same global mutex is being taken for things like cgroup mkdir and > cgroup proc attach, regardless of which part of the hierarchy is being > modified. So, we end up sharing that mutex between random job threads > (ie. that may be manipulating their own cgroup sub-hierarchy), and > control plane threads, which are attempting to manage root-level > cgroups. Bad things happen when the cgroup_mutex (or similar) is held > by a random thread which blocks and is of low scheduling priority, > since when it wakes back up it may take quite a while for it to run > again (whether that low priority be due to CFS bandwidth, sched_idle, > or even just O(hundreds) of threads on a cpu). Starving out the > control plane causes us significant issues, since that affects machine > health. cgroup manipulation is not a hot path operation, but the > control plane tends to hit it fairly often, and so those things > combine at our scale to produce this rare problem. I keep asking because I'm curious about the specific details of the contentions. Control plane locking up is obviously bad but they can usually tolerate some latencies - stalling out multiple seconds (or longer) can be catastrophic but tens or hundreds or millisecs occasionally usually isn't. The only times we've seen latency spikes from CPU side which is enough to cause system-level failures were when there were severe restrictions through bw control. Other cases sure are possible but unless you grab these mutexes while IDLE inside a heavily contended cgroup (which is a bit silly) you gotta push *really* hard. If most of the problems were with cpu bw control, fixing that should do for the time being. Otherwise, we'll have to think about finishing kernfs locking granularity improvements and doing something similar to cgroup locking too. Thanks.
On Tue, Nov 1, 2022 at 2:50 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Nov 01, 2022 at 01:56:29PM -0700, Josh Don wrote: > > Maybe walking through an example would be helpful? I don't know if > > there's anything super specific. For cgroup_mutex for example, the > > same global mutex is being taken for things like cgroup mkdir and > > cgroup proc attach, regardless of which part of the hierarchy is being > > modified. So, we end up sharing that mutex between random job threads > > (ie. that may be manipulating their own cgroup sub-hierarchy), and > > control plane threads, which are attempting to manage root-level > > cgroups. Bad things happen when the cgroup_mutex (or similar) is held > > by a random thread which blocks and is of low scheduling priority, > > since when it wakes back up it may take quite a while for it to run > > again (whether that low priority be due to CFS bandwidth, sched_idle, > > or even just O(hundreds) of threads on a cpu). Starving out the > > control plane causes us significant issues, since that affects machine > > health. cgroup manipulation is not a hot path operation, but the > > control plane tends to hit it fairly often, and so those things > > combine at our scale to produce this rare problem. > > I keep asking because I'm curious about the specific details of the > contentions. Control plane locking up is obviously bad but they can usually > tolerate some latencies - stalling out multiple seconds (or longer) can be > catastrophic but tens or hundreds or millisecs occasionally usually isn't. > > The only times we've seen latency spikes from CPU side which is enough to > cause system-level failures were when there were severe restrictions through > bw control. Other cases sure are possible but unless you grab these mutexes > while IDLE inside a heavily contended cgroup (which is a bit silly) you > gotta push *really* hard. > > If most of the problems were with cpu bw control, fixing that should do for > the time being. Otherwise, we'll have to think about finishing kernfs > locking granularity improvements and doing something similar to cgroup > locking too. Oh we've easily hit stalls measured in multiple seconds. We extensively use cpu.idle to group batch tasks. One of the memory bandwidth mitigations implemented in userspace is cpu jailing, which can end up pushing lots and lots of these batch threads onto a small number of cpus. 5ms min gran * 200 threads is already one second :) We're in the process of transitioning to using bw instead for this instead in order to maintain parallelism. Fixing bw is definitely going to be useful, but I'm afraid we'll still likely have some issues from low throughput for non-bw reasons (some of which we can't directly control, since arbitrary jobs can spin up and configure their hierarchy/threads in antagonistic ways, in effect pushing out the latency of some of their threads). > > Thanks. > > -- > tejun
Hello, (cc'ing Michal, Christian and Li for context) On Tue, Nov 01, 2022 at 02:59:56PM -0700, Josh Don wrote: > > If most of the problems were with cpu bw control, fixing that should do for > > the time being. Otherwise, we'll have to think about finishing kernfs > > locking granularity improvements and doing something similar to cgroup > > locking too. > > Oh we've easily hit stalls measured in multiple seconds. We > extensively use cpu.idle to group batch tasks. One of the memory > bandwidth mitigations implemented in userspace is cpu jailing, which > can end up pushing lots and lots of these batch threads onto a small > number of cpus. 5ms min gran * 200 threads is already one second :) Ah, I see. > We're in the process of transitioning to using bw instead for this > instead in order to maintain parallelism. Fixing bw is definitely > going to be useful, but I'm afraid we'll still likely have some issues > from low throughput for non-bw reasons (some of which we can't > directly control, since arbitrary jobs can spin up and configure their > hierarchy/threads in antagonistic ways, in effect pushing out the > latency of some of their threads). Yeah, thanks for the explanation. Making the lock more granular is tedious but definitely doable. I don't think I can work on it in the near future but will keep it on mind. If anyone's interested in attacking it, please be my guest. Thanks.
On Mon, Oct 31, 2022 at 02:56:13PM -0700, Benjamin Segall wrote: > Peter Zijlstra <peterz@infradead.org> writes: > > > On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote: > >> CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's > >> inline in an hrtimer callback. Runtime distribution is a per-cpu > >> operation, and unthrottling is a per-cgroup operation, since a tg walk > >> is required. On machines with a large number of cpus and large cgroup > >> hierarchies, this cpus*cgroups work can be too much to do in a single > >> hrtimer callback: since IRQ are disabled, hard lockups may easily occur. > >> Specifically, we've found this scalability issue on configurations with > >> 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high > >> memory bandwidth usage. > >> > >> To fix this, we can instead unthrottle cfs_rq's asynchronously via a > >> CSD. Each cpu is responsible for unthrottling itself, thus sharding the > >> total work more fairly across the system, and avoiding hard lockups. > > > > So, TJ has been complaining about us throttling in kernel-space, causing > > grief when we also happen to hold a mutex or some other resource and has > > been prodding us to only throttle at the return-to-user boundary. > > > > Would this be an opportune moment to do this? That is, what if we > > replace this CSD with a task_work that's ran on the return-to-user path > > instead? > > This is unthrottle, not throttle, but it would probably be Duh..
Hello. On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don <joshdon@google.com> wrote: > To fix this, we can instead unthrottle cfs_rq's asynchronously via a > CSD. Each cpu is responsible for unthrottling itself, thus sharding the > total work more fairly across the system, and avoiding hard lockups. FIFO behavior of the cfs_b->throttled_cfs_rq is quite important to ensure fairness of throttling (historically when it FIFO wasn't honored, it caused some cfs_rq starving issues). Despite its name, distribute_cfs_runtime() doesn't distribute the runtime, the time is pulled inside assign_cfs_rq_runtime() (but that's already on target cpu). Currently, it's all synchronized under cfs_b->lock but with your change, throttled cfs_rq would be dissolved among cpus that'd run concurrently (assign_cfs_rq_runtime() still takes cfs_b->lock but it won't be necessarily in the unthrottling order). Have you observed any such fairness issues? [1][2] > +static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq) > [...] > + if (rq == this_rq()) { > + unthrottle_cfs_rq(cfs_rq); > + return; > + } It was pointed out to me that generic_exec_single() does something similar. Wouldn't the flow bandwidth control code be simpler relying on that? Also, can a particular cfs_rq be on both cfs_b->throttled_csd_list and cfs_b->throttled_cfs_rq lists at any moment? I wonder if having a single list_head node in cfs_rq would be feasible (and hence enforcing this constraint in data). Regards, Michal [1] I'm not familiar with IPIs, just to illustrate the concurrency: the fairness could be skewed towards CPUs that are on same "NUMA" node as the timer callback if closer CPUs received them sooner. [2] Currently, I don't think it's a prohibitive issue because with my reasoning even the current code relies on cfs_b->lock being a queued spinlock to ensure the FIFO of cfs_b->throttled_cfs_rq.
On Tue, Nov 01, 2022 at 12:38:23PM -1000, Tejun Heo <tj@kernel.org> wrote: > (cc'ing Michal, Christian and Li for context) Thanks. > > We're in the process of transitioning to using bw instead for this > > instead in order to maintain parallelism. Fixing bw is definitely > > going to be useful, but I'm afraid we'll still likely have some issues > > from low throughput for non-bw reasons (some of which we can't > > directly control, since arbitrary jobs can spin up and configure their > > hierarchy/threads in antagonistic ways, in effect pushing out the > > latency of some of their threads). > > Yeah, thanks for the explanation. Making the lock more granular is tedious > but definitely doable. I don't think I can work on it in the near future but > will keep it on mind. If anyone's interested in attacking it, please be my > guest. From my experience, throttling while holding kernel locks (not just cgroup_mutex) causes more trouble than plain cgroup_mutex scalability currently. But I acknowledge the latter issue too. Michal
On Wed, Nov 02, 2022 at 06:10:49PM +0100, Michal Koutný wrote: > On Tue, Nov 01, 2022 at 12:38:23PM -1000, Tejun Heo <tj@kernel.org> wrote: > > > We're in the process of transitioning to using bw instead for this > > > instead in order to maintain parallelism. Fixing bw is definitely > > > going to be useful, but I'm afraid we'll still likely have some issues > > > from low throughput for non-bw reasons (some of which we can't > > > directly control, since arbitrary jobs can spin up and configure their > > > hierarchy/threads in antagonistic ways, in effect pushing out the > > > latency of some of their threads). > > > > Yeah, thanks for the explanation. Making the lock more granular is tedious > > but definitely doable. I don't think I can work on it in the near future but > > will keep it on mind. If anyone's interested in attacking it, please be my > > guest. > > From my experience, throttling while holding kernel locks (not just > cgroup_mutex) causes more trouble than plain cgroup_mutex scalability > currently. Oh yeah, absolutely. Low cpu bw config + any shared kernel resource is a nightmare and this thread was originally about addressing that. Thanks.
Hi Michal, Thanks for taking a look. On Wed, Nov 2, 2022 at 9:59 AM Michal Koutný <mkoutny@suse.com> wrote: > > Hello. > > On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don <joshdon@google.com> wrote: > > To fix this, we can instead unthrottle cfs_rq's asynchronously via a > > CSD. Each cpu is responsible for unthrottling itself, thus sharding the > > total work more fairly across the system, and avoiding hard lockups. > > FIFO behavior of the cfs_b->throttled_cfs_rq is quite important to > ensure fairness of throttling (historically when it FIFO wasn't honored, > it caused some cfs_rq starving issues). > > Despite its name, distribute_cfs_runtime() doesn't distribute the > runtime, the time is pulled inside assign_cfs_rq_runtime() (but that's > already on target cpu). > Currently, it's all synchronized under cfs_b->lock but with your change, > throttled cfs_rq would be dissolved among cpus that'd run concurrently > (assign_cfs_rq_runtime() still takes cfs_b->lock but it won't be > necessarily in the unthrottling order). I don't think my patch meaningfully regresses this; the prior state was also very potentially unfair in a similar way. Without my patch, distribute_cfs_runtime() will unthrottle the cfs_rq's, and as you point out, it doesn't actually give them any real quota, it lets assign_cfs_rq_runtime() take care of that. But this happens asynchronously on those cpus. If they are idle, they wait for an IPI from the resched_curr() in unthrottled_cfs_rq(), otherwise they simply wait until potentially the next rescheduling point. So we are currently far from ever being guaranteed that the order the cpus pull actual quota via assign_cfs_rq_runtime() matches the order they were unthrottled from the list. > > +static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq) > > [...] > > + if (rq == this_rq()) { > > + unthrottle_cfs_rq(cfs_rq); > > + return; > > + } > > It was pointed out to me that generic_exec_single() does something > similar. > Wouldn't the flow bandwidth control code be simpler relying on that? We already hold rq lock so we couldn't rely on the generic_exec_single() special case since that would double lock. > Also, can a particular cfs_rq be on both cfs_b->throttled_csd_list and > cfs_b->throttled_cfs_rq lists at any moment? > I wonder if having a single list_head node in cfs_rq would be feasible > (and hence enforcing this constraint in data). That's an interesting idea, this could be rewritten so that distribute() pulls the entity off this list and moves it to the throttled_csd_list; we never have an actual need to have entities on both lists at the same time. I'll wait to see if Peter has any comments, but that could be made in a v3 for this patch. Best, Josh
On Wed, Nov 02, 2022 at 05:10:08PM -0700, Josh Don <joshdon@google.com> wrote: > Without my patch, distribute_cfs_runtime() will unthrottle the > cfs_rq's, and as you point out, it doesn't actually give them any real > quota, it lets assign_cfs_rq_runtime() take care of that. But this > happens asynchronously on those cpus. If they are idle, they wait for > an IPI from the resched_curr() in unthrottled_cfs_rq(), otherwise they > simply wait until potentially the next rescheduling point. So we are > currently far from ever being guaranteed that the order the cpus pull > actual quota via assign_cfs_rq_runtime() matches the order they were > unthrottled from the list. Thanks for breaking it down, I agree your change won't affect the fairness substantially and my other points are clarified too. Michal
Hey Peter, Any other thoughts on this patch? Right now the only thing I have to change is to eliminate the redundant list_head, per Michal's suggestion. If the general idea here looks good to you, I can go ahead and do that and send the v3. Thanks, Josh
On Wed, Nov 2, 2022 at 9:59 AM Michal Koutný <mkoutny@suse.com> wrote: > > Also, can a particular cfs_rq be on both cfs_b->throttled_csd_list and > cfs_b->throttled_cfs_rq lists at any moment? > I wonder if having a single list_head node in cfs_rq would be feasible > (and hence enforcing this constraint in data). After more thought, I realized that we can't reuse the throttled_list list_head, since that would potentially break the lockless traversal of a concurrent list_for_each_entry_rcu() (ie. if we removed the element from the throttled list and then added it to the CSD list). - Josh
On Tue, Nov 15, 2022 at 07:01:31PM -0800, Josh Don <joshdon@google.com> wrote: > After more thought, I realized that we can't reuse the throttled_list > list_head, since that would potentially break the lockless traversal > of a concurrent list_for_each_entry_rcu() (ie. if we removed the > element from the throttled list and then added it to the CSD list). I see, the concurrent RCU traversal is a valid point for the two heads. What does it mean for SCHED_WARN_ON in __unthrottle_cfs_rq_async()? IIUC, if the concurrency of cfs_b->throttled_cfs_rq list is expected (hence I'm not sure about the SCHED_WARN_ON), then it may happen that __unthrottle_cfs_rq_async is called on cfs_rq that's already on rq->cfsb_csd_list (there's still rq lock but it's only help inside cfs_b->throttled_cfs_rq iteration). Thanks, Michal
On Wed, Nov 16, 2022 at 1:57 AM Michal Koutný <mkoutny@suse.com> wrote: > > What does it mean for SCHED_WARN_ON in __unthrottle_cfs_rq_async()? > > IIUC, if the concurrency of cfs_b->throttled_cfs_rq list is > expected (hence I'm not sure about the SCHED_WARN_ON), then it may > happen that __unthrottle_cfs_rq_async is called on cfs_rq that's already > on rq->cfsb_csd_list (there's still rq lock but it's only help inside > cfs_b->throttled_cfs_rq iteration). It catches a case where we call unthrottle_cfs_rq_async() on a given cfs_rq again before we have a chance to process the previous call. This should never happen, because currently we only call this from the distribution handler, and we skip entities already queued for unthrottle (this is the check for if (!list_empty(&cfs_rq->throttled_csd_list))). > > Thanks, > Michal
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e4a0b8bd941c..ff5548013979 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5318,10 +5318,73 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b) +#ifdef CONFIG_SMP +static void __cfsb_csd_unthrottle(void *arg) +{ + struct rq *rq = arg; + struct rq_flags rf; + struct cfs_rq *cursor, *tmp; + + rq_lock(rq, &rf); + + list_for_each_entry_safe(cursor, tmp, &rq->cfsb_csd_list, + throttled_csd_list) { + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cursor->tg); + + list_del_init(&cursor->throttled_csd_list); + atomic_dec(&cfs_b->throttled_csd_count); + + if (cfs_rq_throttled(cursor)) + unthrottle_cfs_rq(cursor); + } + + rq_unlock(rq, &rf); +} + +static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq) +{ + struct rq *rq = rq_of(cfs_rq); + struct cfs_bandwidth *cfs_b; + + if (rq == this_rq()) { + unthrottle_cfs_rq(cfs_rq); + return; + } + + /* Already enqueued */ + if (SCHED_WARN_ON(!list_empty(&cfs_rq->throttled_csd_list))) + return; + + cfs_b = tg_cfs_bandwidth(cfs_rq->tg); + + list_add_tail(&cfs_rq->throttled_csd_list, &rq->cfsb_csd_list); + atomic_inc(&cfs_b->throttled_csd_count); + + smp_call_function_single_async(cpu_of(rq), &rq->cfsb_csd); +} +#else +static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq) +{ + unthrottle_cfs_rq(cfs_rq); +} +#endif + +static void unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq) +{ + lockdep_assert_rq_held(rq_of(cfs_rq)); + + if (SCHED_WARN_ON(!cfs_rq_throttled(cfs_rq) || + cfs_rq->runtime_remaining <= 0)) + return; + + __unthrottle_cfs_rq_async(cfs_rq); +} + +static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b) { struct cfs_rq *cfs_rq; u64 runtime, remaining = 1; + bool throttled = false; rcu_read_lock(); list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq, @@ -5329,11 +5392,22 @@ static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b) struct rq *rq = rq_of(cfs_rq); struct rq_flags rf; + if (!remaining) { + throttled = true; + break; + } + rq_lock_irqsave(rq, &rf); if (!cfs_rq_throttled(cfs_rq)) goto next; - /* By the above check, this should never be true */ +#ifdef CONFIG_SMP + /* Already queued for async unthrottle */ + if (!list_empty(&cfs_rq->throttled_csd_list)) + goto next; +#endif + + /* By the above checks, this should never be true */ SCHED_WARN_ON(cfs_rq->runtime_remaining > 0); raw_spin_lock(&cfs_b->lock); @@ -5348,15 +5422,14 @@ static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b) /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) - unthrottle_cfs_rq(cfs_rq); + unthrottle_cfs_rq_async(cfs_rq); next: rq_unlock_irqrestore(rq, &rf); - - if (!remaining) - break; } rcu_read_unlock(); + + return throttled; } /* @@ -5401,10 +5474,8 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u while (throttled && cfs_b->runtime > 0) { raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - distribute_cfs_runtime(cfs_b); + throttled = distribute_cfs_runtime(cfs_b); raw_spin_lock_irqsave(&cfs_b->lock, flags); - - throttled = !list_empty(&cfs_b->throttled_cfs_rq); } /* @@ -5675,12 +5746,16 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); cfs_b->slack_timer.function = sched_cfs_slack_timer; cfs_b->slack_started = false; + atomic_set(&cfs_b->throttled_csd_count, 0); } static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) { cfs_rq->runtime_enabled = 0; INIT_LIST_HEAD(&cfs_rq->throttled_list); +#ifdef CONFIG_SMP + INIT_LIST_HEAD(&cfs_rq->throttled_csd_list); +#endif } void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) @@ -5703,6 +5778,31 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) hrtimer_cancel(&cfs_b->period_timer); hrtimer_cancel(&cfs_b->slack_timer); + + /* + * It is possible that we still have some cfs_rq's pending on the CSD + * list, but this race is very rare. In order for this to occur, we must + * have raced with the last task leaving the group while there exist throttled + * cfs_rq(s), and the period_timer must have queued the CSD item but the remote + * cpu has not yet processed it. In this case, we can simply process all CSD + * work inline here. + */ +#ifdef CONFIG_SMP + if (unlikely(atomic_read(&cfs_b->throttled_csd_count) > 0)) { + unsigned long flags; + int i; + + for_each_possible_cpu(i) { + struct rq *rq = cpu_rq(i); + + local_irq_save(flags); + __cfsb_csd_unthrottle(rq); + local_irq_restore(flags); + } + + SCHED_WARN_ON(atomic_read(&cfs_b->throttled_csd_count) > 0); + } +#endif } /* @@ -12237,6 +12337,11 @@ __init void init_sched_fair_class(void) for_each_possible_cpu(i) { zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i)); zalloc_cpumask_var_node(&per_cpu(select_rq_mask, i), GFP_KERNEL, cpu_to_node(i)); + +#ifdef CONFIG_CFS_BANDWIDTH + INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i)); + INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list); +#endif } open_softirq(SCHED_SOFTIRQ, run_rebalance_domains); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1644242ecd11..e6f505f8c351 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -355,6 +355,7 @@ struct cfs_bandwidth { struct hrtimer period_timer; struct hrtimer slack_timer; struct list_head throttled_cfs_rq; + atomic_t throttled_csd_count; /* Statistics: */ int nr_periods; @@ -645,6 +646,9 @@ struct cfs_rq { int throttled; int throttle_count; struct list_head throttled_list; +#ifdef CONFIG_SMP + struct list_head throttled_csd_list; +#endif #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ }; @@ -1144,6 +1148,11 @@ struct rq { unsigned int core_forceidle_occupation; u64 core_forceidle_start; #endif + +#if defined(CONFIG_CFS_BANDWIDTH) && defined(CONFIG_SMP) + call_single_data_t cfsb_csd; + struct list_head cfsb_csd_list; +#endif }; #ifdef CONFIG_FAIR_GROUP_SCHED