Message ID | 20221019031551.24312-1-zhouchuyi@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4ac7:0:0:0:0:0 with SMTP id y7csp105793wrs; Tue, 18 Oct 2022 20:24:34 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7bJYMA2C5YxVt+zUdV+AGJ8P8AABmG0OZCDClrXKjT9wBGfK00XxgXHHahajzHY77pP1Wf X-Received: by 2002:aa7:dbd2:0:b0:45c:97bb:4ae0 with SMTP id v18-20020aa7dbd2000000b0045c97bb4ae0mr5323846edt.417.1666149874247; Tue, 18 Oct 2022 20:24:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666149874; cv=none; d=google.com; s=arc-20160816; b=FSprAwHkoNAXa1oT+xxQK5+4mdCBhr7itIzTOSxN864AN+kEMAL/InnQJ8V831XYlC 4BmOM/H04ecH/naRHRemfa4J3TqBxaIvKsPmWSq7Y5592IFQ0qyxQn2lqVfDrkmdL34G Ps2ete9hT3hq205nXL7XN8bh3ouWYvck0IloV4iJcCyClzC1x4jVI+aLL95h2oaT8CrW Kcf2429McsZRQr9BDZYH1KoKEEugcII8dPTt4Jr/y5yxQlcLufeYyQ27tuni3m63Ix09 Yp5ybbbwBbW4Lg7IkgiCu9V+SQemUT3Spnehh4QMBezit7Dqc9368VqIUPaYc/ALTKxj 3DoA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=83J0FWeQ22MhRLlQnaI+JzsDt7vTIzx9w1sgOTApdqk=; b=gbsMLveJTqLkvon74/s6ggdo2j8gQG+Id918DZZKjByqiS8lhokz+KHpwx0Dt64ynj 8lLRt79791FOO7nmUUCMtNkazg6SADVB38VA6XtbF3ZC9fXNNcLi6Pz2Yx19F5eZOWLh 69iQckXYSubENp+rVZv+/SSpaezpNMgPa9p1hoeb7Vve+Ds+z9pQKfacTwXJRtm46Xgt /fFkxrJUerDr5/3RYeHvF3YvgfTcxWY+qxJY8wPcID7NwxBe/vXTZbkeK94udgLbiwpE uQ/E59PaI7ZDYlpU3wt5kYA3YVTG/m3SBn3b5hvNYAb+XU+uqolSYUDi53kQ6BXl2hUS yzlw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=1fwPqOJJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d1-20020a170906174100b0077ef3eeca17si10645540eje.155.2022.10.18.20.24.09; Tue, 18 Oct 2022 20:24:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=1fwPqOJJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229773AbiJSDQQ (ORCPT <rfc822;samuel.l.nystrom@gmail.com> + 99 others); Tue, 18 Oct 2022 23:16:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52076 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229484AbiJSDQN (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Oct 2022 23:16:13 -0400 Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2FB31D259C for <linux-kernel@vger.kernel.org>; Tue, 18 Oct 2022 20:16:11 -0700 (PDT) Received: by mail-pf1-x430.google.com with SMTP id 3so15987535pfw.4 for <linux-kernel@vger.kernel.org>; Tue, 18 Oct 2022 20:16:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=83J0FWeQ22MhRLlQnaI+JzsDt7vTIzx9w1sgOTApdqk=; b=1fwPqOJJ2N6WhOI8lMbkbqGZPwTcxZEmQSEj4yn4JR09qobPUr715RcR1jNSC9EtBY t49JAd312c6EIDDGtwL+ANn/vL9LTubcnia7gIPmE809/e5J2v9MmL9yAwMXGOkh98Ip BElTm90OcMgZ/aECd2iSFsYOftV4ZKfGzYxDYUZexz9JJtZ6XSFyClV65p6iXKTuvbug ORjkBIR7FpnpCXfxdFBDNuWbNN+hCc5XQ2ytR2TQlKo6n4U7LnEt9NFxjfm/0DiqQ4ZZ 6LLkVYGokcldu400CAfoV5LdvpUFTaaQousEfu2CNvfrAp7/uLTDzdE8S8zN+rPyisp5 Oljg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=83J0FWeQ22MhRLlQnaI+JzsDt7vTIzx9w1sgOTApdqk=; b=kuwzutkw9BkmKoluXHYdTudHCtOM6220yh/7c5vKERsyixj6voOwxHbSSXr+41vYYu W/qnz0BrkasS6Uz3gdjufAirLLPEo2Zed7LNAsGOoZyXhDq3f5jIz+AMbPgypIfWCbpc 1IbVgWtqHlr17aVicaWfHB5JFjZ65AYydlwdjcF/91nKxIq/NwL9KEdzH73iQ2f+BjPe /7WzLM7sJajae7N4/wgmCOutFt2ckq1ZfwuIbbnh3HicrTy/LdCjp6ReRBJr5YSYgubt bmRjcFM9Jp1JAzuJUtihkA5bkxhswKfhj3NRSMT0a7gnFcP9KfwZ/4bOsUABX5PpfFok cjFw== X-Gm-Message-State: ACrzQf2CjmoFmpFINOEqQstiy4u6xgEpxZJviFMylW7c82zqA+EtcQ67 TdDBn8t5++kIa4KXdYXUv8jEyw== X-Received: by 2002:a63:6b09:0:b0:453:88a9:1d18 with SMTP id g9-20020a636b09000000b0045388a91d18mr5285511pgc.41.1666149370436; Tue, 18 Oct 2022 20:16:10 -0700 (PDT) Received: from YGFVJ29LDD.bytedance.net ([139.177.225.225]) by smtp.gmail.com with ESMTPSA id s5-20020a625e05000000b005631f2b9ba2sm10284984pfb.14.2022.10.18.20.16.05 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 18 Oct 2022 20:16:09 -0700 (PDT) From: Chuyi Zhou <zhouchuyi@bytedance.com> To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, htejun@gmail.com, lizefan.x@bytedance.com, vschneid@redhat.com, bsegall@google.com, Chuyi Zhou <zhouchuyi@bytedance.com>, Abel Wu <wuyun.abel@bytedance.com> Subject: [RESEND] sched/fair: Add min_ratio for cfs bandwidth_control Date: Wed, 19 Oct 2022 11:15:51 +0800 Message-Id: <20221019031551.24312-1-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.37.0 (Apple Git-136) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747084770641612352?= X-GMAIL-MSGID: =?utf-8?q?1747084770641612352?= |
Series |
[RESEND] sched/fair: Add min_ratio for cfs bandwidth_control
|
|
Commit Message
Chuyi Zhou
Oct. 19, 2022, 3:15 a.m. UTC
Tasks may be throttled when holding locks for a long time by current
cfs bandwidth control mechanism once users set a too small quota/period
ratio, which can result whole system get stuck[1].
In order to prevent the above situation from happening, this patch adds
sysctl_sched_cfs_bandwidth_min_ratio in /proc/sys/kernel, which indicates
the minimum percentage of quota/period users can set. The default value is
zero and users can set quota and period without triggering this constraint.
Link[1]:https://lore.kernel.org/lkml/5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com/T/
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Suggested-by: Abel Wu <wuyun.abel@bytedance.com>
---
include/linux/sched/sysctl.h | 4 ++++
kernel/sched/core.c | 23 +++++++++++++++++++++++
kernel/sysctl.c | 10 ++++++++++
3 files changed, 37 insertions(+)
Comments
Chuyi Zhou <zhouchuyi@bytedance.com> writes: > Tasks may be throttled when holding locks for a long time by current > cfs bandwidth control mechanism once users set a too small quota/period > ratio, which can result whole system get stuck[1]. > > In order to prevent the above situation from happening, this patch adds > sysctl_sched_cfs_bandwidth_min_ratio in /proc/sys/kernel, which indicates > the minimum percentage of quota/period users can set. The default value is > zero and users can set quota and period without triggering this > constraint. There's so many other sorts of bad inputs that can get you stuck here that I'm not sure it's ever safe against lockups to provide direct write access to an untrusted user. I'm not totally opposed but it seems like an incomplete fix to a broken (non-default) configuration. > > Link[1]:https://lore.kernel.org/lkml/5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com/T/ > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > Suggested-by: Abel Wu <wuyun.abel@bytedance.com> > --- > include/linux/sched/sysctl.h | 4 ++++ > kernel/sched/core.c | 23 +++++++++++++++++++++++ > kernel/sysctl.c | 10 ++++++++++ > 3 files changed, 37 insertions(+) > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h > index 303ee7dd0c7e..dedb18648f0e 100644 > --- a/include/linux/sched/sysctl.h > +++ b/include/linux/sched/sysctl.h > @@ -21,6 +21,10 @@ enum sched_tunable_scaling { > SCHED_TUNABLESCALING_END, > }; > > +#ifdef CONFIG_CFS_BANDWIDTH > +extern unsigned int sysctl_sched_cfs_bandwidth_min_ratio; > +#endif > + > #define NUMA_BALANCING_DISABLED 0x0 > #define NUMA_BALANCING_NORMAL 0x1 > #define NUMA_BALANCING_MEMORY_TIERING 0x2 > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 5800b0623ff3..8f6cfd889e37 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -10504,6 +10504,12 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, > } > > #ifdef CONFIG_CFS_BANDWIDTH > +/* > + * The minimum of quota/period ratio users can set, default is zero and users can set > + * quota and period without triggering this constraint. > + */ > +unsigned int sysctl_sched_cfs_bandwidth_min_ratio; > + > static DEFINE_MUTEX(cfs_constraints_mutex); > > const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */ > @@ -10513,6 +10519,20 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC; > > static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime); > > +static int check_cfs_bandwidth_min_ratio(u64 period, u64 quota) > +{ > + u64 ratio; > + > + if (!sysctl_sched_cfs_bandwidth_min_ratio) > + return 0; > + > + ratio = div64_u64(quota * 100, period); > + if (ratio < sysctl_sched_cfs_bandwidth_min_ratio) > + return -1; > + > + return 0; > +} > + > static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, > u64 burst) > { > @@ -10548,6 +10568,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, > burst + quota > max_cfs_runtime)) > return -EINVAL; > > + if (quota != RUNTIME_INF && check_cfs_bandwidth_min_ratio(period, quota)) > + return -EINVAL; > + > /* > * Prevent race between setting of cfs_rq->runtime_enabled and > * unthrottle_offline_cfs_rqs(). > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 188c305aeb8b..7d9743e8e514 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1652,6 +1652,16 @@ static struct ctl_table kern_table[] = { > .extra1 = SYSCTL_ZERO, > }, > #endif /* CONFIG_NUMA_BALANCING */ > +#ifdef CONFIG_CFS_BANDWIDTH > + { > + .procname = "sched_cfs_bandwidth_min_ratio", > + .data = &sysctl_sched_cfs_bandwidth_min_ratio, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + }, > +#endif /* CONFIG_CFS_BANDWIDTH */ > { > .procname = "panic", > .data = &panic_timeout,
Hello, On Wed, Oct 19, 2022 at 11:15:51AM +0800, Chuyi Zhou wrote: > Tasks may be throttled when holding locks for a long time by current > cfs bandwidth control mechanism once users set a too small quota/period > ratio, which can result whole system get stuck[1]. > > In order to prevent the above situation from happening, this patch adds > sysctl_sched_cfs_bandwidth_min_ratio in /proc/sys/kernel, which indicates > the minimum percentage of quota/period users can set. The default value is > zero and users can set quota and period without triggering this constraint. > > Link[1]:https://lore.kernel.org/lkml/5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com/T/ > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > Suggested-by: Abel Wu <wuyun.abel@bytedance.com> This is a bit of a bandaid. I think what we really need to do is only throttling when running in userspace. In kernel space, it should just keep accumulating used cycles as debt which should be paid back before userspace code can run again so that we don't throttle at random places in the kernel. Thanks.
在 2022/10/20 05:21, Tejun Heo 写道: > Hello, > > On Wed, Oct 19, 2022 at 11:15:51AM +0800, Chuyi Zhou wrote: >> Tasks may be throttled when holding locks for a long time by current >> cfs bandwidth control mechanism once users set a too small quota/period >> ratio, which can result whole system get stuck[1]. >> >> In order to prevent the above situation from happening, this patch adds >> sysctl_sched_cfs_bandwidth_min_ratio in /proc/sys/kernel, which indicates >> the minimum percentage of quota/period users can set. The default value is >> zero and users can set quota and period without triggering this constraint. >> >> Link[1]:https://lore.kernel.org/lkml/5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com/T/ >> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> >> Suggested-by: Abel Wu <wuyun.abel@bytedance.com> > > This is a bit of a bandaid. I think what we really need to do is only > throttling when running in userspace. In kernel space, it should just keep > accumulating used cycles as debt which should be paid back before userspace > code can run again so that we don't throttle at random places in the kernel. > > Thanks. > Got it. Thanks for your advice. Chuyi Zhou
在 2022/10/20 05:01, Benjamin Segall 写道: > Chuyi Zhou <zhouchuyi@bytedance.com> writes: > >> Tasks may be throttled when holding locks for a long time by current >> cfs bandwidth control mechanism once users set a too small quota/period >> ratio, which can result whole system get stuck[1]. >> >> In order to prevent the above situation from happening, this patch adds >> sysctl_sched_cfs_bandwidth_min_ratio in /proc/sys/kernel, which indicates >> the minimum percentage of quota/period users can set. The default value is >> zero and users can set quota and period without triggering this >> constraint. > > > There's so many other sorts of bad inputs that can get you stuck here > that I'm not sure it's ever safe against lockups to provide direct write > access to an untrusted user. I'm not totally opposed but it seems like > an incomplete fix to a broken (non-default) configuration. > > Thanks for your advice. Chuyi Zhou >> >> Link[1]:https://lore.kernel.org/lkml/5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com/T/ >> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> >> Suggested-by: Abel Wu <wuyun.abel@bytedance.com> >> --- >> include/linux/sched/sysctl.h | 4 ++++ >> kernel/sched/core.c | 23 +++++++++++++++++++++++ >> kernel/sysctl.c | 10 ++++++++++ >> 3 files changed, 37 insertions(+) >> >> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h >> index 303ee7dd0c7e..dedb18648f0e 100644 >> --- a/include/linux/sched/sysctl.h >> +++ b/include/linux/sched/sysctl.h >> @@ -21,6 +21,10 @@ enum sched_tunable_scaling { >> SCHED_TUNABLESCALING_END, >> }; >> >> +#ifdef CONFIG_CFS_BANDWIDTH >> +extern unsigned int sysctl_sched_cfs_bandwidth_min_ratio; >> +#endif >> + >> #define NUMA_BALANCING_DISABLED 0x0 >> #define NUMA_BALANCING_NORMAL 0x1 >> #define NUMA_BALANCING_MEMORY_TIERING 0x2 >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 5800b0623ff3..8f6cfd889e37 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -10504,6 +10504,12 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, >> } >> >> #ifdef CONFIG_CFS_BANDWIDTH >> +/* >> + * The minimum of quota/period ratio users can set, default is zero and users can set >> + * quota and period without triggering this constraint. >> + */ >> +unsigned int sysctl_sched_cfs_bandwidth_min_ratio; >> + >> static DEFINE_MUTEX(cfs_constraints_mutex); >> >> const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */ >> @@ -10513,6 +10519,20 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC; >> >> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime); >> >> +static int check_cfs_bandwidth_min_ratio(u64 period, u64 quota) >> +{ >> + u64 ratio; >> + >> + if (!sysctl_sched_cfs_bandwidth_min_ratio) >> + return 0; >> + >> + ratio = div64_u64(quota * 100, period); >> + if (ratio < sysctl_sched_cfs_bandwidth_min_ratio) >> + return -1; >> + >> + return 0; >> +} >> + >> static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, >> u64 burst) >> { >> @@ -10548,6 +10568,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, >> burst + quota > max_cfs_runtime)) >> return -EINVAL; >> >> + if (quota != RUNTIME_INF && check_cfs_bandwidth_min_ratio(period, quota)) >> + return -EINVAL; >> + >> /* >> * Prevent race between setting of cfs_rq->runtime_enabled and >> * unthrottle_offline_cfs_rqs(). >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >> index 188c305aeb8b..7d9743e8e514 100644 >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -1652,6 +1652,16 @@ static struct ctl_table kern_table[] = { >> .extra1 = SYSCTL_ZERO, >> }, >> #endif /* CONFIG_NUMA_BALANCING */ >> +#ifdef CONFIG_CFS_BANDWIDTH >> + { >> + .procname = "sched_cfs_bandwidth_min_ratio", >> + .data = &sysctl_sched_cfs_bandwidth_min_ratio, >> + .maxlen = sizeof(unsigned int), >> + .mode = 0644, >> + .proc_handler = proc_dointvec_minmax, >> + .extra1 = SYSCTL_ZERO, >> + }, >> +#endif /* CONFIG_CFS_BANDWIDTH */ >> { >> .procname = "panic", >> .data = &panic_timeout,
On Wed, Oct 19, 2022 at 11:21:19AM -1000, Tejun Heo wrote: > Hello, > > On Wed, Oct 19, 2022 at 11:15:51AM +0800, Chuyi Zhou wrote: > > Tasks may be throttled when holding locks for a long time by current > > cfs bandwidth control mechanism once users set a too small quota/period > > ratio, which can result whole system get stuck[1]. > > > > In order to prevent the above situation from happening, this patch adds > > sysctl_sched_cfs_bandwidth_min_ratio in /proc/sys/kernel, which indicates > > the minimum percentage of quota/period users can set. The default value is > > zero and users can set quota and period without triggering this constraint. > > > > Link[1]:https://lore.kernel.org/lkml/5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com/T/ > > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > > Suggested-by: Abel Wu <wuyun.abel@bytedance.com> > > This is a bit of a bandaid. I think what we really need to do is only > throttling when running in userspace. In kernel space, it should just keep > accumulating used cycles as debt which should be paid back before userspace > code can run again so that we don't throttle at random places in the kernel. That's just moving the problem. But yeah; perhaps. Starving random userspace is less of a problem I suppose.
Hello, On Thu, Oct 20, 2022 at 07:08:13PM +0200, Peter Zijlstra wrote: > > This is a bit of a bandaid. I think what we really need to do is only > > throttling when running in userspace. In kernel space, it should just keep > > accumulating used cycles as debt which should be paid back before userspace > > code can run again so that we don't throttle at random places in the kernel. > > That's just moving the problem. But yeah; perhaps. Starving random > userspace is less of a problem I suppose. Given that our primary mean of guaranteeing forward progress is the fact that the system runs out of other things to do when there are severe priority inversions, I don't think we can safely give control of throttling something running in the kernel to userspace. IO control takes a similar approach with shared IOs which can have system-wide impacts and it's been working out pretty well. While some may go over the limit briefly, it's not that difficult to remain true to the intended configuration over time. The only problem is the cases where userspace can cause a large amount of forced consumptions (e.g. for IOs, creating a lot of metadata updates without doing anything else), but even in the unlikely case similar problem exists for CPU, it's pretty easy to add specific control mechanisms around those (e.g. sth along the style of might_resched()). So, yeah, I think this is the actual solution. Thanks.
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 303ee7dd0c7e..dedb18648f0e 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -21,6 +21,10 @@ enum sched_tunable_scaling { SCHED_TUNABLESCALING_END, }; +#ifdef CONFIG_CFS_BANDWIDTH +extern unsigned int sysctl_sched_cfs_bandwidth_min_ratio; +#endif + #define NUMA_BALANCING_DISABLED 0x0 #define NUMA_BALANCING_NORMAL 0x1 #define NUMA_BALANCING_MEMORY_TIERING 0x2 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5800b0623ff3..8f6cfd889e37 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10504,6 +10504,12 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, } #ifdef CONFIG_CFS_BANDWIDTH +/* + * The minimum of quota/period ratio users can set, default is zero and users can set + * quota and period without triggering this constraint. + */ +unsigned int sysctl_sched_cfs_bandwidth_min_ratio; + static DEFINE_MUTEX(cfs_constraints_mutex); const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */ @@ -10513,6 +10519,20 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC; static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime); +static int check_cfs_bandwidth_min_ratio(u64 period, u64 quota) +{ + u64 ratio; + + if (!sysctl_sched_cfs_bandwidth_min_ratio) + return 0; + + ratio = div64_u64(quota * 100, period); + if (ratio < sysctl_sched_cfs_bandwidth_min_ratio) + return -1; + + return 0; +} + static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, u64 burst) { @@ -10548,6 +10568,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, burst + quota > max_cfs_runtime)) return -EINVAL; + if (quota != RUNTIME_INF && check_cfs_bandwidth_min_ratio(period, quota)) + return -EINVAL; + /* * Prevent race between setting of cfs_rq->runtime_enabled and * unthrottle_offline_cfs_rqs(). diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 188c305aeb8b..7d9743e8e514 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1652,6 +1652,16 @@ static struct ctl_table kern_table[] = { .extra1 = SYSCTL_ZERO, }, #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_CFS_BANDWIDTH + { + .procname = "sched_cfs_bandwidth_min_ratio", + .data = &sysctl_sched_cfs_bandwidth_min_ratio, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + }, +#endif /* CONFIG_CFS_BANDWIDTH */ { .procname = "panic", .data = &panic_timeout,