Message ID | 26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp1580979vqu; Sat, 4 Nov 2023 04:01:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGXspwVZOCiC3QEZ8YIPKrRXkf/JFMwb9W3tURO0K+bMCSsiwqK0Sd+/zm4BfCmGA0BNHe+ X-Received: by 2002:a05:6808:d49:b0:3b2:e0f0:e53d with SMTP id w9-20020a0568080d4900b003b2e0f0e53dmr28693993oik.37.1699095687173; Sat, 04 Nov 2023 04:01:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1699095687; cv=none; d=google.com; s=arc-20160816; b=v5FX5yTn56HFKboNhacGNXFuSkVP94gIddHMhqAGQpcdpqQP+/ToaVsGkPSFQSiPFQ 7gqzJQB9a5JH/GKlCr0QDxGr2dxAMU0WTGpDc+qOO4e5JP/oyJ1AC5fG4RSQWTJoLbv6 z/3vQmstlXM2GeUaNIgWfNnfwsVeXCbz8JmLp3Y7QdKDUvdhXJiIyqMQVcukfcjJU6E/ lm2A7vm3jPwtJzsiI8/9SJ2V7kh9qEsmCxinpyauiFsD85/0i83ChpspWIQ4WABTpeYi o25TmuFyV1ze1Z7Dmh6miKPp0p/Qwt7Cuvlijy63lTakiGC0HkvLvFpgeaGjtOsHqv0d l85g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=FmVhVrNCpBKay2gs3bA33KF83UhEVn8KoYDWnKiWtNU=; fh=hclUZirF6ioy5RJ9ji4zprhkXEWhJA3xztaBC4vzfLk=; b=AmMRBO2ko6OhTmi5xaw8Ts3gyVI9WeV5QefTA1Ct/J25C+PIo5mURcyjLKOM0pUhEb KfBvtLb2d718GaV20Ralyz2hLZ0TZC7usb9NmIh/ZJ4HtVa+nrdOoZlElCUFd6EKLIDM lvB0feRraySyd154I1enMMOfOJT/yzkAY0e2VajPmVfY42VqDBQC5Khk65bT9CEToVEI rkSe80mE6NMV0odc/bclaD+Rfx8PML6t/4GTKt8ccozLjg/N+S6EkwreTmCY6ce0atfK CjeSl0YvKPZzVvN0/UL/0+5R/W/xw1MWR6YUU39R6ra4Y3EhnkadfWlFhy6AmzmoFCJQ 9BxA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=cklzCtqB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id w13-20020a0568080d4d00b003a408c8acebsi1531717oik.63.2023.11.04.04.01.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 04 Nov 2023 04:01:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=cklzCtqB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 0A9AC803E7B0; Sat, 4 Nov 2023 04:01:04 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232279AbjKDLAu (ORCPT <rfc822;lhua1029@gmail.com> + 35 others); Sat, 4 Nov 2023 07:00:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47776 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232505AbjKDLAR (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 4 Nov 2023 07:00:17 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D5F32D67 for <linux-kernel@vger.kernel.org>; Sat, 4 Nov 2023 04:00:03 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5DCF8C433C7; Sat, 4 Nov 2023 10:59:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1699095603; bh=yNoCGgEwrXT571CuCNX7khSC0Fmbj52tE9/NrTraAwg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=cklzCtqB+aQmxgUupHu7SCHFn3UGt0OoCipmS7ED71dLkH8+1pj6FNmh0p2yOUjT8 qiTDGmMFpQh7iDVFH927gPIxOuV4f8ea6iCCe+yhcb0BXviwc9v+9PpAr8OzbEMKID SVrfH6lGPFGE1A/TrFIqQG30eIl3Z0IMoixd9CK8BR5iatcqHcJLm9jhDhun8VDwNS LBk9c+p2tqi8uT5kxLIbJcI49/FFZQU5z94cMZupeN8EurHI2ShvRYFhy419e+EghC 3ur0bcmYJR4ZQSn6aIBCUKbjbA3mEUdJspkLr8UWaZCmiiAu99tTkiYXCuL8oGNQFs y52iRRoNXQIIQ== From: Daniel Bristot de Oliveira <bristot@kernel.org> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, linux-kernel@vger.kernel.org, Luca Abeni <luca.abeni@santannapisa.it>, Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>, Thomas Gleixner <tglx@linutronix.de>, Joel Fernandes <joel@joelfernandes.org>, Vineeth Pillai <vineeth@bitbyteword.org>, Shuah Khan <skhan@linuxfoundation.org>, bristot@kernel.org, Phil Auld <pauld@redhat.com> Subject: [PATCH v5 7/7] sched/fair: Fair server interface Date: Sat, 4 Nov 2023 11:59:24 +0100 Message-Id: <26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <cover.1699095159.git.bristot@kernel.org> References: <cover.1699095159.git.bristot@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sat, 04 Nov 2023 04:01:04 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781630959109672417 X-GMAIL-MSGID: 1781630959109672417 |
Series |
SCHED_DEADLINE server infrastructure
|
|
Commit Message
Daniel Bristot de Oliveira
Nov. 4, 2023, 10:59 a.m. UTC
Add an interface for fair server setup on debugfs.
Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
- fair_server_runtime: set runtime in ns
- fair_server_period: set period in ns
- fair_server_defer: on/off for the defer mechanism
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
kernel/sched/deadline.c | 89 +++++++++++++++---
kernel/sched/debug.c | 202 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 6 --
kernel/sched/sched.h | 2 +
4 files changed, 279 insertions(+), 20 deletions(-)
Comments
Hi Daniel, kernel test robot noticed the following build warnings: [auto build test WARNING on tip/sched/core] [also build test WARNING on tip/master linus/master next-20231103] [cannot apply to tip/auto-latest v6.6] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952 base: tip/sched/core patch link: https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org patch subject: [PATCH v5 7/7] sched/fair: Fair server interface config: i386-buildonly-randconfig-001-20231104 (https://download.01.org/0day-ci/archive/20231104/202311042329.PB1gTIL4-lkp@intel.com/config) compiler: gcc-11 (Debian 11.3.0-12) 11.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231104/202311042329.PB1gTIL4-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202311042329.PB1gTIL4-lkp@intel.com/ All warnings (new ones prefixed by >>): In file included from kernel/sched/build_utility.c:72: >> kernel/sched/debug.c:342:57: warning: integer overflow in expression of type 'long int' results in '-100663296' [-Woverflow] 342 | static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */ | ^ vim +342 kernel/sched/debug.c 341 > 342 static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */ 343 static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ 344
Hi Daniel, kernel test robot noticed the following build warnings: [auto build test WARNING on tip/sched/core] [also build test WARNING on tip/master linus/master next-20231103] [cannot apply to tip/auto-latest v6.6] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952 base: tip/sched/core patch link: https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org patch subject: [PATCH v5 7/7] sched/fair: Fair server interface config: i386-randconfig-013-20231105 (https://download.01.org/0day-ci/archive/20231105/202311050844.d45KQ8sK-lkp@intel.com/config) compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231105/202311050844.d45KQ8sK-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202311050844.d45KQ8sK-lkp@intel.com/ All warnings (new ones prefixed by >>): In file included from kernel/sched/build_utility.c:72:0: >> kernel/sched/debug.c:342:57: warning: integer overflow in expression [-Woverflow] static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */ ^ vim +342 kernel/sched/debug.c 341 > 342 static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */ 343 static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ 344
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: > Add an interface for fair server setup on debugfs. > > Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: > > - fair_server_runtime: set runtime in ns > - fair_server_period: set period in ns > - fair_server_defer: on/off for the defer mechanism > This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the total available bandwidth control, right? But then shouldn've we also rip out the throttle thingy right quick?
On 11/6/23 16:40, Peter Zijlstra wrote: > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: >> Add an interface for fair server setup on debugfs. >> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >> >> - fair_server_runtime: set runtime in ns >> - fair_server_period: set period in ns >> - fair_server_defer: on/off for the defer mechanism >> > > This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the > total available bandwidth control, right? right, but thinking aloud... given that the per-cpu files are already allocating the bandwidth on the dl_rq, the spare time for fair scheduler is granted. Still, we can have them there as a safeguard to not overloading the deadline scheduler... (thinking aloud 2) as long as global is a thing... as we get away from it, that global limitation will make less sense - still better to have a form of limitation so people are aware of bandwidth until there. > But then shouldn've we also rip out the throttle thingy right quick? > I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED for now, because GROUP_SCHED depends on it, no? With the next step on moving the dl server as the base for the hierarchical scheduling... That will rip out the CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface. Does it make sense?
On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote: > On 11/6/23 16:40, Peter Zijlstra wrote: > > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: > >> Add an interface for fair server setup on debugfs. > >> > >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: > >> > >> - fair_server_runtime: set runtime in ns > >> - fair_server_period: set period in ns > >> - fair_server_defer: on/off for the defer mechanism > >> > > > > This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the > > total available bandwidth control, right? > > right, but thinking aloud... given that the per-cpu files are already allocating the > bandwidth on the dl_rq, the spare time for fair scheduler is granted. > > Still, we can have them there as a safeguard to not overloading the deadline > scheduler... (thinking aloud 2) as long as global is a thing... as we get away > from it, that global limitation will make less sense - still better to have a form > of limitation so people are aware of bandwidth until there. Yeah, so having a limit on the deadline thing seems prudent as a way to model system overhead. I mean 100% sounds nice, but then all the models also assume no interrupts, no scheduler or migration overhead etc.. So setting a slightly lower max seems far more realistic to me. That said, the period/bandwidth thing is now slightly odd, as we really only care about the utilization. But whatever. One thing at a time. > > But then shouldn've we also rip out the throttle thingy right quick? > > > > I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED > for now, because GROUP_SCHED depends on it, no? Yes. Until we can delete all that code we'll have to keep some of that. > With the next step on moving the dl server as the base for the > hierarchical scheduling... That will rip out the > CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface. > > Does it make sense? I'm still not sure how to deal with affinities and deadline servers for RT. There's a bunch of issues and I thing we've only got some of them solved. The semi-partitioned thing (someone was working on that, I think you know the guy), solves DL 'entities' having affinities. But the problem of FIFO is that they don't have inherent bandwidth. This in turn means that any server for FIFO needs to be minimally concurrent, otherwise you hand out bandwidth to lower priority tasks that the higher priority task might want etc.. (Andersson's group has papers here). Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task that requires .8 a medium prio task that requires .6 and a low prio task that soaks up whatever it can get its little grubby paws on. Then with minimal concurreny this works out nicely, high gets .8, mid gets .6 and low gets the remaining .1. If OTOH you don't limit concurrency and let them all run concurrently, you can end up with the situation where they each get .5. Which is obviously fail. Add affinities here though and you're up a creek, how do you distribute utilization between the slices, what slices, etc.. You say given them a per-cpu cgroup interface, and have them configure it themselves, but that's a god-aweful thing to ask userspace to do. Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total and abysmal failure of a model -- thank you POSIX :-(
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: > Add an interface for fair server setup on debugfs. > > Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: > > - fair_server_runtime: set runtime in ns > - fair_server_period: set period in ns > - fair_server_defer: on/off for the defer mechanism > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> > --- I did the below, that gives us: /debug/sched/fair_server/cpuX/{runtime,period,defer} I wanted to also add: /debug/sched/fair_server/{runtime,period,defer} to more easily set all CPUs to the same value, but then figured userspace will just have to loop. --- --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -380,10 +380,10 @@ static ssize_t sched_fair_server_write(s break; } - if (runtime > period - || period > fair_server_period_max - || period < fair_server_period_min - || zerolax > 1) { + if (runtime > period || + period > fair_server_period_max || + period < fair_server_period_min || + zerolax > 1) { cnt = -EINVAL; goto out; } @@ -515,11 +515,11 @@ static struct dentry *debugfs_sched; static void debugfs_fair_server_init(void) { - long cpu; - struct dentry *rq_dentry; + struct dentry *d_fair; + unsigned int cpu; - rq_dentry = debugfs_create_dir("rq", debugfs_sched); - if (!rq_dentry) + d_fair = debugfs_create_dir("fair_server", debugfs_sched); + if (!d_fair) return; for_each_possible_cpu(cpu) { @@ -527,11 +527,11 @@ static void debugfs_fair_server_init(voi char buf[32]; snprintf(buf, sizeof(buf), "cpu%ld", cpu); - d_cpu = debugfs_create_dir(buf, rq_dentry); + d_cpu = debugfs_create_dir(buf, d_fair); - debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops); - debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops); - debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops); + debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops); + debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops); + debugfs_create_file("defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops); } }
On 11/7/23 13:38, Peter Zijlstra wrote: > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: >> Add an interface for fair server setup on debugfs. >> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >> >> - fair_server_runtime: set runtime in ns >> - fair_server_period: set period in ns >> - fair_server_defer: on/off for the defer mechanism >> >> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> >> --- > > I did the below, that gives us: > > /debug/sched/fair_server/cpuX/{runtime,period,defer} right! > I wanted to also add: > > /debug/sched/fair_server/{runtime,period,defer} > > to more easily set all CPUs to the same value, but then figured > userspace will just have to loop. I thought about that too... and had the same conclusion... let 'em loop. -- Daniel
On 11/7/23 09:16, Peter Zijlstra wrote: > On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote: >> On 11/6/23 16:40, Peter Zijlstra wrote: >>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: >>>> Add an interface for fair server setup on debugfs. >>>> >>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >>>> >>>> - fair_server_runtime: set runtime in ns >>>> - fair_server_period: set period in ns >>>> - fair_server_defer: on/off for the defer mechanism >>>> >>> >>> This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the >>> total available bandwidth control, right? >> >> right, but thinking aloud... given that the per-cpu files are already allocating the >> bandwidth on the dl_rq, the spare time for fair scheduler is granted. >> >> Still, we can have them there as a safeguard to not overloading the deadline >> scheduler... (thinking aloud 2) as long as global is a thing... as we get away >> from it, that global limitation will make less sense - still better to have a form >> of limitation so people are aware of bandwidth until there. > > Yeah, so having a limit on the deadline thing seems prudent as a way to > model system overhead. I mean 100% sounds nice, but then all the models > also assume no interrupts, no scheduler or migration overhead etc.. So > setting a slightly lower max seems far more realistic to me. > > That said, the period/bandwidth thing is now slightly odd, as we really > only care about the utilization. But whatever. One thing at a time. Yep, that is why I am mentioning the generalization as a second phase, it is a harder problem... But having the rt throttling out of the default way is already a good step. > >>> But then shouldn've we also rip out the throttle thingy right quick? >>> >> >> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED >> for now, because GROUP_SCHED depends on it, no? > > Yes. Until we can delete all that code we'll have to keep some of that. > >> With the next step on moving the dl server as the base for the >> hierarchical scheduling... That will rip out the >> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface. >> >> Does it make sense? > > I'm still not sure how to deal with affinities and deadline servers for > RT. > > There's a bunch of issues and I thing we've only got some of them solved. > > The semi-partitioned thing (someone was working on that, I think you > know the guy), solves DL 'entities' having affinities. Yep, then having arbitrari affinities is another step towards mode flexible models... > But the problem of FIFO is that they don't have inherent bandwidth. This > in turn means that any server for FIFO needs to be minimally concurrent, > otherwise you hand out bandwidth to lower priority tasks that the higher > priority task might want etc.. (Andersson's group has papers here). > > Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task > that requires .8 a medium prio task that requires .6 and a low prio task > that soaks up whatever it can get its little grubby paws on. > > Then with minimal concurreny this works out nicely, high gets .8, mid > gets .6 and low gets the remaining .1. > > If OTOH you don't limit concurrency and let them all run concurrently, > you can end up with the situation where they each get .5. Which is > obviously fail. > > Add affinities here though and you're up a creek, how do you distribute > utilization between the slices, what slices, etc.. You say given them a > per-cpu cgroup interface, and have them configure it themselves, but > that's a god-aweful thing to ask userspace to do. and yep again... It is definitely a harder topic... but it gets simpler as we do those other moves... > Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total > and abysmal failure of a model -- thank you POSIX :-( -- Daniel
On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote: > I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED > for now, because GROUP_SCHED depends on it, no? This builds and boots.. --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9930,8 +9930,6 @@ void __init sched_init(void) #endif /* CONFIG_RT_GROUP_SCHED */ } - init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime()); - #ifdef CONFIG_SMP init_defrootdomain(); #endif @@ -9986,7 +9984,6 @@ void __init sched_init(void) init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL); #endif /* CONFIG_FAIR_GROUP_SCHED */ - rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime; #ifdef CONFIG_RT_GROUP_SCHED init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL); #endif --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1480,6 +1480,7 @@ static void update_curr_dl_se(struct rq if (dl_se == &rq->fair_server) return; +#ifdef CONFIG_RT_GROUP_SCHED /* * Because -- for now -- we share the rt bandwidth, we need to * account our runtime there too, otherwise actual rt tasks @@ -1504,6 +1505,7 @@ static void update_curr_dl_se(struct rq rt_rq->rt_time += delta_exec; raw_spin_unlock(&rt_rq->rt_runtime_lock); } +#endif } void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec) --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -8,10 +8,6 @@ int sched_rr_timeslice = RR_TIMESLICE; /* More than 4 hours if BW_SHIFT equals 20. */ static const u64 max_rt_runtime = MAX_BW; -static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun); - -struct rt_bandwidth def_rt_bandwidth; - /* * period over which we measure -rt task CPU usage in us. * default: 1s @@ -67,6 +63,40 @@ static int __init sched_rt_sysctl_init(v late_initcall(sched_rt_sysctl_init); #endif +void init_rt_rq(struct rt_rq *rt_rq) +{ + struct rt_prio_array *array; + int i; + + array = &rt_rq->active; + for (i = 0; i < MAX_RT_PRIO; i++) { + INIT_LIST_HEAD(array->queue + i); + __clear_bit(i, array->bitmap); + } + /* delimiter for bitsearch: */ + __set_bit(MAX_RT_PRIO, array->bitmap); + +#if defined CONFIG_SMP + rt_rq->highest_prio.curr = MAX_RT_PRIO-1; + rt_rq->highest_prio.next = MAX_RT_PRIO-1; + rt_rq->overloaded = 0; + plist_head_init(&rt_rq->pushable_tasks); +#endif /* CONFIG_SMP */ + /* We start is dequeued state, because no RT tasks are queued */ + rt_rq->rt_queued = 0; + +#ifdef CONFIG_RT_GROUP_SCHED + rt_rq->rt_time = 0; + rt_rq->rt_throttled = 0; + rt_rq->rt_runtime = 0; + raw_spin_lock_init(&rt_rq->rt_runtime_lock); +#endif +} + +#ifdef CONFIG_RT_GROUP_SCHED + +static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun); + static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer) { struct rt_bandwidth *rt_b = @@ -131,35 +161,6 @@ static void start_rt_bandwidth(struct rt do_start_rt_bandwidth(rt_b); } -void init_rt_rq(struct rt_rq *rt_rq) -{ - struct rt_prio_array *array; - int i; - - array = &rt_rq->active; - for (i = 0; i < MAX_RT_PRIO; i++) { - INIT_LIST_HEAD(array->queue + i); - __clear_bit(i, array->bitmap); - } - /* delimiter for bitsearch: */ - __set_bit(MAX_RT_PRIO, array->bitmap); - -#if defined CONFIG_SMP - rt_rq->highest_prio.curr = MAX_RT_PRIO-1; - rt_rq->highest_prio.next = MAX_RT_PRIO-1; - rt_rq->overloaded = 0; - plist_head_init(&rt_rq->pushable_tasks); -#endif /* CONFIG_SMP */ - /* We start is dequeued state, because no RT tasks are queued */ - rt_rq->rt_queued = 0; - - rt_rq->rt_time = 0; - rt_rq->rt_throttled = 0; - rt_rq->rt_runtime = 0; - raw_spin_lock_init(&rt_rq->rt_runtime_lock); -} - -#ifdef CONFIG_RT_GROUP_SCHED static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b) { hrtimer_cancel(&rt_b->rt_period_timer); @@ -254,9 +255,6 @@ int alloc_rt_sched_group(struct task_gro if (!tg->rt_se) goto err; - init_rt_bandwidth(&tg->rt_bandwidth, - ktime_to_ns(def_rt_bandwidth.rt_period), 0); - for_each_possible_cpu(i) { rt_rq = kzalloc_node(sizeof(struct rt_rq), GFP_KERNEL, cpu_to_node(i)); @@ -605,70 +603,6 @@ static inline struct rt_bandwidth *sched return &rt_rq->tg->rt_bandwidth; } -#else /* !CONFIG_RT_GROUP_SCHED */ - -static inline u64 sched_rt_runtime(struct rt_rq *rt_rq) -{ - return rt_rq->rt_runtime; -} - -static inline u64 sched_rt_period(struct rt_rq *rt_rq) -{ - return ktime_to_ns(def_rt_bandwidth.rt_period); -} - -typedef struct rt_rq *rt_rq_iter_t; - -#define for_each_rt_rq(rt_rq, iter, rq) \ - for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL) - -#define for_each_sched_rt_entity(rt_se) \ - for (; rt_se; rt_se = NULL) - -static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se) -{ - return NULL; -} - -static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq) -{ - struct rq *rq = rq_of_rt_rq(rt_rq); - - if (!rt_rq->rt_nr_running) - return; - - enqueue_top_rt_rq(rt_rq); - resched_curr(rq); -} - -static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq) -{ - dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running); -} - -static inline int rt_rq_throttled(struct rt_rq *rt_rq) -{ - return rt_rq->rt_throttled; -} - -static inline const struct cpumask *sched_rt_period_mask(void) -{ - return cpu_online_mask; -} - -static inline -struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu) -{ - return &cpu_rq(cpu)->rt; -} - -static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq) -{ - return &def_rt_bandwidth; -} - -#endif /* CONFIG_RT_GROUP_SCHED */ - bool sched_rt_bandwidth_account(struct rt_rq *rt_rq) { struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); @@ -860,7 +794,7 @@ static int do_sched_rt_period_timer(stru const struct cpumask *span; span = sched_rt_period_mask(); -#ifdef CONFIG_RT_GROUP_SCHED + /* * FIXME: isolated CPUs should really leave the root task group, * whether they are isolcpus or were isolated via cpusets, lest @@ -872,7 +806,7 @@ static int do_sched_rt_period_timer(stru */ if (rt_b == &root_task_group.rt_bandwidth) span = cpu_online_mask; -#endif + for_each_cpu(i, span) { int enqueue = 0; struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i); @@ -939,18 +873,6 @@ static int do_sched_rt_period_timer(stru return idle; } -static inline int rt_se_prio(struct sched_rt_entity *rt_se) -{ -#ifdef CONFIG_RT_GROUP_SCHED - struct rt_rq *rt_rq = group_rt_rq(rt_se); - - if (rt_rq) - return rt_rq->highest_prio.curr; -#endif - - return rt_task_of(rt_se)->prio; -} - static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq) { u64 runtime = sched_rt_runtime(rt_rq); @@ -994,6 +916,70 @@ static int sched_rt_runtime_exceeded(str return 0; } +#else /* !CONFIG_RT_GROUP_SCHED */ + +typedef struct rt_rq *rt_rq_iter_t; + +#define for_each_rt_rq(rt_rq, iter, rq) \ + for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL) + +#define for_each_sched_rt_entity(rt_se) \ + for (; rt_se; rt_se = NULL) + +static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se) +{ + return NULL; +} + +static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq) +{ + struct rq *rq = rq_of_rt_rq(rt_rq); + + if (!rt_rq->rt_nr_running) + return; + + enqueue_top_rt_rq(rt_rq); + resched_curr(rq); +} + +static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq) +{ + dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running); +} + +static inline int rt_rq_throttled(struct rt_rq *rt_rq) +{ + return false; +} + +static inline const struct cpumask *sched_rt_period_mask(void) +{ + return cpu_online_mask; +} + +static inline +struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu) +{ + return &cpu_rq(cpu)->rt; +} + +static void __enable_runtime(struct rq *rq) { } +static void __disable_runtime(struct rq *rq) { } + +#endif /* CONFIG_RT_GROUP_SCHED */ + +static inline int rt_se_prio(struct sched_rt_entity *rt_se) +{ +#ifdef CONFIG_RT_GROUP_SCHED + struct rt_rq *rt_rq = group_rt_rq(rt_se); + + if (rt_rq) + return rt_rq->highest_prio.curr; +#endif + + return rt_task_of(rt_se)->prio; +} + /* * Update the current task's runtime statistics. Skip current tasks that * are not in our scheduling class. @@ -1001,7 +987,6 @@ static int sched_rt_runtime_exceeded(str static void update_curr_rt(struct rq *rq) { struct task_struct *curr = rq->curr; - struct sched_rt_entity *rt_se = &curr->rt; s64 delta_exec; if (curr->sched_class != &rt_sched_class) @@ -1011,6 +996,9 @@ static void update_curr_rt(struct rq *rq if (unlikely(delta_exec <= 0)) return; +#ifdef CONFIG_RT_GROUP_SCHED + struct sched_rt_entity *rt_se = &curr->rt; + if (!rt_bandwidth_enabled()) return; @@ -1029,6 +1017,7 @@ static void update_curr_rt(struct rq *rq do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq)); } } +#endif } static void @@ -1185,7 +1174,6 @@ dec_rt_group(struct sched_rt_entity *rt_ static void inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) { - start_rt_bandwidth(&def_rt_bandwidth); } static inline @@ -2913,19 +2901,6 @@ int sched_rt_can_attach(struct task_grou #ifdef CONFIG_SYSCTL static int sched_rt_global_constraints(void) { - unsigned long flags; - int i; - - raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags); - for_each_possible_cpu(i) { - struct rt_rq *rt_rq = &cpu_rq(i)->rt; - - raw_spin_lock(&rt_rq->rt_runtime_lock); - rt_rq->rt_runtime = global_rt_runtime(); - raw_spin_unlock(&rt_rq->rt_runtime_lock); - } - raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags); - return 0; } #endif /* CONFIG_SYSCTL */ @@ -2945,12 +2920,6 @@ static int sched_rt_global_validate(void static void sched_rt_do_global(void) { - unsigned long flags; - - raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags); - def_rt_bandwidth.rt_runtime = global_rt_runtime(); - def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period()); - raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags); } static int sched_rt_handler(struct ctl_table *table, int write, void *buffer, --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -698,13 +698,13 @@ struct rt_rq { #endif /* CONFIG_SMP */ int rt_queued; +#ifdef CONFIG_RT_GROUP_SCHED int rt_throttled; u64 rt_time; u64 rt_runtime; /* Nests inside the rq lock: */ raw_spinlock_t rt_runtime_lock; -#ifdef CONFIG_RT_GROUP_SCHED unsigned int rt_nr_boosted; struct rq *rq; @@ -2460,7 +2460,6 @@ extern void reweight_task(struct task_st extern void resched_curr(struct rq *rq); extern void resched_cpu(int cpu); -extern struct rt_bandwidth def_rt_bandwidth; extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime); extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: > Add an interface for fair server setup on debugfs. > > Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: > > - fair_server_runtime: set runtime in ns > - fair_server_period: set period in ns > - fair_server_defer: on/off for the defer mechanism > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Hi Daniel, Peter, I am writing on behalf of the ChromeOS scheduler team. We had to revert the last 3 patches in this series because of a syzkaller reported bug, this happens on the sched/more branch in Peter's tree: WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220 place_entity+0x240/0x290 kernel/sched/fair.c:5147 Call Trace: <TASK> enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283 enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717 enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117 activate_task+0x60/0xc0 kernel/sched/core.c:2147 ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794 ttwu_queue kernel/sched/core.c:4047 [inline] try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368 kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142 __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800 queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986 queue_delayed_work include/linux/workqueue.h:577 [inline] srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline] which is basically this warning in place_entity: if (WARN_ON_ONCE(!load)) load = 1; Full log (scroll to the bottom as there is console/lockdep side effects which are likely not relevant to this issue): https://paste.debian.net/1304579/ Side note, we are also looking into a KASAN nullptr deref but this happens only on our backport of the patches to a 5.15 kernel, as far as we know. KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445 [...] Call Trace: <TASK> set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728 set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943 set_next_task kernel/sched/sched.h:2241 [inline] pick_next_task kernel/sched/core.c:6014 [inline] __schedule+0x36fb/0x402d kernel/sched/core.c:6378 preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590 preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615 Full splat: https://paste.debian.net/1304573/ Investigation is on going but could you also please take a look at these? It is hard to reproduce and only syzkaller's syzbot has luck so for reproducing these. Also I had a comment below: > +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init) > +{ > + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime); > + u64 new_bw = to_ratio(period, runtime); > + struct rq *rq = dl_se->rq; > + int cpu = cpu_of(rq); > + struct dl_bw *dl_b; > + unsigned long cap; > + int retval = 0; > + int cpus; > + > + dl_b = dl_bw_of(cpu); > + raw_spin_lock(&dl_b->lock); > + cpus = dl_bw_cpus(cpu); > + cap = dl_bw_capacity(cpu); > + > + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) { The dl_overflow() call here seems introducing an issue with our conceptual understanding of how the dl server is supposed to work. Suppose we have a 4 CPU system. Also suppose RT throttling is disabled. Suppose the DL server params are 50ms runtime in 100ms period (basically we want to dedicate 50% of the bandwidth of each CPU to CFS). In such a situation, __dl_overflow() will return an error right? Because total bandwidth will exceed 100% (4 times 50% is 200%). Further, this complicates the setting of the parameters since it means we have to check the number of CPUs in advance and then set the parameters to prevent dl_overflow(). As an example of this, 30% (runtime / period) for each CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work because __dl_overflow() will fail. How do you suggest we remedy this? Can we make the dlserver calculate how much bandwidth is allowed on a per-CPU basis? My understanding is the fake dl_se are all pinned to their respective CPUs, so we don't have the same requirement as real DL tasks which may migrate freely within the root domain. thanks, - Joel
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: > Add an interface for fair server setup on debugfs. > > Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: > > - fair_server_runtime: set runtime in ns > - fair_server_period: set period in ns > - fair_server_defer: on/off for the defer mechanism > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Hi Daniel, Peter, I am writing on behalf of the ChromeOS scheduler team. We had to revert the last 3 patches in this series because of a syzkaller reported bug, this happens on the sched/more branch in Peter's tree: WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220 place_entity+0x240/0x290 kernel/sched/fair.c:5147 Call Trace: <TASK> enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283 enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717 enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117 activate_task+0x60/0xc0 kernel/sched/core.c:2147 ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794 ttwu_queue kernel/sched/core.c:4047 [inline] try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368 kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142 __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800 queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986 queue_delayed_work include/linux/workqueue.h:577 [inline] srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline] which is basically this warning in place_entity: if (WARN_ON_ONCE(!load)) load = 1; Full log (scroll to the bottom as there is console/lockdep side effects which are likely not relevant to this issue): https://paste.debian.net/1304579/ Side note, we are also looking into a KASAN nullptr deref but this happens only on our backport of the patches to a 5.15 kernel, as far as we know. KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445 [...] Call Trace: <TASK> set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728 set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943 set_next_task kernel/sched/sched.h:2241 [inline] pick_next_task kernel/sched/core.c:6014 [inline] __schedule+0x36fb/0x402d kernel/sched/core.c:6378 preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590 preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615 Full splat: https://paste.debian.net/1304573/ Investigation is on going but could you also please take a look at these? It is hard to reproduce and only syzbot has luck reproducing these. Also I had a comment below: > +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init) > +{ > + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime); > + u64 new_bw = to_ratio(period, runtime); > + struct rq *rq = dl_se->rq; > + int cpu = cpu_of(rq); > + struct dl_bw *dl_b; > + unsigned long cap; > + int retval = 0; > + int cpus; > + > + dl_b = dl_bw_of(cpu); > + raw_spin_lock(&dl_b->lock); > + cpus = dl_bw_cpus(cpu); > + cap = dl_bw_capacity(cpu); > + > + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) { The dl_overflow() call here seems introducing an issue with our conceptual understanding of how the dl server is supposed to work. Suppose we have a 4 CPU system. Also suppose RT throttling is disabled. Suppose the DL server params are 50ms runtime in 100ms period (basically we want to dedicate 50% of the bandwidth of each CPU to CFS). In such a situation, __dl_overflow() will return an error right? Because total bandwidth will exceed 100% (4 times 50% is 200%). Further, this complicates the setting of the parameters since it means we have to check the number of CPUs in advance and then set the parameters to prevent dl_overflow(). As an example of this, 30% (runtime / period) for each CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work because __dl_overflow() will fail. How do you suggest we remedy this? Can we make the dlserver calculate how much bandwidth is allowed on a per-CPU basis? My understanding is the fake dl_se are all pinned to their respective CPUs, so we don't have the same requirement as real DL tasks which may migrate freely within the root domain. thanks, - Joel
On 1/19/24 02:55, Joel Fernandes wrote: > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: >> Add an interface for fair server setup on debugfs. >> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >> >> - fair_server_runtime: set runtime in ns >> - fair_server_period: set period in ns >> - fair_server_defer: on/off for the defer mechanism >> >> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> > > Hi Daniel, Peter, > I am writing on behalf of the ChromeOS scheduler team. > > We had to revert the last 3 patches in this series because of a syzkaller > reported bug, this happens on the sched/more branch in Peter's tree: > > WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220 > place_entity+0x240/0x290 kernel/sched/fair.c:5147 > Call Trace: > <TASK> > enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283 > enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717 > enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117 > activate_task+0x60/0xc0 kernel/sched/core.c:2147 > ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794 > ttwu_queue kernel/sched/core.c:4047 [inline] > try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368 > kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142 > __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800 > queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986 > queue_delayed_work include/linux/workqueue.h:577 [inline] > srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline] > > which is basically this warning in place_entity: > if (WARN_ON_ONCE(!load)) > load = 1; > > Full log (scroll to the bottom as there is console/lockdep side effects which > are likely not relevant to this issue): https://paste.debian.net/1304579/ > > Side note, we are also looking into a KASAN nullptr deref but this happens > only on our backport of the patches to a 5.15 kernel, as far as we know. > > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] > CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...] > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 > RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] > RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445 > [...] > Call Trace: > <TASK> > set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728 > set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943 > set_next_task kernel/sched/sched.h:2241 [inline] > pick_next_task kernel/sched/core.c:6014 [inline] > __schedule+0x36fb/0x402d kernel/sched/core.c:6378 > preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590 > preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615 > > Full splat: https://paste.debian.net/1304573/ Interesting, does it keep any task hung? I am having a case where I see a hung task, but I do not get the splat because the system freezes (printk with rq_lock I guess)... It might be the same problem. > Investigation is on going but could you also please take a look at these? It > is hard to reproduce and only syzbot has luck reproducing these. > > Also I had a comment below: > >> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init) >> +{ >> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime); >> + u64 new_bw = to_ratio(period, runtime); >> + struct rq *rq = dl_se->rq; >> + int cpu = cpu_of(rq); >> + struct dl_bw *dl_b; >> + unsigned long cap; >> + int retval = 0; >> + int cpus; >> + >> + dl_b = dl_bw_of(cpu); >> + raw_spin_lock(&dl_b->lock); >> + cpus = dl_bw_cpus(cpu); >> + cap = dl_bw_capacity(cpu); >> + >> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) { > > The dl_overflow() call here seems introducing an issue with our conceptual > understanding of how the dl server is supposed to work. > > Suppose we have a 4 CPU system. Also suppose RT throttling is disabled. > Suppose the DL server params are 50ms runtime in 100ms period (basically we > want to dedicate 50% of the bandwidth of each CPU to CFS). > > In such a situation, __dl_overflow() will return an error right? Because > total bandwidth will exceed 100% (4 times 50% is 200%). I might be missing something in your case, but, it accepts: root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \; 1 1000000000 500000000 1 1000000000 500000000 1 1000000000 500000000 1 1000000000 500000000 your system accepts 400%... the percentage is "global". is it failing in your system? -- Daniel
Hi Daniel, On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote: > On 1/19/24 02:55, Joel Fernandes wrote: >> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote: >>> Add an interface for fair server setup on debugfs. >>> >>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >>> >>> - fair_server_runtime: set runtime in ns >>> - fair_server_period: set period in ns >>> - fair_server_defer: on/off for the defer mechanism >>> >>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> >> >> Hi Daniel, Peter, >> I am writing on behalf of the ChromeOS scheduler team. >> >> We had to revert the last 3 patches in this series because of a syzkaller >> reported bug, this happens on the sched/more branch in Peter's tree: >> >> WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220 >> place_entity+0x240/0x290 kernel/sched/fair.c:5147 >> Call Trace: >> <TASK> >> enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283 >> enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717 >> enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117 >> activate_task+0x60/0xc0 kernel/sched/core.c:2147 >> ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794 >> ttwu_queue kernel/sched/core.c:4047 [inline] >> try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368 >> kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142 >> __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800 >> queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986 >> queue_delayed_work include/linux/workqueue.h:577 [inline] >> srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline] >> >> which is basically this warning in place_entity: >> if (WARN_ON_ONCE(!load)) >> load = 1; >> >> Full log (scroll to the bottom as there is console/lockdep side effects which >> are likely not relevant to this issue): https://paste.debian.net/1304579/ >> >> Side note, we are also looking into a KASAN nullptr deref but this happens >> only on our backport of the patches to a 5.15 kernel, as far as we know. >> >> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] >> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...] >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 >> RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] >> RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445 >> [...] >> Call Trace: >> <TASK> >> set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728 >> set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943 >> set_next_task kernel/sched/sched.h:2241 [inline] >> pick_next_task kernel/sched/core.c:6014 [inline] >> __schedule+0x36fb/0x402d kernel/sched/core.c:6378 >> preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590 >> preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615 >> >> Full splat: https://paste.debian.net/1304573/ > > Interesting, does it keep any task hung? I am having a case where I see > a hung task, but I do not get the splat because the system freezes (printk > with rq_lock I guess)... > > It might be the same problem. Ah, we have an update. Suleiman found this is happening because of core scheduling's pick logic. I have some patches to fix it, there's also more fixes we have on other issues. Will coordinate with the team to send these out soon. We are currently testing them more. >> Investigation is on going but could you also please take a look at these? It >> is hard to reproduce and only syzbot has luck reproducing these. >> >> Also I had a comment below: >> >>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init) >>> +{ >>> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime); >>> + u64 new_bw = to_ratio(period, runtime); >>> + struct rq *rq = dl_se->rq; >>> + int cpu = cpu_of(rq); >>> + struct dl_bw *dl_b; >>> + unsigned long cap; >>> + int retval = 0; >>> + int cpus; >>> + >>> + dl_b = dl_bw_of(cpu); >>> + raw_spin_lock(&dl_b->lock); >>> + cpus = dl_bw_cpus(cpu); >>> + cap = dl_bw_capacity(cpu); >>> + >>> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) { >> >> The dl_overflow() call here seems introducing an issue with our conceptual >> understanding of how the dl server is supposed to work. >> >> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled. >> Suppose the DL server params are 50ms runtime in 100ms period (basically we >> want to dedicate 50% of the bandwidth of each CPU to CFS). >> >> In such a situation, __dl_overflow() will return an error right? Because >> total bandwidth will exceed 100% (4 times 50% is 200%). > > I might be missing something in your case, but, it accepts: > > root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \; > 1 > 1000000000 > 500000000 > 1 > 1000000000 > 500000000 > 1 > 1000000000 > 500000000 > 1 > 1000000000 > 500000000 > > your system accepts 400%... the percentage is "global". > > is it failing in your system? You are right, I was actually trying to change it manually in my kernel in dl_server_start(). In this case dlserver_apply_server_params() gets init=1 and old_bw is 0. I tried using the debugfs, and that works. So I think we will just use the debugfs. I was being lazy and setting it in my kernel manually for testing like this: @@ -1475,7 +1475,7 @@ void dl_server_start(struct sched_dl_entity *dl_se) * this before getting generic. */ if (!dl_server(dl_se)) { u64 runtime = 12 * NSEC_PER_MSEC; u64 period = 15 * NSEC_PER_MSEC; That doesn't work but I tried debugfs and it works. But for production, we will set it from userspace so it should not be an issue. I feel so much better now :) Thanks Daniel. By the way, what's the plan on remaining patches in sched/more branch, are you planning to resend those later? If so, we can just post our fixes on top of that, and if you don't mind you could include it in your next series posting (sched/more + our fixes + your fixes). Thanks! - Joel
On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote: > Interesting, does it keep any task hung? I am having a case where I see > a hung task, but I do not get the splat because the system freezes (printk > with rq_lock I guess)... I missed replying to this part of your email. We see it as a syzkaller report with splats, so it is not clear if a task was hung at the time of a splat. syzkaller runs in its own separate VM instance. The fun part is though, I found a way to do ftrace dump on it without having to pass any boot parameters (that's another issue that we can't pass it boot parameters). So we have that tool in our debug arsenal for these issues, thankfully. > It might be the same problem. True, possibly. thanks, - Joel
On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote: > Add an interface for fair server setup on debugfs. > > Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: > > - fair_server_runtime: set runtime in ns > - fair_server_period: set period in ns > - fair_server_defer: on/off for the defer mechanism Btw Daniel, there is an interesting side-effect of this interface having runtime and period in 2 separate files :) Say I want to set a CPU to 5ms / 10ms. I cannot set either period or runtime to 5ms or 10ms directly. I have to first set period to 100ms, then set runtime to 50ms, then set period to 50ms, then set runtime to 5ms, then finally set period to 10ms. The reason seems to be because otherwise runtime / period will not be accomodated and will cause dl_overflow issues. I'd suggest providing both runtime and period in the same interface to make it more easier to use. However, for the testing I am going with what we have. Also a request: I was wondering if a new version of the last 3 patches could be posted to LKML or shared in a tree somewhere. I am trying to sync to mainline and rebase our latest fixes on top of that, however it is difficult to do because these 3 patches are in bit of a flux (example the discussion between you and Peter about update_curr()). What's the best way to move forward with rebasing our fix contributions? I am going with the sched/more in Peter's queue.git unless you/Peter prefer something else. And I added your update_curr() suggestion onto that, let me know if you disagree with it: @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq) if (entity_is_task(curr)) update_curr_task(task_of(curr), delta_exec); + else + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec); account_cfs_rq_runtime(cfs_rq, delta_exec); } thanks, - Joel > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> > --- > kernel/sched/deadline.c | 89 +++++++++++++++--- > kernel/sched/debug.c | 202 ++++++++++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 6 -- > kernel/sched/sched.h | 2 + > 4 files changed, 279 insertions(+), 20 deletions(-) > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 69ee1fbd60e4..1092ca8892e0 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -321,19 +321,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) > __sub_running_bw(dl_se->dl_bw, dl_rq); > } > > -static void dl_change_utilization(struct task_struct *p, u64 new_bw) > +static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw) > { > - struct rq *rq; > - > - WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV); > - > - if (task_on_rq_queued(p)) > - return; > + if (dl_se->dl_non_contending) { > + sub_running_bw(dl_se, &rq->dl); > + dl_se->dl_non_contending = 0; > > - rq = task_rq(p); > - if (p->dl.dl_non_contending) { > - sub_running_bw(&p->dl, &rq->dl); > - p->dl.dl_non_contending = 0; > /* > * If the timer handler is currently running and the > * timer cannot be canceled, inactive_task_timer() > @@ -341,13 +334,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw) > * will not touch the rq's active utilization, > * so we are still safe. > */ > - if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1) > - put_task_struct(p); > + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) { > + if (!dl_server(dl_se)) > + put_task_struct(dl_task_of(dl_se)); > + } > } > - __sub_rq_bw(p->dl.dl_bw, &rq->dl); > + __sub_rq_bw(dl_se->dl_bw, &rq->dl); > __add_rq_bw(new_bw, &rq->dl); > } > > +static void dl_change_utilization(struct task_struct *p, u64 new_bw) > +{ > + WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV); > + > + if (task_on_rq_queued(p)) > + return; > + > + dl_rq_change_utilization(task_rq(p), &p->dl, new_bw); > +} > + > static void __dl_clear_params(struct sched_dl_entity *dl_se); > > /* > @@ -1508,10 +1513,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec) > > void dl_server_start(struct sched_dl_entity *dl_se) > { > + /* > + * XXX: the apply do not work fine at the init phase for the > + * fair server because things are not yet set. We need to improve > + * this before getting generic. > + */ > if (!dl_server(dl_se)) { > + u64 runtime = 50 * NSEC_PER_MSEC; > + u64 period = 1000 * NSEC_PER_MSEC; > + > + dl_server_apply_params(dl_se, runtime, period, 1); > + > + dl_se->dl_zerolax = 1; > dl_se->dl_server = 1; > setup_new_dl_entity(dl_se); > } > + > enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP); > } > > @@ -1532,6 +1549,50 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, > dl_se->server_pick = pick; > } > > +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init) > +{ > + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime); > + u64 new_bw = to_ratio(period, runtime); > + struct rq *rq = dl_se->rq; > + int cpu = cpu_of(rq); > + struct dl_bw *dl_b; > + unsigned long cap; > + int retval = 0; > + int cpus; > + > + dl_b = dl_bw_of(cpu); > + raw_spin_lock(&dl_b->lock); > + cpus = dl_bw_cpus(cpu); > + cap = dl_bw_capacity(cpu); > + > + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) { > + retval = -EBUSY; > + goto out; > + } > + > + if (init) { > + __add_rq_bw(new_bw, &rq->dl); > + __dl_add(dl_b, new_bw, cpus); > + } else { > + __dl_sub(dl_b, dl_se->dl_bw, cpus); > + __dl_add(dl_b, new_bw, cpus); > + > + dl_rq_change_utilization(rq, dl_se, new_bw); > + } > + > + rq->fair_server.dl_runtime = runtime; > + rq->fair_server.dl_deadline = period; > + rq->fair_server.dl_period = period; > + > + dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); > + dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime); > + > +out: > + raw_spin_unlock(&dl_b->lock); > + > + return retval; > +} > + > /* > * Update the current task's runtime statistics (provided it is still > * a -deadline task and has not been removed from the dl_rq). > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c > index 4580a450700e..bd7ad6b8d3de 100644 > --- a/kernel/sched/debug.c > +++ b/kernel/sched/debug.c > @@ -333,8 +333,208 @@ static const struct file_operations sched_debug_fops = { > .release = seq_release, > }; > > +enum dl_param { > + DL_RUNTIME = 0, > + DL_PERIOD, > + DL_ZEROLAX > +}; > + > +static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */ > +static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ > + > +static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf, > + size_t cnt, loff_t *ppos, enum dl_param param) > +{ > + long cpu = (long) ((struct seq_file *) filp->private_data)->private; > + u64 runtime, period, zerolax; > + struct rq *rq = cpu_rq(cpu); > + size_t err; > + int retval; > + u64 value; > + > + err = kstrtoull_from_user(ubuf, cnt, 10, &value); > + if (err) > + return err; > + > + scoped_guard (rq_lock_irqsave, rq) { > + > + runtime = rq->fair_server.dl_runtime; > + period = rq->fair_server.dl_period; > + zerolax = rq->fair_server.dl_zerolax; > + > + switch (param) { > + case DL_RUNTIME: > + if (runtime == value) > + goto out; > + runtime = value; > + break; > + case DL_PERIOD: > + if (value == period) > + goto out; > + period = value; > + break; > + case DL_ZEROLAX: > + if (zerolax == value) > + goto out; > + zerolax = value; > + break; > + } > + > + if (runtime > period > + || period > fair_server_period_max > + || period < fair_server_period_min > + || zerolax > 1) { > + cnt = -EINVAL; > + goto out; > + } > + > + if (rq->cfs.h_nr_running) { > + update_rq_clock(rq); > + dl_server_stop(&rq->fair_server); > + } > + > + /* > + * The zerolax does not change utilization, so just > + * setting it is enough. > + */ > + if (rq->fair_server.dl_zerolax != zerolax) { > + rq->fair_server.dl_zerolax = zerolax; > + } else { > + retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); > + if (retval) > + cnt = retval; > + } > + > + if (rq->cfs.h_nr_running) > + dl_server_start(&rq->fair_server); > + } > + > +out: > + *ppos += cnt; > + return cnt; > +} > + > +static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param) > +{ > + unsigned long cpu = (unsigned long) m->private; > + struct rq *rq = cpu_rq(cpu); > + u64 value; > + > + switch (param) { > + case DL_RUNTIME: > + value = rq->fair_server.dl_runtime; > + break; > + case DL_PERIOD: > + value = rq->fair_server.dl_period; > + break; > + case DL_ZEROLAX: > + value = rq->fair_server.dl_zerolax; > + } > + > + seq_printf(m, "%llu\n", value); > + return 0; > + > +} > + > +static ssize_t > +sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf, > + size_t cnt, loff_t *ppos) > +{ > + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME); > +} > + > +static int sched_fair_server_runtime_show(struct seq_file *m, void *v) > +{ > + return sched_fair_server_show(m, v, DL_RUNTIME); > +} > + > +static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp) > +{ > + return single_open(filp, sched_fair_server_runtime_show, inode->i_private); > +} > + > +static const struct file_operations fair_server_runtime_fops = { > + .open = sched_fair_server_runtime_open, > + .write = sched_fair_server_runtime_write, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = single_release, > +}; > + > +static ssize_t > +sched_fair_server_period_write(struct file *filp, const char __user *ubuf, > + size_t cnt, loff_t *ppos) > +{ > + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD); > +} > + > +static int sched_fair_server_period_show(struct seq_file *m, void *v) > +{ > + return sched_fair_server_show(m, v, DL_PERIOD); > +} > + > +static int sched_fair_server_period_open(struct inode *inode, struct file *filp) > +{ > + return single_open(filp, sched_fair_server_period_show, inode->i_private); > +} > + > +static const struct file_operations fair_server_period_fops = { > + .open = sched_fair_server_period_open, > + .write = sched_fair_server_period_write, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = single_release, > +}; > + > +static ssize_t > +sched_fair_server_defer_write(struct file *filp, const char __user *ubuf, > + size_t cnt, loff_t *ppos) > +{ > + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX); > +} > + > +static int sched_fair_server_defer_show(struct seq_file *m, void *v) > +{ > + return sched_fair_server_show(m, v, DL_ZEROLAX); > +} > + > +static int sched_fair_server_defer_open(struct inode *inode, struct file *filp) > +{ > + return single_open(filp, sched_fair_server_defer_show, inode->i_private); > +} > + > +static const struct file_operations fair_server_defer_fops = { > + .open = sched_fair_server_defer_open, > + .write = sched_fair_server_defer_write, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = single_release, > +}; > + > static struct dentry *debugfs_sched; > > +static void debugfs_fair_server_init(void) > +{ > + long cpu; > + struct dentry *rq_dentry; > + > + rq_dentry = debugfs_create_dir("rq", debugfs_sched); > + if (!rq_dentry) > + return; > + > + for_each_possible_cpu(cpu) { > + struct dentry *d_cpu; > + char buf[32]; > + > + snprintf(buf, sizeof(buf), "cpu%ld", cpu); > + d_cpu = debugfs_create_dir(buf, rq_dentry); > + > + debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops); > + debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops); > + debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops); > + } > +} > + > static __init int sched_init_debug(void) > { > struct dentry __maybe_unused *numa; > @@ -374,6 +574,8 @@ static __init int sched_init_debug(void) > > debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); > > + debugfs_fair_server_init(); > + > return 0; > } > late_initcall(sched_init_debug); > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 399237cd9f59..5434c52f470d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8419,12 +8419,6 @@ void fair_server_init(struct rq *rq) > struct sched_dl_entity *dl_se = &rq->fair_server; > > init_dl_entity(dl_se); > - > - dl_se->dl_runtime = 50 * NSEC_PER_MSEC; > - dl_se->dl_deadline = 1000 * NSEC_PER_MSEC; > - dl_se->dl_period = 1000 * NSEC_PER_MSEC; > - dl_se->dl_zerolax = 1; > - > dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick); > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index ec0e288c8e06..312b31df5860 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -341,6 +341,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, > dl_server_pick_f pick); > > extern void fair_server_init(struct rq *); > +extern int dl_server_apply_params(struct sched_dl_entity *dl_se, > + u64 runtime, u64 period, bool init); > > #ifdef CONFIG_CGROUP_SCHED >
On 2/12/2024 9:13 PM, Joel Fernandes wrote: > > > On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote: >> Add an interface for fair server setup on debugfs. >> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >> >> - fair_server_runtime: set runtime in ns >> - fair_server_period: set period in ns >> - fair_server_defer: on/off for the defer mechanism > > Btw Daniel, there is an interesting side-effect of this interface having runtime > and period in 2 separate files :) > > Say I want to set a CPU to 5ms / 10ms. > Sorry let me try again, say I want to set 500us / 1ms. Then I have to do the following to get it to work: # echo 100000000 > /sys/kernel/debug/sched/fair_server/cpu0/period # echo 5000000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime # echo 10000000 > /sys/kernel/debug/sched/fair_server/cpu0/period # echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime # echo 1000000 > /sys/kernel/debug/sched/fair_server/cpu0/period IOW, if I boot and do the following, it fails: # echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime bash: echo: write error: Device or resource busy - Joel
On 2/13/24 03:13, Joel Fernandes wrote: > > > On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote: >> Add an interface for fair server setup on debugfs. >> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >> >> - fair_server_runtime: set runtime in ns >> - fair_server_period: set period in ns >> - fair_server_defer: on/off for the defer mechanism > > Btw Daniel, there is an interesting side-effect of this interface having runtime > and period in 2 separate files :) > > Say I want to set a CPU to 5ms / 10ms. > > I cannot set either period or runtime to 5ms or 10ms directly. > > I have to first set period to 100ms, then set runtime to 50ms, then set period > to 50ms, then set runtime to 5ms, then finally set period to 10ms. Hummm yeah I could reproduce that, it seems that it is not even a problem of having two files, but a bug in the logic, I will have a look. > The reason seems to be because otherwise runtime / period will not be > accomodated and will cause dl_overflow issues. > > I'd suggest providing both runtime and period in the same interface to make it > more easier to use. However, for the testing I am going with what we have. > > Also a request: > > I was wondering if a new version of the last 3 patches could be posted to > LKML or shared in a tree somewhere. I am trying to sync to mainline and > rebase our latest fixes on top of that, however it is difficult to do because > these 3 patches are in bit of a flux (example the discussion between you and > Peter about update_curr()). What's the best way to move forward with rebasing > our fix contributions? Juri and I chat about, and we think it is a good thing to re-send this patch set, including a fix I have to it (to avoid regression wrt rt throttling), explaining these things in the mailing list so peter will be able to follow the discussion. I still need to finish testing, and to make a proper cover page with all updates, the latest thing is here (tm): https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6 It is based on peter's sched/more. I will probably re-send it today or tomorrow, but at least you can have a look at it. Another reason to send it is to get the regression test machinery running.... I am going with the sched/more in Peter's queue.git > unless you/Peter prefer something else. And I added your update_curr() > suggestion onto that, let me know if you disagree with it: > > @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq) > > if (entity_is_task(curr)) > update_curr_task(task_of(curr), delta_exec); > + else > + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec); > > account_cfs_rq_runtime(cfs_rq, delta_exec); > } That part of the code was optimized by peter during the last round of discussions. It is like this now: ------------ %< ----------- - if (entity_is_task(curr)) - update_curr_task(task_of(curr), delta_exec); + if (entity_is_task(curr)) { + struct task_struct *p = task_of(curr); + update_curr_task(p, delta_exec); + /* + * Any fair task that runs outside of fair_server should + * account against fair_server such that it can account for + * this time and possibly avoid running this period. + */ + if (p->dl_server != &rq->fair_server) + dl_server_update(&rq->fair_server, delta_exec); + } ------------ >% ----------- It is not straightforward to understand... but the ideia is: if it is a task, and the server is ! of the fair server, discount time directly from the fair server. This also means that if dl_server is NULL (the server is not enabled) it will discount time from the fair server. -- Daniel > thanks, > > - Joel
Hello, Daniel, On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote: > On 2/13/24 03:13, Joel Fernandes wrote: >> >> >> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote: >>> Add an interface for fair server setup on debugfs. >>> >>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >>> >>> - fair_server_runtime: set runtime in ns >>> - fair_server_period: set period in ns >>> - fair_server_defer: on/off for the defer mechanism >> >> Btw Daniel, there is an interesting side-effect of this interface having runtime >> and period in 2 separate files :) >> >> Say I want to set a CPU to 5ms / 10ms. >> >> I cannot set either period or runtime to 5ms or 10ms directly. >> >> I have to first set period to 100ms, then set runtime to 50ms, then set period >> to 50ms, then set runtime to 5ms, then finally set period to 10ms. > > Hummm yeah I could reproduce that, it seems that it is not even a problem of having > two files, but a bug in the logic, I will have a look. Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to not set 45ms/50ms for instance. Also just want to mention, if you could please CC my colleagues Suleiman and Youssef on the patches who are also working on / reviewing these: Suleiman Souhlal <suleiman@google.com> Youssef Esmat <youssefesmat@google.com> >> The reason seems to be because otherwise runtime / period will not be >> accomodated and will cause dl_overflow issues. >> >> I'd suggest providing both runtime and period in the same interface to make it >> more easier to use. However, for the testing I am going with what we have. >> >> Also a request: >> >> I was wondering if a new version of the last 3 patches could be posted to >> LKML or shared in a tree somewhere. I am trying to sync to mainline and >> rebase our latest fixes on top of that, however it is difficult to do because >> these 3 patches are in bit of a flux (example the discussion between you and >> Peter about update_curr()). What's the best way to move forward with rebasing >> our fix contributions? > > Juri and I chat about, and we think it is a good thing to re-send this patch set, > including a fix I have to it (to avoid regression wrt rt throttling), explaining > these things in the mailing list so peter will be able to follow the discussion. > > I still need to finish testing, and to make a proper cover page with all updates, the > latest thing is here (tm): > > https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6 > > It is based on peter's sched/more. I will probably re-send it today or tomorrow, > but at least you can have a look at it. >> Another reason to send it is to get the regression test machinery running.... Sure, looking forward to it. I rebased on above tree and it applied cleanly. What I'll do is I will send our patches today (not those in sched/more) after a bit more testing and tweaks. There are 2 reasons for this: 1. Can get the build robot do its thing. 2. Our internal system checks whether patches backported were posted upstream to list. Hope that sounds good to you and we can start reviewing as well. > I am going with the sched/more in Peter's queue.git >> unless you/Peter prefer something else. And I added your update_curr() >> suggestion onto that, let me know if you disagree with it: >> >> @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq) >> >> if (entity_is_task(curr)) >> update_curr_task(task_of(curr), delta_exec); >> + else >> + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec); >> >> account_cfs_rq_runtime(cfs_rq, delta_exec); >> } > > That part of the code was optimized by peter during the last round of discussions. > > It is like this now: > > ------------ %< ----------- > - if (entity_is_task(curr)) > - update_curr_task(task_of(curr), delta_exec); > + if (entity_is_task(curr)) { > + struct task_struct *p = task_of(curr); > + update_curr_task(p, delta_exec); > + /* > + * Any fair task that runs outside of fair_server should > + * account against fair_server such that it can account for > + * this time and possibly avoid running this period. > + */ > + if (p->dl_server != &rq->fair_server) > + dl_server_update(&rq->fair_server, delta_exec); > + } > ------------ >% ----------- > > It is not straightforward to understand... but the ideia is: > > if it is a task, and the server is ! of the fair server, discount time > directly from the fair server. This also means that if dl_server is NULL > (the server is not enabled) it will discount time from the fair server. Yes, that makes sense. We certainly want to debit from the server even when DL is not picking the task indirectly. I guess Peter's optimization also handles the case where multiple servers are in play. That will help us when/if we make RT as a server as well, right? thanks, - Joel
On 2/15/24 14:57, Joel Fernandes wrote: > Hello, Daniel, > > On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote: >> On 2/13/24 03:13, Joel Fernandes wrote: >>> >>> >>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote: >>>> Add an interface for fair server setup on debugfs. >>>> >>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >>>> >>>> - fair_server_runtime: set runtime in ns >>>> - fair_server_period: set period in ns >>>> - fair_server_defer: on/off for the defer mechanism >>> >>> Btw Daniel, there is an interesting side-effect of this interface having runtime >>> and period in 2 separate files :) >>> >>> Say I want to set a CPU to 5ms / 10ms. >>> >>> I cannot set either period or runtime to 5ms or 10ms directly. >>> >>> I have to first set period to 100ms, then set runtime to 50ms, then set period >>> to 50ms, then set runtime to 5ms, then finally set period to 10ms. >> >> Hummm yeah I could reproduce that, it seems that it is not even a problem of having >> two files, but a bug in the logic, I will have a look. > > Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to > not set 45ms/50ms for instance. I isolated the problem. It is not an interface problem. Long story short, the servers are initialized at the defrootdomain, but the dl_bw info is not being carried over to the new domain because the servers are not a task. I am discussing this with Valentin (topology) & Juri. We will try to find a solution, or at least an presentable XXX: solution... in the next days. You can work around it by disabling the admission control via: # sysctl kernel.sched_rt_runtime_us=-1 the the values will be accepted. For the best of my knowledge, you guys are only using SCHED_RR/FIFO so the admission control for DL is not an issue. >> I still need to finish testing, and to make a proper cover page with all updates, the >> latest thing is here (tm): >> >> https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6 >> >> It is based on peter's sched/more. I will probably re-send it today or tomorrow, >> but at least you can have a look at it. >>> Another reason to send it is to get the regression test machinery running.... > > Sure, looking forward to it. I rebased on above tree and it applied cleanly. > What I'll do is I will send our patches today (not those in sched/more) after a > bit more testing and tweaks. > > There are 2 reasons for this: > 1. Can get the build robot do its thing. > 2. Our internal system checks whether patches backported were posted upstream to > list. > > Hope that sounds good to you and we can start reviewing as well. If it helps downstream for you guys, it is not a problem for me. Still, peter is the person that has more comments to give so... -- Daniel
On 2/15/2024 12:27 PM, Daniel Bristot de Oliveira wrote: > On 2/15/24 14:57, Joel Fernandes wrote: >> Hello, Daniel, >> >> On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote: >>> On 2/13/24 03:13, Joel Fernandes wrote: >>>> >>>> >>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote: >>>>> Add an interface for fair server setup on debugfs. >>>>> >>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}: >>>>> >>>>> - fair_server_runtime: set runtime in ns >>>>> - fair_server_period: set period in ns >>>>> - fair_server_defer: on/off for the defer mechanism >>>> >>>> Btw Daniel, there is an interesting side-effect of this interface having runtime >>>> and period in 2 separate files :) >>>> >>>> Say I want to set a CPU to 5ms / 10ms. >>>> >>>> I cannot set either period or runtime to 5ms or 10ms directly. >>>> >>>> I have to first set period to 100ms, then set runtime to 50ms, then set period >>>> to 50ms, then set runtime to 5ms, then finally set period to 10ms. >>> >>> Hummm yeah I could reproduce that, it seems that it is not even a problem of having >>> two files, but a bug in the logic, I will have a look. >> >> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to >> not set 45ms/50ms for instance. > > I isolated the problem. It is not an interface problem. > > Long story short, the servers are initialized at the defrootdomain, but > the dl_bw info is not being carried over to the new domain because the > servers are not a task. Nice work on nailing the issue. > I am discussing this with Valentin (topology) & Juri. We will try to find a > solution, or at least an presentable XXX: solution... in the next days. > > You can work around it by disabling the admission control via: > > # sysctl kernel.sched_rt_runtime_us=-1 > > the the values will be accepted. For the best of my knowledge, you guys are > only using SCHED_RR/FIFO so the admission control for DL is not an issue. That's right, we only use deadline for the server. However, on some devices, schedutil is used and AFAIR its kthread uses SCHED_DEADLINE. I don't anticipate problems related to admission control and that kthread, so I think your proposed workaround sounds good to me. thanks, - Joel
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 69ee1fbd60e4..1092ca8892e0 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -321,19 +321,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) __sub_running_bw(dl_se->dl_bw, dl_rq); } -static void dl_change_utilization(struct task_struct *p, u64 new_bw) +static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw) { - struct rq *rq; - - WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV); - - if (task_on_rq_queued(p)) - return; + if (dl_se->dl_non_contending) { + sub_running_bw(dl_se, &rq->dl); + dl_se->dl_non_contending = 0; - rq = task_rq(p); - if (p->dl.dl_non_contending) { - sub_running_bw(&p->dl, &rq->dl); - p->dl.dl_non_contending = 0; /* * If the timer handler is currently running and the * timer cannot be canceled, inactive_task_timer() @@ -341,13 +334,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw) * will not touch the rq's active utilization, * so we are still safe. */ - if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1) - put_task_struct(p); + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) { + if (!dl_server(dl_se)) + put_task_struct(dl_task_of(dl_se)); + } } - __sub_rq_bw(p->dl.dl_bw, &rq->dl); + __sub_rq_bw(dl_se->dl_bw, &rq->dl); __add_rq_bw(new_bw, &rq->dl); } +static void dl_change_utilization(struct task_struct *p, u64 new_bw) +{ + WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV); + + if (task_on_rq_queued(p)) + return; + + dl_rq_change_utilization(task_rq(p), &p->dl, new_bw); +} + static void __dl_clear_params(struct sched_dl_entity *dl_se); /* @@ -1508,10 +1513,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec) void dl_server_start(struct sched_dl_entity *dl_se) { + /* + * XXX: the apply do not work fine at the init phase for the + * fair server because things are not yet set. We need to improve + * this before getting generic. + */ if (!dl_server(dl_se)) { + u64 runtime = 50 * NSEC_PER_MSEC; + u64 period = 1000 * NSEC_PER_MSEC; + + dl_server_apply_params(dl_se, runtime, period, 1); + + dl_se->dl_zerolax = 1; dl_se->dl_server = 1; setup_new_dl_entity(dl_se); } + enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP); } @@ -1532,6 +1549,50 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, dl_se->server_pick = pick; } +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init) +{ + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime); + u64 new_bw = to_ratio(period, runtime); + struct rq *rq = dl_se->rq; + int cpu = cpu_of(rq); + struct dl_bw *dl_b; + unsigned long cap; + int retval = 0; + int cpus; + + dl_b = dl_bw_of(cpu); + raw_spin_lock(&dl_b->lock); + cpus = dl_bw_cpus(cpu); + cap = dl_bw_capacity(cpu); + + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) { + retval = -EBUSY; + goto out; + } + + if (init) { + __add_rq_bw(new_bw, &rq->dl); + __dl_add(dl_b, new_bw, cpus); + } else { + __dl_sub(dl_b, dl_se->dl_bw, cpus); + __dl_add(dl_b, new_bw, cpus); + + dl_rq_change_utilization(rq, dl_se, new_bw); + } + + rq->fair_server.dl_runtime = runtime; + rq->fair_server.dl_deadline = period; + rq->fair_server.dl_period = period; + + dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); + dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime); + +out: + raw_spin_unlock(&dl_b->lock); + + return retval; +} + /* * Update the current task's runtime statistics (provided it is still * a -deadline task and has not been removed from the dl_rq). diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 4580a450700e..bd7ad6b8d3de 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -333,8 +333,208 @@ static const struct file_operations sched_debug_fops = { .release = seq_release, }; +enum dl_param { + DL_RUNTIME = 0, + DL_PERIOD, + DL_ZEROLAX +}; + +static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */ +static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */ + +static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos, enum dl_param param) +{ + long cpu = (long) ((struct seq_file *) filp->private_data)->private; + u64 runtime, period, zerolax; + struct rq *rq = cpu_rq(cpu); + size_t err; + int retval; + u64 value; + + err = kstrtoull_from_user(ubuf, cnt, 10, &value); + if (err) + return err; + + scoped_guard (rq_lock_irqsave, rq) { + + runtime = rq->fair_server.dl_runtime; + period = rq->fair_server.dl_period; + zerolax = rq->fair_server.dl_zerolax; + + switch (param) { + case DL_RUNTIME: + if (runtime == value) + goto out; + runtime = value; + break; + case DL_PERIOD: + if (value == period) + goto out; + period = value; + break; + case DL_ZEROLAX: + if (zerolax == value) + goto out; + zerolax = value; + break; + } + + if (runtime > period + || period > fair_server_period_max + || period < fair_server_period_min + || zerolax > 1) { + cnt = -EINVAL; + goto out; + } + + if (rq->cfs.h_nr_running) { + update_rq_clock(rq); + dl_server_stop(&rq->fair_server); + } + + /* + * The zerolax does not change utilization, so just + * setting it is enough. + */ + if (rq->fair_server.dl_zerolax != zerolax) { + rq->fair_server.dl_zerolax = zerolax; + } else { + retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0); + if (retval) + cnt = retval; + } + + if (rq->cfs.h_nr_running) + dl_server_start(&rq->fair_server); + } + +out: + *ppos += cnt; + return cnt; +} + +static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param) +{ + unsigned long cpu = (unsigned long) m->private; + struct rq *rq = cpu_rq(cpu); + u64 value; + + switch (param) { + case DL_RUNTIME: + value = rq->fair_server.dl_runtime; + break; + case DL_PERIOD: + value = rq->fair_server.dl_period; + break; + case DL_ZEROLAX: + value = rq->fair_server.dl_zerolax; + } + + seq_printf(m, "%llu\n", value); + return 0; + +} + +static ssize_t +sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME); +} + +static int sched_fair_server_runtime_show(struct seq_file *m, void *v) +{ + return sched_fair_server_show(m, v, DL_RUNTIME); +} + +static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_fair_server_runtime_show, inode->i_private); +} + +static const struct file_operations fair_server_runtime_fops = { + .open = sched_fair_server_runtime_open, + .write = sched_fair_server_runtime_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static ssize_t +sched_fair_server_period_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD); +} + +static int sched_fair_server_period_show(struct seq_file *m, void *v) +{ + return sched_fair_server_show(m, v, DL_PERIOD); +} + +static int sched_fair_server_period_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_fair_server_period_show, inode->i_private); +} + +static const struct file_operations fair_server_period_fops = { + .open = sched_fair_server_period_open, + .write = sched_fair_server_period_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static ssize_t +sched_fair_server_defer_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX); +} + +static int sched_fair_server_defer_show(struct seq_file *m, void *v) +{ + return sched_fair_server_show(m, v, DL_ZEROLAX); +} + +static int sched_fair_server_defer_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_fair_server_defer_show, inode->i_private); +} + +static const struct file_operations fair_server_defer_fops = { + .open = sched_fair_server_defer_open, + .write = sched_fair_server_defer_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + static struct dentry *debugfs_sched; +static void debugfs_fair_server_init(void) +{ + long cpu; + struct dentry *rq_dentry; + + rq_dentry = debugfs_create_dir("rq", debugfs_sched); + if (!rq_dentry) + return; + + for_each_possible_cpu(cpu) { + struct dentry *d_cpu; + char buf[32]; + + snprintf(buf, sizeof(buf), "cpu%ld", cpu); + d_cpu = debugfs_create_dir(buf, rq_dentry); + + debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops); + debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops); + debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops); + } +} + static __init int sched_init_debug(void) { struct dentry __maybe_unused *numa; @@ -374,6 +574,8 @@ static __init int sched_init_debug(void) debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); + debugfs_fair_server_init(); + return 0; } late_initcall(sched_init_debug); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 399237cd9f59..5434c52f470d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8419,12 +8419,6 @@ void fair_server_init(struct rq *rq) struct sched_dl_entity *dl_se = &rq->fair_server; init_dl_entity(dl_se); - - dl_se->dl_runtime = 50 * NSEC_PER_MSEC; - dl_se->dl_deadline = 1000 * NSEC_PER_MSEC; - dl_se->dl_period = 1000 * NSEC_PER_MSEC; - dl_se->dl_zerolax = 1; - dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ec0e288c8e06..312b31df5860 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -341,6 +341,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, dl_server_pick_f pick); extern void fair_server_init(struct rq *); +extern int dl_server_apply_params(struct sched_dl_entity *dl_se, + u64 runtime, u64 period, bool init); #ifdef CONFIG_CGROUP_SCHED