[v5,7/7] sched/fair: Fair server interface

Message ID 26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot@kernel.org
State New
Headers
Series SCHED_DEADLINE server infrastructure |

Commit Message

Daniel Bristot de Oliveira Nov. 4, 2023, 10:59 a.m. UTC
  Add an interface for fair server setup on debugfs.

Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:

 - fair_server_runtime: set runtime in ns
 - fair_server_period: set period in ns
 - fair_server_defer: on/off for the defer mechanism

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/deadline.c |  89 +++++++++++++++---
 kernel/sched/debug.c    | 202 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c     |   6 --
 kernel/sched/sched.h    |   2 +
 4 files changed, 279 insertions(+), 20 deletions(-)
  

Comments

kernel test robot Nov. 4, 2023, 3:18 p.m. UTC | #1
Hi Daniel,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linus/master next-20231103]
[cannot apply to tip/auto-latest v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org
patch subject: [PATCH v5 7/7] sched/fair: Fair server interface
config: i386-buildonly-randconfig-001-20231104 (https://download.01.org/0day-ci/archive/20231104/202311042329.PB1gTIL4-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231104/202311042329.PB1gTIL4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311042329.PB1gTIL4-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_utility.c:72:
>> kernel/sched/debug.c:342:57: warning: integer overflow in expression of type 'long int' results in '-100663296' [-Woverflow]
     342 | static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
         |                                                         ^


vim +342 kernel/sched/debug.c

   341	
 > 342	static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
   343	static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
   344
  
kernel test robot Nov. 5, 2023, 12:55 a.m. UTC | #2
Hi Daniel,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linus/master next-20231103]
[cannot apply to tip/auto-latest v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org
patch subject: [PATCH v5 7/7] sched/fair: Fair server interface
config: i386-randconfig-013-20231105 (https://download.01.org/0day-ci/archive/20231105/202311050844.d45KQ8sK-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231105/202311050844.d45KQ8sK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311050844.d45KQ8sK-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_utility.c:72:0:
>> kernel/sched/debug.c:342:57: warning: integer overflow in expression [-Woverflow]
    static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
                                                            ^


vim +342 kernel/sched/debug.c

   341	
 > 342	static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
   343	static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
   344
  
Peter Zijlstra Nov. 6, 2023, 3:40 p.m. UTC | #3
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 

This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
total available bandwidth control, right?

But then shouldn've we also rip out the throttle thingy right quick?
  
Daniel Bristot de Oliveira Nov. 6, 2023, 4:29 p.m. UTC | #4
On 11/6/23 16:40, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
>>
> 
> This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
> total available bandwidth control, right?

right, but thinking aloud... given that the per-cpu files are already allocating the
bandwidth on the dl_rq, the spare time for fair scheduler is granted.

Still, we can have them there as a safeguard to not overloading the deadline
scheduler... (thinking aloud 2) as long as global is a thing... as we get away
from it, that global limitation will make less sense - still better to have a form
of limitation so people are aware of bandwidth until there.

> But then shouldn've we also rip out the throttle thingy right quick?
> 

I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
for now, because GROUP_SCHED depends on it, no?

With the next step on moving the dl server as the base for the hierarchical scheduling...
That will rip out the CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.

Does it make sense?
  
Peter Zijlstra Nov. 7, 2023, 8:16 a.m. UTC | #5
On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:
> On 11/6/23 16:40, Peter Zijlstra wrote:
> > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> >> Add an interface for fair server setup on debugfs.
> >>
> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> >>
> >>  - fair_server_runtime: set runtime in ns
> >>  - fair_server_period: set period in ns
> >>  - fair_server_defer: on/off for the defer mechanism
> >>
> > 
> > This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
> > total available bandwidth control, right?
> 
> right, but thinking aloud... given that the per-cpu files are already allocating the
> bandwidth on the dl_rq, the spare time for fair scheduler is granted.
> 
> Still, we can have them there as a safeguard to not overloading the deadline
> scheduler... (thinking aloud 2) as long as global is a thing... as we get away
> from it, that global limitation will make less sense - still better to have a form
> of limitation so people are aware of bandwidth until there.

Yeah, so having a limit on the deadline thing seems prudent as a way to
model system overhead. I mean 100% sounds nice, but then all the models
also assume no interrupts, no scheduler or migration overhead etc.. So
setting a slightly lower max seems far more realistic to me.

That said, the period/bandwidth thing is now slightly odd, as we really
only care about the utilization. But whatever. One thing at a time.

> > But then shouldn've we also rip out the throttle thingy right quick?
> > 
> 
> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
> for now, because GROUP_SCHED depends on it, no?

Yes. Until we can delete all that code we'll have to keep some of that.

> With the next step on moving the dl server as the base for the
> hierarchical scheduling...  That will rip out the
> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
> 
> Does it make sense?

I'm still not sure how to deal with affinities and deadline servers for
RT.

There's a bunch of issues and I thing we've only got some of them solved.

The semi-partitioned thing (someone was working on that, I think you
know the guy), solves DL 'entities' having affinities.

But the problem of FIFO is that they don't have inherent bandwidth. This
in turn means that any server for FIFO needs to be minimally concurrent,
otherwise you hand out bandwidth to lower priority tasks that the higher
priority task might want etc.. (Andersson's group has papers here).

Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task
that requires .8 a medium prio task that requires .6 and a low prio task
that soaks up whatever it can get its little grubby paws on.

Then with minimal concurreny this works out nicely, high gets .8, mid
gets .6 and low gets the remaining .1.

If OTOH you don't limit concurrency and let them all run concurrently,
you can end up with the situation where they each get .5. Which is
obviously fail.

Add affinities here though and you're up a creek, how do you distribute
utilization between the slices, what slices, etc.. You say given them a
per-cpu cgroup interface, and have them configure it themselves, but
that's a god-aweful thing to ask userspace to do.

Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total
and abysmal failure of a model -- thank you POSIX :-(
  
Peter Zijlstra Nov. 7, 2023, 12:38 p.m. UTC | #6
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> ---

I did the below, that gives us:

/debug/sched/fair_server/cpuX/{runtime,period,defer}

I wanted to also add:

/debug/sched/fair_server/{runtime,period,defer}

to more easily set all CPUs to the same value, but then figured
userspace will just have to loop.

---

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -380,10 +380,10 @@ static ssize_t sched_fair_server_write(s
 			break;
 		}
 
-		if (runtime > period
-				|| period > fair_server_period_max
-				|| period < fair_server_period_min
-				|| zerolax > 1) {
+		if (runtime > period ||
+		    period > fair_server_period_max ||
+		    period < fair_server_period_min ||
+		    zerolax > 1) {
 			cnt = -EINVAL;
 			goto out;
 		}
@@ -515,11 +515,11 @@ static struct dentry *debugfs_sched;
 
 static void debugfs_fair_server_init(void)
 {
-	long cpu;
-	struct dentry *rq_dentry;
+	struct dentry *d_fair;
+	unsigned int cpu;
 
-	rq_dentry = debugfs_create_dir("rq", debugfs_sched);
-	if (!rq_dentry)
+	d_fair = debugfs_create_dir("fair_server", debugfs_sched);
+	if (!d_fair)
 		return;
 
 	for_each_possible_cpu(cpu) {
@@ -527,11 +527,11 @@ static void debugfs_fair_server_init(voi
 		char buf[32];
 
 		snprintf(buf, sizeof(buf), "cpu%ld", cpu);
-		d_cpu = debugfs_create_dir(buf, rq_dentry);
+		d_cpu = debugfs_create_dir(buf, d_fair);
 
-		debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
-		debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
-		debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+		debugfs_create_file("defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
 	}
 }
  
Daniel Bristot de Oliveira Nov. 7, 2023, 1:24 p.m. UTC | #7
On 11/7/23 13:38, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
>>
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
>> ---
> 
> I did the below, that gives us:
> 
> /debug/sched/fair_server/cpuX/{runtime,period,defer}

right!

> I wanted to also add:
> 
> /debug/sched/fair_server/{runtime,period,defer}
> 
> to more easily set all CPUs to the same value, but then figured
> userspace will just have to loop.

I thought about that too... and had the same conclusion... let 'em loop.

-- Daniel
  
Daniel Bristot de Oliveira Nov. 7, 2023, 2:06 p.m. UTC | #8
On 11/7/23 09:16, Peter Zijlstra wrote:
> On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:
>> On 11/6/23 16:40, Peter Zijlstra wrote:
>>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>>>> Add an interface for fair server setup on debugfs.
>>>>
>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>
>>>>  - fair_server_runtime: set runtime in ns
>>>>  - fair_server_period: set period in ns
>>>>  - fair_server_defer: on/off for the defer mechanism
>>>>
>>>
>>> This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
>>> total available bandwidth control, right?
>>
>> right, but thinking aloud... given that the per-cpu files are already allocating the
>> bandwidth on the dl_rq, the spare time for fair scheduler is granted.
>>
>> Still, we can have them there as a safeguard to not overloading the deadline
>> scheduler... (thinking aloud 2) as long as global is a thing... as we get away
>> from it, that global limitation will make less sense - still better to have a form
>> of limitation so people are aware of bandwidth until there.
> 
> Yeah, so having a limit on the deadline thing seems prudent as a way to
> model system overhead. I mean 100% sounds nice, but then all the models
> also assume no interrupts, no scheduler or migration overhead etc.. So
> setting a slightly lower max seems far more realistic to me.
> 
> That said, the period/bandwidth thing is now slightly odd, as we really
> only care about the utilization. But whatever. One thing at a time.

Yep, that is why I am mentioning the generalization as a second phase, it is
a harder problem... But having the rt throttling out of the default way is
already a good step.

> 
>>> But then shouldn've we also rip out the throttle thingy right quick?
>>>
>>
>> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
>> for now, because GROUP_SCHED depends on it, no?
> 
> Yes. Until we can delete all that code we'll have to keep some of that.
> 
>> With the next step on moving the dl server as the base for the
>> hierarchical scheduling...  That will rip out the
>> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
>>
>> Does it make sense?
> 
> I'm still not sure how to deal with affinities and deadline servers for
> RT.
> 
> There's a bunch of issues and I thing we've only got some of them solved.
> 
> The semi-partitioned thing (someone was working on that, I think you
> know the guy), solves DL 'entities' having affinities.

Yep, then having arbitrari affinities is another step towards mode flexible models...

> But the problem of FIFO is that they don't have inherent bandwidth. This
> in turn means that any server for FIFO needs to be minimally concurrent,
> otherwise you hand out bandwidth to lower priority tasks that the higher
> priority task might want etc.. (Andersson's group has papers here).
> 
> Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task
> that requires .8 a medium prio task that requires .6 and a low prio task
> that soaks up whatever it can get its little grubby paws on.
> 
> Then with minimal concurreny this works out nicely, high gets .8, mid
> gets .6 and low gets the remaining .1.
> 
> If OTOH you don't limit concurrency and let them all run concurrently,
> you can end up with the situation where they each get .5. Which is
> obviously fail.
> 
> Add affinities here though and you're up a creek, how do you distribute
> utilization between the slices, what slices, etc.. You say given them a
> per-cpu cgroup interface, and have them configure it themselves, but
> that's a god-aweful thing to ask userspace to do.

and yep again... It is definitely a harder topic... but it gets simpler as we do
those other moves...

> Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total
> and abysmal failure of a model -- thank you POSIX :-(

-- Daniel
  
Peter Zijlstra Nov. 7, 2023, 2:44 p.m. UTC | #9
On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:

> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
> for now, because GROUP_SCHED depends on it, no?

This builds and boots..

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9930,8 +9930,6 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 	}
 
-	init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
-
 #ifdef CONFIG_SMP
 	init_defrootdomain();
 #endif
@@ -9986,7 +9984,6 @@ void __init sched_init(void)
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1480,6 +1480,7 @@ static void update_curr_dl_se(struct rq
 	if (dl_se == &rq->fair_server)
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
@@ -1504,6 +1505,7 @@ static void update_curr_dl_se(struct rq
 			rt_rq->rt_time += delta_exec;
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+#endif
 }
 
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -8,10 +8,6 @@ int sched_rr_timeslice = RR_TIMESLICE;
 /* More than 4 hours if BW_SHIFT equals 20. */
 static const u64 max_rt_runtime = MAX_BW;
 
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
-
-struct rt_bandwidth def_rt_bandwidth;
-
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -67,6 +63,40 @@ static int __init sched_rt_sysctl_init(v
 late_initcall(sched_rt_sysctl_init);
 #endif
 
+void init_rt_rq(struct rt_rq *rt_rq)
+{
+	struct rt_prio_array *array;
+	int i;
+
+	array = &rt_rq->active;
+	for (i = 0; i < MAX_RT_PRIO; i++) {
+		INIT_LIST_HEAD(array->queue + i);
+		__clear_bit(i, array->bitmap);
+	}
+	/* delimiter for bitsearch: */
+	__set_bit(MAX_RT_PRIO, array->bitmap);
+
+#if defined CONFIG_SMP
+	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
+	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
+	rt_rq->overloaded = 0;
+	plist_head_init(&rt_rq->pushable_tasks);
+#endif /* CONFIG_SMP */
+	/* We start is dequeued state, because no RT tasks are queued */
+	rt_rq->rt_queued = 0;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	rt_rq->rt_time = 0;
+	rt_rq->rt_throttled = 0;
+	rt_rq->rt_runtime = 0;
+	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
+#endif
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
+
+static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
+
 static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 {
 	struct rt_bandwidth *rt_b =
@@ -131,35 +161,6 @@ static void start_rt_bandwidth(struct rt
 	do_start_rt_bandwidth(rt_b);
 }
 
-void init_rt_rq(struct rt_rq *rt_rq)
-{
-	struct rt_prio_array *array;
-	int i;
-
-	array = &rt_rq->active;
-	for (i = 0; i < MAX_RT_PRIO; i++) {
-		INIT_LIST_HEAD(array->queue + i);
-		__clear_bit(i, array->bitmap);
-	}
-	/* delimiter for bitsearch: */
-	__set_bit(MAX_RT_PRIO, array->bitmap);
-
-#if defined CONFIG_SMP
-	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
-	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
-	rt_rq->overloaded = 0;
-	plist_head_init(&rt_rq->pushable_tasks);
-#endif /* CONFIG_SMP */
-	/* We start is dequeued state, because no RT tasks are queued */
-	rt_rq->rt_queued = 0;
-
-	rt_rq->rt_time = 0;
-	rt_rq->rt_throttled = 0;
-	rt_rq->rt_runtime = 0;
-	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
-}
-
-#ifdef CONFIG_RT_GROUP_SCHED
 static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 {
 	hrtimer_cancel(&rt_b->rt_period_timer);
@@ -254,9 +255,6 @@ int alloc_rt_sched_group(struct task_gro
 	if (!tg->rt_se)
 		goto err;
 
-	init_rt_bandwidth(&tg->rt_bandwidth,
-			ktime_to_ns(def_rt_bandwidth.rt_period), 0);
-
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
 				     GFP_KERNEL, cpu_to_node(i));
@@ -605,70 +603,6 @@ static inline struct rt_bandwidth *sched
 	return &rt_rq->tg->rt_bandwidth;
 }
 
-#else /* !CONFIG_RT_GROUP_SCHED */
-
-static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_runtime;
-}
-
-static inline u64 sched_rt_period(struct rt_rq *rt_rq)
-{
-	return ktime_to_ns(def_rt_bandwidth.rt_period);
-}
-
-typedef struct rt_rq *rt_rq_iter_t;
-
-#define for_each_rt_rq(rt_rq, iter, rq) \
-	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
-
-#define for_each_sched_rt_entity(rt_se) \
-	for (; rt_se; rt_se = NULL)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
-	return NULL;
-}
-
-static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-
-	if (!rt_rq->rt_nr_running)
-		return;
-
-	enqueue_top_rt_rq(rt_rq);
-	resched_curr(rq);
-}
-
-static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
-	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_throttled;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
-	return &cpu_rq(cpu)->rt;
-}
-
-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
-{
-	return &def_rt_bandwidth;
-}
-
-#endif /* CONFIG_RT_GROUP_SCHED */
-
 bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
 {
 	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
@@ -860,7 +794,7 @@ static int do_sched_rt_period_timer(stru
 	const struct cpumask *span;
 
 	span = sched_rt_period_mask();
-#ifdef CONFIG_RT_GROUP_SCHED
+
 	/*
 	 * FIXME: isolated CPUs should really leave the root task group,
 	 * whether they are isolcpus or were isolated via cpusets, lest
@@ -872,7 +806,7 @@ static int do_sched_rt_period_timer(stru
 	 */
 	if (rt_b == &root_task_group.rt_bandwidth)
 		span = cpu_online_mask;
-#endif
+
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
@@ -939,18 +873,6 @@ static int do_sched_rt_period_timer(stru
 	return idle;
 }
 
-static inline int rt_se_prio(struct sched_rt_entity *rt_se)
-{
-#ifdef CONFIG_RT_GROUP_SCHED
-	struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
-	if (rt_rq)
-		return rt_rq->highest_prio.curr;
-#endif
-
-	return rt_task_of(rt_se)->prio;
-}
-
 static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 {
 	u64 runtime = sched_rt_runtime(rt_rq);
@@ -994,6 +916,70 @@ static int sched_rt_runtime_exceeded(str
 	return 0;
 }
 
+#else /* !CONFIG_RT_GROUP_SCHED */
+
+typedef struct rt_rq *rt_rq_iter_t;
+
+#define for_each_rt_rq(rt_rq, iter, rq) \
+	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
+
+#define for_each_sched_rt_entity(rt_se) \
+	for (; rt_se; rt_se = NULL)
+
+static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
+{
+	return NULL;
+}
+
+static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
+{
+	struct rq *rq = rq_of_rt_rq(rt_rq);
+
+	if (!rt_rq->rt_nr_running)
+		return;
+
+	enqueue_top_rt_rq(rt_rq);
+	resched_curr(rq);
+}
+
+static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
+{
+	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
+}
+
+static inline int rt_rq_throttled(struct rt_rq *rt_rq)
+{
+	return false;
+}
+
+static inline const struct cpumask *sched_rt_period_mask(void)
+{
+	return cpu_online_mask;
+}
+
+static inline
+struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
+{
+	return &cpu_rq(cpu)->rt;
+}
+
+static void __enable_runtime(struct rq *rq) { }
+static void __disable_runtime(struct rq *rq) { }
+
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+static inline int rt_se_prio(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct rt_rq *rt_rq = group_rt_rq(rt_se);
+
+	if (rt_rq)
+		return rt_rq->highest_prio.curr;
+#endif
+
+	return rt_task_of(rt_se)->prio;
+}
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -1001,7 +987,6 @@ static int sched_rt_runtime_exceeded(str
 static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
-	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
@@ -1011,6 +996,9 @@ static void update_curr_rt(struct rq *rq
 	if (unlikely(delta_exec <= 0))
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct sched_rt_entity *rt_se = &curr->rt;
+
 	if (!rt_bandwidth_enabled())
 		return;
 
@@ -1029,6 +1017,7 @@ static void update_curr_rt(struct rq *rq
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
 		}
 	}
+#endif
 }
 
 static void
@@ -1185,7 +1174,6 @@ dec_rt_group(struct sched_rt_entity *rt_
 static void
 inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
-	start_rt_bandwidth(&def_rt_bandwidth);
 }
 
 static inline
@@ -2913,19 +2901,6 @@ int sched_rt_can_attach(struct task_grou
 #ifdef CONFIG_SYSCTL
 static int sched_rt_global_constraints(void)
 {
-	unsigned long flags;
-	int i;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	for_each_possible_cpu(i) {
-		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
-
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime = global_rt_runtime();
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-	}
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
-
 	return 0;
 }
 #endif /* CONFIG_SYSCTL */
@@ -2945,12 +2920,6 @@ static int sched_rt_global_validate(void
 
 static void sched_rt_do_global(void)
 {
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	def_rt_bandwidth.rt_runtime = global_rt_runtime();
-	def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period());
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 }
 
 static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -698,13 +698,13 @@ struct rt_rq {
 #endif /* CONFIG_SMP */
 	int			rt_queued;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	int			rt_throttled;
 	u64			rt_time;
 	u64			rt_runtime;
 	/* Nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	unsigned int		rt_nr_boosted;
 
 	struct rq		*rq;
@@ -2460,7 +2460,6 @@ extern void reweight_task(struct task_st
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
 
-extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
  
Joel Fernandes Jan. 19, 2024, 1:49 a.m. UTC | #10
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>

Hi Daniel, Peter,
I am writing on behalf of the ChromeOS scheduler team.

We had to revert the last 3 patches in this series because of a syzkaller
reported bug, this happens on the sched/more branch in Peter's tree:

 WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
 place_entity+0x240/0x290 kernel/sched/fair.c:5147
 Call Trace:
 <TASK>
  enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
  enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
  enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
  activate_task+0x60/0xc0 kernel/sched/core.c:2147
  ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
  ttwu_queue kernel/sched/core.c:4047 [inline]
  try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
  kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
  __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
  queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
  queue_delayed_work include/linux/workqueue.h:577 [inline]
  srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]

which is basically this warning in place_entity:
		if (WARN_ON_ONCE(!load))
			load = 1;

Full log (scroll to the bottom as there is console/lockdep side effects which
are likely not relevant to this issue): https://paste.debian.net/1304579/

Side note, we are also looking into a KASAN nullptr deref but this happens
only on our backport of the patches to a 5.15 kernel, as far as we know.

KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
 RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
 RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
 [...]
Call Trace:
 <TASK>
  set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
  set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
  set_next_task kernel/sched/sched.h:2241 [inline] 
  pick_next_task kernel/sched/core.c:6014 [inline] 
  __schedule+0x36fb/0x402d kernel/sched/core.c:6378
  preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
  preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615

Full splat: https://paste.debian.net/1304573/

Investigation is on going but could you also please take a look at these? It
is hard to reproduce and only syzkaller's syzbot has luck so for reproducing
these.

Also I had a comment below:

> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	u64 new_bw = to_ratio(period, runtime);
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	unsigned long cap;
> +	int retval = 0;
> +	int cpus;
> +
> +	dl_b = dl_bw_of(cpu);
> +	raw_spin_lock(&dl_b->lock);
> +	cpus = dl_bw_cpus(cpu);
> +	cap = dl_bw_capacity(cpu);
> +
> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {

The dl_overflow() call here seems introducing an issue with our conceptual
understanding of how the dl server is supposed to work.

Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
Suppose the DL server params are 50ms runtime in 100ms period (basically we
want to dedicate 50% of the bandwidth of each CPU to CFS).

In such a situation, __dl_overflow() will return an error right? Because
total bandwidth will exceed 100% (4 times 50% is 200%).

Further, this complicates the setting of the parameters since it means we
have to check the number of CPUs in advance and then set the parameters to
prevent dl_overflow(). As an example of this, 30% (runtime / period) for each
CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work
because __dl_overflow() will fail.

How do you suggest we remedy this? Can we make the dlserver calculate how
much bandwidth is allowed on a per-CPU basis? My understanding is the fake
dl_se are all pinned to their respective CPUs, so we don't have the same
requirement as real DL tasks which may migrate freely within the root domain.

thanks,

 - Joel
  
Joel Fernandes Jan. 19, 2024, 1:55 a.m. UTC | #11
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>

Hi Daniel, Peter,
I am writing on behalf of the ChromeOS scheduler team.

We had to revert the last 3 patches in this series because of a syzkaller
reported bug, this happens on the sched/more branch in Peter's tree:

 WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
 place_entity+0x240/0x290 kernel/sched/fair.c:5147
 Call Trace:
 <TASK>
  enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
  enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
  enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
  activate_task+0x60/0xc0 kernel/sched/core.c:2147
  ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
  ttwu_queue kernel/sched/core.c:4047 [inline]
  try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
  kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
  __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
  queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
  queue_delayed_work include/linux/workqueue.h:577 [inline]
  srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]

which is basically this warning in place_entity:
		if (WARN_ON_ONCE(!load))
			load = 1;

Full log (scroll to the bottom as there is console/lockdep side effects which
are likely not relevant to this issue): https://paste.debian.net/1304579/

Side note, we are also looking into a KASAN nullptr deref but this happens
only on our backport of the patches to a 5.15 kernel, as far as we know.

KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
 RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
 RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
 [...]
Call Trace:
 <TASK>
  set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
  set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
  set_next_task kernel/sched/sched.h:2241 [inline] 
  pick_next_task kernel/sched/core.c:6014 [inline] 
  __schedule+0x36fb/0x402d kernel/sched/core.c:6378
  preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
  preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615

Full splat: https://paste.debian.net/1304573/

Investigation is on going but could you also please take a look at these? It
is hard to reproduce and only syzbot has luck reproducing these.

Also I had a comment below:

> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	u64 new_bw = to_ratio(period, runtime);
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	unsigned long cap;
> +	int retval = 0;
> +	int cpus;
> +
> +	dl_b = dl_bw_of(cpu);
> +	raw_spin_lock(&dl_b->lock);
> +	cpus = dl_bw_cpus(cpu);
> +	cap = dl_bw_capacity(cpu);
> +
> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {

The dl_overflow() call here seems introducing an issue with our conceptual
understanding of how the dl server is supposed to work.

Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
Suppose the DL server params are 50ms runtime in 100ms period (basically we
want to dedicate 50% of the bandwidth of each CPU to CFS).

In such a situation, __dl_overflow() will return an error right? Because
total bandwidth will exceed 100% (4 times 50% is 200%).

Further, this complicates the setting of the parameters since it means we
have to check the number of CPUs in advance and then set the parameters to
prevent dl_overflow(). As an example of this, 30% (runtime / period) for each
CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work
because __dl_overflow() will fail.

How do you suggest we remedy this? Can we make the dlserver calculate how
much bandwidth is allowed on a per-CPU basis? My understanding is the fake
dl_se are all pinned to their respective CPUs, so we don't have the same
requirement as real DL tasks which may migrate freely within the root domain.

thanks,

 - Joel
  
Daniel Bristot de Oliveira Jan. 22, 2024, 2:14 p.m. UTC | #12
On 1/19/24 02:55, Joel Fernandes wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
>>
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> 
> Hi Daniel, Peter,
> I am writing on behalf of the ChromeOS scheduler team.
> 
> We had to revert the last 3 patches in this series because of a syzkaller
> reported bug, this happens on the sched/more branch in Peter's tree:
> 
>  WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
>  place_entity+0x240/0x290 kernel/sched/fair.c:5147
>  Call Trace:
>  <TASK>
>   enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
>   enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
>   enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
>   activate_task+0x60/0xc0 kernel/sched/core.c:2147
>   ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
>   ttwu_queue kernel/sched/core.c:4047 [inline]
>   try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
>   kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
>   __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
>   queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
>   queue_delayed_work include/linux/workqueue.h:577 [inline]
>   srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
> 
> which is basically this warning in place_entity:
> 		if (WARN_ON_ONCE(!load))
> 			load = 1;
> 
> Full log (scroll to the bottom as there is console/lockdep side effects which
> are likely not relevant to this issue): https://paste.debian.net/1304579/
> 
> Side note, we are also looking into a KASAN nullptr deref but this happens
> only on our backport of the patches to a 5.15 kernel, as far as we know.
> 
> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
>  RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
>  RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
>  [...]
> Call Trace:
>  <TASK>
>   set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
>   set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
>   set_next_task kernel/sched/sched.h:2241 [inline] 
>   pick_next_task kernel/sched/core.c:6014 [inline] 
>   __schedule+0x36fb/0x402d kernel/sched/core.c:6378
>   preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
>   preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
> 
> Full splat: https://paste.debian.net/1304573/

Interesting, does it keep any task hung? I am having a case where I see
a hung task, but I do not get the splat because the system freezes (printk
with rq_lock I guess)...

It might be the same problem.

> Investigation is on going but could you also please take a look at these? It
> is hard to reproduce and only syzbot has luck reproducing these.
> 
> Also I had a comment below:
> 
>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>> +{
>> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>> +	u64 new_bw = to_ratio(period, runtime);
>> +	struct rq *rq = dl_se->rq;
>> +	int cpu = cpu_of(rq);
>> +	struct dl_bw *dl_b;
>> +	unsigned long cap;
>> +	int retval = 0;
>> +	int cpus;
>> +
>> +	dl_b = dl_bw_of(cpu);
>> +	raw_spin_lock(&dl_b->lock);
>> +	cpus = dl_bw_cpus(cpu);
>> +	cap = dl_bw_capacity(cpu);
>> +
>> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
> 
> The dl_overflow() call here seems introducing an issue with our conceptual
> understanding of how the dl server is supposed to work.
> 
> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
> Suppose the DL server params are 50ms runtime in 100ms period (basically we
> want to dedicate 50% of the bandwidth of each CPU to CFS).
> 
> In such a situation, __dl_overflow() will return an error right? Because
> total bandwidth will exceed 100% (4 times 50% is 200%).

I might be missing something in your case, but, it accepts:

root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000

your system accepts 400%... the percentage is "global".

is it failing in your system?

-- Daniel
  
Joel Fernandes Jan. 23, 2024, 3:39 p.m. UTC | #13
Hi Daniel,

On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote:
> On 1/19/24 02:55, Joel Fernandes wrote:
>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>>> Add an interface for fair server setup on debugfs.
>>>
>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>
>>>  - fair_server_runtime: set runtime in ns
>>>  - fair_server_period: set period in ns
>>>  - fair_server_defer: on/off for the defer mechanism
>>>
>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
>>
>> Hi Daniel, Peter,
>> I am writing on behalf of the ChromeOS scheduler team.
>>
>> We had to revert the last 3 patches in this series because of a syzkaller
>> reported bug, this happens on the sched/more branch in Peter's tree:
>>
>>  WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
>>  place_entity+0x240/0x290 kernel/sched/fair.c:5147
>>  Call Trace:
>>  <TASK>
>>   enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
>>   enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
>>   enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
>>   activate_task+0x60/0xc0 kernel/sched/core.c:2147
>>   ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
>>   ttwu_queue kernel/sched/core.c:4047 [inline]
>>   try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
>>   kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
>>   __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
>>   queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
>>   queue_delayed_work include/linux/workqueue.h:577 [inline]
>>   srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
>>
>> which is basically this warning in place_entity:
>> 		if (WARN_ON_ONCE(!load))
>> 			load = 1;
>>
>> Full log (scroll to the bottom as there is console/lockdep side effects which
>> are likely not relevant to this issue): https://paste.debian.net/1304579/
>>
>> Side note, we are also looking into a KASAN nullptr deref but this happens
>> only on our backport of the patches to a 5.15 kernel, as far as we know.
>>
>> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
>> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
>>  RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
>>  RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
>>  [...]
>> Call Trace:
>>  <TASK>
>>   set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
>>   set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
>>   set_next_task kernel/sched/sched.h:2241 [inline] 
>>   pick_next_task kernel/sched/core.c:6014 [inline] 
>>   __schedule+0x36fb/0x402d kernel/sched/core.c:6378
>>   preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
>>   preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
>>
>> Full splat: https://paste.debian.net/1304573/
> 
> Interesting, does it keep any task hung? I am having a case where I see
> a hung task, but I do not get the splat because the system freezes (printk
> with rq_lock I guess)...
> 
> It might be the same problem.

Ah, we have an update. Suleiman found this is happening because of core
scheduling's pick logic. I have some patches to fix it, there's also more fixes
we have on other issues. Will coordinate with the team to send these out soon.
We are currently testing them more.

>> Investigation is on going but could you also please take a look at these? It
>> is hard to reproduce and only syzbot has luck reproducing these.
>>
>> Also I had a comment below:
>>
>>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>>> +{
>>> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>>> +	u64 new_bw = to_ratio(period, runtime);
>>> +	struct rq *rq = dl_se->rq;
>>> +	int cpu = cpu_of(rq);
>>> +	struct dl_bw *dl_b;
>>> +	unsigned long cap;
>>> +	int retval = 0;
>>> +	int cpus;
>>> +
>>> +	dl_b = dl_bw_of(cpu);
>>> +	raw_spin_lock(&dl_b->lock);
>>> +	cpus = dl_bw_cpus(cpu);
>>> +	cap = dl_bw_capacity(cpu);
>>> +
>>> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
>>
>> The dl_overflow() call here seems introducing an issue with our conceptual
>> understanding of how the dl server is supposed to work.
>>
>> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
>> Suppose the DL server params are 50ms runtime in 100ms period (basically we
>> want to dedicate 50% of the bandwidth of each CPU to CFS).
>>
>> In such a situation, __dl_overflow() will return an error right? Because
>> total bandwidth will exceed 100% (4 times 50% is 200%).
> 
> I might be missing something in your case, but, it accepts:
> 
> root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 
> your system accepts 400%... the percentage is "global".
> 
> is it failing in your system?

You are right, I was actually trying to change it manually in my kernel in
dl_server_start(). In this case dlserver_apply_server_params() gets init=1 and
old_bw is 0.

I tried using the debugfs, and that works. So I think we will just use the
debugfs. I was being lazy and setting it in my kernel manually for testing like
this:

@@ -1475,7 +1475,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
         * this before getting generic.
         */
        if (!dl_server(dl_se)) {
               u64 runtime = 12 * NSEC_PER_MSEC;
                u64 period = 15 * NSEC_PER_MSEC;

That doesn't work but I tried debugfs and it works. But for production, we will
set it from userspace so it should not be an issue.

I feel so much better now :) Thanks Daniel.

By the way, what's the plan on remaining patches in sched/more branch, are you
planning to resend those later? If so, we can just post our fixes on top of
that, and if you don't mind you could include it in your next series posting
(sched/more + our fixes + your fixes).

Thanks!

 - Joel
  
Joel Fernandes Jan. 23, 2024, 3:44 p.m. UTC | #14
On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote:
> Interesting, does it keep any task hung? I am having a case where I see
> a hung task, but I do not get the splat because the system freezes (printk
> with rq_lock I guess)...

I missed replying to this part of your email. We see it as a syzkaller report
with splats, so it is not clear if a task was hung at the time of a splat.
syzkaller runs in its own separate VM instance.

The fun part is though, I found a way to do ftrace dump on it without having to
pass any boot parameters (that's another issue that we can't pass it boot
parameters). So we have that tool in our debug arsenal for these issues, thankfully.

> It might be the same problem.

True, possibly.

thanks,

 - Joel
  
Joel Fernandes Feb. 13, 2024, 2:13 a.m. UTC | #15
On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism

Btw Daniel, there is an interesting side-effect of this interface having runtime
and period in 2 separate files :)

Say I want to set a CPU to 5ms / 10ms.

I cannot set either period or runtime to 5ms or 10ms directly.

I have to first set period to 100ms, then set runtime to 50ms, then set period
to 50ms, then set runtime to 5ms, then finally set period to 10ms.

The reason seems to be because otherwise runtime / period will not be
accomodated and will cause dl_overflow issues.

I'd suggest providing both runtime and period in the same interface to make it
more easier to use. However, for the testing I am going with what we have.

Also a request:

I was wondering if a new version of the last 3 patches could be posted to
LKML or shared in a tree somewhere. I am trying to sync to mainline and
rebase our latest fixes on top of that, however it is difficult to do because
these 3 patches are in bit of a flux (example the discussion between you and
Peter about update_curr()). What's the best way to move forward with rebasing
our fix contributions?  I am going with the sched/more in Peter's queue.git
unless you/Peter prefer something else. And I added your update_curr()
suggestion onto that, let me know if you disagree with it:

@@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)

        if (entity_is_task(curr))
                update_curr_task(task_of(curr), delta_exec);
+       else
+               dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);

        account_cfs_rq_runtime(cfs_rq, delta_exec);
 }

thanks,

 - Joel

> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> ---
>  kernel/sched/deadline.c |  89 +++++++++++++++---
>  kernel/sched/debug.c    | 202 ++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c     |   6 --
>  kernel/sched/sched.h    |   2 +
>  4 files changed, 279 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 69ee1fbd60e4..1092ca8892e0 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -321,19 +321,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  		__sub_running_bw(dl_se->dl_bw, dl_rq);
>  }
>  
> -static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
>  {
> -	struct rq *rq;
> -
> -	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
> -
> -	if (task_on_rq_queued(p))
> -		return;
> +	if (dl_se->dl_non_contending) {
> +		sub_running_bw(dl_se, &rq->dl);
> +		dl_se->dl_non_contending = 0;
>  
> -	rq = task_rq(p);
> -	if (p->dl.dl_non_contending) {
> -		sub_running_bw(&p->dl, &rq->dl);
> -		p->dl.dl_non_contending = 0;
>  		/*
>  		 * If the timer handler is currently running and the
>  		 * timer cannot be canceled, inactive_task_timer()
> @@ -341,13 +334,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
>  		 * will not touch the rq's active utilization,
>  		 * so we are still safe.
>  		 */
> -		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> -			put_task_struct(p);
> +		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
> +			if (!dl_server(dl_se))
> +				put_task_struct(dl_task_of(dl_se));
> +		}
>  	}
> -	__sub_rq_bw(p->dl.dl_bw, &rq->dl);
> +	__sub_rq_bw(dl_se->dl_bw, &rq->dl);
>  	__add_rq_bw(new_bw, &rq->dl);
>  }
>  
> +static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +{
> +	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
> +
> +	if (task_on_rq_queued(p))
> +		return;
> +
> +	dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
> +}
> +
>  static void __dl_clear_params(struct sched_dl_entity *dl_se);
>  
>  /*
> @@ -1508,10 +1513,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
>  
>  void dl_server_start(struct sched_dl_entity *dl_se)
>  {
> +	/*
> +	 * XXX: the apply do not work fine at the init phase for the
> +	 * fair server because things are not yet set. We need to improve
> +	 * this before getting generic.
> +	 */
>  	if (!dl_server(dl_se)) {
> +		u64 runtime = 50 * NSEC_PER_MSEC;
> +		u64 period = 1000 * NSEC_PER_MSEC;
> +
> +		dl_server_apply_params(dl_se, runtime, period, 1);
> +
> +		dl_se->dl_zerolax = 1;
>  		dl_se->dl_server = 1;
>  		setup_new_dl_entity(dl_se);
>  	}
> +
>  	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
>  }
>  
> @@ -1532,6 +1549,50 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
>  	dl_se->server_pick = pick;
>  }
>  
> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	u64 new_bw = to_ratio(period, runtime);
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	unsigned long cap;
> +	int retval = 0;
> +	int cpus;
> +
> +	dl_b = dl_bw_of(cpu);
> +	raw_spin_lock(&dl_b->lock);
> +	cpus = dl_bw_cpus(cpu);
> +	cap = dl_bw_capacity(cpu);
> +
> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
> +		retval = -EBUSY;
> +		goto out;
> +	}
> +
> +	if (init) {
> +		__add_rq_bw(new_bw, &rq->dl);
> +		__dl_add(dl_b, new_bw, cpus);
> +	} else {
> +		__dl_sub(dl_b, dl_se->dl_bw, cpus);
> +		__dl_add(dl_b, new_bw, cpus);
> +
> +		dl_rq_change_utilization(rq, dl_se, new_bw);
> +	}
> +
> +	rq->fair_server.dl_runtime = runtime;
> +	rq->fair_server.dl_deadline  = period;
> +	rq->fair_server.dl_period  = period;
> +
> +	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
> +
> +out:
> +	raw_spin_unlock(&dl_b->lock);
> +
> +	return retval;
> +}
> +
>  /*
>   * Update the current task's runtime statistics (provided it is still
>   * a -deadline task and has not been removed from the dl_rq).
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 4580a450700e..bd7ad6b8d3de 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -333,8 +333,208 @@ static const struct file_operations sched_debug_fops = {
>  	.release	= seq_release,
>  };
>  
> +enum dl_param {
> +	DL_RUNTIME = 0,
> +	DL_PERIOD,
> +	DL_ZEROLAX
> +};
> +
> +static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
> +static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
> +
> +static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
> +				       size_t cnt, loff_t *ppos, enum dl_param param)
> +{
> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
> +	u64 runtime, period, zerolax;
> +	struct rq *rq = cpu_rq(cpu);
> +	size_t err;
> +	int retval;
> +	u64 value;
> +
> +	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
> +	if (err)
> +		return err;
> +
> +	scoped_guard (rq_lock_irqsave, rq) {
> +
> +		runtime  = rq->fair_server.dl_runtime;
> +		period = rq->fair_server.dl_period;
> +		zerolax = rq->fair_server.dl_zerolax;
> +
> +		switch (param) {
> +		case DL_RUNTIME:
> +			if (runtime == value)
> +				goto out;
> +			runtime = value;
> +			break;
> +		case DL_PERIOD:
> +			if (value == period)
> +				goto out;
> +			period = value;
> +			break;
> +		case DL_ZEROLAX:
> +			if (zerolax == value)
> +				goto out;
> +			zerolax = value;
> +			break;
> +		}
> +
> +		if (runtime > period
> +				|| period > fair_server_period_max
> +				|| period < fair_server_period_min
> +				|| zerolax > 1) {
> +			cnt = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (rq->cfs.h_nr_running) {
> +			update_rq_clock(rq);
> +			dl_server_stop(&rq->fair_server);
> +		}
> +
> +		/*
> +		 * The zerolax does not change utilization, so just
> +		 * setting it is enough.
> +		 */
> +		if (rq->fair_server.dl_zerolax != zerolax) {
> +			rq->fair_server.dl_zerolax = zerolax;
> +		} else {
> +			retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
> +			if (retval)
> +				cnt = retval;
> +		}
> +
> +		if (rq->cfs.h_nr_running)
> +			dl_server_start(&rq->fair_server);
> +	}
> +
> +out:
> +	*ppos += cnt;
> +	return cnt;
> +}
> +
> +static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
> +{
> +	unsigned long cpu = (unsigned long) m->private;
> +	struct rq *rq = cpu_rq(cpu);
> +	u64 value;
> +
> +	switch (param) {
> +	case DL_RUNTIME:
> +		value = rq->fair_server.dl_runtime;
> +		break;
> +	case DL_PERIOD:
> +		value = rq->fair_server.dl_period;
> +		break;
> +	case DL_ZEROLAX:
> +		value = rq->fair_server.dl_zerolax;
> +	}
> +
> +	seq_printf(m, "%llu\n", value);
> +	return 0;
> +
> +}
> +
> +static ssize_t
> +sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
> +				size_t cnt, loff_t *ppos)
> +{
> +	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
> +}
> +
> +static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
> +{
> +	return sched_fair_server_show(m, v, DL_RUNTIME);
> +}
> +
> +static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_runtime_fops = {
> +	.open		= sched_fair_server_runtime_open,
> +	.write		= sched_fair_server_runtime_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static ssize_t
> +sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
> +			       size_t cnt, loff_t *ppos)
> +{
> +	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
> +}
> +
> +static int sched_fair_server_period_show(struct seq_file *m, void *v)
> +{
> +	return sched_fair_server_show(m, v, DL_PERIOD);
> +}
> +
> +static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, sched_fair_server_period_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_period_fops = {
> +	.open		= sched_fair_server_period_open,
> +	.write		= sched_fair_server_period_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static ssize_t
> +sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
> +			      size_t cnt, loff_t *ppos)
> +{
> +	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX);
> +}
> +
> +static int sched_fair_server_defer_show(struct seq_file *m, void *v)
> +{
> +	return sched_fair_server_show(m, v, DL_ZEROLAX);
> +}
> +
> +static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, sched_fair_server_defer_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_defer_fops = {
> +	.open		= sched_fair_server_defer_open,
> +	.write		= sched_fair_server_defer_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
>  static struct dentry *debugfs_sched;
>  
> +static void debugfs_fair_server_init(void)
> +{
> +	long cpu;
> +	struct dentry *rq_dentry;
> +
> +	rq_dentry = debugfs_create_dir("rq", debugfs_sched);
> +	if (!rq_dentry)
> +		return;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct dentry *d_cpu;
> +		char buf[32];
> +
> +		snprintf(buf, sizeof(buf), "cpu%ld", cpu);
> +		d_cpu = debugfs_create_dir(buf, rq_dentry);
> +
> +		debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
> +		debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
> +		debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
> +	}
> +}
> +
>  static __init int sched_init_debug(void)
>  {
>  	struct dentry __maybe_unused *numa;
> @@ -374,6 +574,8 @@ static __init int sched_init_debug(void)
>  
>  	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>  
> +	debugfs_fair_server_init();
> +
>  	return 0;
>  }
>  late_initcall(sched_init_debug);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 399237cd9f59..5434c52f470d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8419,12 +8419,6 @@ void fair_server_init(struct rq *rq)
>  	struct sched_dl_entity *dl_se = &rq->fair_server;
>  
>  	init_dl_entity(dl_se);
> -
> -	dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
> -	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
> -	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
> -	dl_se->dl_zerolax = 1;
> -
>  	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec0e288c8e06..312b31df5860 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,6 +341,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
>  		    dl_server_pick_f pick);
>  
>  extern void fair_server_init(struct rq *);
> +extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
> +		    u64 runtime, u64 period, bool init);
>  
>  #ifdef CONFIG_CGROUP_SCHED
>
  
Joel Fernandes Feb. 13, 2024, 2:21 a.m. UTC | #16
On 2/12/2024 9:13 PM, Joel Fernandes wrote:
> 
> 
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
> 
> Btw Daniel, there is an interesting side-effect of this interface having runtime
> and period in 2 separate files :)
> 
> Say I want to set a CPU to 5ms / 10ms.
> 

Sorry let me try again, say I want to set 500us / 1ms.

Then I have to do the following to get it to work:

# echo 100000000 > /sys/kernel/debug/sched/fair_server/cpu0/period
# echo 5000000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
# echo 10000000 > /sys/kernel/debug/sched/fair_server/cpu0/period
# echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
# echo 1000000 > /sys/kernel/debug/sched/fair_server/cpu0/period

IOW, if I boot and do the following, it fails:

# echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
bash: echo: write error: Device or resource busy

 - Joel
  
Daniel Bristot de Oliveira Feb. 14, 2024, 2:23 p.m. UTC | #17
On 2/13/24 03:13, Joel Fernandes wrote:
> 
> 
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
> 
> Btw Daniel, there is an interesting side-effect of this interface having runtime
> and period in 2 separate files :)
> 
> Say I want to set a CPU to 5ms / 10ms.
> 
> I cannot set either period or runtime to 5ms or 10ms directly.
> 
> I have to first set period to 100ms, then set runtime to 50ms, then set period
> to 50ms, then set runtime to 5ms, then finally set period to 10ms.

Hummm yeah I could reproduce that, it seems that it is not even a problem of having
two files, but a bug in the logic, I will have a look.

> The reason seems to be because otherwise runtime / period will not be
> accomodated and will cause dl_overflow issues.
> 
> I'd suggest providing both runtime and period in the same interface to make it
> more easier to use. However, for the testing I am going with what we have.
> 
> Also a request:
> 
> I was wondering if a new version of the last 3 patches could be posted to
> LKML or shared in a tree somewhere. I am trying to sync to mainline and
> rebase our latest fixes on top of that, however it is difficult to do because
> these 3 patches are in bit of a flux (example the discussion between you and
> Peter about update_curr()). What's the best way to move forward with rebasing
> our fix contributions?

Juri and I chat about, and we think it is a good thing to re-send this patch set,
including a fix I have to it (to avoid regression wrt rt throttling), explaining
these things in the mailing list so peter will be able to follow the discussion.

I still need to finish testing, and to make a proper cover page with all updates, the
latest thing is here (tm):

https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6

It is based on peter's sched/more. I will probably re-send it today or tomorrow,
but at least you can have a look at it.

Another reason to send it is to get the regression test machinery running....

 I am going with the sched/more in Peter's queue.git
> unless you/Peter prefer something else. And I added your update_curr()
> suggestion onto that, let me know if you disagree with it:
> 
> @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
> 
>         if (entity_is_task(curr))
>                 update_curr_task(task_of(curr), delta_exec);
> +       else
> +               dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
> 
>         account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }

That part of the code was optimized by peter during the last round of discussions.

It is like this now:

------------ %< -----------
-       if (entity_is_task(curr))
-               update_curr_task(task_of(curr), delta_exec);
+       if (entity_is_task(curr)) {
+               struct task_struct *p = task_of(curr);
+               update_curr_task(p, delta_exec);
+               /*
+                * Any fair task that runs outside of fair_server should
+                * account against fair_server such that it can account for
+                * this time and possibly avoid running this period.
+                */
+               if (p->dl_server != &rq->fair_server)
+                       dl_server_update(&rq->fair_server, delta_exec);
+       }
------------ >% -----------

It is not straightforward to understand... but the ideia is:

if it is a task, and the server is ! of the fair server, discount time
directly from the fair server. This also means that if dl_server is NULL
(the server is not enabled) it will discount time from the fair server.

-- Daniel


> thanks,
> 
>  - Joel
  
Joel Fernandes Feb. 15, 2024, 1:57 p.m. UTC | #18
Hello, Daniel,

On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
> On 2/13/24 03:13, Joel Fernandes wrote:
>>
>>
>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>> Add an interface for fair server setup on debugfs.
>>>
>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>
>>>  - fair_server_runtime: set runtime in ns
>>>  - fair_server_period: set period in ns
>>>  - fair_server_defer: on/off for the defer mechanism
>>
>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>> and period in 2 separate files :)
>>
>> Say I want to set a CPU to 5ms / 10ms.
>>
>> I cannot set either period or runtime to 5ms or 10ms directly.
>>
>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
> 
> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
> two files, but a bug in the logic, I will have a look.

Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
not set 45ms/50ms for instance.

Also just want to mention, if you could please CC my colleagues Suleiman and
Youssef on the patches who are also working on / reviewing these:

Suleiman Souhlal <suleiman@google.com>
Youssef Esmat <youssefesmat@google.com>

>> The reason seems to be because otherwise runtime / period will not be
>> accomodated and will cause dl_overflow issues.
>>
>> I'd suggest providing both runtime and period in the same interface to make it
>> more easier to use. However, for the testing I am going with what we have.
>>
>> Also a request:
>>
>> I was wondering if a new version of the last 3 patches could be posted to
>> LKML or shared in a tree somewhere. I am trying to sync to mainline and
>> rebase our latest fixes on top of that, however it is difficult to do because
>> these 3 patches are in bit of a flux (example the discussion between you and
>> Peter about update_curr()). What's the best way to move forward with rebasing
>> our fix contributions?
> 
> Juri and I chat about, and we think it is a good thing to re-send this patch set,
> including a fix I have to it (to avoid regression wrt rt throttling), explaining
> these things in the mailing list so peter will be able to follow the discussion.
> 
> I still need to finish testing, and to make a proper cover page with all updates, the
> latest thing is here (tm):
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6
> 
> It is based on peter's sched/more. I will probably re-send it today or tomorrow,
> but at least you can have a look at it.
>> Another reason to send it is to get the regression test machinery running....

Sure, looking forward to it. I rebased on above tree and it applied cleanly.
What I'll do is I will send our patches today (not those in sched/more) after a
bit more testing and tweaks.

There are 2 reasons for this:
1. Can get the build robot do its thing.
2. Our internal system checks whether patches backported were posted upstream to
list.

Hope that sounds good to you and we can start reviewing as well.

>  I am going with the sched/more in Peter's queue.git
>> unless you/Peter prefer something else. And I added your update_curr()
>> suggestion onto that, let me know if you disagree with it:
>>
>> @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>
>>         if (entity_is_task(curr))
>>                 update_curr_task(task_of(curr), delta_exec);
>> +       else
>> +               dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
>>
>>         account_cfs_rq_runtime(cfs_rq, delta_exec);
>>  }
> 
> That part of the code was optimized by peter during the last round of discussions.
> 
> It is like this now:
> 
> ------------ %< -----------
> -       if (entity_is_task(curr))
> -               update_curr_task(task_of(curr), delta_exec);
> +       if (entity_is_task(curr)) {
> +               struct task_struct *p = task_of(curr);
> +               update_curr_task(p, delta_exec);
> +               /*
> +                * Any fair task that runs outside of fair_server should
> +                * account against fair_server such that it can account for
> +                * this time and possibly avoid running this period.
> +                */
> +               if (p->dl_server != &rq->fair_server)
> +                       dl_server_update(&rq->fair_server, delta_exec);
> +       }
> ------------ >% -----------
> 
> It is not straightforward to understand... but the ideia is:
> 
> if it is a task, and the server is ! of the fair server, discount time
> directly from the fair server. This also means that if dl_server is NULL
> (the server is not enabled) it will discount time from the fair server.

Yes, that makes sense. We certainly want to debit from the server even when DL
is not picking the task indirectly. I guess Peter's optimization also handles
the case where multiple servers are in play. That will help us when/if we make
RT as a server as well, right?

thanks,

 - Joel
  
Daniel Bristot de Oliveira Feb. 15, 2024, 5:27 p.m. UTC | #19
On 2/15/24 14:57, Joel Fernandes wrote:
> Hello, Daniel,
> 
> On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
>> On 2/13/24 03:13, Joel Fernandes wrote:
>>>
>>>
>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>> Add an interface for fair server setup on debugfs.
>>>>
>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>
>>>>  - fair_server_runtime: set runtime in ns
>>>>  - fair_server_period: set period in ns
>>>>  - fair_server_defer: on/off for the defer mechanism
>>>
>>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>>> and period in 2 separate files :)
>>>
>>> Say I want to set a CPU to 5ms / 10ms.
>>>
>>> I cannot set either period or runtime to 5ms or 10ms directly.
>>>
>>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
>>
>> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
>> two files, but a bug in the logic, I will have a look.
> 
> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
> not set 45ms/50ms for instance.

I isolated the problem. It is not an interface problem.

Long story short, the servers are initialized at the defrootdomain, but
the dl_bw info is not being carried over to the new domain because the
servers are not a task.

I am discussing this with Valentin (topology) & Juri. We will try to find a
solution, or at least an presentable XXX: solution... in the next days.

You can work around it by disabling the admission control via:

# sysctl kernel.sched_rt_runtime_us=-1

the the values will be accepted. For the best of my knowledge, you guys are
only using SCHED_RR/FIFO so the admission control for DL is not an issue.

>> I still need to finish testing, and to make a proper cover page with all updates, the
>> latest thing is here (tm):
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6
>>
>> It is based on peter's sched/more. I will probably re-send it today or tomorrow,
>> but at least you can have a look at it.
>>> Another reason to send it is to get the regression test machinery running....
> 
> Sure, looking forward to it. I rebased on above tree and it applied cleanly.
> What I'll do is I will send our patches today (not those in sched/more) after a
> bit more testing and tweaks.
> 
> There are 2 reasons for this:
> 1. Can get the build robot do its thing.
> 2. Our internal system checks whether patches backported were posted upstream to
> list.
> 
> Hope that sounds good to you and we can start reviewing as well.

If it helps downstream for you guys, it is not a problem for me. Still, peter is
the person that has more comments to give so...

-- Daniel
  
Joel Fernandes Feb. 15, 2024, 5:41 p.m. UTC | #20
On 2/15/2024 12:27 PM, Daniel Bristot de Oliveira wrote:
> On 2/15/24 14:57, Joel Fernandes wrote:
>> Hello, Daniel,
>>
>> On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
>>> On 2/13/24 03:13, Joel Fernandes wrote:
>>>>
>>>>
>>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>>> Add an interface for fair server setup on debugfs.
>>>>>
>>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>>
>>>>>  - fair_server_runtime: set runtime in ns
>>>>>  - fair_server_period: set period in ns
>>>>>  - fair_server_defer: on/off for the defer mechanism
>>>>
>>>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>>>> and period in 2 separate files :)
>>>>
>>>> Say I want to set a CPU to 5ms / 10ms.
>>>>
>>>> I cannot set either period or runtime to 5ms or 10ms directly.
>>>>
>>>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>>>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
>>>
>>> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
>>> two files, but a bug in the logic, I will have a look.
>>
>> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
>> not set 45ms/50ms for instance.
> 
> I isolated the problem. It is not an interface problem.
> 
> Long story short, the servers are initialized at the defrootdomain, but
> the dl_bw info is not being carried over to the new domain because the
> servers are not a task.

Nice work on nailing the issue.

> I am discussing this with Valentin (topology) & Juri. We will try to find a
> solution, or at least an presentable XXX: solution... in the next days.
> 
> You can work around it by disabling the admission control via:
> 
> # sysctl kernel.sched_rt_runtime_us=-1
> 
> the the values will be accepted. For the best of my knowledge, you guys are
> only using SCHED_RR/FIFO so the admission control for DL is not an issue.

That's right, we only use deadline for the server. However, on some devices,
schedutil is used and AFAIR its kthread uses SCHED_DEADLINE. I don't anticipate
problems related to admission control and that kthread, so I think your proposed
workaround sounds good to me.

thanks,

 - Joel
  

Patch

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 69ee1fbd60e4..1092ca8892e0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -321,19 +321,12 @@  void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 		__sub_running_bw(dl_se->dl_bw, dl_rq);
 }
 
-static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
 {
-	struct rq *rq;
-
-	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
-
-	if (task_on_rq_queued(p))
-		return;
+	if (dl_se->dl_non_contending) {
+		sub_running_bw(dl_se, &rq->dl);
+		dl_se->dl_non_contending = 0;
 
-	rq = task_rq(p);
-	if (p->dl.dl_non_contending) {
-		sub_running_bw(&p->dl, &rq->dl);
-		p->dl.dl_non_contending = 0;
 		/*
 		 * If the timer handler is currently running and the
 		 * timer cannot be canceled, inactive_task_timer()
@@ -341,13 +334,25 @@  static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-			put_task_struct(p);
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	}
-	__sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	__sub_rq_bw(dl_se->dl_bw, &rq->dl);
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
+
+	if (task_on_rq_queued(p))
+		return;
+
+	dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
+}
+
 static void __dl_clear_params(struct sched_dl_entity *dl_se);
 
 /*
@@ -1508,10 +1513,22 @@  void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
+	/*
+	 * XXX: the apply do not work fine at the init phase for the
+	 * fair server because things are not yet set. We need to improve
+	 * this before getting generic.
+	 */
 	if (!dl_server(dl_se)) {
+		u64 runtime = 50 * NSEC_PER_MSEC;
+		u64 period = 1000 * NSEC_PER_MSEC;
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_zerolax = 1;
 		dl_se->dl_server = 1;
 		setup_new_dl_entity(dl_se);
 	}
+
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
 }
 
@@ -1532,6 +1549,50 @@  void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 	dl_se->server_pick = pick;
 }
 
+int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
+{
+	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	u64 new_bw = to_ratio(period, runtime);
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	unsigned long cap;
+	int retval = 0;
+	int cpus;
+
+	dl_b = dl_bw_of(cpu);
+	raw_spin_lock(&dl_b->lock);
+	cpus = dl_bw_cpus(cpu);
+	cap = dl_bw_capacity(cpu);
+
+	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
+		retval = -EBUSY;
+		goto out;
+	}
+
+	if (init) {
+		__add_rq_bw(new_bw, &rq->dl);
+		__dl_add(dl_b, new_bw, cpus);
+	} else {
+		__dl_sub(dl_b, dl_se->dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
+
+		dl_rq_change_utilization(rq, dl_se, new_bw);
+	}
+
+	rq->fair_server.dl_runtime = runtime;
+	rq->fair_server.dl_deadline  = period;
+	rq->fair_server.dl_period  = period;
+
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
+
+out:
+	raw_spin_unlock(&dl_b->lock);
+
+	return retval;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4580a450700e..bd7ad6b8d3de 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,8 +333,208 @@  static const struct file_operations sched_debug_fops = {
 	.release	= seq_release,
 };
 
+enum dl_param {
+	DL_RUNTIME = 0,
+	DL_PERIOD,
+	DL_ZEROLAX
+};
+
+static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+
+static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
+				       size_t cnt, loff_t *ppos, enum dl_param param)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	u64 runtime, period, zerolax;
+	struct rq *rq = cpu_rq(cpu);
+	size_t err;
+	int retval;
+	u64 value;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
+	if (err)
+		return err;
+
+	scoped_guard (rq_lock_irqsave, rq) {
+
+		runtime  = rq->fair_server.dl_runtime;
+		period = rq->fair_server.dl_period;
+		zerolax = rq->fair_server.dl_zerolax;
+
+		switch (param) {
+		case DL_RUNTIME:
+			if (runtime == value)
+				goto out;
+			runtime = value;
+			break;
+		case DL_PERIOD:
+			if (value == period)
+				goto out;
+			period = value;
+			break;
+		case DL_ZEROLAX:
+			if (zerolax == value)
+				goto out;
+			zerolax = value;
+			break;
+		}
+
+		if (runtime > period
+				|| period > fair_server_period_max
+				|| period < fair_server_period_min
+				|| zerolax > 1) {
+			cnt = -EINVAL;
+			goto out;
+		}
+
+		if (rq->cfs.h_nr_running) {
+			update_rq_clock(rq);
+			dl_server_stop(&rq->fair_server);
+		}
+
+		/*
+		 * The zerolax does not change utilization, so just
+		 * setting it is enough.
+		 */
+		if (rq->fair_server.dl_zerolax != zerolax) {
+			rq->fair_server.dl_zerolax = zerolax;
+		} else {
+			retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+			if (retval)
+				cnt = retval;
+		}
+
+		if (rq->cfs.h_nr_running)
+			dl_server_start(&rq->fair_server);
+	}
+
+out:
+	*ppos += cnt;
+	return cnt;
+}
+
+static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+	u64 value;
+
+	switch (param) {
+	case DL_RUNTIME:
+		value = rq->fair_server.dl_runtime;
+		break;
+	case DL_PERIOD:
+		value = rq->fair_server.dl_period;
+		break;
+	case DL_ZEROLAX:
+		value = rq->fair_server.dl_zerolax;
+	}
+
+	seq_printf(m, "%llu\n", value);
+	return 0;
+
+}
+
+static ssize_t
+sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
+				size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_runtime_fops = {
+	.open		= sched_fair_server_runtime_open,
+	.write		= sched_fair_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static ssize_t
+sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+}
+
+static int sched_fair_server_period_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_PERIOD);
+}
+
+static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_period_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_period_fops = {
+	.open		= sched_fair_server_period_open,
+	.write		= sched_fair_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static ssize_t
+sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX);
+}
+
+static int sched_fair_server_defer_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_ZEROLAX);
+}
+
+static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_defer_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_defer_fops = {
+	.open		= sched_fair_server_defer_open,
+	.write		= sched_fair_server_defer_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static struct dentry *debugfs_sched;
 
+static void debugfs_fair_server_init(void)
+{
+	long cpu;
+	struct dentry *rq_dentry;
+
+	rq_dentry = debugfs_create_dir("rq", debugfs_sched);
+	if (!rq_dentry)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%ld", cpu);
+		d_cpu = debugfs_create_dir(buf, rq_dentry);
+
+		debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+		debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+		debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
+	}
+}
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -374,6 +574,8 @@  static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_fair_server_init();
+
 	return 0;
 }
 late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 399237cd9f59..5434c52f470d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8419,12 +8419,6 @@  void fair_server_init(struct rq *rq)
 	struct sched_dl_entity *dl_se = &rq->fair_server;
 
 	init_dl_entity(dl_se);
-
-	dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
-	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
-	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
-	dl_se->dl_zerolax = 1;
-
 	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec0e288c8e06..312b31df5860 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,6 +341,8 @@  extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_pick_f pick);
 
 extern void fair_server_init(struct rq *);
+extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
+		    u64 runtime, u64 period, bool init);
 
 #ifdef CONFIG_CGROUP_SCHED