[RFC] sched/eevdf: Use tunable knob sysctl_sched_base_slice as explicit time quanta

Message ID 20240111115745.62813-2-zegao@tencent.com
State New
Headers
Series [RFC] sched/eevdf: Use tunable knob sysctl_sched_base_slice as explicit time quanta |

Commit Message

Ze Gao Jan. 11, 2024, 11:57 a.m. UTC
  AFAIS, We've overlooked what role of the concept of time quanta plays
in EEVDF. According to Theorem 1 in [1], we have

	-r_max < log_k(t) < max(r_max, q)

cleary we don't want either r_max (the maximum user request) or q (time
quanta) to be too much big.

To trade for throughput, in [2] it chooses to do tick preemtion at
per request boundary (i.e., once a cetain request is fulfilled), which
means we literally have no concept of time quanta defined anymore.
Obviously this is no problem if we make

	q = r_i = sysctl_sched_base_slice

just as exactly what we have for now, which actually creates a implict
quanta for us and works well.

However, with custom slice being possible, the lag bound is subject
only to the distribution of users requested slices given the fact no
time quantum is available now and we would pay the cost of losing
many scheduling opportunities to maintain fairness and responsiveness
due to [2]. What's worse, we may suffer unexpected unfairness and
lantecy.

For example, take two cpu bound processes with the same weight and bind
them to the same cpu, and let process A request for 100ms whereas B
request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
nr_cpu=42).  And we can clearly see that playing with custom slice can
actually incur unfair cpu bandwidth allocation (10706 whose request
length is 0.1ms gets more cpu time as well as better latency compared to
10705. Note you might see the other way around in different machines but
the allocation inaccuracy retains, and even top can show you the
noticeble difference in terms of cpu util by per second reporting), which
is obviously not what we want because that would mess up the nice system
and fairness would not hold.

			stress-ng-cpu:10705	stress-ng-cpu:10706
---------------------------------------------------------------------
Slices(ms)		100			0.1
Runtime(ms)		4934.206		5025.048
Switches		58			67
Average delay(ms)	87.074			73.863
Maximum delay(ms)	101.998			101.010

In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
in this patch gives us a better control of the allocation accuracy and
the avg latency:

			stress-ng-cpu:10584	stress-ng-cpu:10583
---------------------------------------------------------------------
Slices(ms)		100			0.1
Runtime(ms)		4980.309		4981.356
Switches		1253			1254
Average delay(ms)	3.990			3.990
Maximum delay(ms)	5.001			4.014

Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
less switches at the cost of worse delay:

			stress-ng-cpu:11208	stress-ng-cpu:11207
---------------------------------------------------------------------
Slices(ms)		100			0.1
Runtime(ms)		4983.722		4977.035
Switches		456			456
Average delay(ms)	10.963			10.939
Maximum delay(ms)	19.002			21.001

By being able to tune sysctl_sched_base_slice knob, we can achieve
the goal to strike a good balance between throughput and latency by
adjusting the frequency of context switches, and the conclusions are
much close to what's covered in [1] with the explicit definition of
a time quantum. And it aslo gives more freedom to choose the eligible
request length range(either through nice value or raw value)
without worrying about overscheduling or underscheduling too much.

Note this change should introduce no obvious regression because all
processes have the same request length as sysctl_sched_base_slice as
in the status quo. And the result of benchmarks proves this as well.

schbench -m2 -F128 -n10	-r90	w/patch	tip/6.7-rc7
Wakeup  (usec): 99.0th:		3028	95
Request (usec): 99.0th:		14992	21984
RPS    (count): 50.0th:		5864	5848

hackbench -s 512 -l 200 -f 25 -P	w/patch	 tip/6.7-rc7
-g 10 					0.212	0.223
-g 20					0.415	0.432
-g 30				 	0.625	0.639
-g 40					0.852	0.858

[1]: https://dl.acm.org/doi/10.5555/890606
[2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u

Signed-off-by: Ze Gao <zegao@tencent.com>
---

Hi peter,

I've been attempting to figure out how eevdf works and how the
idle of latency-nice would fit in it in future.

After reading [1], code and all the disscusions you guys make, I
find out the current implemention deliberately does not embrace
the concept of 'time quanta' mentioned in the paper in [2] and I
see some likely risks ( or not ?) if we are going to bring in
custom slices ( raw value or latency nice) support by not having
one.

Getting my hand dirty gives me some experimental results and it
shows that user specified slices can actually hurt fairness.

So I decide to engage in and propose this patch to explicitly use
the tunable knob sysctl_sched_base_slice as time quanta. The
benchmarks shows no regression as expected though.

Still this is just an immature idea and there should be things I
am blind of or overlook. IOW I'm unsure if it is a real
problem indeed. Hope to get some sage insights from you.

Regards,
                Ze
 kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++------------
 1 file changed, 35 insertions(+), 12 deletions(-)
  

Comments

Vishal Chourasia Jan. 23, 2024, 12:42 p.m. UTC | #1
On Thu, Jan 11, 2024 at 06:57:46AM -0500, Ze Gao wrote:
> AFAIS, We've overlooked what role of the concept of time quanta plays
> in EEVDF. According to Theorem 1 in [1], we have
> 
> 	-r_max < log_k(t) < max(r_max, q)
> 
> cleary we don't want either r_max (the maximum user request) or q (time
> quanta) to be too much big.
> 
> To trade for throughput, in [2] it chooses to do tick preemtion at
> per request boundary (i.e., once a cetain request is fulfilled), which
> means we literally have no concept of time quanta defined anymore.
> Obviously this is no problem if we make
> 
> 	q = r_i = sysctl_sched_base_slice
> 
> just as exactly what we have for now, which actually creates a implict
> quanta for us and works well.
> 
> However, with custom slice being possible, the lag bound is subject
> only to the distribution of users requested slices given the fact no
> time quantum is available now and we would pay the cost of losing
> many scheduling opportunities to maintain fairness and responsiveness
> due to [2]. What's worse, we may suffer unexpected unfairness and
> lantecy.
> 
> For example, take two cpu bound processes with the same weight and bind
> them to the same cpu, and let process A request for 100ms whereas B
> request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
> nr_cpu=42).  And we can clearly see that playing with custom slice can
> actually incur unfair cpu bandwidth allocation (10706 whose request
> length is 0.1ms gets more cpu time as well as better latency compared to
> 10705. Note you might see the other way around in different machines but
> the allocation inaccuracy retains, and even top can show you the
> noticeble difference in terms of cpu util by per second reporting), which
> is obviously not what we want because that would mess up the nice system
> and fairness would not hold.

Hi, How are you setting custom request values for process A and B?

> 
> 			stress-ng-cpu:10705	stress-ng-cpu:10706
> ---------------------------------------------------------------------
> Slices(ms)		100			0.1
> Runtime(ms)		4934.206		5025.048
> Switches		58			67
> Average delay(ms)	87.074			73.863
> Maximum delay(ms)	101.998			101.010
> 
> In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> in this patch gives us a better control of the allocation accuracy and
> the avg latency:
> 
> 			stress-ng-cpu:10584	stress-ng-cpu:10583
> ---------------------------------------------------------------------
> Slices(ms)		100			0.1
> Runtime(ms)		4980.309		4981.356
> Switches		1253			1254
> Average delay(ms)	3.990			3.990
> Maximum delay(ms)	5.001			4.014
> 
> Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> less switches at the cost of worse delay:
> 
> 			stress-ng-cpu:11208	stress-ng-cpu:11207
> ---------------------------------------------------------------------
> Slices(ms)		100			0.1
> Runtime(ms)		4983.722		4977.035
> Switches		456			456
> Average delay(ms)	10.963			10.939
> Maximum delay(ms)	19.002			21.001
> 
> By being able to tune sysctl_sched_base_slice knob, we can achieve
> the goal to strike a good balance between throughput and latency by
> adjusting the frequency of context switches, and the conclusions are
> much close to what's covered in [1] with the explicit definition of
> a time quantum. And it aslo gives more freedom to choose the eligible
> request length range(either through nice value or raw value)
> without worrying about overscheduling or underscheduling too much.
> 
> Note this change should introduce no obvious regression because all
> processes have the same request length as sysctl_sched_base_slice as
> in the status quo. And the result of benchmarks proves this as well.
> 
> schbench -m2 -F128 -n10	-r90	w/patch	tip/6.7-rc7
> Wakeup  (usec): 99.0th:		3028	95
> Request (usec): 99.0th:		14992	21984
> RPS    (count): 50.0th:		5864	5848
> 
> hackbench -s 512 -l 200 -f 25 -P	w/patch	 tip/6.7-rc7
> -g 10 					0.212	0.223
> -g 20					0.415	0.432
> -g 30				 	0.625	0.639
> -g 40					0.852	0.858
> 
> [1]: https://dl.acm.org/doi/10.5555/890606
> [2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u
> 
> Signed-off-by: Ze Gao <zegao@tencent.com>
> ---
  
Ze Gao Jan. 24, 2024, 2:32 a.m. UTC | #2
On Tue, Jan 23, 2024 at 8:42 PM Vishal Chourasia <vishalc@linux.ibmcom> wrote:
>
> On Thu, Jan 11, 2024 at 06:57:46AM -0500, Ze Gao wrote:
> > AFAIS, We've overlooked what role of the concept of time quanta plays
> > in EEVDF. According to Theorem 1 in [1], we have
> >
> >       -r_max < log_k(t) < max(r_max, q)
> >
> > cleary we don't want either r_max (the maximum user request) or q (time
> > quanta) to be too much big.
> >
> > To trade for throughput, in [2] it chooses to do tick preemtion at
> > per request boundary (i.e., once a cetain request is fulfilled), which
> > means we literally have no concept of time quanta defined anymore.
> > Obviously this is no problem if we make
> >
> >       q = r_i = sysctl_sched_base_slice
> >
> > just as exactly what we have for now, which actually creates a implict
> > quanta for us and works well.
> >
> > However, with custom slice being possible, the lag bound is subject
> > only to the distribution of users requested slices given the fact no
> > time quantum is available now and we would pay the cost of losing
> > many scheduling opportunities to maintain fairness and responsiveness
> > due to [2]. What's worse, we may suffer unexpected unfairness and
> > lantecy.
> >
> > For example, take two cpu bound processes with the same weight and bind
> > them to the same cpu, and let process A request for 100ms whereas B
> > request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
> > nr_cpu=42).  And we can clearly see that playing with custom slice can
> > actually incur unfair cpu bandwidth allocation (10706 whose request
> > length is 0.1ms gets more cpu time as well as better latency compared to
> > 10705. Note you might see the other way around in different machines but
> > the allocation inaccuracy retains, and even top can show you the
> > noticeble difference in terms of cpu util by per second reporting), which
> > is obviously not what we want because that would mess up the nice system
> > and fairness would not hold.
>
> Hi, How are you setting custom request values for process A and B?

I cherry-picked peter's commit[1], and adds a SCHED_QUANTA feature control
for testing w/o my patch.  You can check out [2] to see how it works.

And the userspace part looks like this to set/get slice per process:

#include <stdio.h>
#include <stdlib.h>
#include <sched.h>            /* Definition of SCHED_* constants */
#include <sys/syscall.h>      /* Definition of SYS_* constants */
#include <unistd.h>
#include <linux/sched/types.h>
/*
int syscall(SYS_sched_setattr, pid_t pid, struct sched_attr *attr,
                unsigned int flags);
int syscall(SYS_sched_getattr, pid_t pid, struct sched_attr *attr,
                unsigned int size, unsigned int flags);
*/

int main(int argc, char *argv[])
{
        int pid, slice = 0;
        int ecode = 0;;
        struct sched_attr attr = {0};
        if (argc < 2) {
                printf("please specify pid [slice]\n");
                ecode = -1;
                goto out;
        }
        pid = atoi(argv[1]);
        if (!pid || pid == 1) {
                printf("pid %d is not valid\n", pid);
                ecode = -1;
                goto out;
        }

        if (argc >= 3)
                slice = atoi(argv[2]);

        if (slice) {
                if (slice < 100 || slice > 100000) {
                        printf("slice %d[us] is not valid\n", slice);
                        ecode = -1;
                        goto out;
                }
                attr.sched_runtime = slice * 1000;
                ecode = syscall(SYS_sched_setattr, pid, &attr, 0);
                if (ecode) {
                        printf("change pid %d failed\n", pid);
                } else {
                        printf("change pid %d succeed\n", pid);
                }
        }

        ecode = syscall(SYS_sched_getattr, pid, &attr, sizeof(struct
sched_attr), 0);
        if (!ecode) {
                printf("pid: %d slice: %d\n", pid, attr.sched_runtime/1000);
        } else {
                printf("pid: %d getattr failed\n", pid);
        }
out:
        return ecode;
}

Note: here I use microseconds as my time units for convenience.

And the tests run like this:


#!/bin/bash

test() {

        echo -e "-----------------------------------------\n"
        pkill stress-ng

        sleep 1

        taskset -c 1 stress-ng -c 1  &
        ./set_slice $! 100
        taskset -c 1 stress-ng -c 1  &
        ./set_slice $! 100000

        perf sched record -- sleep 10
        perf sched latency -p -C 1
        echo -e "-----------------------------------------\n"

}

echo NO_SCHED_QUANTA > /sys/kernel/debug/sched/features
test
sleep 2
echo SCHED_QUANTA > /sys/kernel/debug/sched/features
test


[1]: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf&id=98866150f92f268a2f08eb1d884de9677eb4ec8f
[2]: https://github.com/zegao96/linux/tree/sched-eevdf


Regards,
        -- Ze

> >
> >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4934.206                5025.048
> > Switches              58                      67
> > Average delay(ms)     87.074                  73.863
> > Maximum delay(ms)     101.998                 101.010
> >
> > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > in this patch gives us a better control of the allocation accuracy and
> > the avg latency:
> >
> >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4980.309                4981.356
> > Switches              1253                    1254
> > Average delay(ms)     3.990                   3.990
> > Maximum delay(ms)     5.001                   4.014
> >
> > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > less switches at the cost of worse delay:
> >
> >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4983.722                4977.035
> > Switches              456                     456
> > Average delay(ms)     10.963                  10.939
> > Maximum delay(ms)     19.002                  21.001
> >
> > By being able to tune sysctl_sched_base_slice knob, we can achieve
> > the goal to strike a good balance between throughput and latency by
> > adjusting the frequency of context switches, and the conclusions are
> > much close to what's covered in [1] with the explicit definition of
> > a time quantum. And it aslo gives more freedom to choose the eligible
> > request length range(either through nice value or raw value)
> > without worrying about overscheduling or underscheduling too much.
> >
> > Note this change should introduce no obvious regression because all
> > processes have the same request length as sysctl_sched_base_slice as
> > in the status quo. And the result of benchmarks proves this as well.
> >
> > schbench -m2 -F128 -n10       -r90    w/patch tip/6.7-rc7
> > Wakeup  (usec): 99.0th:               3028    95
> > Request (usec): 99.0th:               14992   21984
> > RPS    (count): 50.0th:               5864    5848
> >
> > hackbench -s 512 -l 200 -f 25 -P      w/patch  tip/6.7-rc7
> > -g 10                                         0.212   0.223
> > -g 20                                 0.415   0.432
> > -g 30                                 0.625   0.639
> > -g 40                                 0.852   0.858
> >
> > [1]: https://dl.acm.org/doi/10.5555/890606
> > [2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u
> >
> > Signed-off-by: Ze Gao <zegao@tencent.com>
> > ---
>
  
Vishal Chourasia Feb. 2, 2024, 11:50 a.m. UTC | #3
On Wed, Jan 24, 2024 at 10:32:08AM +0800, Ze Gao wrote:
> > Hi, How are you setting custom request values for process A and B?
> 
> I cherry-picked peter's commit[1], and adds a SCHED_QUANTA feature control
> for testing w/o my patch.  You can check out [2] to see how it works.
> 
Thank you sharing your setup.

Built the kernel according to [2] keeping v6.8.0-rc1 as base

// NO_SCHED_QUANTA
# perf script -i perf.data.old  -s perf-latency.py
PID 355045: Average Delta = 87.72726154385964 ms, Max Delta = 110.015044 ms, Count = 57
PID 355044: Average Delta = 92.2655679245283 ms, Max Delta = 110.017182 ms, Count = 53

// SCHED_QUANTA
# perf script -i perf.data  -s perf-latency.py
PID 355065: Average Delta = 10.00 ms, Max Delta = 10.012708 ms, Count = 500
PID 355064: Average Delta = 9.959 ms, Max Delta = 10.023588 ms, Count = 501

#  cat /sys/kernel/debug/sched/base_slice_ns
3000000

base slice is not being enforced.

Next, Looking closing at the perf.data file

# perf script -i perf.data -C 1 | grep switch
..
 stress-ng-cpu 355064 [001] 776706.003222:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
 stress-ng-cpu 355065 [001] 776706.013218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
 stress-ng-cpu 355064 [001] 776706.023218:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
 stress-ng-cpu 355065 [001] 776706.033218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
..

Delta wait time is approx 0.01s or 10ms
So, switch is not happening at base_slice_ns boundary.

But why? is it possible base_slice_ns is not properly used in
arch != x86 ?

> 
> echo NO_SCHED_QUANTA > /sys/kernel/debug/sched/features
> test
> sleep 2
> echo SCHED_QUANTA > /sys/kernel/debug/sched/features
> test
> 
> 
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf&id=98866150f92f268a2f08eb1d884de9677eb4ec8f
> [2]: https://github.com/zegao96/linux/tree/sched-eevdf
> 
> 
> Regards,
>         -- Ze
> 
> > >
> > >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > > ---------------------------------------------------------------------
> > > Slices(ms)            100                     0.1
> > > Runtime(ms)           4934.206                5025.048
> > > Switches              58                      67
> > > Average delay(ms)     87.074                  73.863
> > > Maximum delay(ms)     101.998                 101.010
> > >
> > > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > > in this patch gives us a better control of the allocation accuracy and
> > > the avg latency:
> > >
> > >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > > ---------------------------------------------------------------------
> > > Slices(ms)            100                     0.1
> > > Runtime(ms)           4980.309                4981.356
> > > Switches              1253                    1254
> > > Average delay(ms)     3.990                   3.990
> > > Maximum delay(ms)     5.001                   4.014
> > >
> > > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > > less switches at the cost of worse delay:
> > >
> > >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > > ---------------------------------------------------------------------
> > > Slices(ms)            100                     0.1
> > > Runtime(ms)           4983.722                4977.035
> > > Switches              456                     456
> > > Average delay(ms)     10.963                  10.939
> > > Maximum delay(ms)     19.002                  21.001
> > >
> > > By being able to tune sysctl_sched_base_slice knob, we can achieve
> > > the goal to strike a good balance between throughput and latency by
> > > adjusting the frequency of context switches, and the conclusions are
> > > much close to what's covered in [1] with the explicit definition of
> > > a time quantum. And it aslo gives more freedom to choose the eligible
> > > request length range(either through nice value or raw value)
> > > without worrying about overscheduling or underscheduling too much.
> > >
> > > Note this change should introduce no obvious regression because all
> > > processes have the same request length as sysctl_sched_base_slice as
> > > in the status quo. And the result of benchmarks proves this as well.
> > >
> > > schbench -m2 -F128 -n10       -r90    w/patch tip/6.7-rc7
> > > Wakeup  (usec): 99.0th:               3028    95
> > > Request (usec): 99.0th:               14992   21984
> > > RPS    (count): 50.0th:               5864    5848
> > >
> > > hackbench -s 512 -l 200 -f 25 -P      w/patch  tip/6.7-rc7
> > > -g 10                                         0.212   0.223
> > > -g 20                                 0.415   0.432
> > > -g 30                                 0.625   0.639
> > > -g 40                                 0.852   0.858
> > >
> > > [1]: https://dl.acm.org/doi/10.5555/890606
> > > [2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u
> > >
> > > Signed-off-by: Ze Gao <zegao@tencent.com>
> > > ---
> >
>
  
Ze Gao Feb. 4, 2024, 3:05 a.m. UTC | #4
On Fri, Feb 2, 2024 at 7:50 PM Vishal Chourasia <vishalc@linux.ibm.com> wrote:
>
> On Wed, Jan 24, 2024 at 10:32:08AM +0800, Ze Gao wrote:
> > > Hi, How are you setting custom request values for process A and B?
> >
> > I cherry-picked peter's commit[1], and adds a SCHED_QUANTA feature control
> > for testing w/o my patch.  You can check out [2] to see how it works.
> >
> Thank you sharing your setup.
>
> Built the kernel according to [2] keeping v6.8.0-rc1 as base
>
> // NO_SCHED_QUANTA
> # perf script -i perf.data.old  -s perf-latency.py
> PID 355045: Average Delta = 87.72726154385964 ms, Max Delta = 110.015044 ms, Count = 57
> PID 355044: Average Delta = 92.2655679245283 ms, Max Delta = 110.017182 ms, Count = 53
>
> // SCHED_QUANTA
> # perf script -i perf.data  -s perf-latency.py
> PID 355065: Average Delta = 10.00 ms, Max Delta = 10.012708 ms, Count = 500
> PID 355064: Average Delta = 9.959 ms, Max Delta = 10.023588 ms, Count = 501
>
> #  cat /sys/kernel/debug/sched/base_slice_ns
> 3000000
>
> base slice is not being enforced.
>
> Next, Looking closing at the perf.data file
>
> # perf script -i perf.data -C 1 | grep switch
> ...
>  stress-ng-cpu 355064 [001] 776706.003222:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
>  stress-ng-cpu 355065 [001] 776706.013218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
>  stress-ng-cpu 355064 [001] 776706.023218:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
>  stress-ng-cpu 355065 [001] 776706.033218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
> ...
>
> Delta wait time is approx 0.01s or 10ms

You can check out your HZ, which should be 100 in your settings
in my best guess.That explains your results.

> So, switch is not happening at base_slice_ns boundary.
>
> But why? is it possible base_slice_ns is not properly used in
> arch != x86 ?

The thing is  in my RFC the effective quanta is actually

   max_t(u64, TICK_NSEC, sysctl_sched_base_slice)

where sysctl_sched_base_slice is precisely a handy tunable knob
for users ( maybe i should make it loud and clear more ).

See what I do in update_entity_lag(), you will understand.

Note we have 3 time related concepts here:
1. TIME TICK: (schedule) accounting time unit.
2. TIME QUANTA (not necessarily the effective one): scheduling time unit
3. USER SLICE: time slice per request

To implement latency-nice while being as fair as possible, We must
carefully consider the size relationship between them, and especially
the value range of USER SLICE due to the cold fact that the lag(
unfairness) is literally subject to both time quanta and user requested
slices.


Regards,
        -- Ze

> >
> > echo NO_SCHED_QUANTA > /sys/kernel/debug/sched/features
> > test
> > sleep 2
> > echo SCHED_QUANTA > /sys/kernel/debug/sched/features
> > test
> >
> >
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf&id=98866150f92f268a2f08eb1d884de9677eb4ec8f
> > [2]: https://github.com/zegao96/linux/tree/sched-eevdf
> >
> >
> > Regards,
> >         -- Ze
> >
> > > >
> > > >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > > > ---------------------------------------------------------------------
> > > > Slices(ms)            100                     0.1
> > > > Runtime(ms)           4934.206                5025.048
> > > > Switches              58                      67
> > > > Average delay(ms)     87.074                  73.863
> > > > Maximum delay(ms)     101.998                 101.010
> > > >
> > > > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > > > in this patch gives us a better control of the allocation accuracy and
> > > > the avg latency:
> > > >
> > > >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > > > ---------------------------------------------------------------------
> > > > Slices(ms)            100                     0.1
> > > > Runtime(ms)           4980.309                4981.356
> > > > Switches              1253                    1254
> > > > Average delay(ms)     3.990                   3.990
> > > > Maximum delay(ms)     5.001                   4.014
> > > >
> > > > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > > > less switches at the cost of worse delay:
> > > >
> > > >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > > > ---------------------------------------------------------------------
> > > > Slices(ms)            100                     0.1
> > > > Runtime(ms)           4983.722                4977.035
> > > > Switches              456                     456
> > > > Average delay(ms)     10.963                  10.939
> > > > Maximum delay(ms)     19.002                  21.001
> > > >
> > > > By being able to tune sysctl_sched_base_slice knob, we can achieve
> > > > the goal to strike a good balance between throughput and latency by
> > > > adjusting the frequency of context switches, and the conclusions are
> > > > much close to what's covered in [1] with the explicit definition of
> > > > a time quantum. And it aslo gives more freedom to choose the eligible
> > > > request length range(either through nice value or raw value)
> > > > without worrying about overscheduling or underscheduling too much.
> > > >
> > > > Note this change should introduce no obvious regression because all
> > > > processes have the same request length as sysctl_sched_base_slice as
> > > > in the status quo. And the result of benchmarks proves this as well.
> > > >
> > > > schbench -m2 -F128 -n10       -r90    w/patch tip/6.7-rc7
> > > > Wakeup  (usec): 99.0th:               3028    95
> > > > Request (usec): 99.0th:               14992   21984
> > > > RPS    (count): 50.0th:               5864    5848
> > > >
> > > > hackbench -s 512 -l 200 -f 25 -P      w/patch  tip/6.7-rc7
> > > > -g 10                                         0.212   0.223
> > > > -g 20                                 0.415   0.432
> > > > -g 30                                 0.625   0.639
> > > > -g 40                                 0.852   0.858
> > > >
> > > > [1]: https://dl.acm.org/doi/10.5555/890606
> > > > [2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u
> > > >
> > > > Signed-off-by: Ze Gao <zegao@tencent.com>
> > > > ---
> > >
> >
  
Vishal Chourasia Feb. 5, 2024, 7:37 a.m. UTC | #5
On Sun, Feb 04, 2024 at 11:05:22AM +0800, Ze Gao wrote:
> On Fri, Feb 2, 2024 at 7:50 PM Vishal Chourasia <vishalc@linux.ibm.com> wrote:
> >
> > On Wed, Jan 24, 2024 at 10:32:08AM +0800, Ze Gao wrote:
> > > > Hi, How are you setting custom request values for process A and B?
> > >
> > > I cherry-picked peter's commit[1], and adds a SCHED_QUANTA feature control
> > > for testing w/o my patch.  You can check out [2] to see how it works.
> > >
> > Thank you sharing your setup.
> >
> > Built the kernel according to [2] keeping v6.8.0-rc1 as base
> >
> > // NO_SCHED_QUANTA
> > # perf script -i perf.data.old  -s perf-latency.py
> > PID 355045: Average Delta = 87.72726154385964 ms, Max Delta = 110.015044 ms, Count = 57
> > PID 355044: Average Delta = 92.2655679245283 ms, Max Delta = 110.017182 ms, Count = 53
> >
> > // SCHED_QUANTA
> > # perf script -i perf.data  -s perf-latency.py
> > PID 355065: Average Delta = 10.00 ms, Max Delta = 10.012708 ms, Count = 500
> > PID 355064: Average Delta = 9.959 ms, Max Delta = 10.023588 ms, Count = 501
> >
> > #  cat /sys/kernel/debug/sched/base_slice_ns
> > 3000000
> >
> > base slice is not being enforced.
> >
> > Next, Looking closing at the perf.data file
> >
> > # perf script -i perf.data -C 1 | grep switch
> > ...
> >  stress-ng-cpu 355064 [001] 776706.003222:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
> >  stress-ng-cpu 355065 [001] 776706.013218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
> >  stress-ng-cpu 355064 [001] 776706.023218:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
> >  stress-ng-cpu 355065 [001] 776706.033218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
> > ...
> >
> > Delta wait time is approx 0.01s or 10ms
> 
> You can check out your HZ, which should be 100 in your settings
> in my best guess.That explains your results.
Yes. How much is it in your case? If I may ask.
> 
> > So, switch is not happening at base_slice_ns boundary.
> >
> > But why? is it possible base_slice_ns is not properly used in
> > arch != x86 ?
> 
> The thing is  in my RFC the effective quanta is actually
> 
>    max_t(u64, TICK_NSEC, sysctl_sched_base_slice)
> 
> where sysctl_sched_base_slice is precisely a handy tunable knob
> for users ( maybe i should make it loud and clear more ).
> 
> See what I do in update_entity_lag(), you will understand.
Thanks. I will look into it.
> 
> Note we have 3 time related concepts here:
> 1. TIME TICK: (schedule) accounting time unit.
> 2. TIME QUANTA (not necessarily the effective one): scheduling time unit
> 3. USER SLICE: time slice per request
To double check,
User slice is the request size submitted by a competing task for the time-shared resource (here,
processor) against other competing tasks.

Scheduler allocates time-shared resource (here, processor) in `q` quantum
which is our TIME QUANTA

TIME TICK is time period between two scheduler ticks.

Thanks, 
 -- vishal.c
> 
> To implement latency-nice while being as fair as possible, We must
> carefully consider the size relationship between them, and especially
> the value range of USER SLICE due to the cold fact that the lag(
> unfairness) is literally subject to both time quanta and user requested
> slices.
> 
> 
> Regards,
>         -- Ze
> 
> > >
> > > echo NO_SCHED_QUANTA > /sys/kernel/debug/sched/features
> > > test
> > > sleep 2
> > > echo SCHED_QUANTA > /sys/kernel/debug/sched/features
> > > test
> > >
> > >
> > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf&id=98866150f92f268a2f08eb1d884de9677eb4ec8f
> > > [2]: https://github.com/zegao96/linux/tree/sched-eevdf
> > >
> > >
> > > Regards,
> > >         -- Ze
> > >
> > > > >
> > > > >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > > > > ---------------------------------------------------------------------
> > > > > Slices(ms)            100                     0.1
> > > > > Runtime(ms)           4934.206                5025.048
> > > > > Switches              58                      67
> > > > > Average delay(ms)     87.074                  73.863
> > > > > Maximum delay(ms)     101.998                 101.010
> > > > >
> > > > > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > > > > in this patch gives us a better control of the allocation accuracy and
> > > > > the avg latency:
> > > > >
> > > > >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > > > > ---------------------------------------------------------------------
> > > > > Slices(ms)            100                     0.1
> > > > > Runtime(ms)           4980.309                4981.356
> > > > > Switches              1253                    1254
> > > > > Average delay(ms)     3.990                   3.990
> > > > > Maximum delay(ms)     5.001                   4.014
> > > > >
> > > > > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > > > > less switches at the cost of worse delay:
> > > > >
> > > > >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > > > > ---------------------------------------------------------------------
> > > > > Slices(ms)            100                     0.1
> > > > > Runtime(ms)           4983.722                4977.035
> > > > > Switches              456                     456
> > > > > Average delay(ms)     10.963                  10.939
> > > > > Maximum delay(ms)     19.002                  21.001
> > > > >
> > > > > By being able to tune sysctl_sched_base_slice knob, we can achieve
> > > > > the goal to strike a good balance between throughput and latency by
> > > > > adjusting the frequency of context switches, and the conclusions are
> > > > > much close to what's covered in [1] with the explicit definition of
> > > > > a time quantum. And it aslo gives more freedom to choose the eligible
> > > > > request length range(either through nice value or raw value)
> > > > > without worrying about overscheduling or underscheduling too much.
> > > > >
> > > > > Note this change should introduce no obvious regression because all
> > > > > processes have the same request length as sysctl_sched_base_slice as
> > > > > in the status quo. And the result of benchmarks proves this as well.
> > > > >
> > > > > schbench -m2 -F128 -n10       -r90    w/patch tip/6.7-rc7
> > > > > Wakeup  (usec): 99.0th:               3028    95
> > > > > Request (usec): 99.0th:               14992   21984
> > > > > RPS    (count): 50.0th:               5864    5848
> > > > >
> > > > > hackbench -s 512 -l 200 -f 25 -P      w/patch  tip/6.7-rc7
> > > > > -g 10                                         0.212   0.223
> > > > > -g 20                                 0.415   0.432
> > > > > -g 30                                 0.625   0.639
> > > > > -g 40                                 0.852   0.858
> > > > >
> > > > > [1]: https://dl.acm.org/doi/10.5555/890606
> > > > > [2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u
> > > > >
> > > > > Signed-off-by: Ze Gao <zegao@tencent.com>
> > > > > ---
> > > >
> > >
  
Ze Gao Feb. 6, 2024, 7:50 a.m. UTC | #6
On Mon, Feb 5, 2024 at 3:37 PM Vishal Chourasia <vishalc@linux.ibm.com> wrote:
>
> On Sun, Feb 04, 2024 at 11:05:22AM +0800, Ze Gao wrote:
> > On Fri, Feb 2, 2024 at 7:50 PM Vishal Chourasia <vishalc@linux.ibm.com> wrote:
> > >
> > > On Wed, Jan 24, 2024 at 10:32:08AM +0800, Ze Gao wrote:
> > > > > Hi, How are you setting custom request values for process A and B?
> > > >
> > > > I cherry-picked peter's commit[1], and adds a SCHED_QUANTA feature control
> > > > for testing w/o my patch.  You can check out [2] to see how it works.
> > > >
> > > Thank you sharing your setup.
> > >
> > > Built the kernel according to [2] keeping v6.8.0-rc1 as base
> > >
> > > // NO_SCHED_QUANTA
> > > # perf script -i perf.data.old  -s perf-latency.py
> > > PID 355045: Average Delta = 87.72726154385964 ms, Max Delta = 110015044 ms, Count = 57
> > > PID 355044: Average Delta = 92.2655679245283 ms, Max Delta = 110.017182 ms, Count = 53
> > >
> > > // SCHED_QUANTA
> > > # perf script -i perf.data  -s perf-latency.py
> > > PID 355065: Average Delta = 10.00 ms, Max Delta = 10.012708 ms, Count = 500
> > > PID 355064: Average Delta = 9.959 ms, Max Delta = 10.023588 ms, Count = 501
> > >
> > > #  cat /sys/kernel/debug/sched/base_slice_ns
> > > 3000000
> > >
> > > base slice is not being enforced.
> > >
> > > Next, Looking closing at the perf.data file
> > >
> > > # perf script -i perf.data -C 1 | grep switch
> > > ...
> > >  stress-ng-cpu 355064 [001] 776706.003222:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
> > >  stress-ng-cpu 355065 [001] 776706.013218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
> > >  stress-ng-cpu 355064 [001] 776706.023218:       sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
> > >  stress-ng-cpu 355065 [001] 776706.033218:       sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
> > > ...
> > >
> > > Delta wait time is approx 0.01s or 10ms
> >
> > You can check out your HZ, which should be 100 in your settings
> > in my best guess.That explains your results.
> Yes. How much is it in your case? If I may ask.

Like I mentioned in the changelog: with HZ=1000, sysctl_sched_base_slice=3ms,
nr_cpu=42.

> > > So, switch is not happening at base_slice_ns boundary.
> > >
> > > But why? is it possible base_slice_ns is not properly used in
> > > arch != x86 ?
> >
> > The thing is  in my RFC the effective quanta is actually
> >
> >    max_t(u64, TICK_NSEC, sysctl_sched_base_slice)
> >
> > where sysctl_sched_base_slice is precisely a handy tunable knob
> > for users ( maybe i should make it loud and clear more ).
> >
> > See what I do in update_entity_lag(), you will understand.
> Thanks. I will look into it.
> >
> > Note we have 3 time related concepts here:
> > 1. TIME TICK: (schedule) accounting time unit.
> > 2. TIME QUANTA (not necessarily the effective one): scheduling time unit
> > 3. USER SLICE: time slice per request
> To double check,
> User slice is the request size submitted by a competing task for the time-shared resource (here,
> processor) against other competing tasks.
> Scheduler allocates time-shared resource (here, processor) in `q` quantum
> which is our TIME QUANTA
> TIME TICK is time period between two scheduler ticks.

Yeah, that is what I see them.

Note we don't necessarily allocate time quantum continuously to fulfil a user's
request.

To quote from the paper, "by decoupling the request size from the size of a time
quantum, ... gives a client possibility of trading between allocation
accuracy and
scheduling overhead". This is the very reason why this patch proposes to bring
the concept of time quanta into existence.

Cheers,
        -- Ze

> Thanks,
>  -- vishal.c
> >
> > To implement latency-nice while being as fair as possible, We must
> > carefully consider the size relationship between them, and especially
> > the value range of USER SLICE due to the cold fact that the lag(
> > unfairness) is literally subject to both time quanta and user requested
> > slices.
> >
> >
> > Regards,
> >         -- Ze
> >
> > > >
> > > > echo NO_SCHED_QUANTA > /sys/kernel/debug/sched/features
> > > > test
> > > > sleep 2
> > > > echo SCHED_QUANTA > /sys/kernel/debug/sched/features
> > > > test
> > > >
> > > >
> > > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf&id=98866150f92f268a2f08eb1d884de9677eb4ec8f
> > > > [2]: https://github.com/zegao96/linux/tree/sched-eevdf
> > > >
> > > >
> > > > Regards,
> > > >         -- Ze
> > > >
> > > > > >
> > > > > >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > > > > > ---------------------------------------------------------------------
> > > > > > Slices(ms)            100                     0.1
> > > > > > Runtime(ms)           4934.206                5025.048
> > > > > > Switches              58                      67
> > > > > > Average delay(ms)     87.074                  73.863
> > > > > > Maximum delay(ms)     101.998                 101.010
> > > > > >
> > > > > > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > > > > > in this patch gives us a better control of the allocation accuracy and
> > > > > > the avg latency:
> > > > > >
> > > > > >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > > > > > ---------------------------------------------------------------------
> > > > > > Slices(ms)            100                     0.1
> > > > > > Runtime(ms)           4980.309                4981.356
> > > > > > Switches              1253                    1254
> > > > > > Average delay(ms)     3.990                   3.990
> > > > > > Maximum delay(ms)     5.001                   4.014
> > > > > >
> > > > > > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > > > > > less switches at the cost of worse delay:
> > > > > >
> > > > > >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > > > > > ---------------------------------------------------------------------
> > > > > > Slices(ms)            100                     0.1
> > > > > > Runtime(ms)           4983.722                4977.035
> > > > > > Switches              456                     456
> > > > > > Average delay(ms)     10.963                  10.939
> > > > > > Maximum delay(ms)     19.002                  21.001
> > > > > >
> > > > > > By being able to tune sysctl_sched_base_slice knob, we can achieve
> > > > > > the goal to strike a good balance between throughput and latency by
> > > > > > adjusting the frequency of context switches, and the conclusions are
> > > > > > much close to what's covered in [1] with the explicit definition of
> > > > > > a time quantum. And it aslo gives more freedom to choose the eligible
> > > > > > request length range(either through nice value or raw value)
> > > > > > without worrying about overscheduling or underscheduling too much.
> > > > > >
> > > > > > Note this change should introduce no obvious regression because all
> > > > > > processes have the same request length as sysctl_sched_base_slice as
> > > > > > in the status quo. And the result of benchmarks proves this as well.
> > > > > >
> > > > > > schbench -m2 -F128 -n10       -r90    w/patch tip/6.7-rc7
> > > > > > Wakeup  (usec): 99.0th:               3028    95
> > > > > > Request (usec): 99.0th:               14992   21984
> > > > > > RPS    (count): 50.0th:               5864    5848
> > > > > >
> > > > > > hackbench -s 512 -l 200 -f 25 -P      w/patch  tip/6.7-rc7
> > > > > > -g 10                                         0.212   0.223
> > > > > > -g 20                                 0.415   0.432
> > > > > > -g 30                                 0.625   0.639
> > > > > > -g 40                                 0.852   0.858
> > > > > >
> > > > > > [1]: https://dl.acm.org/doi/10.5555/890606
> > > > > > [2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u
> > > > > >
> > > > > > Signed-off-by: Ze Gao <zegao@tencent.com>
> > > > > > ---
> > > > >
> > > >
  
Luis Machado Feb. 6, 2024, 1:09 p.m. UTC | #7
Hi,

On 1/11/24 11:57, Ze Gao wrote:
> AFAIS, We've overlooked what role of the concept of time quanta plays
> in EEVDF. According to Theorem 1 in [1], we have
>
>       -r_max < log_k(t) < max(r_max, q)
>
> cleary we don't want either r_max (the maximum user request) or q (time
> quanta) to be too much big.
>
> To trade for throughput, in [2] it chooses to do tick preemtion at
> per request boundary (i.e., once a cetain request is fulfilled), which
> means we literally have no concept of time quanta defined anymore.
> Obviously this is no problem if we make
>
>       q = r_i = sysctl_sched_base_slice
>
> just as exactly what we have for now, which actually creates a implict
> quanta for us and works well.
>
> However, with custom slice being possible, the lag bound is subject
> only to the distribution of users requested slices given the fact no
> time quantum is available now and we would pay the cost of losing
> many scheduling opportunities to maintain fairness and responsiveness
> due to [2]. What's worse, we may suffer unexpected unfairness and
> lantecy.
>
> For example, take two cpu bound processes with the same weight and bind
> them to the same cpu, and let process A request for 100ms whereas B
> request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
> nr_cpu=42).  And we can clearly see that playing with custom slice can
> actually incur unfair cpu bandwidth allocation (10706 whose request
> length is 0.1ms gets more cpu time as well as better latency compared to
> 10705. Note you might see the other way around in different machines but
> the allocation inaccuracy retains, and even top can show you the
> noticeble difference in terms of cpu util by per second reporting), which
> is obviously not what we want because that would mess up the nice system
> and fairness would not hold.
>
>                       stress-ng-cpu:10705     stress-ng-cpu:10706
> ---------------------------------------------------------------------
> Slices(ms)            100                     0.1
> Runtime(ms)           4934.206                5025.048
> Switches              58                      67
> Average delay(ms)     87.074                  73.863
> Maximum delay(ms)     101.998                 101.010
>
> In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> in this patch gives us a better control of the allocation accuracy and
> the avg latency:
>
>                       stress-ng-cpu:10584     stress-ng-cpu:10583
> ---------------------------------------------------------------------
> Slices(ms)            100                     0.1
> Runtime(ms)           4980.309                4981.356
> Switches              1253                    1254
> Average delay(ms)     3.990                   3.990
> Maximum delay(ms)     5.001                   4.014
>
> Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> less switches at the cost of worse delay:
>
>                       stress-ng-cpu:11208     stress-ng-cpu:11207
> ---------------------------------------------------------------------
> Slices(ms)            100                     0.1
> Runtime(ms)           4983.722                4977.035
> Switches              456                     456
> Average delay(ms)     10.963                  10.939
> Maximum delay(ms)     19.002                  21.001

Thanks for the write-up, those are interesting results.

While the fairness is restablished (important, no doubt), I'm wondering if the much larger number of switches is of any concern.

I'm planning on giving this patch a try as well.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Ze Gao Feb. 7, 2024, 3:05 a.m. UTC | #8
On Tue, Feb 6, 2024 at 9:09 PM Luis Machado <luis.machado@arm.com> wrote:
>
> Hi,
>
> On 1/11/24 11:57, Ze Gao wrote:
> > AFAIS, We've overlooked what role of the concept of time quanta plays
> > in EEVDF. According to Theorem 1 in [1], we have
> >
> >       -r_max < log_k(t) < max(r_max, q)
> >
> > cleary we don't want either r_max (the maximum user request) or q (time
> > quanta) to be too much big.
> >
> > To trade for throughput, in [2] it chooses to do tick preemtion at
> > per request boundary (i.e., once a cetain request is fulfilled), which
> > means we literally have no concept of time quanta defined anymore.
> > Obviously this is no problem if we make
> >
> >       q = r_i = sysctl_sched_base_slice
> >
> > just as exactly what we have for now, which actually creates a implict
> > quanta for us and works well.
> >
> > However, with custom slice being possible, the lag bound is subject
> > only to the distribution of users requested slices given the fact no
> > time quantum is available now and we would pay the cost of losing
> > many scheduling opportunities to maintain fairness and responsiveness
> > due to [2]. What's worse, we may suffer unexpected unfairness and
> > lantecy.
> >
> > For example, take two cpu bound processes with the same weight and bind
> > them to the same cpu, and let process A request for 100ms whereas B
> > request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
> > nr_cpu=42).  And we can clearly see that playing with custom slice can
> > actually incur unfair cpu bandwidth allocation (10706 whose request
> > length is 0.1ms gets more cpu time as well as better latency compared to
> > 10705. Note you might see the other way around in different machines but
> > the allocation inaccuracy retains, and even top can show you the
> > noticeble difference in terms of cpu util by per second reporting), which
> > is obviously not what we want because that would mess up the nice system
> > and fairness would not hold.
> >
> >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4934.206                5025.048
> > Switches              58                      67
> > Average delay(ms)     87.074                  73.863
> > Maximum delay(ms)     101.998                 101.010
> >
> > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > in this patch gives us a better control of the allocation accuracy and
> > the avg latency:
> >
> >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4980.309                4981.356
> > Switches              1253                    1254
> > Average delay(ms)     3.990                   3.990
> > Maximum delay(ms)     5.001                   4.014
> >
> > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > less switches at the cost of worse delay:
> >
> >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4983.722                4977.035
> > Switches              456                     456
> > Average delay(ms)     10.963                  10.939
> > Maximum delay(ms)     19.002                  21.001
>
> Thanks for the write-up, those are interesting results.
>
> While the fairness is restablished (important, no doubt), I'm wondering if the much larger number of switches is of any concern.

This patch should introduce no changes against the status quo, of
course if I understand and implement it correctly,  like I said in the
changelog when custom slices are not supported right now,

If we do the same experiments without setting custom slices,
(for 10 secs with HZ=1000 and sysctl_sched_base_slice=3ms)
the number of switches is likely to be almost 1253, due to which,
we can conclude that if no regressions are spot w/o this patch,
then there should be none w/ patch, if your concern was about
the throughput it possibly affects.

> I'm planning on giving this patch a try as well.

Cheers!
        -- Ze


> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..1746b224595b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -694,12 +694,13 @@  u64 avg_vruntime(struct cfs_rq *cfs_rq)
  */
 static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	s64 lag, limit;
+	s64 lag, limit, quanta;
 
 	SCHED_WARN_ON(!se->on_rq);
 	lag = avg_vruntime(cfs_rq) - se->vruntime;
 
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	quanta = max_t(u64, TICK_NSEC, sysctl_sched_base_slice);
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, quanta), se);
 	se->vlag = clamp(lag, -limit, limit);
 }
 
@@ -1003,25 +1004,47 @@  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
-		return;
+	u64 delta_exec;
 
 	/*
-	 * For EEVDF the virtual time slope is determined by w_i (iow.
-	 * nice) while the request time r_i is determined by
-	 * sysctl_sched_base_slice.
+	 * To allow wakeup preemption to happen in time, we check to
+	 * push deadlines forward by each call.
 	 */
-	se->slice = sysctl_sched_base_slice;
+	if ((s64)(se->vruntime - se->deadline) >= 0) {
+		/*
+		 * For EEVDF the virtual time slope is determined by w_i (iow.
+		 * nice) while the request time r_i is determined by
+		 * sysctl_sched_base_slice.
+		 */
+		se->slice = sysctl_sched_base_slice;
+		/*
+		 * EEVDF: vd_i = ve_i + r_i / w_i
+		 */
+		se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	}
+	/*
+	 * Make sysctl_sched_base_slice as the size of a 'quantum' in EEVDF
+	 * so as to avoid overscheduling or underscheduling with arbitrary
+	 * request lengths users specify.
+	 *
+	 * IOW, we now change to make scheduling decisions at per
+	 * max(TICK, sysctl_sched_base_slice) boundary.
+	 */
+	delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+	if (delta_exec < sysctl_sched_base_slice)
+		return;
 
 	/*
-	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 * We can come here with TIF_NEED_RESCHED already set from wakeup path.
+	 * Check to see if we can save a call to pick_eevdf if it's set already.
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	if (entity_is_task(se) && test_tsk_need_resched(task_of(se)))
+		return;
 
 	/*
-	 * The task has consumed its request, reschedule.
+	 * The task has consumed a quantum, check and reschedule.
 	 */
-	if (cfs_rq->nr_running > 1) {
+	if (cfs_rq->nr_running > 1 && pick_eevdf(cfs_rq) != se) {
 		resched_curr(rq_of(cfs_rq));
 		clear_buddies(cfs_rq, se);
 	}