[v7,0/9] Add latency priority for CFS class

Message ID 20221028093403.6673-1-vincent.guittot@linaro.org
Headers
Series Add latency priority for CFS class |

Message

Vincent Guittot Oct. 28, 2022, 9:33 a.m. UTC
  This patchset restarts the work about adding a latency priority to describe
the latency tolerance of cfs tasks.

Patch [1] is a new one that has been added with v6. It fixes an
unfairness for low prio tasks because of wakeup_gran() being bigger
than the maximum vruntime credit that a waking task can keep after
sleeping.

The patches [2-4] have been done by Parth:
https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/

I have just rebased and moved the set of latency priority outside the
priority update. I have removed the reviewed tag because the patches
are 2 years old.

This aims to be a generic interface and the following patches is one use
of it to improve the scheduling latency of cfs tasks.

Patch [5] uses latency nice priority to define a latency offset
and then decide if a cfs task can or should preempt the current
running task. The patch gives some tests results with cyclictests and
hackbench to highlight the benefit of latency priority for short
interactive task or long intensive tasks.

Patch [6] adds the support of latency nice priority to task group by
adding a cpu.latency.nice field. The range is [-20:19] as for setting task
latency priority.

Patch [7] makes sched_core taking into account the latency offset.

Patch [8] adds a rb tree to cover some corner cases where the latency
sensitive task (priority < 0) is preempted by high priority task (RT/DL)
or fails to preempt them. This patch ensures that tasks will have at least
a slice of sched_min_granularity in priority at wakeup.

Patch [9] removes useless check after adding a latency rb tree.

I have also backported the patchset on a dragonboard RB3 with an android
mainline kernel based on v5.18 for a quick test. I have used the
TouchLatency app which is part of AOSP and described to be a very good
test to highlight jitter and jank frame sources of a system [1].
In addition to the app, I have added some short running tasks waking-up
regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
without overloading it (and disabling EAS). The 1st results shows that the
patchset helps to reduce the missed deadline frames from 5% to less than
0.1% when the cpu.latency.nice of task group are set. I haven't rerun the
test with latest version.

I have also tested the patchset with the modified version of the alsa
latency test that has been shared by Tim. The test quickly xruns with
default latency nice priority 0 but is able to run without underuns with
a latency -20 and hackbench running simultaneously.

[1] https://source.android.com/docs/core/debug/eval_perf#touchlatency

Change since v6:
- Fix compilation error for !CONFIG_SCHED_DEBUG

Change since v5:
- Add patch 1 to fix unfairness for low prio task. This has been
  discovered while studying Youssef's tests results with latency nice
  which were hitting the same problem.
- Fixed latency_offset computation to take into account
  GENTLE_FAIR_SLEEPERS. This has diseappeared with v2and has been raised
  by Youssef's tests.
- Reworked and optimized how latency_offset in used to check for
  preempting current task at wakeup and tick. This cover more cases too.
- Add patch 9 to remove check_preempt_from_others() which is not needed
  anymore with the rb tree.

Change since v4:
- Removed permission checks to set latency priority. This enables user
  without elevated privilege like audio application to set their latency
  priority as requested by Tim.
- Removed cpu.latency and replaced it by cpu.latency.nice so we keep a
  generic interface not tied to latency_offset which can be used to
  implement other latency features.
- Added an entry in Documentation/admin-guide/cgroup-v2.rst to describe
  cpu.latency.nice.
- Fix some typos.

Change since v3:
- Fix 2 compilation warnings raised by kernel test robot <lkp@intel.com>

Change since v2:
- Set a latency_offset field instead of saving a weight and computing it
  on the fly.
- Make latency_offset available for task group: cpu.latency
- Fix some corner cases to make latency sensitive tasks schedule first and
  add a rb tree for latency sensitive task.

Change since v1:
- fix typo
- move some codes in the right patch to make bisect happy
- simplify and fixed how the weight is computed
- added support of sched core patch 7

Parth Shah (3):
  sched: Introduce latency-nice as a per-task attribute
  sched/core: Propagate parent task's latency requirements to the child
    task
  sched: Allow sched_{get,set}attr to change latency_nice of the task

Vincent Guittot (6):
  sched/fair: fix unfairness at wakeup
  sched/fair: Take into account latency priority at wakeup
  sched/fair: Add sched group latency support
  sched/core: Support latency priority with sched core
  sched/fair: Add latency list
  sched/fair: remove check_preempt_from_others

 Documentation/admin-guide/cgroup-v2.rst |   8 +
 include/linux/sched.h                   |   5 +
 include/uapi/linux/sched.h              |   4 +-
 include/uapi/linux/sched/types.h        |  19 +++
 init/init_task.c                        |   1 +
 kernel/sched/core.c                     | 105 ++++++++++++
 kernel/sched/debug.c                    |   1 +
 kernel/sched/fair.c                     | 210 ++++++++++++++++++++----
 kernel/sched/sched.h                    |  65 +++++++-
 tools/include/uapi/linux/sched.h        |   4 +-
 10 files changed, 386 insertions(+), 36 deletions(-)
  

Comments

Shrikanth Hegde Nov. 13, 2022, 7:56 a.m. UTC | #1
> This patchset restarts the work about adding a latency priority to describe
> the latency tolerance of cfs tasks.
>
>
Hi Vincent.
  
Tested the patches on the power10 machine. It is 80 core system with SMT=8. i.e
total of 640 cpus. on the large workload which mainly interacts with the
database there is minor improvement of 3-5%.

the method followed is creating a cgroup, assigning a latency nice value of -20,
-10, 0 and adding the tasks to procs of the cgroup. outside of cgroup, stress-ng
load is running and it is not set any latency value. stress-ng --cpu=768 -l 50

with microbenchmarks, hackbench the values are more or less the same. for large
process pool of 60, there is 10% improvement. schbench tail latencies show
significant improvement with low and medium load upto 256 groups. only 512
groups shows a slight decline.

Hackbench (Iterations or N=50)
Process             6.1_Base        6.1_Latency_Nice
10                      0.13            0.14
20                      0.18            0.18
30                      0.24            0.25
40                      0.34            0.33
50                      0.40            0.41
60                      0.53            0.49

schbench (Iterations or N=5)

Groups: 1
                     6.1_Base        6.1_Latency_Nice
50.0th:                 10.8             9.8
75.0th:                 12.4            11.4
90.0th:                 14.2            13.2
95.0th:                 15.6            14.6
99.0th:                 27.8            19.0
99.5th:                 38.0            21.6
99.9th:                 66.2            25.4

Groups: 2
                     6.1_Base        6.1_Latency_Nice
50.0th:                 11.2            10.8
75.0th:                 13.2            12.4
90.0th:                 15.0            15.0
95.0th:                 16.6            16.6
99.0th:                 22.4            22.8
99.5th:                 23.8            27.8
99.9th:                 30.2            45.6

Groups: 4
                     6.1_Base        6.1_Latency_Nice
50.0th:                 13.8            11.2
75.0th:                 16.0            13.2
90.0th:                 18.6            15.2
95.0th:                 20.4            16.6
99.0th:                 28.8            21.6
99.5th:                 48.8            25.2
99.9th:                900.2            47.0

Groups: 8
                     6.1_Base        6.1_Latency_Nice
50.0th:                 17.8            14.4
75.0th:                 21.8            17.2
90.0th:                 25.4            20.4
95.0th:                 28.0            22.4
99.0th:                 52.8            28.4
99.5th:                156.4            32.6
99.9th:               1990.2            52.0

Groups: 16
                     6.1_Base        6.1_Latency_Nice
50.0th:                 26.0            21.0
75.0th:                 33.0            27.8
90.0th:                 39.6            34.4
95.0th:                 43.4            38.6
99.0th:                 66.8            48.8
99.5th:                170.6            60.6
99.9th:               3308.8           201.6

Groups: 32
                     6.1_Base        6.1_Latency_Nice
50.0th:                 40.8            38.6
75.0th:                 55.4            52.8
90.0th:                 67.0            64.2
95.0th:                 74.2            71.6
99.0th:                106.0            90.0
99.5th:                323.8           133.0
99.9th:               4789.6           459.2

Groups: 64
                     6.1_Base        6.1_Latency_Nice
50.0th:                 72.6            68.2
75.0th:                103.4            97.8
90.0th:                127.6           120.0
95.0th:                141.2           132.0
99.0th:                343.4           158.4
99.5th:               1609.0           180.8
99.9th:               6571.2           686.6

Groups: 128
                     6.1_Base        6.1_Latency_Nice
50.0th:                147.2           147.2
75.0th:                216.4           217.2
90.0th:                268.4           268.2
95.0th:                300.6           294.8
99.0th:               3500.0           638.6
99.5th:               5995.2          2522.8
99.9th:              10390.4          9451.2

Groups: 256
                     6.1_Base        6.1_Latency_Nice
50.0th:                340.8           333.2
75.0th:                551.8           530.2
90.0th:               3528.4          1919.2
95.0th:               7312.8          5558.4
99.0th:              14630.4         12912.0
99.5th:              17955.2         14950.4
99.9th:              23059.2         20230.4

Groups: 512
                     6.1_Base        6.1_Latency_Nice
50.0th:               1021.8           990.6
75.0th:               9545.6         10044.8
90.0th:              20972.8         21638.4
95.0th:              29971.2         30291.2
99.0th:              42355.2         46707.2
99.5th:              48550.4         52057.6
99.9th:              58867.2         60147.2

Tested-by: shrikanth Hegde <sshegde@linux.vnet.ibm.com>
  
Shrikanth Hegde Nov. 13, 2022, 8:34 a.m. UTC | #2
> This patchset restarts the work about adding a latency priority to describe
> the latency tolerance of cfs tasks.

Hi Vincent.
  
Tested the patches on the power10 machine. It is 80 core system with SMT=8. i.e
total of 640 cpus. on the large workload which mainly interacts with the
database there is minor improvement of 3-5%.

the method followed is creating a cgroup, assigning a latency nice value of -20,
-10, 0 and adding the tasks to procs of the cgroup. outside of cgroup, stress-ng
load is running and it is not set any latency value. stress-ng --cpu=768 -l 50

with microbenchmarks, hackbench the values are more or less the same. for large
process pool of 60, there is 10% improvement. schbench tail latencies show
significant improvement with low and medium load upto 256 groups. only 512
groups shows a slight decline.

Hackbench (Iterations or N=50)
Process             6.1_Base        6.1_Latency_Nice
10                      0.13            0.14
20                      0.18            0.18
30                      0.24            0.25
40                      0.34            0.33
50                      0.40            0.41
60                      0.53            0.49

schbench (Iterations or N=5)

Groups: 1
                     6.1_Base        6.1_Latency_Nice
50.0th:                 10.8             9.8
75.0th:                 12.4            11.4
90.0th:                 14.2            13.2
95.0th:                 15.6            14.6
99.0th:                 27.8            19.0
99.5th:                 38.0            21.6
99.9th:                 66.2            25.4

Groups: 2
                     6.1_Base        6.1_Latency_Nice
50.0th:                 11.2            10.8
75.0th:                 13.2            12.4
90.0th:                 15.0            15.0
95.0th:                 16.6            16.6
99.0th:                 22.4            22.8
99.5th:                 23.8            27.8
99.9th:                 30.2            45.6

Groups: 4
                     6.1_Base        6.1_Latency_Nice
50.0th:                 13.8            11.2
75.0th:                 16.0            13.2
90.0th:                 18.6            15.2
95.0th:                 20.4            16.6
99.0th:                 28.8            21.6
99.5th:                 48.8            25.2
99.9th:                900.2            47.0

Groups: 8
                     6.1_Base        6.1_Latency_Nice
50.0th:                 17.8            14.4
75.0th:                 21.8            17.2
90.0th:                 25.4            20.4
95.0th:                 28.0            22.4
99.0th:                 52.8            28.4
99.5th:                156.4            32.6
99.9th:               1990.2            52.0

Groups: 16
                     6.1_Base        6.1_Latency_Nice
50.0th:                 26.0            21.0
75.0th:                 33.0            27.8
90.0th:                 39.6            34.4
95.0th:                 43.4            38.6
99.0th:                 66.8            48.8
99.5th:                170.6            60.6
99.9th:               3308.8           201.6

Groups: 32
                     6.1_Base        6.1_Latency_Nice
50.0th:                 40.8            38.6
75.0th:                 55.4            52.8
90.0th:                 67.0            64.2
95.0th:                 74.2            71.6
99.0th:                106.0            90.0
99.5th:                323.8           133.0
99.9th:               4789.6           459.2

Groups: 64
                     6.1_Base        6.1_Latency_Nice
50.0th:                 72.6            68.2
75.0th:                103.4            97.8
90.0th:                127.6           120.0
95.0th:                141.2           132.0
99.0th:                343.4           158.4
99.5th:               1609.0           180.8
99.9th:               6571.2           686.6

Groups: 128
                     6.1_Base        6.1_Latency_Nice
50.0th:                147.2           147.2
75.0th:                216.4           217.2
90.0th:                268.4           268.2
95.0th:                300.6           294.8
99.0th:               3500.0           638.6
99.5th:               5995.2          2522.8
99.9th:              10390.4          9451.2

Groups: 256
                     6.1_Base        6.1_Latency_Nice
50.0th:                340.8           333.2
75.0th:                551.8           530.2
90.0th:               3528.4          1919.2
95.0th:               7312.8          5558.4
99.0th:              14630.4         12912.0
99.5th:              17955.2         14950.4
99.9th:              23059.2         20230.4

Groups: 512
                     6.1_Base        6.1_Latency_Nice
50.0th:               1021.8           990.6
75.0th:               9545.6         10044.8
90.0th:              20972.8         21638.4
95.0th:              29971.2         30291.2
99.0th:              42355.2         46707.2
99.5th:              48550.4         52057.6
99.9th:              58867.2         60147.2

Tested-by: shrikanth Hegde <sshegde@linux.vnet.ibm.com>
  
Vincent Guittot Nov. 14, 2022, 10:40 a.m. UTC | #3
On Sun, 13 Nov 2022 at 09:51, shrikanth suresh hegde
<sshegde@linux.vnet.ibm.com> wrote:
>
>
> > This patchset restarts the work about adding a latency priority to describe
> > the latency tolerance of cfs tasks.
>
> Hi Vincent.
>
> Tested the patches on the power10 machine. It is 80 core system with SMT=8. i.e
> total of 640 cpus. on the large workload which mainly interacts with the
> database there is minor improvement of 3-5%.
>
> the method followed is creating a cgroup, assigning a latency nice value of -20,
> -10, 0 and adding the tasks to procs of the cgroup. outside of cgroup, stress-ng
> load is running and it is not set any latency value. stress-ng --cpu=768 -l 50
>
> with microbenchmarks, hackbench the values are more or less the same. for large
> process pool of 60, there is 10% improvement. schbench tail latencies show
> significant improvement with low and medium load upto 256 groups. only 512
> groups shows a slight decline.
>
> Hackbench (Iterations or N=50)
> Process             6.1_Base        6.1_Latency_Nice
> 10                      0.13            0.14
> 20                      0.18            0.18
> 30                      0.24            0.25
> 40                      0.34            0.33
> 50                      0.40            0.41
> 60                      0.53            0.49
>
> schbench (Iterations or N=5)
>
> Groups: 1
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                 10.8             9.8
> 75.0th:                 12.4            11.4
> 90.0th:                 14.2            13.2
> 95.0th:                 15.6            14.6
> 99.0th:                 27.8            19.0
> 99.5th:                 38.0            21.6
> 99.9th:                 66.2            25.4
>
> Groups: 2
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                 11.2            10.8
> 75.0th:                 13.2            12.4
> 90.0th:                 15.0            15.0
> 95.0th:                 16.6            16.6
> 99.0th:                 22.4            22.8
> 99.5th:                 23.8            27.8
> 99.9th:                 30.2            45.6
>
> Groups: 4
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                 13.8            11.2
> 75.0th:                 16.0            13.2
> 90.0th:                 18.6            15.2
> 95.0th:                 20.4            16.6
> 99.0th:                 28.8            21.6
> 99.5th:                 48.8            25.2
> 99.9th:                900.2            47.0
>
> Groups: 8
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                 17.8            14.4
> 75.0th:                 21.8            17.2
> 90.0th:                 25.4            20.4
> 95.0th:                 28.0            22.4
> 99.0th:                 52.8            28.4
> 99.5th:                156.4            32.6
> 99.9th:               1990.2            52.0
>
> Groups: 16
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                 26.0            21.0
> 75.0th:                 33.0            27.8
> 90.0th:                 39.6            34.4
> 95.0th:                 43.4            38.6
> 99.0th:                 66.8            48.8
> 99.5th:                170.6            60.6
> 99.9th:               3308.8           201.6
>
> Groups: 32
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                 40.8            38.6
> 75.0th:                 55.4            52.8
> 90.0th:                 67.0            64.2
> 95.0th:                 74.2            71.6
> 99.0th:                106.0            90.0
> 99.5th:                323.8           133.0
> 99.9th:               4789.6           459.2
>
> Groups: 64
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                 72.6            68.2
> 75.0th:                103.4            97.8
> 90.0th:                127.6           120.0
> 95.0th:                141.2           132.0
> 99.0th:                343.4           158.4
> 99.5th:               1609.0           180.8
> 99.9th:               6571.2           686.6
>
> Groups: 128
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                147.2           147.2
> 75.0th:                216.4           217.2
> 90.0th:                268.4           268.2
> 95.0th:                300.6           294.8
> 99.0th:               3500.0           638.6
> 99.5th:               5995.2          2522.8
> 99.9th:              10390.4          9451.2
>
> Groups: 256
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:                340.8           333.2
> 75.0th:                551.8           530.2
> 90.0th:               3528.4          1919.2
> 95.0th:               7312.8          5558.4
> 99.0th:              14630.4         12912.0
> 99.5th:              17955.2         14950.4
> 99.9th:              23059.2         20230.4
>
> Groups: 512
>                      6.1_Base        6.1_Latency_Nice
> 50.0th:               1021.8           990.6
> 75.0th:               9545.6         10044.8
> 90.0th:              20972.8         21638.4
> 95.0th:              29971.2         30291.2
> 99.0th:              42355.2         46707.2
> 99.5th:              48550.4         52057.6
> 99.9th:              58867.2         60147.2
>
> Tested-by: shrikanth Hegde <sshegde@linux.vnet.ibm.com>

Thanks for the tests and the results


>