[v9,0/9] Add latency priority for CFS class

Message ID 20221115171851.835-1-vincent.guittot@linaro.org
Headers
Series Add latency priority for CFS class |

Message

Vincent Guittot Nov. 15, 2022, 5:18 p.m. UTC
  This patchset restarts the work about adding a latency priority to describe
the latency tolerance of cfs tasks.

Patch [1] is a new one that has been added with v6. It fixes an
unfairness for low prio tasks because of wakeup_gran() being bigger
than the maximum vruntime credit that a waking task can keep after
sleeping.

The patches [2-4] have been done by Parth:
https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/

I have just rebased and moved the set of latency priority outside the
priority update. I have removed the reviewed tag because the patches
are 2 years old.

This aims to be a generic interface and the following patches is one use
of it to improve the scheduling latency of cfs tasks.

Patch [5] uses latency nice priority to define a latency offset
and then decide if a cfs task can or should preempt the current
running task. The patch gives some tests results with cyclictests and
hackbench to highlight the benefit of latency priority for short
interactive task or long intensive tasks.

Patch [6] adds the support of latency nice priority to task group by
adding a cpu.latency.nice field. The range is [-20:19] as for setting task
latency priority.

Patch [7] makes sched_core taking into account the latency offset.

Patch [8] adds a rb tree to cover some corner cases where the latency
sensitive task (priority < 0) is preempted by high priority task (RT/DL)
or fails to preempt them. This patch ensures that tasks will have at least
a slice of sched_min_granularity in priority at wakeup.

Patch [9] removes useless check after adding a latency rb tree.

I have also backported the patchset on a dragonboard RB3 with an android
mainline kernel based on v5.18 for a quick test. I have used the
TouchLatency app which is part of AOSP and described to be a very good
test to highlight jitter and jank frame sources of a system [1].
In addition to the app, I have added some short running tasks waking-up
regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
without overloading it (and disabling EAS). The 1st results shows that the
patchset helps to reduce the missed deadline frames from 5% to less than
0.1% when the cpu.latency.nice of task group are set. I haven't rerun the
test with latest version.

I have also tested the patchset with the modified version of the alsa
latency test that has been shared by Tim. The test quickly xruns with
default latency nice priority 0 but is able to run without underuns with
a latency -20 and hackbench running simultaneously.

While preparing the version 8, I have evaluated the benefit of using an
augmented rbtree instead of adding a rbtree for latency sensitive entities,
which was a relevant suggestion done by PeterZ. Although the augmented
rbtree enables to sort additional information in the tree with a limited
overhead, it has more impact on legacy use cases (latency_nice >= 0)
because the augmented callbacks are always called to maintain this
additional information even when there is no sensitive tasks. In such
cases, the dedicated rbtree remains empty and the overhead is reduced to
loading a cached null node pointer. Nevertheless, we might want to
reconsider the augmented rbtree once the use of negative latency_nice will
be more widlely deployed. At now, the different tests that I have done,
have not shown improvements with augmented rbtree.

Below are some hackbench results:
        2 rbtrees               augmented rbtree        augmented rbtree	
                                sorted by vruntime      sorted by wakeup_vruntime
sched	pipe	
avg     26311,000               25976,667               25839,556
stdev   0,15 %                  0,28 %                  0,24 %
vs tip  0,50 %                  -0,78 %                 -1,31 %
hackbench	1 group	
avg     1,315                   1,344                   1,359
stdev   0,88 %                  1,55 %                  1,82 %
vs tip  -0,47 %                 -2,68 %                 -3,87 %
hackbench	4 groups
avg     1,339                   1,365                   1,367
stdev   2,39 %                  2,26 %                  3,58 %
vs tip  -0,08 %                 -2,01 %                 -2,22 %
hackbench	8 groups
avg     1,233                   1,286                   1,301
stdev   0,74 %                  1,09 %                  1,52 %
vs tip  0,29 %                  -4,05 %                 -5,27 %
hackbench	16 groups	
avg     1,268                   1,313                   1,319
stdev   0,85 %                  1,60 %                  0,68 %
vs tip  -0,02 %                 -3,56 %                 -4,01 %

[1] https://source.android.com/docs/core/debug/eval_perf#touchlatency

Change since v8:
- Rename get_sched_latency by get_sleep_latency
- move latency nice defines in sched/prio.h and fix latency_prio init value
- Fix typo and comments

Change since v7:
- Replaced se->on_latency by using RB_CLEAR_NODE() and RB_EMPTY_NODE()
- Clarify the limit behavior fo the cgroup cpu.latenyc_nice

Change since v6:
- Fix compilation error for !CONFIG_SCHED_DEBUG

Change since v5:
- Add patch 1 to fix unfairness for low prio task. This has been
  discovered while studying Youssef's tests results with latency nice
  which were hitting the same problem.
- Fixed latency_offset computation to take into account
  GENTLE_FAIR_SLEEPERS. This has diseappeared with v2and has been raised
  by Youssef's tests.
- Reworked and optimized how latency_offset in used to check for
  preempting current task at wakeup and tick. This cover more cases too.
- Add patch 9 to remove check_preempt_from_others() which is not needed
  anymore with the rb tree.

Change since v4:
- Removed permission checks to set latency priority. This enables user
  without elevated privilege like audio application to set their latency
  priority as requested by Tim.
- Removed cpu.latency and replaced it by cpu.latency.nice so we keep a
  generic interface not tied to latency_offset which can be used to
  implement other latency features.
- Added an entry in Documentation/admin-guide/cgroup-v2.rst to describe
  cpu.latency.nice.
- Fix some typos.

Change since v3:
- Fix 2 compilation warnings raised by kernel test robot <lkp@intel.com>

Change since v2:
- Set a latency_offset field instead of saving a weight and computing it
  on the fly.
- Make latency_offset available for task group: cpu.latency
- Fix some corner cases to make latency sensitive tasks schedule first and
  add a rb tree for latency sensitive task.

Change since v1:
- fix typo
- move some codes in the right patch to make bisect happy
- simplify and fixed how the weight is computed
- added support of sched core patch 7

Parth Shah (3):
  sched: Introduce latency-nice as a per-task attribute
  sched/core: Propagate parent task's latency requirements to the child
    task
  sched: Allow sched_{get,set}attr to change latency_nice of the task

Vincent Guittot (6):
  sched/fair: fix unfairness at wakeup
  sched/fair: Take into account latency priority at wakeup
  sched/fair: Add sched group latency support
  sched/core: Support latency priority with sched core
  sched/fair: Add latency list
  sched/fair: remove check_preempt_from_others

 Documentation/admin-guide/cgroup-v2.rst |  10 ++
 include/linux/sched.h                   |   4 +
 include/linux/sched/prio.h              |  27 +++
 include/uapi/linux/sched.h              |   4 +-
 include/uapi/linux/sched/types.h        |  19 +++
 init/init_task.c                        |   1 +
 kernel/sched/core.c                     | 106 ++++++++++++
 kernel/sched/debug.c                    |   1 +
 kernel/sched/fair.c                     | 209 ++++++++++++++++++++----
 kernel/sched/sched.h                    |  45 ++++-
 tools/include/uapi/linux/sched.h        |   4 +-
 11 files changed, 394 insertions(+), 36 deletions(-)
  

Comments

K Prateek Nayak Nov. 28, 2022, 11:51 a.m. UTC | #1
Hello Vincent,

Following are the test results on dual socket Zen3 machine (2 x 64C/128T)

tl;dr

o All benchmarks with DEFAULT_LATENCY_NICE value are comparable to tip.
  There is, however, a noticeable dip for unixbench-spawn test case.

o With the 2 rbtree approach, I do not see much difference in the
  hackbench results with varying latency nice value. Tests on v5 did
  yield noticeable improvements for hackbench.
  (https://lore.kernel.org/lkml/cd48ebbb-9724-985f-28e3-e558dea07827@amd.com/)

o For hackbench + cyclictest and hackbench + schbench, I see the
  expected behavior with different latency nice values.

o There are a few cases with hackbench and hackbench + cyclictest where
  the results are non-monotonic with different latency nice values.
  (Marked with "^").

I'll leave the detailed results below:

On 11/15/2022 10:48 PM, Vincent Guittot wrote:
> This patchset restarts the work about adding a latency priority to describe
> the latency tolerance of cfs tasks.
> 
> Patch [1] is a new one that has been added with v6. It fixes an
> unfairness for low prio tasks because of wakeup_gran() being bigger
> than the maximum vruntime credit that a waking task can keep after
> sleeping.
> 
> The patches [2-4] have been done by Parth:
> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/
> 
> I have just rebased and moved the set of latency priority outside the
> priority update. I have removed the reviewed tag because the patches
> are 2 years old.
> 
> This aims to be a generic interface and the following patches is one use
> of it to improve the scheduling latency of cfs tasks.
> 
> Patch [5] uses latency nice priority to define a latency offset
> and then decide if a cfs task can or should preempt the current
> running task. The patch gives some tests results with cyclictests and
> hackbench to highlight the benefit of latency priority for short
> interactive task or long intensive tasks.
> 
> Patch [6] adds the support of latency nice priority to task group by
> adding a cpu.latency.nice field. The range is [-20:19] as for setting task
> latency priority.
> 
> Patch [7] makes sched_core taking into account the latency offset.
> 
> Patch [8] adds a rb tree to cover some corner cases where the latency
> sensitive task (priority < 0) is preempted by high priority task (RT/DL)
> or fails to preempt them. This patch ensures that tasks will have at least
> a slice of sched_min_granularity in priority at wakeup.
> 
> Patch [9] removes useless check after adding a latency rb tree.
> 
> I have also backported the patchset on a dragonboard RB3 with an android
> mainline kernel based on v5.18 for a quick test. I have used the
> TouchLatency app which is part of AOSP and described to be a very good
> test to highlight jitter and jank frame sources of a system [1].
> In addition to the app, I have added some short running tasks waking-up
> regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> without overloading it (and disabling EAS). The 1st results shows that the
> patchset helps to reduce the missed deadline frames from 5% to less than
> 0.1% when the cpu.latency.nice of task group are set. I haven't rerun the
> test with latest version.
> 
> I have also tested the patchset with the modified version of the alsa
> latency test that has been shared by Tim. The test quickly xruns with
> default latency nice priority 0 but is able to run without underuns with
> a latency -20 and hackbench running simultaneously.
> 
> While preparing the version 8, I have evaluated the benefit of using an
> augmented rbtree instead of adding a rbtree for latency sensitive entities,
> which was a relevant suggestion done by PeterZ. Although the augmented
> rbtree enables to sort additional information in the tree with a limited
> overhead, it has more impact on legacy use cases (latency_nice >= 0)
> because the augmented callbacks are always called to maintain this
> additional information even when there is no sensitive tasks. In such
> cases, the dedicated rbtree remains empty and the overhead is reduced to
> loading a cached null node pointer. Nevertheless, we might want to
> reconsider the augmented rbtree once the use of negative latency_nice will
> be more widlely deployed. At now, the different tests that I have done,
> have not shown improvements with augmented rbtree.
> 
> Below are some hackbench results:
>         2 rbtrees               augmented rbtree        augmented rbtree	
>                                 sorted by vruntime      sorted by wakeup_vruntime
> sched	pipe	
> avg     26311,000               25976,667               25839,556
> stdev   0,15 %                  0,28 %                  0,24 %
> vs tip  0,50 %                  -0,78 %                 -1,31 %
> hackbench	1 group	
> avg     1,315                   1,344                   1,359
> stdev   0,88 %                  1,55 %                  1,82 %
> vs tip  -0,47 %                 -2,68 %                 -3,87 %
> hackbench	4 groups
> avg     1,339                   1,365                   1,367
> stdev   2,39 %                  2,26 %                  3,58 %
> vs tip  -0,08 %                 -2,01 %                 -2,22 %
> hackbench	8 groups
> avg     1,233                   1,286                   1,301
> stdev   0,74 %                  1,09 %                  1,52 %
> vs tip  0,29 %                  -4,05 %                 -5,27 %
> hackbench	16 groups	
> avg     1,268                   1,313                   1,319
> stdev   0,85 %                  1,60 %                  0,68 %
> vs tip  -0,02 %                 -3,56 %                 -4,01 %

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:          6.1.0 tip sched/core
- latency_nice: 6.1.0 tip sched/core + this series

When we started testing, the tip was at:
commit d6962c4fe8f9 "sched: Clear ttwu_pending after enqueue_task()"


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ hackbench - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test:			tip			latency_nice
 1-groups:	   4.25 (0.00 pct)	   4.14 (2.58 pct)
 2-groups:	   4.95 (0.00 pct)	   4.92 (0.60 pct)
 4-groups:	   5.19 (0.00 pct)	   5.18 (0.19 pct)
 8-groups:	   5.45 (0.00 pct)	   5.44 (0.18 pct)
16-groups:	   7.33 (0.00 pct)	   7.32 (0.13 pct)

NPS2

Test:			tip			latency_nice
 1-groups:	   4.09 (0.00 pct)	   4.08 (0.24 pct)
 2-groups:	   4.68 (0.00 pct)	   4.72 (-0.85 pct)
 4-groups:	   5.05 (0.00 pct)	   4.97 (1.58 pct)
 8-groups:	   5.37 (0.00 pct)	   5.34 (0.55 pct)
16-groups:	   6.69 (0.00 pct)	   6.74 (-0.74 pct)

NPS4

Test:			tip			latency_nice
 1-groups:	   4.28 (0.00 pct)	   4.35 (-1.63 pct)
 2-groups:	   4.78 (0.00 pct)	   4.76 (0.41 pct)
 4-groups:	   5.11 (0.00 pct)	   5.06 (0.97 pct)
 8-groups:	   5.48 (0.00 pct)	   5.40 (1.45 pct)
16-groups:	   7.07 (0.00 pct)	   6.70 (5.23 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ schbench - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

#workers:	tip			latency_nice
  1:	  31.00 (0.00 pct)	  32.00 (-3.22 pct)
  2:	  33.00 (0.00 pct)	  34.00 (-3.03 pct)
  4:	  39.00 (0.00 pct)	  38.00 (2.56 pct)
  8:	  45.00 (0.00 pct)	  46.00 (-2.22 pct)
 16:	  61.00 (0.00 pct)	  66.00 (-8.19 pct)
 32:	 108.00 (0.00 pct)	 110.00 (-1.85 pct)
 64:	 212.00 (0.00 pct)	 216.00 (-1.88 pct)
128:	 475.00 (0.00 pct)	 701.00 (-47.57 pct)    *
128:     429.00 (0.00 pct)       441.00 (-2.79 pct)      [Verification Run]
256:	 44736.00 (0.00 pct)	 45632.00 (-2.00 pct)
512:	 77184.00 (0.00 pct)	 78720.00 (-1.99 pct)

NPS2

#workers:	tip			latency_nice
  1:	  28.00 (0.00 pct)	  33.00 (-17.85 pct)
  2:	  34.00 (0.00 pct)	  31.00 (8.82 pct)
  4:	  36.00 (0.00 pct)	  36.00 (0.00 pct)
  8:	  51.00 (0.00 pct)	  49.00 (3.92 pct)
 16:	  68.00 (0.00 pct)	  64.00 (5.88 pct)
 32:	 113.00 (0.00 pct)	 115.00 (-1.76 pct)
 64:	 221.00 (0.00 pct)	 219.00 (0.90 pct)
128:	 553.00 (0.00 pct)	 531.00 (3.97 pct)
256:	 43840.00 (0.00 pct)	 48192.00 (-9.92 pct)   *
256:	 50427.00 (0.00 pct)	 48351.00 (4.11 pct)    [Verification Run]
512:	 76672.00 (0.00 pct)	 81024.00 (-5.67 pct)

NPS4

#workers:	tip			latency_nice
  1:	  33.00 (0.00 pct)	  28.00 (15.15 pct)
  2:	  29.00 (0.00 pct)	  34.00 (-17.24 pct)
  4:	  39.00 (0.00 pct)	  36.00 (7.69 pct)
  8:	  58.00 (0.00 pct)	  55.00 (5.17 pct)
 16:	  66.00 (0.00 pct)	  67.00 (-1.51 pct)
 32:	 112.00 (0.00 pct)	 116.00 (-3.57 pct)
 64:	 215.00 (0.00 pct)	 213.00 (0.93 pct)
128:	 689.00 (0.00 pct)	 571.00 (17.12 pct)
256:	 45120.00 (0.00 pct)	 46400.00 (-2.83 pct)
512:	 77440.00 (0.00 pct)	 76160.00 (1.65 pct)


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ tbench - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Clients:	tip			latency_nice
    1	 581.75 (0.00 pct)	 586.52 (0.81 pct)
    2	 1145.75 (0.00 pct)	 1160.69 (1.30 pct)
    4	 2127.94 (0.00 pct)	 2141.49 (0.63 pct)
    8	 3838.27 (0.00 pct)	 3721.10 (-3.05 pct)
   16	 6272.71 (0.00 pct)	 6539.82 (4.25 pct)
   32	 11400.12 (0.00 pct)	 12079.49 (5.95 pct)
   64	 21605.96 (0.00 pct)	 22908.83 (6.03 pct)
  128	 30715.43 (0.00 pct)	 31736.95 (3.32 pct)
  256	 55580.78 (0.00 pct)	 54786.29 (-1.42 pct)
  512	 56528.79 (0.00 pct)	 56453.54 (-0.13 pct)
 1024	 56520.40 (0.00 pct)	 56369.93 (-0.26 pct)

NPS2

Clients:	tip			latency_nice
    1	 584.13 (0.00 pct)	 582.53 (-0.27 pct)
    2	 1153.63 (0.00 pct)	 1140.27 (-1.15 pct)
    4	 2212.89 (0.00 pct)	 2159.49 (-2.41 pct)
    8	 3871.35 (0.00 pct)	 3840.77 (-0.78 pct)
   16	 6216.72 (0.00 pct)	 6437.98 (3.55 pct)
   32	 11766.98 (0.00 pct)	 11663.53 (-0.87 pct)
   64	 22000.93 (0.00 pct)	 21882.88 (-0.53 pct)
  128	 31520.53 (0.00 pct)	 31147.05 (-1.18 pct)
  256	 51420.11 (0.00 pct)	 55216.39 (7.38 pct)
  512	 53935.90 (0.00 pct)	 55407.60 (2.72 pct)
 1024	 55239.73 (0.00 pct)	 55997.25 (1.37 pct)

NPS4

Clients:	tip			latency_nice
    1	 585.83 (0.00 pct)	 578.17 (-1.30 pct)
    2	 1141.59 (0.00 pct)	 1131.14 (-0.91 pct)
    4	 2174.79 (0.00 pct)	 2086.52 (-4.05 pct)
    8	 3887.56 (0.00 pct)	 3778.47 (-2.80 pct)
   16	 6441.59 (0.00 pct)	 6364.30 (-1.19 pct)
   32	 12133.60 (0.00 pct)	 11465.26 (-5.50 pct)   *
   32    11677.16 (0.00 pct)     12662.09 (8.43 pct)    [Verification Run]
   64	 21769.15 (0.00 pct)	 19488.45 (-10.47 pct)  *
   64    20305.64 (0.00 pct)     21002.90 (3.43 pct)    [Verification Run]
  128	 31396.31 (0.00 pct)	 31177.37 (-0.69 pct)
  256	 52792.39 (0.00 pct)	 52890.41 (0.18 pct)
  512	 55315.44 (0.00 pct)	 53572.65 (-3.15 pct)
 1024	 52150.27 (0.00 pct)	 54079.48 (3.69 pct)


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ stream - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

10 Runs:

Test:		tip			latency_nice
 Copy:   307827.79 (0.00 pct)    330524.48 (7.37 pct)
Scale:   208872.28 (0.00 pct)    215002.06 (2.93 pct)
  Add:   239404.64 (0.00 pct)    230334.74 (-3.78 pct)
Triad:   247258.30 (0.00 pct)    238505.06 (-3.54 pct)

100 Runs:

Test:		tip			latency_nice
 Copy:   317217.55 (0.00 pct)    314467.62 (-0.86 pct)
Scale:   208740.82 (0.00 pct)    210452.00 (0.81 pct)
  Add:   240550.63 (0.00 pct)    232376.03 (-3.39 pct)
Triad:   249594.21 (0.00 pct)    242460.83 (-2.85 pct)

NPS2

10 Runs:

Test:		tip			latency_nice
 Copy:   340877.18 (0.00 pct)    339441.26 (-0.42 pct)
Scale:   217318.16 (0.00 pct)    216905.49 (-0.18 pct)
  Add:   259078.93 (0.00 pct)    261686.67 (1.00 pct)
Triad:   274500.78 (0.00 pct)    271699.83 (-1.02 pct)

100 Runs:

Test:		tip			latency_nice
 Copy:   341860.73 (0.00 pct)    335826.36 (-1.76 pct)
Scale:   218043.00 (0.00 pct)    216451.84 (-0.72 pct)
  Add:   253698.22 (0.00 pct)    257317.72 (1.42 pct)
Triad:   265011.84 (0.00 pct)    267769.93 (1.04 pct)

NPS4

10 Runs:

Test:		tip			latency_nice
 Copy:   340877.18 (0.00 pct)    365921.51 (7.34 pct)
Scale:   217318.16 (0.00 pct)    239408.65 (10.16 pct)
  Add:   259078.93 (0.00 pct)    264859.31 (2.23 pct)
Triad:   274500.78 (0.00 pct)    281543.65 (2.56 pct)

100 Runs:

Test:		tip			latency_nice
 Copy:   341860.73 (0.00 pct)    359255.16 (5.08 pct)
Scale:   218043.00 (0.00 pct)    238154.15 (9.22 pct)
  Add:   253698.22 (0.00 pct)    269223.49 (6.11 pct)
Triad:   265011.84 (0.00 pct)    278473.85 (5.07 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ ycsb-mongodb - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

o NPS1

tip:                    131244.00 (var: 2.67%)
latency_nice:           132118.00 (var: 3.62%) (+0.66%)

o NPS2

tip:                    127663.33 (var: 2.08%)
latency_nice:           129148.00 (var: 4.29%) (+1.16%)

o NPS4

tip:                    133295.00 (var: 1.58%)
latency_nice:           129975.33 (var: 1.10%) (-2.49%)


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Unixbench - DEFAULT_LATENCY_NICE ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		      latency_nice
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48929419.48 (   0.00%)    49137039.06 (   0.42%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6275526953.25 (   0.00%)  6265580479.15 (  -0.16%)
unixbench-syscall       Amean     unixbench-syscall-1        2994319.73 (   0.00%)     3008596.83 *  -0.48%*
unixbench-syscall       Amean     unixbench-syscall-512      7349715.87 (   0.00%)     7420994.50 *  -0.97%*
unixbench-pipe          Hmean     unixbench-pipe-1           2830206.03 (   0.00%)     2854405.99 *   0.86%*
unixbench-pipe          Hmean     unixbench-pipe-512       326207828.01 (   0.00%)   328997804.52 *   0.86%*
unixbench-spawn         Hmean     unixbench-spawn-1             6394.21 (   0.00%)        6367.75 (  -0.41%)
unixbench-spawn         Hmean     unixbench-spawn-512          72700.64 (   0.00%)       71454.19 *  -1.71%*
unixbench-execl         Hmean     unixbench-execl-1             4723.61 (   0.00%)        4750.59 (   0.57%)
unixbench-execl         Hmean     unixbench-execl-512          11212.05 (   0.00%)       11262.13 (   0.45%)

o NPS2

Test			Metric	  Parallelism			tip		      latency_nice
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49271512.85 (   0.00%)    49245260.43 (  -0.05%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6267992483.03 (   0.00%)  6264951100.67 (  -0.05%)
unixbench-syscall       Amean     unixbench-syscall-1        2995885.93 (   0.00%)     3005975.10 *  -0.34%*
unixbench-syscall       Amean     unixbench-syscall-512      7388865.77 (   0.00%)     7276275.63 *   1.52%*
unixbench-pipe          Hmean     unixbench-pipe-1           2828971.95 (   0.00%)     2856578.72 *   0.98%*
unixbench-pipe          Hmean     unixbench-pipe-512       326225385.37 (   0.00%)   328941270.81 *   0.83%*
unixbench-spawn         Hmean     unixbench-spawn-1             6958.71 (   0.00%)        6954.21 (  -0.06%)
unixbench-spawn         Hmean     unixbench-spawn-512          85443.56 (   0.00%)       70536.42 * -17.45%* (0.67% vs 0.93% - CoEff var)
unixbench-execl         Hmean     unixbench-execl-1             4767.99 (   0.00%)        4752.63 *  -0.32%*
unixbench-execl         Hmean     unixbench-execl-512          11250.72 (   0.00%)       11320.97 (   0.62%)

o NPS4

Test			Metric	  Parallelism			tip		      latency_nice
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49041932.68 (   0.00%)    49156671.05 (   0.23%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6286981589.85 (   0.00%)  6285248711.40 (  -0.03%)
unixbench-syscall       Amean     unixbench-syscall-1        2992405.60 (   0.00%)     3008933.03 *  -0.55%*
unixbench-syscall       Amean     unixbench-syscall-512      7971789.70 (   0.00%)     7814622.23 *   1.97%*
unixbench-pipe          Hmean     unixbench-pipe-1           2822892.54 (   0.00%)     2852615.11 *   1.05%*
unixbench-pipe          Hmean     unixbench-pipe-512       326408309.83 (   0.00%)   329617202.56 *   0.98%*
unixbench-spawn         Hmean     unixbench-spawn-1             7685.31 (   0.00%)        7243.54 (  -5.75%)
unixbench-spawn         Hmean     unixbench-spawn-512          72245.56 (   0.00%)       77000.81 *   6.58%*
unixbench-execl         Hmean     unixbench-execl-1             4761.42 (   0.00%)        4733.12 *  -0.59%*
unixbench-execl         Hmean     unixbench-execl-512          11533.53 (   0.00%)       11660.17 (   1.10%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Hackbench - Various Latency Nice Values ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

o 100000 loops

- pipe (process)

Test:                   LN: 0                   LN: 19                  LN: -20
 1-groups:         3.91 (0.00 pct)         3.91 (0.00 pct)         3.81 (2.55 pct)
 2-groups:         4.48 (0.00 pct)         4.52 (-0.89 pct)        4.53 (-1.11 pct)
 4-groups:         4.83 (0.00 pct)         4.83 (0.00 pct)         4.87 (-0.82 pct)
 8-groups:         5.09 (0.00 pct)         5.00 (1.76 pct)         5.07 (0.39 pct)
16-groups:         6.92 (0.00 pct)         6.79 (1.87 pct)         6.96 (-0.57 pct)

- pipe (thread)

 1-groups:         4.13 (0.00 pct)         4.08 (1.21 pct)         4.11 (0.48 pct)
 2-groups:         4.78 (0.00 pct)         4.90 (-2.51 pct)        4.79 (-0.20 pct)
 4-groups:         5.12 (0.00 pct)         5.08 (0.78 pct)         5.16 (-0.78 pct)
 8-groups:         5.31 (0.00 pct)         5.28 (0.56 pct)         5.33 (-0.37 pct)
16-groups:         7.34 (0.00 pct)         7.27 (0.95 pct)         7.33 (0.13 pct)

- socket (process)

Test:                   LN: 0                   LN: 19                  LN: -20
 1-groups:         6.61 (0.00 pct)         6.38 (3.47 pct)         6.54 (1.05 pct)
 2-groups:         6.59 (0.00 pct)         6.67 (-1.21 pct)        6.11 (7.28 pct)
 4-groups:         6.77 (0.00 pct)         6.78 (-0.14 pct)        6.79 (-0.29 pct)
 8-groups:         8.29 (0.00 pct)         8.39 (-1.20 pct)        8.36 (-0.84 pct)
16-groups:        12.21 (0.00 pct)        12.03 (1.47 pct)        12.35 (-1.14 pct)

- socket (thread)

Test:                   LN: 0                   LN: 19                  LN: -20
 1-groups:         6.50 (0.00 pct)         5.99 (7.84 pct)         6.02 (7.38 pct)	^
 2-groups:         6.07 (0.00 pct)         6.20 (-2.14 pct)        6.23 (-2.63 pct)
 4-groups:         6.61 (0.00 pct)         6.64 (-0.45 pct)        6.63 (-0.30 pct)
 8-groups:         8.87 (0.00 pct)         8.67 (2.25 pct)         8.78 (1.01 pct)
16-groups:        12.63 (0.00 pct)        12.54 (0.71 pct)        12.59 (0.31 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Hackbench + Cyclictest - Various Latency Nice Values ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Hackbench: 32 Groups

perf bench sched messaging -p -l 100000 -g 32&
cyclictest --policy other -D 5 -q -n -h 2000

o NPS1

----------------------------------------------------------------------------------------------------------
| Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
| LN          |------------------------------|-------------------------------|---------------------------|
|             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
|-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
| 19          | 52.00  | 71.00   | 5191.00   | 29.00  | 68.00   |  4477.00   | 53.00  | 60.00 |  753.00  |
| 0           | 53.00  | 150.00  | 7300.00   | 53.00  | 105.00  |  7730.00   | 53.00  | 64.00 |  2067.00 |
| -20         | 33.00  | 159.00  | 98492.00  | 53.00  | 149.00  |  9608.00   | 53.00  | 91.00 |  5349.00 |
----------------------------------------------------------------------------------------------------------

o NPS4

----------------------------------------------------------------------------------------------------------
| Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
| LN          |------------------------------|-------------------------------|---------------------------|
|             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
|-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
| 19          | 53.00  |  84.00  |  4790.00  | 53.00  |  72.00  |  3456.00   | 53.00  | 58.00 |  1271.00 |
| 0           | 53.00  |  99.00  |  5494.00  | 52.00  |  74.00  |  5813.00   | 53.00  | 59.00 |  1004.00 |
| -20         | 45.00  |  84.00  |  3592.00  | 53.00  |  91.00  |  15222.00  | 53.00  | 74.00 |  5232.00 |	^
----------------------------------------------------------------------------------------------------------

- Hackbench: 128 Groups

perf bench sched messaging -p -l 500000 -g 128&
cyclictest --policy other -D 5 -q -n -h 2000

o NPS1

----------------------------------------------------------------------------------------------------------
| Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
| LN          |------------------------------|-------------------------------|---------------------------|
|             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
|-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
| 19          | 53.00  | 274.00  | 11294.00  | 33.00  | 130.00  |  20071.00  | 53.00  | 56.00 |  244.00  |	^
| 0           | 53.00  | 125.00  | 10014.00  | 53.00  | 113.00  |  15857.00  | 53.00  | 57.00 |  250.00  |
| -20         | 53.00  | 187.00  | 49565.00  | 53.00  | 230.00  |  73353.00  | 53.00  | 118.00|  8816.00 |
----------------------------------------------------------------------------------------------------------

o NPS4

----------------------------------------------------------------------------------------------------------
| Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
| LN          |------------------------------|-------------------------------|---------------------------|
|             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
|-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
| 19          | 53.00  | 271.00  | 11411.00  | 53.00  | 82.00   |  5486.00   | 25.00  | 57.00 | 1256.00  |
| 0           | 53.00  | 148.00  | 8374.00   | 52.00  | 109.00  |  11074.00  | 52.00  | 59.00 | 1068.00  |
| -20         | 53.00  | 202.00  | 52537.00  | 53.00  | 205.00  |  22265.00  | 52.00  | 87.00 | 14151.00 |
----------------------------------------------------------------------------------------------------------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Hackbench + schbench - Various Latency Nice Values ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

perf bench sched messaging -p -l 400000 -g 128
schbench -m 2 -t 1 -s 30

o NPS1

-------------------------------------------------------------------------------------------------
| Hackbench |     schbench LN = 19       |      schbench LN = 0      |     schbench LN = -20    |
| LN        |----------------------------|---------------------------|--------------------------|
|           |  90th  |  95th  |  99th    |  90th  |  95th  |  99th   |  90th  |  95th  |  99th  |
|-----------|--------|--------|----------|--------|--------|---------|--------|--------|--------|
| 19        |   38   |   131  |   1458   |   46   |   151  |  2636   |   11   |   19   |  410   |	^
| 0         |   45   |   98   |   1758   |   25   |   50   |  1670   |   16   |   30   |  1042  |
| -20       |   47   |   348  |   29280  |   40   |   109  |  16144  |   35   |   63   |  9104  |
-------------------------------------------------------------------------------------------------

o NPS4

-------------------------------------------------------------------------------------------------
| Hackbench |     schbench LN = 19       |      schbench LN = 0      |     schbench LN = -20    |
| LN        |----------------------------|---------------------------|--------------------------|
|           |  90th  |  95th  |  99th    |  90th  |  95th  |  99th   |  90th  |  95th  |  99th  |
|-----------|--------|--------|----------|--------|--------|---------|--------|--------|--------|
| 19        |   19   |  60    |  1886    |   17   |  29    |  621    |   10   |   18   |  227   |
| 0         |   51   |  141   |  8120    |   37   |  78    |  8880   |   33   |   55   |  474   |	^
| -20       |   48   |  1494  |  27296   |   51   |  469   |  40384  |   31   |   64   |  4092  |	^
-------------------------------------------------------------------------------------------------

^ Note: There are cases where the Max, 99th percentile latency is
non-monotonic but I've also seen a good amount of run to run variation
there with a single bad sample polluting the results. In such cases,
the averages are more representative.

> 
> [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
> 
> [..snip..]
> 

Apart from couple of anomalies, latency nice reduces wait time, especially
when the system is heavily loaded. If there is any data, or any specific
workload you would like me to run on the test system, please do let me know.
Meanwhile, I'll try to get some numbers for larger workloads like SpecJBB
that did see improvements with latency nice on v5.
--
Thanks and Regards,
Prateek
  
Vincent Guittot Nov. 28, 2022, 5:19 p.m. UTC | #2
Hi Prateek,

On Mon, 28 Nov 2022 at 12:52, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> Following are the test results on dual socket Zen3 machine (2 x 64C/128T)
>
> tl;dr
>
> o All benchmarks with DEFAULT_LATENCY_NICE value are comparable to tip.
>   There is, however, a noticeable dip for unixbench-spawn test case.
>
> o With the 2 rbtree approach, I do not see much difference in the
>   hackbench results with varying latency nice value. Tests on v5 did
>   yield noticeable improvements for hackbench.
>   (https://lore.kernel.org/lkml/cd48ebbb-9724-985f-28e3-e558dea07827@amd.com/)

The 2 rbtree approach is the one that was already used in v5. I just
rerun hackbench tests with latest tip and v6.2-rc7 and I can see large
performance improvement for pipe tests on my system (8 cores system).
Could you try witha larger number of group ? like 64, 128 and 256
groups

>
> o For hackbench + cyclictest and hackbench + schbench, I see the
>   expected behavior with different latency nice values.
>
> o There are a few cases with hackbench and hackbench + cyclictest where
>   the results are non-monotonic with different latency nice values.
>   (Marked with "^").
>
> I'll leave the detailed results below:
>
> On 11/15/2022 10:48 PM, Vincent Guittot wrote:
> > This patchset restarts the work about adding a latency priority to describe
> > the latency tolerance of cfs tasks.
> >
> > Patch [1] is a new one that has been added with v6. It fixes an
> > unfairness for low prio tasks because of wakeup_gran() being bigger
> > than the maximum vruntime credit that a waking task can keep after
> > sleeping.
> >
> > The patches [2-4] have been done by Parth:
> > https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/
> >
> > I have just rebased and moved the set of latency priority outside the
> > priority update. I have removed the reviewed tag because the patches
> > are 2 years old.
> >
> > This aims to be a generic interface and the following patches is one use
> > of it to improve the scheduling latency of cfs tasks.
> >
> > Patch [5] uses latency nice priority to define a latency offset
> > and then decide if a cfs task can or should preempt the current
> > running task. The patch gives some tests results with cyclictests and
> > hackbench to highlight the benefit of latency priority for short
> > interactive task or long intensive tasks.
> >
> > Patch [6] adds the support of latency nice priority to task group by
> > adding a cpu.latency.nice field. The range is [-20:19] as for setting task
> > latency priority.
> >
> > Patch [7] makes sched_core taking into account the latency offset.
> >
> > Patch [8] adds a rb tree to cover some corner cases where the latency
> > sensitive task (priority < 0) is preempted by high priority task (RT/DL)
> > or fails to preempt them. This patch ensures that tasks will have at least
> > a slice of sched_min_granularity in priority at wakeup.
> >
> > Patch [9] removes useless check after adding a latency rb tree.
> >
> > I have also backported the patchset on a dragonboard RB3 with an android
> > mainline kernel based on v5.18 for a quick test. I have used the
> > TouchLatency app which is part of AOSP and described to be a very good
> > test to highlight jitter and jank frame sources of a system [1].
> > In addition to the app, I have added some short running tasks waking-up
> > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > without overloading it (and disabling EAS). The 1st results shows that the
> > patchset helps to reduce the missed deadline frames from 5% to less than
> > 0.1% when the cpu.latency.nice of task group are set. I haven't rerun the
> > test with latest version.
> >
> > I have also tested the patchset with the modified version of the alsa
> > latency test that has been shared by Tim. The test quickly xruns with
> > default latency nice priority 0 but is able to run without underuns with
> > a latency -20 and hackbench running simultaneously.
> >
> > While preparing the version 8, I have evaluated the benefit of using an
> > augmented rbtree instead of adding a rbtree for latency sensitive entities,
> > which was a relevant suggestion done by PeterZ. Although the augmented
> > rbtree enables to sort additional information in the tree with a limited
> > overhead, it has more impact on legacy use cases (latency_nice >= 0)
> > because the augmented callbacks are always called to maintain this
> > additional information even when there is no sensitive tasks. In such
> > cases, the dedicated rbtree remains empty and the overhead is reduced to
> > loading a cached null node pointer. Nevertheless, we might want to
> > reconsider the augmented rbtree once the use of negative latency_nice will
> > be more widlely deployed. At now, the different tests that I have done,
> > have not shown improvements with augmented rbtree.
> >
> > Below are some hackbench results:
> >         2 rbtrees               augmented rbtree        augmented rbtree
> >                                 sorted by vruntime      sorted by wakeup_vruntime
> > sched pipe
> > avg     26311,000               25976,667               25839,556
> > stdev   0,15 %                  0,28 %                  0,24 %
> > vs tip  0,50 %                  -0,78 %                 -1,31 %
> > hackbench     1 group
> > avg     1,315                   1,344                   1,359
> > stdev   0,88 %                  1,55 %                  1,82 %
> > vs tip  -0,47 %                 -2,68 %                 -3,87 %
> > hackbench     4 groups
> > avg     1,339                   1,365                   1,367
> > stdev   2,39 %                  2,26 %                  3,58 %
> > vs tip  -0,08 %                 -2,01 %                 -2,22 %
> > hackbench     8 groups
> > avg     1,233                   1,286                   1,301
> > stdev   0,74 %                  1,09 %                  1,52 %
> > vs tip  0,29 %                  -4,05 %                 -5,27 %
> > hackbench     16 groups
> > avg     1,268                   1,313                   1,319
> > stdev   0,85 %                  1,60 %                  0,68 %
> > vs tip  -0,02 %                 -3,56 %                 -4,01 %
>
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
>
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip:          6.1.0 tip sched/core
> - latency_nice: 6.1.0 tip sched/core + this series
>
> When we started testing, the tip was at:
> commit d6962c4fe8f9 "sched: Clear ttwu_pending after enqueue_task()"
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ hackbench - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> Test:                   tip                     latency_nice
>  1-groups:         4.25 (0.00 pct)         4.14 (2.58 pct)
>  2-groups:         4.95 (0.00 pct)         4.92 (0.60 pct)
>  4-groups:         5.19 (0.00 pct)         5.18 (0.19 pct)
>  8-groups:         5.45 (0.00 pct)         5.44 (0.18 pct)
> 16-groups:         7.33 (0.00 pct)         7.32 (0.13 pct)
>
> NPS2
>
> Test:                   tip                     latency_nice
>  1-groups:         4.09 (0.00 pct)         4.08 (0.24 pct)
>  2-groups:         4.68 (0.00 pct)         4.72 (-0.85 pct)
>  4-groups:         5.05 (0.00 pct)         4.97 (1.58 pct)
>  8-groups:         5.37 (0.00 pct)         5.34 (0.55 pct)
> 16-groups:         6.69 (0.00 pct)         6.74 (-0.74 pct)
>
> NPS4
>
> Test:                   tip                     latency_nice
>  1-groups:         4.28 (0.00 pct)         4.35 (-1.63 pct)
>  2-groups:         4.78 (0.00 pct)         4.76 (0.41 pct)
>  4-groups:         5.11 (0.00 pct)         5.06 (0.97 pct)
>  8-groups:         5.48 (0.00 pct)         5.40 (1.45 pct)
> 16-groups:         7.07 (0.00 pct)         6.70 (5.23 pct)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ schbench - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> #workers:       tip                     latency_nice
>   1:      31.00 (0.00 pct)        32.00 (-3.22 pct)
>   2:      33.00 (0.00 pct)        34.00 (-3.03 pct)
>   4:      39.00 (0.00 pct)        38.00 (2.56 pct)
>   8:      45.00 (0.00 pct)        46.00 (-2.22 pct)
>  16:      61.00 (0.00 pct)        66.00 (-8.19 pct)
>  32:     108.00 (0.00 pct)       110.00 (-1.85 pct)
>  64:     212.00 (0.00 pct)       216.00 (-1.88 pct)
> 128:     475.00 (0.00 pct)       701.00 (-47.57 pct)    *
> 128:     429.00 (0.00 pct)       441.00 (-2.79 pct)      [Verification Run]
> 256:     44736.00 (0.00 pct)     45632.00 (-2.00 pct)
> 512:     77184.00 (0.00 pct)     78720.00 (-1.99 pct)
>
> NPS2
>
> #workers:       tip                     latency_nice
>   1:      28.00 (0.00 pct)        33.00 (-17.85 pct)
>   2:      34.00 (0.00 pct)        31.00 (8.82 pct)
>   4:      36.00 (0.00 pct)        36.00 (0.00 pct)
>   8:      51.00 (0.00 pct)        49.00 (3.92 pct)
>  16:      68.00 (0.00 pct)        64.00 (5.88 pct)
>  32:     113.00 (0.00 pct)       115.00 (-1.76 pct)
>  64:     221.00 (0.00 pct)       219.00 (0.90 pct)
> 128:     553.00 (0.00 pct)       531.00 (3.97 pct)
> 256:     43840.00 (0.00 pct)     48192.00 (-9.92 pct)   *
> 256:     50427.00 (0.00 pct)     48351.00 (4.11 pct)    [Verification Run]
> 512:     76672.00 (0.00 pct)     81024.00 (-5.67 pct)
>
> NPS4
>
> #workers:       tip                     latency_nice
>   1:      33.00 (0.00 pct)        28.00 (15.15 pct)
>   2:      29.00 (0.00 pct)        34.00 (-17.24 pct)
>   4:      39.00 (0.00 pct)        36.00 (7.69 pct)
>   8:      58.00 (0.00 pct)        55.00 (5.17 pct)
>  16:      66.00 (0.00 pct)        67.00 (-1.51 pct)
>  32:     112.00 (0.00 pct)       116.00 (-3.57 pct)
>  64:     215.00 (0.00 pct)       213.00 (0.93 pct)
> 128:     689.00 (0.00 pct)       571.00 (17.12 pct)
> 256:     45120.00 (0.00 pct)     46400.00 (-2.83 pct)
> 512:     77440.00 (0.00 pct)     76160.00 (1.65 pct)
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ tbench - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> Clients:        tip                     latency_nice
>     1    581.75 (0.00 pct)       586.52 (0.81 pct)
>     2    1145.75 (0.00 pct)      1160.69 (1.30 pct)
>     4    2127.94 (0.00 pct)      2141.49 (0.63 pct)
>     8    3838.27 (0.00 pct)      3721.10 (-3.05 pct)
>    16    6272.71 (0.00 pct)      6539.82 (4.25 pct)
>    32    11400.12 (0.00 pct)     12079.49 (5.95 pct)
>    64    21605.96 (0.00 pct)     22908.83 (6.03 pct)
>   128    30715.43 (0.00 pct)     31736.95 (3.32 pct)
>   256    55580.78 (0.00 pct)     54786.29 (-1.42 pct)
>   512    56528.79 (0.00 pct)     56453.54 (-0.13 pct)
>  1024    56520.40 (0.00 pct)     56369.93 (-0.26 pct)
>
> NPS2
>
> Clients:        tip                     latency_nice
>     1    584.13 (0.00 pct)       582.53 (-0.27 pct)
>     2    1153.63 (0.00 pct)      1140.27 (-1.15 pct)
>     4    2212.89 (0.00 pct)      2159.49 (-2.41 pct)
>     8    3871.35 (0.00 pct)      3840.77 (-0.78 pct)
>    16    6216.72 (0.00 pct)      6437.98 (3.55 pct)
>    32    11766.98 (0.00 pct)     11663.53 (-0.87 pct)
>    64    22000.93 (0.00 pct)     21882.88 (-0.53 pct)
>   128    31520.53 (0.00 pct)     31147.05 (-1.18 pct)
>   256    51420.11 (0.00 pct)     55216.39 (7.38 pct)
>   512    53935.90 (0.00 pct)     55407.60 (2.72 pct)
>  1024    55239.73 (0.00 pct)     55997.25 (1.37 pct)
>
> NPS4
>
> Clients:        tip                     latency_nice
>     1    585.83 (0.00 pct)       578.17 (-1.30 pct)
>     2    1141.59 (0.00 pct)      1131.14 (-0.91 pct)
>     4    2174.79 (0.00 pct)      2086.52 (-4.05 pct)
>     8    3887.56 (0.00 pct)      3778.47 (-2.80 pct)
>    16    6441.59 (0.00 pct)      6364.30 (-1.19 pct)
>    32    12133.60 (0.00 pct)     11465.26 (-5.50 pct)   *
>    32    11677.16 (0.00 pct)     12662.09 (8.43 pct)    [Verification Run]
>    64    21769.15 (0.00 pct)     19488.45 (-10.47 pct)  *
>    64    20305.64 (0.00 pct)     21002.90 (3.43 pct)    [Verification Run]
>   128    31396.31 (0.00 pct)     31177.37 (-0.69 pct)
>   256    52792.39 (0.00 pct)     52890.41 (0.18 pct)
>   512    55315.44 (0.00 pct)     53572.65 (-3.15 pct)
>  1024    52150.27 (0.00 pct)     54079.48 (3.69 pct)
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ stream - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> NPS1
>
> 10 Runs:
>
> Test:           tip                     latency_nice
>  Copy:   307827.79 (0.00 pct)    330524.48 (7.37 pct)
> Scale:   208872.28 (0.00 pct)    215002.06 (2.93 pct)
>   Add:   239404.64 (0.00 pct)    230334.74 (-3.78 pct)
> Triad:   247258.30 (0.00 pct)    238505.06 (-3.54 pct)
>
> 100 Runs:
>
> Test:           tip                     latency_nice
>  Copy:   317217.55 (0.00 pct)    314467.62 (-0.86 pct)
> Scale:   208740.82 (0.00 pct)    210452.00 (0.81 pct)
>   Add:   240550.63 (0.00 pct)    232376.03 (-3.39 pct)
> Triad:   249594.21 (0.00 pct)    242460.83 (-2.85 pct)
>
> NPS2
>
> 10 Runs:
>
> Test:           tip                     latency_nice
>  Copy:   340877.18 (0.00 pct)    339441.26 (-0.42 pct)
> Scale:   217318.16 (0.00 pct)    216905.49 (-0.18 pct)
>   Add:   259078.93 (0.00 pct)    261686.67 (1.00 pct)
> Triad:   274500.78 (0.00 pct)    271699.83 (-1.02 pct)
>
> 100 Runs:
>
> Test:           tip                     latency_nice
>  Copy:   341860.73 (0.00 pct)    335826.36 (-1.76 pct)
> Scale:   218043.00 (0.00 pct)    216451.84 (-0.72 pct)
>   Add:   253698.22 (0.00 pct)    257317.72 (1.42 pct)
> Triad:   265011.84 (0.00 pct)    267769.93 (1.04 pct)
>
> NPS4
>
> 10 Runs:
>
> Test:           tip                     latency_nice
>  Copy:   340877.18 (0.00 pct)    365921.51 (7.34 pct)
> Scale:   217318.16 (0.00 pct)    239408.65 (10.16 pct)
>   Add:   259078.93 (0.00 pct)    264859.31 (2.23 pct)
> Triad:   274500.78 (0.00 pct)    281543.65 (2.56 pct)
>
> 100 Runs:
>
> Test:           tip                     latency_nice
>  Copy:   341860.73 (0.00 pct)    359255.16 (5.08 pct)
> Scale:   218043.00 (0.00 pct)    238154.15 (9.22 pct)
>   Add:   253698.22 (0.00 pct)    269223.49 (6.11 pct)
> Triad:   265011.84 (0.00 pct)    278473.85 (5.07 pct)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> tip:                    131244.00 (var: 2.67%)
> latency_nice:           132118.00 (var: 3.62%) (+0.66%)
>
> o NPS2
>
> tip:                    127663.33 (var: 2.08%)
> latency_nice:           129148.00 (var: 4.29%) (+1.16%)
>
> o NPS4
>
> tip:                    133295.00 (var: 1.58%)
> latency_nice:           129975.33 (var: 1.10%) (-2.49%)
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ Unixbench - DEFAULT_LATENCY_NICE ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> Test                    Metric    Parallelism                   tip                   latency_nice
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48929419.48 (   0.00%)    49137039.06 (   0.42%)
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6275526953.25 (   0.00%)  6265580479.15 (  -0.16%)
> unixbench-syscall       Amean     unixbench-syscall-1        2994319.73 (   0.00%)     3008596.83 *  -0.48%*
> unixbench-syscall       Amean     unixbench-syscall-512      7349715.87 (   0.00%)     7420994.50 *  -0.97%*
> unixbench-pipe          Hmean     unixbench-pipe-1           2830206.03 (   0.00%)     2854405.99 *   0.86%*
> unixbench-pipe          Hmean     unixbench-pipe-512       326207828.01 (   0.00%)   328997804.52 *   0.86%*
> unixbench-spawn         Hmean     unixbench-spawn-1             6394.21 (   0.00%)        6367.75 (  -0.41%)
> unixbench-spawn         Hmean     unixbench-spawn-512          72700.64 (   0.00%)       71454.19 *  -1.71%*
> unixbench-execl         Hmean     unixbench-execl-1             4723.61 (   0.00%)        4750.59 (   0.57%)
> unixbench-execl         Hmean     unixbench-execl-512          11212.05 (   0.00%)       11262.13 (   0.45%)
>
> o NPS2
>
> Test                    Metric    Parallelism                   tip                   latency_nice
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49271512.85 (   0.00%)    49245260.43 (  -0.05%)
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6267992483.03 (   0.00%)  6264951100.67 (  -0.05%)
> unixbench-syscall       Amean     unixbench-syscall-1        2995885.93 (   0.00%)     3005975.10 *  -0.34%*
> unixbench-syscall       Amean     unixbench-syscall-512      7388865.77 (   0.00%)     7276275.63 *   1.52%*
> unixbench-pipe          Hmean     unixbench-pipe-1           2828971.95 (   0.00%)     2856578.72 *   0.98%*
> unixbench-pipe          Hmean     unixbench-pipe-512       326225385.37 (   0.00%)   328941270.81 *   0.83%*
> unixbench-spawn         Hmean     unixbench-spawn-1             6958.71 (   0.00%)        6954.21 (  -0.06%)
> unixbench-spawn         Hmean     unixbench-spawn-512          85443.56 (   0.00%)       70536.42 * -17.45%* (0.67% vs 0.93% - CoEff var)

I don't expect any perf improvement or regression when the latency
nice is not changed

> unixbench-execl         Hmean     unixbench-execl-1             4767.99 (   0.00%)        4752.63 *  -0.32%*
> unixbench-execl         Hmean     unixbench-execl-512          11250.72 (   0.00%)       11320.97 (   0.62%)
>
> o NPS4
>
> Test                    Metric    Parallelism                   tip                   latency_nice
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49041932.68 (   0.00%)    49156671.05 (   0.23%)
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6286981589.85 (   0.00%)  6285248711.40 (  -0.03%)
> unixbench-syscall       Amean     unixbench-syscall-1        2992405.60 (   0.00%)     3008933.03 *  -0.55%*
> unixbench-syscall       Amean     unixbench-syscall-512      7971789.70 (   0.00%)     7814622.23 *   1.97%*
> unixbench-pipe          Hmean     unixbench-pipe-1           2822892.54 (   0.00%)     2852615.11 *   1.05%*
> unixbench-pipe          Hmean     unixbench-pipe-512       326408309.83 (   0.00%)   329617202.56 *   0.98%*
> unixbench-spawn         Hmean     unixbench-spawn-1             7685.31 (   0.00%)        7243.54 (  -5.75%)
> unixbench-spawn         Hmean     unixbench-spawn-512          72245.56 (   0.00%)       77000.81 *   6.58%*
> unixbench-execl         Hmean     unixbench-execl-1             4761.42 (   0.00%)        4733.12 *  -0.59%*
> unixbench-execl         Hmean     unixbench-execl-512          11533.53 (   0.00%)       11660.17 (   1.10%)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ Hackbench - Various Latency Nice Values ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> o 100000 loops
>
> - pipe (process)
>
> Test:                   LN: 0                   LN: 19                  LN: -20
>  1-groups:         3.91 (0.00 pct)         3.91 (0.00 pct)         3.81 (2.55 pct)
>  2-groups:         4.48 (0.00 pct)         4.52 (-0.89 pct)        4.53 (-1.11 pct)
>  4-groups:         4.83 (0.00 pct)         4.83 (0.00 pct)         4.87 (-0.82 pct)
>  8-groups:         5.09 (0.00 pct)         5.00 (1.76 pct)         5.07 (0.39 pct)
> 16-groups:         6.92 (0.00 pct)         6.79 (1.87 pct)         6.96 (-0.57 pct)
>
> - pipe (thread)
>
>  1-groups:         4.13 (0.00 pct)         4.08 (1.21 pct)         4.11 (0.48 pct)
>  2-groups:         4.78 (0.00 pct)         4.90 (-2.51 pct)        4.79 (-0.20 pct)
>  4-groups:         5.12 (0.00 pct)         5.08 (0.78 pct)         5.16 (-0.78 pct)
>  8-groups:         5.31 (0.00 pct)         5.28 (0.56 pct)         5.33 (-0.37 pct)
> 16-groups:         7.34 (0.00 pct)         7.27 (0.95 pct)         7.33 (0.13 pct)
>
> - socket (process)
>
> Test:                   LN: 0                   LN: 19                  LN: -20
>  1-groups:         6.61 (0.00 pct)         6.38 (3.47 pct)         6.54 (1.05 pct)
>  2-groups:         6.59 (0.00 pct)         6.67 (-1.21 pct)        6.11 (7.28 pct)
>  4-groups:         6.77 (0.00 pct)         6.78 (-0.14 pct)        6.79 (-0.29 pct)
>  8-groups:         8.29 (0.00 pct)         8.39 (-1.20 pct)        8.36 (-0.84 pct)
> 16-groups:        12.21 (0.00 pct)        12.03 (1.47 pct)        12.35 (-1.14 pct)
>
> - socket (thread)
>
> Test:                   LN: 0                   LN: 19                  LN: -20
>  1-groups:         6.50 (0.00 pct)         5.99 (7.84 pct)         6.02 (7.38 pct)      ^
>  2-groups:         6.07 (0.00 pct)         6.20 (-2.14 pct)        6.23 (-2.63 pct)
>  4-groups:         6.61 (0.00 pct)         6.64 (-0.45 pct)        6.63 (-0.30 pct)
>  8-groups:         8.87 (0.00 pct)         8.67 (2.25 pct)         8.78 (1.01 pct)
> 16-groups:        12.63 (0.00 pct)        12.54 (0.71 pct)        12.59 (0.31 pct)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ Hackbench + Cyclictest - Various Latency Nice Values ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> - Hackbench: 32 Groups
>
> perf bench sched messaging -p -l 100000 -g 32&
> cyclictest --policy other -D 5 -q -n -h 2000
>
> o NPS1
>
> ----------------------------------------------------------------------------------------------------------
> | Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
> | LN          |------------------------------|-------------------------------|---------------------------|
> |             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
> |-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
> | 19          | 52.00  | 71.00   | 5191.00   | 29.00  | 68.00   |  4477.00   | 53.00  | 60.00 |  753.00  |
> | 0           | 53.00  | 150.00  | 7300.00   | 53.00  | 105.00  |  7730.00   | 53.00  | 64.00 |  2067.00 |
> | -20         | 33.00  | 159.00  | 98492.00  | 53.00  | 149.00  |  9608.00   | 53.00  | 91.00 |  5349.00 |
> ----------------------------------------------------------------------------------------------------------
>
> o NPS4
>
> ----------------------------------------------------------------------------------------------------------
> | Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
> | LN          |------------------------------|-------------------------------|---------------------------|
> |             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
> |-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
> | 19          | 53.00  |  84.00  |  4790.00  | 53.00  |  72.00  |  3456.00   | 53.00  | 58.00 |  1271.00 |
> | 0           | 53.00  |  99.00  |  5494.00  | 52.00  |  74.00  |  5813.00   | 53.00  | 59.00 |  1004.00 |
> | -20         | 45.00  |  84.00  |  3592.00  | 53.00  |  91.00  |  15222.00  | 53.00  | 74.00 |  5232.00 |      ^
> ----------------------------------------------------------------------------------------------------------
>
> - Hackbench: 128 Groups
>
> perf bench sched messaging -p -l 500000 -g 128&
> cyclictest --policy other -D 5 -q -n -h 2000
>
> o NPS1
>
> ----------------------------------------------------------------------------------------------------------
> | Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
> | LN          |------------------------------|-------------------------------|---------------------------|
> |             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
> |-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
> | 19          | 53.00  | 274.00  | 11294.00  | 33.00  | 130.00  |  20071.00  | 53.00  | 56.00 |  244.00  |      ^
> | 0           | 53.00  | 125.00  | 10014.00  | 53.00  | 113.00  |  15857.00  | 53.00  | 57.00 |  250.00  |
> | -20         | 53.00  | 187.00  | 49565.00  | 53.00  | 230.00  |  73353.00  | 53.00  | 118.00|  8816.00 |
> ----------------------------------------------------------------------------------------------------------
>
> o NPS4
>
> ----------------------------------------------------------------------------------------------------------
> | Hackbench   |      Cyclictest LN = 19      |      Cyclictest LN = 0        |    Cyclictest LN = -20    |
> | LN          |------------------------------|-------------------------------|---------------------------|
> |             |   Min  |   Avg   |  Max      |   Min  |   Avg   |   Max      |   Min  |  Avg  |   Max    |
> |-------------|--------|---------|-----------|--------|---------|------------|--------|-------|----------|
> | 19          | 53.00  | 271.00  | 11411.00  | 53.00  | 82.00   |  5486.00   | 25.00  | 57.00 | 1256.00  |
> | 0           | 53.00  | 148.00  | 8374.00   | 52.00  | 109.00  |  11074.00  | 52.00  | 59.00 | 1068.00  |
> | -20         | 53.00  | 202.00  | 52537.00  | 53.00  | 205.00  |  22265.00  | 52.00  | 87.00 | 14151.00 |
> ----------------------------------------------------------------------------------------------------------
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ Hackbench + schbench - Various Latency Nice Values ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> perf bench sched messaging -p -l 400000 -g 128
> schbench -m 2 -t 1 -s 30
>
> o NPS1
>
> -------------------------------------------------------------------------------------------------
> | Hackbench |     schbench LN = 19       |      schbench LN = 0      |     schbench LN = -20    |
> | LN        |----------------------------|---------------------------|--------------------------|
> |           |  90th  |  95th  |  99th    |  90th  |  95th  |  99th   |  90th  |  95th  |  99th  |
> |-----------|--------|--------|----------|--------|--------|---------|--------|--------|--------|
> | 19        |   38   |   131  |   1458   |   46   |   151  |  2636   |   11   |   19   |  410   |       ^
> | 0         |   45   |   98   |   1758   |   25   |   50   |  1670   |   16   |   30   |  1042  |
> | -20       |   47   |   348  |   29280  |   40   |   109  |  16144  |   35   |   63   |  9104  |
> -------------------------------------------------------------------------------------------------
>
> o NPS4
>
> -------------------------------------------------------------------------------------------------
> | Hackbench |     schbench LN = 19       |      schbench LN = 0      |     schbench LN = -20    |
> | LN        |----------------------------|---------------------------|--------------------------|
> |           |  90th  |  95th  |  99th    |  90th  |  95th  |  99th   |  90th  |  95th  |  99th  |
> |-----------|--------|--------|----------|--------|--------|---------|--------|--------|--------|
> | 19        |   19   |  60    |  1886    |   17   |  29    |  621    |   10   |   18   |  227   |
> | 0         |   51   |  141   |  8120    |   37   |  78    |  8880   |   33   |   55   |  474   |       ^
> | -20       |   48   |  1494  |  27296   |   51   |  469   |  40384  |   31   |   64   |  4092  |       ^
> -------------------------------------------------------------------------------------------------
>
> ^ Note: There are cases where the Max, 99th percentile latency is
> non-monotonic but I've also seen a good amount of run to run variation
> there with a single bad sample polluting the results. In such cases,
> the averages are more representative.
>
> >
> > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
> >
> > [..snip..]
> >
>
> Apart from couple of anomalies, latency nice reduces wait time, especially
> when the system is heavily loaded. If there is any data, or any specific
> workload you would like me to run on the test system, please do let me know.
> Meanwhile, I'll try to get some numbers for larger workloads like SpecJBB
> that did see improvements with latency nice on v5.

Thanks for your tests

Vincent

> --
> Thanks and Regards,
> Prateek
  
K Prateek Nayak Dec. 7, 2022, 4:26 p.m. UTC | #3
Hello Vincent,

Thank you for taking a look at the report.

On 11/28/2022 10:49 PM, Vincent Guittot wrote:
> Hi Prateek,
> 
> On Mon, 28 Nov 2022 at 12:52, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Vincent,
>>
>> Following are the test results on dual socket Zen3 machine (2 x 64C/128T)
>>
>> tl;dr
>>
>> o All benchmarks with DEFAULT_LATENCY_NICE value are comparable to tip.
>>   There is, however, a noticeable dip for unixbench-spawn test case.
>>
>> o With the 2 rbtree approach, I do not see much difference in the
>>   hackbench results with varying latency nice value. Tests on v5 did
>>   yield noticeable improvements for hackbench.
>>   (https://lore.kernel.org/lkml/cd48ebbb-9724-985f-28e3-e558dea07827@amd.com/)
> 
> The 2 rbtree approach is the one that was already used in v5. I just
> rerun hackbench tests with latest tip and v6.2-rc7 and I can see large
> performance improvement for pipe tests on my system (8 cores system).
> Could you try witha larger number of group ? like 64, 128 and 256
> groups

Ah! My bad. I've rerun hackbench with larger number of groups and I see a
clear win for pipes with latency nice 19. Hackbench with sockets too see a
small win.

o pipes

$ perf bench sched messaging -p -l 50000 -g <groups>

latency_nice:           0                       19                      -20
32-groups:         9.43 (0.00 pct)         6.42 (31.91 pct)        9.75 (-3.39 pct)
64-groups:        21.55 (0.00 pct)        12.97 (39.81 pct)       21.48 (0.32 pct)
128-groups:       41.15 (0.00 pct)        24.18 (41.23 pct)       46.69 (-13.46 pct)
256-groups:       78.87 (0.00 pct)        43.65 (44.65 pct)       78.84 (0.03 pct)
512-groups:      125.48 (0.00 pct)        78.91 (37.11 pct)      136.21 (-8.55 pct)
1024-groups:     292.81 (0.00 pct)       151.36 (48.30 pct)      323.57 (-10.50 pct)

o sockets

$ perf bench sched messaging  -l 100000 -g <groups>

latency_nice:           0                       19                      -20
32-groups:        27.23 (0.00 pct)        27.00 (0.84 pct)        26.92 (1.13 pct)
64-groups:        45.71 (0.00 pct)        44.58 (2.47 pct)        45.86 (-0.32 pct)
128-groups:       79.55 (0.00 pct)        78.22 (1.67 pct)        80.01 (-0.57 pct)
256-groups:      161.41 (0.00 pct)       164.04 (-1.62 pct)      169.57 (-5.05 pct)
512-groups:      326.41 (0.00 pct)       310.00 (5.02 pct)       342.17 (-4.82 pct)
1024-groups:     634.36 (0.00 pct)       633.59 (0.12 pct)       640.05 (-0.89 pct)

Note: All tests were done in NPS1 mode.

> 
>>
>> [..snip..]
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ~ Unixbench - DEFAULT_LATENCY_NICE ~
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test                    Metric    Parallelism                   tip                   latency_nice
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48929419.48 (   0.00%)    49137039.06 (   0.42%)
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6275526953.25 (   0.00%)  6265580479.15 (  -0.16%)
>> unixbench-syscall       Amean     unixbench-syscall-1        2994319.73 (   0.00%)     3008596.83 *  -0.48%*
>> unixbench-syscall       Amean     unixbench-syscall-512      7349715.87 (   0.00%)     7420994.50 *  -0.97%*
>> unixbench-pipe          Hmean     unixbench-pipe-1           2830206.03 (   0.00%)     2854405.99 *   0.86%*
>> unixbench-pipe          Hmean     unixbench-pipe-512       326207828.01 (   0.00%)   328997804.52 *   0.86%*
>> unixbench-spawn         Hmean     unixbench-spawn-1             6394.21 (   0.00%)        6367.75 (  -0.41%)
>> unixbench-spawn         Hmean     unixbench-spawn-512          72700.64 (   0.00%)       71454.19 *  -1.71%*
>> unixbench-execl         Hmean     unixbench-execl-1             4723.61 (   0.00%)        4750.59 (   0.57%)
>> unixbench-execl         Hmean     unixbench-execl-512          11212.05 (   0.00%)       11262.13 (   0.45%)
>>
>> o NPS2
>>
>> Test                    Metric    Parallelism                   tip                   latency_nice
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49271512.85 (   0.00%)    49245260.43 (  -0.05%)
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6267992483.03 (   0.00%)  6264951100.67 (  -0.05%)
>> unixbench-syscall       Amean     unixbench-syscall-1        2995885.93 (   0.00%)     3005975.10 *  -0.34%*
>> unixbench-syscall       Amean     unixbench-syscall-512      7388865.77 (   0.00%)     7276275.63 *   1.52%*
>> unixbench-pipe          Hmean     unixbench-pipe-1           2828971.95 (   0.00%)     2856578.72 *   0.98%*
>> unixbench-pipe          Hmean     unixbench-pipe-512       326225385.37 (   0.00%)   328941270.81 *   0.83%*
>> unixbench-spawn         Hmean     unixbench-spawn-1             6958.71 (   0.00%)        6954.21 (  -0.06%)
>> unixbench-spawn         Hmean     unixbench-spawn-512          85443.56 (   0.00%)       70536.42 * -17.45%* (0.67% vs 0.93% - CoEff var)
> 
> I don't expect any perf improvement or regression when the latency
> nice is not changed

This regression can be ignored. Although the results from back to
back runs are very stable, I see the results vary when I rebuild
the unixbench binaries on my test setup.

			  tip	      latency_nice
unixbench-spawn-512	73489.0		78260.4		(kexec)
unixbench-spawn-512 	73332.7		77821.2		(reboot)
unixbench-spawn-512	86207.4		82281.2		(rebuilt + reboot)

I'll go back and look more into the spawn test because there is
something else at play there but other Unixbench results seem to
be stable looking at the rerun.

> 
>> unixbench-execl         Hmean     unixbench-execl-1             4767.99 (   0.00%)        4752.63 *  -0.32%*
>> unixbench-execl         Hmean     unixbench-execl-512          11250.72 (   0.00%)       11320.97 (   0.62%)
>>
>> o NPS4
>>
>> Test                    Metric    Parallelism                   tip                   latency_nice
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49041932.68 (   0.00%)    49156671.05 (   0.23%)
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6286981589.85 (   0.00%)  6285248711.40 (  -0.03%)
>> unixbench-syscall       Amean     unixbench-syscall-1        2992405.60 (   0.00%)     3008933.03 *  -0.55%*
>> unixbench-syscall       Amean     unixbench-syscall-512      7971789.70 (   0.00%)     7814622.23 *   1.97%*
>> unixbench-pipe          Hmean     unixbench-pipe-1           2822892.54 (   0.00%)     2852615.11 *   1.05%*
>> unixbench-pipe          Hmean     unixbench-pipe-512       326408309.83 (   0.00%)   329617202.56 *   0.98%*
>> unixbench-spawn         Hmean     unixbench-spawn-1             7685.31 (   0.00%)        7243.54 (  -5.75%)
>> unixbench-spawn         Hmean     unixbench-spawn-512          72245.56 (   0.00%)       77000.81 *   6.58%*
>> unixbench-execl         Hmean     unixbench-execl-1             4761.42 (   0.00%)        4733.12 *  -0.59%*
>> unixbench-execl         Hmean     unixbench-execl-512          11533.53 (   0.00%)       11660.17 (   1.10%)
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ~ Hackbench - Various Latency Nice Values ~
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o 100000 loops
>>
>> - pipe (process)
>>
>> Test:                   LN: 0                   LN: 19                  LN: -20
>>  1-groups:         3.91 (0.00 pct)         3.91 (0.00 pct)         3.81 (2.55 pct)
>>  2-groups:         4.48 (0.00 pct)         4.52 (-0.89 pct)        4.53 (-1.11 pct)
>>  4-groups:         4.83 (0.00 pct)         4.83 (0.00 pct)         4.87 (-0.82 pct)
>>  8-groups:         5.09 (0.00 pct)         5.00 (1.76 pct)         5.07 (0.39 pct)
>> 16-groups:         6.92 (0.00 pct)         6.79 (1.87 pct)         6.96 (-0.57 pct)
>>
>> - pipe (thread)
>>
>>  1-groups:         4.13 (0.00 pct)         4.08 (1.21 pct)         4.11 (0.48 pct)
>>  2-groups:         4.78 (0.00 pct)         4.90 (-2.51 pct)        4.79 (-0.20 pct)
>>  4-groups:         5.12 (0.00 pct)         5.08 (0.78 pct)         5.16 (-0.78 pct)
>>  8-groups:         5.31 (0.00 pct)         5.28 (0.56 pct)         5.33 (-0.37 pct)
>> 16-groups:         7.34 (0.00 pct)         7.27 (0.95 pct)         7.33 (0.13 pct)
>>
>> - socket (process)
>>
>> Test:                   LN: 0                   LN: 19                  LN: -20
>>  1-groups:         6.61 (0.00 pct)         6.38 (3.47 pct)         6.54 (1.05 pct)
>>  2-groups:         6.59 (0.00 pct)         6.67 (-1.21 pct)        6.11 (7.28 pct)
>>  4-groups:         6.77 (0.00 pct)         6.78 (-0.14 pct)        6.79 (-0.29 pct)
>>  8-groups:         8.29 (0.00 pct)         8.39 (-1.20 pct)        8.36 (-0.84 pct)
>> 16-groups:        12.21 (0.00 pct)        12.03 (1.47 pct)        12.35 (-1.14 pct)
>>
>> - socket (thread)
>>
>> Test:                   LN: 0                   LN: 19                  LN: -20
>>  1-groups:         6.50 (0.00 pct)         5.99 (7.84 pct)         6.02 (7.38 pct)      ^
>>  2-groups:         6.07 (0.00 pct)         6.20 (-2.14 pct)        6.23 (-2.63 pct)
>>  4-groups:         6.61 (0.00 pct)         6.64 (-0.45 pct)        6.63 (-0.30 pct)
>>  8-groups:         8.87 (0.00 pct)         8.67 (2.25 pct)         8.78 (1.01 pct)
>> 16-groups:        12.63 (0.00 pct)        12.54 (0.71 pct)        12.59 (0.31 pct)
>>
>>> [..snip..]
>>>
>>
>> Apart from couple of anomalies, latency nice reduces wait time, especially
>> when the system is heavily loaded. If there is any data, or any specific
>> workload you would like me to run on the test system, please do let me know.
>> Meanwhile, I'll try to get some numbers for larger workloads like SpecJBB
>> that did see improvements with latency nice on v5.

Following are results for SpecJBB in NPS1 mode:

+----------------------------------------------+
|                |   Latency Nice    |         |
|     Metric     |-------------------|   tip   |
|                |    0    |    19   |         |
|----------------|-------------------|---------|
|    Max jOPS    | 100.00% | 102.19% | 101.02% |
| Criritcal jOPS | 100.00% | 122.41% | 100.41% |
+----------------------------------------------+

SpecJBB throughput for Max-jOPS is similar across the board
but Critical-jOPS throughput sees a good uplift again with
latency nice 19.

> 
> [..snip..]
>

If there is any specific workload you would like me to test,
please do let me know. I'll try to test more workloads I come
across with different latency nice values and update you
with the results on this thread.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek