[00/17] sched: EEVDF using latency-nice

Message ID	20230328092622.062917921@infradead.org
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230328092622.062917921@infradead.org> User-Agent: quilt/0.66 Date: Tue, 28 Mar 2023 11:26:22 +0200 From: Peter Zijlstra <peterz@infradead.org> To: mingo@kernel.org, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io, chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com, pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com, yu.c.chen@intel.com, youssefesmat@chromium.org, joel@joelfernandes.org, efault@gmx.de Subject: [PATCH 00/17] sched: EEVDF using latency-nice Precedence: bulk
Series	sched: EEVDF using latency-nice \| [00/17] sched: EEVDF using latency-nice [01/17] sched: Introduce latency-nice as a per-task attribute [02/17] sched/fair: Add latency_offset [03/17] sched/fair: Add sched group latency support [04/17] sched/fair: Add avg_vruntime [05/17] sched/fair: Remove START_DEBIT [06/17] sched/fair: Add lag based placement [07/17] rbtree: Add rb_add_augmented_cached() helper [08/17] sched/fair: Implement an EEVDF like policy [09/17] sched: Commit to lag based placement [10/17] sched/smp: Use lag to simplify cross-runqueue placement [11/17] sched: Commit to EEVDF [12/17] sched/debug: Rename min_granularity to base_slice [13/17] sched: Merge latency_offset into slice [14/17] sched/eevdf: Better handle mixed slice length [15/17,RFC] sched/eevdf: Sleeper bonus [16/17,RFC] sched/eevdf: Minimal vavg option [17/17,DEBUG] sched/eevdf: Debug / validation crud

Message ID

20230328092622.062917921@infradead.org

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Message-ID: <20230328092622.062917921@infradead.org>
User-Agent: quilt/0.66
Date: Tue, 28 Mar 2023 11:26:22 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org, vincent.guittot@linaro.org
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
        juri.lelli@redhat.com, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io,
        chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com,
        pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com,
        joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com,
        yu.c.chen@intel.com, youssefesmat@chromium.org,
        joel@joelfernandes.org, efault@gmx.de
Subject: [PATCH 00/17] sched: EEVDF using latency-nice
Precedence: bulk

Series

sched: EEVDF using latency-nice |

Message

Peter Zijlstra March 28, 2023, 9:26 a.m. UTC

  Hi!

Latest version of the EEVDF [1] patches.

Many changes since last time; most notably it now fully replaces CFS and uses
lag based placement for migrations. Smaller changes include:

 - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
   bits on a system/cgroup based kernel build.
 - fixed a bunch of reweight / cgroup placement issues
 - adaptive placement strategy for smaller slices
 - rename se->lag to se->vlag

There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
artificial/daft but who knows.

The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
because it places things too far to the left in the tree. Basically it messes
with the whole 'when', by placing a task back in history you're putting a
burden on the now to accomodate catching up. More tinkering required.

But over-all the thing seems to be fairly usable and could do with more
extensive testing.

[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564

Results:

  hackbech -g $nr_cpu + cyclictest --policy other results:

			EEVDF			 CFS

		# Min Latencies: 00054
  LNICE(19)	# Avg Latencies: 00660
		# Max Latencies: 23103

		# Min Latencies: 00052		00053
  LNICE(0)	# Avg Latencies: 00318		00687
		# Max Latencies: 08593		13913

		# Min Latencies: 00054
  LNICE(-19)	# Avg Latencies: 00055
		# Max Latencies: 00061


Some preliminary results from Chen Yu on a slightly older version:

  schbench  (95% tail latency, lower is better)
  =================================================================================
  case                    nr_instance            baseline (std%)    compare% ( std%)
  normal                   25%                     1.00  (2.49%)    -81.2%   (4.27%)
  normal                   50%                     1.00  (2.47%)    -84.5%   (0.47%)
  normal                   75%                     1.00  (2.5%)     -81.3%   (1.27%)
  normal                  100%                     1.00  (3.14%)    -79.2%   (0.72%)
  normal                  125%                     1.00  (3.07%)    -77.5%   (0.85%)
  normal                  150%                     1.00  (3.35%)    -76.4%   (0.10%)
  normal                  175%                     1.00  (3.06%)    -76.2%   (0.56%)
  normal                  200%                     1.00  (3.11%)    -76.3%   (0.39%)
  ==================================================================================

  hackbench (throughput, higher is better)
  ==============================================================================
  case                    nr_instance            baseline(std%)  compare%( std%)
  threads-pipe              25%                      1.00 (<2%)    -17.5 (<2%)
  threads-socket            25%                      1.00 (<2%)    -1.9 (<2%)
  threads-pipe              50%                      1.00 (<2%)     +6.7 (<2%)
  threads-socket            50%                      1.00 (<2%)    -6.3  (<2%)
  threads-pipe              100%                     1.00 (3%)     +110.1 (3%)
  threads-socket            100%                     1.00 (<2%)    -40.2 (<2%)
  threads-pipe              150%                     1.00 (<2%)    +125.4 (<2%)
  threads-socket            150%                     1.00 (<2%)    -24.7 (<2%)
  threads-pipe              200%                     1.00 (<2%)    -89.5 (<2%)
  threads-socket            200%                     1.00 (<2%)    -27.4 (<2%)
  process-pipe              25%                      1.00 (<2%)    -15.0 (<2%)
  process-socket            25%                      1.00 (<2%)    -3.9 (<2%)
  process-pipe              50%                      1.00 (<2%)    -0.4  (<2%)
  process-socket            50%                      1.00 (<2%)    -5.3  (<2%)
  process-pipe              100%                     1.00 (<2%)    +62.0 (<2%)
  process-socket            100%                     1.00 (<2%)    -39.5  (<2%)
  process-pipe              150%                     1.00 (<2%)    +70.0 (<2%)
  process-socket            150%                     1.00 (<2%)    -20.3 (<2%)
  process-pipe              200%                     1.00 (<2%)    +79.2 (<2%)
  process-socket            200%                     1.00 (<2%)    -22.4  (<2%)
  ==============================================================================

  stress-ng (throughput, higher is better)
  ==============================================================================
  case                    nr_instance            baseline(std%)  compare%( std%)
  switch                  25%                      1.00 (<2%)    -6.5 (<2%)
  switch                  50%                      1.00 (<2%)    -9.2 (<2%)
  switch                  75%                      1.00 (<2%)    -1.2 (<2%)
  switch                  100%                     1.00 (<2%)    +11.1 (<2%)
  switch                  125%                     1.00 (<2%)    -16.7% (9%)
  switch                  150%                     1.00 (<2%)    -13.6 (<2%)
  switch                  175%                     1.00 (<2%)    -16.2 (<2%)
  switch                  200%                     1.00 (<2%)    -19.4% (<2%)
  fork                    50%                      1.00 (<2%)    -0.1 (<2%)
  fork                    75%                      1.00 (<2%)    -0.3 (<2%)
  fork                    100%                     1.00 (<2%)    -0.1 (<2%)
  fork                    125%                     1.00 (<2%)    -6.9 (<2%)
  fork                    150%                     1.00 (<2%)    -8.8 (<2%)
  fork                    200%                     1.00 (<2%)    -3.3 (<2%)
  futex                   25%                      1.00 (<2%)    -3.2 (<2%)
  futex                   50%                      1.00 (3%)     -19.9 (5%)
  futex                   75%                      1.00 (6%)     -19.1 (2%)
  futex                   100%                     1.00 (16%)    -30.5 (10%)
  futex                   125%                     1.00 (25%)    -39.3 (11%)
  futex                   150%                     1.00 (20%)    -27.2% (17%)
  futex                   175%                     1.00 (<2%)    -18.6 (<2%)
  futex                   200%                     1.00 (<2%)    -47.5 (<2%)
  nanosleep               25%                      1.00 (<2%)    -0.1 (<2%)
  nanosleep               50%                      1.00 (<2%)    -0.0% (<2%)
  nanosleep               75%                      1.00 (<2%)    +15.2% (<2%)
  nanosleep               100%                     1.00 (<2%)    -26.4 (<2%)
  nanosleep               125%                     1.00 (<2%)    -1.3 (<2%)
  nanosleep               150%                     1.00 (<2%)    +2.1  (<2%)
  nanosleep               175%                     1.00 (<2%)    +8.3 (<2%)
  nanosleep               200%                     1.00 (<2%)    +2.0% (<2%)
  ===============================================================================

  unixbench (throughput, higher is better)
  ==============================================================================
  case                    nr_instance            baseline(std%)  compare%( std%)
  spawn                   125%                      1.00 (<2%)    +8.1 (<2%)
  context1                100%                      1.00 (6%)     +17.4 (6%)
  context1                75%                       1.00 (13%)    +18.8 (8%)
  =================================================================================

  netperf  (throughput, higher is better)
  ===========================================================================
  case                    nr_instance          baseline(std%)  compare%( std%)
  UDP_RR                  25%                   1.00    (<2%)    -1.5%  (<2%)
  UDP_RR                  50%                   1.00    (<2%)    -0.3%  (<2%)
  UDP_RR                  75%                   1.00    (<2%)    +12.5% (<2%)
  UDP_RR                 100%                   1.00    (<2%)    -4.3%  (<2%)
  UDP_RR                 125%                   1.00    (<2%)    -4.9%  (<2%)
  UDP_RR                 150%                   1.00    (<2%)    -4.7%  (<2%)
  UDP_RR                 175%                   1.00    (<2%)    -6.1%  (<2%)
  UDP_RR                 200%                   1.00    (<2%)    -6.6%  (<2%)
  TCP_RR                  25%                   1.00    (<2%)    -1.4%  (<2%)
  TCP_RR                  50%                   1.00    (<2%)    -0.2%  (<2%)
  TCP_RR                  75%                   1.00    (<2%)    -3.9%  (<2%)
  TCP_RR                 100%                   1.00    (2%)     +3.6%  (5%)
  TCP_RR                 125%                   1.00    (<2%)    -4.2%  (<2%)
  TCP_RR                 150%                   1.00    (<2%)    -6.0%  (<2%)
  TCP_RR                 175%                   1.00    (<2%)    -7.4%  (<2%)
  TCP_RR                 200%                   1.00    (<2%)    -8.4%  (<2%)
  ==========================================================================


---
Also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf

---
Parth Shah (1):
      sched: Introduce latency-nice as a per-task attribute

Peter Zijlstra (14):
      sched/fair: Add avg_vruntime
      sched/fair: Remove START_DEBIT
      sched/fair: Add lag based placement
      rbtree: Add rb_add_augmented_cached() helper
      sched/fair: Implement an EEVDF like policy
      sched: Commit to lag based placement
      sched/smp: Use lag to simplify cross-runqueue placement
      sched: Commit to EEVDF
      sched/debug: Rename min_granularity to base_slice
      sched: Merge latency_offset into slice
      sched/eevdf: Better handle mixed slice length
      sched/eevdf: Sleeper bonus
      sched/eevdf: Minimal vavg option
      sched/eevdf: Debug / validation crud

Vincent Guittot (2):
      sched/fair: Add latency_offset
      sched/fair: Add sched group latency support

 Documentation/admin-guide/cgroup-v2.rst |   10 +
 include/linux/rbtree_augmented.h        |   26 +
 include/linux/sched.h                   |    6 +
 include/uapi/linux/sched.h              |    4 +-
 include/uapi/linux/sched/types.h        |   19 +
 init/init_task.c                        |    3 +-
 kernel/sched/core.c                     |   65 +-
 kernel/sched/debug.c                    |   49 +-
 kernel/sched/fair.c                     | 1199 ++++++++++++++++---------------
 kernel/sched/features.h                 |   29 +-
 kernel/sched/sched.h                    |   23 +-
 tools/include/uapi/linux/sched.h        |    4 +-
 12 files changed, 794 insertions(+), 643 deletions(-)

Comments

Shrikanth Hegde April 3, 2023, 7:42 a.m. UTC | #1

On 3/28/23 2:56 PM, Peter Zijlstra wrote:
> Hi!
>
> Latest version of the EEVDF [1] patches.
>
> Many changes since last time; most notably it now fully replaces CFS and uses
> lag based placement for migrations. Smaller changes include:
>
>  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
>    bits on a system/cgroup based kernel build.
>  - fixed a bunch of reweight / cgroup placement issues
>  - adaptive placement strategy for smaller slices
>  - rename se->lag to se->vlag
>
> There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> artificial/daft but who knows.
>
> The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> because it places things too far to the left in the tree. Basically it messes
> with the whole 'when', by placing a task back in history you're putting a
> burden on the now to accomodate catching up. More tinkering required.
>
> But over-all the thing seems to be fairly usable and could do with more
> extensive testing.
>
> [1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
>
> Results:
>
>   hackbech -g $nr_cpu + cyclictest --policy other results:
>
> 			EEVDF			 CFS
>
> 		# Min Latencies: 00054
>   LNICE(19)	# Avg Latencies: 00660
> 		# Max Latencies: 23103
>
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00318		00687
> 		# Max Latencies: 08593		13913
>
> 		# Min Latencies: 00054
>   LNICE(-19)	# Avg Latencies: 00055
> 		# Max Latencies: 00061
>
>
> Some preliminary results from Chen Yu on a slightly older version:
>
>   schbench  (95% tail latency, lower is better)
>   =================================================================================
>   case                    nr_instance            baseline (std%)    compare% ( std%)
>   normal                   25%                     1.00  (2.49%)    -81.2%   (4.27%)
>   normal                   50%                     1.00  (2.47%)    -84.5%   (0.47%)
>   normal                   75%                     1.00  (2.5%)     -81.3%   (1.27%)
>   normal                  100%                     1.00  (3.14%)    -79.2%   (0.72%)
>   normal                  125%                     1.00  (3.07%)    -77.5%   (0.85%)
>   normal                  150%                     1.00  (3.35%)    -76.4%   (0.10%)
>   normal                  175%                     1.00  (3.06%)    -76.2%   (0.56%)
>   normal                  200%                     1.00  (3.11%)    -76.3%   (0.39%)
>   ==================================================================================
>
>   hackbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   threads-pipe              25%                      1.00 (<2%)    -17.5 (<2%)
>   threads-socket            25%                      1.00 (<2%)    -1.9 (<2%)
>   threads-pipe              50%                      1.00 (<2%)     +6.7 (<2%)
>   threads-socket            50%                      1.00 (<2%)    -6.3  (<2%)
>   threads-pipe              100%                     1.00 (3%)     +110.1 (3%)
>   threads-socket            100%                     1.00 (<2%)    -40.2 (<2%)
>   threads-pipe              150%                     1.00 (<2%)    +125.4 (<2%)
>   threads-socket            150%                     1.00 (<2%)    -24.7 (<2%)
>   threads-pipe              200%                     1.00 (<2%)    -89.5 (<2%)
>   threads-socket            200%                     1.00 (<2%)    -27.4 (<2%)
>   process-pipe              25%                      1.00 (<2%)    -15.0 (<2%)
>   process-socket            25%                      1.00 (<2%)    -3.9 (<2%)
>   process-pipe              50%                      1.00 (<2%)    -0.4  (<2%)
>   process-socket            50%                      1.00 (<2%)    -5.3  (<2%)
>   process-pipe              100%                     1.00 (<2%)    +62.0 (<2%)
>   process-socket            100%                     1.00 (<2%)    -39.5  (<2%)
>   process-pipe              150%                     1.00 (<2%)    +70.0 (<2%)
>   process-socket            150%                     1.00 (<2%)    -20.3 (<2%)
>   process-pipe              200%                     1.00 (<2%)    +79.2 (<2%)
>   process-socket            200%                     1.00 (<2%)    -22.4  (<2%)
>   ==============================================================================
>
>   stress-ng (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   switch                  25%                      1.00 (<2%)    -6.5 (<2%)
>   switch                  50%                      1.00 (<2%)    -9.2 (<2%)
>   switch                  75%                      1.00 (<2%)    -1.2 (<2%)
>   switch                  100%                     1.00 (<2%)    +11.1 (<2%)
>   switch                  125%                     1.00 (<2%)    -16.7% (9%)
>   switch                  150%                     1.00 (<2%)    -13.6 (<2%)
>   switch                  175%                     1.00 (<2%)    -16.2 (<2%)
>   switch                  200%                     1.00 (<2%)    -19.4% (<2%)
>   fork                    50%                      1.00 (<2%)    -0.1 (<2%)
>   fork                    75%                      1.00 (<2%)    -0.3 (<2%)
>   fork                    100%                     1.00 (<2%)    -0.1 (<2%)
>   fork                    125%                     1.00 (<2%)    -6.9 (<2%)
>   fork                    150%                     1.00 (<2%)    -8.8 (<2%)
>   fork                    200%                     1.00 (<2%)    -3.3 (<2%)
>   futex                   25%                      1.00 (<2%)    -3.2 (<2%)
>   futex                   50%                      1.00 (3%)     -19.9 (5%)
>   futex                   75%                      1.00 (6%)     -19.1 (2%)
>   futex                   100%                     1.00 (16%)    -30.5 (10%)
>   futex                   125%                     1.00 (25%)    -39.3 (11%)
>   futex                   150%                     1.00 (20%)    -27.2% (17%)
>   futex                   175%                     1.00 (<2%)    -18.6 (<2%)
>   futex                   200%                     1.00 (<2%)    -47.5 (<2%)
>   nanosleep               25%                      1.00 (<2%)    -0.1 (<2%)
>   nanosleep               50%                      1.00 (<2%)    -0.0% (<2%)
>   nanosleep               75%                      1.00 (<2%)    +15.2% (<2%)
>   nanosleep               100%                     1.00 (<2%)    -26.4 (<2%)
>   nanosleep               125%                     1.00 (<2%)    -1.3 (<2%)
>   nanosleep               150%                     1.00 (<2%)    +2.1  (<2%)
>   nanosleep               175%                     1.00 (<2%)    +8.3 (<2%)
>   nanosleep               200%                     1.00 (<2%)    +2.0% (<2%)
>   ===============================================================================
>
>   unixbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   spawn                   125%                      1.00 (<2%)    +8.1 (<2%)
>   context1                100%                      1.00 (6%)     +17.4 (6%)
>   context1                75%                       1.00 (13%)    +18.8 (8%)
>   =================================================================================
>
>   netperf  (throughput, higher is better)
>   ===========================================================================
>   case                    nr_instance          baseline(std%)  compare%( std%)
>   UDP_RR                  25%                   1.00    (<2%)    -1.5%  (<2%)
>   UDP_RR                  50%                   1.00    (<2%)    -0.3%  (<2%)
>   UDP_RR                  75%                   1.00    (<2%)    +12.5% (<2%)
>   UDP_RR                 100%                   1.00    (<2%)    -4.3%  (<2%)
>   UDP_RR                 125%                   1.00    (<2%)    -4.9%  (<2%)
>   UDP_RR                 150%                   1.00    (<2%)    -4.7%  (<2%)
>   UDP_RR                 175%                   1.00    (<2%)    -6.1%  (<2%)
>   UDP_RR                 200%                   1.00    (<2%)    -6.6%  (<2%)
>   TCP_RR                  25%                   1.00    (<2%)    -1.4%  (<2%)
>   TCP_RR                  50%                   1.00    (<2%)    -0.2%  (<2%)
>   TCP_RR                  75%                   1.00    (<2%)    -3.9%  (<2%)
>   TCP_RR                 100%                   1.00    (2%)     +3.6%  (5%)
>   TCP_RR                 125%                   1.00    (<2%)    -4.2%  (<2%)
>   TCP_RR                 150%                   1.00    (<2%)    -6.0%  (<2%)
>   TCP_RR                 175%                   1.00    (<2%)    -7.4%  (<2%)
>   TCP_RR                 200%                   1.00    (<2%)    -8.4%  (<2%)
>   ==========================================================================
>
>
> ---
> Also available at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf
>
> ---
> Parth Shah (1):
>       sched: Introduce latency-nice as a per-task attribute
>
> Peter Zijlstra (14):
>       sched/fair: Add avg_vruntime
>       sched/fair: Remove START_DEBIT
>       sched/fair: Add lag based placement
>       rbtree: Add rb_add_augmented_cached() helper
>       sched/fair: Implement an EEVDF like policy
>       sched: Commit to lag based placement
>       sched/smp: Use lag to simplify cross-runqueue placement
>       sched: Commit to EEVDF
>       sched/debug: Rename min_granularity to base_slice
>       sched: Merge latency_offset into slice
>       sched/eevdf: Better handle mixed slice length
>       sched/eevdf: Sleeper bonus
>       sched/eevdf: Minimal vavg option
>       sched/eevdf: Debug / validation crud
>
> Vincent Guittot (2):
>       sched/fair: Add latency_offset
>       sched/fair: Add sched group latency support
>
>  Documentation/admin-guide/cgroup-v2.rst |   10 +
>  include/linux/rbtree_augmented.h        |   26 +
>  include/linux/sched.h                   |    6 +
>  include/uapi/linux/sched.h              |    4 +-
>  include/uapi/linux/sched/types.h        |   19 +
>  init/init_task.c                        |    3 +-
>  kernel/sched/core.c                     |   65 +-
>  kernel/sched/debug.c                    |   49 +-
>  kernel/sched/fair.c                     | 1199 ++++++++++++++++---------------
>  kernel/sched/features.h                 |   29 +-
>  kernel/sched/sched.h                    |   23 +-
>  tools/include/uapi/linux/sched.h        |    4 +-
>  12 files changed, 794 insertions(+), 643 deletions(-)
>

Tested the patch on power system with 60 cores with SMT=8. Total of 480 CPU's.
System has four NUMA nodes.

TL;DR

Real life workload like daytrader shows improvement in different cases, while
microbenchmarks shows gains and regress as well.

Tested with microbenchmarks (hackbench, schbench, unixbench, STREAM and lmbench)
and DB workload called day trader. daytrader simulates the real life trading
activities which gives total transaction/s. It uses around 70% CPU.

Comparison is between tip/master vs +this_patch. tip/master was at 4b7aa0abddff
small nit: Applies cleanly to tip/master. patch fails to apply cleanly for
sched/core. sched/core is at 05bfb338fa8d
===============================================================================
Summary of methods and observations.
===============================================================================
Method 1:    Ran microbenchmarks on an idle system without any cgroups.
Observation: hackbench, unixbench shows gain. schbench shows regression.
             Stream and lmbench values are same.

Method 2:    Ran microbenchmarks on an idle system. Created a cgroup and ran
	     benchmarks in that cgroup. Latency values are assigned to the cgroup.
	     This is almost same as Method 1.
Observation: hackbench pipe shows improvement. schbench shows regression. lmbench and stream
	     are same more or less.

Method 3:    Ran microbenchmarks in a cgroup and in another terminal running stress-ng
             at 50% utilization. here, also tried different latency nice values
	     for cgroup.
Observation: Hackbench shows gain in latency values. Schbench shows good gain in
             latency values except, 1 thread case. lmbench and stream regress
	     slightly. unixbench is mixed.

	     One concerning throughput is 4 X Shell Scripts (8 concurrent) which
	     shows 50% regression. This is verified with additional run. The
	     same holds true for 25% utilization as well.

Method 4:    Ran daytrader with no cgroups on idle system.
Observation: we see around 7% gain in throughput.

Method 5:    Ran daytrader in a cgroup and running stress-ng at 50% utilization.
Observation: we see around 9% gain in throughput.

===============================================================================

Note:
positive values show improvement and negative values shows regression.

hackbench has 50 iterations.
schbench has 10 iterations
unixbench has 10 iterations.
lmbench has 50 iterations.

# lscpu
Architecture:            ppc64le
  Byte Order:            Little Endian
CPU(s):                  480
  On-line CPU(s) list:   0-479
  Thread(s) per core:    8
  Core(s) per socket:    15
  Socket(s):             4
  Physical sockets:      4
  Physical chips:        1
  Physical cores/chip:   15

NUMA:
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-119
  NUMA node1 CPU(s):     120-239
  NUMA node2 CPU(s):     240-359
  NUMA node3 CPU(s):     360-479


===============================================================================
Detailed logs from each method.
================
Method 1:
================
This is to compare the out of box performance of the two. no load on the system
benchmarks are run without any cgroup.

Hackbench shows improvement. schbench results are mixed. But schebench has run
to variance. stream and lmbench are same.

-------------------------------------------------------------------------------
lmbench                          tip/master      eevdf
-------------------------------------------------------------------------------
latency process fork       :     120.56,     120.32(+0.20)
latency process Exec       :     176.70,     177.22(-0.30)
latency process Exec       :       5.59,       5.89(-5.47)
latency syscall fstat      :       0.26,       0.26( 0.00)
latency syscall open       :       2.27,       2.29(-0.88)
AF_UNIX sock stream latency:       9.13,       9.34(-2.30)
Select on 200 fd's         :       2.16,       2.15(-0.46)
semaphore latency          :       0.85,       0.85(+0.00)

-------------------------------------------------------------------------------
Stream                           tip/master      eevdf
-------------------------------------------------------------------------------
copy latency                :       0.58,       0.59(-1.72)
copy bandwidth              :   27357.05,   27009.15(-1.27)
scale latency               :       0.61,       0.61(0.00)
scale bandwidth             :   26268.65,   26057.07(-0.81)
add latency                 :       1.25,       1.25(0.00)
add bandwidth               :   19176.21,   19177.24(0.01)
triad latency               :       0.74,       0.74(0.00)
triad bandwidth             :   32591.51,   32506.32(-0.26)

-------------------------------------------------------------------------------
Unixbench                              tip/master    eevdf
-------------------------------------------------------------------------------
1 X Execl Throughput               :    5158.07,    5228.97(1.37)
4 X Execl Throughput               :   12745.19,   12927.75(1.43)
1 X Pipe-based Context Switching   :  178280.42,  170140.15(-4.57)
4 X Pipe-based Context Switching   :  594414.36,  560509.01(-5.70)
1 X Process Creation               :    8657.10,    8659.28(0.03)
4 X Process Creation               :   16476.56,   17007.43(3.22)
1 X Shell Scripts (1 concurrent)   :   10179.24,   10307.21(1.26)
4 X Shell Scripts (1 concurrent)   :   32990.17,   33251.73(0.79)
1 X Shell Scripts (8 concurrent)   :    4878.56,    4940.22(1.26)
4 X Shell Scripts (8 concurrent)   :   14001.89,   13568.88(-3.09)

-------------------------------------------------------------------------------
Schbench    tip/master      eevdf
-------------------------------------------------------------------------------
1 Threads
50.0th:       7.20,       7.00(2.78)
75.0th:       8.20,       7.90(3.66)
90.0th:      10.10,       8.30(17.82)
95.0th:      11.40,       9.30(18.42)
99.0th:      13.30,      11.00(17.29)
99.5th:      13.60,      11.70(13.97)
99.9th:      15.40,      13.40(12.99)
2 Threads
50.0th:       8.60,       8.00(6.98)
75.0th:       9.80,       8.80(10.20)
90.0th:      11.50,       9.90(13.91)
95.0th:      12.40,      10.70(13.71)
99.0th:      13.50,      13.70(-1.48)
99.5th:      14.90,      15.00(-0.67)
99.9th:      27.60,      23.60(14.49)
4 Threads
50.0th:      10.00,       9.90(1.00)
75.0th:      11.70,      12.00(-2.56)
90.0th:      13.60,      14.30(-5.15)
95.0th:      14.90,      15.40(-3.36)
99.0th:      17.80,      18.50(-3.93)
99.5th:      19.00,      19.30(-1.58)
99.9th:      27.60,      32.10(-16.30)
8 Threads
50.0th:      12.20,      13.30(-9.02)
75.0th:      15.20,      17.50(-15.13)
90.0th:      18.40,      21.60(-17.39)
95.0th:      20.70,      24.10(-16.43)
99.0th:      26.30,      30.20(-14.83)
99.5th:      30.50,      37.90(-24.26)
99.9th:      53.10,      92.10(-73.45)
16 Threads
50.0th:      20.70,      19.70(4.83)
75.0th:      28.20,      27.20(3.55)
90.0th:      36.20,      33.80(6.63)
95.0th:      40.70,      37.50(7.86)
99.0th:      51.50,      45.30(12.04)
99.5th:      62.70,      49.40(21.21)
99.9th:     120.70,      88.40(26.76)
32 Threads
50.0th:      39.50,      38.60(2.28)
75.0th:      58.30,      56.10(3.77)
90.0th:      76.40,      72.60(4.97)
95.0th:      86.30,      82.20(4.75)
99.0th:     102.20,      98.90(3.23)
99.5th:     108.00,     105.30(2.50)
99.9th:     179.30,     188.80(-5.30)

-------------------------------------------------------------------------------
Hackbench			  tip/master     eevdf
-------------------------------------------------------------------------------
Process 10 groups          :       0.19,       0.19(0.00)
Process 20 groups          :       0.24,       0.26(-8.33)
Process 30 groups          :       0.30,       0.31(-3.33)
Process 40 groups          :       0.35,       0.37(-5.71)
Process 50 groups          :       0.41,       0.44(-7.32)
Process 60 groups          :       0.47,       0.50(-6.38)
thread  10 groups          :       0.22,       0.23(-4.55)
thread  20 groups          :       0.28,       0.27(3.57)
Process(Pipe) 10 groups    :       0.16,       0.16(0.00)
Process(Pipe) 20 groups    :       0.26,       0.24(7.69)
Process(Pipe) 30 groups    :       0.36,       0.30(16.67)
Process(Pipe) 40 groups    :       0.40,       0.35(12.50)
Process(Pipe) 50 groups    :       0.48,       0.40(16.67)
Process(Pipe) 60 groups    :       0.55,       0.44(20.00)
thread (Pipe) 10 groups    :       0.16,       0.14(12.50)
thread (Pipe) 20 groups    :       0.24,       0.22(8.33)


================
Method 2:
================
This was to compare baseline performance with the eevdf by assigning different
latency nice values. In order to do that, created a cgroup and assigned latency
nice values to cgroup. microbenchmarks are run from that cgroup.

hackbench pipe shows improvement. schbench shows regression. lmbench and stream
are same more or less.

-------------------------------------------------------------------------------
lmbench                     tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
latency process fork       :  121.20,  121.35(-0.12),  121.75(-0.45), 120.61(0.49)
latency process Exec       :  177.60,  180.84(-1.82),  177.93(-0.18), 177.44(0.09)
latency process Exec       :    5.80,    6.16(-6.27),    6.14(-5.89),   6.14(-5.91)
latency syscall fstat      :    0.26,    0.26(0.00) ,    0.26(0.00) ,   0.26(0.00)
latency syscall open       :    2.27,    2.29(-0.88),    2.29(-0.88),   2.29(-0.88)
AF_UNIX sock_stream latency:    9.31,    9.61(-3.22),    9.61(-3.22),   9.53(-2.36)
Select on 200 fd'si        :    2.17,    2.15(0.92) ,    2.15(0.92) ,   2.15(0.92)
semaphore latency          :    0.88,    0.89(-1.14),    0.88(0.00) ,   0.88(0.00)

-------------------------------------------------------------------------------
Stream          tip/master   eevdf(LN=0)       eevdf(LN=-20)      eevdf(LN=19)
-------------------------------------------------------------------------------
copy latency   :     0.56,      0.58(-3.57),      0.58(-3.57),       0.58(-3.57)
copy bandwidth : 28767.80,  27520.04(-4.34),  27506.95(-4.38),   27381.61(-4.82)
scale latency  :     0.60,      0.61(-1.67),      0.61(-1.67),       0.61(-1.67)
scale bandwidth: 26875.58,  26385.22(-1.82),  26339.94(-1.99),   26302.86(-2.13)
add latency    :     1.25,      1.25(0.00) ,      1.25(0.00) ,       1.25(0.00)
add bandwidth  : 19175.76,  19177.48(0.01) ,  19177.60(0.01) ,   19176.32(0.00)
triad latency  :     0.74,      0.73(1.35) ,      0.74(0.00) ,       0.74(0.00)
triad bandwidth: 32545.70,  32658.95(0.35) ,  32581.78(0.11) ,   32561.74(0.05)

--------------------------------------------------------------------------------------------------
Unixbench                         tip/master  eevdf(LN=0)      eevdf(LN=-20)   eevdf(LN=19)
--------------------------------------------------------------------------------------------------
1 X Execl Throughput            :  5147.23,    5184.87(0.73),     5217.16(1.36),     5218.21(1.38)
4 X Execl Throughput            : 13225.55,   13638.36(3.12),    13643.07(3.16),    13636.50(3.11)
1 X Pipe-based Context Switching:171413.56,  162720.69(-5.07),  163420.54(-4.66),  163446.67(-4.65)
4 X Pipe-based Context Switching:564887.90,  554545.01(-1.83),  555561.24(-1.65),  547421.20(-3.09)
1 X Process Creation            :  8555.73,    8503.18(-0.61),    8556.39(0.01),     8621.36(0.77)
4 X Process Creation            : 17007.47,   16372.44(-3.73),   17002.88(-0.03),   16611.47(-2.33)
1 X Shell Scripts (1 concurrent): 10104.23,   10235.09(1.30),    10171.44(0.67),    10275.76(1.70)
4 X Shell Scripts (1 concurrent): 33752.14,   32278.50(-4.37),   32885.92(-2.57),   32256.58(-4.43)
1 X Shell Scripts (8 concurrent):  4864.71,    4909.30(0.92),     4914.62(1.03),     4896.45(0.65)
4 X Shell Scripts (8 concurrent): 14237.17,   13395.20(-5.91),   13599.52(-4.48),   12923.93(-9.22)


-------------------------------------------------------------------------------
schbench    tip/master   eevdf(LN=0)       eevdf(LN=-20)      eevdf(LN=19)
-------------------------------------------------------------------------------
1 Threads
50.0th:       6.90,       7.30(-5.80),       7.30(-5.80),      7.10(-2.90)
75.0th:       7.90,       8.40(-6.33),       8.60(-8.86),      8.00(-1.27)
90.0th:      10.10,       9.60(4.95),       10.50(-3.96),      8.90(11.88)
95.0th:      11.20,      10.60(5.36),       11.10(0.89),       9.40(16.07)
99.0th:      13.30,      12.70(4.51),       12.80(3.76),      11.80(11.28)
99.5th:      13.90,      13.50(2.88),       13.60(2.16),      12.40(10.79)
99.9th:      15.00,      15.40(-2.67),      15.20(-1.33),     13.70(8.67)
2 Threads
50.0th:       7.20,       8.10(-12.50),      8.00(-11.11),     8.40(-16.67)
75.0th:       8.30,       9.20(-10.84),      9.00(-8.43),      9.70(-16.87)
90.0th:      10.10,      11.00(-8.91),      10.00(0.99),      11.00(-8.91)
95.0th:      11.30,      12.60(-11.50),     10.60(6.19),      11.60(-2.65)
99.0th:      14.40,      15.40(-6.94),      11.90(17.36),     13.70(4.86)
99.5th:      15.20,      16.10(-5.92),      13.20(13.16),     14.60(3.95)
99.9th:      16.40,      17.30(-5.49),      14.70(10.37),     16.20(1.22)
4 Threads
50.0th:       8.90,      10.30(-15.73),     10.00(-12.36),    10.10(-13.48)
75.0th:      10.80,      12.10(-12.04),     11.80(-9.26),     12.00(-11.11)
90.0th:      13.00,      14.00(-7.69),      13.70(-5.38),     14.30(-10.00)
95.0th:      14.40,      15.20(-5.56),      14.90(-3.47),     15.80(-9.72)
99.0th:      16.90,      17.50(-3.55),      18.70(-10.65),    19.80(-17.16)
99.5th:      17.40,      18.50(-6.32),      19.80(-13.79),    22.10(-27.01)
99.9th:      18.70,      22.30(-19.25),     22.70(-21.39),    37.50(-100.53)
8 Threads
50.0th:      11.50,      12.80(-11.30),     13.30(-15.65),    12.80(-11.30)
75.0th:      15.00,      16.30(-8.67),      16.90(-12.67),    16.20(-8.00)
90.0th:      18.80,      19.50(-3.72),      20.30(-7.98),     19.90(-5.85)
95.0th:      21.40,      21.80(-1.87),      22.30(-4.21),     22.10(-3.27)
99.0th:      27.60,      26.30(4.71) ,      27.60(0.00),      27.30(1.09)
99.5th:      30.40,      32.40(-6.58),      36.40(-19.74),    30.00(1.32)
99.9th:      56.90,      59.10(-3.87),      66.70(-17.22),    60.90(-7.03)
16 Threads
50.0th:      19.20,      20.90(-8.85),      20.60(-7.29),     21.00(-9.38)
75.0th:      25.30,      27.50(-8.70),      27.80(-9.88),     28.30(-11.86)
90.0th:      31.20,      34.60(-10.90),     35.10(-12.50),    35.20(-12.82)
95.0th:      35.40,      38.90(-9.89),      39.50(-11.58),    39.20(-10.73)
99.0th:      44.90,      47.60(-6.01),      47.50(-5.79),     47.60(-6.01)
99.5th:      48.50,      50.50(-4.12),      50.20(-3.51),     55.60(-14.64)
99.9th:      70.80,      84.70(-19.63),     81.40(-14.97),   103.50(-46.19)
32 Threads
50.0th:      39.10,      38.60(1.28),       36.10(7.67),      39.50(-1.02)
75.0th:      57.20,      56.10(1.92),       52.00(9.09),      57.70(-0.87)
90.0th:      74.00,      73.70(0.41),       65.70(11.22),     74.40(-0.54)
95.0th:      82.30,      83.50(-1.46),      74.20(9.84),      84.50(-2.67)
99.0th:      95.80,      98.60(-2.92),      92.10(3.86),     100.50(-4.91)
99.5th:     101.50,     104.10(-2.56),      98.90(2.56),     108.20(-6.60)
99.9th:     185.70,     179.90(3.12),      163.50(11.95),    193.00(-3.93)

-------------------------------------------------------------------------------
Hackbench                tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
Process 10 groups       :   0.19,    0.19(0.00),    0.19(0.00),    0.19(0.00)
Process 20 groups       :   0.24,    0.25(-4.17),   0.26(-8.33),   0.25(-4.17)
Process 30 groups       :   0.30,    0.31(-3.33),   0.31(-3.33),   0.30(0.00)
Process 40 groups       :   0.35,    0.37(-5.71),   0.38(-8.57),   0.38(-8.57)
Process 50 groups       :   0.43,    0.44(-2.33),   0.44(-2.33),   0.44(-2.33)
Process 60 groups       :   0.49,    0.52(-6.12),   0.51(-4.08),   0.51(-4.08)
thread  10 groups       :   0.23,    0.22(4.35),    0.23(0.00),    0.23(0.00)
thread  20 groups       :   0.28,    0.28(0.00),    0.27(3.57),    0.28(0.00)
Process(Pipe) 10 groups :   0.17,    0.16(5.88),    0.16(5.88),    0.16(5.88)
Process(Pipe) 20 groups :   0.25,    0.24(4.00),    0.24(4.00),    0.24(4.00)
Process(Pipe) 30 groups :   0.32,    0.29(9.38),    0.29(9.38),    0.29(9.38)
Process(Pipe) 40 groups :   0.39,    0.34(12.82),   0.34(12.82),   0.34(12.82)
Process(Pipe) 50 groups :   0.45,    0.39(13.33),   0.39(13.33),   0.38(15.56)
Process(Pipe) 60 groups :   0.51,    0.43(15.69),   0.43(15.69),   0.43(15.69)
thread(Pipe)  10 groups :   0.16,    0.15(6.25),    0.15(6.25),    0.15(6.25)
thread(Pipe)  20 groups :   0.24,    0.22(8.33),    0.22(8.33),    0.22(8.33)

================
Method 3:
================
Comparing baseline vs eevdf when the system utilization is 50%. A cpu cgroup is
created and  different latency nice values are assigned to it. on another bash
terminal stress-ng is running at 50% utilization(stress-ng --cpu=480 -l 50).

Hackbench shows gain in latency values. Schbench shows good gain in latency values
except, 1 thread case. lmbench and stream regress slightly. unixbench is mixed.

One concerning throughput is 4 X Shell Scripts (8 concurrent) which shows 50%
regression. This is verified with additional run. The same holds true for 25%
utilization as well.

-------------------------------------------------------------------------------
lmbench                     tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
latency process fork       :152.98,   158.34(-3.50),  155.07(-1.36),   157.57(-3.00)
latency process Exec       :214.30,   214.08(0.10),   214.41(-0.05),   215.16(-0.40)
latency process Exec       : 12.44,    11.86(4.66),    10.60(14.79),    10.58(14.94)
latency syscall fstat      :  0.44,     0.45(-2.27),    0.43(2.27),      0.45(-2.27)
latency syscall open       :  3.71,     3.68(0.81),     3.70(0.27),      3.74(-0.81)
AF_UNIX sock stream latency: 14.07,    13.44(4.48),    14.69(-4.41),    13.65(2.99)
Select on 200 fd'si        :  3.97,     4.16(-4.79),    4.02(-1.26),     4.21(-6.05)
semaphore latency          :  1.83,     1.82(0.55),     1.77(3.28),      1.75(4.37)

-------------------------------------------------------------------------------
Stream          tip/master   eevdf(LN=0)       eevdf(LN=-20)        eevdf(LN=19)
-------------------------------------------------------------------------------
copy latency   :       0.69,       0.69(0.00),       0.76(-10.14),      0.72(-4.35)
copy bandwidth :   23947.02,   24275.24(1.37),   22032.30(-8.00),   23487.29(-1.92)
scale latency  :       0.71,       0.74(-4.23),      0.75(-5.63),       0.77(-8.45)
scale bandwidth:   23490.27,   22713.99(-3.30),  22168.98(-5.62),   21782.47(-7.27)
add latency    :       1.34,       1.36(-1.49),      1.39(-3.73),       1.42(-5.97)
add bandwidth  :   17986.34,   17771.92(-1.19),  17461.59(-2.92),   17276.34(-3.95)
triad latency  :       0.91,       0.93(-2.20),      0.91(0.00),        0.94(-3.30)
triad bandwidth:   27948.13,   27652.98(-1.06),  28134.58(0.67),    27269.73(-2.43)

-------------------------------------------------------------------------------------------------
Unixbench                          tip/master    eevdf(LN=0)      eevdf(LN=-20)   eevdf(LN=19)
-------------------------------------------------------------------------------------------------
1 X Execl Throughput            :   4940.56,    4944.30(0.08),    4991.69(1.03),     4982.80(0.85)
4 X Execl Throughput            :  10737.13,   10885.69(1.38),   10615.75(-1.13),   10803.82(0.62)
1 X Pipe-based Context Switching:  91313.57,  103426.11(13.26), 102985.91(12.78),  104614.22(14.57)
4 X Pipe-based Context Switching: 370430.07,  408075.33(10.16), 409273.07(10.49),  431360.88(16.45)
1 X Process Creation            :   6844.45,    6854.06(0.14),    6887.63(0.63),     6894.30(0.73)
4 X Process Creation            :  18690.31,   19307.50(3.30),   19425.39(3.93),    19128.43(2.34)
1 X Shell Scripts (1 concurrent):   8184.52,    8135.30(-0.60),   8185.53(0.01),     8163.10(-0.26)
4 X Shell Scripts (1 concurrent):  25737.71,   22583.29(-12.26), 22470.35(-12.69),  22615.13(-12.13)
1 X Shell Scripts (8 concurrent):   3653.71,    3115.03(-14.74),  3156.26(-13.61),   3106.63(-14.97)    <<<<< This may be of concern.
4 X Shell Scripts (8 concurrent):   9625.38,    4505.63(-53.19),  4484.03(-53.41),   4468.70(-53.57)    <<<<< This is a concerning one.

-------------------------------------------------------------------------------
schbench    tip/master   eevdf(LN=0)       eevdf(LN=-20)      eevdf(LN=19)
-------------------------------------------------------------------------------
1 Threads
50.0th:      15.10,      15.20(-0.66),      15.10(0.00),       15.10(0.00)
75.0th:      17.20,      17.70(-2.91),      17.20(0.00),       17.40(-1.16)
90.0th:      20.10,      20.70(-2.99),      20.40(-1.49),      20.70(-2.99)
95.0th:      22.20,      22.80(-2.70),      22.60(-1.80),      23.10(-4.05)
99.0th:      45.10,      51.50(-14.19),     37.20(17.52),      44.50(1.33)
99.5th:      79.80,     106.20(-33.08),    103.10(-29.20),    101.00(-26.57)
99.9th:     206.60,     771.40(-273.38),  1003.50(-385.72),   905.50(-338.29)
2 Threads
50.0th:      16.50,      17.00(-3.03),      16.70(-1.21),      16.20(1.82)
75.0th:      19.20,      19.90(-3.65),      19.40(-1.04),      18.90(1.56)
90.0th:      22.20,      23.10(-4.05),      22.80(-2.70),      22.00(0.90)
95.0th:      24.30,      25.40(-4.53),      25.20(-3.70),      24.50(-0.82)
99.0th:      97.00,      41.70(57.01),      43.00(55.67),      45.10(53.51)
99.5th:     367.10,      96.70(73.66),      98.80(73.09),     104.60(71.51)
99.9th:    3770.80,     811.40(78.48),    1414.70(62.48),     886.90(76.48)
4 Threads
50.0th:      20.00,      20.10(-0.50),      19.70(1.50),       19.50(2.50)
75.0th:      23.50,      23.40(0.43),       22.80(2.98),       23.00(2.13)
90.0th:      28.00,      27.00(3.57),       26.50(5.36),       26.60(5.00)
95.0th:      37.20,      29.50(20.70),      28.90(22.31),      28.80(22.58)
99.0th:    2792.50,      42.80(98.47),      38.30(98.63),      37.00(98.68)
99.5th:    4964.00,     101.50(97.96),      85.00(98.29),      70.20(98.59)
99.9th:    7864.80,    1722.20(78.10),     755.40(90.40),     817.10(89.61)
8 Threads
50.0th:      25.30,      24.50(3.16),       24.30(3.95),       23.60(6.72)
75.0th:      31.80,      30.00(5.66),       29.90(5.97),       29.30(7.86)
90.0th:      39.30,      35.00(10.94),      35.00(10.94),      34.20(12.98)
95.0th:     198.00,      38.20(80.71),      38.20(80.71),      37.40(81.11)
99.0th:    4601.20,      56.30(98.78),      85.90(98.13),      65.30(98.58)
99.5th:    6422.40,     162.70(97.47),     195.30(96.96),     153.40(97.61)
99.9th:    9684.00,    3237.60(66.57),    3726.40(61.52),    3965.60(59.05)
16 Threads
50.0th:      37.00,      35.20(4.86),       33.90(8.38),       34.00(8.11)
75.0th:      49.20,      46.00(6.50),       44.20(10.16),      44.40(9.76)
90.0th:      64.20,      54.80(14.64),      52.80(17.76),      53.20(17.13)
95.0th:     890.20,      59.70(93.29),      58.20(93.46),      58.60(93.42)
99.0th:    5369.60,      85.30(98.41),     124.90(97.67),     116.90(97.82)
99.5th:    6952.00,     228.00(96.72),     680.20(90.22),     339.40(95.12)
99.9th:    9222.40,    4896.80(46.90),    4648.40(49.60),    4365.20(52.67)
32 Threads
50.0th:      59.60,      56.80(4.70),       55.30(7.21),       56.00(6.04)
75.0th:      83.70,      78.70(5.97),       75.90(9.32),       77.50(7.41)
90.0th:     122.70,      95.50(22.17),      92.40(24.69),      93.80(23.55)
95.0th:    1680.40,     105.00(93.75),     102.20(93.92),     103.70(93.83)
99.0th:    6540.80,     382.10(94.16),     321.10(95.09),     489.30(92.52)
99.5th:    8094.40,    2144.20(73.51),    2172.70(73.16),    1990.70(75.41)
99.9th:   11417.60,    6672.80(41.56),    6903.20(39.54),    6268.80(45.10)

-------------------------------------------------------------------------------
Hackbench                tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
Process 10  groups     :   0.18,     0.18(0.00),    0.18(0.00),     0.18(0.00)
Process 20  groups     :   0.32,     0.33(-3.13),   0.33(-3.13),    0.33(-3.13)
Process 30  groups     :   0.42,     0.43(-2.38),   0.43(-2.38),    0.43(-2.38)
Process 40  groups     :   0.51,     0.53(-3.92),   0.53(-3.92),    0.53(-3.92)
Process 50  groups     :   0.62,     0.64(-3.23),   0.65(-4.84),    0.64(-3.23)
Process 60  groups     :   0.72,     0.73(-1.39),   0.74(-2.78),    0.74(-2.78)
thread  10  groups     :   0.19,     0.19(0.00),    0.19(0.00),     0.19(0.00)
thread  20  groups     :   0.33,     0.34(-3.03),   0.34(-3.03),    0.34(-3.03)
Process(Pipe) 10 groups:   0.17,     0.16(5.88),    0.16(5.88),     0.16(5.88)
Process(Pipe) 20 groups:   0.25,     0.23(8.00),    0.23(8.00),     0.23(8.00)
Process(Pipe) 30 groups:   0.36,     0.31(13.89),   0.31(13.89),    0.31(13.89)
Process(Pipe) 40 groups:   0.42,     0.36(14.29),   0.36(14.29),    0.36(14.29)
Process(Pipe) 50 groups:   0.49,     0.42(14.29),   0.41(16.33),    0.42(14.29)
Process(Pipe) 60 groups:   0.53,     0.44(16.98),   0.44(16.98),    0.44(16.98)
thread(Pipe)  10 groups:   0.14,     0.14(0.00),    0.14(0.00),     0.14(0.00)
thread(Pipe)  20 groups:   0.24,     0.24(0.00),    0.22(8.33),     0.23(4.17)


================
Method 4:
================
Running daytrader on an idle system without any cgroup. daytrader is a trading
simulator application which does buy/sell and intraday trading etc. It is a
throughput oriented workload running with Jmeter.
reference: https://www.ibm.com/docs/en/linux-on-systems?topic=bad-daytrader

We see around 7% improvement in throughput with eevdf.

--------------------------------------------------------------------------
daytrader			tip/master		eevdf
--------------------------------------------------------------------------
Total throughputs                  1x			1.0717x(7.17)


================
Method 5:
================
Running daytrader on a system where utilization is 50%. created a cgroup and ran
microbenchmark in it and assigned different latency values to it. On another
bash terminal stress-ng is running at 50% utilization.

At LN=0, we see a 9% improvement with eevdf compared to baseline.

-------------------------------------------------------------------------
daytrader	   tip/master   eevdf	      eevdf          eevdf
				(LN=0)       (LN=-20)       (LN=19)
-------------------------------------------------------------------------
Total throughputs      1x      1.0923x(9.2%)   1.0759x(7.6)    1.111x(11.1)


Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

David Vernet April 10, 2023, 3:13 a.m. UTC | #2

On Tue, Mar 28, 2023 at 11:26:22AM +0200, Peter Zijlstra wrote:
> Hi!
> 
> Latest version of the EEVDF [1] patches.
> 
> Many changes since last time; most notably it now fully replaces CFS and uses
> lag based placement for migrations. Smaller changes include:
> 
>  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
>    bits on a system/cgroup based kernel build.
>  - fixed a bunch of reweight / cgroup placement issues
>  - adaptive placement strategy for smaller slices
>  - rename se->lag to se->vlag
> 
> There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> artificial/daft but who knows.
> 
> The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> because it places things too far to the left in the tree. Basically it messes
> with the whole 'when', by placing a task back in history you're putting a
> burden on the now to accomodate catching up. More tinkering required.
> 
> But over-all the thing seems to be fairly usable and could do with more
> extensive testing.

Hi Peter,

I used the EEVDF scheduler to run workloads on one of Meta's largest
services (our main HHVM web server), and I wanted to share my
observations with you.

The TL;DR is that, unfortunately, it appears as though EEVDF regresses
these workloads rather substantially. Running with "vanilla" EEVDF (i.e.
on this patch set up to [0], with no changes to latency nice for any
task) compared to vanilla CFS results in the following outcomes to our
major KPIs for servicing web requests:

- .5 - 2.5% drop in throughput
- 1 - 5% increase in p95 latencies
- 1.75 - 6% increase in p99 latencies
- .5 - 4% drop in 50th percentile throughput

[0]: https://lore.kernel.org/lkml/20230328110354.562078801@infradead.org/

Decreasing latency nice for our critical web workers unfortunately did
not help either. For example, here are the numbers for a latency nice
value of -10:

- .8 - 2.5% drop in throughput
- 2 - 4% increase in p95 latencies
- 1 - 4.5% increase in p99 latencies
- 0 - 4.5% increase in 50th percentile throughput

Other latency nice values resulted in similar metrics. Some metrics may
get slightly better, and others slightly worse, but the end result was
always a relatively significant regression from vanilla CFS. Throughout
the rest of this write up, the remaining figures quoted will be from
vanilla EEVDF runs (modulo some numbers towards the end of this writeup
which describe the outcome of increasing the default base slice with
sysctl_sched_base_slice).

With that out the way, let me outline some of the reasons for these
regressions:

1. Improved runqueue delays, but costly faults and involuntary context
   switches

EEVDF substantially increased the number of context switches on the
system, by 15 - 35%. On its own, this doesn't necessarily imply a
problem. For example, we observed that EEVDF resulted in a 20 - 40%
reduction in the time that tasks were spent waiting on the runqueue
before being placed on a CPU.

There were, however, other metrics which were less encouraging. We
observed a 400 - 550% increase in involuntary context switches (which
are also presumably a reason for the improved runqueue delays), as well
as a 10 - 70% increase in major page faults per minute. Along these
lines, we also saw an erratic but often sigificant decrease in CPU
utilization.

It's hard to say exactly what kinds of issues such faults / involuntary
context context switches could introduce, but it does seem clear that in
general, less time is being spent doing useful work, and more time is
spent thrashing on resources between tasks.

2. Front-end CPU pipeline bottlenecks

Like many (any?) other JIT engines / compilers, HHVM tends to be heavily
front-end bound in the CPU pipeline, and have very poor IPC
(Instructions Per Cycle). For HHVM, this is due to high branch resteers,
poor icache / iTLB locality, and poor uop caching / decoding (many uops
are being serviced through the MITE instead of the DSB). While using
BOLT [1] to improve the layout of the HHVM binary does help to minimize
these costs, they're still the main bottleneck for the application.

[1]: https://github.com/llvm/llvm-project/blob/main/bolt/docs/OptimizingClang.md

An implication of this is that any time a task runs on a CPU after one
of these web worker tasks, it is essentially guaranteed to have poor
front-end locality, and their IPCs will similarly suffer. In other
words, more context switches generally means fewer instructions being
run across the whole system. When profiling vanilla CFS vs. vanilla
EEVDF (that is, with no changes to latency nice for any task), we found
that it resulted in a 1 - 2% drop in IPC across the whole system.

Note that simply partitioning the system by cpuset won't necessarily
work either, as CPU utilization will drop even further, and we want to
keep the system as busy as possible. There's another internal patch set
we have (which we're planning to try and upstream soon) where waking
tasks are placed in a global shared "wakequeue", which is then always
pulled from in newidle_balance(). The discrepancy in performance between
CFS and EEVDF is even worse in this case, with throughput dropping by 2
- 4%, p95 tail latencies increasing by 3 - 5%, and p99 tail latencies
increasing by 6 - 11%.

3. Low latency + long slice are not mutually exclusive for us

An interesting quality of web workloads running JIT engines is that they
require both low-latency, and long slices on the CPU. The reason we need
the tasks to be low latency is they're on the critical path for
servicing web requests (for most of their runtime, at least), and the
reasons we need them to have long slices are enumerated above -- they
thrash the icache / DSB / iTLB, more aggressive context switching causes
us to thrash on paging from disk, and in general, these tasks are on the
critical path for servicing web requests and we want to encourage them
to run to completion.

This causes EEVDF to perform poorly for workloads with these
characteristics. If we decrease latency nice for our web workers then
they'll have lower latency, but only because their slices are smaller.
This in turn causes the increase in context switches, which causes the
thrashing described above.

Worth noting -- I did try and increase the default base slice length by
setting sysctl_sched_base_slice to 35ms, and these were the results:

With EEVDF slice 35ms and latency_nice 0
----------------------------------------
- .5 - 2.25% drop in throughput
- 2.5 - 4.5% increase in p95 latencies
- 2.5 - 5.25% increase in p99 latencies
- Context switch per minute increase: 9.5 - 12.4%
- Involuntary context switch increase: ~320 - 330%
- Major fault delta: -3.6% to 37.6%
- IPC decrease .5 - .9%

With EEVDF slice 35ms and latency_nice -8 for web workers
---------------------------------------------------------
- .5 - 2.5% drop in throughput
- 1.7 - 4.75% increase in p95 latencies
- 2.5 - 5% increase in p99 latencies
- Context switch per minute increase: 10.5 - 15%
- Involuntary context switch increase: ~327 - 350%
- Major fault delta: -1% to 45%
- IPC decrease .4 - 1.1%

I was expecting the increase in context switches and involuntary context
switches to be lower what than they ended up being with the increased
default slice length. Regardless, it still seems to tell a relatively
consistent story with the numbers from above. The improvement in IPC is
expected, though also less improved than I was anticipating (presumably
due to the still-high context switch rate). There were also fewer major
faults per minute compared to runs with a shorter default slice.

Note that even if increasing the slice length did cause fewer context
switches and major faults, I still expect that it would hurt throughput
and latency for HHVM given that when latency-nicer tasks are eventually
given the CPU, the web workers will have to wait around for longer than
we'd like for those tasks to burn through their longer slices.

In summary, I must admit that this patch set makes me a bit nervous.
Speaking for Meta at least, the patch set in its current form exceeds
the performance regressions (generally < .5% at the very most) that
we're able to tolerate in production. More broadly, it will certainly
cause us to have to carefully consider how it affects our model for
server capacity.

Thanks,
David

David Vernet April 11, 2023, 2:09 a.m. UTC | #3

On Sun, Apr 09, 2023 at 10:13:50PM -0500, David Vernet wrote:
> On Tue, Mar 28, 2023 at 11:26:22AM +0200, Peter Zijlstra wrote:
> > Hi!
> > 
> > Latest version of the EEVDF [1] patches.
> > 
> > Many changes since last time; most notably it now fully replaces CFS and uses
> > lag based placement for migrations. Smaller changes include:
> > 
> >  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
> >    bits on a system/cgroup based kernel build.
> >  - fixed a bunch of reweight / cgroup placement issues
> >  - adaptive placement strategy for smaller slices
> >  - rename se->lag to se->vlag
> > 
> > There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> > PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> > because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> > split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> > split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> > artificial/daft but who knows.
> > 
> > The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> > because it places things too far to the left in the tree. Basically it messes
> > with the whole 'when', by placing a task back in history you're putting a
> > burden on the now to accomodate catching up. More tinkering required.
> > 
> > But over-all the thing seems to be fairly usable and could do with more
> > extensive testing.
> 
> Hi Peter,
> 
> I used the EEVDF scheduler to run workloads on one of Meta's largest
> services (our main HHVM web server), and I wanted to share my
> observations with you.
> 
> The TL;DR is that, unfortunately, it appears as though EEVDF regresses
> these workloads rather substantially. Running with "vanilla" EEVDF (i.e.
> on this patch set up to [0], with no changes to latency nice for any
> task) compared to vanilla CFS results in the following outcomes to our
> major KPIs for servicing web requests:
> 
> - .5 - 2.5% drop in throughput
> - 1 - 5% increase in p95 latencies
> - 1.75 - 6% increase in p99 latencies
> - .5 - 4% drop in 50th percentile throughput
> 
> [0]: https://lore.kernel.org/lkml/20230328110354.562078801@infradead.org/
> 
> Decreasing latency nice for our critical web workers unfortunately did
> not help either. For example, here are the numbers for a latency nice
> value of -10:
> 
> - .8 - 2.5% drop in throughput
> - 2 - 4% increase in p95 latencies
> - 1 - 4.5% increase in p99 latencies
> - 0 - 4.5% increase in 50th percentile throughput
> 
> Other latency nice values resulted in similar metrics. Some metrics may
> get slightly better, and others slightly worse, but the end result was
> always a relatively significant regression from vanilla CFS. Throughout
> the rest of this write up, the remaining figures quoted will be from
> vanilla EEVDF runs (modulo some numbers towards the end of this writeup
> which describe the outcome of increasing the default base slice with
> sysctl_sched_base_slice).
> 
> With that out the way, let me outline some of the reasons for these
> regressions:
> 
> 1. Improved runqueue delays, but costly faults and involuntary context
>    switches
> 
> EEVDF substantially increased the number of context switches on the
> system, by 15 - 35%. On its own, this doesn't necessarily imply a
> problem. For example, we observed that EEVDF resulted in a 20 - 40%
> reduction in the time that tasks were spent waiting on the runqueue
> before being placed on a CPU.
> 
> There were, however, other metrics which were less encouraging. We
> observed a 400 - 550% increase in involuntary context switches (which
> are also presumably a reason for the improved runqueue delays), as well
> as a 10 - 70% increase in major page faults per minute. Along these
> lines, we also saw an erratic but often sigificant decrease in CPU
> utilization.
> 
> It's hard to say exactly what kinds of issues such faults / involuntary
> context context switches could introduce, but it does seem clear that in
> general, less time is being spent doing useful work, and more time is
> spent thrashing on resources between tasks.
> 
> 2. Front-end CPU pipeline bottlenecks
> 
> Like many (any?) other JIT engines / compilers, HHVM tends to be heavily
> front-end bound in the CPU pipeline, and have very poor IPC
> (Instructions Per Cycle). For HHVM, this is due to high branch resteers,
> poor icache / iTLB locality, and poor uop caching / decoding (many uops
> are being serviced through the MITE instead of the DSB). While using
> BOLT [1] to improve the layout of the HHVM binary does help to minimize
> these costs, they're still the main bottleneck for the application.
> 
> [1]: https://github.com/llvm/llvm-project/blob/main/bolt/docs/OptimizingClang.md
> 
> An implication of this is that any time a task runs on a CPU after one
> of these web worker tasks, it is essentially guaranteed to have poor
> front-end locality, and their IPCs will similarly suffer. In other
> words, more context switches generally means fewer instructions being
> run across the whole system. When profiling vanilla CFS vs. vanilla
> EEVDF (that is, with no changes to latency nice for any task), we found
> that it resulted in a 1 - 2% drop in IPC across the whole system.
> 
> Note that simply partitioning the system by cpuset won't necessarily
> work either, as CPU utilization will drop even further, and we want to
> keep the system as busy as possible. There's another internal patch set
> we have (which we're planning to try and upstream soon) where waking
> tasks are placed in a global shared "wakequeue", which is then always
> pulled from in newidle_balance(). The discrepancy in performance between
> CFS and EEVDF is even worse in this case, with throughput dropping by 2
> - 4%, p95 tail latencies increasing by 3 - 5%, and p99 tail latencies
> increasing by 6 - 11%.
> 
> 3. Low latency + long slice are not mutually exclusive for us
> 
> An interesting quality of web workloads running JIT engines is that they
> require both low-latency, and long slices on the CPU. The reason we need
> the tasks to be low latency is they're on the critical path for
> servicing web requests (for most of their runtime, at least), and the
> reasons we need them to have long slices are enumerated above -- they
> thrash the icache / DSB / iTLB, more aggressive context switching causes
> us to thrash on paging from disk, and in general, these tasks are on the
> critical path for servicing web requests and we want to encourage them
> to run to completion.
> 
> This causes EEVDF to perform poorly for workloads with these
> characteristics. If we decrease latency nice for our web workers then
> they'll have lower latency, but only because their slices are smaller.
> This in turn causes the increase in context switches, which causes the
> thrashing described above.
> 
> Worth noting -- I did try and increase the default base slice length by
> setting sysctl_sched_base_slice to 35ms, and these were the results:
> 
> With EEVDF slice 35ms and latency_nice 0
> ----------------------------------------
> - .5 - 2.25% drop in throughput
> - 2.5 - 4.5% increase in p95 latencies
> - 2.5 - 5.25% increase in p99 latencies
> - Context switch per minute increase: 9.5 - 12.4%
> - Involuntary context switch increase: ~320 - 330%
> - Major fault delta: -3.6% to 37.6%
> - IPC decrease .5 - .9%
> 
> With EEVDF slice 35ms and latency_nice -8 for web workers
> ---------------------------------------------------------
> - .5 - 2.5% drop in throughput
> - 1.7 - 4.75% increase in p95 latencies
> - 2.5 - 5% increase in p99 latencies
> - Context switch per minute increase: 10.5 - 15%
> - Involuntary context switch increase: ~327 - 350%
> - Major fault delta: -1% to 45%
> - IPC decrease .4 - 1.1%
> 
> I was expecting the increase in context switches and involuntary context
> switches to be lower what than they ended up being with the increased
> default slice length. Regardless, it still seems to tell a relatively
> consistent story with the numbers from above. The improvement in IPC is
> expected, though also less improved than I was anticipating (presumably
> due to the still-high context switch rate). There were also fewer major
> faults per minute compared to runs with a shorter default slice.

Ah, these numbers with a larger slice are inaccurate. I was being
careless and accidentally changed the wrong sysctl
(sysctl_sched_cfs_bandwidth_slice rather than sysctl_sched_base_slice)
yesterday when testing how a longer slice affects performance.
Increasing sysctl_sched_base_slice to 30ms actually does improve things
a bit, but we're still losing to CFS (note that higher is better for
throughput, and lower is better for p95 and p99 latency):

lat_nice | throughput     | p95 latency  | p99 latency   | total swtch    | invol swtch  | mjr faults | IPC
-----------------------------------------------------------------------------------------------------------------------
0        | -1.2% to 0%    | 1 to 2.5%    | 1.8 to 3.75%  | 0%             | 200 to 210%  | -3% to 14% | -0.4% to -0.25%
-----------------------------------------------------------------------------------------------------------------------
-8       | -1.8% to -1.3% | 1 to 2.3%    | .75% to 2.5%  | -1.9% to -1.3% | 185 to 193%  | -3% to 20% | -0.6% to -0.25%
-----------------------------------------------------------------------------------------------------------------------
-12      | -.9% to 0.25%  | -0.1 to 2.4% | -0.4% to 2.8% | -2% to -1.5%   | 180 to 185%  | -5% to 30% | -0.8% to -0.25%
-----------------------------------------------------------------------------------------------------------------------
-19      | -1.3% to 0%   | 0.3 to 3.5%  | -1% to 2.1%   | -2% to -1.4%   | 175 to 190%   | 4% to 27%  | -0.6% to -0.27%
-----------------------------------------------------------------------------------------------------------------------

I'm sure experimenting with various slice lengths, etc would yield
slightly different results, but the common theme seems to be that EEVDF
causes more involuntary context switches and more major faults, which
regress throughput and latency for our web workloads.

> 
> Note that even if increasing the slice length did cause fewer context
> switches and major faults, I still expect that it would hurt throughput
> and latency for HHVM given that when latency-nicer tasks are eventually
> given the CPU, the web workers will have to wait around for longer than
> we'd like for those tasks to burn through their longer slices.
> 
> In summary, I must admit that this patch set makes me a bit nervous.
> Speaking for Meta at least, the patch set in its current form exceeds
> the performance regressions (generally < .5% at the very most) that
> we're able to tolerate in production. More broadly, it will certainly
> cause us to have to carefully consider how it affects our model for
> server capacity.
> 
> Thanks,
> David

Mike Galbraith April 11, 2023, 10:15 a.m. UTC | #4

On Mon, 2023-04-10 at 16:23 +0800, Hillf Danton wrote:
>
> In order to only narrow down the poor performance reported, make a tradeoff
> between runtime and latency simply by restoring sysctl_sched_min_granularity
> at tick preempt, given the known order on the runqueue.

Tick preemption isn't the primary contributor to the scheduling delta,
it's wakeup preemption. If you look at the perf summaries of 5 minute
recordings on my little 8 rq box below, you'll see that the delta is
more than twice what a 250Hz tick could inflict.  You could also just
turn off WAKEUP_PREEMPTION and watch the delta instantly peg negative.

Anyway...

Given we know preemption is markedly up, and as always a source of pain
(as well as gain), perhaps we can try to tamp it down a little without
inserting old constraints into the shiny new scheduler.

The dirt simple tweak below puts a dent in the sting by merely sticking
with whatever decision EEVDF last made until it itself invalidates that
decision. It still selects via the same math, just does so the tiniest
bit less frenetically.

---
 kernel/sched/fair.c     |    3 +++
 kernel/sched/features.h |    6 ++++++
 2 files changed, 9 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -950,6 +950,9 @@ static struct sched_entity *pick_eevdf(s
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;

+	if (sched_feat(GENTLE_EEVDF) && curr)
+		return curr;
+
 	while (node) {
 		struct sched_entity *se = __node_2_se(node);

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -14,6 +14,12 @@ SCHED_FEAT(MINIMAL_VA, false)
 SCHED_FEAT(VALIDATE_QUEUE, false)

 /*
+ * Don't be quite so damn twitchy, once you select a champion let the
+ * poor bastard carry the baton until no longer eligible to do so.
+ */
+SCHED_FEAT(GENTLE_EEVDF, true)
+
+/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.

perf.data.cfs
 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(8)      |1665786.092 ms |   529819 | avg:   1.046 ms | max:  33.639 ms | sum:554226.960 ms |
  dav1d-worker:(8)      | 187982.593 ms |   448022 | avg:   0.881 ms | max:  35.806 ms | sum:394546.442 ms |
  X:2503                | 102533.714 ms |    89729 | avg:   0.071 ms | max:   9.448 ms | sum: 6372.383 ms |
  VizCompositorTh:5235  |  38717.241 ms |    76743 | avg:   0.632 ms | max:  24.308 ms | sum:48502.097 ms |
  llvmpipe-0:(2)        |  32520.412 ms |    42390 | avg:   1.041 ms | max:  19.804 ms | sum:44116.653 ms |
  llvmpipe-1:(2)        |  32374.548 ms |    35557 | avg:   1.247 ms | max:  17.439 ms | sum:44347.573 ms |
  llvmpipe-2:(2)        |  31579.168 ms |    34292 | avg:   1.312 ms | max:  16.775 ms | sum:45005.225 ms |
  llvmpipe-3:(2)        |  30478.664 ms |    33659 | avg:   1.375 ms | max:  16.863 ms | sum:46268.417 ms |
  llvmpipe-7:(2)        |  29778.002 ms |    30684 | avg:   1.543 ms | max:  17.384 ms | sum:47338.420 ms |
  llvmpipe-4:(2)        |  29741.774 ms |    32832 | avg:   1.433 ms | max:  18.571 ms | sum:47062.280 ms |
  llvmpipe-5:(2)        |  29462.794 ms |    32641 | avg:   1.455 ms | max:  19.802 ms | sum:47497.195 ms |
  llvmpipe-6:(2)        |  28367.114 ms |    32132 | avg:   1.514 ms | max:  16.562 ms | sum:48646.738 ms |
  ThreadPoolForeg:(16)  |  22238.667 ms |    66355 | avg:   0.353 ms | max:  46.477 ms | sum:23409.474 ms |
  VideoFrameCompo:5243  |  17071.755 ms |    75223 | avg:   0.288 ms | max:  33.358 ms | sum:21650.918 ms |
  chrome:(8)            |   6478.351 ms |    47110 | avg:   0.486 ms | max:  28.018 ms | sum:22910.980 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2317066.420 ms |  2221889 |                 |       46.477 ms |   1629736.515 ms |
 ----------------------------------------------------------------------------------------------------------

perf.data.eevdf
 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(8)      |1673379.930 ms |   743590 | avg:   0.745 ms | max:  28.003 ms | sum:554041.093 ms |
  dav1d-worker:(8)      | 197647.514 ms |  1139053 | avg:   0.434 ms | max:  22.357 ms | sum:494377.980 ms |
  X:2495                | 100741.946 ms |   114808 | avg:   0.191 ms | max:   8.583 ms | sum:21945.360 ms |
  VizCompositorTh:6571  |  37705.863 ms |    74900 | avg:   0.479 ms | max:  16.464 ms | sum:35843.010 ms |
  llvmpipe-6:(2)        |  30757.126 ms |    38941 | avg:   1.448 ms | max:  18.529 ms | sum:56371.507 ms |
  llvmpipe-3:(2)        |  30658.127 ms |    40296 | avg:   1.405 ms | max:  24.791 ms | sum:56601.212 ms |
  llvmpipe-4:(2)        |  30456.388 ms |    40011 | avg:   1.419 ms | max:  23.840 ms | sum:56793.272 ms |
  llvmpipe-2:(2)        |  30395.971 ms |    40828 | avg:   1.394 ms | max:  19.195 ms | sum:56897.961 ms |
  llvmpipe-5:(2)        |  30346.432 ms |    39393 | avg:   1.445 ms | max:  21.747 ms | sum:56917.495 ms |
  llvmpipe-1:(2)        |  30275.694 ms |    41349 | avg:   1.378 ms | max:  20.765 ms | sum:56989.923 ms |
  llvmpipe-7:(2)        |  29768.515 ms |    37626 | avg:   1.532 ms | max:  20.649 ms | sum:57639.337 ms |
  llvmpipe-0:(2)        |  28931.905 ms |    42568 | avg:   1.378 ms | max:  20.942 ms | sum:58667.379 ms |
  ThreadPoolForeg:(60)  |  22598.216 ms |   131514 | avg:   0.342 ms | max:  36.105 ms | sum:44927.149 ms |
  VideoFrameCompo:6587  |  16966.649 ms |    90751 | avg:   0.357 ms | max:  18.199 ms | sum:32379.045 ms |
  chrome:(25)           |   8862.695 ms |    75923 | avg:   0.308 ms | max:  30.821 ms | sum:23347.992 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2331946.838 ms |  3471615 |                 |       36.105 ms |   1808071.407 ms |
 ----------------------------------------------------------------------------------------------------------

perf.data.eevdf+tweak
 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(8)      |1687121.317 ms |   695518 | avg:   0.760 ms | max:  24.098 ms | sum:528302.626 ms |
  dav1d-worker:(8)      | 183514.008 ms |   922884 | avg:   0.489 ms | max:  32.093 ms | sum:451319.787 ms |
  X:2489                |  99164.486 ms |   101585 | avg:   0.239 ms | max:   8.896 ms | sum:24295.253 ms |
  VizCompositorTh:17881 |  37911.007 ms |    71122 | avg:   0.499 ms | max:  16.743 ms | sum:35460.994 ms |
  llvmpipe-1:(2)        |  29946.625 ms |    40320 | avg:   1.394 ms | max:  23.036 ms | sum:56222.367 ms |
  llvmpipe-2:(2)        |  29910.414 ms |    39677 | avg:   1.412 ms | max:  24.187 ms | sum:56011.791 ms |
  llvmpipe-6:(2)        |  29742.389 ms |    37822 | avg:   1.484 ms | max:  18.228 ms | sum:56109.947 ms |
  llvmpipe-3:(2)        |  29644.994 ms |    39155 | avg:   1.435 ms | max:  21.191 ms | sum:56202.636 ms |
  llvmpipe-5:(2)        |  29520.006 ms |    38037 | avg:   1.482 ms | max:  21.698 ms | sum:56373.679 ms |
  llvmpipe-4:(2)        |  29460.485 ms |    38562 | avg:   1.462 ms | max:  26.308 ms | sum:56389.022 ms |
  llvmpipe-7:(2)        |  29449.959 ms |    36308 | avg:   1.557 ms | max:  21.617 ms | sum:56547.129 ms |
  llvmpipe-0:(2)        |  29041.903 ms |    41207 | avg:   1.389 ms | max:  26.322 ms | sum:57239.666 ms |
  ThreadPoolForeg:(16)  |  22490.094 ms |   112591 | avg:   0.377 ms | max:  27.027 ms | sum:42414.618 ms |
  VideoFrameCompo:17888 |  17385.895 ms |    86651 | avg:   0.367 ms | max:  19.350 ms | sum:31767.043 ms |
  chrome:(8)            |   6826.127 ms |    61487 | avg:   0.306 ms | max:  20.000 ms | sum:18835.879 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2326181.115 ms |  3081183 |                 |       32.093 ms |   1737425.434 ms |
 ----------------------------------------------------------------------------------------------------------

Mike Galbraith April 11, 2023, 2:56 p.m. UTC | #5

On Tue, 2023-04-11 at 21:33 +0800, Hillf Danton wrote:
> On Tue, 11 Apr 2023 12:15:41 +0200 Mike Galbraith <efault@gmx.de>
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -950,6 +950,9 @@ static struct sched_entity *pick_eevdf(s
> >         if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> >                 curr =3D NULL;
> >
> > +       if (sched_feat(GENTLE_EEVDF) && curr)
> > +               return curr;
> > +
>
> This is rather aggressive, given latency-10 curr and latency-0 candidate
> at tick hit for instance.

The numbers seem to indicate that the ~400k ctx switches eliminated
were meaningless to the load being measures.  I recorded everything for
5 minutes, and the recording wide max actually went down.. but one-off
hits happen regularly in noisy GUI regardless of scheduler, are
difficult to assign meaning to.

Now I'm not saying there is no cost, if you change anything that's
converted to instructions, there is a price tag somewhere, whether you
notice immediately or not.  Nor am I saying that patchlet is golden.  I
am saying that some of the ctx switch delta look very much like useless
overhead that can and should be made to go away.  From my POV, patchlet
actually looks like kinda viable, but to Peter and regression reporter,
it and associated data are presented as a datapoint.

>  And along your direction a mild change is
> postpone the preempt wakeup to the next tick.
>
> +++ b/kernel/sched/fair.c
> @@ -7932,8 +7932,6 @@ static void check_preempt_wakeup(struct
>                 return;
>  
>         cfs_rq = cfs_rq_of(se);
> -       update_curr(cfs_rq);
> -
>         /*
>          * XXX pick_eevdf(cfs_rq) != se ?
>          */

Mmmm, stopping time is a bad idea methinks.

	-Mike

Mike Galbraith April 12, 2023, 4:05 a.m. UTC | #6

On Wed, 2023-04-12 at 10:50 +0800, Hillf Danton wrote:
> On Tue, 11 Apr 2023 16:56:24 +0200 Mike Galbraith <efault@gmx.de>
>
>
> The data from you and David (lat_nice: -12 throughput: -.9% to 0.25%) is
> supporting eevdf, given a optimization <5% could be safely ignored in general
> (while 10% good and 20% standing ovation).
>

There's nothing pro or con here, David's testing seems to agree with my
own testing that a bit of adjustment may be necessary and that's it.
Cold hard numbers to developer, completely optional mitigation tweak to
fellow tester.. and we're done.

	-Mike

Phil Auld April 25, 2023, 12:32 p.m. UTC | #7

Hi Peter,

On Tue, Mar 28, 2023 at 11:26:22AM +0200 Peter Zijlstra wrote:
> Hi!
> 
> Latest version of the EEVDF [1] patches.
> 
> Many changes since last time; most notably it now fully replaces CFS and uses
> lag based placement for migrations. Smaller changes include:
> 
>  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
>    bits on a system/cgroup based kernel build.
>  - fixed a bunch of reweight / cgroup placement issues
>  - adaptive placement strategy for smaller slices
>  - rename se->lag to se->vlag
> 
> There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> artificial/daft but who knows.
> 
> The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> because it places things too far to the left in the tree. Basically it messes
> with the whole 'when', by placing a task back in history you're putting a
> burden on the now to accomodate catching up. More tinkering required.
> 
> But over-all the thing seems to be fairly usable and could do with more
> extensive testing.

I had Jirka run his suite of perf workloads on this. These are macro benchmarks
on baremetal (NAS, SPECjbb etc). I can't share specific results because it
comes out in nice html reports on an internal website. There was no noticeable
performance change, which is a good thing. Overall performance was comparable
to CFS.

There was a win in stability though. A number of the error boxes across the
board were smaller. So less variance.

These are mostly performance/throughput tests. We're going to run some more
latency sensitive tests now.

So, fwiw, EEVDF is performing well on macro workloads here.



Cheers,
Phil

> 
> [1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
> 
> Results:
> 
>   hackbech -g $nr_cpu + cyclictest --policy other results:
> 
> 			EEVDF			 CFS
> 
> 		# Min Latencies: 00054
>   LNICE(19)	# Avg Latencies: 00660
> 		# Max Latencies: 23103
> 
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00318		00687
> 		# Max Latencies: 08593		13913
> 
> 		# Min Latencies: 00054
>   LNICE(-19)	# Avg Latencies: 00055
> 		# Max Latencies: 00061
> 
> 
> Some preliminary results from Chen Yu on a slightly older version:
> 
>   schbench  (95% tail latency, lower is better)
>   =================================================================================
>   case                    nr_instance            baseline (std%)    compare% ( std%)
>   normal                   25%                     1.00  (2.49%)    -81.2%   (4.27%)
>   normal                   50%                     1.00  (2.47%)    -84.5%   (0.47%)
>   normal                   75%                     1.00  (2.5%)     -81.3%   (1.27%)
>   normal                  100%                     1.00  (3.14%)    -79.2%   (0.72%)
>   normal                  125%                     1.00  (3.07%)    -77.5%   (0.85%)
>   normal                  150%                     1.00  (3.35%)    -76.4%   (0.10%)
>   normal                  175%                     1.00  (3.06%)    -76.2%   (0.56%)
>   normal                  200%                     1.00  (3.11%)    -76.3%   (0.39%)
>   ==================================================================================
> 
>   hackbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   threads-pipe              25%                      1.00 (<2%)    -17.5 (<2%)
>   threads-socket            25%                      1.00 (<2%)    -1.9 (<2%)
>   threads-pipe              50%                      1.00 (<2%)     +6.7 (<2%)
>   threads-socket            50%                      1.00 (<2%)    -6.3  (<2%)
>   threads-pipe              100%                     1.00 (3%)     +110.1 (3%)
>   threads-socket            100%                     1.00 (<2%)    -40.2 (<2%)
>   threads-pipe              150%                     1.00 (<2%)    +125.4 (<2%)
>   threads-socket            150%                     1.00 (<2%)    -24.7 (<2%)
>   threads-pipe              200%                     1.00 (<2%)    -89.5 (<2%)
>   threads-socket            200%                     1.00 (<2%)    -27.4 (<2%)
>   process-pipe              25%                      1.00 (<2%)    -15.0 (<2%)
>   process-socket            25%                      1.00 (<2%)    -3.9 (<2%)
>   process-pipe              50%                      1.00 (<2%)    -0.4  (<2%)
>   process-socket            50%                      1.00 (<2%)    -5.3  (<2%)
>   process-pipe              100%                     1.00 (<2%)    +62.0 (<2%)
>   process-socket            100%                     1.00 (<2%)    -39.5  (<2%)
>   process-pipe              150%                     1.00 (<2%)    +70.0 (<2%)
>   process-socket            150%                     1.00 (<2%)    -20.3 (<2%)
>   process-pipe              200%                     1.00 (<2%)    +79.2 (<2%)
>   process-socket            200%                     1.00 (<2%)    -22.4  (<2%)
>   ==============================================================================
> 
>   stress-ng (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   switch                  25%                      1.00 (<2%)    -6.5 (<2%)
>   switch                  50%                      1.00 (<2%)    -9.2 (<2%)
>   switch                  75%                      1.00 (<2%)    -1.2 (<2%)
>   switch                  100%                     1.00 (<2%)    +11.1 (<2%)
>   switch                  125%                     1.00 (<2%)    -16.7% (9%)
>   switch                  150%                     1.00 (<2%)    -13.6 (<2%)
>   switch                  175%                     1.00 (<2%)    -16.2 (<2%)
>   switch                  200%                     1.00 (<2%)    -19.4% (<2%)
>   fork                    50%                      1.00 (<2%)    -0.1 (<2%)
>   fork                    75%                      1.00 (<2%)    -0.3 (<2%)
>   fork                    100%                     1.00 (<2%)    -0.1 (<2%)
>   fork                    125%                     1.00 (<2%)    -6.9 (<2%)
>   fork                    150%                     1.00 (<2%)    -8.8 (<2%)
>   fork                    200%                     1.00 (<2%)    -3.3 (<2%)
>   futex                   25%                      1.00 (<2%)    -3.2 (<2%)
>   futex                   50%                      1.00 (3%)     -19.9 (5%)
>   futex                   75%                      1.00 (6%)     -19.1 (2%)
>   futex                   100%                     1.00 (16%)    -30.5 (10%)
>   futex                   125%                     1.00 (25%)    -39.3 (11%)
>   futex                   150%                     1.00 (20%)    -27.2% (17%)
>   futex                   175%                     1.00 (<2%)    -18.6 (<2%)
>   futex                   200%                     1.00 (<2%)    -47.5 (<2%)
>   nanosleep               25%                      1.00 (<2%)    -0.1 (<2%)
>   nanosleep               50%                      1.00 (<2%)    -0.0% (<2%)
>   nanosleep               75%                      1.00 (<2%)    +15.2% (<2%)
>   nanosleep               100%                     1.00 (<2%)    -26.4 (<2%)
>   nanosleep               125%                     1.00 (<2%)    -1.3 (<2%)
>   nanosleep               150%                     1.00 (<2%)    +2.1  (<2%)
>   nanosleep               175%                     1.00 (<2%)    +8.3 (<2%)
>   nanosleep               200%                     1.00 (<2%)    +2.0% (<2%)
>   ===============================================================================
> 
>   unixbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   spawn                   125%                      1.00 (<2%)    +8.1 (<2%)
>   context1                100%                      1.00 (6%)     +17.4 (6%)
>   context1                75%                       1.00 (13%)    +18.8 (8%)
>   =================================================================================
> 
>   netperf  (throughput, higher is better)
>   ===========================================================================
>   case                    nr_instance          baseline(std%)  compare%( std%)
>   UDP_RR                  25%                   1.00    (<2%)    -1.5%  (<2%)
>   UDP_RR                  50%                   1.00    (<2%)    -0.3%  (<2%)
>   UDP_RR                  75%                   1.00    (<2%)    +12.5% (<2%)
>   UDP_RR                 100%                   1.00    (<2%)    -4.3%  (<2%)
>   UDP_RR                 125%                   1.00    (<2%)    -4.9%  (<2%)
>   UDP_RR                 150%                   1.00    (<2%)    -4.7%  (<2%)
>   UDP_RR                 175%                   1.00    (<2%)    -6.1%  (<2%)
>   UDP_RR                 200%                   1.00    (<2%)    -6.6%  (<2%)
>   TCP_RR                  25%                   1.00    (<2%)    -1.4%  (<2%)
>   TCP_RR                  50%                   1.00    (<2%)    -0.2%  (<2%)
>   TCP_RR                  75%                   1.00    (<2%)    -3.9%  (<2%)
>   TCP_RR                 100%                   1.00    (2%)     +3.6%  (5%)
>   TCP_RR                 125%                   1.00    (<2%)    -4.2%  (<2%)
>   TCP_RR                 150%                   1.00    (<2%)    -6.0%  (<2%)
>   TCP_RR                 175%                   1.00    (<2%)    -7.4%  (<2%)
>   TCP_RR                 200%                   1.00    (<2%)    -8.4%  (<2%)
>   ==========================================================================
> 
> 
> ---
> Also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf
> 
> ---
> Parth Shah (1):
>       sched: Introduce latency-nice as a per-task attribute
> 
> Peter Zijlstra (14):
>       sched/fair: Add avg_vruntime
>       sched/fair: Remove START_DEBIT
>       sched/fair: Add lag based placement
>       rbtree: Add rb_add_augmented_cached() helper
>       sched/fair: Implement an EEVDF like policy
>       sched: Commit to lag based placement
>       sched/smp: Use lag to simplify cross-runqueue placement
>       sched: Commit to EEVDF
>       sched/debug: Rename min_granularity to base_slice
>       sched: Merge latency_offset into slice
>       sched/eevdf: Better handle mixed slice length
>       sched/eevdf: Sleeper bonus
>       sched/eevdf: Minimal vavg option
>       sched/eevdf: Debug / validation crud
> 
> Vincent Guittot (2):
>       sched/fair: Add latency_offset
>       sched/fair: Add sched group latency support
> 
>  Documentation/admin-guide/cgroup-v2.rst |   10 +
>  include/linux/rbtree_augmented.h        |   26 +
>  include/linux/sched.h                   |    6 +
>  include/uapi/linux/sched.h              |    4 +-
>  include/uapi/linux/sched/types.h        |   19 +
>  init/init_task.c                        |    3 +-
>  kernel/sched/core.c                     |   65 +-
>  kernel/sched/debug.c                    |   49 +-
>  kernel/sched/fair.c                     | 1199 ++++++++++++++++---------------
>  kernel/sched/features.h                 |   29 +-
>  kernel/sched/sched.h                    |   23 +-
>  tools/include/uapi/linux/sched.h        |    4 +-
>  12 files changed, 794 insertions(+), 643 deletions(-)
> 

--