[v6,0/4] sched/fair: Improve scan efficiency of SIS

Message ID	20221019122859.18399-1-wuyun.abel@bytedance.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Abel Wu <wuyun.abel@bytedance.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>, Mel Gorman <mgorman@suse.de>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Valentin Schneider <valentin.schneider@arm.com> Cc: Josh Don <joshdon@google.com>, Chen Yu <yu.c.chen@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>, K Prateek Nayak <kprateek.nayak@amd.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, Aubrey Li <aubrey.li@intel.com>, Qais Yousef <qais.yousef@arm.com>, Juri Lelli <juri.lelli@redhat.com>, Rik van Riel <riel@surriel.com>, Yicong Yang <yangyicong@huawei.com>, Barry Song <21cnbao@gmail.com>, linux-kernel@vger.kernel.org, Abel Wu <wuyun.abel@bytedance.com> Subject: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS Date: Wed, 19 Oct 2022 20:28:55 +0800 Message-Id: <20221019122859.18399-1-wuyun.abel@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sched/fair: Improve scan efficiency of SIS \| [v6,0/4] sched/fair: Improve scan efficiency of SIS [v6,1/4] sched/fair: Skip core update if task pending [v6,2/4] sched/fair: Ignore SIS_UTIL when has_idle_core [v6,3/4] sched/fair: Introduce SIS_CORE [v6,4/4] sched/fair: Deal with SIS scan failures

Message ID

20221019122859.18399-1-wuyun.abel@bytedance.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Abel Wu <wuyun.abel@bytedance.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@kernel.org>, Mel Gorman <mgorman@suse.de>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <valentin.schneider@arm.com>
Cc: Josh Don <joshdon@google.com>, Chen Yu <yu.c.chen@intel.com>,
        Tim Chen <tim.c.chen@linux.intel.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Aubrey Li <aubrey.li@intel.com>,
        Qais Yousef <qais.yousef@arm.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Rik van Riel <riel@surriel.com>,
        Yicong Yang <yangyicong@huawei.com>,
        Barry Song <21cnbao@gmail.com>, linux-kernel@vger.kernel.org,
        Abel Wu <wuyun.abel@bytedance.com>
Subject: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
Date: Wed, 19 Oct 2022 20:28:55 +0800
Message-Id: <20221019122859.18399-1-wuyun.abel@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

sched/fair: Improve scan efficiency of SIS |

Message

Abel Wu Oct. 19, 2022, 12:28 p.m. UTC

  This patchset tries to improve SIS scan efficiency by recording idle
cpus in a cpumask for each LLC which will be used as a target cpuset
in the domain scan. The cpus are recorded at CORE granule to avoid
tasks being stack on same core.

v5 -> v6:
 - Rename SIS_FILTER to SIS_CORE as it can only be activated when
   SMT is enabled and better describes the behavior of CORE granule
   update & load delivery.
 - Removed the part of limited scan for idle cores since it might be
   better to open another thread to discuss the strategies such as
   limited or scaled depth. But keep the part of full scan for idle
   cores when LLC is overloaded because SIS_CORE can greatly reduce
   the overhead of full scan in such case.
 - Removed the state of sd_is_busy which indicates an LLC is fully
   busy and we can safely skip the SIS domain scan. I would prefer
   leave this to SIS_UTIL.
 - The filter generation mechanism is replaced by in-place updates
   during domain scan to better deal with partial scan failures.
 - Collect Reviewed-bys from Tim Chen

v4 -> v5:
 - Add limited scan for idle cores when overloaded, suggested by Mel
 - Split out several patches since they are irrelevant to this scope
 - Add quick check on ttwu_pending before core update
 - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
 - Move the main filter logic to the idle path, because the newidle
   balance can bail out early if rq->avg_idle is small enough and
   lose chances to update the filter.

v3 -> v4:
 - Update filter in load_balance rather than in the tick
 - Now the filter contains unoccupied cpus rather than overloaded ones
 - Added mechanisms to deal with the false positive cases

v2 -> v3:
 - Removed sched-idle balance feature and focus on SIS
 - Take non-CFS tasks into consideration
 - Several fixes/improvement suggested by Josh Don

v1 -> v2:
 - Several optimizations on sched-idle balancing
 - Ignore asym topos in can_migrate_task
 - Add more benchmarks including SIS efficiency
 - Re-organize patch as suggested by Mel Gorman

Abel Wu (4):
  sched/fair: Skip core update if task pending
  sched/fair: Ignore SIS_UTIL when has_idle_core
  sched/fair: Introduce SIS_CORE
  sched/fair: Deal with SIS scan failures

 include/linux/sched/topology.h |  15 ++++
 kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
 kernel/sched/features.h        |   7 ++
 kernel/sched/sched.h           |   3 +
 kernel/sched/topology.c        |   8 ++-
 5 files changed, 141 insertions(+), 14 deletions(-)

Comments

Abel Wu Nov. 4, 2022, 7:29 a.m. UTC | #1

Ping :)

On 10/19/22 8:28 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
> 
> v5 -> v6:
>   - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>     SMT is enabled and better describes the behavior of CORE granule
>     update & load delivery.
>   - Removed the part of limited scan for idle cores since it might be
>     better to open another thread to discuss the strategies such as
>     limited or scaled depth. But keep the part of full scan for idle
>     cores when LLC is overloaded because SIS_CORE can greatly reduce
>     the overhead of full scan in such case.
>   - Removed the state of sd_is_busy which indicates an LLC is fully
>     busy and we can safely skip the SIS domain scan. I would prefer
>     leave this to SIS_UTIL.
>   - The filter generation mechanism is replaced by in-place updates
>     during domain scan to better deal with partial scan failures.
>   - Collect Reviewed-bys from Tim Chen
> 
> ...
>

K Prateek Nayak Nov. 14, 2022, 5:45 a.m. UTC | #2

Hello Abel,

Sorry for the delay. I've tested the patch on a dual socket Zen3 system
(2 x 64C/128T)

tl;dr

o I do not notice any regressions with the standard benchmarks.
o schbench sees a nice improvement to the tail latency when the number
  of worker are equal to the number of cores in the system in NPS1 and
  NPS2 mode. (Marked with "^")
o Few data points show improvements in tbench in NPS1 and NPS2 mode.
  (Marked with "^")

I'm still in the process of running larger workloads. If there is any
specific workload you would like me to run on the test system, please
do let me know. Below is the detailed report:

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:          5.19.0 tip sched/core
- sis_core: 	5.19.0 tip sched/core + this series

When we started testing, the tip was at:
commit fdf756f71271 ("sched: Fix more TASK_state comparisons")

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip			sis_core
 1-groups:	   4.06 (0.00 pct)	   4.26 (-4.92 pct)	*
 1-groups:	   4.14 (0.00 pct)	   4.09 (1.20 pct)	[Verification Run]
 2-groups:	   4.76 (0.00 pct)	   4.71 (1.05 pct)
 4-groups:	   5.22 (0.00 pct)	   5.11 (2.10 pct)
 8-groups:	   5.35 (0.00 pct)	   5.31 (0.74 pct)
16-groups:	   7.21 (0.00 pct)	   6.80 (5.68 pct)

o NPS2

Test:			tip			sis_core
 1-groups:	   4.09 (0.00 pct)	   4.08 (0.24 pct)
 2-groups:	   4.70 (0.00 pct)	   4.69 (0.21 pct)
 4-groups:	   5.05 (0.00 pct)	   4.92 (2.57 pct)
 8-groups:	   5.35 (0.00 pct)	   5.26 (1.68 pct)
16-groups:	   6.37 (0.00 pct)	   6.34 (0.47 pct)

o NPS4

Test:			tip			sis_core
 1-groups:	   4.07 (0.00 pct)	   3.99 (1.96 pct)
 2-groups:	   4.65 (0.00 pct)	   4.59 (1.29 pct)
 4-groups:	   5.13 (0.00 pct)	   5.00 (2.53 pct)
 8-groups:	   5.47 (0.00 pct)	   5.43 (0.73 pct)
16-groups:	   6.82 (0.00 pct)	   6.56 (3.81 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	tip			sis_core
  1:	  33.00 (0.00 pct)	  33.00 (0.00 pct)
  2:	  35.00 (0.00 pct)	  35.00 (0.00 pct)
  4:	  39.00 (0.00 pct)	  38.00 (2.56 pct)
  8:	  49.00 (0.00 pct)	  48.00 (2.04 pct)
 16:	  63.00 (0.00 pct)	  66.00 (-4.76 pct)
 32:	 109.00 (0.00 pct)	 107.00 (1.83 pct)
 64:	 208.00 (0.00 pct)	 216.00 (-3.84 pct)
128:	 559.00 (0.00 pct)	 469.00 (16.10 pct)     ^
256:	 45888.00 (0.00 pct)	 47552.00 (-3.62 pct)
512:	 80000.00 (0.00 pct)	 79744.00 (0.32 pct)

o NPS2

#workers:	=tip			sis_core
  1:	  30.00 (0.00 pct)	  32.00 (-6.66 pct)
  2:	  37.00 (0.00 pct)	  34.00 (8.10 pct)
  4:	  39.00 (0.00 pct)	  36.00 (7.69 pct)
  8:	  51.00 (0.00 pct)	  49.00 (3.92 pct)
 16:	  67.00 (0.00 pct)	  66.00 (1.49 pct)
 32:	 117.00 (0.00 pct)	 109.00 (6.83 pct)
 64:	 216.00 (0.00 pct)	 213.00 (1.38 pct)
128:	 529.00 (0.00 pct)	 465.00 (12.09 pct)     ^
256:	 47040.00 (0.00 pct)	 46528.00 (1.08 pct)
512:	 84864.00 (0.00 pct)	 83584.00 (1.50 pct)

o NPS4

#workers:	tip			sis_core
  1:	  23.00 (0.00 pct)	  28.00 (-21.73 pct)
  2:	  28.00 (0.00 pct)	  36.00 (-28.57 pct)
  4:	  41.00 (0.00 pct)	  43.00 (-4.87 pct)
  8:	  60.00 (0.00 pct)	  48.00 (20.00 pct)
 16:	  71.00 (0.00 pct)	  69.00 (2.81 pct)
 32:	 117.00 (0.00 pct)	 115.00 (1.70 pct)
 64:	 227.00 (0.00 pct)	 228.00 (-0.44 pct)
128:	 545.00 (0.00 pct)	 545.00 (0.00 pct)
256:	 45632.00 (0.00 pct)	 47680.00 (-4.48 pct)
512:	 81024.00 (0.00 pct)	 76416.00 (5.68 pct)

Note: For lower worker count, schbench can show run to
run variation depending on external factors. Regression
for lower worker count can be ignored. The results are
included to spot any large blow up in the tail latency
for larger worker count.

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip			sis_core
    1	 578.37 (0.00 pct)	 582.09 (0.64 pct)
    2	 1062.09 (0.00 pct)	 1063.95 (0.17 pct)
    4	 1800.62 (0.00 pct)	 1879.18 (4.36 pct)
    8	 3211.02 (0.00 pct)	 3220.44 (0.29 pct)
   16	 4848.92 (0.00 pct)	 4890.08 (0.84 pct)
   32	 9091.36 (0.00 pct)	 9721.13 (6.92 pct)     ^
   64	 15454.01 (0.00 pct)	 15124.42 (-2.13 pct)
  128	 3511.33 (0.00 pct)	 14314.79 (307.67 pct)
  128    19910.99 (0.00pct)      19935.61 (0.12 pct)   [Verification Run]
  256	 50019.32 (0.00 pct)	 50708.24 (1.37 pct)
  512	 44317.68 (0.00 pct)	 44787.48 (1.06 pct)
 1024	 41200.85 (0.00 pct)	 42079.29 (2.13 pct)

o NPS2

Clients:	tip			sis_core
    1	 576.05 (0.00 pct)	 579.18 (0.54 pct)
    2	 1037.68 (0.00 pct)	 1070.49 (3.16 pct)
    4	 1818.13 (0.00 pct)	 1860.22 (2.31 pct)
    8	 3004.16 (0.00 pct)	 3087.09 (2.76 pct)
   16	 4520.11 (0.00 pct)	 4789.53 (5.96 pct)
   32	 8624.23 (0.00 pct)	 9439.50 (9.45 pct)     ^
   64	 14886.75 (0.00 pct)	 15004.96 (0.79 pct)
  128	 20602.00 (0.00 pct)	 17730.31 (-13.93 pct) *
  128    20602.00 (0.00 pct)     19585.20 (-4.93 pct)   [Verification Run]
  256	 45566.83 (0.00 pct)	 47922.70 (5.17 pct)
  512	 42717.49 (0.00 pct)	 43809.68 (2.55 pct)
 1024	 40936.61 (0.00 pct)	 40787.71 (-0.36 pct)

o NPS4

Clients:	tip			sis_core
    1	 576.36 (0.00 pct)	 580.83 (0.77 pct)
    2	 1044.26 (0.00 pct)	 1066.50 (2.12 pct)
    4	 1839.77 (0.00 pct)	 1867.56 (1.51 pct)
    8	 3043.53 (0.00 pct)	 3115.17 (2.35 pct)
   16	 5207.54 (0.00 pct)	 4847.53 (-6.91 pct)	*
   16	 4722.56 (0.00 pct)	 4811.29 (1.87 pct)	[Verification Run]
   32	 9263.86 (0.00 pct)	 9478.68 (2.31 pct)
   64	 14959.66 (0.00 pct)	 15267.39 (2.05 pct)
  128	 20698.65 (0.00 pct)	 20432.19 (-1.28 pct)
  256	 46666.21 (0.00 pct)	 46664.81 (0.00 pct)
  512	 41532.80 (0.00 pct)	 44241.12 (6.52 pct)
 1024	 39459.49 (0.00 pct)	 41043.22 (4.01 pct)

Note: On the tested kernel, with 128 clients, tbench can
run into a bottleneck during C2 exit. More details can be
found at:
https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
This issue has been fixed in v6.0 but was not part of the
tip kernel when I started testing. This data point has
been rerun with C2 disabled to get representative results.

~~~~~~~~~~
~ Stream ~
~~~~~~~~~~

o NPS1

-> 10 Runs:

Test:		tip			sis_core
 Copy:	 328419.14 (0.00 pct)	 337857.83 (2.87 pct)
Scale:	 206071.21 (0.00 pct)	 212133.82 (2.94 pct)
  Add:	 235271.48 (0.00 pct)	 243811.97 (3.63 pct)
Triad:	 253175.80 (0.00 pct)	 252333.43 (-0.33 pct)

-> 100 Runs:

Test:		tip			sis_core
 Copy:	 328209.61 (0.00 pct)	 339817.27 (3.53 pct)
Scale:	 216310.13 (0.00 pct)	 218635.16 (1.07 pct)
  Add:	 244417.83 (0.00 pct)	 245641.47 (0.50 pct)
Triad:	 237508.83 (0.00 pct)	 255387.28 (7.52 pct)

o NPS2

-> 10 Runs:

Test:		tip			sis_core
 Copy:	 336503.88 (0.00 pct)	 339684.21 (0.94 pct)
Scale:	 218035.23 (0.00 pct)	 217601.11 (-0.19 pct)
  Add:	 257677.42 (0.00 pct)	 258608.34 (0.36 pct)
Triad:	 268872.37 (0.00 pct)	 272548.09 (1.36 pct)

-> 100 Runs:

Test:		tip			sis_core
 Copy:	 332304.34 (0.00 pct)	 341565.75 (2.78 pct)
Scale:	 223421.60 (0.00 pct)	 224267.40 (0.37 pct)
  Add:	 252363.56 (0.00 pct)	 254926.98 (1.01 pct)
Triad:	 266687.56 (0.00 pct)	 270782.81 (1.53 pct)

o NPS4

-> 10 Runs:

Test:		tip			sis_core
 Copy:	 353515.62 (0.00 pct)	 342060.85 (-3.24 pct)
Scale:	 228854.37 (0.00 pct)	 218262.41 (-4.62 pct)
  Add:	 254942.12 (0.00 pct)	 241975.90 (-5.08 pct)
Triad:	 270521.87 (0.00 pct)	 257686.71 (-4.74 pct)

-> 100 Runs:

Test:		tip			sis_core
 Copy:	 374520.81 (0.00 pct)	 369353.13 (-1.37 pct)
Scale:	 246280.23 (0.00 pct)	 253881.69 (3.08 pct)
  Add:	 262772.72 (0.00 pct)	 266484.58 (1.41 pct)
Triad:	 283740.92 (0.00 pct)	 279981.18 (-1.32 pct)

On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
> 
> v5 -> v6:
>  - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>    SMT is enabled and better describes the behavior of CORE granule
>    update & load delivery.
>  - Removed the part of limited scan for idle cores since it might be
>    better to open another thread to discuss the strategies such as
>    limited or scaled depth. But keep the part of full scan for idle
>    cores when LLC is overloaded because SIS_CORE can greatly reduce
>    the overhead of full scan in such case.
>  - Removed the state of sd_is_busy which indicates an LLC is fully
>    busy and we can safely skip the SIS domain scan. I would prefer
>    leave this to SIS_UTIL.
>  - The filter generation mechanism is replaced by in-place updates
>    during domain scan to better deal with partial scan failures.
>  - Collect Reviewed-bys from Tim Chen
> 
> v4 -> v5:
>  - Add limited scan for idle cores when overloaded, suggested by Mel
>  - Split out several patches since they are irrelevant to this scope
>  - Add quick check on ttwu_pending before core update
>  - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>  - Move the main filter logic to the idle path, because the newidle
>    balance can bail out early if rq->avg_idle is small enough and
>    lose chances to update the filter.
> 
> v3 -> v4:
>  - Update filter in load_balance rather than in the tick
>  - Now the filter contains unoccupied cpus rather than overloaded ones
>  - Added mechanisms to deal with the false positive cases
> 
> v2 -> v3:
>  - Removed sched-idle balance feature and focus on SIS
>  - Take non-CFS tasks into consideration
>  - Several fixes/improvement suggested by Josh Don
> 
> v1 -> v2:
>  - Several optimizations on sched-idle balancing
>  - Ignore asym topos in can_migrate_task
>  - Add more benchmarks including SIS efficiency
>  - Re-organize patch as suggested by Mel Gorman
> 
> Abel Wu (4):
>   sched/fair: Skip core update if task pending
>   sched/fair: Ignore SIS_UTIL when has_idle_core
>   sched/fair: Introduce SIS_CORE
>   sched/fair: Deal with SIS scan failures
> 
>  include/linux/sched/topology.h |  15 ++++
>  kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
>  kernel/sched/features.h        |   7 ++
>  kernel/sched/sched.h           |   3 +
>  kernel/sched/topology.c        |   8 ++-
>  5 files changed, 141 insertions(+), 14 deletions(-)
> 

I ran pgbench from mmtest but realised there is too much run to run
variation on the system. Planning on running MongoDB benchmark which
is more stable on the system and couple more workloads but the
initial results look good. I'll get back with results later this week
or by early next week. Meanwhile, if you need data for any specific
workload on the test system, please do let me know. 

--
Thanks and Regards,
Prateek

Abel Wu Nov. 15, 2022, 8:31 a.m. UTC | #3

Hi Prateek, thanks very much for your detailed testing!

On 11/14/22 1:45 PM, K Prateek Nayak wrote:
> Hello Abel,
> 
> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
> (2 x 64C/128T)
> 
> tl;dr
> 
> o I do not notice any regressions with the standard benchmarks.
> o schbench sees a nice improvement to the tail latency when the number
>    of worker are equal to the number of cores in the system in NPS1 and
>    NPS2 mode. (Marked with "^")
> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>    (Marked with "^")
> 
> I'm still in the process of running larger workloads. If there is any
> specific workload you would like me to run on the test system, please
> do let me know. Below is the detailed report:

Not particularly in my mind, and I think testing larger workloads is
great. Thanks!

> 
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>      Total 2 NUMA nodes in the dual socket machine.
> 
>      Node 0: 0-63,   128-191
>      Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>      Total 4 NUMA nodes exist over 2 socket.
>     
>      Node 0: 0-31,   128-159
>      Node 1: 32-63,  160-191
>      Node 2: 64-95,  192-223
>      Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>      Total 8 NUMA nodes exist over 2 socket.
>     
>      Node 0: 0-15,    128-143
>      Node 1: 16-31,   144-159
>      Node 2: 32-47,   160-175
>      Node 3: 48-63,   176-191
>      Node 4: 64-79,   192-207
>      Node 5: 80-95,   208-223
>      Node 6: 96-111,  223-231
>      Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:          5.19.0 tip sched/core
> - sis_core: 	5.19.0 tip sched/core + this series
> 
> When we started testing, the tip was at:
> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> o NPS1
> 
> Test:			tip			sis_core
>   1-groups:	   4.06 (0.00 pct)	   4.26 (-4.92 pct)	*
>   1-groups:	   4.14 (0.00 pct)	   4.09 (1.20 pct)	[Verification Run]
>   2-groups:	   4.76 (0.00 pct)	   4.71 (1.05 pct)
>   4-groups:	   5.22 (0.00 pct)	   5.11 (2.10 pct)
>   8-groups:	   5.35 (0.00 pct)	   5.31 (0.74 pct)
> 16-groups:	   7.21 (0.00 pct)	   6.80 (5.68 pct)
> 
> o NPS2
> 
> Test:			tip			sis_core
>   1-groups:	   4.09 (0.00 pct)	   4.08 (0.24 pct)
>   2-groups:	   4.70 (0.00 pct)	   4.69 (0.21 pct)
>   4-groups:	   5.05 (0.00 pct)	   4.92 (2.57 pct)
>   8-groups:	   5.35 (0.00 pct)	   5.26 (1.68 pct)
> 16-groups:	   6.37 (0.00 pct)	   6.34 (0.47 pct)
> 
> o NPS4
> 
> Test:			tip			sis_core
>   1-groups:	   4.07 (0.00 pct)	   3.99 (1.96 pct)
>   2-groups:	   4.65 (0.00 pct)	   4.59 (1.29 pct)
>   4-groups:	   5.13 (0.00 pct)	   5.00 (2.53 pct)
>   8-groups:	   5.47 (0.00 pct)	   5.43 (0.73 pct)
> 16-groups:	   6.82 (0.00 pct)	   6.56 (3.81 pct)

Although each cpu will get 2.5 tasks when 16-groups, which can
be considered overloaded, I tested in AMD EPYC 7Y83 machine and
the total cpu usage was ~82% (with some older kernel version),
so there is still lots of idle time.

I guess cutting off at 16-groups is because it's enough loaded
compared to the real workloads, so testing more groups might just
be a waste of time?

Thanks & Best Regards,
	Abel

> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> o NPS1
> 
> #workers:	tip			sis_core
>    1:	  33.00 (0.00 pct)	  33.00 (0.00 pct)
>    2:	  35.00 (0.00 pct)	  35.00 (0.00 pct)
>    4:	  39.00 (0.00 pct)	  38.00 (2.56 pct)
>    8:	  49.00 (0.00 pct)	  48.00 (2.04 pct)
>   16:	  63.00 (0.00 pct)	  66.00 (-4.76 pct)
>   32:	 109.00 (0.00 pct)	 107.00 (1.83 pct)
>   64:	 208.00 (0.00 pct)	 216.00 (-3.84 pct)
> 128:	 559.00 (0.00 pct)	 469.00 (16.10 pct)     ^
> 256:	 45888.00 (0.00 pct)	 47552.00 (-3.62 pct)
> 512:	 80000.00 (0.00 pct)	 79744.00 (0.32 pct)
> 
> o NPS2
> 
> #workers:	=tip			sis_core
>    1:	  30.00 (0.00 pct)	  32.00 (-6.66 pct)
>    2:	  37.00 (0.00 pct)	  34.00 (8.10 pct)
>    4:	  39.00 (0.00 pct)	  36.00 (7.69 pct)
>    8:	  51.00 (0.00 pct)	  49.00 (3.92 pct)
>   16:	  67.00 (0.00 pct)	  66.00 (1.49 pct)
>   32:	 117.00 (0.00 pct)	 109.00 (6.83 pct)
>   64:	 216.00 (0.00 pct)	 213.00 (1.38 pct)
> 128:	 529.00 (0.00 pct)	 465.00 (12.09 pct)     ^
> 256:	 47040.00 (0.00 pct)	 46528.00 (1.08 pct)
> 512:	 84864.00 (0.00 pct)	 83584.00 (1.50 pct)
> 
> o NPS4
> 
> #workers:	tip			sis_core
>    1:	  23.00 (0.00 pct)	  28.00 (-21.73 pct)
>    2:	  28.00 (0.00 pct)	  36.00 (-28.57 pct)
>    4:	  41.00 (0.00 pct)	  43.00 (-4.87 pct)
>    8:	  60.00 (0.00 pct)	  48.00 (20.00 pct)
>   16:	  71.00 (0.00 pct)	  69.00 (2.81 pct)
>   32:	 117.00 (0.00 pct)	 115.00 (1.70 pct)
>   64:	 227.00 (0.00 pct)	 228.00 (-0.44 pct)
> 128:	 545.00 (0.00 pct)	 545.00 (0.00 pct)
> 256:	 45632.00 (0.00 pct)	 47680.00 (-4.48 pct)
> 512:	 81024.00 (0.00 pct)	 76416.00 (5.68 pct)
> 
> Note: For lower worker count, schbench can show run to
> run variation depending on external factors. Regression
> for lower worker count can be ignored. The results are
> included to spot any large blow up in the tail latency
> for larger worker count.
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> Clients:	tip			sis_core
>      1	 578.37 (0.00 pct)	 582.09 (0.64 pct)
>      2	 1062.09 (0.00 pct)	 1063.95 (0.17 pct)
>      4	 1800.62 (0.00 pct)	 1879.18 (4.36 pct)
>      8	 3211.02 (0.00 pct)	 3220.44 (0.29 pct)
>     16	 4848.92 (0.00 pct)	 4890.08 (0.84 pct)
>     32	 9091.36 (0.00 pct)	 9721.13 (6.92 pct)     ^
>     64	 15454.01 (0.00 pct)	 15124.42 (-2.13 pct)
>    128	 3511.33 (0.00 pct)	 14314.79 (307.67 pct)
>    128    19910.99 (0.00pct)      19935.61 (0.12 pct)   [Verification Run]
>    256	 50019.32 (0.00 pct)	 50708.24 (1.37 pct)
>    512	 44317.68 (0.00 pct)	 44787.48 (1.06 pct)
>   1024	 41200.85 (0.00 pct)	 42079.29 (2.13 pct)
> 
> o NPS2
> 
> Clients:	tip			sis_core
>      1	 576.05 (0.00 pct)	 579.18 (0.54 pct)
>      2	 1037.68 (0.00 pct)	 1070.49 (3.16 pct)
>      4	 1818.13 (0.00 pct)	 1860.22 (2.31 pct)
>      8	 3004.16 (0.00 pct)	 3087.09 (2.76 pct)
>     16	 4520.11 (0.00 pct)	 4789.53 (5.96 pct)
>     32	 8624.23 (0.00 pct)	 9439.50 (9.45 pct)     ^
>     64	 14886.75 (0.00 pct)	 15004.96 (0.79 pct)
>    128	 20602.00 (0.00 pct)	 17730.31 (-13.93 pct) *
>    128    20602.00 (0.00 pct)     19585.20 (-4.93 pct)   [Verification Run]
>    256	 45566.83 (0.00 pct)	 47922.70 (5.17 pct)
>    512	 42717.49 (0.00 pct)	 43809.68 (2.55 pct)
>   1024	 40936.61 (0.00 pct)	 40787.71 (-0.36 pct)
> 
> o NPS4
> 
> Clients:	tip			sis_core
>      1	 576.36 (0.00 pct)	 580.83 (0.77 pct)
>      2	 1044.26 (0.00 pct)	 1066.50 (2.12 pct)
>      4	 1839.77 (0.00 pct)	 1867.56 (1.51 pct)
>      8	 3043.53 (0.00 pct)	 3115.17 (2.35 pct)
>     16	 5207.54 (0.00 pct)	 4847.53 (-6.91 pct)	*
>     16	 4722.56 (0.00 pct)	 4811.29 (1.87 pct)	[Verification Run]
>     32	 9263.86 (0.00 pct)	 9478.68 (2.31 pct)
>     64	 14959.66 (0.00 pct)	 15267.39 (2.05 pct)
>    128	 20698.65 (0.00 pct)	 20432.19 (-1.28 pct)
>    256	 46666.21 (0.00 pct)	 46664.81 (0.00 pct)
>    512	 41532.80 (0.00 pct)	 44241.12 (6.52 pct)
>   1024	 39459.49 (0.00 pct)	 41043.22 (4.01 pct)
> 
> Note: On the tested kernel, with 128 clients, tbench can
> run into a bottleneck during C2 exit. More details can be
> found at:
> https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
> This issue has been fixed in v6.0 but was not part of the
> tip kernel when I started testing. This data point has
> been rerun with C2 disabled to get representative results.
> 
> ~~~~~~~~~~
> ~ Stream ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> -> 10 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 328419.14 (0.00 pct)	 337857.83 (2.87 pct)
> Scale:	 206071.21 (0.00 pct)	 212133.82 (2.94 pct)
>    Add:	 235271.48 (0.00 pct)	 243811.97 (3.63 pct)
> Triad:	 253175.80 (0.00 pct)	 252333.43 (-0.33 pct)
> 
> -> 100 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 328209.61 (0.00 pct)	 339817.27 (3.53 pct)
> Scale:	 216310.13 (0.00 pct)	 218635.16 (1.07 pct)
>    Add:	 244417.83 (0.00 pct)	 245641.47 (0.50 pct)
> Triad:	 237508.83 (0.00 pct)	 255387.28 (7.52 pct)
> 
> o NPS2
> 
> -> 10 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 336503.88 (0.00 pct)	 339684.21 (0.94 pct)
> Scale:	 218035.23 (0.00 pct)	 217601.11 (-0.19 pct)
>    Add:	 257677.42 (0.00 pct)	 258608.34 (0.36 pct)
> Triad:	 268872.37 (0.00 pct)	 272548.09 (1.36 pct)
> 
> -> 100 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 332304.34 (0.00 pct)	 341565.75 (2.78 pct)
> Scale:	 223421.60 (0.00 pct)	 224267.40 (0.37 pct)
>    Add:	 252363.56 (0.00 pct)	 254926.98 (1.01 pct)
> Triad:	 266687.56 (0.00 pct)	 270782.81 (1.53 pct)
> 
> o NPS4
> 
> -> 10 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 353515.62 (0.00 pct)	 342060.85 (-3.24 pct)
> Scale:	 228854.37 (0.00 pct)	 218262.41 (-4.62 pct)
>    Add:	 254942.12 (0.00 pct)	 241975.90 (-5.08 pct)
> Triad:	 270521.87 (0.00 pct)	 257686.71 (-4.74 pct)
> 
> -> 100 Runs:
> 
> Test:		tip			sis_core
>   Copy:	 374520.81 (0.00 pct)	 369353.13 (-1.37 pct)
> Scale:	 246280.23 (0.00 pct)	 253881.69 (3.08 pct)
>    Add:	 262772.72 (0.00 pct)	 266484.58 (1.41 pct)
> Triad:	 283740.92 (0.00 pct)	 279981.18 (-1.32 pct)
> 
> On 10/19/2022 5:58 PM, Abel Wu wrote:
>> This patchset tries to improve SIS scan efficiency by recording idle
>> cpus in a cpumask for each LLC which will be used as a target cpuset
>> in the domain scan. The cpus are recorded at CORE granule to avoid
>> tasks being stack on same core.
>>
>> v5 -> v6:
>>   - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>>     SMT is enabled and better describes the behavior of CORE granule
>>     update & load delivery.
>>   - Removed the part of limited scan for idle cores since it might be
>>     better to open another thread to discuss the strategies such as
>>     limited or scaled depth. But keep the part of full scan for idle
>>     cores when LLC is overloaded because SIS_CORE can greatly reduce
>>     the overhead of full scan in such case.
>>   - Removed the state of sd_is_busy which indicates an LLC is fully
>>     busy and we can safely skip the SIS domain scan. I would prefer
>>     leave this to SIS_UTIL.
>>   - The filter generation mechanism is replaced by in-place updates
>>     during domain scan to better deal with partial scan failures.
>>   - Collect Reviewed-bys from Tim Chen
>>
>> v4 -> v5:
>>   - Add limited scan for idle cores when overloaded, suggested by Mel
>>   - Split out several patches since they are irrelevant to this scope
>>   - Add quick check on ttwu_pending before core update
>>   - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>>   - Move the main filter logic to the idle path, because the newidle
>>     balance can bail out early if rq->avg_idle is small enough and
>>     lose chances to update the filter.
>>
>> v3 -> v4:
>>   - Update filter in load_balance rather than in the tick
>>   - Now the filter contains unoccupied cpus rather than overloaded ones
>>   - Added mechanisms to deal with the false positive cases
>>
>> v2 -> v3:
>>   - Removed sched-idle balance feature and focus on SIS
>>   - Take non-CFS tasks into consideration
>>   - Several fixes/improvement suggested by Josh Don
>>
>> v1 -> v2:
>>   - Several optimizations on sched-idle balancing
>>   - Ignore asym topos in can_migrate_task
>>   - Add more benchmarks including SIS efficiency
>>   - Re-organize patch as suggested by Mel Gorman
>>
>> Abel Wu (4):
>>    sched/fair: Skip core update if task pending
>>    sched/fair: Ignore SIS_UTIL when has_idle_core
>>    sched/fair: Introduce SIS_CORE
>>    sched/fair: Deal with SIS scan failures
>>
>>   include/linux/sched/topology.h |  15 ++++
>>   kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
>>   kernel/sched/features.h        |   7 ++
>>   kernel/sched/sched.h           |   3 +
>>   kernel/sched/topology.c        |   8 ++-
>>   5 files changed, 141 insertions(+), 14 deletions(-)
>>
> 
> I ran pgbench from mmtest but realised there is too much run to run
> variation on the system. Planning on running MongoDB benchmark which
> is more stable on the system and couple more workloads but the
> initial results look good. I'll get back with results later this week
> or by early next week. Meanwhile, if you need data for any specific
> workload on the test system, please do let me know.
> 
> --
> Thanks and Regards,
> Prateek

K Prateek Nayak Nov. 15, 2022, 11:28 a.m. UTC | #4

Hello Abel,

Thank you for taking a look at the report.

On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
> 
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>>    of worker are equal to the number of cores in the system in NPS1 and
>>    NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>    (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
> 
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>>      Total 2 NUMA nodes in the dual socket machine.
>>
>>      Node 0: 0-63,   128-191
>>      Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>      Total 4 NUMA nodes exist over 2 socket.
>>          Node 0: 0-31,   128-159
>>      Node 1: 32-63,  160-191
>>      Node 2: 64-95,  192-223
>>      Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>      Total 8 NUMA nodes exist over 2 socket.
>>          Node 0: 0-15,    128-143
>>      Node 1: 16-31,   144-159
>>      Node 2: 32-47,   160-175
>>      Node 3: 48-63,   176-191
>>      Node 4: 64-79,   192-207
>>      Node 5: 80-95,   208-223
>>      Node 6: 96-111,  223-231
>>      Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip:          5.19.0 tip sched/core
>> - sis_core:     5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test:            tip            sis_core
>>   1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>>   1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>>   2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>>   4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>>   8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test:            tip            sis_core
>>   1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>>   2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>>   4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>>   8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test:            tip            sis_core
>>   1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>>   2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>>   4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>>   8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
> 
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
> 
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?

The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads. 

> 
> Thanks & Best Regards,
>     Abel
> 
> [..snip..]
>


--
Thanks and Regards,
Prateek

K Prateek Nayak Nov. 22, 2022, 11:28 a.m. UTC | #5

Hello Abel,

Following are the results for hackbench with larger number of
groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
a regression in unixbench spawn in NPS2 and NPS4 mode and
unixbench syscall in NPs2 mode, everything looks good.

Detailed results are below:

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

tip:            131696.33 (var: 2.03%)
sis_core:       129519.00 (var: 1.46%)  (-1.65%)

o NPS2:

tip:            129895.33 (var: 2.34%)
sis_core:       130774.33 (var: 2.57%)  (+0.67%)

o NPS4:

tip:            131165.00 (var: 1.06%)
sis_core:       133547.33 (var: 3.90%)  (+1.81%)

~~~~~~~~~~~~~~~~~
~ Spec-JBB NPS1 ~
~~~~~~~~~~~~~~~~~

Max-jOPS and Critical-jOPS are same as the tip kernel.

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

-> unixbench-dhry2reg

o NPS1

kernel:                                        tip                          sis_core
Min       unixbench-dhry2reg-1            48876615.50 (   0.00%)         48891544.00 (   0.03%)
Min       unixbench-dhry2reg-512        6260344658.90 (   0.00%)       6282967594.10 (   0.36%)
Hmean     unixbench-dhry2reg-1            49299721.81 (   0.00%)         49233828.70 (  -0.13%)
Hmean     unixbench-dhry2reg-512        6267459427.19 (   0.00%)       6288772961.79 *   0.34%*
CoeffVar  unixbench-dhry2reg-1                   0.90 (   0.00%)                0.68 (  24.66%)
CoeffVar  unixbench-dhry2reg-512                 0.10 (   0.00%)                0.10 (   7.54%)

o NPS2

kernel:                                        tip                          sis_core
Min       unixbench-dhry2reg-1            48828251.70 (   0.00%)         48856709.20 (   0.06%)
Min       unixbench-dhry2reg-512        6244987739.10 (   0.00%)       6271229549.10 (   0.42%)
Hmean     unixbench-dhry2reg-1            48869882.65 (   0.00%)         49302481.81 (   0.89%)
Hmean     unixbench-dhry2reg-512        6261073948.84 (   0.00%)       6272564898.35 (   0.18%)
CoeffVar  unixbench-dhry2reg-1                   0.08 (   0.00%)                0.87 (-945.28%)
CoeffVar  unixbench-dhry2reg-512                 0.23 (   0.00%)                0.03 (  85.94%)

o NPS4

kernel:                                        tip                          sis_core
Min       unixbench-dhry2reg-1            48523981.30 (   0.00%)         49083957.50 (   1.15%)
Min       unixbench-dhry2reg-512        6253738837.10 (   0.00%)       6271747119.10 (   0.29%)
Hmean     unixbench-dhry2reg-1            48781044.09 (   0.00%)         49232218.87 *   0.92%*
Hmean     unixbench-dhry2reg-512        6264428474.90 (   0.00%)       6280484789.64 (   0.26%)
CoeffVar  unixbench-dhry2reg-1                   0.46 (   0.00%)                0.26 (  42.63%)
CoeffVar  unixbench-dhry2reg-512                 0.17 (   0.00%)                0.21 ( -26.72%)

-> unixbench-syscall

o NPS1

kernel:                             tip                  sis_core
Min       unixbench-syscall-1    2975654.80 (   0.00%)  2978489.40 (  -0.10%)
Min       unixbench-syscall-512  7840226.50 (   0.00%)  7822133.40 (   0.23%)
Amean     unixbench-syscall-1    2976326.47 (   0.00%)  2980985.27 *  -0.16%*
Amean     unixbench-syscall-512  7850493.90 (   0.00%)  7844527.50 (   0.08%)
CoeffVar  unixbench-syscall-1          0.03 (   0.00%)        0.07 (-154.43%)
CoeffVar  unixbench-syscall-512        0.13 (   0.00%)        0.34 (-158.96%)

o NPS2

kernel:                             tip                  sis_core
Min       unixbench-syscall-1    2969863.60 (   0.00%)  2977936.50 (  -0.27%)
Min       unixbench-syscall-512  8053157.60 (   0.00%)  8072239.00 (  -0.24%)
Amean     unixbench-syscall-1    2970462.30 (   0.00%)  2981732.50 *  -0.38%*
Amean     unixbench-syscall-512  8061454.50 (   0.00%)  8079287.73 *  -0.22%*
CoeffVar  unixbench-syscall-1          0.02 (   0.00%)        0.11 (-527.26%)
CoeffVar  unixbench-syscall-512        0.12 (   0.00%)        0.08 (  37.30%)

o NPS4

kernel:                             tip                  sis_core
Min       unixbench-syscall-1    2971799.80 (   0.00%)  2979335.60 (  -0.25%)
Min       unixbench-syscall-512  7824196.90 (   0.00%)  8155610.20 (  -4.24%)
Amean     unixbench-syscall-1    2973045.43 (   0.00%)  2982036.13 *  -0.30%*
Amean     unixbench-syscall-512  7826302.17 (   0.00%)  8173026.57 *  -4.43%*   <-- Regression in syscall for larger worker count
CoeffVar  unixbench-syscall-1          0.04 (   0.00%)        0.09 (-139.63%)
CoeffVar  unixbench-syscall-512        0.03 (   0.00%)        0.20 (-701.13%)


-> unixbench-pipe

o NPS1

kernel:                               tip                  sis_core
Min       unixbench-pipe-1        2894765.30 (   0.00%)   2891505.30 (  -0.11%)
Min       unixbench-pipe-512    329818573.50 (   0.00%) 325610257.80 (  -1.28%)
Hmean     unixbench-pipe-1        2898803.38 (   0.00%)   2896940.25 (  -0.06%)
Hmean     unixbench-pipe-512    330226401.69 (   0.00%) 326311984.29 *  -1.19%*
CoeffVar  unixbench-pipe-1              0.14 (   0.00%)         0.17 ( -21.99%)
CoeffVar  unixbench-pipe-512            0.11 (   0.00%)         0.20 ( -88.38%)

o NPS2

kernel:                               tip                   sis_core
Min       unixbench-pipe-1        2895327.90 (   0.00%)    2894798.20 (  -0.02%)
Min       unixbench-pipe-512    328350065.60 (   0.00%)  325681163.10 (  -0.81%)
Hmean     unixbench-pipe-1        2899129.86 (   0.00%)    2897067.80 (  -0.07%)
Hmean     unixbench-pipe-512    329436096.80 (   0.00%)  326023030.94 *  -1.04%*
CoeffVar  unixbench-pipe-1              0.12 (   0.00%)          0.09 (  21.96%)
CoeffVar  unixbench-pipe-512            0.30 (   0.00%)          0.12 (  60.80%)

o NPS4

kernel:                               tip                   sis_core
Min       unixbench-pipe-1        2901525.60 (   0.00%)    2885730.80 (  -0.54%)
Min       unixbench-pipe-512    330265873.90 (   0.00%)  326730770.60 (  -1.07%)
Hmean     unixbench-pipe-1        2906184.70 (   0.00%)    2891616.18 *  -0.50%*
Hmean     unixbench-pipe-512    330854683.27 (   0.00%)  327113296.63 *  -1.13%*
CoeffVar  unixbench-pipe-1              0.14 (   0.00%)          0.19 ( -33.74%)
CoeffVar  unixbench-pipe-512            0.16 (   0.00%)          0.11 (  31.75%)

-> unixbench-spawn

o NPS1

kernel:                             tip                  sis_core
Min       unixbench-spawn-1       6536.50 (   0.00%)     6000.30 (  -8.20%)
Min       unixbench-spawn-512    72571.40 (   0.00%)    70829.60 (  -2.40%)
Hmean     unixbench-spawn-1       6811.16 (   0.00%)     7016.11 (   3.01%)
Hmean     unixbench-spawn-512    72801.77 (   0.00%)    71012.03 *  -2.46%*
CoeffVar  unixbench-spawn-1          3.69 (   0.00%)       13.52 (-266.69%)
CoeffVar  unixbench-spawn-512        0.27 (   0.00%)        0.22 (  18.25%)

o NPS2

kernel:                             tip                  sis_core
Min       unixbench-spawn-1       7042.20 (   0.00%)     7078.70 (   0.52%)
Min       unixbench-spawn-512    85571.60 (   0.00%)    77362.60 (  -9.59%)
Hmean     unixbench-spawn-1       7199.01 (   0.00%)     7276.55 (   1.08%)
Hmean     unixbench-spawn-512    85717.77 (   0.00%)    77923.73 *  -9.09%*     <-- Regression in spawn test for larger worker count
CoeffVar  unixbench-spawn-1          3.50 (   0.00%)        3.30 (   5.70%)
CoeffVar  unixbench-spawn-512        0.20 (   0.00%)        0.82 (-304.88%)

o NPS4

kernel:                             tip                  sis_core
Min       unixbench-spawn-1       7521.90 (   0.00%)     8102.80 (   7.72%)
Min       unixbench-spawn-512    84245.70 (   0.00%)    73074.50 ( -13.26%)
Hmean     unixbench-spawn-1       7659.12 (   0.00%)     8645.19 *  12.87%*
Hmean     unixbench-spawn-512    84908.77 (   0.00%)    73409.49 * -13.54%*     <-- Regression in spawn test for larger worker count
CoeffVar  unixbench-spawn-1          1.92 (   0.00%)        5.78 (-200.56%)
CoeffVar  unixbench-spawn-512        0.76 (   0.00%)        0.41 (  46.58%)

-> unixbench-execl

o NPS1

kernel:                             tip                  sis_core
Min       unixbench-execl-1       5421.50 (   0.00%)     5471.50 (   0.92%)
Min       unixbench-execl-512    11213.50 (   0.00%)    11677.20 (   4.14%)
Hmean     unixbench-execl-1       5443.75 (   0.00%)     5475.36 *   0.58%*
Hmean     unixbench-execl-512    11311.94 (   0.00%)    11804.52 *   4.35%*
CoeffVar  unixbench-execl-1          0.38 (   0.00%)        0.12 (  69.22%)
CoeffVar  unixbench-execl-512        1.03 (   0.00%)        1.73 ( -68.91%)

o NPS2

kernel:                             tip                  sis_core
Min       unixbench-execl-1       5089.10 (   0.00%)     5405.40 (   6.22%)
Min       unixbench-execl-512    11772.70 (   0.00%)    11917.20 (   1.23%)
Hmean     unixbench-execl-1       5321.65 (   0.00%)     5421.41 (   1.87%)
Hmean     unixbench-execl-512    12201.73 (   0.00%)    12327.95 (   1.03%)
CoeffVar  unixbench-execl-1          3.87 (   0.00%)        0.28 (  92.88%)
CoeffVar  unixbench-execl-512        6.23 (   0.00%)        5.78 (   7.21%)

o NPS4

kernel:                             tip                  sis_core
Min       unixbench-execl-1       5099.40 (   0.00%)     5479.60 (   7.46%)
Min       unixbench-execl-512    11692.80 (   0.00%)    12205.50 (   4.38%)
Hmean     unixbench-execl-1       5136.86 (   0.00%)     5487.93 *   6.83%*
Hmean     unixbench-execl-512    12053.71 (   0.00%)    12712.96 (   5.47%)
CoeffVar  unixbench-execl-1          1.05 (   0.00%)        0.14 (  86.57%)
CoeffVar  unixbench-execl-512        3.85 (   0.00%)        5.86 ( -52.14%)

For unixbench regressions, I do not see anything obvious jump up
in perf traces captureed with IBS. top shows over 99% utilization
which would ideally mean there are not many updates to the mask.
I'll take some more look at the spawn test case and get back to you.

On 11/15/2022 4:58 PM, K Prateek Nayak wrote:
> Hello Abel,
> 
> Thank you for taking a look at the report.
> 
> On 11/15/2022 2:01 PM, Abel Wu wrote:
>> Hi Prateek, thanks very much for your detailed testing!
>>
>> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>>> Hello Abel,
>>>
>>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>>> (2 x 64C/128T)
>>>
>>> tl;dr
>>>
>>> o I do not notice any regressions with the standard benchmarks.
>>> o schbench sees a nice improvement to the tail latency when the number
>>>    of worker are equal to the number of cores in the system in NPS1 and
>>>    NPS2 mode. (Marked with "^")
>>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>>    (Marked with "^")
>>>
>>> I'm still in the process of running larger workloads. If there is any
>>> specific workload you would like me to run on the test system, please
>>> do let me know. Below is the detailed report:
>>
>> Not particularly in my mind, and I think testing larger workloads is
>> great. Thanks!
>>
>>>
>>> Following are the results from running standard benchmarks on a
>>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>>> NPS modes.
>>>
>>> NPS Modes are used to logically divide single socket into
>>> multiple NUMA region.
>>> Following is the NUMA configuration for each NPS mode on the system:
>>>
>>> NPS1: Each socket is a NUMA node.
>>>      Total 2 NUMA nodes in the dual socket machine.
>>>
>>>      Node 0: 0-63,   128-191
>>>      Node 1: 64-127, 192-255
>>>
>>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>>      Total 4 NUMA nodes exist over 2 socket.
>>>          Node 0: 0-31,   128-159
>>>      Node 1: 32-63,  160-191
>>>      Node 2: 64-95,  192-223
>>>      Node 3: 96-127, 223-255
>>>
>>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>>      Total 8 NUMA nodes exist over 2 socket.
>>>          Node 0: 0-15,    128-143
>>>      Node 1: 16-31,   144-159
>>>      Node 2: 32-47,   160-175
>>>      Node 3: 48-63,   176-191
>>>      Node 4: 64-79,   192-207
>>>      Node 5: 80-95,   208-223
>>>      Node 6: 96-111,  223-231
>>>      Node 7: 112-127, 232-255
>>>
>>> Benchmark Results:
>>>
>>> Kernel versions:
>>> - tip:          5.19.0 tip sched/core
>>> - sis_core:     5.19.0 tip sched/core + this series
>>>
>>> When we started testing, the tip was at:
>>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>>
>>> ~~~~~~~~~~~~~
>>> ~ hackbench ~
>>> ~~~~~~~~~~~~~
>>>
>>> o NPS1
>>>
>>> Test:            tip            sis_core
>>>   1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>>>   1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>>>   2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>>>   4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>>>   8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>>
>>> o NPS2
>>>
>>> Test:            tip            sis_core
>>>   1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>>>   2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>>>   4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>>>   8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>>
>>> o NPS4
>>>
>>> Test:            tip            sis_core
>>>   1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>>>   2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>>>   4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>>>   8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
>>
>> Although each cpu will get 2.5 tasks when 16-groups, which can
>> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
>> the total cpu usage was ~82% (with some older kernel version),
>> so there is still lots of idle time.
>>
>> I guess cutting off at 16-groups is because it's enough loaded
>> compared to the real workloads, so testing more groups might just
>> be a waste of time?
> 
> The machine has 16 LLCs so I capped the results at 16-groups.
> Previously I had seen some run-to-run variance with larger group counts
> so I limited the reports to 16-groups. I'll run hackbench with more
> number of groups (32, 64, 128, 256) and get back to you with the
> results along with results for a couple of long running workloads. 

~~~~~~~~~~~~~
~ Hackbench ~
~~~~~~~~~~~~~

$ perf bench sched messaging -p -l 50000 -g <groups>

o NPS1

kernel:               tip                     sis_core
32-groups:         6.20 (0.00 pct)         5.86 (5.48 pct)
64-groups:        16.55 (0.00 pct)        15.21 (8.09 pct)
128-groups:       42.57 (0.00 pct)        34.63 (18.65 pct)
256-groups:       71.69 (0.00 pct)        67.11 (6.38 pct)
512-groups:      108.48 (0.00 pct)       110.23 (-1.61 pct)

o NPS2

kernel:                tip                     sis_core
32-groups:         6.56 (0.00 pct)         5.60 (14.63 pct)
64-groups:        15.74 (0.00 pct)        14.45 (8.19 pct)
128-groups:       39.93 (0.00 pct)        35.33 (11.52 pct)
256-groups:       74.49 (0.00 pct)        69.65 (6.49 pct)
512-groups:      112.22 (0.00 pct)       113.75 (-1.36 pct)

o NPS4:

kernel:               tip                     sis_core
32-groups:         9.48 (0.00 pct)         5.64 (40.50 pct)
64-groups:        15.38 (0.00 pct)        14.13 (8.12 pct)
128-groups:       39.93 (0.00 pct)        34.47 (13.67 pct)
256-groups:       75.31 (0.00 pct)        67.98 (9.73 pct)
512-groups:      115.37 (0.00 pct)       111.15 (3.65 pct)

Note: Hackbench with 32-groups show run to run variation
on tip but is more stable with sis_core. Hackbench for
64-groups and beyond is stable on both kernels.

> 
>>
>> Thanks & Best Regards,
>>     Abel
>>
>> [..snip..]
>>
> 
> 
> --
> Thanks and Regards,
> Prateek

Apart from the couple of regressions in Unixbench, everything looks good.
If you would like me to get any more data for any workload on the test
system, please do let me know.
--
Thanks and Regards,
Prateek

Abel Wu Nov. 24, 2022, 3:50 a.m. UTC | #6

Hi Prateek, thanks again for your detailed test!

On 11/22/22 7:28 PM, K Prateek Nayak wrote:
> Hello Abel,
> 
> Following are the results for hackbench with larger number of
> groups, ycsb-mongodb, Spec-JBB, and unixbench. Apart for
> a regression in unixbench spawn in NPS2 and NPS4 mode and
> unixbench syscall in NPs2 mode, everything looks good.
> 
> ...
> 
> -> unixbench-syscall
> 
> o NPS4
> 
> kernel:                             tip                  sis_core
> Min       unixbench-syscall-1    2971799.80 (   0.00%)  2979335.60 (  -0.25%)
> Min       unixbench-syscall-512  7824196.90 (   0.00%)  8155610.20 (  -4.24%)
> Amean     unixbench-syscall-1    2973045.43 (   0.00%)  2982036.13 *  -0.30%*
> Amean     unixbench-syscall-512  7826302.17 (   0.00%)  8173026.57 *  -4.43%*   <-- Regression in syscall for larger worker count
> CoeffVar  unixbench-syscall-1          0.04 (   0.00%)        0.09 (-139.63%)
> CoeffVar  unixbench-syscall-512        0.03 (   0.00%)        0.20 (-701.13%)
> 
> 
> -> unixbench-spawn
> 
> o NPS1
> 
> kernel:                             tip                  sis_core
> Min       unixbench-spawn-1       6536.50 (   0.00%)     6000.30 (  -8.20%)
> Min       unixbench-spawn-512    72571.40 (   0.00%)    70829.60 (  -2.40%)
> Hmean     unixbench-spawn-1       6811.16 (   0.00%)     7016.11 (   3.01%)
> Hmean     unixbench-spawn-512    72801.77 (   0.00%)    71012.03 *  -2.46%*
> CoeffVar  unixbench-spawn-1          3.69 (   0.00%)       13.52 (-266.69%)
> CoeffVar  unixbench-spawn-512        0.27 (   0.00%)        0.22 (  18.25%)
> 
> o NPS2
> 
> kernel:                             tip                  sis_core
> Min       unixbench-spawn-1       7042.20 (   0.00%)     7078.70 (   0.52%)
> Min       unixbench-spawn-512    85571.60 (   0.00%)    77362.60 (  -9.59%)
> Hmean     unixbench-spawn-1       7199.01 (   0.00%)     7276.55 (   1.08%)
> Hmean     unixbench-spawn-512    85717.77 (   0.00%)    77923.73 *  -9.09%*     <-- Regression in spawn test for larger worker count
> CoeffVar  unixbench-spawn-1          3.50 (   0.00%)        3.30 (   5.70%)
> CoeffVar  unixbench-spawn-512        0.20 (   0.00%)        0.82 (-304.88%)
> 
> o NPS4
> 
> kernel:                             tip                  sis_core
> Min       unixbench-spawn-1       7521.90 (   0.00%)     8102.80 (   7.72%)
> Min       unixbench-spawn-512    84245.70 (   0.00%)    73074.50 ( -13.26%)
> Hmean     unixbench-spawn-1       7659.12 (   0.00%)     8645.19 *  12.87%*
> Hmean     unixbench-spawn-512    84908.77 (   0.00%)    73409.49 * -13.54%*     <-- Regression in spawn test for larger worker count
> CoeffVar  unixbench-spawn-1          1.92 (   0.00%)        5.78 (-200.56%)
> CoeffVar  unixbench-spawn-512        0.76 (   0.00%)        0.41 (  46.58%)
> 
> ...
> 
> For unixbench regressions, I do not see anything obvious jump up
> in perf traces captureed with IBS. top shows over 99% utilization
> which would ideally mean there are not many updates to the mask.
> I'll take some more look at the spawn test case and get back to you.

These regressions seems to be common in full parallel tests. I
guess it might be due to over updating the idle cpumask when LLC
is overloaded which is not necessary if SIS_UTIL enabled, but I
need to dig it further. Maybe the rq avg_idle or nr_idle_scan need
to be taken into consideration as well. Thanks for providing these
important information.

> 
> ~~~~~~~~~~~~~
> ~ Hackbench ~
> ~~~~~~~~~~~~~
> 
> $ perf bench sched messaging -p -l 50000 -g <groups>
> 
> o NPS1
> 
> kernel:               tip                     sis_core
> 32-groups:         6.20 (0.00 pct)         5.86 (5.48 pct)
> 64-groups:        16.55 (0.00 pct)        15.21 (8.09 pct)
> 128-groups:       42.57 (0.00 pct)        34.63 (18.65 pct)
> 256-groups:       71.69 (0.00 pct)        67.11 (6.38 pct)
> 512-groups:      108.48 (0.00 pct)       110.23 (-1.61 pct)
> 
> o NPS2
> 
> kernel:                tip                     sis_core
> 32-groups:         6.56 (0.00 pct)         5.60 (14.63 pct)
> 64-groups:        15.74 (0.00 pct)        14.45 (8.19 pct)
> 128-groups:       39.93 (0.00 pct)        35.33 (11.52 pct)
> 256-groups:       74.49 (0.00 pct)        69.65 (6.49 pct)
> 512-groups:      112.22 (0.00 pct)       113.75 (-1.36 pct)
> 
> o NPS4:
> 
> kernel:               tip                     sis_core
> 32-groups:         9.48 (0.00 pct)         5.64 (40.50 pct)
> 64-groups:        15.38 (0.00 pct)        14.13 (8.12 pct)
> 128-groups:       39.93 (0.00 pct)        34.47 (13.67 pct)
> 256-groups:       75.31 (0.00 pct)        67.98 (9.73 pct)
> 512-groups:      115.37 (0.00 pct)       111.15 (3.65 pct)
> 
> Note: Hackbench with 32-groups show run to run variation
> on tip but is more stable with sis_core. Hackbench for
> 64-groups and beyond is stable on both kernels.
> 
The result is consistent with mine except 512-groups which I
didn't test. The 512-groups test may have the same problem
aforementioned.

Thanks & Regards,
	Abel

K Prateek Nayak Feb. 7, 2023, 3:42 a.m. UTC | #7

Hello Abel,

I've retested the patches with on the updated tip and the results
are still promising.

tl;dr

o Hackbench sees improvements when the machine is overloaded.
o tbench shows improvements when the machine is overloaded.
o The unixbench regression seen previously seems to be unrelated
  to the patch as the spawn test scores are vastly different
  after a reboot/kexec for the same kernel.
o Other benchmarks show slight improvements or are comparable to
  the numbers on tip.

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Following are the Kernel versions:

tip:            6.2.0-rc2 tip:sched/core at
                commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads"
sis_short:      tip + series

The patch applied cleanly on the tip.

Benchmark Results:

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

NPS1

Test:			tip			sis_core
 1-groups:	   4.36 (0.00 pct)	   4.17 (4.35 pct)
 2-groups:	   5.17 (0.00 pct)	   5.03 (2.70 pct)
 4-groups:	   4.17 (0.00 pct)	   4.14 (0.71 pct)
 8-groups:	   4.64 (0.00 pct)	   4.63 (0.21 pct)
16-groups:	   5.43 (0.00 pct)	   5.32 (2.02 pct)

NPS2

Test:			tip			sis_core
 1-groups:	   4.43 (0.00 pct)	   4.27 (3.61 pct)
 2-groups:	   4.61 (0.00 pct)	   4.92 (-6.72 pct)	*
 2-groups:	   4.52 (0.00 pct)	   4.55 (-0.66 pct)	[Verification Run]
 4-groups:	   4.25 (0.00 pct)	   4.10 (3.52 pct)
 8-groups:	   4.91 (0.00 pct)	   4.53 (7.73 pct)
16-groups:	   5.84 (0.00 pct)	   5.54 (5.13 pct)

NPS4

Test:			tip			sis_core
 1-groups:	   4.34 (0.00 pct)	   4.23 (2.53 pct)
 2-groups:	   4.64 (0.00 pct)	   4.84 (-4.31 pct)
 4-groups:	   4.20 (0.00 pct)	   4.17 (0.71 pct)
 8-groups:	   5.21 (0.00 pct)	   5.06 (2.87 pct)
16-groups:	   6.24 (0.00 pct)	   5.60 (10.25 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

NPS1

#workers:	tip			sis_core
  1:	  36.00 (0.00 pct)	  23.00 (36.11 pct)
  2:	  37.00 (0.00 pct)	  37.00 (0.00 pct)
  4:	  37.00 (0.00 pct)	  38.00 (-2.70 pct)
  8:	  47.00 (0.00 pct)	  52.00 (-10.63 pct)
 16:	  64.00 (0.00 pct)	  65.00 (-1.56 pct)
 32:	 109.00 (0.00 pct)	 111.00 (-1.83 pct)
 64:	 222.00 (0.00 pct)	 215.00 (3.15 pct)
128:	 515.00 (0.00 pct)	 486.00 (5.63 pct)
256:	 39744.00 (0.00 pct)	 47808.00 (-20.28 pct)	* (Machine Overloaded ~ 2 tasks per rq)
256:	 43242.00 (0.00 pct)	 42293.00 (2.19 pct)	[Verification Run]
512:	 81280.00 (0.00 pct)	 76416.00 (5.98 pct)

NPS2

#workers:	tip			sis_core
  1:	  27.00 (0.00 pct)	  27.00 (0.00 pct)
  2:	  31.00 (0.00 pct)	  30.00 (3.22 pct)
  4:	  38.00 (0.00 pct)	  37.00 (2.63 pct)
  8:	  50.00 (0.00 pct)	  46.00 (8.00 pct)
 16:	  66.00 (0.00 pct)	  68.00 (-3.03 pct)
 32:	 116.00 (0.00 pct)	 113.00 (2.58 pct)
 64:	 210.00 (0.00 pct)	 228.00 (-8.57 pct)	*
 64:	 206.00 (0.00 pct)	 219.00 (-6.31 pct)	[Verification Run]
128:	 523.00 (0.00 pct)	 559.00 (-6.88 pct)	*
128:	 474.00 (0.00 pct)	 497.00 (-4.85 pct)	[Verification Run]
256:	 44864.00 (0.00 pct)	 47040.00 (-4.85 pct)
512:	 78464.00 (0.00 pct)	 81280.00 (-3.58 pct)

NPS4

#workers:	tip			sis_core
  1:	  32.00 (0.00 pct)	  27.00 (15.62 pct)
  2:	  32.00 (0.00 pct)	  35.00 (-9.37 pct)
  4:	  34.00 (0.00 pct)	  41.00 (-20.58 pct)
  8:	  58.00 (0.00 pct)	  58.00 (0.00 pct)
 16:	  67.00 (0.00 pct)	  69.00 (-2.98 pct)
 32:	 118.00 (0.00 pct)	 112.00 (5.08 pct)
 64:	 224.00 (0.00 pct)	 209.00 (6.69 pct)
128:	 533.00 (0.00 pct)	 519.00 (2.62 pct)
256:	 43456.00 (0.00 pct)	 45248.00 (-4.12 pct)
512:	 78976.00 (0.00 pct)	 76160.00 (3.56 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients:	tip			sis_core
    1	 539.96 (0.00 pct)	 538.19 (-0.32 pct)
    2	 1068.21 (0.00 pct)	 1063.04 (-0.48 pct)
    4	 1994.76 (0.00 pct)	 1990.47 (-0.21 pct)
    8	 3602.30 (0.00 pct)	 3496.07 (-2.94 pct)
   16	 6075.49 (0.00 pct)	 6061.74 (-0.22 pct)
   32	 11641.07 (0.00 pct)	 11904.58 (2.26 pct)
   64	 21529.16 (0.00 pct)	 22124.81 (2.76 pct)
  128	 30852.92 (0.00 pct)	 31258.56 (1.31 pct)
  256	 51901.20 (0.00 pct)	 53249.69 (2.59 pct)
  512	 46797.40 (0.00 pct)	 54477.79 (16.41 pct)
 1024	 46057.28 (0.00 pct)	 53676.58 (16.54 pct)

NPS2

Clients:	tip			sis_core
    1	 536.11 (0.00 pct)	 541.18 (0.94 pct)
    2	 1044.58 (0.00 pct)	 1064.16 (1.87 pct)
    4	 2043.92 (0.00 pct)	 2017.84 (-1.27 pct)
    8	 3572.50 (0.00 pct)	 3494.83 (-2.17 pct)
   16	 6040.97 (0.00 pct)	 5530.10 (-8.45 pct)	*
   16	 5814.03 (0.00 pct)	 6012.33 (-8.45 pct)	[Verification Run]
   32	 10794.10 (0.00 pct)	 10841.68 (0.44 pct)
   64	 20905.89 (0.00 pct)	 21438.82 (2.54 pct)
  128	 30885.39 (0.00 pct)	 30064.78 (-2.65 pct)
  256	 48901.25 (0.00 pct)	 51395.08 (5.09 pct)
  512	 49673.91 (0.00 pct)	 51725.89 (4.13 pct)
 1024	 47626.34 (0.00 pct)	 52662.01 (10.57 pct)

NPS4

Clients:	tip			sis_core
    1	 544.91 (0.00 pct)	 544.66 (-0.04 pct)
    2	 1046.49 (0.00 pct)	 1072.42 (2.47 pct)
    4	 2007.11 (0.00 pct)	 1970.05 (-1.84 pct)
    8	 3590.66 (0.00 pct)	 3670.45 (2.22 pct)
   16	 5956.60 (0.00 pct)	 6045.07 (1.48 pct)
   32	 10431.73 (0.00 pct)	 10439.40 (0.07 pct)
   64	 21563.37 (0.00 pct)	 19344.05 (-10.29 pct)	*
   64	 19387.71 (0.00 pct)	 19050.47 (-1.73 pct)	[Verification Run]
  128	 30352.16 (0.00 pct)	 26998.85 (-11.04 pct)	*
  128	 29110.99 (0.00 pct)	 29690.37 (1.99 pct)	[Verification Run]
  256	 49504.51 (0.00 pct)	 50921.66 (2.86 pct)
  512	 44916.61 (0.00 pct)	 52176.11 (16.16 pct)
 1024	 49986.21 (0.00 pct)	 51639.91 (3.30 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

NPS1

10 Runs:

Test:		tip			sis_core
 Copy:	 339390.30 (0.00 pct)	 324656.88 (-4.34 pct)
Scale:	 212472.78 (0.00 pct)	 210641.39 (-0.86 pct)
  Add:	 247598.48 (0.00 pct)	 241669.10 (-2.39 pct)
Triad:	 261852.07 (0.00 pct)	 252088.55 (-3.72 pct)

100 Runs:

Test:		tip			sis_core
 Copy:	 335938.02 (0.00 pct)	 331491.32 (-1.32 pct)
Scale:	 212597.92 (0.00 pct)	 218705.84 (2.87 pct)
  Add:	 248294.62 (0.00 pct)	 243830.42 (-1.79 pct)
Triad:	 258400.88 (0.00 pct)	 248178.42 (-3.95 pct)

NPS2

10 Runs:

Test:		tip			sis_core
 Copy:	 334500.32 (0.00 pct)	 335317.70 (0.24 pct)
Scale:	 216804.76 (0.00 pct)	 217862.71 (0.48 pct)
  Add:	 250787.33 (0.00 pct)	 258839.00 (3.21 pct)
Triad:	 259451.40 (0.00 pct)	 264847.88 (2.07 pct)

100 Runs:

Test:		tip			sis_core
 Copy:	 326385.13 (0.00 pct)	 338030.70 (3.56 pct)
Scale:	 216440.37 (0.00 pct)	 230053.24 (6.28 pct)
  Add:	 255062.22 (0.00 pct)	 259197.23 (1.62 pct)
Triad:	 265442.03 (0.00 pct)	 271365.65 (2.23 pct)

NPS4

10 Runs:

Test:		tip			sis_core
 Copy:   363927.86 (0.00 pct)    361014.15 (-0.80 pct)
Scale:   238190.49 (0.00 pct)    242176.02 (1.67 pct)
  Add:   262806.49 (0.00 pct)    266348.50 (1.34 pct)
Triad:   276492.33 (0.00 pct)    276769.10 (0.10 pct)

100 Runs:

Test:		tip			sis_core
 Copy:   365041.37 (0.00 pct)    349299.35 (-4.31 pct)
Scale:   239295.27 (0.00 pct)    229944.85 (-3.90 pct)
  Add:   264085.21 (0.00 pct)    252651.56 (-4.32 pct)
Triad:   279664.56 (0.00 pct)    274254.22 (-1.93 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip:                    131328.67 (var: 2.97%)
sis_core:               131702.33 (var: 3.61%)	(0.28%)

o NPS2:

tip:			132482.33 (var: 2.06%)
sis_core:		132338.33 (var: 0.97%)  (-0.11%)

o NPS4:

tip:                    134130.00 (var: 4.12%)
sis_core:               133224.33 (var: 4.13%)	(-0.67%)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48770555.20 (   0.00%)    49025161.73 (   0.52%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6268185467.60 (   0.00%)  6266351964.20 (  -0.03%)
unixbench-syscall       Amean     unixbench-syscall-1        2685321.17 (   0.00%)     2694468.30 *  -0.34%*
unixbench-syscall       Amean     unixbench-syscall-512      7291476.20 (   0.00%)     7295087.67 (  -0.05%)
unixbench-pipe          Hmean     unixbench-pipe-1           2480858.53 (   0.00%)     2536923.44 *   2.26%*
unixbench-pipe          Hmean     unixbench-pipe-512       300739256.62 (   0.00%)   303470605.93 *   0.91%*
unixbench-spawn         Hmean     unixbench-spawn-1             4358.14 (   0.00%)        4104.88 (  -5.81%)	* (Known to be unstable)
unixbench-spawn         Hmean     unixbench-spawn-1             4711.00 (   0.00%)        4006.20 ( -14.96%)	[Verification Run]
unixbench-spawn         Hmean     unixbench-spawn-512          76497.32 (   0.00%)       75555.94 *  -1.23%*
unixbench-execl         Hmean     unixbench-execl-1             4147.12 (   0.00%)        4157.33 (   0.25%)
unixbench-execl         Hmean     unixbench-execl-512          12435.26 (   0.00%)       11992.43 (  -3.56%)

o NPS2

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48872335.50 (   0.00%)    48902553.70 (   0.06%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6264134378.20 (   0.00%)  6260631689.40 (  -0.06%)
unixbench-syscall       Amean     unixbench-syscall-1        2683903.13 (   0.00%)     2694829.17 *  -0.41%*
unixbench-syscall       Amean     unixbench-syscall-512      7746773.60 (   0.00%)     7493782.67 *   3.27%*
unixbench-pipe          Hmean     unixbench-pipe-1           2476724.23 (   0.00%)     2537127.96 *   2.44%*
unixbench-pipe          Hmean     unixbench-pipe-512       300277350.41 (   0.00%)   302979776.19 *   0.90%*
unixbench-spawn         Hmean     unixbench-spawn-1             5026.50 (   0.00%)        4680.63 (  -6.88%)	*
unixbench-spawn         Hmean     unixbench-spawn-1             5421.70 (   0.00%)        5311.50 (  -2.03%)	[Verification Run]
unixbench-spawn         Hmean     unixbench-spawn-512          80549.70 (   0.00%)       78888.60 (  -2.06%)
unixbench-execl         Hmean     unixbench-execl-1             4151.70 (   0.00%)        3913.76 *  -5.73%*	*
unixbench-execl         Hmean     unixbench-execl-1             4304.30 (   0.00%)        4303.20 (  -0.02%)	[Verification run]
unixbench-execl         Hmean     unixbench-execl-512          13605.15 (   0.00%)       13129.23 (  -3.50%)

o NPS4

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48506771.20 (   0.00%)    48894866.70 (   0.80%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6280954362.50 (   0.00%)  6282759876.40 (   0.03%)
unixbench-syscall       Amean     unixbench-syscall-1        2687259.30 (   0.00%)     2695379.93 *  -0.30%*
unixbench-syscall       Amean     unixbench-syscall-512      7350275.67 (   0.00%)     7366923.73 (  -0.23%)
unixbench-pipe          Hmean     unixbench-pipe-1           2478893.01 (   0.00%)     2540015.88 *   2.47%*
unixbench-pipe          Hmean     unixbench-pipe-512       301830155.61 (   0.00%)   304305539.27 *   0.82%*
unixbench-spawn         Hmean     unixbench-spawn-1             5208.55 (   0.00%)        5273.11 (   1.24%)
unixbench-spawn         Hmean     unixbench-spawn-512          80745.79 (   0.00%)       81940.71 *   1.48%*
unixbench-execl         Hmean     unixbench-execl-1             4072.72 (   0.00%)        4126.13 *   1.31%*
unixbench-execl         Hmean     unixbench-execl-512          13746.56 (   0.00%)       12848.77 (  -6.53%)	*
unixbench-execl         Hmean     unixbench-execl-512          13898.30 (   0.00%)       13959.70 (   0.44%)	[Verification Run]

On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
> 
> v5 -> v6:
>  - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>    SMT is enabled and better describes the behavior of CORE granule
>    update & load delivery.
>  - Removed the part of limited scan for idle cores since it might be
>    better to open another thread to discuss the strategies such as
>    limited or scaled depth. But keep the part of full scan for idle
>    cores when LLC is overloaded because SIS_CORE can greatly reduce
>    the overhead of full scan in such case.
>  - Removed the state of sd_is_busy which indicates an LLC is fully
>    busy and we can safely skip the SIS domain scan. I would prefer
>    leave this to SIS_UTIL.
>  - The filter generation mechanism is replaced by in-place updates
>    during domain scan to better deal with partial scan failures.
>  - Collect Reviewed-bys from Tim Chen
> 
> v4 -> v5:
>  - Add limited scan for idle cores when overloaded, suggested by Mel
>  - Split out several patches since they are irrelevant to this scope
>  - Add quick check on ttwu_pending before core update
>  - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>  - Move the main filter logic to the idle path, because the newidle
>    balance can bail out early if rq->avg_idle is small enough and
>    lose chances to update the filter.
> 
> v3 -> v4:
>  - Update filter in load_balance rather than in the tick
>  - Now the filter contains unoccupied cpus rather than overloaded ones
>  - Added mechanisms to deal with the false positive cases
> 
> v2 -> v3:
>  - Removed sched-idle balance feature and focus on SIS
>  - Take non-CFS tasks into consideration
>  - Several fixes/improvement suggested by Josh Don
> 
> v1 -> v2:
>  - Several optimizations on sched-idle balancing
>  - Ignore asym topos in can_migrate_task
>  - Add more benchmarks including SIS efficiency
>  - Re-organize patch as suggested by Mel Gorman
> 
> Abel Wu (4):
>   sched/fair: Skip core update if task pending
>   sched/fair: Ignore SIS_UTIL when has_idle_core
>   sched/fair: Introduce SIS_CORE
>   sched/fair: Deal with SIS scan failures
> 
>  include/linux/sched/topology.h |  15 ++++
>  kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
>  kernel/sched/features.h        |   7 ++
>  kernel/sched/sched.h           |   3 +
>  kernel/sched/topology.c        |   8 ++-
>  5 files changed, 141 insertions(+), 14 deletions(-)
> 

Testing with couple of larger workloads like SpecJBB are still underway.
I'll update the thread with the results once they are done. The idea
is promising. I'll also try to run schbench / hackbench pinned in a
manner such that all wakeups happen on an external LLC to spot any
impact of rapid changes to the idle cpu mask of an external LLC.
Please let me know if you would like me to test or get data for any
particular benchmark from my test setup.

--
Thanks and Regards,
Prateek

Abel Wu Feb. 16, 2023, 1:18 p.m. UTC | #8

Hi Prateek, thanks very much for your solid testings!

On 2/7/23 11:42 AM, K Prateek Nayak wrote:
> Hello Abel,
> 
> I've retested the patches with on the updated tip and the results
> are still promising.
> 
> tl;dr
> 
> o Hackbench sees improvements when the machine is overloaded.
> o tbench shows improvements when the machine is overloaded.
> o The unixbench regression seen previously seems to be unrelated
>    to the patch as the spawn test scores are vastly different
>    after a reboot/kexec for the same kernel.
> o Other benchmarks show slight improvements or are comparable to
>    the numbers on tip.

Cheers! Yet I still see some minor regressions in the report
below. As we discussed last time, reducing unnecessary updates
on the idle cpumask when LLC is overloaded should help.

Thanks & Best regards,
	Abel