[RFC,v2,0/2] sched/fair: Choose the CPU where short task is running during wake up

Message ID	cover.1666531576.git.yu.c.chen@intel.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Chen Yu <yu.c.chen@intel.com> To: Peter Zijlstra <peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, Tim Chen <tim.c.chen@intel.com>, Mel Gorman <mgorman@techsingularity.net> Cc: Juri Lelli <juri.lelli@redhat.com>, Rik van Riel <riel@surriel.com>, Aaron Lu <aaron.lu@intel.com>, Abel Wu <wuyun.abel@bytedance.com>, K Prateek Nayak <kprateek.nayak@amd.com>, Yicong Yang <yangyicong@hisilicon.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, Ingo Molnar <mingo@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Hillf Danton <hdanton@sina.com>, Honglei Wang <wanghonglei@didichuxing.com>, Len Brown <len.brown@intel.com>, Chen Yu <yu.chen.surf@gmail.com>, linux-kernel@vger.kernel.org, Chen Yu <yu.c.chen@intel.com> Subject: [RFC PATCH v2 0/2] sched/fair: Choose the CPU where short task is running during wake up Date: Sun, 23 Oct 2022 23:31:50 +0800 Message-Id: <cover.1666531576.git.yu.c.chen@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sched/fair: Choose the CPU where short task is running during wake up \| [RFC,v2,0/2] sched/fair: Choose the CPU where short task is running during wake up [RFC,v2,1/2] sched/fair: Introduce short duration task check [RFC,v2,2/2] sched/fair: Choose the CPU where short task is running during wake up

Message ID

cover.1666531576.git.yu.c.chen@intel.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>
Cc: Juri Lelli <juri.lelli@redhat.com>,
        Rik van Riel <riel@surriel.com>,
        Aaron Lu <aaron.lu@intel.com>,
        Abel Wu <wuyun.abel@bytedance.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        Yicong Yang <yangyicong@hisilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Ingo Molnar <mingo@redhat.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Hillf Danton <hdanton@sina.com>,
        Honglei Wang <wanghonglei@didichuxing.com>,
        Len Brown <len.brown@intel.com>,
        Chen Yu <yu.chen.surf@gmail.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH v2 0/2] sched/fair: Choose the CPU where short task is
 running during wake up
Date: Sun, 23 Oct 2022 23:31:50 +0800
Message-Id: <cover.1666531576.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

sched/fair: Choose the CPU where short task is running during wake up |

Message

Chen Yu Oct. 23, 2022, 3:31 p.m. UTC

  At LPC 2022 Real-time and Scheduling Micro Conference we presented
the cross CPU wakeup issue. This patch is a text version of the
talk, and hopefully, we can clarify the problem and appreciate any
feedback.

The main purpose of this change is to avoid too many crosses CPU
wake up when the system is busy. Please refer to the commit log
of [PATCH 2/2] for detail.

This patch set is composed of two parts. The first part is to introduce
the definition of a short-duration task. The second part leverages the
first part to choose a CPU where only one short-duration task is running
on. This CPU is chosen as the candidate to place a woken task.

This version is modified based on the following feedback on v1:
1. Tim suggested raising the bar to choose a CPU with a short-duration
   task, by checking if the short-duration task is the only runnable
   task on the target CPU.
2. To address Peter's concern: would this patch inhibit spreading the
   workload when there are idle CPUs around? The patch would only take
   effect when the system is relatively busy, and only choose the CPU
   where only one short-duration task is running.
3. Prateek, Honglwei and Hillf suggsted to prefer previous idle CPU to the
   CPU with short-duration task running.

v1 link: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/

Chen Yu (2):
  sched/fair: Introduce short duration task check
  sched/fair: Choose the CPU where short task is running during wake up

 include/linux/sched.h   |  8 ++++
 kernel/sched/core.c     |  2 +
 kernel/sched/fair.c     | 99 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 4 files changed, 110 insertions(+)

Comments

K Prateek Nayak Nov. 22, 2022, 10:31 a.m. UTC | #1

Hello Chenyu,

I've tested v2 series on an dual socket Zen3 system (2 x 64C/128T) and
the results are largely positive.

tl;dr

o Hackbench results are mostly similar with tip.
o schbench sees improvements to tail latency when the system is
  loaded in NPS1 case but I do see one small regression for
  128 workers in NPS4 mode.
o tbench sees small gains in NPS2 and NPS4 mode
o Stream and Spec-JBB results remain same as the tip.
o ycsb-mongodb sees small gains in NPS2 and NPS4 mode.
o unixbench results see small to moderate gains overall.

I'll leave the detailed results below:

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:          5.19.0 tip sched/core
- sis_short: 	5.19.0 tip sched/core + this series

When we started testing, the tip was at:
commit fdf756f71271 ("sched: Fix more TASK_state comparisons")

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip			sis_short
 1-groups:	   4.06 (0.00 pct)	   4.02 (0.98 pct)
 2-groups:	   4.76 (0.00 pct)	   4.71 (1.05 pct)
 4-groups:	   5.22 (0.00 pct)	   5.07 (2.87 pct)
 8-groups:	   5.35 (0.00 pct)	   5.31 (0.74 pct)
16-groups:	   7.21 (0.00 pct)	   7.22 (-0.13 pct)

o NPS2

Test:			tip			sis_short
 1-groups:	   4.09 (0.00 pct)	   4.05 (0.97 pct)
 2-groups:	   4.70 (0.00 pct)	   4.69 (0.21 pct)
 4-groups:	   5.05 (0.00 pct)	   4.95 (1.98 pct)
 8-groups:	   5.35 (0.00 pct)	   5.27 (1.49 pct)
16-groups:	   6.37 (0.00 pct)	   6.60 (-3.61 pct)

o NPS4

Test:			tip			sis_short
 1-groups:	   4.07 (0.00 pct)	   4.13 (-1.47 pct)
 2-groups:	   4.65 (0.00 pct)	   4.71 (-1.29 pct)
 4-groups:	   5.13 (0.00 pct)	   5.05 (1.55 pct)
 8-groups:	   5.47 (0.00 pct)	   5.44 (0.54 pct)
16-groups:	   6.82 (0.00 pct)	   6.72 (1.46 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	tip			sis_short
  1:	  33.00 (0.00 pct)	  34.00 (-3.03 pct)
  2:	  35.00 (0.00 pct)	  36.00 (-2.85 pct)
  4:	  39.00 (0.00 pct)	  40.00 (-2.56 pct)
  8:	  49.00 (0.00 pct)	  47.00 (4.08 pct)
 16:	  63.00 (0.00 pct)	  64.00 (-1.58 pct)
 32:	 109.00 (0.00 pct)	 106.00 (2.75 pct)
 64:	 208.00 (0.00 pct)	 214.00 (-2.88 pct)
128:	 559.00 (0.00 pct)	 497.00 (11.09 pct)
256:	 45888.00 (0.00 pct)	 47424.00 (-3.34 pct)
512:	 80000.00 (0.00 pct)	 77952.00 (2.56 pct)

o NPS2

#workers:	tip			sis_short
  1:	  30.00 (0.00 pct)	  31.00 (-3.33 pct)
  2:	  37.00 (0.00 pct)	  36.00 (2.70 pct)
  4:	  39.00 (0.00 pct)	  40.00 (-2.56 pct)
  8:	  51.00 (0.00 pct)	  50.00 (1.96 pct)
 16:	  67.00 (0.00 pct)	  67.00 (0.00 pct)
 32:	 117.00 (0.00 pct)	 114.00 (2.56 pct)
 64:	 216.00 (0.00 pct)	 214.00 (0.92 pct)
128:	 529.00 (0.00 pct)	 597.00 (-12.85 pct)    *
128:     388.00 (0.00 pct)       382.00 (1.54 pct)      [Verification Run]
256:	 47040.00 (0.00 pct)	 47424.00 (-0.81 pct)
512:	 84864.00 (0.00 pct)	 81792.00 (3.61 pct)

o NPS4

#workers:	tip			sis_short
  1:	  23.00 (0.00 pct)	  33.00 (-43.47 pct)
  2:	  28.00 (0.00 pct)	  27.00 (3.57 pct)
  4:	  41.00 (0.00 pct)	  37.00 (9.75 pct)
  8:	  60.00 (0.00 pct)	  56.00 (6.66 pct)
 16:	  71.00 (0.00 pct)	  71.00 (0.00 pct)
 32:	 117.00 (0.00 pct)	 114.00 (2.56 pct)
 64:	 227.00 (0.00 pct)	 218.00 (3.96 pct)
128:	 545.00 (0.00 pct)	 747.00 (-37.06 pct)    *
128:	 383.00 (0.00 pct)	 412.00 (-7.85 pct)    [Verification Run]
256:	 45632.00 (0.00 pct)	 47296.00 (-3.64 pct)
512:	 81024.00 (0.00 pct)	 78720.00 (2.84 pct)

Note: For lower worker count, schbench can show run to
run variation depending on external factors. Regression
for lower worker count can be ignored. The results are
included to spot any large blow up in the tail latency
for larger worker count.

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip			sis_short
    1	 578.37 (0.00 pct)	 568.72 (-1.66 pct)
    2	 1062.09 (0.00 pct)	 1055.45 (-0.62 pct)
    4	 1800.62 (0.00 pct)	 1833.37 (1.81 pct)
    8	 3211.02 (0.00 pct)	 3124.95 (-2.68 pct)
   16	 4848.92 (0.00 pct)	 4823.27 (-0.52 pct)
   32	 9091.36 (0.00 pct)	 9301.80 (2.31 pct)
   64	 15454.01 (0.00 pct)	 14639.52 (-5.27 pct)   *
   64	 14890.79 (0.00 pct)	 14314.95 (-3.86 pct)   [Verification Run]
  128	 3511.33 (0.00 pct)	 2740.46 (-21.95 pct)   *
  128	 19750.19 (0.00 pct)	 20006.42 (1.29 pct)    [Verification Run]
  256	 50019.32 (0.00 pct)	 50384.18 (0.72 pct)
  512	 44317.68 (0.00 pct)	 44155.90 (-0.36 pct)
 1024	 41200.85 (0.00 pct)	 41242.49 (0.10 pct)

o NPS2

Clients:	tip			sis_short
    1	 576.05 (0.00 pct)	 578.08 (0.35 pct)
    2	 1037.68 (0.00 pct)	 1098.68 (5.87 pct)
    4	 1818.13 (0.00 pct)	 1838.79 (1.13 pct)
    8	 3004.16 (0.00 pct)	 3071.73 (2.24 pct)
   16	 4520.11 (0.00 pct)	 4820.67 (6.64 pct)
   32	 8624.23 (0.00 pct)	 9264.14 (7.41 pct)
   64	 14886.75 (0.00 pct)	 14976.91 (0.60 pct)
  128	 20602.00 (0.00 pct)	 20247.46 (-1.72 pct)
  256	 45566.83 (0.00 pct)	 48786.00 (7.06 pct)
  512	 42717.49 (0.00 pct)	 44678.97 (4.59 pct)
 1024	 40936.61 (0.00 pct)	 40866.32 (-0.17 pct)

o NPS4

Clients:	tip			sis_short
    1	 576.36 (0.00 pct)	 588.43 (2.09 pct)
    2	 1044.26 (0.00 pct)	 1074.47 (2.89 pct)
    4	 1839.77 (0.00 pct)	 1852.10 (0.67 pct)
    8	 3043.53 (0.00 pct)	 3235.32 (6.30 pct)
   16	 5207.54 (0.00 pct)	 4804.41 (-7.74 pct)    *
   16    4620.29 (0.00 pct)      4714.69 (2.04 pct)     [Verification Run]
   32	 9263.86 (0.00 pct)	 8238.55 (-11.06 pct)   *
   32	 9263.86 (0.00 pct)	 9443.77 (1.94 pct)     [Verification Run]
   64	 14959.66 (0.00 pct)	 15321.44 (2.41 pct)
  128	 20698.65 (0.00 pct)	 16806.27 (-18.80 pct)  *
  128    20698.65 (0.00 pct)     20978.42 (1.35 pct)    [Verification Run]
  256	 46666.21 (0.00 pct)	 49787.15 (6.68 pct)
  512	 41532.80 (0.00 pct)	 44738.18 (7.71 pct)
 1024	 39459.49 (0.00 pct)	 41473.96 (5.10 pct)

Note: On the tested kernel, with 128 clients, tbench can
run into a bottleneck during C2 exit. More details can be
found at:
https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
This issue has been fixed in v6.0 but was not part of the
tip kernel when I started testing. This data point has
been rerun with C2 disabled to get representative results.

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test:		tip			sis_short
 Copy:	 328419.14 (0.00 pct)	 336740.00 (2.53 pct)
Scale:	 206071.21 (0.00 pct)	 212682.17 (3.20 pct)
  Add:	 235271.48 (0.00 pct)	 244104.35 (3.75 pct)
Triad:	 253175.80 (0.00 pct)	 251776.26 (-0.55 pct)

- 100 Runs:

Test:		tip			sis_short
 Copy:	 328209.61 (0.00 pct)	 340132.12 (3.63 pct)
Scale:	 216310.13 (0.00 pct)	 218811.70 (1.15 pct)
  Add:	 244417.83 (0.00 pct)	 246349.22 (0.79 pct)
Triad:	 237508.83 (0.00 pct)	 260160.20 (9.53 pct)

o NPS2

- 10 Runs:

Test:		tip			sis_short
 Copy:	 336503.88 (0.00 pct)	 319171.80 (-5.15 pct)
Scale:	 218035.23 (0.00 pct)	 219061.13 (0.47 pct)
  Add:	 257677.42 (0.00 pct)	 256776.22 (-0.34 pct)
Triad:	 268872.37 (0.00 pct)	 263751.14 (-1.90 pct)

- 100 Runs:

Test:		tip			sis_short
 Copy:	 332304.34 (0.00 pct)	 320547.46 (-3.53 pct)
Scale:	 223421.60 (0.00 pct)	 220418.63 (-1.34 pct)
  Add:	 252363.56 (0.00 pct)	 254553.30 (0.86 pct)
Triad:	 266687.56 (0.00 pct)	 260009.00 (-2.50 pct)

o NPS4

- 10 Runs:

Test:		tip			sis_short
 Copy:	 353515.62 (0.00 pct)	 338973.78 (-4.11 pct)
Scale:	 228854.37 (0.00 pct)	 230319.08 (0.64 pct)
  Add:	 254942.12 (0.00 pct)	 247794.21 (-2.80 pct)
Triad:	 270521.87 (0.00 pct)	 261432.32 (-3.36 pct)

- 100 Runs:

Test:		tip			sis_short
 Copy:	 374520.81 (0.00 pct)	 363272.21 (-3.00 pct)
Scale:	 246280.23 (0.00 pct)	 241457.83 (-1.95 pct)
  Add:	 262772.72 (0.00 pct)	 261924.44 (-0.32 pct)
Triad:	 283740.92 (0.00 pct)	 274791.15 (-3.15 pct)

~~~~~~~~~~~~~~~~~
~ Spec-JBB NPS1 ~
~~~~~~~~~~~~~~~~~

--------------------------------------------------
|   Throughput  |     tip     |     sis_short    |
--------------------------------------------------
|    Max-jOPS   |     100%    |      98.84%      |
| Critical-jOPS |     100%    |      100.31%     |
--------------------------------------------------

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip:                    131696.33 (var: 2.03%)
sis_short:              130844.67 (var: 2.55%)  (-0.64%)

o NPS2:

tip:                    129895.33 (var: 2.34%)
sis_short:              133104.33 (var: 1.65%)  (+2.647%)

o NPS4:

tip:                    131165.00 (var: 1.06%)
sis_short:              138180.67 (var: 0.83%)  (+5.34%)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

-> unixbench-dhry2reg

o NPS1

kernel:                                        tip                          sis_short
Min       unixbench-dhry2reg-1            48876615.50 (   0.00%)          48489507.40 (  -0.79%)
Min       unixbench-dhry2reg-512        6260344658.90 (   0.00%)        6253084311.60 (  -0.12%)
Hmean     unixbench-dhry2reg-1            49299721.81 (   0.00%)          49014780.04 (  -0.58%)
Hmean     unixbench-dhry2reg-512        6267459427.19 (   0.00%)        6261978461.64 (  -0.09%)
CoeffVar  unixbench-dhry2reg-1                   0.90 (   0.00%)                 0.98 (  -9.38%)
CoeffVar  unixbench-dhry2reg-512                 0.10 (   0.00%)                 0.17 ( -61.99%)
Max       unixbench-dhry2reg-1            49758806.60 (   0.00%)          49428847.90 (  -0.66%)
Max       unixbench-dhry2reg-512        6273024869.70 (   0.00%)        6273555460.00 (   0.01%)

o NPS2

kernel:                                        tip                          sis_short
Min       unixbench-dhry2reg-1            48828251.70 (   0.00%)          48591509.40 (  -0.48%)
Min       unixbench-dhry2reg-512        6244987739.10 (   0.00%)        6254966248.00 (   0.16%)
Hmean     unixbench-dhry2reg-1            48869882.65 (   0.00%)          49230596.10 (   0.74%)
Hmean     unixbench-dhry2reg-512        6261073948.84 (   0.00%)        6260685008.60 (  -0.01%)
CoeffVar  unixbench-dhry2reg-1                   0.08 (   0.00%)                 1.20 (-1347.66%)
CoeffVar  unixbench-dhry2reg-512                 0.23 (   0.00%)                 0.09 (  59.12%)
Max       unixbench-dhry2reg-1            48909163.40 (   0.00%)          49752650.10 (   1.72%)
Max       unixbench-dhry2reg-512        6271411453.90 (   0.00%)        6266517108.00 (  -0.08%)

o NPS4

kernel:                                        tip                          sis_short
Min       unixbench-dhry2reg-1            48523981.30 (   0.00%)          48728886.20 (   0.42%)
Min       unixbench-dhry2reg-512        6253738837.10 (   0.00%)        6260870171.70 (   0.11%)
Hmean     unixbench-dhry2reg-1            48781044.09 (   0.00%)          48969711.29 (   0.39%)
Hmean     unixbench-dhry2reg-512        6264428474.90 (   0.00%)        6277327761.28 (   0.21%)
CoeffVar  unixbench-dhry2reg-1                   0.46 (   0.00%)                 0.43 (   6.91%)
CoeffVar  unixbench-dhry2reg-512                 0.17 (   0.00%)                 0.29 ( -70.82%)
Max       unixbench-dhry2reg-1            48925665.20 (   0.00%)          49091708.50 (   0.34%)
Max       unixbench-dhry2reg-512        6274958506.80 (   0.00%)        6296828879.20 (   0.35%)

-> unixbench-syscall

o NPS1

kernel:                             tip                  sis_short
Min       unixbench-syscall-1    2975654.80 (   0.00%)  2971008.50 (   0.16%)
Min       unixbench-syscall-512  7840226.50 (   0.00%)  6586485.10 (  15.99%)
Amean     unixbench-syscall-1    2976326.47 (   0.00%)  2971920.50 *   0.15%*
Amean     unixbench-syscall-512  7850493.90 (   0.00%)  6597210.63 *  15.96%*
CoeffVar  unixbench-syscall-1          0.03 (   0.00%)        0.03 ( -14.26%)
CoeffVar  unixbench-syscall-512        0.13 (   0.00%)        0.27 (-103.14%)
Max       unixbench-syscall-1    2977279.70 (   0.00%)  2972935.80 (   0.15%)
Max       unixbench-syscall-512  7860838.90 (   0.00%)  6617515.40 (  15.82%)

o NPS2

kernel:                             tip                  sis_short
Min       unixbench-syscall-1    2969863.60 (   0.00%)  2974771.70 (  -0.17%)
Min       unixbench-syscall-512  8053157.60 (   0.00%)  7411223.90 (   7.97%)
Amean     unixbench-syscall-1    2970462.30 (   0.00%)  2975278.63 *  -0.16%*
Amean     unixbench-syscall-512  8061454.50 (   0.00%)  7437679.30 *   7.74%*
CoeffVar  unixbench-syscall-1          0.02 (   0.00%)        0.02 ( -17.72%)
CoeffVar  unixbench-syscall-512        0.12 (   0.00%)        0.34 (-179.38%)
Max       unixbench-syscall-1    2970859.30 (   0.00%)  2975972.90 (  -0.17%)
Max       unixbench-syscall-512  8072312.30 (   0.00%)  7461732.50 (   7.56%)

o NPS4

kernel:                             tip                  sis_short
Min       unixbench-syscall-1    2971799.80 (   0.00%)  2974601.20 (  -0.09%)
Min       unixbench-syscall-512  7824196.90 (   0.00%)  8242480.10 (  -5.35%)
Amean     unixbench-syscall-1    2973045.43 (   0.00%)  2974739.93 *  -0.06%*
Amean     unixbench-syscall-512  7826302.17 (   0.00%)  8261295.03 *  -5.56%*
CoeffVar  unixbench-syscall-1          0.04 (   0.00%)        0.00 (  86.39%)
CoeffVar  unixbench-syscall-512        0.03 (   0.00%)        0.37 (-1376.49%)
Max       unixbench-syscall-1    2973786.50 (   0.00%)  2974895.30 (  -0.04%)
Max       unixbench-syscall-512  7828115.90 (   0.00%)  8296830.40 (  -5.99%)


-> unixbench-pipe

o NPS1

kernel:                               tip                  sis_short
Min       unixbench-pipe-1        2894765.30 (   0.00%)    2904821.00 (   0.35%)
Min       unixbench-pipe-512    329818573.50 (   0.00%)  329565756.00 (  -0.08%)
Hmean     unixbench-pipe-1        2898803.38 (   0.00%)    2911189.71 *   0.43%*
Hmean     unixbench-pipe-512    330226401.69 (   0.00%)  330389884.94 (   0.05%)
CoeffVar  unixbench-pipe-1              0.14 (   0.00%)          0.22 ( -62.25%)
CoeffVar  unixbench-pipe-512            0.11 (   0.00%)          0.24 (-126.10%)
Max       unixbench-pipe-1        2902691.20 (   0.00%)    2917740.00 (   0.52%)
Max       unixbench-pipe-512    330440132.10 (   0.00%)  331162497.90 (   0.22%)

o NPS2

kernel:                               tip                   sis_short
Min       unixbench-pipe-1        2895327.90 (   0.00%)    2905421.90 (   0.35%)
Min       unixbench-pipe-512    328350065.60 (   0.00%)  330137916.90 (   0.54%)
Hmean     unixbench-pipe-1        2899129.86 (   0.00%)    2910562.69 *   0.39%*
Hmean     unixbench-pipe-512    329436096.80 (   0.00%)  330509036.17 (   0.33%)
CoeffVar  unixbench-pipe-1              0.12 (   0.00%)          0.20 ( -70.84%)
CoeffVar  unixbench-pipe-512            0.30 (   0.00%)          0.10 (  65.00%)
Max       unixbench-pipe-1        2901619.40 (   0.00%)    2916758.50 (   0.52%)
Max       unixbench-pipe-512    330239044.10 (   0.00%)  330814020.50 (   0.17%)

o NPS4

kernel:                               tip                   sis_short
Min       unixbench-pipe-1        2901525.60 (   0.00%)    2909864.00 (   0.29%)
Min       unixbench-pipe-512    330265873.90 (   0.00%)  330543034.40 (   0.08%)
Hmean     unixbench-pipe-1        2906184.70 (   0.00%)    2912725.52 *   0.23%*
Hmean     unixbench-pipe-512    330854683.27 (   0.00%)  331540275.79 (   0.21%)
CoeffVar  unixbench-pipe-1              0.14 (   0.00%)          0.09 (  39.44%)
CoeffVar  unixbench-pipe-512            0.16 (   0.00%)          0.27 ( -73.84%)
Max       unixbench-pipe-1        2909154.50 (   0.00%)    2914249.80 (   0.18%)
Max       unixbench-pipe-512    331245477.30 (   0.00%)  332305755.00 (   0.32%)

-> unixbench-spawn

o NPS1

kernel:                             tip                  sis_short
Min       unixbench-spawn-1       6536.50 (   0.00%)     6458.00 (  -1.20%)
Min       unixbench-spawn-512    72571.40 (   0.00%)    91525.90 (  26.12%)
Hmean     unixbench-spawn-1       6811.16 (   0.00%)     6510.74 (  -4.41%)
Hmean     unixbench-spawn-512    72801.77 (   0.00%)    91829.95 *  26.14%*
CoeffVar  unixbench-spawn-1          3.69 (   0.00%)        1.00 (  72.93%)
CoeffVar  unixbench-spawn-512        0.27 (   0.00%)        0.41 ( -50.84%)
Max       unixbench-spawn-1       7021.00 (   0.00%)     6583.60 (  -6.23%)
Max       unixbench-spawn-512    72927.00 (   0.00%)    92257.50 (  26.51%)

o NPS2

kernel:                             tip                  sis_short
Min       unixbench-spawn-1       7042.20 (   0.00%)     7411.00 (   5.24%)
Min       unixbench-spawn-512    85571.60 (   0.00%)    89549.50 (   4.65%)
Hmean     unixbench-spawn-1       7199.01 (   0.00%)     7553.53 *   4.92%*
Hmean     unixbench-spawn-512    85717.77 (   0.00%)    89751.68 *   4.71%*
CoeffVar  unixbench-spawn-1          3.50 (   0.00%)        1.68 (  51.98%)
CoeffVar  unixbench-spawn-512        0.20 (   0.00%)        0.28 ( -36.60%)
Max       unixbench-spawn-1       7495.00 (   0.00%)     7650.40 (   2.07%)
Max       unixbench-spawn-512    85909.20 (   0.00%)    90028.30 (   4.79%)

o NPS4

kernel:                             tip                  sis_short
Min       unixbench-spawn-1       7521.90 (   0.00%)     8404.10 (  11.73%)
Min       unixbench-spawn-512    84245.70 (   0.00%)    91260.20 (   8.33%)
Hmean     unixbench-spawn-1       7659.12 (   0.00%)     8526.01 *  11.32%*
Hmean     unixbench-spawn-512    84908.77 (   0.00%)    91365.07 *   7.60%*
CoeffVar  unixbench-spawn-1          1.92 (   0.00%)        2.06 (  -7.21%)
CoeffVar  unixbench-spawn-512        0.76 (   0.00%)        0.10 (  86.60%)
Max       unixbench-spawn-1       7815.40 (   0.00%)     8729.60 (  11.70%)
Max       unixbench-spawn-512    85532.90 (   0.00%)    91437.30 (   6.90%)

-> unixbench-execl

o NPS1

kernel:                             tip                  sis_short
Min       unixbench-execl-1       5421.50 (   0.00%)     5466.40 (   0.83%)
Min       unixbench-execl-512    11213.50 (   0.00%)    11720.30 (   4.52%)
Hmean     unixbench-execl-1       5443.75 (   0.00%)     5468.53 (   0.46%)
Hmean     unixbench-execl-512    11311.94 (   0.00%)    11809.97 *   4.40%*
CoeffVar  unixbench-execl-1          0.38 (   0.00%)        0.04 (  89.57%)
CoeffVar  unixbench-execl-512        1.03 (   0.00%)        0.74 (  27.60%)
Max       unixbench-execl-1       5461.90 (   0.00%)     5470.70 (   0.16%)
Max       unixbench-execl-512    11440.40 (   0.00%)    11895.60 (   3.98%)

o NPS2

kernel:                             tip                  sis_short
Min       unixbench-execl-1       5089.10 (   0.00%)     5119.50 (   0.60%)
Min       unixbench-execl-512    11772.70 (   0.00%)    11591.40 (  -1.54%)
Hmean     unixbench-execl-1       5321.65 (   0.00%)     5251.49 (  -1.32%)
Hmean     unixbench-execl-512    12201.73 (   0.00%)    11665.67 (  -4.39%)
CoeffVar  unixbench-execl-1          3.87 (   0.00%)        2.33 (  39.91%)
CoeffVar  unixbench-execl-512        6.23 (   0.00%)        1.04 (  83.38%)
Max       unixbench-execl-1       5453.90 (   0.00%)     5359.00 (  -1.74%)
Max       unixbench-execl-512    13111.60 (   0.00%)    11805.80 (  -9.96%)

o NPS4

kernel:                             tip                  sis_short
Min       unixbench-execl-1       5099.40 (   0.00%)     5352.70 (   4.97%)
Min       unixbench-execl-512    11692.80 (   0.00%)    13368.20 (  14.33%)
Hmean     unixbench-execl-1       5136.86 (   0.00%)     5404.31 *   5.21%*
Hmean     unixbench-execl-512    12053.71 (   0.00%)    14018.53 *  16.30%*
CoeffVar  unixbench-execl-1          1.05 (   0.00%)        0.84 (  20.12%)
CoeffVar  unixbench-execl-512        3.85 (   0.00%)        5.29 ( -37.45%)
Max       unixbench-execl-1       5198.70 (   0.00%)     5434.90 (   4.54%)
Max       unixbench-execl-512    12585.70 (   0.00%)    14839.80 (  17.91%)

On 10/23/2022 9:01 PM, Chen Yu wrote:
> At LPC 2022 Real-time and Scheduling Micro Conference we presented
> the cross CPU wakeup issue. This patch is a text version of the
> talk, and hopefully, we can clarify the problem and appreciate any
> feedback.
> 
> The main purpose of this change is to avoid too many crosses CPU
> wake up when the system is busy. Please refer to the commit log
> of [PATCH 2/2] for detail.
> 
> This patch set is composed of two parts. The first part is to introduce
> the definition of a short-duration task. The second part leverages the
> first part to choose a CPU where only one short-duration task is running
> on. This CPU is chosen as the candidate to place a woken task.
> 
> This version is modified based on the following feedback on v1:
> 1. Tim suggested raising the bar to choose a CPU with a short-duration
>    task, by checking if the short-duration task is the only runnable
>    task on the target CPU.
> 2. To address Peter's concern: would this patch inhibit spreading the
>    workload when there are idle CPUs around? The patch would only take
>    effect when the system is relatively busy, and only choose the CPU
>    where only one short-duration task is running.
> 3. Prateek, Honglwei and Hillf suggsted to prefer previous idle CPU to the
>    CPU with short-duration task running.
> 
> v1 link: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/
> 
> Chen Yu (2):
>   sched/fair: Introduce short duration task check
>   sched/fair: Choose the CPU where short task is running during wake up
> 
>  include/linux/sched.h   |  8 ++++
>  kernel/sched/core.c     |  2 +
>  kernel/sched/fair.c     | 99 +++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h |  1 +
>  4 files changed, 110 insertions(+)
> 

Except for schbench with 128 workers in NPS4 mode, I do not
see any large regressions for the above workloads and I do
see small to moderate gains overall for most workload, even
the larger ones. I'll try to get data for more workload but
overall the idea seems promising. I'll also get some numbers
with the changes Peter suggested on Patch 1.

If there is any specific workload you would like me to run
on the test machine, please do let me know.
--
Thanks and Regards,
Prateek

Chen Yu Nov. 30, 2022, 4:03 a.m. UTC | #2

On 2022-11-22 at 16:01:42 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> I've tested v2 series on an dual socket Zen3 system (2 x 64C/128T) and
> the results are largely positive.
>
Thank you Prateek, and sorry for late response. 
> tl;dr
> 
> o Hackbench results are mostly similar with tip.
> o schbench sees improvements to tail latency when the system is
>   loaded in NPS1 case but I do see one small regression for
>   128 workers in NPS4 mode.
> o tbench sees small gains in NPS2 and NPS4 mode
> o Stream and Spec-JBB results remain same as the tip.
> o ycsb-mongodb sees small gains in NPS2 and NPS4 mode.
> o unixbench results see small to moderate gains overall.
> 
> I'll leave the detailed results below:
> 
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:          5.19.0 tip sched/core
> - sis_short: 	5.19.0 tip sched/core + this series
> 
> When we started testing, the tip was at:
> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> o NPS1
> 
> Test:			tip			sis_short
>  1-groups:	   4.06 (0.00 pct)	   4.02 (0.98 pct)
>  2-groups:	   4.76 (0.00 pct)	   4.71 (1.05 pct)
>  4-groups:	   5.22 (0.00 pct)	   5.07 (2.87 pct)
>  8-groups:	   5.35 (0.00 pct)	   5.31 (0.74 pct)
> 16-groups:	   7.21 (0.00 pct)	   7.22 (-0.13 pct)
> 
> o NPS2
> 
> Test:			tip			sis_short
>  1-groups:	   4.09 (0.00 pct)	   4.05 (0.97 pct)
>  2-groups:	   4.70 (0.00 pct)	   4.69 (0.21 pct)
>  4-groups:	   5.05 (0.00 pct)	   4.95 (1.98 pct)
>  8-groups:	   5.35 (0.00 pct)	   5.27 (1.49 pct)
> 16-groups:	   6.37 (0.00 pct)	   6.60 (-3.61 pct)
> 
> o NPS4
> 
> Test:			tip			sis_short
>  1-groups:	   4.07 (0.00 pct)	   4.13 (-1.47 pct)
>  2-groups:	   4.65 (0.00 pct)	   4.71 (-1.29 pct)
>  4-groups:	   5.13 (0.00 pct)	   5.05 (1.55 pct)
>  8-groups:	   5.47 (0.00 pct)	   5.44 (0.54 pct)
> 16-groups:	   6.82 (0.00 pct)	   6.72 (1.46 pct)
> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> o NPS1
> 
> #workers:	tip			sis_short
>   1:	  33.00 (0.00 pct)	  34.00 (-3.03 pct)
>   2:	  35.00 (0.00 pct)	  36.00 (-2.85 pct)
>   4:	  39.00 (0.00 pct)	  40.00 (-2.56 pct)
>   8:	  49.00 (0.00 pct)	  47.00 (4.08 pct)
>  16:	  63.00 (0.00 pct)	  64.00 (-1.58 pct)
>  32:	 109.00 (0.00 pct)	 106.00 (2.75 pct)
>  64:	 208.00 (0.00 pct)	 214.00 (-2.88 pct)
> 128:	 559.00 (0.00 pct)	 497.00 (11.09 pct)
> 256:	 45888.00 (0.00 pct)	 47424.00 (-3.34 pct)
> 512:	 80000.00 (0.00 pct)	 77952.00 (2.56 pct)
> 
> o NPS2
> 
> #workers:	tip			sis_short
>   1:	  30.00 (0.00 pct)	  31.00 (-3.33 pct)
>   2:	  37.00 (0.00 pct)	  36.00 (2.70 pct)
>   4:	  39.00 (0.00 pct)	  40.00 (-2.56 pct)
>   8:	  51.00 (0.00 pct)	  50.00 (1.96 pct)
>  16:	  67.00 (0.00 pct)	  67.00 (0.00 pct)
>  32:	 117.00 (0.00 pct)	 114.00 (2.56 pct)
>  64:	 216.00 (0.00 pct)	 214.00 (0.92 pct)
> 128:	 529.00 (0.00 pct)	 597.00 (-12.85 pct)    *
> 128:     388.00 (0.00 pct)       382.00 (1.54 pct)      [Verification Run]
> 256:	 47040.00 (0.00 pct)	 47424.00 (-0.81 pct)
> 512:	 84864.00 (0.00 pct)	 81792.00 (3.61 pct)
> 
> o NPS4
> 
> #workers:	tip			sis_short
>   1:	  23.00 (0.00 pct)	  33.00 (-43.47 pct)
>   2:	  28.00 (0.00 pct)	  27.00 (3.57 pct)
>   4:	  41.00 (0.00 pct)	  37.00 (9.75 pct)
>   8:	  60.00 (0.00 pct)	  56.00 (6.66 pct)
>  16:	  71.00 (0.00 pct)	  71.00 (0.00 pct)
>  32:	 117.00 (0.00 pct)	 114.00 (2.56 pct)
>  64:	 227.00 (0.00 pct)	 218.00 (3.96 pct)
> 128:	 545.00 (0.00 pct)	 747.00 (-37.06 pct)    *
> 128:	 383.00 (0.00 pct)	 412.00 (-7.85 pct)    [Verification Run]
> 256:	 45632.00 (0.00 pct)	 47296.00 (-3.64 pct)
> 512:	 81024.00 (0.00 pct)	 78720.00 (2.84 pct)
> 
> Note: For lower worker count, schbench can show run to
> run variation depending on external factors. Regression
> for lower worker count can be ignored. The results are
> included to spot any large blow up in the tail latency
> for larger worker count.
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> Clients:	tip			sis_short
>     1	 578.37 (0.00 pct)	 568.72 (-1.66 pct)
>     2	 1062.09 (0.00 pct)	 1055.45 (-0.62 pct)
>     4	 1800.62 (0.00 pct)	 1833.37 (1.81 pct)
>     8	 3211.02 (0.00 pct)	 3124.95 (-2.68 pct)
>    16	 4848.92 (0.00 pct)	 4823.27 (-0.52 pct)
>    32	 9091.36 (0.00 pct)	 9301.80 (2.31 pct)
>    64	 15454.01 (0.00 pct)	 14639.52 (-5.27 pct)   *
>    64	 14890.79 (0.00 pct)	 14314.95 (-3.86 pct)   [Verification Run]
>   128	 3511.33 (0.00 pct)	 2740.46 (-21.95 pct)   *
>   128	 19750.19 (0.00 pct)	 20006.42 (1.29 pct)    [Verification Run]
>   256	 50019.32 (0.00 pct)	 50384.18 (0.72 pct)
>   512	 44317.68 (0.00 pct)	 44155.90 (-0.36 pct)
>  1024	 41200.85 (0.00 pct)	 41242.49 (0.10 pct)
> 
> o NPS2
> 
> Clients:	tip			sis_short
>     1	 576.05 (0.00 pct)	 578.08 (0.35 pct)
>     2	 1037.68 (0.00 pct)	 1098.68 (5.87 pct)
>     4	 1818.13 (0.00 pct)	 1838.79 (1.13 pct)
>     8	 3004.16 (0.00 pct)	 3071.73 (2.24 pct)
>    16	 4520.11 (0.00 pct)	 4820.67 (6.64 pct)
>    32	 8624.23 (0.00 pct)	 9264.14 (7.41 pct)
>    64	 14886.75 (0.00 pct)	 14976.91 (0.60 pct)
>   128	 20602.00 (0.00 pct)	 20247.46 (-1.72 pct)
>   256	 45566.83 (0.00 pct)	 48786.00 (7.06 pct)
>   512	 42717.49 (0.00 pct)	 44678.97 (4.59 pct)
>  1024	 40936.61 (0.00 pct)	 40866.32 (-0.17 pct)
> 
> o NPS4
> 
> Clients:	tip			sis_short
>     1	 576.36 (0.00 pct)	 588.43 (2.09 pct)
>     2	 1044.26 (0.00 pct)	 1074.47 (2.89 pct)
>     4	 1839.77 (0.00 pct)	 1852.10 (0.67 pct)
>     8	 3043.53 (0.00 pct)	 3235.32 (6.30 pct)
>    16	 5207.54 (0.00 pct)	 4804.41 (-7.74 pct)    *
>    16    4620.29 (0.00 pct)      4714.69 (2.04 pct)     [Verification Run]
>    32	 9263.86 (0.00 pct)	 8238.55 (-11.06 pct)   *
>    32	 9263.86 (0.00 pct)	 9443.77 (1.94 pct)     [Verification Run]
>    64	 14959.66 (0.00 pct)	 15321.44 (2.41 pct)
>   128	 20698.65 (0.00 pct)	 16806.27 (-18.80 pct)  *
>   128    20698.65 (0.00 pct)     20978.42 (1.35 pct)    [Verification Run]
>   256	 46666.21 (0.00 pct)	 49787.15 (6.68 pct)
>   512	 41532.80 (0.00 pct)	 44738.18 (7.71 pct)
>  1024	 39459.49 (0.00 pct)	 41473.96 (5.10 pct)
> 
> Note: On the tested kernel, with 128 clients, tbench can
> run into a bottleneck during C2 exit. More details can be
> found at:
> https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
> This issue has been fixed in v6.0 but was not part of the
> tip kernel when I started testing. This data point has
> been rerun with C2 disabled to get representative results.
This reminds me that, previously I tested with Cstates > C1 disabled, and
with turbo disabled, so as to mitigate possible deviation. May I know if
all C-states and turbo are enabled in your test besides tbench?
> 
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> - 10 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 328419.14 (0.00 pct)	 336740.00 (2.53 pct)
> Scale:	 206071.21 (0.00 pct)	 212682.17 (3.20 pct)
>   Add:	 235271.48 (0.00 pct)	 244104.35 (3.75 pct)
> Triad:	 253175.80 (0.00 pct)	 251776.26 (-0.55 pct)
> 
> - 100 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 328209.61 (0.00 pct)	 340132.12 (3.63 pct)
> Scale:	 216310.13 (0.00 pct)	 218811.70 (1.15 pct)
>   Add:	 244417.83 (0.00 pct)	 246349.22 (0.79 pct)
> Triad:	 237508.83 (0.00 pct)	 260160.20 (9.53 pct)
> 
> o NPS2
> 
> - 10 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 336503.88 (0.00 pct)	 319171.80 (-5.15 pct)
> Scale:	 218035.23 (0.00 pct)	 219061.13 (0.47 pct)
>   Add:	 257677.42 (0.00 pct)	 256776.22 (-0.34 pct)
> Triad:	 268872.37 (0.00 pct)	 263751.14 (-1.90 pct)
> 
> - 100 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 332304.34 (0.00 pct)	 320547.46 (-3.53 pct)
> Scale:	 223421.60 (0.00 pct)	 220418.63 (-1.34 pct)
>   Add:	 252363.56 (0.00 pct)	 254553.30 (0.86 pct)
> Triad:	 266687.56 (0.00 pct)	 260009.00 (-2.50 pct)
> 
> o NPS4
> 
> - 10 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 353515.62 (0.00 pct)	 338973.78 (-4.11 pct)
> Scale:	 228854.37 (0.00 pct)	 230319.08 (0.64 pct)
>   Add:	 254942.12 (0.00 pct)	 247794.21 (-2.80 pct)
> Triad:	 270521.87 (0.00 pct)	 261432.32 (-3.36 pct)
> 
> - 100 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 374520.81 (0.00 pct)	 363272.21 (-3.00 pct)
> Scale:	 246280.23 (0.00 pct)	 241457.83 (-1.95 pct)
>   Add:	 262772.72 (0.00 pct)	 261924.44 (-0.32 pct)
> Triad:	 283740.92 (0.00 pct)	 274791.15 (-3.15 pct)
> 
> ~~~~~~~~~~~~~~~~~
> ~ Spec-JBB NPS1 ~
> ~~~~~~~~~~~~~~~~~
> 
> --------------------------------------------------
> |   Throughput  |     tip     |     sis_short    |
> --------------------------------------------------
> |    Max-jOPS   |     100%    |      98.84%      |
> | Critical-jOPS |     100%    |      100.31%     |
> --------------------------------------------------
> 
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
> 
> o NPS1
> 
> tip:                    131696.33 (var: 2.03%)
> sis_short:              130844.67 (var: 2.55%)  (-0.64%)
> 
> o NPS2:
> 
> tip:                    129895.33 (var: 2.34%)
> sis_short:              133104.33 (var: 1.65%)  (+2.647%)
> 
> o NPS4:
> 
> tip:                    131165.00 (var: 1.06%)
> sis_short:              138180.67 (var: 0.83%)  (+5.34%)
> 
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
> 
> -> unixbench-dhry2reg
> 
> o NPS1
> 
> kernel:                                        tip                          sis_short
> Min       unixbench-dhry2reg-1            48876615.50 (   0.00%)          48489507.40 (  -0.79%)
> Min       unixbench-dhry2reg-512        6260344658.90 (   0.00%)        6253084311.60 (  -0.12%)
> Hmean     unixbench-dhry2reg-1            49299721.81 (   0.00%)          49014780.04 (  -0.58%)
> Hmean     unixbench-dhry2reg-512        6267459427.19 (   0.00%)        6261978461.64 (  -0.09%)
> CoeffVar  unixbench-dhry2reg-1                   0.90 (   0.00%)                 0.98 (  -9.38%)
> CoeffVar  unixbench-dhry2reg-512                 0.10 (   0.00%)                 0.17 ( -61.99%)
> Max       unixbench-dhry2reg-1            49758806.60 (   0.00%)          49428847.90 (  -0.66%)
> Max       unixbench-dhry2reg-512        6273024869.70 (   0.00%)        6273555460.00 (   0.01%)
> 
> o NPS2
> 
> kernel:                                        tip                          sis_short
> Min       unixbench-dhry2reg-1            48828251.70 (   0.00%)          48591509.40 (  -0.48%)
> Min       unixbench-dhry2reg-512        6244987739.10 (   0.00%)        6254966248.00 (   0.16%)
> Hmean     unixbench-dhry2reg-1            48869882.65 (   0.00%)          49230596.10 (   0.74%)
> Hmean     unixbench-dhry2reg-512        6261073948.84 (   0.00%)        6260685008.60 (  -0.01%)
> CoeffVar  unixbench-dhry2reg-1                   0.08 (   0.00%)                 1.20 (-1347.66%)
> CoeffVar  unixbench-dhry2reg-512                 0.23 (   0.00%)                 0.09 (  59.12%)
> Max       unixbench-dhry2reg-1            48909163.40 (   0.00%)          49752650.10 (   1.72%)
> Max       unixbench-dhry2reg-512        6271411453.90 (   0.00%)        6266517108.00 (  -0.08%)
> 
> o NPS4
> 
> kernel:                                        tip                          sis_short
> Min       unixbench-dhry2reg-1            48523981.30 (   0.00%)          48728886.20 (   0.42%)
> Min       unixbench-dhry2reg-512        6253738837.10 (   0.00%)        6260870171.70 (   0.11%)
> Hmean     unixbench-dhry2reg-1            48781044.09 (   0.00%)          48969711.29 (   0.39%)
> Hmean     unixbench-dhry2reg-512        6264428474.90 (   0.00%)        6277327761.28 (   0.21%)
> CoeffVar  unixbench-dhry2reg-1                   0.46 (   0.00%)                 0.43 (   6.91%)
> CoeffVar  unixbench-dhry2reg-512                 0.17 (   0.00%)                 0.29 ( -70.82%)
> Max       unixbench-dhry2reg-1            48925665.20 (   0.00%)          49091708.50 (   0.34%)
> Max       unixbench-dhry2reg-512        6274958506.80 (   0.00%)        6296828879.20 (   0.35%)
> 
> -> unixbench-syscall
> 
> o NPS1
> 
> kernel:                             tip                  sis_short
> Min       unixbench-syscall-1    2975654.80 (   0.00%)  2971008.50 (   0.16%)
> Min       unixbench-syscall-512  7840226.50 (   0.00%)  6586485.10 (  15.99%)
> Amean     unixbench-syscall-1    2976326.47 (   0.00%)  2971920.50 *   0.15%*
> Amean     unixbench-syscall-512  7850493.90 (   0.00%)  6597210.63 *  15.96%*
> CoeffVar  unixbench-syscall-1          0.03 (   0.00%)        0.03 ( -14.26%)
> CoeffVar  unixbench-syscall-512        0.13 (   0.00%)        0.27 (-103.14%)
> Max       unixbench-syscall-1    2977279.70 (   0.00%)  2972935.80 (   0.15%)
> Max       unixbench-syscall-512  7860838.90 (   0.00%)  6617515.40 (  15.82%)
> 
> o NPS2
> 
> kernel:                             tip                  sis_short
> Min       unixbench-syscall-1    2969863.60 (   0.00%)  2974771.70 (  -0.17%)
> Min       unixbench-syscall-512  8053157.60 (   0.00%)  7411223.90 (   7.97%)
> Amean     unixbench-syscall-1    2970462.30 (   0.00%)  2975278.63 *  -0.16%*
> Amean     unixbench-syscall-512  8061454.50 (   0.00%)  7437679.30 *   7.74%*
> CoeffVar  unixbench-syscall-1          0.02 (   0.00%)        0.02 ( -17.72%)
> CoeffVar  unixbench-syscall-512        0.12 (   0.00%)        0.34 (-179.38%)
> Max       unixbench-syscall-1    2970859.30 (   0.00%)  2975972.90 (  -0.17%)
> Max       unixbench-syscall-512  8072312.30 (   0.00%)  7461732.50 (   7.56%)
> 
> o NPS4
> 
> kernel:                             tip                  sis_short
> Min       unixbench-syscall-1    2971799.80 (   0.00%)  2974601.20 (  -0.09%)
> Min       unixbench-syscall-512  7824196.90 (   0.00%)  8242480.10 (  -5.35%)
> Amean     unixbench-syscall-1    2973045.43 (   0.00%)  2974739.93 *  -0.06%*
> Amean     unixbench-syscall-512  7826302.17 (   0.00%)  8261295.03 *  -5.56%*
> CoeffVar  unixbench-syscall-1          0.04 (   0.00%)        0.00 (  86.39%)
> CoeffVar  unixbench-syscall-512        0.03 (   0.00%)        0.37 (-1376.49%)
> Max       unixbench-syscall-1    2973786.50 (   0.00%)  2974895.30 (  -0.04%)
> Max       unixbench-syscall-512  7828115.90 (   0.00%)  8296830.40 (  -5.99%)
> 
> 
> -> unixbench-pipe
> 
> o NPS1
> 
> kernel:                               tip                  sis_short
> Min       unixbench-pipe-1        2894765.30 (   0.00%)    2904821.00 (   0.35%)
> Min       unixbench-pipe-512    329818573.50 (   0.00%)  329565756.00 (  -0.08%)
> Hmean     unixbench-pipe-1        2898803.38 (   0.00%)    2911189.71 *   0.43%*
> Hmean     unixbench-pipe-512    330226401.69 (   0.00%)  330389884.94 (   0.05%)
> CoeffVar  unixbench-pipe-1              0.14 (   0.00%)          0.22 ( -62.25%)
> CoeffVar  unixbench-pipe-512            0.11 (   0.00%)          0.24 (-126.10%)
> Max       unixbench-pipe-1        2902691.20 (   0.00%)    2917740.00 (   0.52%)
> Max       unixbench-pipe-512    330440132.10 (   0.00%)  331162497.90 (   0.22%)
> 
> o NPS2
> 
> kernel:                               tip                   sis_short
> Min       unixbench-pipe-1        2895327.90 (   0.00%)    2905421.90 (   0.35%)
> Min       unixbench-pipe-512    328350065.60 (   0.00%)  330137916.90 (   0.54%)
> Hmean     unixbench-pipe-1        2899129.86 (   0.00%)    2910562.69 *   0.39%*
> Hmean     unixbench-pipe-512    329436096.80 (   0.00%)  330509036.17 (   0.33%)
> CoeffVar  unixbench-pipe-1              0.12 (   0.00%)          0.20 ( -70.84%)
> CoeffVar  unixbench-pipe-512            0.30 (   0.00%)          0.10 (  65.00%)
> Max       unixbench-pipe-1        2901619.40 (   0.00%)    2916758.50 (   0.52%)
> Max       unixbench-pipe-512    330239044.10 (   0.00%)  330814020.50 (   0.17%)
> 
> o NPS4
> 
> kernel:                               tip                   sis_short
> Min       unixbench-pipe-1        2901525.60 (   0.00%)    2909864.00 (   0.29%)
> Min       unixbench-pipe-512    330265873.90 (   0.00%)  330543034.40 (   0.08%)
> Hmean     unixbench-pipe-1        2906184.70 (   0.00%)    2912725.52 *   0.23%*
> Hmean     unixbench-pipe-512    330854683.27 (   0.00%)  331540275.79 (   0.21%)
> CoeffVar  unixbench-pipe-1              0.14 (   0.00%)          0.09 (  39.44%)
> CoeffVar  unixbench-pipe-512            0.16 (   0.00%)          0.27 ( -73.84%)
> Max       unixbench-pipe-1        2909154.50 (   0.00%)    2914249.80 (   0.18%)
> Max       unixbench-pipe-512    331245477.30 (   0.00%)  332305755.00 (   0.32%)
> 
> -> unixbench-spawn
> 
> o NPS1
> 
> kernel:                             tip                  sis_short
> Min       unixbench-spawn-1       6536.50 (   0.00%)     6458.00 (  -1.20%)
> Min       unixbench-spawn-512    72571.40 (   0.00%)    91525.90 (  26.12%)
> Hmean     unixbench-spawn-1       6811.16 (   0.00%)     6510.74 (  -4.41%)
> Hmean     unixbench-spawn-512    72801.77 (   0.00%)    91829.95 *  26.14%*
> CoeffVar  unixbench-spawn-1          3.69 (   0.00%)        1.00 (  72.93%)
> CoeffVar  unixbench-spawn-512        0.27 (   0.00%)        0.41 ( -50.84%)
> Max       unixbench-spawn-1       7021.00 (   0.00%)     6583.60 (  -6.23%)
> Max       unixbench-spawn-512    72927.00 (   0.00%)    92257.50 (  26.51%)
> 
> o NPS2
> 
> kernel:                             tip                  sis_short
> Min       unixbench-spawn-1       7042.20 (   0.00%)     7411.00 (   5.24%)
> Min       unixbench-spawn-512    85571.60 (   0.00%)    89549.50 (   4.65%)
> Hmean     unixbench-spawn-1       7199.01 (   0.00%)     7553.53 *   4.92%*
> Hmean     unixbench-spawn-512    85717.77 (   0.00%)    89751.68 *   4.71%*
> CoeffVar  unixbench-spawn-1          3.50 (   0.00%)        1.68 (  51.98%)
> CoeffVar  unixbench-spawn-512        0.20 (   0.00%)        0.28 ( -36.60%)
> Max       unixbench-spawn-1       7495.00 (   0.00%)     7650.40 (   2.07%)
> Max       unixbench-spawn-512    85909.20 (   0.00%)    90028.30 (   4.79%)
> 
> o NPS4
> 
> kernel:                             tip                  sis_short
> Min       unixbench-spawn-1       7521.90 (   0.00%)     8404.10 (  11.73%)
> Min       unixbench-spawn-512    84245.70 (   0.00%)    91260.20 (   8.33%)
> Hmean     unixbench-spawn-1       7659.12 (   0.00%)     8526.01 *  11.32%*
> Hmean     unixbench-spawn-512    84908.77 (   0.00%)    91365.07 *   7.60%*
> CoeffVar  unixbench-spawn-1          1.92 (   0.00%)        2.06 (  -7.21%)
> CoeffVar  unixbench-spawn-512        0.76 (   0.00%)        0.10 (  86.60%)
> Max       unixbench-spawn-1       7815.40 (   0.00%)     8729.60 (  11.70%)
> Max       unixbench-spawn-512    85532.90 (   0.00%)    91437.30 (   6.90%)
> 
> -> unixbench-execl
> 
> o NPS1
> 
> kernel:                             tip                  sis_short
> Min       unixbench-execl-1       5421.50 (   0.00%)     5466.40 (   0.83%)
> Min       unixbench-execl-512    11213.50 (   0.00%)    11720.30 (   4.52%)
> Hmean     unixbench-execl-1       5443.75 (   0.00%)     5468.53 (   0.46%)
> Hmean     unixbench-execl-512    11311.94 (   0.00%)    11809.97 *   4.40%*
> CoeffVar  unixbench-execl-1          0.38 (   0.00%)        0.04 (  89.57%)
> CoeffVar  unixbench-execl-512        1.03 (   0.00%)        0.74 (  27.60%)
> Max       unixbench-execl-1       5461.90 (   0.00%)     5470.70 (   0.16%)
> Max       unixbench-execl-512    11440.40 (   0.00%)    11895.60 (   3.98%)
> 
> o NPS2
> 
> kernel:                             tip                  sis_short
> Min       unixbench-execl-1       5089.10 (   0.00%)     5119.50 (   0.60%)
> Min       unixbench-execl-512    11772.70 (   0.00%)    11591.40 (  -1.54%)
> Hmean     unixbench-execl-1       5321.65 (   0.00%)     5251.49 (  -1.32%)
> Hmean     unixbench-execl-512    12201.73 (   0.00%)    11665.67 (  -4.39%)
> CoeffVar  unixbench-execl-1          3.87 (   0.00%)        2.33 (  39.91%)
> CoeffVar  unixbench-execl-512        6.23 (   0.00%)        1.04 (  83.38%)
> Max       unixbench-execl-1       5453.90 (   0.00%)     5359.00 (  -1.74%)
> Max       unixbench-execl-512    13111.60 (   0.00%)    11805.80 (  -9.96%)
> 
> o NPS4
> 
> kernel:                             tip                  sis_short
> Min       unixbench-execl-1       5099.40 (   0.00%)     5352.70 (   4.97%)
> Min       unixbench-execl-512    11692.80 (   0.00%)    13368.20 (  14.33%)
> Hmean     unixbench-execl-1       5136.86 (   0.00%)     5404.31 *   5.21%*
> Hmean     unixbench-execl-512    12053.71 (   0.00%)    14018.53 *  16.30%*
> CoeffVar  unixbench-execl-1          1.05 (   0.00%)        0.84 (  20.12%)
> CoeffVar  unixbench-execl-512        3.85 (   0.00%)        5.29 ( -37.45%)
> Max       unixbench-execl-1       5198.70 (   0.00%)     5434.90 (   4.54%)
> Max       unixbench-execl-512    12585.70 (   0.00%)    14839.80 (  17.91%)
> 
> 
> Except for schbench with 128 workers in NPS4 mode, I do not
> see any large regressions for the above workloads and I do
> see small to moderate gains overall for most workload, even
> the larger ones. I'll try to get data for more workload but
> overall the idea seems promising. I'll also get some numbers
> with the changes Peter suggested on Patch 1.
I spent sometime to dig into the issue which motivates me to propose this
solution. And it was found that this issue could not be easily solved
directly because there seems to be an inevitable race condition window, with the
increasing of CPU number, this race condition is exposed easlier. So
current patch is an indirect solution to avoid that, I'll send the detail
in v3. 
> 
> If there is any specific workload you would like me to run
> on the test machine, please do let me know.
Thanks for always helping us to test the patch, I'll send v3 once I get the
result and we can discuss on that then.

thanks,
Chenyu

K Prateek Nayak Dec. 2, 2022, 3:21 a.m. UTC | #3

Hello Chenyu,

On 11/30/2022 9:33 AM, Chen Yu wrote:
> On 2022-11-22 at 16:01:42 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> I've tested v2 series on an dual socket Zen3 system (2 x 64C/128T) and
>> the results are largely positive.
>>
> Thank you Prateek, and sorry for late response. 

Thank you for taking a look at the report.

>> tl;dr
>>
>> o Hackbench results are mostly similar with tip.
>> o schbench sees improvements to tail latency when the system is
>>   loaded in NPS1 case but I do see one small regression for
>>   128 workers in NPS4 mode.
>> o tbench sees small gains in NPS2 and NPS4 mode
>> o Stream and Spec-JBB results remain same as the tip.
>> o ycsb-mongodb sees small gains in NPS2 and NPS4 mode.
>> o unixbench results see small to moderate gains overall.
>>
>> I'll leave the detailed results below:
>>
>> [..snip..]
>>
>> Note: On the tested kernel, with 128 clients, tbench can
>> run into a bottleneck during C2 exit. More details can be
>> found at:
>> https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/
>> This issue has been fixed in v6.0 but was not part of the
>> tip kernel when I started testing. This data point has
>> been rerun with C2 disabled to get representative results.
> This reminds me that, previously I tested with Cstates > C1 disabled, and
> with turbo disabled, so as to mitigate possible deviation. May I know if
> all C-states and turbo are enabled in your test besides tbench?

I do run with all C-states and turbo enabled with performance governor.
I can do a parallel run with C2 and turbo disabled to bring down any
possibility of external factors affecting the results. In the past,
we've seen some issues come to light when running with C2 and turbo
enabled so I had stuck to it. Thank you for pointing this out.

>>
>> [..snip..]
>>
>> Except for schbench with 128 workers in NPS4 mode, I do not
>> see any large regressions for the above workloads and I do
>> see small to moderate gains overall for most workload, even
>> the larger ones. I'll try to get data for more workload but
>> overall the idea seems promising. I'll also get some numbers
>> with the changes Peter suggested on Patch 1.
> I spent sometime to dig into the issue which motivates me to propose this
> solution. And it was found that this issue could not be easily solved
> directly because there seems to be an inevitable race condition window, with the
> increasing of CPU number, this race condition is exposed easlier. So
> current patch is an indirect solution to avoid that, I'll send the detail
> in v3. 

I see that v3 is out. Thank you for the detailed explanation and the
visualization of the bottleneck in v3 Patch 2.

>>
>> If there is any specific workload you would like me to run
>> on the test machine, please do let me know.
> Thanks for always helping us to test the patch, I'll send v3 once I get the
> result and we can discuss on that then.

I've queued up runs for v3 with the same set of benchmarks reported
above. I will make a point to include results with C2 and turbo disabled
to reduce external variables.
I'll share the results on v3 in the coming week.
--
Thanks and Regards,
Prateek