[0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Message ID cover.1695704179.git.yu.c.chen@intel.com
Headers
Series Introduce SIS_CACHE to choose previous CPU during task wakeup |

Message

Chen Yu Sept. 26, 2023, 5:10 a.m. UTC
  RFC -> v1:
- drop RFC
- Only record the short sleeping time for each task, to better honor the
  burst sleeping tasks. (Mathieu Desnoyers)
- Keep the forward movement monotonic for runqueue's cache-hot timeout value.
  (Mathieu Desnoyers, Aaron Lu)
- Introduce a new helper function cache_hot_cpu() that considers
  rq->cache_hot_timeout. (Aaron Lu)
- Add analysis of why inhibiting task migration could bring better throughput
  for some benchmarks. (Gautham R. Shenoy)
- Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
  select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
  (K Prateek Nayak)

Thanks for your comments and review!

----------------------------------------------------------------------

When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inhibit the task migration while keeping the work conservation of
scheduler could benefit many workloads. Inspired by Mathieu's
proposal to limit the task migration ratio[1], this patch considers
the task average sleep duration. If the task is a short sleeping one,
then tag its previous CPU as cache hot for a short while. During this
reservation period, other wakees are not allowed to pick this idle CPU
until a timeout. Later if the task is woken up again, it can find its
previous CPU still idle, and choose it in select_idle_sibling().

This test is based on tip/sched/core, on top of
Commit afc1996859a2
("sched/fair: Ratelimit update to tg->load_avg")

patch afc1996859a2 has significantly reduced the cost of task migration,
the SIS_CACHE further reduces that cost. SIS_CACHE shows noticeable
throughput improvement of netperf/tbench around 100% load.

[patch 1/2] records the task's average short sleeping time in
its per sched_entity structure.
[patch 2/2] introduces the SIS_CACHE to skip the cache-hot
idle CPU during wakeup.

Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1

Chen Yu (2):
  sched/fair: Record the short sleeping time of a task
  sched/fair: skip the cache hot CPU in select_idle_cpu()

 include/linux/sched.h   |  3 ++
 kernel/sched/fair.c     | 86 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h    |  1 +
 4 files changed, 87 insertions(+), 4 deletions(-)
  

Comments

Ingo Molnar Sept. 27, 2023, 8 a.m. UTC | #1
* Chen Yu <yu.c.chen@intel.com> wrote:

> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inhibit the task migration while keeping the work conservation of
> scheduler could benefit many workloads. Inspired by Mathieu's
> proposal to limit the task migration ratio[1], this patch considers
> the task average sleep duration. If the task is a short sleeping one,
> then tag its previous CPU as cache hot for a short while. During this
> reservation period, other wakees are not allowed to pick this idle CPU
> until a timeout. Later if the task is woken up again, it can find its
> previous CPU still idle, and choose it in select_idle_sibling().

Yeah, so I'm not convinced about this at this stage.

By allowing a task to basically hog a CPU after it has gone idle already,
however briefly, we reduce resource utilization efficiency for the sake
of singular benchmark workloads.

In a mixed environment the cost of leaving CPUs idle longer than necessary
will show up - and none of these benchmarks show that kind of side effect
and indirect overhead.

This feature would be a lot more convincing if it tried to measure overhead
in the pathological case, not the case it's been written for.

Thanks,

	Ingo
  
Tim Chen Sept. 27, 2023, 9:34 p.m. UTC | #2
On Wed, 2023-09-27 at 10:00 +0200, Ingo Molnar wrote:
> * Chen Yu <yu.c.chen@intel.com> wrote:
> 
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inhibit the task migration while keeping the work conservation of
> > scheduler could benefit many workloads. Inspired by Mathieu's
> > proposal to limit the task migration ratio[1], this patch considers
> > the task average sleep duration. If the task is a short sleeping one,
> > then tag its previous CPU as cache hot for a short while. During this
> > reservation period, other wakees are not allowed to pick this idle CPU
> > until a timeout. Later if the task is woken up again, it can find its
> > previous CPU still idle, and choose it in select_idle_sibling().
> 
> Yeah, so I'm not convinced about this at this stage.
> 
> By allowing a task to basically hog a CPU after it has gone idle already,
> however briefly, we reduce resource utilization efficiency for the sake
> of singular benchmark workloads.
> 
> In a mixed environment the cost of leaving CPUs idle longer than necessary
> will show up - and none of these benchmarks show that kind of side effect
> and indirect overhead.
> 
> This feature would be a lot more convincing if it tried to measure overhead
> in the pathological case, not the case it's been written for.
> 

Ingo,

Mathieu's patches on detecting overly high task migrations and then
rate limiting migration is a way to detect that tasks are getting 
crazy doing CPU musical chairs and in a pathological state.

Will the migration rate be a reasonable indicator that we need to
do something to reduce pathological migrations like SIS_CACHE proposal so the
tasks don't get jerked all over?
Or you have some other better indicators in mind?

We did some experiments on the OLTP workload on a 112 core 2 socket
SPR machine.  The OLTP workload have a mixture of threads
handling database updates on disks and handling transaction
queries over network.

For Mathieu's original task migration rate limit patches,
we saw 1.2% improvement and for Chen Yu's SIS_CACHE proposal, we 
saw 0.7% improvement.  System is running at
~94% busy so is under high utilization.  The variation of this workload
is less than 0.2%. There are improvements for such mix workload
though it is not as much as the microbenchmarks.  These
data are perliminary and we are still doing more experiments.

For the OLTP experiments, each socket with 64 cores are divided
with sub-numa clusters of 4 nodes of 16 cores each so the scheduling
overhead in idle CPU search is much less if SNC is off.  

Thanks.

Tim
  
Chen Yu Sept. 28, 2023, 8:23 a.m. UTC | #3
Hi Ingo,

On 2023-09-27 at 10:00:11 +0200, Ingo Molnar wrote:
> 
> * Chen Yu <yu.c.chen@intel.com> wrote:
> 
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inhibit the task migration while keeping the work conservation of
> > scheduler could benefit many workloads. Inspired by Mathieu's
> > proposal to limit the task migration ratio[1], this patch considers
> > the task average sleep duration. If the task is a short sleeping one,
> > then tag its previous CPU as cache hot for a short while. During this
> > reservation period, other wakees are not allowed to pick this idle CPU
> > until a timeout. Later if the task is woken up again, it can find its
> > previous CPU still idle, and choose it in select_idle_sibling().
> 
> Yeah, so I'm not convinced about this at this stage.
> 
> By allowing a task to basically hog a CPU after it has gone idle already,
> however briefly, we reduce resource utilization efficiency for the sake
> of singular benchmark workloads.
>

Currently in the code we do not really reserve the idle CPU or force it
to be idle. We just give other wakee a search sequence suggestion to find
the idle CPU. If all idle CPUs are in reserved state, the first reserved idle
CPU will be picked up rather than left it in idle. This can fully utilize the
idle CPU resource. The main impact is the wakeup latency if I understand
correctly. Let me run the latest schbench and monitor these latency statistics
in detail.
 
> In a mixed environment the cost of leaving CPUs idle longer than necessary
> will show up - and none of these benchmarks show that kind of side effect
> and indirect overhead.
> 
> This feature would be a lot more convincing if it tried to measure overhead
> in the pathological case, not the case it's been written for.
>

Thanks for the suggestion, Ingo. Yes, we should launch more tests to evaluate this
proposal. As Tim mentioned, we have previously tested it using OLTP benchmark
as described in PATCH [2/2]. I'm thinking of running more benchmarks to get
a wider understanding of how this change would impact them, both positive and
negative part.

thanks,
Chenyu
  
K Prateek Nayak Oct. 5, 2023, 6:22 a.m. UTC | #4
Hello Chenyu,

On 9/26/2023 10:40 AM, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
>   burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>   (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
>   rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
>   for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>   (K Prateek Nayak)
> 
> Thanks for your comments and review!

Sorry for the delay! I'll leave the test results from a 3rd Generation
EPYC system below.

tl;dr

- Small regression in tbench and netperf possible due to more searching
  for an idle CPU.

- Small regression in schbench (old) at 256 workers albeit with large
  run to run variance.

- Other benchmarks are more or less same.

I'll leave the full result below

o System details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)


o Kernel Details

- tip:	tip:sched/core at commit 5fe7765997b1 (sched/deadline: Make
	dl_rq->pushable_dl_tasks update drive dl_rq->overloaded)

- SIS_CACHE: tip + this series


o Benchmark results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)     SIS_CACHE[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.36)     1.01 [ -1.47]( 3.02)
 2-groups     1.00 [ -0.00]( 2.35)     0.99 [  0.92]( 1.01)
 4-groups     1.00 [ -0.00]( 1.79)     0.98 [  2.34]( 0.63)
 8-groups     1.00 [ -0.00]( 0.84)     0.98 [  1.73]( 1.02)
16-groups     1.00 [ -0.00]( 2.39)     0.97 [  2.76]( 2.33)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:    tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
    1     1.00 [  0.00]( 0.86)     0.97 [ -2.68]( 0.74)
    2     1.00 [  0.00]( 0.99)     0.98 [ -2.18]( 0.17)
    4     1.00 [  0.00]( 0.49)     0.98 [ -2.47]( 1.15)
    8     1.00 [  0.00]( 0.96)     0.96 [ -3.81]( 0.24)
   16     1.00 [  0.00]( 1.38)     0.96 [ -4.33]( 1.31)
   32     1.00 [  0.00]( 1.64)     0.95 [ -4.70]( 1.59)
   64     1.00 [  0.00]( 0.92)     0.97 [ -2.97]( 0.49)
  128     1.00 [  0.00]( 0.57)     0.99 [ -1.15]( 0.57)
  256     1.00 [  0.00]( 0.38)     1.00 [  0.03]( 0.79)
  512     1.00 [  0.00]( 0.04)     1.00 [  0.43]( 0.34)
 1024     1.00 [  0.00]( 0.20)     1.00 [  0.41]( 0.13)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 2.52)     0.93 [ -6.90]( 6.75)
Scale     1.00 [  0.00]( 6.38)     0.99 [ -1.18]( 7.45)
  Add     1.00 [  0.00]( 6.54)     0.97 [ -2.55]( 7.34)
Triad     1.00 [  0.00]( 5.18)     0.95 [ -4.64]( 6.81)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 0.74)     1.00 [ -0.20]( 1.69)
Scale     1.00 [  0.00]( 6.25)     1.03 [  3.46]( 0.55)
  Add     1.00 [  0.00]( 6.53)     1.05 [  4.58]( 0.43)
Triad     1.00 [  0.00]( 5.14)     0.98 [ -1.78]( 6.24)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:         tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.27)     0.98 [ -1.50]( 0.14)
 2-clients     1.00 [  0.00]( 1.32)     0.98 [ -2.35]( 0.54)
 4-clients     1.00 [  0.00]( 0.40)     0.98 [ -2.35]( 0.56)
 8-clients     1.00 [  0.00]( 0.97)     0.97 [ -2.72]( 0.50)
16-clients     1.00 [  0.00]( 0.54)     0.96 [ -3.92]( 0.86)
32-clients     1.00 [  0.00]( 1.38)     0.97 [ -3.10]( 0.44)
64-clients     1.00 [  0.00]( 1.78)     0.97 [ -3.44]( 1.70)
128-clients    1.00 [  0.00]( 1.09)     0.94 [ -5.75]( 2.67)
256-clients    1.00 [  0.00]( 4.45)     0.97 [ -2.61]( 4.93)
512-clients    1.00 [  0.00](54.70)     0.98 [ -1.64](55.09)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:  tip[pct imp](CV)     SIS_CACHE[pct imp](CV)
  1     1.00 [ -0.00]( 3.95)     0.97 [  2.56](10.42)
  2     1.00 [ -0.00]( 5.89)     0.83 [ 16.67](22.56)
  4     1.00 [ -0.00](14.28)     1.00 [ -0.00](14.75)
  8     1.00 [ -0.00]( 4.90)     0.84 [ 15.69]( 6.01)
 16     1.00 [ -0.00]( 4.15)     1.00 [ -0.00]( 4.41)
 32     1.00 [ -0.00]( 5.10)     1.01 [ -1.10]( 3.44)
 64     1.00 [ -0.00]( 2.69)     1.04 [ -3.72]( 2.57)
128     1.00 [ -0.00]( 2.63)     0.94 [  6.29]( 2.55)
256     1.00 [ -0.00](26.75)     1.51 [-50.57](11.40)
512     1.00 [ -0.00]( 2.93)     0.96 [  3.52]( 3.56)

==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip     SIS_CACHE(pct imp)
Throughput      1.00    1.00 (%diff: 0.27%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip      SIS_CACHE(pct imp)
Throughput      1.00    1.00 (%diff: -0.45%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning      scaling     tip                SIS_CACHE(pct imp)
 1CCD           1        1.00              1.00 (%diff: -0.47%)
 2CCD           2        1.00              0.98 (%diff: -2.34%)
 4CCD           4        1.00              1.00 (%diff: -0.29%)
 8CCD           8        1.00              1.01 (%diff: 0.54%)

> 
> ----------------------------------------------------------------------
> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek
  
Chen Yu Oct. 7, 2023, 3:23 a.m. UTC | #5
Hi Prateek,

On 2023-10-05 at 11:52:13 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/26/2023 10:40 AM, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> >   burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >   (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> >   rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> >   for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >   (K Prateek Nayak)
> > 
> > Thanks for your comments and review!
> 
> Sorry for the delay! I'll leave the test results from a 3rd Generation
> EPYC system below.
> 
> tl;dr
> 
> - Small regression in tbench and netperf possible due to more searching
>   for an idle CPU.
> 
> - Small regression in schbench (old) at 256 workers albeit with large
>   run to run variance.
> 
> - Other benchmarks are more or less same.
> 
> Test          : schbench
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:  tip[pct imp](CV)     SIS_CACHE[pct imp](CV)
>   1     1.00 [ -0.00]( 3.95)     0.97 [  2.56](10.42)
>   2     1.00 [ -0.00]( 5.89)     0.83 [ 16.67](22.56)
>   4     1.00 [ -0.00](14.28)     1.00 [ -0.00](14.75)
>   8     1.00 [ -0.00]( 4.90)     0.84 [ 15.69]( 6.01)
>  16     1.00 [ -0.00]( 4.15)     1.00 [ -0.00]( 4.41)
>  32     1.00 [ -0.00]( 5.10)     1.01 [ -1.10]( 3.44)
>  64     1.00 [ -0.00]( 2.69)     1.04 [ -3.72]( 2.57)
> 128     1.00 [ -0.00]( 2.63)     0.94 [  6.29]( 2.55)
> 256     1.00 [ -0.00](26.75)     1.51 [-50.57](11.40)

Thanks for the testing. So the latency regression from schbench is
quite obvious, and as you mentioned, it is possible due to longer
scan time during select_idle_cpu(). I'll run the same test with split
LLC to see if I can reproduce the issue or not.
I'm also working with Mathieu on another direction to choose previous CPU
over current CPU when the system is overloaded, and that should be
more moderate and I'll post the test result later.

thanks,
Chenyu
  
Madadi Vineeth Reddy Oct. 17, 2023, 9:49 a.m. UTC | #6
Hi Chen Yu,

On 26/09/23 10:40, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
>   burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>   (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
>   rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
>   for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>   (K Prateek Nayak)
> 
> Thanks for your comments and review!
> 
> ----------------------------------------------------------------------

Regarding making the scan for finding an idle cpu longer vs cache benefits, 
I ran some benchmarks.

Tested the patch on power system with 12 cores. Total of 96 CPU's.
System has two NUMA nodes.

Below are some of the benchmark results

schbench 99.0th latency (lower is better)
========
case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)


schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
for SIS_CACHE in case of 4-mthreads. I think we can ignore the last case due to huge run to run variations.

producer_consumer avg time/access (lower is better)
========
loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)

The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
mainly when loads per consumer iteration is lower.

hackbench normalized time in seconds (lower is better)
========
case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)

hackbench results are similar in both kernels except the case where there is an improvement of
29% in case of threads-pipe case with 1 groups.

Daytrader throughput (higher is better)
========

As per Ingo suggestion, ran a real life workload daytrader

baseline:
=================================================================================== 
 Instance      1
     Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
  ================       ===============   ===============   ===============
       10124.5 			    2 		    0 		   3970

SIS_CACHE:
===================================================================================
 Instance      1
     Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
  ================       ===============   ===============   ===============
       10319.5                       2               0              5771

In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.

Thanks and Regards 
Madadi Vineeth Reddy
  
Chen Yu Oct. 17, 2023, 11:09 a.m. UTC | #7
Hi Madadi,

On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 26/09/23 10:40, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> >   burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >   (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> >   rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> >   for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >   (K Prateek Nayak)
> > 
> > Thanks for your comments and review!
> > 
> > ----------------------------------------------------------------------
> 
> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
> I ran some benchmarks.
> 

Thanks very much for your interest and your time on the patch.

> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> System has two NUMA nodes.
>
> Below are some of the benchmark results
> 
> schbench 99.0th latency (lower is better)
> ========
> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
> 
> 
> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
> for SIS_CACHE in case of 4-mthreads.

The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.

> I think we can ignore the last case due to huge run to run variations.

Although the run-to-run variation is large, it seems that the decrease is within that range.
Prateek has also reported that when the system is overloaded there could be some regression
from schbench:
https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
latency in detail.
 
> producer_consumer avg time/access (lower is better)
> ========
> loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> 5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
> 20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
> 50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 
> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
> mainly when loads per consumer iteration is lower.
> 
> hackbench normalized time in seconds (lower is better)
> ========
> case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
> process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
> process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
> process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
> threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
> threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
> threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
> threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)
> 
> hackbench results are similar in both kernels except the case where there is an improvement of
> 29% in case of threads-pipe case with 1 groups.
> 
> Daytrader throughput (higher is better)
> ========
> 
> As per Ingo suggestion, ran a real life workload daytrader
> 
> baseline:
> =================================================================================== 
>  Instance      1
>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>   ================       ===============   ===============   ===============
>        10124.5 			    2 		    0 		   3970
> 
> SIS_CACHE:
> ===================================================================================
>  Instance      1
>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>   ================       ===============   ===============   ===============
>        10319.5                       2               0              5771
> 
> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>

Thanks for bringing this good news, a real life workload benefits from this change.
I'll tune this patch a little bit to address the regression from schbench. Also to mention
that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
platform benefit from this change.
https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/

thanks,
Chenyu
  
Madadi Vineeth Reddy Oct. 18, 2023, 7:32 p.m. UTC | #8
Hi Chen Yu,
On 17/10/23 16:39, Chen Yu wrote:
> Hi Madadi,
> 
> On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 26/09/23 10:40, Chen Yu wrote:
>>> RFC -> v1:
>>> - drop RFC
>>> - Only record the short sleeping time for each task, to better honor the
>>>   burst sleeping tasks. (Mathieu Desnoyers)
>>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>>>   (Mathieu Desnoyers, Aaron Lu)
>>> - Introduce a new helper function cache_hot_cpu() that considers
>>>   rq->cache_hot_timeout. (Aaron Lu)
>>> - Add analysis of why inhibiting task migration could bring better throughput
>>>   for some benchmarks. (Gautham R. Shenoy)
>>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>>>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>>>   (K Prateek Nayak)
>>>
>>> Thanks for your comments and review!
>>>
>>> ----------------------------------------------------------------------
>>
>> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
>> I ran some benchmarks.
>>
> 
> Thanks very much for your interest and your time on the patch.
> 
>> Tested the patch on power system with 12 cores. Total of 96 CPU's.
>> System has two NUMA nodes.
>>
>> Below are some of the benchmark results
>>
>> schbench 99.0th latency (lower is better)
>> ========
>> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
>> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
>> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
>> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
>> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
>>
>>
>> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
>> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
>> for SIS_CACHE in case of 4-mthreads.
> 
> The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> 
>> I think we can ignore the last case due to huge run to run variations.
> 
> Although the run-to-run variation is large, it seems that the decrease is within that range.
> Prateek has also reported that when the system is overloaded there could be some regression
> from schbench:
> https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
> Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> latency in detail.
>  

raw data by schbench(old) with 6-mthreads
======================

Baseline (5 runs)
========
Latency percentiles (usec)                                                                                                                                                                                                                                  
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 981 
        99.5000th: 4424
        99.9000th: 9200
        min=0, max=29497

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 29
        90.0000th: 35
        95.0000th: 38
        *99.0000th: 495 
        99.5000th: 3924
        99.9000th: 9872
        min=0, max=29997

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 30
        90.0000th: 36
        95.0000th: 39
        *99.0000th: 1326
        99.5000th: 4744
        99.9000th: 10000
        min=0, max=23394

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 55
        99.5000th: 3292
        99.9000th: 9104
        min=0, max=25196

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 711 
        99.5000th: 4600
        99.9000th: 9424
        min=0, max=19997

SIS_CACHE (5 runs)
=========
Latency percentiles (usec)                                                                                                                                                                                                                                                                                     
        50.0000th: 23
        75.0000th: 30
        90.0000th: 35
        95.0000th: 38
        *99.0000th: 1894
        99.5000th: 5464
        99.9000th: 10000
        min=0, max=19157

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 2396
        99.5000th: 6664
        99.9000th: 10000
        min=0, max=24029

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 2132
        99.5000th: 6296
        99.9000th: 10000
        min=0, max=25313

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 1090
        99.5000th: 6232
        99.9000th: 9744
        min=0, max=27264

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 38
        *99.0000th: 1786
        99.5000th: 5240
        99.9000th: 9968
        min=0, max=24754

The above data as indicated has large run to run variation and in general, the latency is
high in case of SIS_CACHE for the 99th %ile.


schbench(new) with 6-mthreads
=============

Baseline
========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
	  50.0th: 8          (43672 samples)
	  90.0th: 13         (83908 samples)
	* 99.0th: 20         (18323 samples)
	  99.9th: 775        (1785 samples)
	  min=1, max=8400
Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
	  50.0th: 13648      (59873 samples)
	  90.0th: 14000      (82767 samples)
	* 99.0th: 14320      (16342 samples)
	  99.9th: 18720      (1670 samples)
	  min=5130, max=38334
RPS percentiles (requests) runtime 30 (s) (31 total samples)
	  20.0th: 6968       (8 samples)
	* 50.0th: 6984       (23 samples)
	  90.0th: 6984       (0 samples)
	  min=6835, max=6991
average rps: 6984.77


SIS_CACHE
=========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
	  50.0th: 9          (49267 samples)
	  90.0th: 14         (86522 samples)
	* 99.0th: 21         (14091 samples)
	  99.9th: 1146       (1722 samples)
	  min=1, max=10427
Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
	  50.0th: 13616      (62838 samples)
	  90.0th: 14000      (85301 samples)
	* 99.0th: 14352      (16149 samples)
	  99.9th: 21408      (1660 samples)
	  min=5070, max=41866
RPS percentiles (requests) runtime 30 (s) (31 total samples)
	  20.0th: 6968       (7 samples)
	* 50.0th: 6984       (21 samples)
	  90.0th: 6984       (0 samples)
	  min=6672, max=6996
average rps: 6981.07

In new schbench, I didn't observe run to run variation and also there was no regression
in case of SIS_CACHE for the 99th %ile.


>> producer_consumer avg time/access (lower is better)
>> ========
>> loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
>> 5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
>> 20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
>> 50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
>> 100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
>>
>> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
>> mainly when loads per consumer iteration is lower.
>>
>> hackbench normalized time in seconds (lower is better)
>> ========
>> case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
>> process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
>> process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
>> process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
>> process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
>> threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
>> threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
>> threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
>> threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)
>>
>> hackbench results are similar in both kernels except the case where there is an improvement of
>> 29% in case of threads-pipe case with 1 groups.
>>
>> Daytrader throughput (higher is better)
>> ========
>>
>> As per Ingo suggestion, ran a real life workload daytrader
>>
>> baseline:
>> =================================================================================== 
>>  Instance      1
>>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>>   ================       ===============   ===============   ===============
>>        10124.5 			    2 		    0 		   3970
>>
>> SIS_CACHE:
>> ===================================================================================
>>  Instance      1
>>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>>   ================       ===============   ===============   ===============
>>        10319.5                       2               0              5771
>>
>> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>>
> 
> Thanks for bringing this good news, a real life workload benefits from this change.
> I'll tune this patch a little bit to address the regression from schbench. Also to mention
> that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
> CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
> platform benefit from this change.
> https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/

Oh..ok. Thanks for the pointer!

> 
> thanks,
> Chenyu
>  

Thanks and Regards
Madadi Vineeth Reddy
  
Chen Yu Oct. 19, 2023, 10:57 a.m. UTC | #9
On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> On 17/10/23 16:39, Chen Yu wrote:
> > Hi Madadi,
> > 
> > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> >> Hi Chen Yu,
> >>
> >> On 26/09/23 10:40, Chen Yu wrote:
> >>> RFC -> v1:
> >>> - drop RFC
> >>> - Only record the short sleeping time for each task, to better honor the
> >>>   burst sleeping tasks. (Mathieu Desnoyers)
> >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >>>   (Mathieu Desnoyers, Aaron Lu)
> >>> - Introduce a new helper function cache_hot_cpu() that considers
> >>>   rq->cache_hot_timeout. (Aaron Lu)
> >>> - Add analysis of why inhibiting task migration could bring better throughput
> >>>   for some benchmarks. (Gautham R. Shenoy)
> >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >>>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >>>   (K Prateek Nayak)
> >>>
> >>> Thanks for your comments and review!
> >>>
> >>> ----------------------------------------------------------------------
> >>
> >> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
> >> I ran some benchmarks.
> >>
> > 
> > Thanks very much for your interest and your time on the patch.
> > 
> >> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> >> System has two NUMA nodes.
> >>
> >> Below are some of the benchmark results
> >>
> >> schbench 99.0th latency (lower is better)
> >> ========
> >> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
> >> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
> >> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
> >> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
> >> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
> >>
> >>
> >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
> >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
> >> for SIS_CACHE in case of 4-mthreads.
> > 
> > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> > 
> >> I think we can ignore the last case due to huge run to run variations.
> > 
> > Although the run-to-run variation is large, it seems that the decrease is within that range.
> > Prateek has also reported that when the system is overloaded there could be some regression
> > from schbench:
> > https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
> > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> > latency in detail.
> >  
> 
> raw data by schbench(old) with 6-mthreads
> ======================
> 
> Baseline (5 runs)
> ========
> Latency percentiles (usec)                                                                                                                                                                                                                                  
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 981 
>         99.5000th: 4424
>         99.9000th: 9200
>         min=0, max=29497
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 35
>         95.0000th: 38
>         *99.0000th: 495 
>         99.5000th: 3924
>         99.9000th: 9872
>         min=0, max=29997
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 30
>         90.0000th: 36
>         95.0000th: 39
>         *99.0000th: 1326
>         99.5000th: 4744
>         99.9000th: 10000
>         min=0, max=23394
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 55
>         99.5000th: 3292
>         99.9000th: 9104
>         min=0, max=25196
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 711 
>         99.5000th: 4600
>         99.9000th: 9424
>         min=0, max=19997
> 
> SIS_CACHE (5 runs)
> =========
> Latency percentiles (usec)                                                                                                                                                                                                                                                                                     
>         50.0000th: 23
>         75.0000th: 30
>         90.0000th: 35
>         95.0000th: 38
>         *99.0000th: 1894
>         99.5000th: 5464
>         99.9000th: 10000
>         min=0, max=19157
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 2396
>         99.5000th: 6664
>         99.9000th: 10000
>         min=0, max=24029
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 2132
>         99.5000th: 6296
>         99.9000th: 10000
>         min=0, max=25313
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 1090
>         99.5000th: 6232
>         99.9000th: 9744
>         min=0, max=27264
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 38
>         *99.0000th: 1786
>         99.5000th: 5240
>         99.9000th: 9968
>         min=0, max=24754
> 
> The above data as indicated has large run to run variation and in general, the latency is
> high in case of SIS_CACHE for the 99th %ile.
> 
> 
> schbench(new) with 6-mthreads
> =============
> 
> Baseline
> ========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
> 	  50.0th: 8          (43672 samples)
> 	  90.0th: 13         (83908 samples)
> 	* 99.0th: 20         (18323 samples)
> 	  99.9th: 775        (1785 samples)
> 	  min=1, max=8400
> Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
> 	  50.0th: 13648      (59873 samples)
> 	  90.0th: 14000      (82767 samples)
> 	* 99.0th: 14320      (16342 samples)
> 	  99.9th: 18720      (1670 samples)
> 	  min=5130, max=38334
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 	  20.0th: 6968       (8 samples)
> 	* 50.0th: 6984       (23 samples)
> 	  90.0th: 6984       (0 samples)
> 	  min=6835, max=6991
> average rps: 6984.77
> 
> 
> SIS_CACHE
> =========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
> 	  50.0th: 9          (49267 samples)
> 	  90.0th: 14         (86522 samples)
> 	* 99.0th: 21         (14091 samples)
> 	  99.9th: 1146       (1722 samples)
> 	  min=1, max=10427
> Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
> 	  50.0th: 13616      (62838 samples)
> 	  90.0th: 14000      (85301 samples)
> 	* 99.0th: 14352      (16149 samples)
> 	  99.9th: 21408      (1660 samples)
> 	  min=5070, max=41866
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 	  20.0th: 6968       (7 samples)
> 	* 50.0th: 6984       (21 samples)
> 	  90.0th: 6984       (0 samples)
> 	  min=6672, max=6996
> average rps: 6981.07
> 
> In new schbench, I didn't observe run to run variation and also there was no regression
> in case of SIS_CACHE for the 99th %ile.
>

Thanks for the test Madadi, in my opinion we can stick with the new schbench
in the future. I'll have a double check on my test machine.

thanks,
Chenyu