[PATCHSET,0/7] perf lock contention: Improve performance if map is full (v1)

Message ID 20230406210611.1622492-1-namhyung@kernel.org
Headers
Series perf lock contention: Improve performance if map is full (v1) |

Message

Namhyung Kim April 6, 2023, 9:06 p.m. UTC
  Hello,

I got a report that the overhead of perf lock contention is too big in
some cases.  It was running the task aggregation mode (-t) at the moment
and there were lots of tasks contending each other.

It turned out that the hash map update is a problem.  The result is saved
in the lock_stat hash map which is pre-allocated.  The BPF program never
deletes data in the map, but just adds.  But if the map is full, (try to)
update the map becomes a very heavy operation - since it needs to check
every CPU's freelist to get a new node to save the result.  But we know
it'd fail when the map is full.  No need to update then.

I've checked it on my 64 CPU machine with this.

    $ perf bench sched messaging -g 1000
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 1000 groups == 40000 processes run

         Total time: 2.825 [sec]

And I used the task mode, so that it can guarantee the map is full.
The default map entry size is 16K and this workload has 40K tasks.

Before:
    $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 1000 groups == 40000 processes run

         Total time: 11.299 [sec]
     contended   total wait     max wait     avg wait          pid   comm

         19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
           243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
           177     66.35 ms     12.08 ms    374.88 us      1220416   node

After:
    $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
    # Running 'sched/messaging' benchmark:
    # 20 sender and receiver processes per group
    # 1000 groups == 40000 processes run

         Total time: 3.044 [sec]
     contended   total wait     max wait     avg wait          pid   comm

         18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
            51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
            81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging

    === output for debug ===

    bad: 1164137, total: 2253341
    bad rate: 51.66 %
    histogram of failure reasons
           task: 0
          stack: 0
           time: 0
           data: 1164137

The first few patches are small cleanups and fixes.  You can get the code
from 'perf/lock-map-v1' branch in

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Thanks,
Namhyung

Namhyung Kim (7):
  perf lock contention: Simplify parse_lock_type()
  perf lock contention: Use -M for --map-nr-entries
  perf lock contention: Update default map size to 16384
  perf lock contention: Add data failure stat
  perf lock contention: Update total/bad stats for hidden entries
  perf lock contention: Revise needs_callstack() condition
  perf lock contention: Do not try to update if hash map is full

 tools/perf/Documentation/perf-lock.txt        |  4 +-
 tools/perf/builtin-lock.c                     | 64 ++++++++-----------
 tools/perf/util/bpf_lock_contention.c         |  7 +-
 .../perf/util/bpf_skel/lock_contention.bpf.c  | 29 +++++++--
 tools/perf/util/bpf_skel/lock_data.h          |  3 +
 tools/perf/util/lock-contention.h             |  2 +
 6 files changed, 60 insertions(+), 49 deletions(-)


base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
  

Comments

Ian Rogers April 7, 2023, 12:35 a.m. UTC | #1
On Thu, Apr 6, 2023 at 2:06 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> I got a report that the overhead of perf lock contention is too big in
> some cases.  It was running the task aggregation mode (-t) at the moment
> and there were lots of tasks contending each other.
>
> It turned out that the hash map update is a problem.  The result is saved
> in the lock_stat hash map which is pre-allocated.  The BPF program never
> deletes data in the map, but just adds.  But if the map is full, (try to)
> update the map becomes a very heavy operation - since it needs to check
> every CPU's freelist to get a new node to save the result.  But we know
> it'd fail when the map is full.  No need to update then.
>
> I've checked it on my 64 CPU machine with this.
>
>     $ perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
>
>          Total time: 2.825 [sec]
>
> And I used the task mode, so that it can guarantee the map is full.
> The default map entry size is 16K and this workload has 40K tasks.
>
> Before:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
>
>          Total time: 11.299 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
>
>          19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
>            243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
>            177     66.35 ms     12.08 ms    374.88 us      1220416   node
>
> After:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
>
>          Total time: 3.044 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
>
>          18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
>             51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
>             81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging
>
>     === output for debug ===
>
>     bad: 1164137, total: 2253341
>     bad rate: 51.66 %
>     histogram of failure reasons
>            task: 0
>           stack: 0
>            time: 0
>            data: 1164137
>
> The first few patches are small cleanups and fixes.  You can get the code
> from 'perf/lock-map-v1' branch in
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Thanks,
> Namhyung
>
> Namhyung Kim (7):
>   perf lock contention: Simplify parse_lock_type()
>   perf lock contention: Use -M for --map-nr-entries
>   perf lock contention: Update default map size to 16384
>   perf lock contention: Add data failure stat
>   perf lock contention: Update total/bad stats for hidden entries
>   perf lock contention: Revise needs_callstack() condition
>   perf lock contention: Do not try to update if hash map is full

Series:
Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

>  tools/perf/Documentation/perf-lock.txt        |  4 +-
>  tools/perf/builtin-lock.c                     | 64 ++++++++-----------
>  tools/perf/util/bpf_lock_contention.c         |  7 +-
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 29 +++++++--
>  tools/perf/util/bpf_skel/lock_data.h          |  3 +
>  tools/perf/util/lock-contention.h             |  2 +
>  6 files changed, 60 insertions(+), 49 deletions(-)
>
>
> base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
> --
> 2.40.0.577.gac1e443424-goog
>
  
Arnaldo Carvalho de Melo April 7, 2023, 12:41 a.m. UTC | #2
Em Thu, Apr 06, 2023 at 02:06:04PM -0700, Namhyung Kim escreveu:
> Hello,
> 
> I got a report that the overhead of perf lock contention is too big in
> some cases.  It was running the task aggregation mode (-t) at the moment
> and there were lots of tasks contending each other.
> 
> It turned out that the hash map update is a problem.  The result is saved
> in the lock_stat hash map which is pre-allocated.  The BPF program never
> deletes data in the map, but just adds.  But if the map is full, (try to)
> update the map becomes a very heavy operation - since it needs to check
> every CPU's freelist to get a new node to save the result.  But we know
> it'd fail when the map is full.  No need to update then.

Thanks, applied.

- Arnaldo

 
> I've checked it on my 64 CPU machine with this.
> 
>     $ perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
> 
>          Total time: 2.825 [sec]
> 
> And I used the task mode, so that it can guarantee the map is full.
> The default map entry size is 16K and this workload has 40K tasks.
> 
> Before:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
> 
>          Total time: 11.299 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
> 
>          19284      3.51 s       3.70 ms    181.91 us      1305863   sched-messaging
>            243     84.09 ms    466.67 us    346.04 us      1336608   sched-messaging
>            177     66.35 ms     12.08 ms    374.88 us      1220416   node
> 
> After:
>     $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver processes per group
>     # 1000 groups == 40000 processes run
> 
>          Total time: 3.044 [sec]
>      contended   total wait     max wait     avg wait          pid   comm
> 
>          18743    591.92 ms    442.96 us     31.58 us      1431454   sched-messaging
>             51    210.64 ms    207.45 ms      4.13 ms      1468724   sched-messaging
>             81     68.61 ms     65.79 ms    847.07 us      1463183   sched-messaging
> 
>     === output for debug ===
> 
>     bad: 1164137, total: 2253341
>     bad rate: 51.66 %
>     histogram of failure reasons
>            task: 0
>           stack: 0
>            time: 0
>            data: 1164137
> 
> The first few patches are small cleanups and fixes.  You can get the code
> from 'perf/lock-map-v1' branch in
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> 
> Thanks,
> Namhyung
> 
> Namhyung Kim (7):
>   perf lock contention: Simplify parse_lock_type()
>   perf lock contention: Use -M for --map-nr-entries
>   perf lock contention: Update default map size to 16384
>   perf lock contention: Add data failure stat
>   perf lock contention: Update total/bad stats for hidden entries
>   perf lock contention: Revise needs_callstack() condition
>   perf lock contention: Do not try to update if hash map is full
> 
>  tools/perf/Documentation/perf-lock.txt        |  4 +-
>  tools/perf/builtin-lock.c                     | 64 ++++++++-----------
>  tools/perf/util/bpf_lock_contention.c         |  7 +-
>  .../perf/util/bpf_skel/lock_contention.bpf.c  | 29 +++++++--
>  tools/perf/util/bpf_skel/lock_data.h          |  3 +
>  tools/perf/util/lock-contention.h             |  2 +
>  6 files changed, 60 insertions(+), 49 deletions(-)
> 
> 
> base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
> -- 
> 2.40.0.577.gac1e443424-goog
>