[PATCHSET,0/9] perf report: Improve srcline sort performance (v1)

Message ID 20221215192817.2734573-1-namhyung@kernel.org
Headers
Series perf report: Improve srcline sort performance (v1) |

Message

Namhyung Kim Dec. 15, 2022, 7:28 p.m. UTC
  Hello,

I noticed a performance problem in the srcline/srcfile processing during
perf report when it's using an external addr2line process.  I guess it's
also helpful even if it uses the libbfd to get the srcline info.

Also note that it's mostly from large (static) binaries, but smaller
binaries should also benefit from the fix if they have a lot of samples.

The first 5 patches are general fixes and updates.  The latter 4 patches
implemented the actual speed-up.

Let's test it with the perf tools itself.  Build a static binary like below.

  $ cd tools/perf
  $ make NO_JVMTI=1 LDFLAGS=-static

Then run the perf test workload.

  $ ./perf record -- ./perf test -w noploop

And run the perf report with srcline sort key like this.

  $ ./perf report -n -s srcline --stdio
  # To display the perf.data header info, please use --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 4K of event 'cycles:u'
  # Event count (approx.): 3572938596
  #
  # Overhead       Samples  Source:Line
  # ........  ............  ............
  #
      99.94%          4010  noploop.c:26
       0.03%            14  ??:0
       0.03%             1  perf.c:330
       0.00%             1  wcscpy.o:0

The problem is that it runs the addr2line when it processes each sample.
But as you can see many samples can have same result.  IOW, if the samples
have same address, we don't need to run the addr2line each time.

So I changed the sort_key->cmp() to compare the addresses only and moved
the addr2line from sort_key->collapse() so that they can be run after
merging the samples with the same address.

With the change, I can get a huge speed-up in processing srcline info
while they generate the same output.

Before:

  $ ./perf stat -- ./perf report -s srcline > /dev/null

   Performance counter stats for './perf report -s srcline':

           15,397.13 msec task-clock:u                     #    0.993 CPUs utilized
                   0      context-switches:u               #    0.000 /sec
                   0      cpu-migrations:u                 #    0.000 /sec
               3,810      page-faults:u                    #  247.449 /sec
      54,516,351,820      cycles:u                         #    3.541 GHz
      31,494,118,293      instructions:u                   #    0.58  insn per cycle
       8,577,271,187      branches:u                       #  557.069 M/sec
       1,216,165,520      branch-misses:u                  #   14.18% of all branches

        15.505066606 seconds time elapsed

        15.094122000 seconds user
         0.396962000 seconds sys

After:

  $ ./perf stat -- ./perf report -s srcline > /dev/null

   Performance counter stats for './perf report -s srcline':

              105.66 msec task-clock:u                     #    0.994 CPUs utilized
                   0      context-switches:u               #    0.000 /sec
                   0      cpu-migrations:u                 #    0.000 /sec
               3,275      page-faults:u                    #   30.995 K/sec
         185,063,407      cycles:u                         #    1.751 GHz
         142,470,215      instructions:u                   #    0.77  insn per cycle
          34,584,038      branches:u                       #  327.311 M/sec
           3,226,005      branch-misses:u                  #    9.33% of all branches

         0.106270464 seconds time elapsed

         0.074254000 seconds user
         0.032871000 seconds sys

The code is available at 'perf/srcline-v1' branch in

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Thanks,
Namhyung


Namhyung Kim (9):
  perf srcline: Do not return NULL for srcline
  perf report: Ignore SIGPIPE for srcline
  perf symbol: Add filename__has_section()
  perf srcline: Skip srcline if .debug_line is missing
  perf srcline: Conditionally suppress addr2line warnings
  perf hist: Add perf_hpp_fmt->init() callback
  perf hist: Improve srcline sort key performance
  perf hist: Improve srcfile sort key performance
  perf hist: Improve srcline_{from,to} sort key performance

 tools/perf/builtin-report.c      |   1 +
 tools/perf/util/hist.c           |  10 +--
 tools/perf/util/hist.h           |   1 +
 tools/perf/util/sort.c           | 129 ++++++++++++++++++++++++++++---
 tools/perf/util/sort.h           |   1 +
 tools/perf/util/srcline.c        |  20 +++--
 tools/perf/util/symbol-elf.c     |  28 +++++++
 tools/perf/util/symbol-minimal.c |   5 ++
 tools/perf/util/symbol.h         |   1 +
 9 files changed, 176 insertions(+), 20 deletions(-)


base-commit: 818448e9cf92e5c6b3c10320372eefcbe4174e4f
  

Comments

Ian Rogers Dec. 15, 2022, 8:28 p.m. UTC | #1
On Thu, Dec 15, 2022 at 11:28 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> I noticed a performance problem in the srcline/srcfile processing during
> perf report when it's using an external addr2line process.  I guess it's
> also helpful even if it uses the libbfd to get the srcline info.
>
> Also note that it's mostly from large (static) binaries, but smaller
> binaries should also benefit from the fix if they have a lot of samples.
>
> The first 5 patches are general fixes and updates.  The latter 4 patches
> implemented the actual speed-up.
>
> Let's test it with the perf tools itself.  Build a static binary like below.
>
>   $ cd tools/perf
>   $ make NO_JVMTI=1 LDFLAGS=-static
>
> Then run the perf test workload.
>
>   $ ./perf record -- ./perf test -w noploop
>
> And run the perf report with srcline sort key like this.
>
>   $ ./perf report -n -s srcline --stdio
>   # To display the perf.data header info, please use --header/--header-only options.
>   #
>   #
>   # Total Lost Samples: 0
>   #
>   # Samples: 4K of event 'cycles:u'
>   # Event count (approx.): 3572938596
>   #
>   # Overhead       Samples  Source:Line
>   # ........  ............  ............
>   #
>       99.94%          4010  noploop.c:26
>        0.03%            14  ??:0
>        0.03%             1  perf.c:330
>        0.00%             1  wcscpy.o:0
>
> The problem is that it runs the addr2line when it processes each sample.
> But as you can see many samples can have same result.  IOW, if the samples
> have same address, we don't need to run the addr2line each time.
>
> So I changed the sort_key->cmp() to compare the addresses only and moved
> the addr2line from sort_key->collapse() so that they can be run after
> merging the samples with the same address.
>
> With the change, I can get a huge speed-up in processing srcline info
> while they generate the same output.
>
> Before:
>
>   $ ./perf stat -- ./perf report -s srcline > /dev/null
>
>    Performance counter stats for './perf report -s srcline':
>
>            15,397.13 msec task-clock:u                     #    0.993 CPUs utilized
>                    0      context-switches:u               #    0.000 /sec
>                    0      cpu-migrations:u                 #    0.000 /sec
>                3,810      page-faults:u                    #  247.449 /sec
>       54,516,351,820      cycles:u                         #    3.541 GHz
>       31,494,118,293      instructions:u                   #    0.58  insn per cycle
>        8,577,271,187      branches:u                       #  557.069 M/sec
>        1,216,165,520      branch-misses:u                  #   14.18% of all branches
>
>         15.505066606 seconds time elapsed
>
>         15.094122000 seconds user
>          0.396962000 seconds sys
>
> After:
>
>   $ ./perf stat -- ./perf report -s srcline > /dev/null
>
>    Performance counter stats for './perf report -s srcline':
>
>               105.66 msec task-clock:u                     #    0.994 CPUs utilized
>                    0      context-switches:u               #    0.000 /sec
>                    0      cpu-migrations:u                 #    0.000 /sec
>                3,275      page-faults:u                    #   30.995 K/sec
>          185,063,407      cycles:u                         #    1.751 GHz
>          142,470,215      instructions:u                   #    0.77  insn per cycle
>           34,584,038      branches:u                       #  327.311 M/sec
>            3,226,005      branch-misses:u                  #    9.33% of all branches
>
>          0.106270464 seconds time elapsed
>
>          0.074254000 seconds user
>          0.032871000 seconds sys
>
> The code is available at 'perf/srcline-v1' branch in
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Thanks,
> Namhyung
>
>
> Namhyung Kim (9):
>   perf srcline: Do not return NULL for srcline
>   perf report: Ignore SIGPIPE for srcline
>   perf symbol: Add filename__has_section()
>   perf srcline: Skip srcline if .debug_line is missing
>   perf srcline: Conditionally suppress addr2line warnings
>   perf hist: Add perf_hpp_fmt->init() callback
>   perf hist: Improve srcline sort key performance
>   perf hist: Improve srcfile sort key performance
>   perf hist: Improve srcline_{from,to} sort key performance

Nice 145x performance win!

Series: Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

>  tools/perf/builtin-report.c      |   1 +
>  tools/perf/util/hist.c           |  10 +--
>  tools/perf/util/hist.h           |   1 +
>  tools/perf/util/sort.c           | 129 ++++++++++++++++++++++++++++---
>  tools/perf/util/sort.h           |   1 +
>  tools/perf/util/srcline.c        |  20 +++--
>  tools/perf/util/symbol-elf.c     |  28 +++++++
>  tools/perf/util/symbol-minimal.c |   5 ++
>  tools/perf/util/symbol.h         |   1 +
>  9 files changed, 176 insertions(+), 20 deletions(-)
>
>
> base-commit: 818448e9cf92e5c6b3c10320372eefcbe4174e4f
> --
> 2.39.0.314.g84b9a713c41-goog
>