[RFC,0/5] perf: Add ioctl to emit sideband events

Message ID 20230414082300.34798-1-adrian.hunter@intel.com
Headers
Series perf: Add ioctl to emit sideband events |

Message

Adrian Hunter April 14, 2023, 8:22 a.m. UTC
  Hi

Here is a stab at adding an ioctl for sideband events.

This is to overcome races when reading the same information
from /proc.

To keep it simple, the ioctl is limited to emitting existing
sideband events (fork, namespaces, comm, mmap) to an already
active context.

There are not yet any perf tools patches at this stage.


Adrian Hunter (5):
      perf: Add ioctl to emit sideband events
      perf: Add fork to the sideband ioctl
      perf: Add namespaces to the sideband ioctl
      perf: Add comm to the sideband ioctl
      perf: Add mmap to the sideband ioctl

 include/uapi/linux/perf_event.h |  19 ++-
 kernel/events/core.c            | 315 +++++++++++++++++++++++++++++++++-------
 2 files changed, 280 insertions(+), 54 deletions(-)


Regards
Adrian
  

Comments

Peter Zijlstra April 17, 2023, 11:02 a.m. UTC | #1
On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote:
> Hi
> 
> Here is a stab at adding an ioctl for sideband events.
> 
> This is to overcome races when reading the same information
> from /proc.

What races? Are you talking about reading old state in /proc the kernel
delivering a sideband event for the new state, and then you writing the
old state out?

Surely that's something perf tool can fix without kernel changes?
  
Ian Rogers April 17, 2023, 4:37 p.m. UTC | #2
On Mon, Apr 17, 2023 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote:
> > Hi
> >
> > Here is a stab at adding an ioctl for sideband events.
> >
> > This is to overcome races when reading the same information
> > from /proc.
>
> What races? Are you talking about reading old state in /proc the kernel
> delivering a sideband event for the new state, and then you writing the
> old state out?
>
> Surely that's something perf tool can fix without kernel changes?

So my reading is that during event synthesis there are races between
reading the different /proc files. There is still, I believe, a race
in with perf record/top with uid filtering which reminds me of this.
The uid filtering race is that we scan /proc to find processes (pids)
for a uid, we then synthesize the maps for each of these pids but if a
pid starts or exits we either error out or don't sample that pid. I
believe the error out behavior is easy to hit 100% of the time making
uid mode of limited use.

This may be for something other than synthesis, but for synthesis a
few points are:
 - as servers get bigger and consequently more jobs get consolidated
on them, synthesis is slow (hence --num-thread-synthesize) and also
the events dominate the perf.data file - perhaps >90% of the file
size, and a lot of that will be for processes with no samples in them.
Another issue here is that all those file descriptors don't come for
free in the kernel.
 - BPF has buildid+offset stack traces that remove the need for
synthesis by having more expensive stack generation. I believe this is
unpopular as adding this as a variant for every kind of event would be
hard, but perhaps we can do some low-hanging fruit like instructions
and cycles.
 - I believe Jiri looked at doing synthesis with BPF. Perhaps we could
do something similar to the off-cpu and tail-synthesize, where more
things happen at the tail end of perf. Off-cpu records data in maps
that it then synthesizes into samples.

There is also a long standing issue around not sampling munmap (or
mremap) that causes plenty of issues. Perhaps if we had less mmap in
the perf.data file we could add these.

Thanks,
Ian
  
Adrian Hunter April 18, 2023, 6:18 a.m. UTC | #3
On 17/04/23 14:02, Peter Zijlstra wrote:
> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote:
>> Hi
>>
>> Here is a stab at adding an ioctl for sideband events.
>>
>> This is to overcome races when reading the same information
>> from /proc.
> 
> What races? Are you talking about reading old state in /proc the kernel
> delivering a sideband event for the new state, and then you writing the
> old state out?
> 
> Surely that's something perf tool can fix without kernel changes?

Yes, and it was a bit of a brain fart not to realise that.

There may still be corner cases, where different kinds of events are
interdependent, perhaps NAMESPACES events vs MMAP events could
have ordering issues.

Putting that aside, the ioctl may be quicker than reading from
/proc.  I could get some numbers and see what people think.
  
Adrian Hunter April 18, 2023, 7:03 a.m. UTC | #4
On 17/04/23 19:37, Ian Rogers wrote:
> On Mon, Apr 17, 2023 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote:
>>> Hi
>>>
>>> Here is a stab at adding an ioctl for sideband events.
>>>
>>> This is to overcome races when reading the same information
>>> from /proc.
>>
>> What races? Are you talking about reading old state in /proc the kernel
>> delivering a sideband event for the new state, and then you writing the
>> old state out?
>>
>> Surely that's something perf tool can fix without kernel changes?
> 
> So my reading is that during event synthesis there are races between
> reading the different /proc files. There is still, I believe, a race
> in with perf record/top with uid filtering which reminds me of this.
> The uid filtering race is that we scan /proc to find processes (pids)
> for a uid, we then synthesize the maps for each of these pids but if a
> pid starts or exits we either error out or don't sample that pid. I
> believe the error out behavior is easy to hit 100% of the time making
> uid mode of limited use.
> 
> This may be for something other than synthesis, but for synthesis a
> few points are:
>  - as servers get bigger and consequently more jobs get consolidated
> on them, synthesis is slow (hence --num-thread-synthesize) and also
> the events dominate the perf.data file - perhaps >90% of the file
> size, and a lot of that will be for processes with no samples in them.

Note also, for hardware tracing, it isn't generally possible to know
that during tracing, and figuring it out afterwards and working
backwards may not be feasible.

> Another issue here is that all those file descriptors don't come for
> free in the kernel.
>  - BPF has buildid+offset stack traces that remove the need for
> synthesis by having more expensive stack generation. I believe this is
> unpopular as adding this as a variant for every kind of event would be
> hard, but perhaps we can do some low-hanging fruit like instructions
> and cycles.
>  - I believe Jiri looked at doing synthesis with BPF. Perhaps we could
> do something similar to the off-cpu and tail-synthesize, where more
> things happen at the tail end of perf. Off-cpu records data in maps
> that it then synthesizes into samples.
> 
> There is also a long standing issue around not sampling munmap (or
> mremap) that causes plenty of issues. Perhaps if we had less mmap in
> the perf.data file we could add these.
> 
> Thanks,
> Ian
  
Adrian Hunter April 18, 2023, 1:36 p.m. UTC | #5
On 18/04/23 09:18, Adrian Hunter wrote:
> On 17/04/23 14:02, Peter Zijlstra wrote:
>> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote:
>>> Hi
>>>
>>> Here is a stab at adding an ioctl for sideband events.
>>>
>>> This is to overcome races when reading the same information
>>> from /proc.
>>
>> What races? Are you talking about reading old state in /proc the kernel
>> delivering a sideband event for the new state, and then you writing the
>> old state out?
>>
>> Surely that's something perf tool can fix without kernel changes?
> 
> Yes, and it was a bit of a brain fart not to realise that.
> 
> There may still be corner cases, where different kinds of events are
> interdependent, perhaps NAMESPACES events vs MMAP events could
> have ordering issues.
> 
> Putting that aside, the ioctl may be quicker than reading from
> /proc.  I could get some numbers and see what people think.
> 

Here's a result with a quick hack to use the ioctl but without
handling the buffer becoming full (hence the -m4M)

# ps -e | wc -l
1171
# perf.old stat -- perf.old record -o old.data --namespaces -a true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.095 MB old.data (100 samples) ]

 Performance counter stats for 'perf.old record -o old.data --namespaces -a true':

            498.15 msec task-clock                       #    0.987 CPUs utilized             
               126      context-switches                 #  252.935 /sec                      
                64      cpu-migrations                   #  128.475 /sec                      
              4396      page-faults                      #    8.825 K/sec                     
        1927096347      cycles                           #    3.868 GHz                       
        4563059399      instructions                     #    2.37  insn per cycle            
         914232559      branches                         #    1.835 G/sec                     
           6618052      branch-misses                    #    0.72% of all branches           
        9633787105      slots                            #   19.339 G/sec                     
        4394300990      topdown-retiring                 #     38.8% Retiring                 
        3693815286      topdown-bad-spec                 #     32.6% Bad Speculation          
        1692356927      topdown-fe-bound                 #     14.9% Frontend Bound           
        1544151518      topdown-be-bound                 #     13.6% Backend Bound            

       0.504636742 seconds time elapsed

       0.158237000 seconds user
       0.340625000 seconds sys

# perf.old stat -- perf.new record -o new.data -m4M --namespaces -a true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.095 MB new.data (103 samples) ]

 Performance counter stats for 'perf.new record -o new.data -m4M --namespaces -a true':

            386.61 msec task-clock                       #    0.988 CPUs utilized             
               100      context-switches                 #  258.658 /sec                      
                65      cpu-migrations                   #  168.128 /sec                      
              4935      page-faults                      #   12.765 K/sec                     
        1495905137      cycles                           #    3.869 GHz                       
        3647660473      instructions                     #    2.44  insn per cycle            
         735822370      branches                         #    1.903 G/sec                     
           5765668      branch-misses                    #    0.78% of all branches           
        7477722620      slots                            #   19.342 G/sec                     
        3415835954      topdown-retiring                 #     39.5% Retiring                 
        2748625759      topdown-bad-spec                 #     31.8% Bad Speculation          
        1221594670      topdown-fe-bound                 #     14.1% Frontend Bound           
        1256150733      topdown-be-bound                 #     14.5% Backend Bound            

       0.391472763 seconds time elapsed

       0.141207000 seconds user
       0.246277000 seconds sys

# ls -lh old.data
-rw------- 1 root root 1.2M Apr 18 13:19 old.data
# ls -lh new.data
-rw------- 1 root root 1.2M Apr 18 13:19 new.data
#
  
Ian Rogers April 18, 2023, 3:51 p.m. UTC | #6
On Tue, Apr 18, 2023 at 6:36 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 18/04/23 09:18, Adrian Hunter wrote:
> > On 17/04/23 14:02, Peter Zijlstra wrote:
> >> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote:
> >>> Hi
> >>>
> >>> Here is a stab at adding an ioctl for sideband events.
> >>>
> >>> This is to overcome races when reading the same information
> >>> from /proc.
> >>
> >> What races? Are you talking about reading old state in /proc the kernel
> >> delivering a sideband event for the new state, and then you writing the
> >> old state out?
> >>
> >> Surely that's something perf tool can fix without kernel changes?
> >
> > Yes, and it was a bit of a brain fart not to realise that.
> >
> > There may still be corner cases, where different kinds of events are
> > interdependent, perhaps NAMESPACES events vs MMAP events could
> > have ordering issues.
> >
> > Putting that aside, the ioctl may be quicker than reading from
> > /proc.  I could get some numbers and see what people think.
> >
>
> Here's a result with a quick hack to use the ioctl but without
> handling the buffer becoming full (hence the -m4M)
>
> # ps -e | wc -l
> 1171
> # perf.old stat -- perf.old record -o old.data --namespaces -a true
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.095 MB old.data (100 samples) ]
>
>  Performance counter stats for 'perf.old record -o old.data --namespaces -a true':
>
>             498.15 msec task-clock                       #    0.987 CPUs utilized
>                126      context-switches                 #  252.935 /sec
>                 64      cpu-migrations                   #  128.475 /sec
>               4396      page-faults                      #    8.825 K/sec
>         1927096347      cycles                           #    3.868 GHz
>         4563059399      instructions                     #    2.37  insn per cycle
>          914232559      branches                         #    1.835 G/sec
>            6618052      branch-misses                    #    0.72% of all branches
>         9633787105      slots                            #   19.339 G/sec
>         4394300990      topdown-retiring                 #     38.8% Retiring
>         3693815286      topdown-bad-spec                 #     32.6% Bad Speculation
>         1692356927      topdown-fe-bound                 #     14.9% Frontend Bound
>         1544151518      topdown-be-bound                 #     13.6% Backend Bound
>
>        0.504636742 seconds time elapsed
>
>        0.158237000 seconds user
>        0.340625000 seconds sys
>
> # perf.old stat -- perf.new record -o new.data -m4M --namespaces -a true
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.095 MB new.data (103 samples) ]
>
>  Performance counter stats for 'perf.new record -o new.data -m4M --namespaces -a true':
>
>             386.61 msec task-clock                       #    0.988 CPUs utilized
>                100      context-switches                 #  258.658 /sec
>                 65      cpu-migrations                   #  168.128 /sec
>               4935      page-faults                      #   12.765 K/sec
>         1495905137      cycles                           #    3.869 GHz
>         3647660473      instructions                     #    2.44  insn per cycle
>          735822370      branches                         #    1.903 G/sec
>            5765668      branch-misses                    #    0.78% of all branches
>         7477722620      slots                            #   19.342 G/sec
>         3415835954      topdown-retiring                 #     39.5% Retiring
>         2748625759      topdown-bad-spec                 #     31.8% Bad Speculation
>         1221594670      topdown-fe-bound                 #     14.1% Frontend Bound
>         1256150733      topdown-be-bound                 #     14.5% Backend Bound
>
>        0.391472763 seconds time elapsed
>
>        0.141207000 seconds user
>        0.246277000 seconds sys
>
> # ls -lh old.data
> -rw------- 1 root root 1.2M Apr 18 13:19 old.data
> # ls -lh new.data
> -rw------- 1 root root 1.2M Apr 18 13:19 new.data
> #

Cool, so the headline is a ~20% or 1billion instruction reduction in
perf startup overhead?

Thanks,
Ian