diff mbox series

tracing: Add disable-filter-buf option

Message ID	20231215102633.7a24cb77@rorschach.local.home
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-1213-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Date: Fri, 15 Dec 2023 10:26:33 -0500 From: Steven Rostedt <rostedt@goodmis.org> To: LKML <linux-kernel@vger.kernel.org>, Linux trace kernel <linux-trace-kernel@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Subject: [PATCH] tracing: Add disable-filter-buf option Message-ID: <20231215102633.7a24cb77@rorschach.local.home> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-getmail-retrieved-from-mailbox: INBOX
Series	tracing: Add disable-filter-buf option \| tracing: Add disable-filter-buf option

Commit Message

Steven Rostedt Dec. 15, 2023, 3:26 p.m. UTC

  From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Normally, when the filter is enabled, a temporary buffer is created to
copy the event data into it to perform the filtering logic. If the filter
passes and the event should be recorded, then the event is copied from the
temporary buffer into the ring buffer. If the event is to be discarded
then it is simply dropped. If another event comes in via an interrupt, it
will not use the temporary buffer as it is busy and will write directly
into the ring buffer.

The disable-filter-buf option will disable the temporary buffer and always
write into the ring buffer. This will avoid the copy when the event is to
be recorded, but also adds a bit more overhead on the discard, and if
another event were to interrupt the event that is to be discarded, then
the event will not be removed from the ring buffer but instead converted
to padding that will not be read by the reader. Padding will still take up
space on the ring buffer.

This option can be beneficial if most events are recorded and not
discarded, or simply for debugging the discard functionality of the ring
buffer.

Also fix some whitespace (that was fixed by editing this in vscode).

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 Documentation/trace/ftrace.rst | 23 +++++++++++++++++++++
 kernel/trace/trace.c           | 37 ++++++++++++++++++++--------------
 kernel/trace/trace.h           |  1 +
 3 files changed, 46 insertions(+), 15 deletions(-)

Comments

Mathieu Desnoyers Dec. 15, 2023, 3:53 p.m. UTC | #1

On 2023-12-15 10:26, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> Normally, when the filter is enabled, a temporary buffer is created to
> copy the event data into it to perform the filtering logic. If the filter
> passes and the event should be recorded, then the event is copied from the
> temporary buffer into the ring buffer. If the event is to be discarded
> then it is simply dropped. If another event comes in via an interrupt, it
> will not use the temporary buffer as it is busy and will write directly
> into the ring buffer.
> 
> The disable-filter-buf option will disable the temporary buffer and always
> write into the ring buffer. This will avoid the copy when the event is to
> be recorded, but also adds a bit more overhead on the discard, and if
> another event were to interrupt the event that is to be discarded, then
> the event will not be removed from the ring buffer but instead converted
> to padding that will not be read by the reader. Padding will still take up
> space on the ring buffer.
> 
> This option can be beneficial if most events are recorded and not
> discarded, or simply for debugging the discard functionality of the ring
> buffer.
> 
> Also fix some whitespace (that was fixed by editing this in vscode).

I'm not convinced that a boolean state is what you need here.

Yes, today you are in a position where you have two options:

a) use the filter buffer, which falls back on copy to ring buffer
    if nested,

b) disable the filter buffer, and thus always copy to ring buffer
    before filtering,

But I think it would not be far-fetched to improve the implementation
of the filter-buffer to have one filter-buffer per nesting level
(per execution context), or just implement the filter buffer as a
per-cpu stack, which would remove the need to fall back on copy
to ring buffer when nested. Or you could even do like LTTng and
filter on the input arguments directly.

But each of these changes would add yet another boolean tunable,
which can quickly make the mix hard to understand from a user
perspective.

So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"

?

Thanks,

Mathieu

Steven Rostedt Dec. 15, 2023, 5:04 p.m. UTC | #2

On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> 
> I'm not convinced that a boolean state is what you need here.

I will admit the biggest motivation for this was to allow for debugging ;-)

> 
> Yes, today you are in a position where you have two options:
> 
> a) use the filter buffer, which falls back on copy to ring buffer
>     if nested,
> 
> b) disable the filter buffer, and thus always copy to ring buffer
>     before filtering,
> 
> But I think it would not be far-fetched to improve the implementation
> of the filter-buffer to have one filter-buffer per nesting level
> (per execution context), or just implement the filter buffer as a
> per-cpu stack, which would remove the need to fall back on copy
> to ring buffer when nested. Or you could even do like LTTng and
> filter on the input arguments directly.

The filtering on the input arguments is much more difficult as they are
not exposed to user space. I plan on adding that feature, but it will
be more tied to the probe infrastructure, and be separate from this
type of filtering.

When looking at removing the discard, I found that the histogram logic
currently depends on writing to the ring buffer directly. That's
because it needs to know the timestamp, and may or may not discard it.
I'll have to look at this code more and see if I can get rid of that.
In fact, this patch taps into that logic (it's what added the
tracing_set_filter_buffering() function.

> 
> But each of these changes would add yet another boolean tunable,
> which can quickly make the mix hard to understand from a user
> perspective.

Honestly, if I do the stack filtering (it's already per_cpu), it will
replace this altogether, and this option will still be available. That
is, it will switch off buffering an event regardless of the
implementation.

> 
> So rather than stacking tons of "on/off" switches for filter
> features, how about you let users express the mechanism they
> want to use for filtering with a string instead ? e.g.
> 
> filter-method="temp-buffer"
> filter-method="ring-buffer"
> filter-method="input-arguments"

If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.

Thanks for the review.

-- Steve

Mathieu Desnoyers Dec. 15, 2023, 5:24 p.m. UTC | #3

On 2023-12-15 12:04, Steven Rostedt wrote:
> On Fri, 15 Dec 2023 10:53:39 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
[...]
>>
>> So rather than stacking tons of "on/off" switches for filter
>> features, how about you let users express the mechanism they
>> want to use for filtering with a string instead ? e.g.
>>
>> filter-method="temp-buffer"
>> filter-method="ring-buffer"
>> filter-method="input-arguments"
> 
> If I add other ways to filter, it will be a separate file to control
> that, but all options are on/off switches. Even if I add other
> functionality to the way buffers are created, this will still have the
> same functionality to turn the entire thing on or off.

I'll be clearer then: I think this is a bad ABI. In your reply, you justify
this bad ABI by implementation motivations.

I don't care about the implementation, I care about the ABI, and
I feel that your reply is not addressing my concern at all.

Moreover, double-negative boolean options (disable-X=false) are confusing.

Thanks,

Mathieu

Steven Rostedt Dec. 15, 2023, 5:34 p.m. UTC | #4

On Fri, 15 Dec 2023 12:24:14 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2023-12-15 12:04, Steven Rostedt wrote:
> > On Fri, 15 Dec 2023 10:53:39 -0500
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:  
> [...]
> >>
> >> So rather than stacking tons of "on/off" switches for filter
> >> features, how about you let users express the mechanism they
> >> want to use for filtering with a string instead ? e.g.
> >>
> >> filter-method="temp-buffer"
> >> filter-method="ring-buffer"
> >> filter-method="input-arguments"  
> > 
> > If I add other ways to filter, it will be a separate file to control
> > that, but all options are on/off switches. Even if I add other
> > functionality to the way buffers are created, this will still have the
> > same functionality to turn the entire thing on or off.  
> 
> I'll be clearer then: I think this is a bad ABI. In your reply, you justify
> this bad ABI by implementation motivations.

What's wrong with a way to stop the copying ?

The option is just a way to say "I don't want to do the copy into the
buffer, I want to go directly into it"

> 
> I don't care about the implementation, I care about the ABI, and
> I feel that your reply is not addressing my concern at all.

Maybe I don't understand your concern.

It's an on/off switch (like all options are). This is just a way to say
"I want to indirect copying of the event before filtering or not".

The "input-argument" part above may never happen. What's different
between tracefs and LTTng, is that all events are created by the
subsystem not by me. You don't use the TRACE_EVENT() macro, but you
need to manually create each event you care about yourself. It's more
of a burden but you also then have the luxury of hooking to the input
portion. That is not exposed, and that is by design. As that could
possibly make all tracepoints an ABI, and you'll have people like
peterz nacking any new tracepoint in the scheduler. He doesn't allow
trace events anymore because of that exposure.

> 
> Moreover, double-negative boolean options (disable-X=false) are confusing.

I agree with this, and will rename it to "filter-buffer" instead. I
blame getting up at 3am for my 5:15am flight for not thinking clearly
on that ;-)

Thanks!

-- Steve

Mathieu Desnoyers Dec. 15, 2023, 6:25 p.m. UTC | #5

On 2023-12-15 12:34, Steven Rostedt wrote:
> On Fri, 15 Dec 2023 12:24:14 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2023-12-15 12:04, Steven Rostedt wrote:
>>> On Fri, 15 Dec 2023 10:53:39 -0500
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> [...]
>>>>
>>>> So rather than stacking tons of "on/off" switches for filter
>>>> features, how about you let users express the mechanism they
>>>> want to use for filtering with a string instead ? e.g.
>>>>
>>>> filter-method="temp-buffer"
>>>> filter-method="ring-buffer"
>>>> filter-method="input-arguments"
>>>
>>> If I add other ways to filter, it will be a separate file to control
>>> that, but all options are on/off switches. Even if I add other
>>> functionality to the way buffers are created, this will still have the
>>> same functionality to turn the entire thing on or off.
>>
>> I'll be clearer then: I think this is a bad ABI. In your reply, you justify
>> this bad ABI by implementation motivations.
> 
> What's wrong with a way to stop the copying ?

I am not against exposing an ABI that allows userspace to alter the
filter behavior. I disagree on the way you plan to expose the ABI.

Exposing this option as an ABI in this way exposes too much internal
ring buffer implementation details to userspace.

It exposes the following details which IMHO should be hidden or
configurable in a way that allows moving to a whole new mechanism
which will have significantly different characteristics in the
future:

It exposes that:

- filtering uses a copy to a temporary buffer, and
- that this copy is enabled by default.

Once exposed, those design constraints become immutable due to ABI.

> 
> The option is just a way to say "I don't want to do the copy into the
> buffer, I want to go directly into it"

My main concern is how this concept, once exposed to userspace,
becomes not only an internal implementation detail, but a fundamental
part of the design which cannot ever go away without breaking the ABI
or making parts of the ABI irrelevant.

I can make a parallel with the scheduler: this is as if the sched
feature knobs (which are there only for development/debugging purposes)
would all be exposed as userspace ABI. This would seriously
limit the evolution of the scheduler design in the future. I see this
as the same problem at the ring buffer level.

> 
>>
>> I don't care about the implementation, I care about the ABI, and
>> I feel that your reply is not addressing my concern at all.
> 
> Maybe I don't understand your concern.
> 
> It's an on/off switch (like all options are). This is just a way to say
> "I want to indirect copying of the event before filtering or not".

Not all tracefs options are booleans. The "current_tracer" file ABI
exposes a string input/output parameter. I would recommend the same
for the equivalent of a "current_filter" file.

> 
> The "input-argument" part above may never happen. What's different
> between tracefs and LTTng, is that all events are created by the
> subsystem not by me. You don't use the TRACE_EVENT() macro, but you
> need to manually create each event you care about yourself. It's more
> of a burden but you also then have the luxury of hooking to the input
> portion. That is not exposed, and that is by design. As that could
> possibly make all tracepoints an ABI, and you'll have people like
> peterz nacking any new tracepoint in the scheduler. He doesn't allow
> trace events anymore because of that exposure.

I'm not arguing for moving to the input-argument scheme, I just used
this as an hypothetical example to show why we should not expose
internal implementation details to userspace which will prevent future
evolution only for the sake of having debugging knobs.

Thanks,

Mathieu

Steven Rostedt Dec. 15, 2023, 6:43 p.m. UTC | #6

On Fri, 15 Dec 2023 13:25:07 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> I am not against exposing an ABI that allows userspace to alter the
> filter behavior. I disagree on the way you plan to expose the ABI.

These are no different than the knobs for sched_debug

> 
> Exposing this option as an ABI in this way exposes too much internal
> ring buffer implementation details to userspace.

There's already lots of exposure using options. The options are just
knobs, nothing more.

> 
> It exposes the following details which IMHO should be hidden or
> configurable in a way that allows moving to a whole new mechanism
> which will have significantly different characteristics in the
> future:
> 
> It exposes that:
> 
> - filtering uses a copy to a temporary buffer, and
> - that this copy is enabled by default.
> 
> Once exposed, those design constraints become immutable due to ABI.

No it is not. There is no such thing as "immutable ABI". The rule is
"don't break user space" If this functionality in the kernel goes away,
the knob could become a nop, and I doubt any user space will break
because of it.

That is, the only burden is keeping this option exposed. But it could
be just like that light switch that has nothing connected to it. It's
still there, but does nothing if you switch it. This knob can act the
same way. This does not in anyway prevent future innovation.

> 
> > 
> > The option is just a way to say "I don't want to do the copy into the
> > buffer, I want to go directly into it"  
> 
> My main concern is how this concept, once exposed to userspace,
> becomes not only an internal implementation detail, but a fundamental
> part of the design which cannot ever go away without breaking the ABI
> or making parts of the ABI irrelevant.

Again, it's not about breaking ABI. Linus himself stated that Linux
constantly breaks ABI. As long as breaking ABI doesn't break user
space, it's OK. I've broken tracing ABI several times over the last
decade, and nobody complained. It's *how" you break it. It's similar to
the tree falling in the forest. If you break ABI and no user space
application notices, did you really break ABI?

> 
> I can make a parallel with the scheduler: this is as if the sched
> feature knobs (which are there only for development/debugging purposes)
> would all be exposed as userspace ABI. This would seriously
> limit the evolution of the scheduler design in the future. I see this
> as the same problem at the ring buffer level.

Heh, I mentioned sched before reading this. Again, it's whether or not
userspace can notice if behavior changes or not. If it can't notice, it
didn't break.

> 
> >   
> >>
> >> I don't care about the implementation, I care about the ABI, and
> >> I feel that your reply is not addressing my concern at all.  
> > 
> > Maybe I don't understand your concern.
> > 
> > It's an on/off switch (like all options are). This is just a way to say
> > "I want to indirect copying of the event before filtering or not".  
> 
> Not all tracefs options are booleans. The "current_tracer" file ABI

"current_trace" is not an option. Look in /sys/kernel/tracing/options.
They are all boolean.

> exposes a string input/output parameter. I would recommend the same
> for the equivalent of a "current_filter" file.

I still do not see the benefit. The "current_filter" would expose just
as much implementation that I can see. This option is just an knob to
turn it on or off. Most options, nobody knows about, and are seldom
used by anyone but me ;-)

Once you put a file in the main directory, it become much more likely
to be used.

> 
> > 
> > The "input-argument" part above may never happen. What's different
> > between tracefs and LTTng, is that all events are created by the
> > subsystem not by me. You don't use the TRACE_EVENT() macro, but you
> > need to manually create each event you care about yourself. It's more
> > of a burden but you also then have the luxury of hooking to the input
> > portion. That is not exposed, and that is by design. As that could
> > possibly make all tracepoints an ABI, and you'll have people like
> > peterz nacking any new tracepoint in the scheduler. He doesn't allow
> > trace events anymore because of that exposure.  
> 
> I'm not arguing for moving to the input-argument scheme, I just used
> this as an hypothetical example to show why we should not expose
> internal implementation details to userspace which will prevent future
> evolution only for the sake of having debugging knobs.

Again, this is just a knob in the options directory, and not a full
fledge feature. Options are not always guaranteed to be there,
depending on the config. There's some options that are even created by
what tracer is in "current_tracer".

I added this mainly to be able to add a kselftest to stress test the
discard code. If it would make you feel better, I could even add a
config around it to have it only exposed if the config is enabled.

-- Steve

Mathieu Desnoyers Dec. 15, 2023, 7:08 p.m. UTC | #7

On 2023-12-15 13:43, Steven Rostedt wrote:
> On Fri, 15 Dec 2023 13:25:07 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> I am not against exposing an ABI that allows userspace to alter the
>> filter behavior. I disagree on the way you plan to expose the ABI.
> 
> These are no different than the knobs for sched_debug

These are very much different. The sched features are enabled at
build-time by modifying kernel/sched/features.h. There is no userspace
ABI involved.

> 
>>
>> Exposing this option as an ABI in this way exposes too much internal
>> ring buffer implementation details to userspace.
> 
> There's already lots of exposure using options. The options are just
> knobs, nothing more.
> 
>>
>> It exposes the following details which IMHO should be hidden or
>> configurable in a way that allows moving to a whole new mechanism
>> which will have significantly different characteristics in the
>> future:
>>
>> It exposes that:
>>
>> - filtering uses a copy to a temporary buffer, and
>> - that this copy is enabled by default.
>>
>> Once exposed, those design constraints become immutable due to ABI.
> 
> No it is not. There is no such thing as "immutable ABI". The rule is
> "don't break user space" If this functionality in the kernel goes away,
> the knob could become a nop, and I doubt any user space will break
> because of it.
> 
> That is, the only burden is keeping this option exposed. But it could
> be just like that light switch that has nothing connected to it. It's
> still there, but does nothing if you switch it. This knob can act the
> same way. This does not in anyway prevent future innovation.

I am not comfortable with exposing internal ring buffer implementation
details to userspace which may or may not be deprecated as no-ops
in the future. This will lead to useless clutter.

One approach that might be somewhat better would be to only expose
those files when a new CONFIG_TRACING_DEBUG option is enabled. This
way userspace cannot rely on having those files present as part
of the ABI, but you would still be free to use them for selftests
by skipping the specific selftests if the config option is not
enabled.

Thanks,

Mathieu

Masami Hiramatsu (Google) Dec. 17, 2023, 8:10 a.m. UTC | #8

On Fri, 15 Dec 2023 14:08:40 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2023-12-15 13:43, Steven Rostedt wrote:
> > On Fri, 15 Dec 2023 13:25:07 -0500
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> I am not against exposing an ABI that allows userspace to alter the
> >> filter behavior. I disagree on the way you plan to expose the ABI.
> > 
> > These are no different than the knobs for sched_debug
> 
> These are very much different. The sched features are enabled at
> build-time by modifying kernel/sched/features.h. There is no userspace
> ABI involved.
> 
> > 
> >>
> >> Exposing this option as an ABI in this way exposes too much internal
> >> ring buffer implementation details to userspace.
> > 
> > There's already lots of exposure using options. The options are just
> > knobs, nothing more.
> > 
> >>
> >> It exposes the following details which IMHO should be hidden or
> >> configurable in a way that allows moving to a whole new mechanism
> >> which will have significantly different characteristics in the
> >> future:
> >>
> >> It exposes that:
> >>
> >> - filtering uses a copy to a temporary buffer, and
> >> - that this copy is enabled by default.
> >>
> >> Once exposed, those design constraints become immutable due to ABI.
> > 
> > No it is not. There is no such thing as "immutable ABI". The rule is
> > "don't break user space" If this functionality in the kernel goes away,
> > the knob could become a nop, and I doubt any user space will break
> > because of it.
> > 
> > That is, the only burden is keeping this option exposed. But it could
> > be just like that light switch that has nothing connected to it. It's
> > still there, but does nothing if you switch it. This knob can act the
> > same way. This does not in anyway prevent future innovation.
> 
> I am not comfortable with exposing internal ring buffer implementation
> details to userspace which may or may not be deprecated as no-ops
> in the future. This will lead to useless clutter.

Hmm, but this may change the ring buffer consumption rate if the filter
is enabled. The ring buffer may be filled quickly than the user expected.
Thus if user specifies the rare condition, most of the events on the ring
buffer is filled with garbage. And user will know the buffer size *seems*
smaller than the setting.
I think copying overhead will be a secondary effect, the biggest noticable
difference is how many events are recorded in the ring buffer. Thus, what
about naming the option as "filter-on-buffer"?

If we introduce filtering on input directly, at that point we will use
it if "filter-on-buffer = no", because this is also not noticable from
users.

Thank you,

Steven Rostedt Dec. 17, 2023, 11:56 p.m. UTC | #9

On Sun, 17 Dec 2023 17:10:45 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > >> It exposes the following details which IMHO should be hidden or
> > >> configurable in a way that allows moving to a whole new mechanism
> > >> which will have significantly different characteristics in the
> > >> future:
> > >>
> > >> It exposes that:
> > >>
> > >> - filtering uses a copy to a temporary buffer, and
> > >> - that this copy is enabled by default.
> > >>
> > >> Once exposed, those design constraints become immutable due to ABI.  
> > > 
> > > No it is not. There is no such thing as "immutable ABI". The rule is
> > > "don't break user space" If this functionality in the kernel goes away,
> > > the knob could become a nop, and I doubt any user space will break
> > > because of it.
> > > 
> > > That is, the only burden is keeping this option exposed. But it could
> > > be just like that light switch that has nothing connected to it. It's
> > > still there, but does nothing if you switch it. This knob can act the
> > > same way. This does not in anyway prevent future innovation.  
> > 
> > I am not comfortable with exposing internal ring buffer implementation
> > details to userspace which may or may not be deprecated as no-ops
> > in the future. This will lead to useless clutter.  
> 
> Hmm, but this may change the ring buffer consumption rate if the filter
> is enabled. The ring buffer may be filled quickly than the user expected.

WHich it has been since 0fc1b09ff1ff4 ("tracing: Use temp buffer when
filtering events"), and before that commit, things were a bit slower.
That commit sped things up. But, even with that commit, there's no
guarantee that you will get to use the temp buffer and just write
directly into the ring buffer.

And I found that currently histograms (and sythetic events!) also write
directly into the buffer :-p

> Thus if user specifies the rare condition, most of the events on the ring
> buffer is filled with garbage. And user will know the buffer size *seems*
> smaller than the setting.

Not sure what you mean by that. The event on the buffer is removed
unless another event sneaks in and makes it impossible to reset the
next write location before the discarded event.

> I think copying overhead will be a secondary effect, the biggest noticable
> difference is how many events are recorded in the ring buffer. Thus, what
> about naming the option as "filter-on-buffer"?

I'm not sure I understand that. How about just call it "filter_direct",
which means to write directly on the buffer, and default that off.

> 
> If we introduce filtering on input directly, at that point we will use
> it if "filter-on-buffer = no", because this is also not noticable from
> users.

The "filter on input" will be a different interface, as the current
filter is only on the output of TRACE_EVENT() fields. The input
parameters isn't exposed at all, and may never be, as that would make
peterz and others keep all tracepoints from their subsystems.

As my main motivation for this was to create a kselftest that can
stress test the filtering directly into the ring buffer (like it use
to, and like it does for interrupting events and histograms and
synthetic events), we can still add tests to make sure that part works.

I'm fine slapping a Kconfig of CONFIG_TRACE_FORCE_DIRECT_KNOB and
place the documentation in the help content saying it adds a knob to
allow kselftest stress test the direct to ring buffer filtering.

-- Steve

diff mbox series

Patch

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 23572f6697c0..7fe96da34962 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1239,6 +1239,29 @@  Here are the available options:
 	When the free_buffer is closed, tracing will
 	stop (tracing_on set to 0).
 
+  disable-filter-buf
+	Normally, when the filter is enabled, a temporary buffer is
+	created to copy the event data into it to perform the
+	filtering logic. If the filter passes and the event should
+	be recorded, then the event is copied from the temporary
+	buffer into the ring buffer. If the event is to be discarded
+	then it is simply dropped. If another event comes in via
+	an interrupt, it will not use the temporary buffer as it is
+	busy and will write directly into the ring buffer.
+
+	This option will disable the temporary buffer and always
+	write into the ring buffer. This will avoid the copy when
+	the event is to be recorded, but also adds a bit more
+	overhead on the discard, and if another event were to interrupt
+	the event that is to be discarded, then the event will not
+	be removed from the ring buffer but instead converted to
+	padding that will not be read by the reader. Padding will
+	still take up space on the ring buffer.
+
+	This option can be beneficial if most events are recorded and
+	not discarded, or simply for debugging the discard functionality
+	of the ring buffer.
+
   irq-info
 	Shows the interrupt, preempt count, need resched data.
 	When disabled, the trace looks like::
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 199df497db07..41b674b1b809 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5398,6 +5398,8 @@  int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
 	return 0;
 }
 
+static int __tracing_set_filter_buffering(struct trace_array *tr, bool set);
+
 int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 {
 	int *map;
@@ -5451,6 +5453,9 @@  int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 	if (mask == TRACE_ITER_FUNC_FORK)
 		ftrace_pid_follow_fork(tr, enabled);
 
+	if (mask == TRACE_ITER_DISABLE_FILTER_BUF)
+		__tracing_set_filter_buffering(tr, enabled);
+
 	if (mask == TRACE_ITER_OVERWRITE) {
 		ring_buffer_change_overwrite(tr->array_buffer.buffer, enabled);
 #ifdef CONFIG_TRACER_MAX_TRACE
@@ -6464,7 +6469,7 @@  static void tracing_set_nop(struct trace_array *tr)
 {
 	if (tr->current_trace == &nop_trace)
 		return;
-	
+
 	tr->current_trace->enabled--;
 
 	if (tr->current_trace->reset)
@@ -7534,27 +7539,29 @@  u64 tracing_event_time_stamp(struct trace_buffer *buffer, struct ring_buffer_eve
 	return ring_buffer_event_time_stamp(buffer, rbe);
 }
 
-/*
- * Set or disable using the per CPU trace_buffer_event when possible.
- */
-int tracing_set_filter_buffering(struct trace_array *tr, bool set)
+static int __tracing_set_filter_buffering(struct trace_array *tr, bool set)
 {
-	int ret = 0;
-
-	mutex_lock(&trace_types_lock);
-
 	if (set && tr->no_filter_buffering_ref++)
-		goto out;
+		return 0;
 
 	if (!set) {
-		if (WARN_ON_ONCE(!tr->no_filter_buffering_ref)) {
-			ret = -EINVAL;
-			goto out;
-		}
+		if (WARN_ON_ONCE(!tr->no_filter_buffering_ref))
+			return -EINVAL;
 
 		--tr->no_filter_buffering_ref;
 	}
- out:
+	return 0;
+}
+
+/*
+ * Set or disable using the per CPU trace_buffer_event when possible.
+ */
+int tracing_set_filter_buffering(struct trace_array *tr, bool set)
+{
+	int ret;
+
+	mutex_lock(&trace_types_lock);
+	ret = __tracing_set_filter_buffering(tr, set);
 	mutex_unlock(&trace_types_lock);
 
 	return ret;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 0489e72c8169..a9529943cee2 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1250,6 +1250,7 @@  extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(EVENT_FORK,		"event-fork"),		\
 		C(PAUSE_ON_TRACE,	"pause-on-trace"),	\
 		C(HASH_PTR,		"hash-ptr"),	/* Print hashed pointer */ \
+		C(DISABLE_FILTER_BUF,	"disable-filter-buffer"),\
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\