tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

Message ID 20240205065340.2848065-1-svens@linux.ibm.com
State New
Headers
Series tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on() |

Commit Message

Sven Schnelle Feb. 5, 2024, 6:53 a.m. UTC
  tracer_tracing_is_on() checks whether record_disabled is not zero. This
checks both the record_disabled counter and the RB_BUFFER_OFF flag.
Reading the source it looks like this function should only check for
the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
This fixes spurious fails in the 'test for function traceon/off triggers'
test from the ftrace testsuite when the system is under load.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Sven Schnelle Feb. 5, 2024, 1:16 p.m. UTC | #1
Hi Steven,

Steven Rostedt <rostedt@goodmis.org> writes:

> On Mon,  5 Feb 2024 07:53:40 +0100
> Sven Schnelle <svens@linux.ibm.com> wrote:
>
>> tracer_tracing_is_on() checks whether record_disabled is not zero. This
>> checks both the record_disabled counter and the RB_BUFFER_OFF flag.
>> Reading the source it looks like this function should only check for
>> the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
>> This fixes spurious fails in the 'test for function traceon/off triggers'
>> test from the ftrace testsuite when the system is under load.
>> 
>
> I've seen these spurious failures too, but haven't looked deeper into
> it. Thanks,

Another issue i'm hitting sometimes is this part:

csum1=`md5sum trace`
sleep $SLEEP_TIME
csum2=`md5sum trace`

if [ "$csum1" != "$csum2" ]; then
    fail "Tracing file is still changing"
fi

This is because the command line was replaced in the
saved_cmdlines_buffer, an example diff between both files
is:

       ftracetest-17950   [005] ..... 344507.002490: sched_process_wait: comm=ftracetest pid=0 prio=120
       ftracetest-17950   [005] ..... 344507.002492: sched_process_wait: comm=ftracetest pid=0 prio=120
- stress-ng-fanot-17820   [006] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=10000054 [ns]
+           <...>-17820   [006] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=10000054 [ns]
       ftracetest-17950   [005] d.h.. 344507.009901: sched_stat_runtime: comm=ftracetest pid=17950 runtime=7417915 [ns]
  stress-ng-fanot-17819   [003] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17819 runtime=9983473 [ns]
- stress-ng-fanot-17820   [007] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=9999865 [ns]
+           <...>-17820   [007] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=9999865 [ns]
  stress-ng-fanot-17819   [004] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17819 runtime=8388039 [ns]

This can be improved by:

echo 32768 > /sys/kernel/tracing/saved_cmdlines_size

But this is of course not a fix - should we maybe replace the program
name with <...> before comparing, remove the check completely, or do
anything else? What do you think?

Thanks,
Sven
  
Sven Schnelle Feb. 5, 2024, 3:09 p.m. UTC | #2
Hi Steven,

Steven Rostedt <rostedt@goodmis.org> writes:

> On Mon, 05 Feb 2024 14:16:30 +0100
> Sven Schnelle <svens@linux.ibm.com> wrote:
>> 
>> Another issue i'm hitting sometimes is this part:
>> 
>> csum1=`md5sum trace`
>> sleep $SLEEP_TIME
>> csum2=`md5sum trace`
>> 
>> if [ "$csum1" != "$csum2" ]; then
>>     fail "Tracing file is still changing"
>> fi
>> 
>> This is because the command line was replaced in the
>> saved_cmdlines_buffer, an example diff between both files
>> is:
>
> [..]
>
>> 
>> This can be improved by:
>> 
>> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size
>> 
>> But this is of course not a fix - should we maybe replace the program
>> name with <...> before comparing, remove the check completely, or do
>> anything else? What do you think?
>
> Hmm, actually I would say that this exposes a real bug. Not a major
> one, but one that I find annoying. The saved commandlines should only
> be updated when a trace event occurs. But really, it should only be
> updated if one is added to the ring buffer. If the ring buffer isn't
> being updated, we shouldn't be adding new command lines.
>
> There may be a location that has tracing off but still updating the
> cmdlines which will break the saved cache.

Ok, my understanding is that it will override the entry in the list if
another process comes up with the same PID. But i haven't read the code
carefully - let me do that now.
  
Sven Schnelle Feb. 6, 2024, 6:32 a.m. UTC | #3
Steven Rostedt <rostedt@goodmis.org> writes:

> On Mon, 05 Feb 2024 14:16:30 +0100
> Sven Schnelle <svens@linux.ibm.com> wrote:
>>
>> Another issue i'm hitting sometimes is this part:
>>
>> csum1=`md5sum trace`
>> sleep $SLEEP_TIME
>> csum2=`md5sum trace`
>>
>> if [ "$csum1" != "$csum2" ]; then
>>     fail "Tracing file is still changing"
>> fi
>>
>> This is because the command line was replaced in the
>> saved_cmdlines_buffer, an example diff between both files
>> is:
>
> [..]
>
>>
>> This can be improved by:
>>
>> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size
>>
>> But this is of course not a fix - should we maybe replace the program
>> name with <...> before comparing, remove the check completely, or do
>> anything else? What do you think?
>
> Hmm, actually I would say that this exposes a real bug. Not a major
> one, but one that I find annoying. The saved commandlines should only
> be updated when a trace event occurs. But really, it should only be
> updated if one is added to the ring buffer. If the ring buffer isn't
> being updated, we shouldn't be adding new command lines.
>
> There may be a location that has tracing off but still updating the
> cmdlines which will break the saved cache.

Looking at trace_save_cmdline():

tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000

so this is basically

tpid = tsk->pid & 0x7fff;

further on:

        // might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & 0x7fff)
	idx = savedcmd->map_pid_to_cmdline[tpid];
	if (idx == NO_CMDLINE_MAP) {
                // This will pick an existing entry if there are
                // more than cmdline_num entries present
		idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num;      
		savedcmd->map_pid_to_cmdline[tpid] = idx;
		savedcmd->cmdline_idx = idx;
	}

So i think the problem that sometimes '<...>' instead of the correct
comm is logged is just expected behaviour given the code above.
  
Mete Durlu Feb. 6, 2024, 7:05 a.m. UTC | #4
On 2/5/24 07:53, Sven Schnelle wrote:
> tracer_tracing_is_on() checks whether record_disabled is not zero. This
> checks both the record_disabled counter and the RB_BUFFER_OFF flag.
> Reading the source it looks like this function should only check for
> the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
> This fixes spurious fails in the 'test for function traceon/off triggers'
> test from the ftrace testsuite when the system is under load.
> 
> Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
> ---
>   kernel/trace/trace.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 2a7c6fd934e9..47e221e1e720 100644

Tested-By: Mete Durlu <meted@linux.ibm.com>
  
Sven Schnelle Feb. 6, 2024, 8:48 a.m. UTC | #5
Sven Schnelle <svens@linux.ibm.com> writes:

> Looking at trace_save_cmdline():
>
> tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000
>
> so this is basically
>
> tpid = tsk->pid & 0x7fff;
>
> further on:
>
>         // might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & 0x7fff)
> 	idx = savedcmd->map_pid_to_cmdline[tpid];
> 	if (idx == NO_CMDLINE_MAP) {
>                 // This will pick an existing entry if there are
>                 // more than cmdline_num entries present
> 		idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num;      
> 		savedcmd->map_pid_to_cmdline[tpid] = idx;
> 		savedcmd->cmdline_idx = idx;
> 	}
>
> So i think the problem that sometimes '<...>' instead of the correct
> comm is logged is just expected behaviour given the code above. 

I added some logging, and the test is not triggering this issue. So i
assume the default of 128 cmdline entries is just to small. Sorry for
the noise. Lets see whether we're still triggering some failures with
the other fix applied in CI. If we do, maybe we want to increase the
saved_cmdline_size for the ftrace test suite.
  
Sven Schnelle Feb. 7, 2024, 5:50 a.m. UTC | #6
Hi Steven,

Steven Rostedt <rostedt@goodmis.org> writes:

> On Tue, 06 Feb 2024 09:48:16 +0100
> Sven Schnelle <svens@linux.ibm.com> wrote:
>
>> I added some logging, and the test is not triggering this issue. So i
>> assume the default of 128 cmdline entries is just to small. Sorry for
>> the noise. Lets see whether we're still triggering some failures with
>> the other fix applied in CI. If we do, maybe we want to increase the
>> saved_cmdline_size for the ftrace test suite.
>
> I wonder if it is a good idea to increase the size when tracing starts,
> like the ring buffer expanding code. Maybe to 1024? Or is that still
> too small?

Not sure whether that is enough, have to test. However, it's not really
a fix, it would just require a bit more load and the test would fail
again. The fundamental problem is that even after disabling tracing
there might be some tracing line added due to the lockless nature of
the ringbuffer. This might then replace some existing cmdline entry.
So maybe we should change the test to ignore the program name when
calculating the checksums.
  
Mete Durlu Feb. 7, 2024, 12:07 p.m. UTC | #7
On 2/7/24 12:09, Steven Rostedt wrote:
> On Wed, 07 Feb 2024 06:50:59 +0100
> Sven Schnelle <svens@linux.ibm.com> wrote:
> 
>> Hi Steven,
> 
>> Not sure whether that is enough, have to test. However, it's not really
>> a fix, it would just require a bit more load and the test would fail
>> again. The fundamental problem is that even after disabling tracing
>> there might be some tracing line added due to the lockless nature of
>> the ringbuffer. This might then replace some existing cmdline entry.
>> So maybe we should change the test to ignore the program name when
>> calculating the checksums.
> 
> Have you figured out what caused the cmdlines to change when tracing is
> off. It shouldn't be updated even with just 128 entries.
> 
> I'm also looking into a way to keep more of a LRU command lines around,
> but nothing concrete yet.
> 
> -- Steve

Hi,

wouldn't the following scenario explain the behavior we are seeing.
When using event triggers, trace uses lockless ringbuffer control paths.
If cmdline update and trace output reading is happening on different
cpus, the ordering can get messed up.

1. event happens and trace trigger tells ring buffer to stop writes
2. (on cpu#1)test calculates checksum on current state of trace
    output.
3. (on cpu#2)not knowing about the trace buffers status yet, writer adds
    a one last entry which would collide with a pid in cmdline map before
    actually stopping. While (on cpu#1) checksum is being calculated, new
    saved cmdlines entry is waiting for spinlocks to be unlocked and then
    gets added.
4. test calculates checksum again and finds that the trace output has
    changed. <...> is put on collided pid.
  
Sven Schnelle Feb. 7, 2024, 1:33 p.m. UTC | #8
Steven Rostedt <rostedt@goodmis.org> writes:

> On Wed, 7 Feb 2024 13:07:36 +0100
> Mete Durlu <meted@linux.ibm.com> wrote:
>
>> wouldn't the following scenario explain the behavior we are seeing.
>> When using event triggers, trace uses lockless ringbuffer control paths.
>> If cmdline update and trace output reading is happening on different
>> cpus, the ordering can get messed up.
>>
>> 1. event happens and trace trigger tells ring buffer to stop writes
>> 2. (on cpu#1)test calculates checksum on current state of trace
>>     output.
>> 3. (on cpu#2)not knowing about the trace buffers status yet, writer adds
>>     a one last entry which would collide with a pid in cmdline map before
>>     actually stopping. While (on cpu#1) checksum is being calculated, new
>>     saved cmdlines entry is waiting for spinlocks to be unlocked and then
>>     gets added.
>> 4. test calculates checksum again and finds that the trace output has
>>     changed. <...> is put on collided pid.
>
> But the failure is here:
>
> on=`cat tracing_on`
> if [ $on != "0" ]; then
>     fail "Tracing is not off"
> fi

It might be misleading because we're discussing two issues in one
thread. The failure above was one problem, which the initial fix
is for. The other issue we're still seeing is the test below:

> csum1=`md5sum trace`
> sleep $SLEEP_TIME
> csum2=`md5sum trace`
>
> if [ "$csum1" != "$csum2" ]; then
>     fail "Tracing file is still changing"
> fi
>
> 1. tracing is off
> 2. do checksum of trace
> 3. sleep
> 4. do another checksum of trace
> 5. compare the two checksums
>
> Now how did they come up differently in that amount of time? The
> saved_cmdlines really should not have been updated.

My assumption without reading the code is that something like this
happens:

CPU0                             CPU1
[ringbuffer enabled]
                                 ring_buffer_write()
                                     if (atomic_read(&buffer->record_disabled))
                                            goto out;
echo 0 > tracing_on
record_disabled |= RB_BUFFER_OFF
csum1=`md5sum trace`
                                 [adds trace entry to ring buffer,
                                  overwriting savedcmd_lines entry because
                                  it thinks ring buffer is enabled]
csum2=`md5sum trace`
  
Mete Durlu Feb. 8, 2024, 10:25 a.m. UTC | #9
On 2/7/24 16:47, Steven Rostedt wrote:
> On Wed, 07 Feb 2024 14:33:21 +0100
> Sven Schnelle <svens@linux.ibm.com> wrote:
> 
>> My assumption without reading the code is that something like this
>> happens:
>>
>> CPU0                             CPU1
>> [ringbuffer enabled]
>>                                   ring_buffer_write()
>>                                       if (atomic_read(&buffer->record_disabled))
>>                                              goto out;
>> echo 0 > tracing_on
>> record_disabled |= RB_BUFFER_OFF
>> csum1=`md5sum trace`
> 
> Note, the CPU1 is performing with preemption disabled, so for this to
> happen, something really bad happened on CPU0 to delay preempt disabled
> section so long to allow the trace to be read. Perhaps we should have
> the return of the echo 0 > tracing_on require a synchronize_rcu() to
> make sure all ring buffers see it disabled before it returns.
> 
> But unless your system is doing something really stressed to cause the
> preempt disabled section to take so long, I highly doubt this was the
> race.
> 

I have been only able to reliably reproduce this issue when the system
is under load from stressors. But I am not sure if it can be considered
as *really stressed*.

system : 8 cpus (4 physical cores)
load   : stress-ng --fanotify 1 (or --context 2)
result : ~5/10 test fails

of course as load increases test starts to fail more often, but a
single stressor doesn't seem like much to me for a 4 core machine.

after adding synchronize_rcu() + patch from Sven, I am no longer seeing
failures with the setup above. So it seems like synchronize_rcu() did
the trick(or at least it helps a lot) for the case described on the
previous mail. I couldn't trigger the failure yet, not even with
increased load(but now the test case takes > 5mins to finish :) ).

Here is the diff:

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
@@ -9328,10 +9328,12 @@ rb_simple_write(struct file *filp, const char 
__user *ubuf,
                         val = 0; /* do nothing */
                 } else if (val) {
                         tracer_tracing_on(tr);
+                       synchronize_rcu();
                         if (tr->current_trace->start)
                                 tr->current_trace->start(tr);
                 } else {
                         tracer_tracing_off(tr);
+                       synchronize_rcu();
                         if (tr->current_trace->stop)
                                 tr->current_trace->stop(tr);

Not 100% sure if these were the correct places to add them.
  
Mete Durlu Feb. 12, 2024, 10:54 p.m. UTC | #10
On 2/12/24 19:53, Steven Rostedt wrote:
> 
> Right, it will definitely force the race window to go away.
> 
> Can you still trigger this issue with just Sven's patch and not this change?
Sven's patch makes the test cases much more resilient. I needed to ramp
up the load up to 4 stressors on a 4 core system before the test starts
to fail again. At this point it is fair to say that the system is under
significant load.

The failing condition is now different, which tells me that
Sven's patch have already solved(or improved) the previous issue.
What is failing now:

   echo '** ENABLE TRACING'
   enable_tracing

   cnt=`cnt_trace`
   if [ $cnt -eq 0 ]; then
      fail "Nothing found in trace"
   fi

Adding a sleep between enable_tracing and cnt_trace seems to improve the
situation a lot. (my guess is that, the trace writer isn't getting any
runtime before the test checks the trace output)

   echo '** ENABLE TRACING'
   enable_tracing

  +sleep $SLEEP_TIME
   cnt=`cnt_trace`
   if [ $cnt -eq 0 ]; then
      fail "Nothing found in trace"
   fi

Didn't see any failure while increasing the system load until 8
stressors (this might have been luck). When system was under load with
8 stressors, I have seen the test failing 2 out of 10 times. Definitely
an improvement over the current situation.
  

Patch

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..47e221e1e720 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1532,7 +1532,7 @@  void disable_trace_on_warning(void)
 bool tracer_tracing_is_on(struct trace_array *tr)
 {
 	if (tr->array_buffer.buffer)
-		return ring_buffer_record_is_on(tr->array_buffer.buffer);
+		return ring_buffer_record_is_set_on(tr->array_buffer.buffer);
 	return !tr->buffer_disabled;
 }