[rcu/dev,1/3] net: Use call_rcu_flush() for qdisc_free_cb

Message ID 20221117031551.1142289-1-joel@joelfernandes.org
State New
Headers
Series [rcu/dev,1/3] net: Use call_rcu_flush() for qdisc_free_cb |

Commit Message

Joel Fernandes Nov. 17, 2022, 3:15 a.m. UTC
  In a networking test on ChromeOS, we find that using the new CONFIG_RCU_LAZY
causes a networking test to fail in the teardown phase.

The failure happens during: ip netns del <name>

Using ftrace, I found the callbacks it was queuing which this series fixes. Use
call_rcu_flush() to revert to the old behavior. With that, the test passes.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 net/sched/sch_generic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Eric Dumazet Nov. 17, 2022, 9:44 p.m. UTC | #1
On Wed, Nov 16, 2022 at 7:16 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> In a networking test on ChromeOS, we find that using the new CONFIG_RCU_LAZY
> causes a networking test to fail in the teardown phase.
>
> The failure happens during: ip netns del <name>
>
> Using ftrace, I found the callbacks it was queuing which this series fixes. Use
> call_rcu_flush() to revert to the old behavior. With that, the test passes.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  net/sched/sch_generic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index a9aadc4e6858..63fbf640d3b2 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1067,7 +1067,7 @@ static void qdisc_destroy(struct Qdisc *qdisc)
>
>         trace_qdisc_destroy(qdisc);
>
> -       call_rcu(&qdisc->rcu, qdisc_free_cb);
> +       call_rcu_flush(&qdisc->rcu, qdisc_free_cb);
>  }

I took a look at this one.

qdisc_free_cb() is essentially freeing : Some per-cpu memory, and the
'struct Qdisc'

I do not see why we need to force a flush for this (small ?) piece of memory.
  
Joel Fernandes Nov. 17, 2022, 9:58 p.m. UTC | #2
> On Nov 17, 2022, at 4:44 PM, Eric Dumazet <edumazet@google.com> wrote:
> 
> On Wed, Nov 16, 2022 at 7:16 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
>> 
>> In a networking test on ChromeOS, we find that using the new CONFIG_RCU_LAZY
>> causes a networking test to fail in the teardown phase.
>> 
>> The failure happens during: ip netns del <name>
>> 
>> Using ftrace, I found the callbacks it was queuing which this series fixes. Use
>> call_rcu_flush() to revert to the old behavior. With that, the test passes.
>> 
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>> net/sched/sch_generic.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index a9aadc4e6858..63fbf640d3b2 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -1067,7 +1067,7 @@ static void qdisc_destroy(struct Qdisc *qdisc)
>> 
>>        trace_qdisc_destroy(qdisc);
>> 
>> -       call_rcu(&qdisc->rcu, qdisc_free_cb);
>> +       call_rcu_flush(&qdisc->rcu, qdisc_free_cb);
>> }
> 
> I took a look at this one.
> 
> qdisc_free_cb() is essentially freeing : Some per-cpu memory, and the
> 'struct Qdisc'
> 
> I do not see why we need to force a flush for this (small ?) piece of memory.

I’ll try to drop that and rerun the test, and get back to you. It could be that there is a different callback that this flush() is compensating for, or something. I am pretty sure at one point, dropping this patch made the test fail most of the time. Now it passes 100%.

I’ll also attempt to collect a complete trace, maybe I’ll learn some networking code in the process..

Thanks!
  
Joel Fernandes Nov. 18, 2022, 12:23 a.m. UTC | #3
On Thu, Nov 17, 2022 at 01:44:12PM -0800, Eric Dumazet wrote:
> On Wed, Nov 16, 2022 at 7:16 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > In a networking test on ChromeOS, we find that using the new CONFIG_RCU_LAZY
> > causes a networking test to fail in the teardown phase.
> >
> > The failure happens during: ip netns del <name>
> >
> > Using ftrace, I found the callbacks it was queuing which this series fixes. Use
> > call_rcu_flush() to revert to the old behavior. With that, the test passes.
> >
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  net/sched/sch_generic.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index a9aadc4e6858..63fbf640d3b2 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -1067,7 +1067,7 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> >
> >         trace_qdisc_destroy(qdisc);
> >
> > -       call_rcu(&qdisc->rcu, qdisc_free_cb);
> > +       call_rcu_flush(&qdisc->rcu, qdisc_free_cb);
> >  }
> 
> I took a look at this one.
> 
> qdisc_free_cb() is essentially freeing : Some per-cpu memory, and the
> 'struct Qdisc'
> 
> I do not see why we need to force a flush for this (small ?) piece of memory.

Indeed! Just tested and dropping this one still makes the test pass.

I believe this patch was papering over the issues fixed by the other
patches, so it stuck.

I will drop this one and move over to trying your suggestions for 2/3.

Thanks for taking a look,

 - Joel
  

Patch

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a9aadc4e6858..63fbf640d3b2 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1067,7 +1067,7 @@  static void qdisc_destroy(struct Qdisc *qdisc)
 
 	trace_qdisc_destroy(qdisc);
 
-	call_rcu(&qdisc->rcu, qdisc_free_cb);
+	call_rcu_flush(&qdisc->rcu, qdisc_free_cb);
 }
 
 void qdisc_put(struct Qdisc *qdisc)