[v7,2/3] sched/task: Add the put_task_struct_atomic_safe() function

Message ID 20230425114307.36889-3-wander@redhat.com
State New
Headers
Series Introduce put_task_struct_atomic_sleep() |

Commit Message

Wander Lairson Costa April 25, 2023, 11:43 a.m. UTC
  Due to the possibility of indirectly acquiring sleeping locks, it is
unsafe to call put_task_struct() in atomic contexts when the kernel is
compiled with PREEMPT_RT.

To mitigate this issue, this commit introduces
put_task_struct_atomic_safe(), which schedules __put_task_struct()
through call_rcu() when PREEMPT_RT is enabled. While a workqueue would
be a more natural approach, we cannot allocate dynamic memory from
atomic context in PREEMPT_RT, making the code more complex.

This implementation ensures safe execution in atomic contexts and
avoids any potential issues that may arise from using the non-atomic
version.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reported-by: Hu Chunyu <chuhu@redhat.com>
Reviewed-by: Paul McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/sched/task.h | 35 +++++++++++++++++++++++++++++++++++
 kernel/fork.c              |  8 ++++++++
 2 files changed, 43 insertions(+)
  

Comments

Peter Zijlstra May 4, 2023, 8:42 a.m. UTC | #1
On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index b597b97b1f8f..cf774b83b2ec 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
>  
>  void put_task_struct_rcu_user(struct task_struct *task);
>  
> +extern void __delayed_put_task_struct(struct rcu_head *rhp);
> +
> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> +		/*
> +		 * Decrement the refcount explicitly to avoid unnecessarily
> +		 * calling call_rcu.
> +		 */
> +		if (refcount_dec_and_test(&task->usage))
> +			/*
> +			 * under PREEMPT_RT, we can't call put_task_struct
> +			 * in atomic context because it will indirectly
> +			 * acquire sleeping locks.
> +			 * call_rcu() will schedule __delayed_put_task_struct()
> +			 * to be called in process context.
> +			 *
> +			 * __put_task_struct() is called when
> +			 * refcount_dec_and_test(&t->usage) succeeds.
> +			 *
> +			 * This means that it can't conflict with
> +			 * put_task_struct_rcu_user() which abuses ->rcu the same
> +			 * way; rcu_users has a reference so task->usage can't be
> +			 * zero after rcu_users 1 -> 0 transition.
> +			 *
> +			 * delayed_free_task() also uses ->rcu, but it is only called
> +			 * when it fails to fork a process. Therefore, there is no
> +			 * way it can conflict with put_task_struct().
> +			 */
> +			call_rcu(&task->rcu, __delayed_put_task_struct);
> +	} else {
> +		put_task_struct(task);
> +	}
> +}

Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
has already asked why can't we just rcu free the thing unconditionally.

Google only found me an earlier version of this same patch set, but I'm
sure we've had that discussion many times over the past several years.
The above and your follow up patch is just horrible.

It requires users to know the RT specific context and gives them no help
what so ever for !RT builds.
  
Valentin Schneider May 4, 2023, 9:32 a.m. UTC | #2
On 04/05/23 10:42, Peter Zijlstra wrote:
> On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
>> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
>> index b597b97b1f8f..cf774b83b2ec 100644
>> --- a/include/linux/sched/task.h
>> +++ b/include/linux/sched/task.h
>> @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
>>
>>  void put_task_struct_rcu_user(struct task_struct *task);
>>
>> +extern void __delayed_put_task_struct(struct rcu_head *rhp);
>> +
>> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
>> +{
>> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> +		/*
>> +		 * Decrement the refcount explicitly to avoid unnecessarily
>> +		 * calling call_rcu.
>> +		 */
>> +		if (refcount_dec_and_test(&task->usage))
>> +			/*
>> +			 * under PREEMPT_RT, we can't call put_task_struct
>> +			 * in atomic context because it will indirectly
>> +			 * acquire sleeping locks.
>> +			 * call_rcu() will schedule __delayed_put_task_struct()
>> +			 * to be called in process context.
>> +			 *
>> +			 * __put_task_struct() is called when
>> +			 * refcount_dec_and_test(&t->usage) succeeds.
>> +			 *
>> +			 * This means that it can't conflict with
>> +			 * put_task_struct_rcu_user() which abuses ->rcu the same
>> +			 * way; rcu_users has a reference so task->usage can't be
>> +			 * zero after rcu_users 1 -> 0 transition.
>> +			 *
>> +			 * delayed_free_task() also uses ->rcu, but it is only called
>> +			 * when it fails to fork a process. Therefore, there is no
>> +			 * way it can conflict with put_task_struct().
>> +			 */
>> +			call_rcu(&task->rcu, __delayed_put_task_struct);
>> +	} else {
>> +		put_task_struct(task);
>> +	}
>> +}
>
> Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> has already asked why can't we just rcu free the thing unconditionally.
>
> Google only found me an earlier version of this same patch set, but I'm
> sure we've had that discussion many times over the past several years.
> The above and your follow up patch is just horrible.
>

So on v3/v4 we got to doing that unconditionally for PREEMPT_RT, but per
[1] Wander went back to hand-fixing the problematic callsites.

Now that I'm looking at it again, I couldn't find a concrete argument from
Oleg against doing this unconditionally - as Wander is pointing out in the
changelog and comments, reusing task_struct.rcu for that purpose is safe
(although not necessarily obviously so).

Is this just miscommunication, or is there a genuine issue with doing this
unconditionally? As argued before, I'd also much rather have this be an
unconditional call_rcu() (regardless of context or PREEMPT_RT).
  
Wander Lairson Costa May 4, 2023, 12:24 p.m. UTC | #3
On Thu, May 04, 2023 at 10:42:29AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index b597b97b1f8f..cf774b83b2ec 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
> >  
> >  void put_task_struct_rcu_user(struct task_struct *task);
> >  
> > +extern void __delayed_put_task_struct(struct rcu_head *rhp);
> > +
> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> > +{
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > +		/*
> > +		 * Decrement the refcount explicitly to avoid unnecessarily
> > +		 * calling call_rcu.
> > +		 */
> > +		if (refcount_dec_and_test(&task->usage))
> > +			/*
> > +			 * under PREEMPT_RT, we can't call put_task_struct
> > +			 * in atomic context because it will indirectly
> > +			 * acquire sleeping locks.
> > +			 * call_rcu() will schedule __delayed_put_task_struct()
> > +			 * to be called in process context.
> > +			 *
> > +			 * __put_task_struct() is called when
> > +			 * refcount_dec_and_test(&t->usage) succeeds.
> > +			 *
> > +			 * This means that it can't conflict with
> > +			 * put_task_struct_rcu_user() which abuses ->rcu the same
> > +			 * way; rcu_users has a reference so task->usage can't be
> > +			 * zero after rcu_users 1 -> 0 transition.
> > +			 *
> > +			 * delayed_free_task() also uses ->rcu, but it is only called
> > +			 * when it fails to fork a process. Therefore, there is no
> > +			 * way it can conflict with put_task_struct().
> > +			 */
> > +			call_rcu(&task->rcu, __delayed_put_task_struct);
> > +	} else {
> > +		put_task_struct(task);
> > +	}
> > +}
> 
> Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> has already asked why can't we just rcu free the thing unconditionally.
> 
> Google only found me an earlier version of this same patch set, but I'm
> sure we've had that discussion many times over the past several years.
> The above and your follow up patch is just horrible.
> 
> It requires users to know the RT specific context and gives them no help
> what so ever for !RT builds.
> 
> 

No problem, I will send a new version doing it unconditionally.
  
Wander Lairson Costa May 4, 2023, 12:24 p.m. UTC | #4
On Thu, May 04, 2023 at 10:32:31AM +0100, Valentin Schneider wrote:
> On 04/05/23 10:42, Peter Zijlstra wrote:
> > On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
> >> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> >> index b597b97b1f8f..cf774b83b2ec 100644
> >> --- a/include/linux/sched/task.h
> >> +++ b/include/linux/sched/task.h
> >> @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
> >>
> >>  void put_task_struct_rcu_user(struct task_struct *task);
> >>
> >> +extern void __delayed_put_task_struct(struct rcu_head *rhp);
> >> +
> >> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> >> +{
> >> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> >> +		/*
> >> +		 * Decrement the refcount explicitly to avoid unnecessarily
> >> +		 * calling call_rcu.
> >> +		 */
> >> +		if (refcount_dec_and_test(&task->usage))
> >> +			/*
> >> +			 * under PREEMPT_RT, we can't call put_task_struct
> >> +			 * in atomic context because it will indirectly
> >> +			 * acquire sleeping locks.
> >> +			 * call_rcu() will schedule __delayed_put_task_struct()
> >> +			 * to be called in process context.
> >> +			 *
> >> +			 * __put_task_struct() is called when
> >> +			 * refcount_dec_and_test(&t->usage) succeeds.
> >> +			 *
> >> +			 * This means that it can't conflict with
> >> +			 * put_task_struct_rcu_user() which abuses ->rcu the same
> >> +			 * way; rcu_users has a reference so task->usage can't be
> >> +			 * zero after rcu_users 1 -> 0 transition.
> >> +			 *
> >> +			 * delayed_free_task() also uses ->rcu, but it is only called
> >> +			 * when it fails to fork a process. Therefore, there is no
> >> +			 * way it can conflict with put_task_struct().
> >> +			 */
> >> +			call_rcu(&task->rcu, __delayed_put_task_struct);
> >> +	} else {
> >> +		put_task_struct(task);
> >> +	}
> >> +}
> >
> > Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> > has already asked why can't we just rcu free the thing unconditionally.
> >
> > Google only found me an earlier version of this same patch set, but I'm
> > sure we've had that discussion many times over the past several years.
> > The above and your follow up patch is just horrible.
> >
> 
> So on v3/v4 we got to doing that unconditionally for PREEMPT_RT, but per
> [1] Wander went back to hand-fixing the problematic callsites.
> 
> Now that I'm looking at it again, I couldn't find a concrete argument from
> Oleg against doing this unconditionally - as Wander is pointing out in the
> changelog and comments, reusing task_struct.rcu for that purpose is safe
> (although not necessarily obviously so).
> 
> Is this just miscommunication, or is there a genuine issue with doing this
> unconditionally? As argued before, I'd also much rather have this be an
> unconditional call_rcu() (regardless of context or PREEMPT_RT).
> 

Yeah, I think it was a misunderstanding of mine.
  
Oleg Nesterov May 4, 2023, 12:29 p.m. UTC | #5
On 05/04, Peter Zijlstra wrote:
>
> Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> has already asked why can't we just rcu free the thing unconditionally.
>
> Google only found me an earlier version of this same patch set, but I'm
> sure we've had that discussion many times over the past several years.

Yes... see for example

https://lore.kernel.org/lkml/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/

We already have an rcu pass before put_task_struct(zombie), see
put_task_struct_rcu_user(), another one look unfortunate.

Oleg.
  
Peter Zijlstra May 4, 2023, 2:33 p.m. UTC | #6
On Thu, May 04, 2023 at 02:29:45PM +0200, Oleg Nesterov wrote:
> On 05/04, Peter Zijlstra wrote:
> >
> > Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> > has already asked why can't we just rcu free the thing unconditionally.
> >
> > Google only found me an earlier version of this same patch set, but I'm
> > sure we've had that discussion many times over the past several years.
> 
> Yes... see for example
> 
> https://lore.kernel.org/lkml/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/
> 
> We already have an rcu pass before put_task_struct(zombie), see
> put_task_struct_rcu_user(), another one look unfortunate.

Ah indeed, it got mentioned there as well. And Linus seems to be arguing
against doing an rcu free there. So humm..

Then I'm thinking something trivial like so:

static inline void put_task_struct(struct task_struct *t)
{
	if (!refcount_dec_and_test(&t->usage))
		return;

	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
		call_rcu(&t->rcu, __put_task_struct_rcu);

	__put_task_struct(t);
}

should do, or alternatively use irq_work, which has a much lower
latency, but meh..
  
Wander Lairson Costa May 4, 2023, 2:55 p.m. UTC | #7
On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 04, 2023 at 02:29:45PM +0200, Oleg Nesterov wrote:
> > On 05/04, Peter Zijlstra wrote:
> > >
> > > Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> > > has already asked why can't we just rcu free the thing unconditionally.
> > >
> > > Google only found me an earlier version of this same patch set, but I'm
> > > sure we've had that discussion many times over the past several years.
> >
> > Yes... see for example
> >
> > https://lore.kernel.org/lkml/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/
> >
> > We already have an rcu pass before put_task_struct(zombie), see
> > put_task_struct_rcu_user(), another one look unfortunate.
>
> Ah indeed, it got mentioned there as well. And Linus seems to be arguing
> against doing an rcu free there. So humm..
>
> Then I'm thinking something trivial like so:
>
> static inline void put_task_struct(struct task_struct *t)
> {
>         if (!refcount_dec_and_test(&t->usage))
>                 return;
>
>         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
>                 call_rcu(&t->rcu, __put_task_struct_rcu);
>
>         __put_task_struct(t);
> }
>

That's what v5 [1] does. What would be the path in this case? Should I
resend it as v8?

> should do, or alternatively use irq_work, which has a much lower
> latency, but meh..
>

Initially, I did that. I switched to call_rcu because the code got much simpler.

[1] https://lore.kernel.org/all/20230210161323.37400-1-wander@redhat.com/
  
Oleg Nesterov May 4, 2023, 3:23 p.m. UTC | #8
On 05/04, Wander Lairson Costa wrote:
>
> On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > static inline void put_task_struct(struct task_struct *t)
> > {
> >         if (!refcount_dec_and_test(&t->usage))
> >                 return;
> >
> >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> >
> >         __put_task_struct(t);
> > }
> >
>
> That's what v5 [1] does.

Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.

	https://lore.kernel.org/all/Y+zFNrCjBn53%2F+Q2@linutronix.de/

Oleg.
  
Peter Zijlstra May 4, 2023, 3:24 p.m. UTC | #9
On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:

> > Then I'm thinking something trivial like so:
> >
> > static inline void put_task_struct(struct task_struct *t)
> > {
> >         if (!refcount_dec_and_test(&t->usage))
> >                 return;
> >
> >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> >
> >         __put_task_struct(t);
> > }
> >
> 
> That's what v5 [1] does. What would be the path in this case? Should I
> resend it as v8?

It's almost what v5 does. v5 also has a !in_task() thing. v5 also
violates codingstyle :-)

But yeah.. something like that.
  
Peter Zijlstra May 4, 2023, 3:30 p.m. UTC | #10
On Thu, May 04, 2023 at 05:23:07PM +0200, Oleg Nesterov wrote:
> On 05/04, Wander Lairson Costa wrote:
> >
> > On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > static inline void put_task_struct(struct task_struct *t)
> > > {
> > >         if (!refcount_dec_and_test(&t->usage))
> > >                 return;
> > >
> > >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> > >
> > >         __put_task_struct(t);
> > > }
> > >
> >
> > That's what v5 [1] does.
> 
> Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.

This can help:

  https://lkml.kernel.org/r/168303194177.404.8610123576035502891.tip-bot2@tip-bot2
  
Wander Lairson Costa May 4, 2023, 6:21 p.m. UTC | #11
On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:
> On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> 
> > > Then I'm thinking something trivial like so:
> > >
> > > static inline void put_task_struct(struct task_struct *t)
> > > {
> > >         if (!refcount_dec_and_test(&t->usage))
> > >                 return;
> > >
> > >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> > >
> > >         __put_task_struct(t);
> > > }
> > >
> > 
> > That's what v5 [1] does. What would be the path in this case? Should I
> > resend it as v8?
> 
> It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> violates codingstyle :-)

IIRC, the in_task() is there because preemptible() doesn't check if it
is running in interrupt context.

> 
> But yeah.. something like that.
>
  
Wander Lairson Costa May 4, 2023, 6:29 p.m. UTC | #12
On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 05/04, Wander Lairson Costa wrote:
> >
> > On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > static inline void put_task_struct(struct task_struct *t)
> > > {
> > >         if (!refcount_dec_and_test(&t->usage))
> > >                 return;
> > >
> > >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> > >
> > >         __put_task_struct(t);
> > > }
> > >
> >
> > That's what v5 [1] does.
>
> Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
>
>         https://lore.kernel.org/all/Y+zFNrCjBn53%2F+Q2@linutronix.de/
>

I think that was my confusion in that thread. My understanding is that
CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
context. I feel my ignorance is blocking me from seeing something that
is obvious to everyone else :/

> Oleg.
>
  
Oleg Nesterov May 4, 2023, 7:22 p.m. UTC | #13
On 05/04, Wander Lairson Costa wrote:
>
> On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> >
> >         https://lore.kernel.org/all/Y+zFNrCjBn53%2F+Q2@linutronix.de/
> >
> 
> I think that was my confusion in that thread. My understanding is that
> CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> context.

Sorry, I don't understand... perhaps I missed something. But iiuc
the problem is simple.

So, this code

	raw_spin_lock(one);
	spin_lock(two);

is obviously wrong if CONFIG_PREEMPT_RT.

Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
are the same thing. Except they have different lockdep annotations if
CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.

So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
on with PREEMPT_RT.

Cough... not sure my explanation can help ;) It looks very confusing when
I read it.

Oleg.
  
Wander Lairson Costa May 4, 2023, 7:38 p.m. UTC | #14
On Thu, May 4, 2023 at 4:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 05/04, Wander Lairson Costa wrote:
> >
> > On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> > >
> > >         https://lore.kernel.org/all/Y+zFNrCjBn53%2F+Q2@linutronix.de/
> > >
> >
> > I think that was my confusion in that thread. My understanding is that
> > CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> > context.
>
> Sorry, I don't understand... perhaps I missed something. But iiuc
> the problem is simple.
>
> So, this code
>
>         raw_spin_lock(one);
>         spin_lock(two);
>
> is obviously wrong if CONFIG_PREEMPT_RT.
>
> Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
> are the same thing. Except they have different lockdep annotations if
> CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.
>
> So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
> on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
> on with PREEMPT_RT.
>
> Cough... not sure my explanation can help ;) It looks very confusing when
> I read it.
>

Thanks for the explanation. That's my understanding too. The part I
don't get is why this would fail with a call_rcu() inside
put_task_struct().

> Oleg.
>
  
Oleg Nesterov May 4, 2023, 8:16 p.m. UTC | #15
Hi Wander,

I certainly missed something ;) plus I am already sleeping. but let me try to
reply anyway.

On 05/04, Wander Lairson Costa wrote:
> On Thu, May 4, 2023 at 4:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 05/04, Wander Lairson Costa wrote:
> > >
> > > On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> > > >
> > > >         https://lore.kernel.org/all/Y+zFNrCjBn53%2F+Q2@linutronix.de/
> > > >
> > >
> > > I think that was my confusion in that thread. My understanding is that
> > > CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> > > context.
> >
> > Sorry, I don't understand... perhaps I missed something. But iiuc
> > the problem is simple.
> >
> > So, this code
> >
> >         raw_spin_lock(one);
> >         spin_lock(two);
> >
> > is obviously wrong if CONFIG_PREEMPT_RT.
> >
> > Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
> > are the same thing. Except they have different lockdep annotations if
> > CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.
> >
> > So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
> > on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
> > on with PREEMPT_RT.
> >
> > Cough... not sure my explanation can help ;) It looks very confusing when
> > I read it.
> >
>
> Thanks for the explanation. That's my understanding too. The part I
> don't get is why this would fail with a call_rcu() inside
> put_task_struct().

the problem is that call_rcu() won't be called if !IS_ENABLED(PREEMPT_RT),
___put_task_struct() will be called.

CONFIG_PROVE_RAW_LOCK_NESTING can't know this can't happen if PREEMPT_RT
is set.

IOW. To simplify, suppose we have

	// can be called in atomic context, e.g. under
	// raw_spin_lock() so it is wrong with PREEMPT_RT
	void __put_task_struct(struct task_struct *tsk)
	{
		spin_lock(some_lock);
	}

lets "fix" the code above, lets change __put_task_struct,

	void __put_task_struct(struct task_struct *tsk)
	{
		if (!IS_ENABLED(CONFIG_PREEMPT_RT))
			return;

		spin_lock(some_lock);
	}

Now, if CONFIG_PREEMPT_RT is true then __put_task_struct() is fine
wrt lock nesting.

But, if CONFIG_PREEMPT_RT is not set, then __put_task_struct() still
does the same:

	void __put_task_struct(struct task_struct *tsk)
	{
		spin_lock(some_lock);
	}

and CONFIG_PROVE_RAW_LOCK_NESTING will complain. Because, once again,
it checks the nesting as if CONFIG_PREEMPT_RT is true, and in this case
__put_task_struct() if it is called under raw_spin_lock().

Oleg.
  
Peter Zijlstra May 5, 2023, 1:32 p.m. UTC | #16
On Thu, May 04, 2023 at 03:21:11PM -0300, Wander Lairson Costa wrote:
> On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:
> > On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> > 
> > > > Then I'm thinking something trivial like so:
> > > >
> > > > static inline void put_task_struct(struct task_struct *t)
> > > > {
> > > >         if (!refcount_dec_and_test(&t->usage))
> > > >                 return;
> > > >
> > > >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> > > >
> > > >         __put_task_struct(t);
> > > > }
> > > >
> > > 
> > > That's what v5 [1] does. What would be the path in this case? Should I
> > > resend it as v8?
> > 
> > It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> > violates codingstyle :-)
> 
> IIRC, the in_task() is there because preemptible() doesn't check if it
> is running in interrupt context.

#define preemptible()   (preempt_count() == 0 && !irqs_disabled())

When in interrupt context preempt_count() will have a non-zero value in
HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
(false && false), last time I checked that ends up being false.
  
Peter Zijlstra May 5, 2023, 1:39 p.m. UTC | #17
On Thu, May 04, 2023 at 05:30:57PM +0200, Peter Zijlstra wrote:
> On Thu, May 04, 2023 at 05:23:07PM +0200, Oleg Nesterov wrote:

> > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> 
> This can help:
> 
>   https://lkml.kernel.org/r/168303194177.404.8610123576035502891.tip-bot2@tip-bot2

Explicitly:

static inline void put_task_struct(struct task_struct *t)
{
	if (!refcount_dec_and_test(&t->usage))
		return;

	if (!IS_ENABLED(CONFIG_PREEMPT_RT) || premptible()) {
		/*
		 * ... same comment as the other patch ...
		 */
		static DEFINE_WAIT_OVERRIDE_MAP(put_task_map, LD_WAIT_SLEEP);
		lock_map_acquire_try(&put_task_map);
		__put_task_struct(t);
		lock_map_release(&put_task_map);
		return;
	}

	call_rcu(&t->rcu, __put_task_struct_rcu);
}

Should not complain since we tell it to STFU :-)
  
Steven Rostedt May 5, 2023, 2:26 p.m. UTC | #18
On Fri, 5 May 2023 15:32:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, May 04, 2023 at 03:21:11PM -0300, Wander Lairson Costa wrote:
> > On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:  
> > > On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> > >   
> > > > > Then I'm thinking something trivial like so:
> > > > >
> > > > > static inline void put_task_struct(struct task_struct *t)
> > > > > {
> > > > >         if (!refcount_dec_and_test(&t->usage))
> > > > >                 return;
> > > > >
> > > > >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > > >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> > > > >
> > > > >         __put_task_struct(t);
> > > > > }
> > > > >  
> > > > 
> > > > That's what v5 [1] does. What would be the path in this case? Should I
> > > > resend it as v8?  
> > > 
> > > It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> > > violates codingstyle :-)  
> > 
> > IIRC, the in_task() is there because preemptible() doesn't check if it
> > is running in interrupt context.  
> 
> #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
> 
> When in interrupt context preempt_count() will have a non-zero value in
> HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
> (false && false), last time I checked that ends up being false.

Interesting, I can't find v5 anywhere in my mail folders (but I have
v4 and v6!). Anyway, from just the context of this email, and seeing
IS_ENABLED(CONFIG_PREEMPT_RT), I'm guessing that in_task() returns false if
it's running in a interrupt thread, where preemtible() does not.

-- Steve
  
Steven Rostedt May 5, 2023, 2:29 p.m. UTC | #19
On Fri, 5 May 2023 10:26:02 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > > IIRC, the in_task() is there because preemptible() doesn't check if it
> > > is running in interrupt context.    
> > 
> > #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
> > 
> > When in interrupt context preempt_count() will have a non-zero value in
> > HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
> > (false && false), last time I checked that ends up being false.  
> 
> Interesting, I can't find v5 anywhere in my mail folders (but I have
> v4 and v6!). Anyway, from just the context of this email, and seeing
> IS_ENABLED(CONFIG_PREEMPT_RT), I'm guessing that in_task() returns false if
> it's running in a interrupt thread, where preemtible() does not.

But then I question, does it matter if it is running in an interrupt thread
or not for put_task_struct()?

-- Steve
  
Wander Lairson Costa May 8, 2023, 12:28 p.m. UTC | #20
On Fri, May 5, 2023 at 10:33 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 04, 2023 at 03:21:11PM -0300, Wander Lairson Costa wrote:
> > On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> > >
> > > > > Then I'm thinking something trivial like so:
> > > > >
> > > > > static inline void put_task_struct(struct task_struct *t)
> > > > > {
> > > > >         if (!refcount_dec_and_test(&t->usage))
> > > > >                 return;
> > > > >
> > > > >         if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > > >                 call_rcu(&t->rcu, __put_task_struct_rcu);
> > > > >
> > > > >         __put_task_struct(t);
> > > > > }
> > > > >
> > > >
> > > > That's what v5 [1] does. What would be the path in this case? Should I
> > > > resend it as v8?
> > >
> > > It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> > > violates codingstyle :-)
> >
> > IIRC, the in_task() is there because preemptible() doesn't check if it
> > is running in interrupt context.
>
> #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
>
> When in interrupt context preempt_count() will have a non-zero value in
> HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
> (false && false), last time I checked that ends up being false.
>

When I first tested, I started with in_irq(), then I saw some warnings
and added preemptible().
Never tested with preemptible() alone. Thanks, I will change that and test it.
  
Wander Lairson Costa May 8, 2023, 12:30 p.m. UTC | #21
On Thu, May 4, 2023 at 5:16 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Hi Wander,
>
> I certainly missed something ;) plus I am already sleeping. but let me try to
> reply anyway.
>
> On 05/04, Wander Lairson Costa wrote:
> > On Thu, May 4, 2023 at 4:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > On 05/04, Wander Lairson Costa wrote:
> > > >
> > > > On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> > > > >
> > > > >         https://lore.kernel.org/all/Y+zFNrCjBn53%2F+Q2@linutronix.de/
> > > > >
> > > >
> > > > I think that was my confusion in that thread. My understanding is that
> > > > CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> > > > context.
> > >
> > > Sorry, I don't understand... perhaps I missed something. But iiuc
> > > the problem is simple.
> > >
> > > So, this code
> > >
> > >         raw_spin_lock(one);
> > >         spin_lock(two);
> > >
> > > is obviously wrong if CONFIG_PREEMPT_RT.
> > >
> > > Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
> > > are the same thing. Except they have different lockdep annotations if
> > > CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.
> > >
> > > So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
> > > on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
> > > on with PREEMPT_RT.
> > >
> > > Cough... not sure my explanation can help ;) It looks very confusing when
> > > I read it.
> > >
> >
> > Thanks for the explanation. That's my understanding too. The part I
> > don't get is why this would fail with a call_rcu() inside
> > put_task_struct().
>
> the problem is that call_rcu() won't be called if !IS_ENABLED(PREEMPT_RT),
> ___put_task_struct() will be called.
>
> CONFIG_PROVE_RAW_LOCK_NESTING can't know this can't happen if PREEMPT_RT
> is set.
>
> IOW. To simplify, suppose we have
>
>         // can be called in atomic context, e.g. under
>         // raw_spin_lock() so it is wrong with PREEMPT_RT
>         void __put_task_struct(struct task_struct *tsk)
>         {
>                 spin_lock(some_lock);
>         }
>
> lets "fix" the code above, lets change __put_task_struct,
>
>         void __put_task_struct(struct task_struct *tsk)
>         {
>                 if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>                         return;
>
>                 spin_lock(some_lock);
>         }
>
> Now, if CONFIG_PREEMPT_RT is true then __put_task_struct() is fine
> wrt lock nesting.
>
> But, if CONFIG_PREEMPT_RT is not set, then __put_task_struct() still
> does the same:
>
>         void __put_task_struct(struct task_struct *tsk)
>         {
>                 spin_lock(some_lock);
>         }
>
> and CONFIG_PROVE_RAW_LOCK_NESTING will complain. Because, once again,
> it checks the nesting as if CONFIG_PREEMPT_RT is true, and in this case
> __put_task_struct() if it is called under raw_spin_lock().
>

IIUC, this is a problem with the current implementation, isn't it?
Because the holder of raw_spin_lock is the put_task_struct() caller.

> Oleg.
>
  

Patch

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index b597b97b1f8f..cf774b83b2ec 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -141,6 +141,41 @@  static inline void put_task_struct_many(struct task_struct *t, int nr)
 
 void put_task_struct_rcu_user(struct task_struct *task);
 
+extern void __delayed_put_task_struct(struct rcu_head *rhp);
+
+static inline void put_task_struct_atomic_safe(struct task_struct *task)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		/*
+		 * Decrement the refcount explicitly to avoid unnecessarily
+		 * calling call_rcu.
+		 */
+		if (refcount_dec_and_test(&task->usage))
+			/*
+			 * under PREEMPT_RT, we can't call put_task_struct
+			 * in atomic context because it will indirectly
+			 * acquire sleeping locks.
+			 * call_rcu() will schedule __delayed_put_task_struct()
+			 * to be called in process context.
+			 *
+			 * __put_task_struct() is called when
+			 * refcount_dec_and_test(&t->usage) succeeds.
+			 *
+			 * This means that it can't conflict with
+			 * put_task_struct_rcu_user() which abuses ->rcu the same
+			 * way; rcu_users has a reference so task->usage can't be
+			 * zero after rcu_users 1 -> 0 transition.
+			 *
+			 * delayed_free_task() also uses ->rcu, but it is only called
+			 * when it fails to fork a process. Therefore, there is no
+			 * way it can conflict with put_task_struct().
+			 */
+			call_rcu(&task->rcu, __delayed_put_task_struct);
+	} else {
+		put_task_struct(task);
+	}
+}
+
 /* Free all architecture-specific resources held by a thread. */
 void release_thread(struct task_struct *dead_task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index ea332319dffe..7f016b691b1d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -854,6 +854,14 @@  void __put_task_struct(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(__put_task_struct);
 
+void __delayed_put_task_struct(struct rcu_head *rhp)
+{
+	struct task_struct *task = container_of(rhp, struct task_struct, rcu);
+
+	__put_task_struct(task);
+}
+EXPORT_SYMBOL_GPL(__delayed_put_task_struct);
+
 void __init __weak arch_task_cache_init(void) { }
 
 /*