[v4] kernel/fork: beware of __put_task_struct calling context

Message ID 20230206130449.41360-1-wander@redhat.com
State New
Headers
Series [v4] kernel/fork: beware of __put_task_struct calling context |

Commit Message

Wander Lairson Costa Feb. 6, 2023, 1:04 p.m. UTC
  Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
locks. Therefore, it can't be called from an non-preemptible context.

One practical example is splat inside inactive_task_timer(), which is
called in a interrupt context:

CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
 Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
 Call Trace:
 dump_stack_lvl+0x57/0x7d
 mark_lock_irq.cold+0x33/0xba
 ? stack_trace_save+0x4b/0x70
 ? save_trace+0x55/0x150
 mark_lock+0x1e7/0x400
 mark_usage+0x11d/0x140
 __lock_acquire+0x30d/0x930
 lock_acquire.part.0+0x9c/0x210
 ? refill_obj_stock+0x3d/0x3a0
 ? rcu_read_lock_sched_held+0x3f/0x70
 ? trace_lock_acquire+0x38/0x140
 ? lock_acquire+0x30/0x80
 ? refill_obj_stock+0x3d/0x3a0
 rt_spin_lock+0x27/0xe0
 ? refill_obj_stock+0x3d/0x3a0
 refill_obj_stock+0x3d/0x3a0
 ? inactive_task_timer+0x1ad/0x340
 kmem_cache_free+0x357/0x560
 inactive_task_timer+0x1ad/0x340
 ? switched_from_dl+0x2d0/0x2d0
 __run_hrtimer+0x8a/0x1a0
 __hrtimer_run_queues+0x91/0x130
 hrtimer_interrupt+0x10f/0x220
 __sysvec_apic_timer_interrupt+0x7b/0xd0
 sysvec_apic_timer_interrupt+0x4f/0xd0
 ? asm_sysvec_apic_timer_interrupt+0xa/0x20
 asm_sysvec_apic_timer_interrupt+0x12/0x20
 RIP: 0033:0x7fff196bf6f5

Instead of calling __put_task_struct() directly, we defer it using
call_rcu(). A more natural approach would use a workqueue, but since
in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.

Changelog
=========

v1:
* Initial implementation fixing the splat.

v2:
* Isolate the logic in its own function.
* Fix two more cases caught in review.

v3:
* Change __put_task_struct() to handle the issue internally.

v4:
* Explain why call_rcu() is safe to call from interrupt context.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reported-by: Hu Chunyu <chuhu@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Cc: Paul McKenney <paulmck@kernel.org>
---
 kernel/fork.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)
  

Comments

Sebastian Andrzej Siewior Feb. 6, 2023, 2:56 p.m. UTC | #1
On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote:
> Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
> locks. Therefore, it can't be called from an non-preemptible context.
> 
> One practical example is splat inside inactive_task_timer(), which is
> called in a interrupt context:

Do you have more?
The inactive_task_timer() is marked as HRTIMER_MODE_REL_HARD which means
in runs in hardirq-context. The author of commit
   850377a875a48 ("sched/deadline: Ensure inactive_timer runs in hardirq context")

should have been aware of that.
We have on workaround of that put_task() in sched-switch. I wasn't aware
of this shortcoming. So either we have more problems or potential
problems or this is the only finding so far.

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9f7fe3541897..532dd2ceb6a3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk)> +void __put_task_struct(struct task_struct *tsk)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))

Is it safe to use the rcu member in any case? If so why not use it
unconditionally?

> +		/*
> +		 * under PREEMPT_RT, we can't call put_task_struct
> +		 * in atomic context because it will indirectly
> +		 * acquire sleeping locks.
> +		 *
> +		 * call_rcu() will schedule delayed_put_task_struct_rcu()
> +		 * to be called in process context.
> +		 */
> +		call_rcu(&tsk->rcu, delayed_put_task_struct_rcu);
> +	else
> +		___put_task_struct(tsk);
> +}
>  EXPORT_SYMBOL_GPL(__put_task_struct);
>  
>  void __init __weak arch_task_cache_init(void) { }

Sebastian
  
Oleg Nesterov Feb. 6, 2023, 3:27 p.m. UTC | #2
On 02/06, Sebastian Andrzej Siewior wrote:
>
> On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote:
>
> > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk)
> …
> > +void __put_task_struct(struct task_struct *tsk)
> > +{
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
>
> Is it safe to use the rcu member in any case?

I thinks it is safe but deserves a comment. I guess Wander misunderstood
me when I asked him to do this...

__put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds.

This means that it can't "conflict" with put_task_struct_rcu_user() which
abuses ->rcu the same way; rcu_users has a reference so task->usage can't
be zero after rcu_users 1 -> 0 transition.

> If so why not use it
> unconditionally?

performance ?


And... I still don't like the name of delayed_put_task_struct_rcu() to me
___put_task_struct_rcu() looks a bit less confusing, note that we already
have delayed_put_task_struct(). But this is minor.

Oleg.
  
Sebastian Andrzej Siewior Feb. 6, 2023, 4:04 p.m. UTC | #3
On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote:
> On 02/06, Sebastian Andrzej Siewior wrote:
> >
> > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote:
> >
> > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk)
> > …
> > > +void __put_task_struct(struct task_struct *tsk)
> > > +{
> > > +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
> >
> > Is it safe to use the rcu member in any case?
> 
> I thinks it is safe but deserves a comment. I guess Wander misunderstood
> me when I asked him to do this...
> 
> __put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds.
> 
> This means that it can't "conflict" with put_task_struct_rcu_user() which
> abuses ->rcu the same way; rcu_users has a reference so task->usage can't
> be zero after rcu_users 1 -> 0 transition.

Sounds good.

> > If so why not use it
> > unconditionally?
> 
> performance ?

All the free() part is moved from the caller into rcu.

> 
> And... I still don't like the name of delayed_put_task_struct_rcu() to me
> ___put_task_struct_rcu() looks a bit less confusing, note that we already
> have delayed_put_task_struct(). But this is minor.

So if we do it unconditionally then we could get rid of
put_task_struct_rcu_user().
Otherwise we could use put_task_struct_rcu_user() in that timer
callback because it will lead to lockdep warnings once printk is fixed.

> Oleg.
Sebastian
  
Oleg Nesterov Feb. 6, 2023, 4:27 p.m. UTC | #4
On 02/06, Sebastian Andrzej Siewior wrote:
>
> On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote:
>
> > > If so why not use it
> > > unconditionally?
> >
> > performance ?
>
> All the free() part is moved from the caller into rcu.

sorry, I don't understand,

>
> >
> > And... I still don't like the name of delayed_put_task_struct_rcu() to me
> > ___put_task_struct_rcu() looks a bit less confusing, note that we already
> > have delayed_put_task_struct(). But this is minor.
>
> So if we do it unconditionally then we could get rid of
> put_task_struct_rcu_user().

Yes. But the whole purpose of rcu_users is that we want to avoid the unconditional
rcu grace period before free_task() ?

Just in case... please note that delayed_put_task_struct() delays
refcount_sub(t->usage), not free_task().

Why do we need this? Consider

	rcu_read_lock();

	task = find-task-in-rcu-protected-list;

	// Safe, task->usage can't be zero
	get_task_struct(task);

	rcu_read_unlock();


> Otherwise we could use put_task_struct_rcu_user() in that timer
> callback because it will lead to lockdep warnings once printk is fixed.

IIUC there are more in-atomic callers of put_task_struct(). But perhaps
I misunderstood you...

Oleg.
  
Wander Lairson Costa Feb. 6, 2023, 6:32 p.m. UTC | #5
On Mon, Feb 6, 2023 at 11:57 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote:
> > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
> > locks. Therefore, it can't be called from an non-preemptible context.
> >
> > One practical example is splat inside inactive_task_timer(), which is
> > called in a interrupt context:
>
> Do you have more?
> The inactive_task_timer() is marked as HRTIMER_MODE_REL_HARD which means
> in runs in hardirq-context. The author of commit
>    850377a875a48 ("sched/deadline: Ensure inactive_timer runs in hardirq context")
>
> should have been aware of that.
> We have on workaround of that put_task() in sched-switch. I wasn't aware
> of this shortcoming. So either we have more problems or potential
> problems or this is the only finding so far.
>

Valentin spotted two other potential issues, I fixed them in v2[1].
Also there is a discussion there that led to this implementation.

> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9f7fe3541897..532dd2ceb6a3 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk)
> …
> > +void __put_task_struct(struct task_struct *tsk)
> > +{
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
>
> Is it safe to use the rcu member in any case? If so why not use it
> unconditionally?
>

I am unsure what would be the consequences of moving every call to
RCU, so I thought it would be better to play safe and do it only when
necessary.

> > +             /*
> > +              * under PREEMPT_RT, we can't call put_task_struct
> > +              * in atomic context because it will indirectly
> > +              * acquire sleeping locks.
> > +              *
> > +              * call_rcu() will schedule delayed_put_task_struct_rcu()
> > +              * to be called in process context.
> > +              */
> > +             call_rcu(&tsk->rcu, delayed_put_task_struct_rcu);
> > +     else
> > +             ___put_task_struct(tsk);
> > +}
> >  EXPORT_SYMBOL_GPL(__put_task_struct);
> >
> >  void __init __weak arch_task_cache_init(void) { }
>
> Sebastian
>

[1] https://lore.kernel.org/all/20230120150246.20797-1-wander@redhat.com/
  
Wander Lairson Costa Feb. 6, 2023, 6:34 p.m. UTC | #6
On Mon, Feb 6, 2023 at 12:27 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 02/06, Sebastian Andrzej Siewior wrote:
> >
> > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote:
> >
> > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk)
> > …
> > > +void __put_task_struct(struct task_struct *tsk)
> > > +{
> > > +   if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
> >
> > Is it safe to use the rcu member in any case?
>
> I thinks it is safe but deserves a comment. I guess Wander misunderstood
> me when I asked him to do this...
>

Oops, sorry. Next version, I will include this description.

> __put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds.
>
> This means that it can't "conflict" with put_task_struct_rcu_user() which
> abuses ->rcu the same way; rcu_users has a reference so task->usage can't
> be zero after rcu_users 1 -> 0 transition.
>
> > If so why not use it
> > unconditionally?
>
> performance ?
>
>
> And... I still don't like the name of delayed_put_task_struct_rcu() to me
> ___put_task_struct_rcu() looks a bit less confusing, note that we already
> have delayed_put_task_struct(). But this is minor.
>

Ok, I will change it.

> Oleg.
>
  
Wander Lairson Costa Feb. 6, 2023, 6:36 p.m. UTC | #7
On Mon, Feb 6, 2023 at 1:04 PM, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote:
> > On 02/06, Sebastian Andrzej Siewior wrote:
> > >
> > > On 2023-02-06 10:04:47 [-0300], Wander Lairson Costa wrote:
> > >
> > > > @@ -857,6 +857,29 @@ void __put_task_struct(struct task_struct *tsk)
> > > …
> > > > +void __put_task_struct(struct task_struct *tsk)
> > > > +{
> > > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
> > >
> > > Is it safe to use the rcu member in any case?
> >
> > I thinks it is safe but deserves a comment. I guess Wander misunderstood
> > me when I asked him to do this...
> >
> > __put_task_struct() is called when refcount_dec_and_test(&t->usage) succeeds.
> >
> > This means that it can't "conflict" with put_task_struct_rcu_user() which
> > abuses ->rcu the same way; rcu_users has a reference so task->usage can't
> > be zero after rcu_users 1 -> 0 transition.
>
> Sounds good.
>
> > > If so why not use it
> > > unconditionally?
> >
> > performance ?
>
> All the free() part is moved from the caller into rcu.
>
> >
> > And... I still don't like the name of delayed_put_task_struct_rcu() to me
> > ___put_task_struct_rcu() looks a bit less confusing, note that we already
> > have delayed_put_task_struct(). But this is minor.
>
> So if we do it unconditionally then we could get rid of
> put_task_struct_rcu_user().
> Otherwise we could use put_task_struct_rcu_user() in that timer
> callback because it will lead to lockdep warnings once printk is fixed.

put_task_struct_rcu_user() calls delayed_put_task_struct(), which does
more than just call __put_task_struct(). I tried this approach at the
beginning, but I got another splat (unfortunately, I don't remember
where).
  
Andrew Morton Feb. 7, 2023, 1:09 a.m. UTC | #8
On Mon,  6 Feb 2023 10:04:47 -0300 Wander Lairson Costa <wander@redhat.com> wrote:

> Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
> locks. Therefore, it can't be called from an non-preemptible context.

Well that's regrettable.  Especially if non-preempt kernels don't do
this.

Why does PREEMPT_RT do this and can it be fixed?

If it cannot be fixed then we should have a might_sleep() in
__put_task_struct() for all kernel configurations, along with an
apologetic comment explaining why.
  
Wander Lairson Costa Feb. 7, 2023, 3:26 p.m. UTC | #9
On Mon, Feb 6, 2023 at 10:09 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon,  6 Feb 2023 10:04:47 -0300 Wander Lairson Costa <wander@redhat.com> wrote:
>
> > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
> > locks. Therefore, it can't be cala from a non-preemptible context.
>
> Well that's regrettable. Especially if non-preempt kernels don't do
> this.
>
> Why does PREEMPT_RT do this and can it be fixed?
>

Under PREEMPT_RT, spin_lock becomes a wrapper around rtmutex, which is
a sleeping lock. This is necessary because of the deterministic
scheduling requirements of the RT kernel. Most of the places that run
in atomic context in the stock kernel, become process context in the
RT kernel, so the change spin_lock -> rtmutex is safe. However, there
are always exceptions.

In this particular case, __put_task_struct calls kcache_mem_free,
which refill_obj_stock. refill_obj_stock acquires a local_lock, that
is implemented using a spin_lock.

> If it cannot be fixed then we should have a might_sleep() in
> __put_task_struct() for all kernel configurations, along with an
> apologetic comment explaining why.
>

This patch is supposed to be the fix.
  
Sebastian Andrzej Siewior Feb. 10, 2023, 4:48 p.m. UTC | #10
On 2023-02-06 17:27:58 [+0100], Oleg Nesterov wrote:
> On 02/06, Sebastian Andrzej Siewior wrote:
> >
> > On 2023-02-06 16:27:12 [+0100], Oleg Nesterov wrote:
> >
> > > > If so why not use it
> > > > unconditionally?
> > >
> > > performance ?
> >
> > All the free() part is moved from the caller into rcu.
> 
> sorry, I don't understand,

That callback does mostly free() and it is batched with other free()
invocations. This also is moved away from the caller which _might_
benefit.

> > > And... I still don't like the name of delayed_put_task_struct_rcu() to me
> > > ___put_task_struct_rcu() looks a bit less confusing, note that we already
> > > have delayed_put_task_struct(). But this is minor.
> >
> > So if we do it unconditionally then we could get rid of
> > put_task_struct_rcu_user().
> 
> Yes. But the whole purpose of rcu_users is that we want to avoid the unconditional
> rcu grace period before free_task() ?

Oh, this is usage vs rcu_users. Okay, mixed that up.

> Just in case... please note that delayed_put_task_struct() delays
> refcount_sub(t->usage), not free_task().

Just noticed ;)

> Why do we need this? Consider
> 
> 	rcu_read_lock();
> 
> 	task = find-task-in-rcu-protected-list;
> 
> 	// Safe, task->usage can't be zero
> 	get_task_struct(task);
> 
> 	rcu_read_unlock();
> 
> 
> > Otherwise we could use put_task_struct_rcu_user() in that timer
> > callback because it will lead to lockdep warnings once printk is fixed.
> 
> IIUC there are more in-atomic callers of put_task_struct(). But perhaps
> I misunderstood you...

That is true. So you are saying that we don't what to use RCU for
put_task_struct() unconditionally?

> Oleg.

Sebastian
  
Sebastian Andrzej Siewior Feb. 10, 2023, 5:08 p.m. UTC | #11
On 2023-02-06 17:09:27 [-0800], Andrew Morton wrote:
> On Mon,  6 Feb 2023 10:04:47 -0300 Wander Lairson Costa <wander@redhat.com> wrote:
> 
> > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
> > locks. Therefore, it can't be called from an non-preemptible context.
> 
> Well that's regrettable.  Especially if non-preempt kernels don't do
> this.

Non-preemptible context on PREEMPT_RT. Interrupts handler and timers
don't count as non-preemptible because interrupt handler are threaded
and hrtimers are invoked in softirq context (which is preemptible on
PREEMPT_RT).

This here is different because the hrtimer in question was marked as
HRTIMER_MODE_REL_HARD. In this case it is invoked in hardirq context as
requested with all the problems that follow.

> Why does PREEMPT_RT do this and can it be fixed?

PREEMPT_RT tries to move as much as it can out of hardirq context into
preemptible context. A spinlock_t is preemptible on PREEMPT_RT while
it is not in other preemption models. The scheduler needs to use
raw_spinlock_t in order to be able to schedule a task from
hardirq-context without a deadlock.
For memory allocation only sleeping locks (spinlock_t) is used since
there are no memory allocation/ deallocation on PREEMPT_RT in hardirq
context. These two need to be separated.

> If it cannot be fixed then we should have a might_sleep() in
> __put_task_struct() for all kernel configurations, along with an
> apologetic comment explaining why.

__put_task_struct() should not be invoked in atomic context on
PREEMPT_RT. It is fine however in a regular timer hrtimer. Adding
might_sleep() will trigger a lot of false positives on a preemptible
kernel and RT.

A might_lock() on a spinlock_t should do the trick from LOCKDEP
perspective if CONFIG_PROVE_RAW_LOCK_NESTING is enabled.
In this case it should be visible due to rq-lock or due to hrtimer.

Sebastian
  

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..532dd2ceb6a3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -840,7 +840,7 @@  static inline void put_signal_struct(struct signal_struct *sig)
 		free_signal_struct(sig);
 }
 
-void __put_task_struct(struct task_struct *tsk)
+static void ___put_task_struct(struct task_struct *tsk)
 {
 	WARN_ON(!tsk->exit_state);
 	WARN_ON(refcount_read(&tsk->usage));
@@ -857,6 +857,29 @@  void __put_task_struct(struct task_struct *tsk)
 	sched_core_free(tsk);
 	free_task(tsk);
 }
+
+static void delayed_put_task_struct_rcu(struct rcu_head *rhp)
+{
+	struct task_struct *task = container_of(rhp, struct task_struct, rcu);
+
+	___put_task_struct(task);
+}
+
+void __put_task_struct(struct task_struct *tsk)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
+		/*
+		 * under PREEMPT_RT, we can't call put_task_struct
+		 * in atomic context because it will indirectly
+		 * acquire sleeping locks.
+		 *
+		 * call_rcu() will schedule delayed_put_task_struct_rcu()
+		 * to be called in process context.
+		 */
+		call_rcu(&tsk->rcu, delayed_put_task_struct_rcu);
+	else
+		___put_task_struct(tsk);
+}
 EXPORT_SYMBOL_GPL(__put_task_struct);
 
 void __init __weak arch_task_cache_init(void) { }