[v9] kernel/fork: beware of __put_task_struct calling context

Message ID 20230516191441.34377-1-wander@redhat.com
State New
Headers
Series [v9] kernel/fork: beware of __put_task_struct calling context |

Commit Message

Wander Lairson Costa May 16, 2023, 7:14 p.m. UTC
  Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
locks. Therefore, it can't be called from an non-preemptible context.

One practical example is splat inside inactive_task_timer(), which is
called in a interrupt context:

CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
 Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
 Call Trace:
 dump_stack_lvl+0x57/0x7d
 mark_lock_irq.cold+0x33/0xba
 ? stack_trace_save+0x4b/0x70
 ? save_trace+0x55/0x150
 mark_lock+0x1e7/0x400
 mark_usage+0x11d/0x140
 __lock_acquire+0x30d/0x930
 lock_acquire.part.0+0x9c/0x210
 ? refill_obj_stock+0x3d/0x3a0
 ? rcu_read_lock_sched_held+0x3f/0x70
 ? trace_lock_acquire+0x38/0x140
 ? lock_acquire+0x30/0x80
 ? refill_obj_stock+0x3d/0x3a0
 rt_spin_lock+0x27/0xe0
 ? refill_obj_stock+0x3d/0x3a0
 refill_obj_stock+0x3d/0x3a0
 ? inactive_task_timer+0x1ad/0x340
 kmem_cache_free+0x357/0x560
 inactive_task_timer+0x1ad/0x340
 ? switched_from_dl+0x2d0/0x2d0
 __run_hrtimer+0x8a/0x1a0
 __hrtimer_run_queues+0x91/0x130
 hrtimer_interrupt+0x10f/0x220
 __sysvec_apic_timer_interrupt+0x7b/0xd0
 sysvec_apic_timer_interrupt+0x4f/0xd0
 ? asm_sysvec_apic_timer_interrupt+0xa/0x20
 asm_sysvec_apic_timer_interrupt+0x12/0x20
 RIP: 0033:0x7fff196bf6f5

Instead of calling __put_task_struct() directly, we defer it using
call_rcu(). A more natural approach would use a workqueue, but since
in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.

Changelog
=========

v1:
* Initial implementation fixing the splat.

v2:
* Isolate the logic in its own function.
* Fix two more cases caught in review.

v3:
* Change __put_task_struct() to handle the issue internally.

v4:
* Explain why call_rcu() is safe to call from interrupt context.

v5:
* Explain why __put_task_struct() doesn't conflict with
  put_task_sruct_rcu_user.

v6:
* As per Sebastian's review, revert back the implementation of v2
  with a distinct function.
* Add a check in put_task_struct() to warning when called from a
  non-sleepable context.
* Address more call sites.

v7:
* Fix typos.
* Add an explanation why the new function doesn't conflict with
  delayed_free_task().

v8:
* Bring back v5.
* Fix coding style.

v9:
* Reorganize to not need ___put_task_struct() by Oleg's suggestion.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reported-by: Hu Chunyu <chuhu@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Paul McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Luis Goncalves <lgoncalv@redhat.com>
---
 include/linux/sched/task.h | 28 +++++++++++++++++++++++++++-
 kernel/fork.c              |  8 ++++++++
 2 files changed, 35 insertions(+), 1 deletion(-)
  

Comments

Matthew Wilcox May 16, 2023, 7:24 p.m. UTC | #1
On Tue, May 16, 2023 at 04:14:41PM -0300, Wander Lairson Costa wrote:
> +void __put_task_struct_rcu_cb(struct rcu_head *rhp)
> +{
> +	struct task_struct *task = container_of(rhp, struct task_struct, rcu);
> +
> +	__put_task_struct(task);
> +}
> +EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb);

Why does this need to be exported when its only caller is within the
main kernel and cannot possibly be built as a module?
  
Andrew Morton May 16, 2023, 9:05 p.m. UTC | #2
On Tue, 16 May 2023 20:24:04 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Tue, May 16, 2023 at 04:14:41PM -0300, Wander Lairson Costa wrote:
> > +void __put_task_struct_rcu_cb(struct rcu_head *rhp)
> > +{
> > +	struct task_struct *task = container_of(rhp, struct task_struct, rcu);
> > +
> > +	__put_task_struct(task);
> > +}
> > +EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb);
> 
> Why does this need to be exported when its only caller is within the
> main kernel and cannot possibly be built as a module?

It's referenced by inlined put_task_struct(), which is called from all
over.

However I believe the above definition could be inside #ifdef
CONFIG_PREEMPT_RT, to save a scrap of resources?
  
Matthew Wilcox May 16, 2023, 9:41 p.m. UTC | #3
On Tue, May 16, 2023 at 02:05:55PM -0700, Andrew Morton wrote:
> On Tue, 16 May 2023 20:24:04 +0100 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Tue, May 16, 2023 at 04:14:41PM -0300, Wander Lairson Costa wrote:
> > > +void __put_task_struct_rcu_cb(struct rcu_head *rhp)
> > > +{
> > > +	struct task_struct *task = container_of(rhp, struct task_struct, rcu);
> > > +
> > > +	__put_task_struct(task);
> > > +}
> > > +EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb);
> > 
> > Why does this need to be exported when its only caller is within the
> > main kernel and cannot possibly be built as a module?
> 
> It's referenced by inlined put_task_struct(), which is called from all
> over.

Oh, I missed that put_task_struct() was still inlined.  Should it be?
It seems quite large now.

> However I believe the above definition could be inside #ifdef
> CONFIG_PREEMPT_RT, to save a scrap of resources?
  
Andrew Morton May 16, 2023, 10:50 p.m. UTC | #4
On Tue, 16 May 2023 22:41:18 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> 
> Oh, I missed that put_task_struct() was still inlined.  Should it be?
> It seems quite large now.

It's not significantly worse because of this patch.  In fact, it's
unchanged for non-RT kernels.

Possibly put_task_struct() *should* be uninlined, because it made the
mistake of using the dang refcount stuff, which never saw a byte which
it couldn't consume :(


I mean...

--- a/fs/open.c~a
+++ a/fs/open.c
@@ -1572,3 +1572,9 @@ int stream_open(struct inode *inode, str
 }
 
 EXPORT_SYMBOL(stream_open);
+
+#include <linux/refcount.h>
+bool foo(refcount_t *r)
+{
+	return refcount_dec_and_test(r);
+}
  
Oleg Nesterov May 17, 2023, 3:26 p.m. UTC | #5
On 05/16, Wander Lairson Costa wrote:
>
>  static inline void put_task_struct(struct task_struct *t)
>  {
> -	if (refcount_dec_and_test(&t->usage))
> +	if (!refcount_dec_and_test(&t->usage))
> +		return;
> +
> +	/*
> +	 * under PREEMPT_RT, we can't call put_task_struct
> +	 * in atomic context because it will indirectly
> +	 * acquire sleeping locks.
> +	 *
> +	 * call_rcu() will schedule delayed_put_task_struct_rcu()
> +	 * to be called in process context.
> +	 *
> +	 * __put_task_struct() is called when
> +	 * refcount_dec_and_test(&t->usage) succeeds.
> +	 *
> +	 * This means that it can't "conflict" with
> +	 * put_task_struct_rcu_user() which abuses ->rcu the same
> +	 * way; rcu_users has a reference so task->usage can't be
> +	 * zero after rcu_users 1 -> 0 transition.
> +	 *
> +	 * delayed_free_task() also uses ->rcu, but it is only called
> +	 * when it fails to fork a process. Therefore, there is no
> +	 * way it can conflict with put_task_struct().
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> +		call_rcu(&t->rcu, __put_task_struct_rcu_cb);
> +	else
>  		__put_task_struct(t);
>  }

LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ...

Again, I'll try to investigate when I have time although I am not sure I can really help.

Perhaps you too can try to do this ? ;)

Oleg.
  
Wander Lairson Costa May 17, 2023, 4:57 p.m. UTC | #6
On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 05/16, Wander Lairson Costa wrote:
> >
> >  static inline void put_task_struct(struct task_struct *t)
> >  {
> > -     if (refcount_dec_and_test(&t->usage))
> > +     if (!refcount_dec_and_test(&t->usage))
> > +             return;
> > +
> > +     /*
> > +      * under PREEMPT_RT, we can't call put_task_struct
> > +      * in atomic context because it will indirectly
> > +      * acquire sleeping locks.
> > +      *
> > +      * call_rcu() will schedule delayed_put_task_struct_rcu()
> > +      * to be called in process context.
> > +      *
> > +      * __put_task_struct() is called when
> > +      * refcount_dec_and_test(&t->usage) succeeds.
> > +      *
> > +      * This means that it can't "conflict" with
> > +      * put_task_struct_rcu_user() which abuses ->rcu the same
> > +      * way; rcu_users has a reference so task->usage can't be
> > +      * zero after rcu_users 1 -> 0 transition.
> > +      *
> > +      * delayed_free_task() also uses ->rcu, but it is only called
> > +      * when it fails to fork a process. Therefore, there is no
> > +      * way it can conflict with put_task_struct().
> > +      */
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > +             call_rcu(&t->rcu, __put_task_struct_rcu_cb);
> > +     else
> >               __put_task_struct(t);
> >  }
>
> LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ...
>
> Again, I'll try to investigate when I have time although I am not sure I can really help.
>
> Perhaps you too can try to do this ? ;)
>

FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and
stock kernels. No splat happened.

> Oleg.
>
  
Oleg Nesterov May 29, 2023, 12:22 p.m. UTC | #7
On 05/17, Wander Lairson Costa wrote:
>
> On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ...
> >
> > Again, I'll try to investigate when I have time although I am not sure I can really help.
> >
> > Perhaps you too can try to do this ? ;)
> >
>
> FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and
> stock kernels. No splat happened.

Strange... FYI, I am running the kernel with this patch

	diff --git a/kernel/sys.c b/kernel/sys.c
	index 339fee3eff6a..3169cceddf3b 100644
	--- a/kernel/sys.c
	+++ b/kernel/sys.c
	@@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
	 
		error = 0;
		switch (option) {
	+	case 666: {
	+		static DEFINE_SPINLOCK(l);
	+		static DEFINE_RAW_SPINLOCK(r);
	+
	+		raw_spin_lock(&r);
	+		spin_lock(&l);
	+		spin_unlock(&l);
	+		raw_spin_unlock(&r);
	+
	+		break;
	+	}
		case PR_SET_PDEATHSIG:
			if (!valid_signal(arg2)) {
				error = -EINVAL;

applied (because I am too lazy to compile a module ;) and

	# perl -e 'syscall 157,666'

triggers the lockdep bug

	=============================
	[ BUG: Invalid wait context ]
	6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted
	-----------------------------
	perl/35 is trying to lock:
	ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b
	other info that might help us debug this:
	context-{5:5}
	...

as expected.

Looks like your testing was wrong... Or maybe you missed another lockdep problem ?
Did you check dmesg? Perhaps lockdep detected another bug,say, even at boot time ?
In this case debug_locks_off() sets debug_locks = 0 and this disables lockdep.

Oleg.
  
Wander Lairson Costa June 1, 2023, 5:45 p.m. UTC | #8
On Mon, May 29, 2023 at 9:23 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 05/17, Wander Lairson Costa wrote:
> >
> > On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ...
> > >
> > > Again, I'll try to investigate when I have time although I am not sure I can really help.
> > >
> > > Perhaps you too can try to do this ? ;)
> > >
> >
> > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and
> > stock kernels. No splat happened.
>
> Strange... FYI, I am running the kernel with this patch
>
>         diff --git a/kernel/sys.c b/kernel/sys.c
>         index 339fee3eff6a..3169cceddf3b 100644
>         --- a/kernel/sys.c
>         +++ b/kernel/sys.c
>         @@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
>                 error = 0;
>                 switch (option) {
>         +       case 666: {
>         +               static DEFINE_SPINLOCK(l);
>         +               static DEFINE_RAW_SPINLOCK(r);
>         +
>         +               raw_spin_lock(&r);
>         +               spin_lock(&l);
>         +               spin_unlock(&l);
>         +               raw_spin_unlock(&r);
>         +
>         +               break;
>         +       }
>                 case PR_SET_PDEATHSIG:
>                         if (!valid_signal(arg2)) {
>                                 error = -EINVAL;
>
> applied (because I am too lazy to compile a module ;) and
>

FWIW, I converted it to a module [1]

>         # perl -e 'syscall 157,666'
>
> triggers the lockdep bug
>
>         =============================
>         [ BUG: Invalid wait context ]
>         6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted
>         -----------------------------
>         perl/35 is trying to lock:
>         ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b
>         other info that might help us debug this:
>         context-{5:5}
>         ...
>
> as expected.
>

Yeah, I tried it here and I had the same results, but only in the RT
kernel. But running the reproducer for put_task_struct(), works fine.

> Looks like your testing was wrong... Or maybe you missed another lockdep problem ?
> Did you check dmesg? Perhaps lockdep detected another bug,say, even at boot time ?
> In this case debug_locks_off() sets debug_locks = 0 and this disables lockdep.
>
> Oleg.
>
  
Oleg Nesterov June 1, 2023, 6:13 p.m. UTC | #9
On 06/01, Wander Lairson Costa wrote:
>
> On Mon, May 29, 2023 at 9:23 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 05/17, Wander Lairson Costa wrote:
> > >
> > > On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ...
> > > >
> > > > Again, I'll try to investigate when I have time although I am not sure I can really help.
> > > >
> > > > Perhaps you too can try to do this ? ;)
> > > >
> > >
> > > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and
> > > stock kernels. No splat happened.
> >
> > Strange... FYI, I am running the kernel with this patch
> >
> >         diff --git a/kernel/sys.c b/kernel/sys.c
> >         index 339fee3eff6a..3169cceddf3b 100644
> >         --- a/kernel/sys.c
> >         +++ b/kernel/sys.c
> >         @@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >
> >                 error = 0;
> >                 switch (option) {
> >         +       case 666: {
> >         +               static DEFINE_SPINLOCK(l);
> >         +               static DEFINE_RAW_SPINLOCK(r);
> >         +
> >         +               raw_spin_lock(&r);
> >         +               spin_lock(&l);
> >         +               spin_unlock(&l);
> >         +               raw_spin_unlock(&r);
> >         +
> >         +               break;
> >         +       }
> >                 case PR_SET_PDEATHSIG:
> >                         if (!valid_signal(arg2)) {
> >                                 error = -EINVAL;
> >
> > applied (because I am too lazy to compile a module ;) and
> >
>
> FWIW, I converted it to a module [1]

where is [1] ?  not that I think this matters though...

> >         # perl -e 'syscall 157,666'
> >
> > triggers the lockdep bug
> >
> >         =============================
> >         [ BUG: Invalid wait context ]
> >         6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted
> >         -----------------------------
> >         perl/35 is trying to lock:
> >         ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b
> >         other info that might help us debug this:
> >         context-{5:5}
> >         ...
> >
> > as expected.
> >
>
> Yeah, I tried it here and I had the same results,

OK,

> but only in the RT kernel

this again suggests that your testing was wrong or I am totally confused (quite
possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT.

> But running the reproducer for put_task_struct(), works fine.

which reproducer ?

Oleg.
  
Wander Lairson Costa June 1, 2023, 6:23 p.m. UTC | #10
On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 06/01, Wander Lairson Costa wrote:
> >
> > On Mon, May 29, 2023 at 9:23 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > On 05/17, Wander Lairson Costa wrote:
> > > >
> > > > On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > > > >
> > > > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ...
> > > > >
> > > > > Again, I'll try to investigate when I have time although I am not sure I can really help.
> > > > >
> > > > > Perhaps you too can try to do this ? ;)
> > > > >
> > > >
> > > > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and
> > > > stock kernels. No splat happened.
> > >
> > > Strange... FYI, I am running the kernel with this patch
> > >
> > >         diff --git a/kernel/sys.c b/kernel/sys.c
> > >         index 339fee3eff6a..3169cceddf3b 100644
> > >         --- a/kernel/sys.c
> > >         +++ b/kernel/sys.c
> > >         @@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > >
> > >                 error = 0;
> > >                 switch (option) {
> > >         +       case 666: {
> > >         +               static DEFINE_SPINLOCK(l);
> > >         +               static DEFINE_RAW_SPINLOCK(r);
> > >         +
> > >         +               raw_spin_lock(&r);
> > >         +               spin_lock(&l);
> > >         +               spin_unlock(&l);
> > >         +               raw_spin_unlock(&r);
> > >         +
> > >         +               break;
> > >         +       }
> > >                 case PR_SET_PDEATHSIG:
> > >                         if (!valid_signal(arg2)) {
> > >                                 error = -EINVAL;
> > >
> > > applied (because I am too lazy to compile a module ;) and
> > >
> >
> > FWIW, I converted it to a module [1]
>
> where is [1] ?  not that I think this matters though...
>
> > >         # perl -e 'syscall 157,666'
> > >
> > > triggers the lockdep bug
> > >
> > >         =============================
> > >         [ BUG: Invalid wait context ]
> > >         6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted
> > >         -----------------------------
> > >         perl/35 is trying to lock:
> > >         ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b
> > >         other info that might help us debug this:
> > >         context-{5:5}
> > >         ...
> > >
> > > as expected.
> > >
> >
> > Yeah, I tried it here and I had the same results,
>
> OK,
>
> > but only in the RT kernel
>
> this again suggests that your testing was wrong or I am totally confused (quite
> possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT.
>

Hrm, could you please share your .config?

> > But running the reproducer for put_task_struct(), works fine.
>
> which reproducer ?
>

Only now I noticed I didn't add the reproducer to the commit message:

while true; do
    stress-ng --sched deadline --sched-period 1000000000
--sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t
20
done
  
Oleg Nesterov June 2, 2023, 5:34 p.m. UTC | #11
On 06/01, Wander Lairson Costa wrote:
>
> On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > but only in the RT kernel
> >
> > this again suggests that your testing was wrong or I am totally confused (quite
> > possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT.
> >
>
> Hrm, could you please share your .config?

Sure. I do not want to spam the list, I'll send you a private email.

Can you share your kernel module code?

Did you verify that debug_locks != 0 as I asked in my previous email ?

> > > But running the reproducer for put_task_struct(), works fine.
> >
> > which reproducer ?
> >
>
> Only now I noticed I didn't add the reproducer to the commit message:
>
> while true; do
>     stress-ng --sched deadline --sched-period 1000000000
> --sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t
> 20
> done

Cough ;) I think we need a more simple one to enssure that
refcount_sub_and_test(nr, &t->usage) returns true under raw_spin_lock()
and then __put_task_struct() actually takes spin_lock().

Oleg.
  
Oleg Nesterov June 2, 2023, 5:39 p.m. UTC | #12
On 06/01, Wander Lairson Costa wrote:
>
> On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > but only in the RT kernel
> >
> > this again suggests that your testing was wrong or I am totally confused (quite
> > possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT.
> >
>
> Hrm, could you please share your .config?

Sure. I do not want to spam the list, I'll send you a private email.

Can you share your kernel module code?

Did you verify that debug_locks != 0 as I asked in my previous email ?

> > > But running the reproducer for put_task_struct(), works fine.
> >
> > which reproducer ?
> >
>
> Only now I noticed I didn't add the reproducer to the commit message:
>
> while true; do
>     stress-ng --sched deadline --sched-period 1000000000
> --sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t
> 20
> done

Cough ;) I think we need something more simple to ensure that
refcount_sub_and_test(nr, &t->usage) returns true under raw_spin_lock()
and then __put_task_struct() actually takes spin_lock().

Oleg.
  
Wander Lairson Costa June 5, 2023, 11:24 a.m. UTC | #13
On Fri, Jun 2, 2023 at 2:34 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 06/01, Wander Lairson Costa wrote:
> >
> > On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > but only in the RT kernel
> > >
> > > this again suggests that your testing was wrong or I am totally confused (quite
> > > possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT.
> > >
> >
> > Hrm, could you please share your .config?
>
> Sure. I do not want to spam the list, I'll send you a private email.
>

Thanks. I found an unrelated earlier splat in the console code. That's
why I couldn't reproduce it in the stock kernel.

> Can you share your kernel module code?
>

*facepalm* I forgot to post the link: https://github.com/walac/test-prove-lock/

> Did you verify that debug_locks != 0 as I asked in my previous email ?
>
> > > > But running the reproducer for put_task_struct(), works fine.
> > >
> > > which reproducer ?
> > >
> >
> > Only now I noticed I didn't add the reproducer to the commit message:
> >
> > while true; do
> >     stress-ng --sched deadline --sched-period 1000000000
> > --sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t
> > 20
> > done
>
> Cough ;) I think we need a more simple one to enssure that
> refcount_sub_and_test(nr, &t->usage) returns true under raw_spin_lock()
> and then __put_task_struct() actually takes spin_lock().
>
> Oleg.
>
  
Oleg Nesterov June 6, 2023, 8:39 p.m. UTC | #14
On 06/05, Wander Lairson Costa wrote:
>
> Thanks. I found an unrelated earlier splat in the console code. That's
> why I couldn't reproduce it in the stock kernel.

As expected...

So... Not sure what can I say ;) can you verify that this patch doesn't solve
the issues with CONFIG_PROVE_RAW_LOCK_NESTING pointed out by Sebastian? Using
stress-ng or anything else.

This is not that bad, unless I am totally confused the current code (without
your patch) has the same problem (otherwise we wouldn't need this fix).

But perhaps you can make 2/2 which adds the DEFINE_WAIT_OVERRIDE_MAP() hack
as Peter suggested?

Oleg.
  
Wander Lairson Costa June 9, 2023, 7:07 p.m. UTC | #15
On Tue, Jun 6, 2023 at 5:40 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 06/05, Wander Lairson Costa wrote:
> >
> > Thanks. I found an unrelated earlier splat in the console code. That's
> > why I couldn't reproduce it in the stock kernel.
>
> As expected...
>
> So... Not sure what can I say ;) can you verify that this patch doesn't solve
> the issues with CONFIG_PROVE_RAW_LOCK_NESTING pointed out by Sebastian? Using
> stress-ng or anything else.
>

I managed to test it without a console. No issues happened in the stock kernel.

> This is not that bad, unless I am totally confused the current code (without
> your patch) has the same problem (otherwise we wouldn't need this fix).
>

That's my understanding as well.

> But perhaps you can make 2/2 which adds the DEFINE_WAIT_OVERRIDE_MAP() hack
> as Peter suggested?
>

Yes, sure. I would like to get the issue reproduced in practice to
make sure I am really fixing the problem. But I can live with that.

> Oleg.
>
  

Patch

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d6c48163c6de..9bcb9535d4e1 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -112,10 +112,36 @@  static inline struct task_struct *get_task_struct(struct task_struct *t)
 }
 
 extern void __put_task_struct(struct task_struct *t);
+extern void __put_task_struct_rcu_cb(struct rcu_head *rhp);
 
 static inline void put_task_struct(struct task_struct *t)
 {
-	if (refcount_dec_and_test(&t->usage))
+	if (!refcount_dec_and_test(&t->usage))
+		return;
+
+	/*
+	 * under PREEMPT_RT, we can't call put_task_struct
+	 * in atomic context because it will indirectly
+	 * acquire sleeping locks.
+	 *
+	 * call_rcu() will schedule delayed_put_task_struct_rcu()
+	 * to be called in process context.
+	 *
+	 * __put_task_struct() is called when
+	 * refcount_dec_and_test(&t->usage) succeeds.
+	 *
+	 * This means that it can't "conflict" with
+	 * put_task_struct_rcu_user() which abuses ->rcu the same
+	 * way; rcu_users has a reference so task->usage can't be
+	 * zero after rcu_users 1 -> 0 transition.
+	 *
+	 * delayed_free_task() also uses ->rcu, but it is only called
+	 * when it fails to fork a process. Therefore, there is no
+	 * way it can conflict with put_task_struct().
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
+		call_rcu(&t->rcu, __put_task_struct_rcu_cb);
+	else
 		__put_task_struct(t);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..fd3bb4a554c4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -846,6 +846,14 @@  void __put_task_struct(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(__put_task_struct);
 
+void __put_task_struct_rcu_cb(struct rcu_head *rhp)
+{
+	struct task_struct *task = container_of(rhp, struct task_struct, rcu);
+
+	__put_task_struct(task);
+}
+EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb);
+
 void __init __weak arch_task_cache_init(void) { }
 
 /*