[v2,1/4] sched/task: Add the put_task_struct_atomic_safe function

Message ID 20230120150246.20797-2-wander@redhat.com
State New
Headers
Series Fix put_task_struct() calls under PREEMPT_RT |

Commit Message

Wander Lairson Costa Jan. 20, 2023, 3:02 p.m. UTC
  With PREEMPT_RT, it is unsafe to call put_task_struct() in atomic
contexts because it indirectly acquires sleeping locks.

Introduce put_task_struct_atomic_safe(), which schedules
__put_task_struct() through call_rcu() when the kernel is compiled with
PREEMPT_RT.

A more natural approach would use a workqueue, but since
in PREEMPT_RT we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Cc: Paul McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/sched/task.h | 21 +++++++++++++++++++++
 kernel/fork.c              |  8 ++++++++
 2 files changed, 29 insertions(+)
  

Comments

Oleg Nesterov Jan. 23, 2023, 4:30 p.m. UTC | #1
On 01/20, Wander Lairson Costa wrote:
>
> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> +		/*
> +		 * Decrement the refcount explicitly to avoid unnecessarily
> +		 * calling call_rcu.
> +		 */
> +		if (refcount_dec_and_test(&task->usage))
> +			/*
> +			 * under PREEMPT_RT, we can't call put_task_struct
> +			 * in atomic context because it will indirectly
> +			 * acquire sleeping locks.
> +			 */
> +			call_rcu(&task->rcu, __delayed_put_task_struct);
                                  ^^^^^^^^^
I am not sure the usage of task->rcu is safe...

Suppose that, before __delayed_put_task_struct() is called by RCU, this task
does the last schedule and calls put_task_struct_rcu_user().

And, can't we simply turn put_task_struct() into something like

	put_task_struct(struct task_struct *t)
	{
		if (refcount_dec_and_test(&t->usage)) {
			if (IS_ENABLED(CONFIG_PREEMPT_RT)
			    && (in_atomic() || irqs_disabled()))
				call_rcu(...);
			else
				__put_task_struct(t);
		}
	}

?

Oleg.
  
Oleg Nesterov Jan. 23, 2023, 4:49 p.m. UTC | #2
On 01/23, Oleg Nesterov wrote:
>
> On 01/20, Wander Lairson Costa wrote:
> >
> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> > +{
> > +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > +		/*
> > +		 * Decrement the refcount explicitly to avoid unnecessarily
> > +		 * calling call_rcu.
> > +		 */
> > +		if (refcount_dec_and_test(&task->usage))
> > +			/*
> > +			 * under PREEMPT_RT, we can't call put_task_struct
> > +			 * in atomic context because it will indirectly
> > +			 * acquire sleeping locks.
> > +			 */
> > +			call_rcu(&task->rcu, __delayed_put_task_struct);
>                                   ^^^^^^^^^
> I am not sure the usage of task->rcu is safe...
>
> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
> does the last schedule and calls put_task_struct_rcu_user().

Ah, sorry, please forget, rcu_users != 0 implies task->usage != 0.


> And, can't we simply turn put_task_struct() into something like
> 
> 	put_task_struct(struct task_struct *t)
> 	{
> 		if (refcount_dec_and_test(&t->usage)) {
> 			if (IS_ENABLED(CONFIG_PREEMPT_RT)
> 			    && (in_atomic() || irqs_disabled()))
> 				call_rcu(...);
> 			else
> 				__put_task_struct(t);
> 		}
> 	}
> 
> ?
> 
> Oleg.
  
Wander Lairson Costa Jan. 23, 2023, 5:24 p.m. UTC | #3
On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/20, Wander Lairson Costa wrote:
> >
> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> > +{
> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > +             /*
> > +              * Decrement the refcount explicitly to avoid unnecessarily
> > +              * calling call_rcu.
> > +              */
> > +             if (refcount_dec_and_test(&task->usage))
> > +                     /*
> > +                      * under PREEMPT_RT, we can't call put_task_struct
> > +                      * in atomic context because it will indirectly
> > +                      * acquire sleeping locks.
> > +                      */
> > +                     call_rcu(&task->rcu, __delayed_put_task_struct);
>                                   ^^^^^^^^^
> I am not sure the usage of task->rcu is safe...
>
> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
> does the last schedule and calls put_task_struct_rcu_user().
>
> And, can't we simply turn put_task_struct() into something like
>
>         put_task_struct(struct task_struct *t)
>         {
>                 if (refcount_dec_and_test(&t->usage)) {
>                         if (IS_ENABLED(CONFIG_PREEMPT_RT)
>                             && (in_atomic() || irqs_disabled()))
>                                 call_rcu(...);
>                         else
>                                 __put_task_struct(t);
>                 }
>         }
>
> ?

Yeah, that was one approach I thought about. I chose to use an
explicit function because I assumed calling __put_task_struct() from a
non-preemptable context should be the exception, not the rule.
Therefore (if I am correct in my assumption), it would make sense for
only some call sites to pay the overhead price for it. But this is
just a guess, and I have no evidence to support my claim.
  
Valentin Schneider Jan. 27, 2023, 3:55 p.m. UTC | #4
On 23/01/23 14:24, Wander Lairson Costa wrote:
> On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> On 01/20, Wander Lairson Costa wrote:
>> >
>> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
>> > +{
>> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> > +             /*
>> > +              * Decrement the refcount explicitly to avoid unnecessarily
>> > +              * calling call_rcu.
>> > +              */
>> > +             if (refcount_dec_and_test(&task->usage))
>> > +                     /*
>> > +                      * under PREEMPT_RT, we can't call put_task_struct
>> > +                      * in atomic context because it will indirectly
>> > +                      * acquire sleeping locks.
>> > +                      */
>> > +                     call_rcu(&task->rcu, __delayed_put_task_struct);
>>                                   ^^^^^^^^^
>> I am not sure the usage of task->rcu is safe...
>>
>> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
>> does the last schedule and calls put_task_struct_rcu_user().
>>
>> And, can't we simply turn put_task_struct() into something like
>>
>>         put_task_struct(struct task_struct *t)
>>         {
>>                 if (refcount_dec_and_test(&t->usage)) {
>>                         if (IS_ENABLED(CONFIG_PREEMPT_RT)
>>                             && (in_atomic() || irqs_disabled()))
>>                                 call_rcu(...);
>>                         else
>>                                 __put_task_struct(t);
>>                 }
>>         }
>>
>> ?
>
> Yeah, that was one approach I thought about. I chose to use an
> explicit function because I assumed calling __put_task_struct() from a
> non-preemptable context should be the exception, not the rule.

I'd tend to agree.

> Therefore (if I am correct in my assumption), it would make sense for
> only some call sites to pay the overhead price for it. But this is
> just a guess, and I have no evidence to support my claim.

My worry here is that it's easy to miss problematic callgraphs, and it's
potentially easy for new ones to creep in. Having a solution within
put_task_struct() itself would prevent that.

Another thing, if you look at release_task_stack(), it either caches the
outgoing stack for later use, or frees it via RCU (regardless of
PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
of the task struct to RCU?
  
Wander Lairson Costa Jan. 30, 2023, 11:49 a.m. UTC | #5
On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 23/01/23 14:24, Wander Lairson Costa wrote:
> > On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> On 01/20, Wander Lairson Costa wrote:
> >> >
> >> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> >> > +{
> >> > +     if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> >> > +             /*
> >> > +              * Decrement the refcount explicitly to avoid unnecessarily
> >> > +              * calling call_rcu.
> >> > +              */
> >> > +             if (refcount_dec_and_test(&task->usage))
> >> > +                     /*
> >> > +                      * under PREEMPT_RT, we can't call put_task_struct
> >> > +                      * in atomic context because it will indirectly
> >> > +                      * acquire sleeping locks.
> >> > +                      */
> >> > +                     call_rcu(&task->rcu, __delayed_put_task_struct);
> >>                                   ^^^^^^^^^
> >> I am not sure the usage of task->rcu is safe...
> >>
> >> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
> >> does the last schedule and calls put_task_struct_rcu_user().
> >>
> >> And, can't we simply turn put_task_struct() into something like
> >>
> >>         put_task_struct(struct task_struct *t)
> >>         {
> >>                 if (refcount_dec_and_test(&t->usage)) {
> >>                         if (IS_ENABLED(CONFIG_PREEMPT_RT)
> >>                             && (in_atomic() || irqs_disabled()))
> >>                                 call_rcu(...);
> >>                         else
> >>                                 __put_task_struct(t);
> >>                 }
> >>         }
> >>
> >> ?
> >
> > Yeah, that was one approach I thought about. I chose to use an
> > explicit function because I assumed calling __put_task_struct() from a
> > non-preemptable context should be the exception, not the rule.
>
> I'd tend to agree.
>
> > Therefore (if I am correct in my assumption), it would make sense for
> > only some call sites to pay the overhead price for it. But this is
> > just a guess, and I have no evidence to support my claim.
>
> My worry here is that it's easy to miss problematic callgraphs, and it's
> potentially easy for new ones to creep in. Having a solution within
> put_task_struct() itself would prevent that.
>

We could add a WARN_ON statement in put_task_struct() to detect such cases.

> Another thing, if you look at release_task_stack(), it either caches the
> outgoing stack for later use, or frees it via RCU (regardless of
> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
> of the task struct to RCU?
>

That's a point. Do you mean doing that even for !PREEMPT_RT?
  
Valentin Schneider Jan. 30, 2023, 2:46 p.m. UTC | #6
On 30/01/23 08:49, Wander Lairson Costa wrote:
> On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> On 23/01/23 14:24, Wander Lairson Costa wrote:
>> > Therefore (if I am correct in my assumption), it would make sense for
>> > only some call sites to pay the overhead price for it. But this is
>> > just a guess, and I have no evidence to support my claim.
>>
>> My worry here is that it's easy to miss problematic callgraphs, and it's
>> potentially easy for new ones to creep in. Having a solution within
>> put_task_struct() itself would prevent that.
>>
>
> We could add a WARN_ON statement in put_task_struct() to detect such cases.
>

Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to
detect misuse, but it doesn't change that some callgraphs will only
materialize under certain hardware/configuration combos.

>> Another thing, if you look at release_task_stack(), it either caches the
>> outgoing stack for later use, or frees it via RCU (regardless of
>> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
>> of the task struct to RCU?
>>
>
> That's a point. Do you mean doing that even for !PREEMPT_RT?

Could be worth a try? I think because of the cache thing the task stack is
a bit less aggressive wrt RCU callback processing, but at a quick glance I
don't see any fundamental reason why the task_struct itself can't be given
the same treatment.
  
Wander Lairson Costa Jan. 30, 2023, 2:58 p.m. UTC | #7
On Mon, Jan 30, 2023 at 11:47 AM Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 30/01/23 08:49, Wander Lairson Costa wrote:
> > On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >>
> >> On 23/01/23 14:24, Wander Lairson Costa wrote:
> >> > Therefore (if I am correct in my assumption), it would make sense for
> >> > only some call sites to pay the overhead price for it. But this is
> >> > just a guess, and I have no evidence to support my claim.
> >>
> >> My worry here is that it's easy to miss problematic callgraphs, and it's
> >> potentially easy for new ones to creep in. Having a solution within
> >> put_task_struct() itself would prevent that.
> >>
> >
> > We could add a WARN_ON statement in put_task_struct() to detect such cases.
> >
>
> Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to
> detect misuse, but it doesn't change that some callgraphs will only
> materialize under certain hardware/configuration combos.
>

If we put a WARN_ON in put_task_struct(), we catch cases where the
reference count didn't reach zero.

> >> Another thing, if you look at release_task_stack(), it either caches the
> >> outgoing stack for later use, or frees it via RCU (regardless of
> >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
> >> of the task struct to RCU?
> >>
> >
> > That's a point. Do you mean doing that even for !PREEMPT_RT?
>
> Could be worth a try?

Sure. But I would do it only for PREEMPT_RT.

> I think because of the cache thing the task stack is
> a bit less aggressive wrt RCU callback processing, but at a quick glance I
> don't see any fundamental reason why the task_struct itself can't be given
> the same treatment.
>

Any idea about tests to catch performance regressions?

I
  
Valentin Schneider Jan. 30, 2023, 3:20 p.m. UTC | #8
On 30/01/23 11:58, Wander Lairson Costa wrote:
> On Mon, Jan 30, 2023 at 11:47 AM Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> On 30/01/23 08:49, Wander Lairson Costa wrote:
>> > On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >>
>> >> On 23/01/23 14:24, Wander Lairson Costa wrote:
>> >> > Therefore (if I am correct in my assumption), it would make sense for
>> >> > only some call sites to pay the overhead price for it. But this is
>> >> > just a guess, and I have no evidence to support my claim.
>> >>
>> >> My worry here is that it's easy to miss problematic callgraphs, and it's
>> >> potentially easy for new ones to creep in. Having a solution within
>> >> put_task_struct() itself would prevent that.
>> >>
>> >
>> > We could add a WARN_ON statement in put_task_struct() to detect such cases.
>> >
>>
>> Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to
>> detect misuse, but it doesn't change that some callgraphs will only
>> materialize under certain hardware/configuration combos.
>>
>
> If we put a WARN_ON in put_task_struct(), we catch cases where the
> reference count didn't reach zero.
>

True, that'd be an improvement.

>> >> Another thing, if you look at release_task_stack(), it either caches the
>> >> outgoing stack for later use, or frees it via RCU (regardless of
>> >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
>> >> of the task struct to RCU?
>> >>
>> >
>> > That's a point. Do you mean doing that even for !PREEMPT_RT?
>>
>> Could be worth a try?
>
> Sure. But I would do it only for PREEMPT_RT.
>
>> I think because of the cache thing the task stack is
>> a bit less aggressive wrt RCU callback processing, but at a quick glance I
>> don't see any fundamental reason why the task_struct itself can't be given
>> the same treatment.
>>
>
> Any idea about tests to catch performance regressions?
>

I would wager anything fork-heavy with short-lived tasks, say loops of
short hackbench runs, I belive stress-ng also has a fork test case.
  
Daniel Bristot de Oliveira Feb. 17, 2023, 5:35 p.m. UTC | #9
On 1/30/23 08:49, Wander Lairson Costa wrote:
>> Another thing, if you look at release_task_stack(), it either caches the
>> outgoing stack for later use, or frees it via RCU (regardless of
>> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
>> of the task struct to RCU?
>>
> That's a point. Do you mean doing that even for !PREEMPT_RT?
> 

It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt kernel
with SCHED_DEADLINE...

adding him to the thread.

-- Daniel
  
luca abeni Feb. 17, 2023, 7:04 p.m. UTC | #10
Hi all,

On Fri, 17 Feb 2023 14:35:22 -0300
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> On 1/30/23 08:49, Wander Lairson Costa wrote:
> >> Another thing, if you look at release_task_stack(), it either
> >> caches the outgoing stack for later use, or frees it via RCU
> >> (regardless of PREEMPT_RT). Perhaps we could follow that and just
> >> always punt the freeing of the task struct to RCU?
> >>  
> > That's a point. Do you mean doing that even for !PREEMPT_RT?
> >   
> 
> It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt
> kernel with SCHED_DEADLINE...
> 
> adding him to the thread.

Thanks Daniel for adding me.

I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel,
without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if
more information or tests are needed, let me know.


			Luca
  
Wander Lairson Costa Feb. 22, 2023, 6:42 p.m. UTC | #11
On Fri, Feb 17, 2023 at 08:04:37PM +0100, luca abeni wrote:
> Hi all,
> 
> On Fri, 17 Feb 2023 14:35:22 -0300
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
> > On 1/30/23 08:49, Wander Lairson Costa wrote:
> > >> Another thing, if you look at release_task_stack(), it either
> > >> caches the outgoing stack for later use, or frees it via RCU
> > >> (regardless of PREEMPT_RT). Perhaps we could follow that and just
> > >> always punt the freeing of the task struct to RCU?
> > >>  
> > > That's a point. Do you mean doing that even for !PREEMPT_RT?
> > >   
> > 
> > It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt
> > kernel with SCHED_DEADLINE...
> > 
> > adding him to the thread.
> 
> Thanks Daniel for adding me.
> 
> I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel,
> without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if
> more information or tests are needed, let me know.
> 

Does it happen in linux-6.1 or linux-6.2?

> 
> 			Luca
>
  
luca abeni Feb. 22, 2023, 9 p.m. UTC | #12
On Wed, 22 Feb 2023 15:42:37 -0300
Wander Lairson Costa <wander@redhat.com> wrote:
[...]
> > I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel,
> > without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if
> > more information or tests are needed, let me know.
> >   
> 
> Does it happen in linux-6.1 or linux-6.2?

I only tried with 5.15.76... I am going to test 6.2 and I'll let you
know ASAP.


			Luca
  
luca abeni Feb. 24, 2023, 8:46 a.m. UTC | #13
On Wed, 22 Feb 2023 22:00:34 +0100
luca abeni <luca.abeni@santannapisa.it> wrote:

> On Wed, 22 Feb 2023 15:42:37 -0300
> Wander Lairson Costa <wander@redhat.com> wrote:
> [...]
> > > I triggered this "BUG: Invalid wait context" with a 5.15.76
> > > kernel, without PREEMPT_RT. I can reproduce it easily on a
> > > KVM-based VM; if more information or tests are needed, let me
> > > know. 
> > 
> > Does it happen in linux-6.1 or linux-6.2?  
> 
> I only tried with 5.15.76... I am going to test 6.2 and I'll let you
> know ASAP.

For various reasons it took more time than expected, but I managed to
reproduce the bug with 6.2... Here are the relevant kernel messages:

[ 1246.556100] =============================
[ 1246.559104] [ BUG: Invalid wait context ]
[ 1246.562270] 6.2.0 #4 Not tainted
[ 1246.564854] -----------------------------
[ 1246.567260] swapper/3/0 is trying to lock:
[ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at: put_cpu_partial+0x24/0x1c0
[ 1246.571325] other info that might help us debug this:
[ 1246.573045] context-{2:2}
[ 1246.574166] no locks held by swapper/3/0.
[ 1246.575434] stack backtrace:
[ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4
[ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 1246.580815] Call Trace:
[ 1246.581723]  <IRQ>
[ 1246.582570]  dump_stack_lvl+0x49/0x61
[ 1246.583860]  __lock_acquire.cold+0xc8/0x31c
[ 1246.584923]  ? __lock_acquire+0x3be/0x1df0
[ 1246.585915]  lock_acquire+0xce/0x2f0
[ 1246.586819]  ? put_cpu_partial+0x24/0x1c0
[ 1246.588177]  ? lock_is_held_type+0xdb/0x130
[ 1246.589519]  put_cpu_partial+0x5b/0x1c0
[ 1246.590996]  ? put_cpu_partial+0x24/0x1c0
[ 1246.592212]  inactive_task_timer+0x263/0x4c0
[ 1246.593509]  ? __pfx_inactive_task_timer+0x10/0x10
[ 1246.594953]  __hrtimer_run_queues+0x1bf/0x470
[ 1246.596297]  hrtimer_interrupt+0x117/0x250
[ 1246.597528]  __sysvec_apic_timer_interrupt+0x99/0x270
[ 1246.599015]  sysvec_apic_timer_interrupt+0x8d/0xc0
[ 1246.600416]  </IRQ>
[ 1246.601170]  <TASK>
[ 1246.601918]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 1246.603377] RIP: 0010:default_idle+0xf/0x20
[ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS: 00000202
[ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000 RCX: 0000000000000000
[ 1246.613230] RDX: 0000000000000000 RSI: ffffffffa510271b RDI: ffffffffa50d5b15
[ 1246.615266] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001
[ 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12: ffff8c2c4126b000
[ 1246.619318] R13: ffff8c2c4126b000 R14: 0000000000000000 R15: 0000000000000000
[ 1246.621293]  ? __pfx_default_idle+0x10/0x10
[ 1246.622581]  default_idle_call+0x71/0x220
[ 1246.623790]  do_idle+0x210/0x290
[ 1246.624827]  cpu_startup_entry+0x18/0x20
[ 1246.626016]  start_secondary+0xf1/0x100
[ 1246.627200]  secondary_startup_64_no_verify+0xe0/0xeb
[ 1246.628707]  </TASK>


Let me know if you need more information, or
I should run other tests/experiments.


				Luca
  
Wander Lairson Costa Feb. 24, 2023, 1:02 p.m. UTC | #14
On Fri, Feb 24, 2023 at 09:46:48AM +0100, luca abeni wrote:
> On Wed, 22 Feb 2023 22:00:34 +0100
> luca abeni <luca.abeni@santannapisa.it> wrote:
> 
> > On Wed, 22 Feb 2023 15:42:37 -0300
> > Wander Lairson Costa <wander@redhat.com> wrote:
> > [...]
> > > > I triggered this "BUG: Invalid wait context" with a 5.15.76
> > > > kernel, without PREEMPT_RT. I can reproduce it easily on a
> > > > KVM-based VM; if more information or tests are needed, let me
> > > > know. 
> > > 
> > > Does it happen in linux-6.1 or linux-6.2?  
> > 
> > I only tried with 5.15.76... I am going to test 6.2 and I'll let you
> > know ASAP.
> 
> For various reasons it took more time than expected, but I managed to
> reproduce the bug with 6.2... Here are the relevant kernel messages:
> 
> [ 1246.556100] =============================
> [ 1246.559104] [ BUG: Invalid wait context ]
> [ 1246.562270] 6.2.0 #4 Not tainted
> [ 1246.564854] -----------------------------
> [ 1246.567260] swapper/3/0 is trying to lock:
> [ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at: put_cpu_partial+0x24/0x1c0
> [ 1246.571325] other info that might help us debug this:
> [ 1246.573045] context-{2:2}
> [ 1246.574166] no locks held by swapper/3/0.
> [ 1246.575434] stack backtrace:
> [ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4
> [ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> [ 1246.580815] Call Trace:
> [ 1246.581723]  <IRQ>
> [ 1246.582570]  dump_stack_lvl+0x49/0x61
> [ 1246.583860]  __lock_acquire.cold+0xc8/0x31c
> [ 1246.584923]  ? __lock_acquire+0x3be/0x1df0
> [ 1246.585915]  lock_acquire+0xce/0x2f0
> [ 1246.586819]  ? put_cpu_partial+0x24/0x1c0
> [ 1246.588177]  ? lock_is_held_type+0xdb/0x130
> [ 1246.589519]  put_cpu_partial+0x5b/0x1c0
> [ 1246.590996]  ? put_cpu_partial+0x24/0x1c0
> [ 1246.592212]  inactive_task_timer+0x263/0x4c0
> [ 1246.593509]  ? __pfx_inactive_task_timer+0x10/0x10
> [ 1246.594953]  __hrtimer_run_queues+0x1bf/0x470
> [ 1246.596297]  hrtimer_interrupt+0x117/0x250
> [ 1246.597528]  __sysvec_apic_timer_interrupt+0x99/0x270
> [ 1246.599015]  sysvec_apic_timer_interrupt+0x8d/0xc0
> [ 1246.600416]  </IRQ>
> [ 1246.601170]  <TASK>
> [ 1246.601918]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 1246.603377] RIP: 0010:default_idle+0xf/0x20
> [ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS: 00000202
> [ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000 RCX: 0000000000000000
> [ 1246.613230] RDX: 0000000000000000 RSI: ffffffffa510271b RDI: ffffffffa50d5b15
> [ 1246.615266] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001
> [ 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12: ffff8c2c4126b000
> [ 1246.619318] R13: ffff8c2c4126b000 R14: 0000000000000000 R15: 0000000000000000
> [ 1246.621293]  ? __pfx_default_idle+0x10/0x10
> [ 1246.622581]  default_idle_call+0x71/0x220
> [ 1246.623790]  do_idle+0x210/0x290
> [ 1246.624827]  cpu_startup_entry+0x18/0x20
> [ 1246.626016]  start_secondary+0xf1/0x100
> [ 1246.627200]  secondary_startup_64_no_verify+0xe0/0xeb
> [ 1246.628707]  </TASK>
> 
> 
> Let me know if you need more information, or
> I should run other tests/experiments.
> 

This seems to be a different (maybe related?) issue. Would you mind
sharing your .config and steps to reproduce it?

> 
> 				Luca
>
  
luca abeni Feb. 24, 2023, 4:01 p.m. UTC | #15
On Fri, 24 Feb 2023 10:02:40 -0300
Wander Lairson Costa <wander@redhat.com> wrote:
[...]
> > [ 1246.556100] =============================
> > [ 1246.559104] [ BUG: Invalid wait context ]
> > [ 1246.562270] 6.2.0 #4 Not tainted
> > [ 1246.564854] -----------------------------
> > [ 1246.567260] swapper/3/0 is trying to lock:
> > [ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at:
> > put_cpu_partial+0x24/0x1c0 [ 1246.571325] other info that might
> > help us debug this: [ 1246.573045] context-{2:2}
> > [ 1246.574166] no locks held by swapper/3/0.
> > [ 1246.575434] stack backtrace:
> > [ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4
> > [ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> > BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 1246.580815] Call Trace:
> > [ 1246.581723]  <IRQ>
> > [ 1246.582570]  dump_stack_lvl+0x49/0x61
> > [ 1246.583860]  __lock_acquire.cold+0xc8/0x31c
> > [ 1246.584923]  ? __lock_acquire+0x3be/0x1df0
> > [ 1246.585915]  lock_acquire+0xce/0x2f0
> > [ 1246.586819]  ? put_cpu_partial+0x24/0x1c0
> > [ 1246.588177]  ? lock_is_held_type+0xdb/0x130
> > [ 1246.589519]  put_cpu_partial+0x5b/0x1c0
> > [ 1246.590996]  ? put_cpu_partial+0x24/0x1c0
> > [ 1246.592212]  inactive_task_timer+0x263/0x4c0
> > [ 1246.593509]  ? __pfx_inactive_task_timer+0x10/0x10
> > [ 1246.594953]  __hrtimer_run_queues+0x1bf/0x470
> > [ 1246.596297]  hrtimer_interrupt+0x117/0x250
> > [ 1246.597528]  __sysvec_apic_timer_interrupt+0x99/0x270
> > [ 1246.599015]  sysvec_apic_timer_interrupt+0x8d/0xc0
> > [ 1246.600416]  </IRQ>
> > [ 1246.601170]  <TASK>
> > [ 1246.601918]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > [ 1246.603377] RIP: 0010:default_idle+0xf/0x20
> > [ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90
> > 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03
> > 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90
> > 90 90 90 90 90 [ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS:
> > 00000202 [ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000
> > RCX: 0000000000000000 [ 1246.613230] RDX: 0000000000000000 RSI:
> > ffffffffa510271b RDI: ffffffffa50d5b15 [ 1246.615266] RBP:
> > 0000000000000003 R08: 0000000000000001 R09: 0000000000000001 [
> > 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12:
> > ffff8c2c4126b000 [ 1246.619318] R13: ffff8c2c4126b000 R14:
> > 0000000000000000 R15: 0000000000000000 [ 1246.621293]  ?
> > __pfx_default_idle+0x10/0x10 [ 1246.622581]
> > default_idle_call+0x71/0x220 [ 1246.623790]  do_idle+0x210/0x290 [
> > 1246.624827]  cpu_startup_entry+0x18/0x20 [ 1246.626016]
> > start_secondary+0xf1/0x100 [ 1246.627200]
> > secondary_startup_64_no_verify+0xe0/0xeb [ 1246.628707]  </TASK>
> > 
> > 
> > Let me know if you need more information, or
> > I should run other tests/experiments.
> >   
> 
> This seems to be a different (maybe related?) issue. Would you mind
> sharing your .config and steps to reproduce it?

Ah, sorry then... I probably misunderstood the kernel messages
(in my understanding, this is lockdep complaining because
put_task_struct() - which can take a sleeping lock - is invoked
from a timer callback).


Anyway, I attach the config (it is basically a "make defconfig;
make kvm_guest.config" with some debug options manually enabled - I
think the relevant one is CONFIG_PROVE_RAW_LOCK_NESTING)


			Luca
  

Patch

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 357e0068497c..80b4c5812563 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -127,6 +127,27 @@  static inline void put_task_struct_many(struct task_struct *t, int nr)
 
 void put_task_struct_rcu_user(struct task_struct *task);
 
+extern void __delayed_put_task_struct(struct rcu_head *rhp);
+
+static inline void put_task_struct_atomic_safe(struct task_struct *task)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		/*
+		 * Decrement the refcount explicitly to avoid unnecessarily
+		 * calling call_rcu.
+		 */
+		if (refcount_dec_and_test(&task->usage))
+			/*
+			 * under PREEMPT_RT, we can't call put_task_struct
+			 * in atomic context because it will indirectly
+			 * acquire sleeping locks.
+			 */
+			call_rcu(&task->rcu, __delayed_put_task_struct);
+	} else {
+		put_task_struct(task);
+	}
+}
+
 /* Free all architecture-specific resources held by a thread. */
 void release_thread(struct task_struct *dead_task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..3d7a4e9311b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -859,6 +859,14 @@  void __put_task_struct(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(__put_task_struct);
 
+void __delayed_put_task_struct(struct rcu_head *rhp)
+{
+	struct task_struct *task = container_of(rhp, struct task_struct, rcu);
+
+	__put_task_struct(task);
+}
+EXPORT_SYMBOL_GPL(__delayed_put_task_struct);
+
 void __init __weak arch_task_cache_init(void) { }
 
 /*