[2/2] rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks

Message ID 20240129225730.3168681-3-boqun.feng@gmail.com
State New
Headers
Series RCU tasks fixes for v6.9 |

Commit Message

Boqun Feng Jan. 29, 2024, 10:57 p.m. UTC
  From: "Paul E. McKenney" <paulmck@kernel.org>

Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock.  This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch.  However, such deadlocks are becoming
more frequent.  In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.

In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
just as well take advantage of that fact.

This commit therefore eliminates these deadlock by replacing the
SRCU-based wait for do_exit() completion with per-CPU lists of tasks
currently exiting.  A given task will be on one of these per-CPU lists for
the same period of time that this task would previously have been in the
previous SRCU read-side critical section.  These lists enable RCU Tasks
to find the tasks that have already been removed from the tasks list,
but that must nevertheless be waited upon.

The RCU Tasks grace period gathers any of these do_exit() tasks that it
must wait on, and adds them to the list of holdouts.  Per-CPU locking
and get_task_struct() are used to synchronize addition to and removal
from these lists.

Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/

Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/sched.h |  2 +
 init/init_task.c      |  1 +
 kernel/fork.c         |  1 +
 kernel/rcu/tasks.h    | 89 ++++++++++++++++++++++++++++++-------------
 4 files changed, 67 insertions(+), 26 deletions(-)
  

Comments

Frederic Weisbecker Feb. 7, 2024, 10:53 p.m. UTC | #1
Le Mon, Jan 29, 2024 at 02:57:27PM -0800, Boqun Feng a écrit :
> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> Holding a mutex across synchronize_rcu_tasks() and acquiring
> that same mutex in code called from do_exit() after its call to
> exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
> results in deadlock.  This is by design, because tasks that are far
> enough into do_exit() are no longer present on the tasks list, making
> it a bit difficult for RCU Tasks to find them, let alone wait on them
> to do a voluntary context switch.  However, such deadlocks are becoming
> more frequent.  In addition, lockdep currently does not detect such
> deadlocks and they can be difficult to reproduce.
> 
> In addition, if a task voluntarily context switches during that time
> (for example, if it blocks acquiring a mutex), then this task is in an
> RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
> just as well take advantage of that fact.
> 
> This commit therefore eliminates these deadlock by replacing the
> SRCU-based wait for do_exit() completion with per-CPU lists of tasks
> currently exiting.  A given task will be on one of these per-CPU lists for
> the same period of time that this task would previously have been in the
> previous SRCU read-side critical section.  These lists enable RCU Tasks
> to find the tasks that have already been removed from the tasks list,
> but that must nevertheless be waited upon.
> 
> The RCU Tasks grace period gathers any of these do_exit() tasks that it
> must wait on, and adds them to the list of holdouts.  Per-CPU locking
> and get_task_struct() are used to synchronize addition to and removal
> from these lists.
> 
> Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
> 
> Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

With that, I think we can now revert 28319d6dc5e2 (rcu-tasks: Fix
synchronize_rcu_tasks() VS zap_pid_ns_processes()). Because if the task
is in rcu_tasks_exit_list, it's treated just like the others and must go
through check_holdout_task(). Therefore and unlike with the previous srcu thing,
a task sleeping between exit_tasks_rcu_start() and exit_tasks_rcu_finish() is
now a quiescent state. And that kills the possible deadlock.

> -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
> +void exit_tasks_rcu_start(void)
>  {
> -	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
> +	unsigned long flags;
> +	struct rcu_tasks_percpu *rtpcp;
> +	struct task_struct *t = current;
> +
> +	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
> +	get_task_struct(t);

Is this get_task_struct() necessary?

> +	preempt_disable();
> +	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
> +	t->rcu_tasks_exit_cpu = smp_processor_id();
> +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);

Do we really need smp_mb__after_unlock_lock() ?

> +	if (!rtpcp->rtp_exit_list.next)
> +		INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
> +	list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list);
> +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> +	preempt_enable();
>  }
>  
>  /*
> - * Contribute to protect against tasklist scan blind spot while the
> - * task is exiting and may be removed from the tasklist. See
> - * corresponding synchronize_srcu() for further details.
> + * Remove the task from the "yet another list" because do_exit() is now
> + * non-preemptible, allowing synchronize_rcu() to wait beyond this point.
>   */
> -void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
> +void exit_tasks_rcu_stop(void)
>  {
> +	unsigned long flags;
> +	struct rcu_tasks_percpu *rtpcp;
>  	struct task_struct *t = current;
>  
> -	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
> +	WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list));
> +	rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu);
> +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> +	list_del_init(&t->rcu_tasks_exit_list);
> +	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
> +	put_task_struct(t);

And conversely this put_task_struct()?

Thanks.

>  }
>  
>  /*
> -- 
> 2.43.0
>
  
Frederic Weisbecker Feb. 8, 2024, 1:52 a.m. UTC | #2
Le Wed, Feb 07, 2024 at 11:53:13PM +0100, Frederic Weisbecker a écrit :
> Le Mon, Jan 29, 2024 at 02:57:27PM -0800, Boqun Feng a écrit :
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > Holding a mutex across synchronize_rcu_tasks() and acquiring
> > that same mutex in code called from do_exit() after its call to
> > exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
> > results in deadlock.  This is by design, because tasks that are far
> > enough into do_exit() are no longer present on the tasks list, making
> > it a bit difficult for RCU Tasks to find them, let alone wait on them
> > to do a voluntary context switch.  However, such deadlocks are becoming
> > more frequent.  In addition, lockdep currently does not detect such
> > deadlocks and they can be difficult to reproduce.
> > 
> > In addition, if a task voluntarily context switches during that time
> > (for example, if it blocks acquiring a mutex), then this task is in an
> > RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
> > just as well take advantage of that fact.
> > 
> > This commit therefore eliminates these deadlock by replacing the
> > SRCU-based wait for do_exit() completion with per-CPU lists of tasks
> > currently exiting.  A given task will be on one of these per-CPU lists for
> > the same period of time that this task would previously have been in the
> > previous SRCU read-side critical section.  These lists enable RCU Tasks
> > to find the tasks that have already been removed from the tasks list,
> > but that must nevertheless be waited upon.
> > 
> > The RCU Tasks grace period gathers any of these do_exit() tasks that it
> > must wait on, and adds them to the list of holdouts.  Per-CPU locking
> > and get_task_struct() are used to synchronize addition to and removal
> > from these lists.
> > 
> > Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
> > 
> > Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> With that, I think we can now revert 28319d6dc5e2 (rcu-tasks: Fix
> synchronize_rcu_tasks() VS zap_pid_ns_processes()). Because if the task
> is in rcu_tasks_exit_list, it's treated just like the others and must go
> through check_holdout_task(). Therefore and unlike with the previous srcu thing,
> a task sleeping between exit_tasks_rcu_start() and exit_tasks_rcu_finish() is
> now a quiescent state. And that kills the possible deadlock.
> 
> > -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
> > +void exit_tasks_rcu_start(void)
> >  {
> > -	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
> > +	unsigned long flags;
> > +	struct rcu_tasks_percpu *rtpcp;
> > +	struct task_struct *t = current;
> > +
> > +	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
> > +	get_task_struct(t);
> 
> Is this get_task_struct() necessary?
> 
> > +	preempt_disable();
> > +	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
> > +	t->rcu_tasks_exit_cpu = smp_processor_id();
> > +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> 
> Do we really need smp_mb__after_unlock_lock() ?

Or maybe it orders add into rtpcp->rtp_exit_list VS
main tasklist's removal? Such that:

synchronize_rcu_tasks()                       do_exit()
----------------------                        ---------
//for_each_process_thread()
READ tasklist                                 WRITE rtpcp->rtp_exit_list
LOCK rtpcp->lock                              UNLOCK rtpcp->lock
smp_mb__after_unlock_lock()                   WRITE tasklist //unhash_process()
READ rtpcp->rtp_exit_list

Does this work? Hmm, I'll play with litmus once I have a fresh brain...

Thanks.
  
Frederic Weisbecker Feb. 8, 2024, 2:10 a.m. UTC | #3
Le Thu, Feb 08, 2024 at 02:52:10AM +0100, Frederic Weisbecker a écrit :
> Le Wed, Feb 07, 2024 at 11:53:13PM +0100, Frederic Weisbecker a écrit :
> > Le Mon, Jan 29, 2024 at 02:57:27PM -0800, Boqun Feng a écrit :
> > > From: "Paul E. McKenney" <paulmck@kernel.org>
> > > 
> > > Holding a mutex across synchronize_rcu_tasks() and acquiring
> > > that same mutex in code called from do_exit() after its call to
> > > exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
> > > results in deadlock.  This is by design, because tasks that are far
> > > enough into do_exit() are no longer present on the tasks list, making
> > > it a bit difficult for RCU Tasks to find them, let alone wait on them
> > > to do a voluntary context switch.  However, such deadlocks are becoming
> > > more frequent.  In addition, lockdep currently does not detect such
> > > deadlocks and they can be difficult to reproduce.
> > > 
> > > In addition, if a task voluntarily context switches during that time
> > > (for example, if it blocks acquiring a mutex), then this task is in an
> > > RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
> > > just as well take advantage of that fact.
> > > 
> > > This commit therefore eliminates these deadlock by replacing the
> > > SRCU-based wait for do_exit() completion with per-CPU lists of tasks
> > > currently exiting.  A given task will be on one of these per-CPU lists for
> > > the same period of time that this task would previously have been in the
> > > previous SRCU read-side critical section.  These lists enable RCU Tasks
> > > to find the tasks that have already been removed from the tasks list,
> > > but that must nevertheless be waited upon.
> > > 
> > > The RCU Tasks grace period gathers any of these do_exit() tasks that it
> > > must wait on, and adds them to the list of holdouts.  Per-CPU locking
> > > and get_task_struct() are used to synchronize addition to and removal
> > > from these lists.
> > > 
> > > Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
> > > 
> > > Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > With that, I think we can now revert 28319d6dc5e2 (rcu-tasks: Fix
> > synchronize_rcu_tasks() VS zap_pid_ns_processes()). Because if the task
> > is in rcu_tasks_exit_list, it's treated just like the others and must go
> > through check_holdout_task(). Therefore and unlike with the previous srcu thing,
> > a task sleeping between exit_tasks_rcu_start() and exit_tasks_rcu_finish() is
> > now a quiescent state. And that kills the possible deadlock.
> > 
> > > -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
> > > +void exit_tasks_rcu_start(void)
> > >  {
> > > -	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
> > > +	unsigned long flags;
> > > +	struct rcu_tasks_percpu *rtpcp;
> > > +	struct task_struct *t = current;
> > > +
> > > +	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
> > > +	get_task_struct(t);
> > 
> > Is this get_task_struct() necessary?
> > 
> > > +	preempt_disable();
> > > +	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
> > > +	t->rcu_tasks_exit_cpu = smp_processor_id();
> > > +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> > 
> > Do we really need smp_mb__after_unlock_lock() ?
> 
> Or maybe it orders add into rtpcp->rtp_exit_list VS
> main tasklist's removal? Such that:
> 
> synchronize_rcu_tasks()                       do_exit()
> ----------------------                        ---------
> //for_each_process_thread()
> READ tasklist                                 WRITE rtpcp->rtp_exit_list
> LOCK rtpcp->lock                              UNLOCK rtpcp->lock
> smp_mb__after_unlock_lock()                   WRITE tasklist //unhash_process()
> READ rtpcp->rtp_exit_list
> 
> Does this work? Hmm, I'll play with litmus once I have a fresh brain...

ie: does smp_mb__after_unlock_lock() order only what precedes the UNLOCK with
the UNLOCK itself? (but then the UNLOCK itself can be reordered with anything
that follows)? Or does it also order what follows the UNLOCK with the UNLOCK
itself? If both, then it looks ok, otherwise...

Also on the other end, does LOCK/smp_mb__after_unlock_lock() order against what
precedes the LOCK? That also is necessary for the above to work.

Of course by the time I'm writing this email, litmus would have told me
already...

Thanks.
  
Paul E. McKenney Feb. 8, 2024, 9:56 a.m. UTC | #4
On Thu, Feb 08, 2024 at 03:10:32AM +0100, Frederic Weisbecker wrote:
> Le Thu, Feb 08, 2024 at 02:52:10AM +0100, Frederic Weisbecker a écrit :
> > Le Wed, Feb 07, 2024 at 11:53:13PM +0100, Frederic Weisbecker a écrit :
> > > Le Mon, Jan 29, 2024 at 02:57:27PM -0800, Boqun Feng a écrit :
> > > > From: "Paul E. McKenney" <paulmck@kernel.org>
> > > > 
> > > > Holding a mutex across synchronize_rcu_tasks() and acquiring
> > > > that same mutex in code called from do_exit() after its call to
> > > > exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
> > > > results in deadlock.  This is by design, because tasks that are far
> > > > enough into do_exit() are no longer present on the tasks list, making
> > > > it a bit difficult for RCU Tasks to find them, let alone wait on them
> > > > to do a voluntary context switch.  However, such deadlocks are becoming
> > > > more frequent.  In addition, lockdep currently does not detect such
> > > > deadlocks and they can be difficult to reproduce.
> > > > 
> > > > In addition, if a task voluntarily context switches during that time
> > > > (for example, if it blocks acquiring a mutex), then this task is in an
> > > > RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
> > > > just as well take advantage of that fact.
> > > > 
> > > > This commit therefore eliminates these deadlock by replacing the
> > > > SRCU-based wait for do_exit() completion with per-CPU lists of tasks
> > > > currently exiting.  A given task will be on one of these per-CPU lists for
> > > > the same period of time that this task would previously have been in the
> > > > previous SRCU read-side critical section.  These lists enable RCU Tasks
> > > > to find the tasks that have already been removed from the tasks list,
> > > > but that must nevertheless be waited upon.
> > > > 
> > > > The RCU Tasks grace period gathers any of these do_exit() tasks that it
> > > > must wait on, and adds them to the list of holdouts.  Per-CPU locking
> > > > and get_task_struct() are used to synchronize addition to and removal
> > > > from these lists.
> > > > 
> > > > Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
> > > > 
> > > > Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
> > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > 
> > > With that, I think we can now revert 28319d6dc5e2 (rcu-tasks: Fix
> > > synchronize_rcu_tasks() VS zap_pid_ns_processes()). Because if the task
> > > is in rcu_tasks_exit_list, it's treated just like the others and must go
> > > through check_holdout_task(). Therefore and unlike with the previous srcu thing,
> > > a task sleeping between exit_tasks_rcu_start() and exit_tasks_rcu_finish() is
> > > now a quiescent state. And that kills the possible deadlock.
> > > 
> > > > -void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
> > > > +void exit_tasks_rcu_start(void)
> > > >  {
> > > > -	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
> > > > +	unsigned long flags;
> > > > +	struct rcu_tasks_percpu *rtpcp;
> > > > +	struct task_struct *t = current;
> > > > +
> > > > +	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
> > > > +	get_task_struct(t);
> > > 
> > > Is this get_task_struct() necessary?

Now that you mention it, I think not!

Each task will remove itself from this list before going away, and if
we put it on the holdout list (where it might stay for longer), there
will be a get_task_struct() there.

Good catch, thank you!

> > > > +	preempt_disable();
> > > > +	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
> > > > +	t->rcu_tasks_exit_cpu = smp_processor_id();
> > > > +	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
> > > 
> > > Do we really need smp_mb__after_unlock_lock() ?

Not on this acquisition we don't.  But each lock must be all one way
or the other.  Yes, extra overhead on PowerPC, but this is nowhere near
a fastpath.

The needed ordering is provided by simple locking.

> > Or maybe it orders add into rtpcp->rtp_exit_list VS
> > main tasklist's removal? Such that:

This ordering is not needed.  The lock orders addition to this
list against removal from tasklist.  If we hold this lock, either
the task is already on this list or our holding this lock prevents
it from removing itself from the tasklist.

We have already scanned the task list, and we have already done
whatever update we are worried about.

So, if the task was on the tasklist when we scanned, well and
good.  If the task was created after we scanned the tasklist,
then it cannot possibly access whatever we removed.

But please double-check!!!

> > synchronize_rcu_tasks()                       do_exit()
> > ----------------------                        ---------
> > //for_each_process_thread()
> > READ tasklist                                 WRITE rtpcp->rtp_exit_list
> > LOCK rtpcp->lock                              UNLOCK rtpcp->lock
> > smp_mb__after_unlock_lock()                   WRITE tasklist //unhash_process()
> > READ rtpcp->rtp_exit_list
> > 
> > Does this work? Hmm, I'll play with litmus once I have a fresh brain...

First, thank you very much for the review!!!

> ie: does smp_mb__after_unlock_lock() order only what precedes the UNLOCK with
> the UNLOCK itself? (but then the UNLOCK itself can be reordered with anything
> that follows)? Or does it also order what follows the UNLOCK with the UNLOCK
> itself? If both, then it looks ok, otherwise...

If you have this:

	earlier_accesses();
	spin_lock(...);
	ill_considered_memory_accesses();
	smp_mb__after_unlock_lock();
	later_accesses();

Then earlier_accesses() will be ordered against later_accesses(), but
ill_considered_memory_accesses() won't necessarily be ordered.  Also,
any accesses before any prior release of that same lock will be ordered
against later_accesses().

(In real life, ill_considered_memory_accesses() will be fully ordered
against either spin_lock() on the one hand or smp_mb__after_unlock_lock()
on the other, with x86 doing the first and PowerPC doing the second.
So please try to avoid any ill_considered_memory_accesses().)

> Also on the other end, does LOCK/smp_mb__after_unlock_lock() order against what
> precedes the LOCK? That also is necessary for the above to work.

It looks like an smp_mb__after_spinlock() would also be needed, for
example, on ARMv8.

> Of course by the time I'm writing this email, litmus would have told me
> already...

;-) ;-) ;-)

But I believe that simple locking covers this case.  Famous last words...

							Thanx, Paul
  
Frederic Weisbecker Feb. 8, 2024, 10:43 a.m. UTC | #5
On Thu, Feb 08, 2024 at 01:56:10AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 08, 2024 at 03:10:32AM +0100, Frederic Weisbecker wrote:
> This ordering is not needed.  The lock orders addition to this
> list against removal from tasklist.  If we hold this lock, either
> the task is already on this list or our holding this lock prevents
> it from removing itself from the tasklist.
> 
> We have already scanned the task list, and we have already done
> whatever update we are worried about.
> 
> So, if the task was on the tasklist when we scanned, well and
> good.  If the task was created after we scanned the tasklist,
> then it cannot possibly access whatever we removed.
> 
> But please double-check!!!

Heh, right, another new pattern for me to discover :-/

C r-LOCK

{
}

P0(spinlock_t *LOCK, int *X, int *Y)
{
	int r1;
	int r2;
	
	r1 = READ_ONCE(*X);

	spin_lock(LOCK);
	r2 = READ_ONCE(*Y);
	spin_unlock(LOCK);
}

P1(spinlock_t *LOCK, int *X, int *Y)
{
	spin_lock(LOCK);
	WRITE_ONCE(*Y, 1);
	spin_unlock(LOCK);
	WRITE_ONCE(*X, 1);
}

exists (0:r1=1 /\ 0:r2=0) (* never *)


> 
> > > synchronize_rcu_tasks()                       do_exit()
> > > ----------------------                        ---------
> > > //for_each_process_thread()
> > > READ tasklist                                 WRITE rtpcp->rtp_exit_list
> > > LOCK rtpcp->lock                              UNLOCK rtpcp->lock
> > > smp_mb__after_unlock_lock()                   WRITE tasklist //unhash_process()
> > > READ rtpcp->rtp_exit_list
> > > 
> > > Does this work? Hmm, I'll play with litmus once I have a fresh brain...
> 
> First, thank you very much for the review!!!
> 
> > ie: does smp_mb__after_unlock_lock() order only what precedes the UNLOCK with
> > the UNLOCK itself? (but then the UNLOCK itself can be reordered with anything
> > that follows)? Or does it also order what follows the UNLOCK with the UNLOCK
> > itself? If both, then it looks ok, otherwise...
> 
> If you have this:
> 
> 	earlier_accesses();
> 	spin_lock(...);
> 	ill_considered_memory_accesses();
> 	smp_mb__after_unlock_lock();
> 	later_accesses();
> 
> Then earlier_accesses() will be ordered against later_accesses(), but
> ill_considered_memory_accesses() won't necessarily be ordered.  Also,
> any accesses before any prior release of that same lock will be ordered
> against later_accesses().
> 
> (In real life, ill_considered_memory_accesses() will be fully ordered
> against either spin_lock() on the one hand or smp_mb__after_unlock_lock()
> on the other, with x86 doing the first and PowerPC doing the second.
> So please try to avoid any ill_considered_memory_accesses().)

Thanks a lot for that explanation!


> 
> > Also on the other end, does LOCK/smp_mb__after_unlock_lock() order against what
> > precedes the LOCK? That also is necessary for the above to work.
> 
> It looks like an smp_mb__after_spinlock() would also be needed, for
> example, on ARMv8.
> 
> > Of course by the time I'm writing this email, litmus would have told me
> > already...
> 
> ;-) ;-) ;-)
> 
> But I believe that simple locking covers this case.  Famous last words...

Indeed, looks right!

Thanks!
> 							Thanx, Paul
  

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cdb8ea53c365..4f0e9274da2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -858,6 +858,8 @@  struct task_struct {
 	u8				rcu_tasks_idx;
 	int				rcu_tasks_idle_cpu;
 	struct list_head		rcu_tasks_holdout_list;
+	int				rcu_tasks_exit_cpu;
+	struct list_head		rcu_tasks_exit_list;
 #endif /* #ifdef CONFIG_TASKS_RCU */
 
 #ifdef CONFIG_TASKS_TRACE_RCU
diff --git a/init/init_task.c b/init/init_task.c
index 7ecb458eb3da..4daee6d761c8 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -147,6 +147,7 @@  struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.rcu_tasks_holdout = false,
 	.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
 	.rcu_tasks_idle_cpu = -1,
+	.rcu_tasks_exit_list = LIST_HEAD_INIT(init_task.rcu_tasks_exit_list),
 #endif
 #ifdef CONFIG_TASKS_TRACE_RCU
 	.trc_reader_nesting = 0,
diff --git a/kernel/fork.c b/kernel/fork.c
index 47ff3b35352e..3eb86f30e664 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1975,6 +1975,7 @@  static inline void rcu_copy_process(struct task_struct *p)
 	p->rcu_tasks_holdout = false;
 	INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
 	p->rcu_tasks_idle_cpu = -1;
+	INIT_LIST_HEAD(&p->rcu_tasks_exit_list);
 #endif /* #ifdef CONFIG_TASKS_RCU */
 #ifdef CONFIG_TASKS_TRACE_RCU
 	p->trc_reader_nesting = 0;
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 732ad5b39946..bd4a51fd5b1f 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -32,6 +32,7 @@  typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
  * @rtp_irq_work: IRQ work queue for deferred wakeups.
  * @barrier_q_head: RCU callback for barrier operation.
  * @rtp_blkd_tasks: List of tasks blocked as readers.
+ * @rtp_exit_list: List of tasks in the latter portion of do_exit().
  * @cpu: CPU number corresponding to this entry.
  * @rtpp: Pointer to the rcu_tasks structure.
  */
@@ -46,6 +47,7 @@  struct rcu_tasks_percpu {
 	struct irq_work rtp_irq_work;
 	struct rcu_head barrier_q_head;
 	struct list_head rtp_blkd_tasks;
+	struct list_head rtp_exit_list;
 	int cpu;
 	struct rcu_tasks *rtpp;
 };
@@ -144,8 +146,6 @@  static struct rcu_tasks rt_name =							\
 }
 
 #ifdef CONFIG_TASKS_RCU
-/* Track exiting tasks in order to allow them to be waited for. */
-DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
 /* Report delay in synchronize_srcu() completion in rcu_tasks_postscan(). */
 static void tasks_rcu_exit_srcu_stall(struct timer_list *unused);
@@ -275,6 +275,8 @@  static void cblist_init_generic(struct rcu_tasks *rtp)
 		rtpcp->rtpp = rtp;
 		if (!rtpcp->rtp_blkd_tasks.next)
 			INIT_LIST_HEAD(&rtpcp->rtp_blkd_tasks);
+		if (!rtpcp->rtp_exit_list.next)
+			INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
 	}
 
 	pr_info("%s: Setting shift to %d and lim to %d rcu_task_cb_adjust=%d.\n", rtp->name,
@@ -851,10 +853,12 @@  static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 //	number of voluntary context switches, and add that task to the
 //	holdout list.
 // rcu_tasks_postscan():
-//	Invoke synchronize_srcu() to ensure that all tasks that were
-//	in the process of exiting (and which thus might not know to
-//	synchronize with this RCU Tasks grace period) have completed
-//	exiting.
+//	Gather per-CPU lists of tasks in do_exit() to ensure that all
+//	tasks that were in the process of exiting (and which thus might
+//	not know to synchronize with this RCU Tasks grace period) have
+//	completed exiting.  The synchronize_rcu() in rcu_tasks_postgp()
+//	will take care of any tasks stuck in the non-preemptible region
+//	of do_exit() following its call to exit_tasks_rcu_stop().
 // check_all_holdout_tasks(), repeatedly until holdout list is empty:
 //	Scans the holdout list, attempting to identify a quiescent state
 //	for each task on the list.  If there is a quiescent state, the
@@ -867,8 +871,10 @@  static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 //	with interrupts disabled.
 //
 // For each exiting task, the exit_tasks_rcu_start() and
-// exit_tasks_rcu_finish() functions begin and end, respectively, the SRCU
-// read-side critical sections waited for by rcu_tasks_postscan().
+// exit_tasks_rcu_finish() functions add and remove, respectively, the
+// current task to a per-CPU list of tasks that rcu_tasks_postscan() must
+// wait on.  This is necessary because rcu_tasks_postscan() must wait on
+// tasks that have already been removed from the global list of tasks.
 //
 // Pre-grace-period update-side code is ordered before the grace
 // via the raw_spin_lock.*rcu_node().  Pre-grace-period read-side code
@@ -932,9 +938,13 @@  static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
 	}
 }
 
+void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
+DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
+
 /* Processing between scanning taskslist and draining the holdout list. */
 static void rcu_tasks_postscan(struct list_head *hop)
 {
+	int cpu;
 	int rtsi = READ_ONCE(rcu_task_stall_info);
 
 	if (!IS_ENABLED(CONFIG_TINY_RCU)) {
@@ -948,9 +958,9 @@  static void rcu_tasks_postscan(struct list_head *hop)
 	 * this, divide the fragile exit path part in two intersecting
 	 * read side critical sections:
 	 *
-	 * 1) An _SRCU_ read side starting before calling exit_notify(),
-	 *    which may remove the task from the tasklist, and ending after
-	 *    the final preempt_disable() call in do_exit().
+	 * 1) A task_struct list addition before calling exit_notify(),
+	 *    which may remove the task from the tasklist, with the
+	 *    removal after the final preempt_disable() call in do_exit().
 	 *
 	 * 2) An _RCU_ read side starting with the final preempt_disable()
 	 *    call in do_exit() and ending with the final call to schedule()
@@ -959,7 +969,18 @@  static void rcu_tasks_postscan(struct list_head *hop)
 	 * This handles the part 1). And postgp will handle part 2) with a
 	 * call to synchronize_rcu().
 	 */
-	synchronize_srcu(&tasks_rcu_exit_srcu);
+
+	for_each_possible_cpu(cpu) {
+		unsigned long flags;
+		struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, cpu);
+		struct task_struct *t;
+
+		raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+		list_for_each_entry(t, &rtpcp->rtp_exit_list, rcu_tasks_exit_list)
+			if (list_empty(&t->rcu_tasks_holdout_list))
+				rcu_tasks_pertask(t, hop);
+		raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
+	}
 
 	if (!IS_ENABLED(CONFIG_TINY_RCU))
 		del_timer_sync(&tasks_rcu_exit_srcu_stall_timer);
@@ -1027,7 +1048,6 @@  static void rcu_tasks_postgp(struct rcu_tasks *rtp)
 	 *
 	 * In addition, this synchronize_rcu() waits for exiting tasks
 	 * to complete their final preempt_disable() region of execution,
-	 * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
 	 * enforcing the whole region before tasklist removal until
 	 * the final schedule() with TASK_DEAD state to be an RCU TASKS
 	 * read side critical section.
@@ -1035,9 +1055,6 @@  static void rcu_tasks_postgp(struct rcu_tasks *rtp)
 	synchronize_rcu();
 }
 
-void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func);
-DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks");
-
 static void tasks_rcu_exit_srcu_stall(struct timer_list *unused)
 {
 #ifndef CONFIG_TINY_RCU
@@ -1147,25 +1164,45 @@  struct task_struct *get_rcu_tasks_gp_kthread(void)
 EXPORT_SYMBOL_GPL(get_rcu_tasks_gp_kthread);
 
 /*
- * Contribute to protect against tasklist scan blind spot while the
- * task is exiting and may be removed from the tasklist. See
- * corresponding synchronize_srcu() for further details.
+ * Protect against tasklist scan blind spot while the task is exiting and
+ * may be removed from the tasklist.  Do this by adding the task to yet
+ * another list.
  */
-void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
+void exit_tasks_rcu_start(void)
 {
-	current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
+	unsigned long flags;
+	struct rcu_tasks_percpu *rtpcp;
+	struct task_struct *t = current;
+
+	WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
+	get_task_struct(t);
+	preempt_disable();
+	rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
+	t->rcu_tasks_exit_cpu = smp_processor_id();
+	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+	if (!rtpcp->rtp_exit_list.next)
+		INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
+	list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list);
+	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
+	preempt_enable();
 }
 
 /*
- * Contribute to protect against tasklist scan blind spot while the
- * task is exiting and may be removed from the tasklist. See
- * corresponding synchronize_srcu() for further details.
+ * Remove the task from the "yet another list" because do_exit() is now
+ * non-preemptible, allowing synchronize_rcu() to wait beyond this point.
  */
-void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
+void exit_tasks_rcu_stop(void)
 {
+	unsigned long flags;
+	struct rcu_tasks_percpu *rtpcp;
 	struct task_struct *t = current;
 
-	__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
+	WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list));
+	rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu);
+	raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+	list_del_init(&t->rcu_tasks_exit_list);
+	raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
+	put_task_struct(t);
 }
 
 /*