[V4,resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly

Message ID 20231120083559.285174-1-ming.lei@redhat.com
State New
Headers
Series [V4,resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly |

Commit Message

Ming Lei Nov. 20, 2023, 8:35 a.m. UTC
  group_cpus_evenly() could be part of storage driver's error handler,
such as nvme driver, when may happen during CPU hotplug, in which
storage queue has to drain its pending IOs because all CPUs associated
with the queue are offline and the queue is becoming inactive. And
handling IO needs error handler to provide forward progress.

Then dead lock is caused:

1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
handler is waiting for inflight IO

2) error handler is waiting for CPU hotplug lock

3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
error handling can't provide forward progress.

Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
in which two stage spreads are taken: 1) the 1st stage is over all present
CPUs; 2) the end stage is over all other CPUs.

Turns out the two stage spread just needs consistent 'cpu_present_mask', and
remove the CPU hotplug lock by storing it into one local cache. This way
doesn't change correctness, because all CPUs are still covered.

Cc: Keith Busch <kbusch@kernel.org>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Tested-by: Guangwu Zhang <guazhang@redhat.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 lib/group_cpus.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)
  

Comments

Andrew Morton Nov. 20, 2023, 8 p.m. UTC | #1
On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote:

> group_cpus_evenly() could be part of storage driver's error handler,
> such as nvme driver, when may happen during CPU hotplug, in which
> storage queue has to drain its pending IOs because all CPUs associated
> with the queue are offline and the queue is becoming inactive. And
> handling IO needs error handler to provide forward progress.
> 
> Then dead lock is caused:
> 
> 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
> handler is waiting for inflight IO
> 
> 2) error handler is waiting for CPU hotplug lock
> 
> 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
> error handling can't provide forward progress.
> 
> Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
> in which two stage spreads are taken: 1) the 1st stage is over all present
> CPUs; 2) the end stage is over all other CPUs.
> 
> Turns out the two stage spread just needs consistent 'cpu_present_mask', and
> remove the CPU hotplug lock by storing it into one local cache. This way
> doesn't change correctness, because all CPUs are still covered.

I'm not sure what is the intended merge path for this, but I can do lib/.

Do you think that a -stable backport is needed?  It sounds that way.

If so, are we able to identify a suitable Fixes: target?  That would
predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into
lib/").
  
Andrew Morton Dec. 6, 2023, 11:12 p.m. UTC | #2
On Mon, 20 Nov 2023 12:00:59 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote:
> 
> > group_cpus_evenly() could be part of storage driver's error handler,
> > such as nvme driver, when may happen during CPU hotplug, in which
> > storage queue has to drain its pending IOs because all CPUs associated
> > with the queue are offline and the queue is becoming inactive. And
> > handling IO needs error handler to provide forward progress.
> > 
> > Then dead lock is caused:
> > 
> > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
> > handler is waiting for inflight IO
> > 
> > 2) error handler is waiting for CPU hotplug lock
> > 
> > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
> > error handling can't provide forward progress.
> > 
> > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
> > in which two stage spreads are taken: 1) the 1st stage is over all present
> > CPUs; 2) the end stage is over all other CPUs.
> > 
> > Turns out the two stage spread just needs consistent 'cpu_present_mask', and
> > remove the CPU hotplug lock by storing it into one local cache. This way
> > doesn't change correctness, because all CPUs are still covered.
> 
> I'm not sure what is the intended merge path for this, but I can do lib/.
> 
> Do you think that a -stable backport is needed?  It sounds that way.
> 
> If so, are we able to identify a suitable Fixes: target?  That would
> predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into
> lib/").  

No?  I think this predates 428e211641ed8 ("genirq/affinity: Replace
deprecated CPU-hotplug functions." also.

I'll slap a cc:stable on it and I'll let you and the -stable
maintainers figure it out.
  
Yury Norov Dec. 7, 2023, 12:41 a.m. UTC | #3
Hi Ming,

On Mon, Nov 20, 2023 at 04:35:59PM +0800, Ming Lei wrote:
> group_cpus_evenly() could be part of storage driver's error handler,
> such as nvme driver, when may happen during CPU hotplug, in which
> storage queue has to drain its pending IOs because all CPUs associated
> with the queue are offline and the queue is becoming inactive. And
> handling IO needs error handler to provide forward progress.
> 
> Then dead lock is caused:
> 
> 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
> handler is waiting for inflight IO
> 
> 2) error handler is waiting for CPU hotplug lock
> 
> 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
> error handling can't provide forward progress.
> 
> Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
> in which two stage spreads are taken: 1) the 1st stage is over all present
> CPUs; 2) the end stage is over all other CPUs.
> 
> Turns out the two stage spread just needs consistent 'cpu_present_mask', and
> remove the CPU hotplug lock by storing it into one local cache. This way
> doesn't change correctness, because all CPUs are still covered.
> 
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: linux-nvme@lists.infradead.org
> Cc: linux-block@vger.kernel.org
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Reported-by: Guangwu Zhang <guazhang@redhat.com>
> Tested-by: Guangwu Zhang <guazhang@redhat.com>
> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  lib/group_cpus.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index aa3f6815bb12..ee272c4cefcc 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
>  	if (!masks)
>  		goto fail_node_to_cpumask;
>  
> -	/* Stabilize the cpumasks */
> -	cpus_read_lock();
>  	build_node_to_cpumask(node_to_cpumask);
>  
> +	/*
> +	 * Make a local cache of 'cpu_present_mask', so the two stages
> +	 * spread can observe consistent 'cpu_present_mask' without holding
> +	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
> +	 * hotplug code.
> +	 *
> +	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
> +	 * we can live with the case because it only affects that hotplug
> +	 * CPU is handled in the 1st or 2nd stage, and either way is correct
> +	 * from API user viewpoint since 2-stage spread is sort of
> +	 * optimization.
> +	 */
> +	cpumask_copy(npresmsk, data_race(cpu_present_mask));

Now that you initialize the npresmsk explicitly, you can allocate it
using alloc_cpumask_var().

The same actually holds for nmsk too, and even before this patch. Maybe
fix it in a separate prepending patch?

> +
>  	/* grouping present CPUs first */
>  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> -				  cpu_present_mask, nmsk, masks);
> +				  npresmsk, nmsk, masks);
>  	if (ret < 0)
>  		goto fail_build_affinity;
>  	nr_present = ret;
> @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
>  		curgrp = 0;
>  	else
>  		curgrp = nr_present;
> -	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
> +	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
>  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
>  				  npresmsk, nmsk, masks);

The first thing the helper does is checking if nprepmask is empty.
cpumask_andnot() returns false in that case. So, assuming that present
cpumask in the previous call can't be empty, we can save few cycles if
drop corresponding check in the helper and do like this:
        
	if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk) == 0) {
                nr_others = 0;
                goto fail_build_affinity;
        }

  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
  				  npresmsk, nmsk, masks);

Although, it's not related to this patch directly. So, if you fix
zalloc_cpumask_var(), the patch looks good to me.

Reviewed-by: Yury Norov <yury.norov@gmail.com>

>  	if (ret >= 0)
>  		nr_others = ret;
>  
>   fail_build_affinity:
> -	cpus_read_unlock();
> -
>  	if (ret >= 0)
>  		WARN_ON(nr_present + nr_others < numgrps);
>  
> -- 
> 2.41.0
>
  
Andrew Morton Dec. 7, 2023, 12:54 a.m. UTC | #4
On Wed, 6 Dec 2023 16:41:44 -0800 Yury Norov <yury.norov@gmail.com> wrote:

> Although, it's not related to this patch directly. So, if you fix
> zalloc_cpumask_var(), the patch looks good to me.
> 
> Reviewed-by: Yury Norov <yury.norov@gmail.com>

Thanks.

I just moved this into mm.git's non-rebasing mm-hotfixes-stable branch,
so I'd prefer not to have to redo it at this stage.

Let's do these things (with which I agree) as a followup patch, please.
  
Yury Norov Dec. 7, 2023, 1:01 a.m. UTC | #5
On Wed, Dec 6, 2023 at 4:54 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 6 Dec 2023 16:41:44 -0800 Yury Norov <yury.norov@gmail.com> wrote:
>
> > Although, it's not related to this patch directly. So, if you fix
> > zalloc_cpumask_var(), the patch looks good to me.
> >
> > Reviewed-by: Yury Norov <yury.norov@gmail.com>
>
> Thanks.
>
> I just moved this into mm.git's non-rebasing mm-hotfixes-stable branch,
> so I'd prefer not to have to redo it at this stage.
>
> Let's do these things (with which I agree) as a followup patch, please.

Sure, I can do that, and append to my series.
  
Ming Lei Dec. 7, 2023, 1:11 a.m. UTC | #6
On Wed, Dec 06, 2023 at 04:41:44PM -0800, Yury Norov wrote:
> Hi Ming,
> 
> On Mon, Nov 20, 2023 at 04:35:59PM +0800, Ming Lei wrote:
> > group_cpus_evenly() could be part of storage driver's error handler,
> > such as nvme driver, when may happen during CPU hotplug, in which
> > storage queue has to drain its pending IOs because all CPUs associated
> > with the queue are offline and the queue is becoming inactive. And
> > handling IO needs error handler to provide forward progress.
> > 
> > Then dead lock is caused:
> > 
> > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
> > handler is waiting for inflight IO
> > 
> > 2) error handler is waiting for CPU hotplug lock
> > 
> > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
> > error handling can't provide forward progress.
> > 
> > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
> > in which two stage spreads are taken: 1) the 1st stage is over all present
> > CPUs; 2) the end stage is over all other CPUs.
> > 
> > Turns out the two stage spread just needs consistent 'cpu_present_mask', and
> > remove the CPU hotplug lock by storing it into one local cache. This way
> > doesn't change correctness, because all CPUs are still covered.
> > 
> > Cc: Keith Busch <kbusch@kernel.org>
> > Cc: linux-nvme@lists.infradead.org
> > Cc: linux-block@vger.kernel.org
> > Reported-by: Yi Zhang <yi.zhang@redhat.com>
> > Reported-by: Guangwu Zhang <guazhang@redhat.com>
> > Tested-by: Guangwu Zhang <guazhang@redhat.com>
> > Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
> > Reviewed-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  lib/group_cpus.c | 22 ++++++++++++++++------
> >  1 file changed, 16 insertions(+), 6 deletions(-)
> > 
> > diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> > index aa3f6815bb12..ee272c4cefcc 100644
> > --- a/lib/group_cpus.c
> > +++ b/lib/group_cpus.c
> > @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> >  	if (!masks)
> >  		goto fail_node_to_cpumask;
> >  
> > -	/* Stabilize the cpumasks */
> > -	cpus_read_lock();
> >  	build_node_to_cpumask(node_to_cpumask);
> >  
> > +	/*
> > +	 * Make a local cache of 'cpu_present_mask', so the two stages
> > +	 * spread can observe consistent 'cpu_present_mask' without holding
> > +	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
> > +	 * hotplug code.
> > +	 *
> > +	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
> > +	 * we can live with the case because it only affects that hotplug
> > +	 * CPU is handled in the 1st or 2nd stage, and either way is correct
> > +	 * from API user viewpoint since 2-stage spread is sort of
> > +	 * optimization.
> > +	 */
> > +	cpumask_copy(npresmsk, data_race(cpu_present_mask));
> 
> Now that you initialize the npresmsk explicitly, you can allocate it
> using alloc_cpumask_var().

Indeed, but this way is actually before this patch, and not related with
this fix.

> 
> The same actually holds for nmsk too, and even before this patch. Maybe
> fix it in a separate prepending patch?

Yeah, 'nmsk' is similar with 'npresmsk', and it is not fix, just one
optimization.

group_cpus_evenly() is only run in slow path, so this kind of
micro-optimization is not urgent and should be done in standalone
patch, and even we can live with it.

> 
> > +
> >  	/* grouping present CPUs first */
> >  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> > -				  cpu_present_mask, nmsk, masks);
> > +				  npresmsk, nmsk, masks);
> >  	if (ret < 0)
> >  		goto fail_build_affinity;
> >  	nr_present = ret;
> > @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
> >  		curgrp = 0;
> >  	else
> >  		curgrp = nr_present;
> > -	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
> > +	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
> >  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> >  				  npresmsk, nmsk, masks);
> 
> The first thing the helper does is checking if nprepmask is empty.
> cpumask_andnot() returns false in that case. So, assuming that present
> cpumask in the previous call can't be empty, we can save few cycles if
> drop corresponding check in the helper and do like this:
>         
> 	if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk) == 0) {
>                 nr_others = 0;
>                 goto fail_build_affinity;
>         }
> 
>   	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
>   				  npresmsk, nmsk, masks);
> 
> Although, it's not related to this patch directly. So, if you fix
> zalloc_cpumask_var(), the patch looks good to me.

I'd rather not make things complicated, as mentioned this API is only
run in slow path.

> 
> Reviewed-by: Yury Norov <yury.norov@gmail.com>

Thanks for the review!


Thanks, 
Ming
  
Ming Lei Dec. 9, 2023, 2:27 a.m. UTC | #7
On Wed, Dec 06, 2023 at 03:12:46PM -0800, Andrew Morton wrote:
> On Mon, 20 Nov 2023 12:00:59 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > group_cpus_evenly() could be part of storage driver's error handler,
> > > such as nvme driver, when may happen during CPU hotplug, in which
> > > storage queue has to drain its pending IOs because all CPUs associated
> > > with the queue are offline and the queue is becoming inactive. And
> > > handling IO needs error handler to provide forward progress.
> > > 
> > > Then dead lock is caused:
> > > 
> > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
> > > handler is waiting for inflight IO
> > > 
> > > 2) error handler is waiting for CPU hotplug lock
> > > 
> > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
> > > error handling can't provide forward progress.
> > > 
> > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
> > > in which two stage spreads are taken: 1) the 1st stage is over all present
> > > CPUs; 2) the end stage is over all other CPUs.
> > > 
> > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and
> > > remove the CPU hotplug lock by storing it into one local cache. This way
> > > doesn't change correctness, because all CPUs are still covered.
> > 
> > I'm not sure what is the intended merge path for this, but I can do lib/.
> > 
> > Do you think that a -stable backport is needed?  It sounds that way.
> > 
> > If so, are we able to identify a suitable Fixes: target?  That would
> > predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into
> > lib/").  
> 
> No?  I think this predates 428e211641ed8 ("genirq/affinity: Replace
> deprecated CPU-hotplug functions." also.
> 
> I'll slap a cc:stable on it and I'll let you and the -stable
> maintainers figure it out.

The issue should be introduced since 3ee0ce2a54df ("genirq/affinity: Use get/put_online_cpus
around cpumask operations") in v4.8, but the logic has been changed a lot, so
may take some effort to backport to longterm stables.

The issue is reported from RH QA test, in which both cpu hotplug and
nvme error recovering are triggered at same time, and easy to duplicate
in QE lab, but may be hard to trigger in production environment.

Thanks, 
Ming
  

Patch

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index aa3f6815bb12..ee272c4cefcc 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -366,13 +366,25 @@  struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	if (!masks)
 		goto fail_node_to_cpumask;
 
-	/* Stabilize the cpumasks */
-	cpus_read_lock();
 	build_node_to_cpumask(node_to_cpumask);
 
+	/*
+	 * Make a local cache of 'cpu_present_mask', so the two stages
+	 * spread can observe consistent 'cpu_present_mask' without holding
+	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
+	 * hotplug code.
+	 *
+	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
+	 * we can live with the case because it only affects that hotplug
+	 * CPU is handled in the 1st or 2nd stage, and either way is correct
+	 * from API user viewpoint since 2-stage spread is sort of
+	 * optimization.
+	 */
+	cpumask_copy(npresmsk, data_race(cpu_present_mask));
+
 	/* grouping present CPUs first */
 	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-				  cpu_present_mask, nmsk, masks);
+				  npresmsk, nmsk, masks);
 	if (ret < 0)
 		goto fail_build_affinity;
 	nr_present = ret;
@@ -387,15 +399,13 @@  struct cpumask *group_cpus_evenly(unsigned int numgrps)
 		curgrp = 0;
 	else
 		curgrp = nr_present;
-	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
+	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
 	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
 				  npresmsk, nmsk, masks);
 	if (ret >= 0)
 		nr_others = ret;
 
  fail_build_affinity:
-	cpus_read_unlock();
-
 	if (ret >= 0)
 		WARN_ON(nr_present + nr_others < numgrps);