[V4,resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly
Message ID | 20231120083559.285174-1-ming.lei@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9910:0:b0:403:3b70:6f57 with SMTP id i16csp2058394vqn; Mon, 20 Nov 2023 00:39:47 -0800 (PST) X-Google-Smtp-Source: AGHT+IH+Qwl3gvYN2Vv5k6IykIbBsd1tg6CmcvadYf0xraUPkKeEUIt50fK4fbRLj8/eWoleTpuV X-Received: by 2002:a05:6870:b4a8:b0:1f9:46c9:966b with SMTP id y40-20020a056870b4a800b001f946c9966bmr1404555oap.49.1700469587510; Mon, 20 Nov 2023 00:39:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700469587; cv=none; d=google.com; s=arc-20160816; b=Q1iOGXe8HjkVDQpN/yzDlrTiR0rwSIpK0TGfPdyB1dUCkzJ1Z4Vdl2LfNhZfzmvfik YvMCCysHIJPEUVLIRajRI7ls7CMbdfudb3JOlbYtflXisktB2fOaRHSc9db6/+7BtCP9 q54lmO/ZcoFqeoNKLe6yeVhBXNQ137nJ1FidUZHM+0TzXPHwvXFHXxqeb2bM9h34gHrk bOfMSj24hMp1LK1G5U6az5ThrSWklXv72tcJsrL2uEVTtzs7EQCP0P2jy7zL1ShXCv8/ +p9ALrU3lOPMDmLG+SbrV8Z8f8c+xkUlb13Dxz2oAgZwZiPVzBTohS76csR4oYiBOMYn Q+MQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=CBLqoMyIVireLlLreZE4y9YWUgoFtH3j+ZLwtIAR6bY=; fh=KApP02vowD3Knp+v+caNtZxw86mwjkDSePQRUJ0PEZo=; b=PvaFkA0wviOVTCFvV8ZfBe1SUudQ10tnmZAwFLET1hBayQjdEJqtFpD6CNgimZpQDq Gimq5gN5InDUF92B1zBW95xMxlQ1PPLldR4/wXNncWWki53P004OAhVFq2iueG2QdkyK lDSHeMXcPrT39D2EwMdWlfdU3SSGbXnOojOnAm2ZWhSPRPPcLYT4FH4TIrFTeUChy8jx BhUqeOcVIBvJBZGQ68+IrE/m83GrH+rSYDyLxI+MDNPAMqB6ALRLXu2k/iptjj+LNpDk wswDjhtGbJ39Da6AVarycCeadoGwEs9WpDxYwCe44SNY0Knee5WavmKWccGMOhtLAbE/ p+Aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HUHlzKLE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id c8-20020a056a00248800b006cb6bce9cd0si3716562pfv.47.2023.11.20.00.39.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Nov 2023 00:39:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HUHlzKLE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 0351E8075917; Mon, 20 Nov 2023 00:36:47 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232222AbjKTIgj (ORCPT <rfc822;heyuhang3455@gmail.com> + 27 others); Mon, 20 Nov 2023 03:36:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41018 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231948AbjKTIgh (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 20 Nov 2023 03:36:37 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1F6B9E for <linux-kernel@vger.kernel.org>; Mon, 20 Nov 2023 00:36:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700469388; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=CBLqoMyIVireLlLreZE4y9YWUgoFtH3j+ZLwtIAR6bY=; b=HUHlzKLEfszrbruskvsw0HcBledqTaQAxNszDt+ZaAyPkPwt+ILYeigKpeMkiiYq8ehTu0 qsgsRfeqQGaxLj1iZ7WchGAjqR0jR+t5ons/C5sfEOSjtQ2ZSsI8mPxUYephkuOqNPoROs tmVIlUc6TSsjlEKNHHHj8iUVfjsl00Q= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-461-W8IigMUsOXuqc3NP8N005A-1; Mon, 20 Nov 2023 03:36:23 -0500 X-MC-Unique: W8IigMUsOXuqc3NP8N005A-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 90AA53C0E657; Mon, 20 Nov 2023 08:36:22 +0000 (UTC) Received: from localhost (unknown [10.72.120.15]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8BDEA40C6EB9; Mon, 20 Nov 2023 08:36:21 +0000 (UTC) From: Ming Lei <ming.lei@redhat.com> To: Thomas Gleixner <tglx@linutronix.de>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Yi Zhang <yi.zhang@redhat.com>, Guangwu Zhang <guazhang@redhat.com>, Chengming Zhou <zhouchengming@bytedance.com>, Jens Axboe <axboe@kernel.dk> Subject: [PATCH V4 resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly Date: Mon, 20 Nov 2023 16:35:59 +0800 Message-ID: <20231120083559.285174-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 20 Nov 2023 00:36:47 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783071598194856558 X-GMAIL-MSGID: 1783071598194856558 |
Series |
[V4,resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly
|
|
Commit Message
Ming Lei
Nov. 20, 2023, 8:35 a.m. UTC
group_cpus_evenly() could be part of storage driver's error handler, such as nvme driver, when may happen during CPU hotplug, in which storage queue has to drain its pending IOs because all CPUs associated with the queue are offline and the queue is becoming inactive. And handling IO needs error handler to provide forward progress. Then dead lock is caused: 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's handler is waiting for inflight IO 2) error handler is waiting for CPU hotplug lock 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because error handling can't provide forward progress. Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), in which two stage spreads are taken: 1) the 1st stage is over all present CPUs; 2) the end stage is over all other CPUs. Turns out the two stage spread just needs consistent 'cpu_present_mask', and remove the CPU hotplug lock by storing it into one local cache. This way doesn't change correctness, because all CPUs are still covered. Cc: Keith Busch <kbusch@kernel.org> Cc: linux-nvme@lists.infradead.org Cc: linux-block@vger.kernel.org Reported-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: Guangwu Zhang <guazhang@redhat.com> Tested-by: Guangwu Zhang <guazhang@redhat.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- lib/group_cpus.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-)
Comments
On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote: > group_cpus_evenly() could be part of storage driver's error handler, > such as nvme driver, when may happen during CPU hotplug, in which > storage queue has to drain its pending IOs because all CPUs associated > with the queue are offline and the queue is becoming inactive. And > handling IO needs error handler to provide forward progress. > > Then dead lock is caused: > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > handler is waiting for inflight IO > > 2) error handler is waiting for CPU hotplug lock > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > error handling can't provide forward progress. > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > in which two stage spreads are taken: 1) the 1st stage is over all present > CPUs; 2) the end stage is over all other CPUs. > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > remove the CPU hotplug lock by storing it into one local cache. This way > doesn't change correctness, because all CPUs are still covered. I'm not sure what is the intended merge path for this, but I can do lib/. Do you think that a -stable backport is needed? It sounds that way. If so, are we able to identify a suitable Fixes: target? That would predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into lib/").
On Mon, 20 Nov 2023 12:00:59 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote: > > > group_cpus_evenly() could be part of storage driver's error handler, > > such as nvme driver, when may happen during CPU hotplug, in which > > storage queue has to drain its pending IOs because all CPUs associated > > with the queue are offline and the queue is becoming inactive. And > > handling IO needs error handler to provide forward progress. > > > > Then dead lock is caused: > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > handler is waiting for inflight IO > > > > 2) error handler is waiting for CPU hotplug lock > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > error handling can't provide forward progress. > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > in which two stage spreads are taken: 1) the 1st stage is over all present > > CPUs; 2) the end stage is over all other CPUs. > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > remove the CPU hotplug lock by storing it into one local cache. This way > > doesn't change correctness, because all CPUs are still covered. > > I'm not sure what is the intended merge path for this, but I can do lib/. > > Do you think that a -stable backport is needed? It sounds that way. > > If so, are we able to identify a suitable Fixes: target? That would > predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into > lib/"). No? I think this predates 428e211641ed8 ("genirq/affinity: Replace deprecated CPU-hotplug functions." also. I'll slap a cc:stable on it and I'll let you and the -stable maintainers figure it out.
Hi Ming, On Mon, Nov 20, 2023 at 04:35:59PM +0800, Ming Lei wrote: > group_cpus_evenly() could be part of storage driver's error handler, > such as nvme driver, when may happen during CPU hotplug, in which > storage queue has to drain its pending IOs because all CPUs associated > with the queue are offline and the queue is becoming inactive. And > handling IO needs error handler to provide forward progress. > > Then dead lock is caused: > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > handler is waiting for inflight IO > > 2) error handler is waiting for CPU hotplug lock > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > error handling can't provide forward progress. > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > in which two stage spreads are taken: 1) the 1st stage is over all present > CPUs; 2) the end stage is over all other CPUs. > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > remove the CPU hotplug lock by storing it into one local cache. This way > doesn't change correctness, because all CPUs are still covered. > > Cc: Keith Busch <kbusch@kernel.org> > Cc: linux-nvme@lists.infradead.org > Cc: linux-block@vger.kernel.org > Reported-by: Yi Zhang <yi.zhang@redhat.com> > Reported-by: Guangwu Zhang <guazhang@redhat.com> > Tested-by: Guangwu Zhang <guazhang@redhat.com> > Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> > Reviewed-by: Jens Axboe <axboe@kernel.dk> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > lib/group_cpus.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c > index aa3f6815bb12..ee272c4cefcc 100644 > --- a/lib/group_cpus.c > +++ b/lib/group_cpus.c > @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > if (!masks) > goto fail_node_to_cpumask; > > - /* Stabilize the cpumasks */ > - cpus_read_lock(); > build_node_to_cpumask(node_to_cpumask); > > + /* > + * Make a local cache of 'cpu_present_mask', so the two stages > + * spread can observe consistent 'cpu_present_mask' without holding > + * cpu hotplug lock, then we can reduce deadlock risk with cpu > + * hotplug code. > + * > + * Here CPU hotplug may happen when reading `cpu_present_mask`, and > + * we can live with the case because it only affects that hotplug > + * CPU is handled in the 1st or 2nd stage, and either way is correct > + * from API user viewpoint since 2-stage spread is sort of > + * optimization. > + */ > + cpumask_copy(npresmsk, data_race(cpu_present_mask)); Now that you initialize the npresmsk explicitly, you can allocate it using alloc_cpumask_var(). The same actually holds for nmsk too, and even before this patch. Maybe fix it in a separate prepending patch? > + > /* grouping present CPUs first */ > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > - cpu_present_mask, nmsk, masks); > + npresmsk, nmsk, masks); > if (ret < 0) > goto fail_build_affinity; > nr_present = ret; > @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > curgrp = 0; > else > curgrp = nr_present; > - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); > + cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > npresmsk, nmsk, masks); The first thing the helper does is checking if nprepmask is empty. cpumask_andnot() returns false in that case. So, assuming that present cpumask in the previous call can't be empty, we can save few cycles if drop corresponding check in the helper and do like this: if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk) == 0) { nr_others = 0; goto fail_build_affinity; } ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, nmsk, masks); Although, it's not related to this patch directly. So, if you fix zalloc_cpumask_var(), the patch looks good to me. Reviewed-by: Yury Norov <yury.norov@gmail.com> > if (ret >= 0) > nr_others = ret; > > fail_build_affinity: > - cpus_read_unlock(); > - > if (ret >= 0) > WARN_ON(nr_present + nr_others < numgrps); > > -- > 2.41.0 >
On Wed, 6 Dec 2023 16:41:44 -0800 Yury Norov <yury.norov@gmail.com> wrote: > Although, it's not related to this patch directly. So, if you fix > zalloc_cpumask_var(), the patch looks good to me. > > Reviewed-by: Yury Norov <yury.norov@gmail.com> Thanks. I just moved this into mm.git's non-rebasing mm-hotfixes-stable branch, so I'd prefer not to have to redo it at this stage. Let's do these things (with which I agree) as a followup patch, please.
On Wed, Dec 6, 2023 at 4:54 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 6 Dec 2023 16:41:44 -0800 Yury Norov <yury.norov@gmail.com> wrote: > > > Although, it's not related to this patch directly. So, if you fix > > zalloc_cpumask_var(), the patch looks good to me. > > > > Reviewed-by: Yury Norov <yury.norov@gmail.com> > > Thanks. > > I just moved this into mm.git's non-rebasing mm-hotfixes-stable branch, > so I'd prefer not to have to redo it at this stage. > > Let's do these things (with which I agree) as a followup patch, please. Sure, I can do that, and append to my series.
On Wed, Dec 06, 2023 at 04:41:44PM -0800, Yury Norov wrote: > Hi Ming, > > On Mon, Nov 20, 2023 at 04:35:59PM +0800, Ming Lei wrote: > > group_cpus_evenly() could be part of storage driver's error handler, > > such as nvme driver, when may happen during CPU hotplug, in which > > storage queue has to drain its pending IOs because all CPUs associated > > with the queue are offline and the queue is becoming inactive. And > > handling IO needs error handler to provide forward progress. > > > > Then dead lock is caused: > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > handler is waiting for inflight IO > > > > 2) error handler is waiting for CPU hotplug lock > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > error handling can't provide forward progress. > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > in which two stage spreads are taken: 1) the 1st stage is over all present > > CPUs; 2) the end stage is over all other CPUs. > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > remove the CPU hotplug lock by storing it into one local cache. This way > > doesn't change correctness, because all CPUs are still covered. > > > > Cc: Keith Busch <kbusch@kernel.org> > > Cc: linux-nvme@lists.infradead.org > > Cc: linux-block@vger.kernel.org > > Reported-by: Yi Zhang <yi.zhang@redhat.com> > > Reported-by: Guangwu Zhang <guazhang@redhat.com> > > Tested-by: Guangwu Zhang <guazhang@redhat.com> > > Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> > > Reviewed-by: Jens Axboe <axboe@kernel.dk> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > lib/group_cpus.c | 22 ++++++++++++++++------ > > 1 file changed, 16 insertions(+), 6 deletions(-) > > > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c > > index aa3f6815bb12..ee272c4cefcc 100644 > > --- a/lib/group_cpus.c > > +++ b/lib/group_cpus.c > > @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > > if (!masks) > > goto fail_node_to_cpumask; > > > > - /* Stabilize the cpumasks */ > > - cpus_read_lock(); > > build_node_to_cpumask(node_to_cpumask); > > > > + /* > > + * Make a local cache of 'cpu_present_mask', so the two stages > > + * spread can observe consistent 'cpu_present_mask' without holding > > + * cpu hotplug lock, then we can reduce deadlock risk with cpu > > + * hotplug code. > > + * > > + * Here CPU hotplug may happen when reading `cpu_present_mask`, and > > + * we can live with the case because it only affects that hotplug > > + * CPU is handled in the 1st or 2nd stage, and either way is correct > > + * from API user viewpoint since 2-stage spread is sort of > > + * optimization. > > + */ > > + cpumask_copy(npresmsk, data_race(cpu_present_mask)); > > Now that you initialize the npresmsk explicitly, you can allocate it > using alloc_cpumask_var(). Indeed, but this way is actually before this patch, and not related with this fix. > > The same actually holds for nmsk too, and even before this patch. Maybe > fix it in a separate prepending patch? Yeah, 'nmsk' is similar with 'npresmsk', and it is not fix, just one optimization. group_cpus_evenly() is only run in slow path, so this kind of micro-optimization is not urgent and should be done in standalone patch, and even we can live with it. > > > + > > /* grouping present CPUs first */ > > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > > - cpu_present_mask, nmsk, masks); > > + npresmsk, nmsk, masks); > > if (ret < 0) > > goto fail_build_affinity; > > nr_present = ret; > > @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > > curgrp = 0; > > else > > curgrp = nr_present; > > - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); > > + cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); > > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > > npresmsk, nmsk, masks); > > The first thing the helper does is checking if nprepmask is empty. > cpumask_andnot() returns false in that case. So, assuming that present > cpumask in the previous call can't be empty, we can save few cycles if > drop corresponding check in the helper and do like this: > > if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk) == 0) { > nr_others = 0; > goto fail_build_affinity; > } > > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > npresmsk, nmsk, masks); > > Although, it's not related to this patch directly. So, if you fix > zalloc_cpumask_var(), the patch looks good to me. I'd rather not make things complicated, as mentioned this API is only run in slow path. > > Reviewed-by: Yury Norov <yury.norov@gmail.com> Thanks for the review! Thanks, Ming
On Wed, Dec 06, 2023 at 03:12:46PM -0800, Andrew Morton wrote: > On Mon, 20 Nov 2023 12:00:59 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei <ming.lei@redhat.com> wrote: > > > > > group_cpus_evenly() could be part of storage driver's error handler, > > > such as nvme driver, when may happen during CPU hotplug, in which > > > storage queue has to drain its pending IOs because all CPUs associated > > > with the queue are offline and the queue is becoming inactive. And > > > handling IO needs error handler to provide forward progress. > > > > > > Then dead lock is caused: > > > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > > handler is waiting for inflight IO > > > > > > 2) error handler is waiting for CPU hotplug lock > > > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > > error handling can't provide forward progress. > > > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > > in which two stage spreads are taken: 1) the 1st stage is over all present > > > CPUs; 2) the end stage is over all other CPUs. > > > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > > remove the CPU hotplug lock by storing it into one local cache. This way > > > doesn't change correctness, because all CPUs are still covered. > > > > I'm not sure what is the intended merge path for this, but I can do lib/. > > > > Do you think that a -stable backport is needed? It sounds that way. > > > > If so, are we able to identify a suitable Fixes: target? That would > > predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into > > lib/"). > > No? I think this predates 428e211641ed8 ("genirq/affinity: Replace > deprecated CPU-hotplug functions." also. > > I'll slap a cc:stable on it and I'll let you and the -stable > maintainers figure it out. The issue should be introduced since 3ee0ce2a54df ("genirq/affinity: Use get/put_online_cpus around cpumask operations") in v4.8, but the logic has been changed a lot, so may take some effort to backport to longterm stables. The issue is reported from RH QA test, in which both cpu hotplug and nvme error recovering are triggered at same time, and easy to duplicate in QE lab, but may be hard to trigger in production environment. Thanks, Ming
diff --git a/lib/group_cpus.c b/lib/group_cpus.c index aa3f6815bb12..ee272c4cefcc 100644 --- a/lib/group_cpus.c +++ b/lib/group_cpus.c @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) if (!masks) goto fail_node_to_cpumask; - /* Stabilize the cpumasks */ - cpus_read_lock(); build_node_to_cpumask(node_to_cpumask); + /* + * Make a local cache of 'cpu_present_mask', so the two stages + * spread can observe consistent 'cpu_present_mask' without holding + * cpu hotplug lock, then we can reduce deadlock risk with cpu + * hotplug code. + * + * Here CPU hotplug may happen when reading `cpu_present_mask`, and + * we can live with the case because it only affects that hotplug + * CPU is handled in the 1st or 2nd stage, and either way is correct + * from API user viewpoint since 2-stage spread is sort of + * optimization. + */ + cpumask_copy(npresmsk, data_race(cpu_present_mask)); + /* grouping present CPUs first */ ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, - cpu_present_mask, nmsk, masks); + npresmsk, nmsk, masks); if (ret < 0) goto fail_build_affinity; nr_present = ret; @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) curgrp = 0; else curgrp = nr_present; - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); + cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, nmsk, masks); if (ret >= 0) nr_others = ret; fail_build_affinity: - cpus_read_unlock(); - if (ret >= 0) WARN_ON(nr_present + nr_others < numgrps);