Message ID | 20231025025737.358756-1-ming.lei@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce89:0:b0:403:3b70:6f57 with SMTP id p9csp2333706vqx; Tue, 24 Oct 2023 19:58:57 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGGkd6Z6sqKC6/CB6rOG3nAUXwKF6Fb9rLI0ZPTy+97XKybnnB8Z3pbiehHbNpAr9G33lgs X-Received: by 2002:a81:6d12:0:b0:5a7:b914:34c1 with SMTP id i18-20020a816d12000000b005a7b91434c1mr17755298ywc.18.1698202737353; Tue, 24 Oct 2023 19:58:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698202737; cv=none; d=google.com; s=arc-20160816; b=A1qDmHmiPplQcGFnIICwx66zwQoaBbjiKW07kpdMgdO1TuweenkabN/xGoH4EtEY0X XUNn8rqg+FunVhDdiM1XQoKSbbjnmubr5TUH5ftWyhZkoK48SAuDtzJSeoZyd0qoqVKf QO+pPLoUdwHrS7zYtRtfx+QSKQSMC4UYNVUmBvyQY/sLt7Iv/9wvU8VK36U9dHTOYHKH YfnZTeyWIh98ThdUeNTrX3L/nxXFtn5GEmeNXWCcc+nYhJBC1jqVZK+d9nQVtmx5Z0o0 lrJjpQe+8MV5Vwo5TOL2XRwfI76vP392N7IeLzPZM3gvYdeoopAbhHXAUTEVToV2NkQP fZmQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=hBy/IOVULBLpWvlSTWRdaYyj/wgoU15vxkhYAJo7MDQ=; fh=dA1CNY9t1IKzuC6Od2a3/x9aDCoZGNfmTY2lyYCPDdA=; b=RMsDS+DaSkE2MeAxku+46P/RTzR7mGgcmm3CaeET8MTqvLECGrjG3EKOpo/09JDPEs X3eyKDlhqruG8T2qh8NthSGf2NbEH+1kWR90DXdJFi+KfbxYzzTJFnK8TKoYcM6QlLGn kOrooTJYIUxwanc4tJATO+2TE1qkf2HnEXwVA3dZ4Ry3Q5pPhpdrQC6eT4s5rIqfThoy E0MH/J5RJZubTK5MOmRjHzjIKEiPd/pWGxGLDkcoJJEbg61p4H2Eng/dTJqDxTrx+nea UxA2hdgk1KCIIFYvFnXxYBAgpvfr0auW4nZPVd6PyR4ldBP8bXz3e11R1edlgAEDNXqo mfsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=XcZt9zVn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id n131-20020a817289000000b005a825c9b92asi10236755ywc.395.2023.10.24.19.58.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Oct 2023 19:58:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=XcZt9zVn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 4C496802571F; Tue, 24 Oct 2023 19:58:56 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232234AbjJYC6r (ORCPT <rfc822;aposhian.dev@gmail.com> + 26 others); Tue, 24 Oct 2023 22:58:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56946 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229514AbjJYC6q (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Oct 2023 22:58:46 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D4E811D for <linux-kernel@vger.kernel.org>; Tue, 24 Oct 2023 19:58:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1698202679; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=hBy/IOVULBLpWvlSTWRdaYyj/wgoU15vxkhYAJo7MDQ=; b=XcZt9zVnUPs1vuul8HILlcltGrCSdYV+JztIO0VT0zgSgmQ0X5pieJ3AfMNL+4cCW1H/vP DJ0XzOGkR+bCk0maeQ76nur1bQu7r2GzZkcAXnkTzywe0zqCjQ6k9Z4ejkChkoS4NJhRw6 81e0tXvMOjb8JysDPPR6uk6SxwANeYE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-459-7CjZFFHcN9GmebVuhiMhlg-1; Tue, 24 Oct 2023 22:57:48 -0400 X-MC-Unique: 7CjZFFHcN9GmebVuhiMhlg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 4DB51185A785; Wed, 25 Oct 2023 02:57:48 +0000 (UTC) Received: from localhost (unknown [10.72.120.2]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2A2D41C060AE; Wed, 25 Oct 2023 02:57:46 +0000 (UTC) From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Tejun Heo <tj@kernel.org>, linux-kernel@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Andrew Theurer <atheurer@redhat.com>, Joe Mario <jmario@redhat.com>, Sebastian Jug <sejug@redhat.com>, Frederic Weisbecker <frederic@kernel.org>, Bart Van Assche <bvanassche@acm.org> Subject: [PATCH V3] blk-mq: don't schedule block kworker on isolated CPUs Date: Wed, 25 Oct 2023 10:57:37 +0800 Message-ID: <20231025025737.358756-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 24 Oct 2023 19:58:56 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779644642046904480 X-GMAIL-MSGID: 1780694633733079957 |
Series |
[V3] blk-mq: don't schedule block kworker on isolated CPUs
|
|
Commit Message
Ming Lei
Oct. 25, 2023, 2:57 a.m. UTC
Kernel parameter of `isolcpus=` or 'nohz_full=' are used for isolating CPUs
for specific task, and user often won't want block IO to disturb these CPUs,
also long IO latency may be caused if blk-mq kworker is scheduled on these
isolated CPUs.
Kernel workqueue only respects this limit for WQ_UNBOUND, for bound wq,
the responsibility should be on wq user.
So don't not run block kworker on isolated CPUs by ruling out isolated CPUs
from hctx->cpumask. Meantime in cpuhp handler, use queue map to check if
all CPUs in this hw queue are offline, this way can avoid any cost in fast
IO code path.
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Andrew Theurer <atheurer@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Sebastian Jug <sejug@redhat.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V3:
- avoid to check invalid cpu as reported by Bart
- take current cpu(to be offline, not done yet) into account
- simplify blk_mq_hctx_has_online_cpu()
V2:
- remove module parameter, meantime use queue map to check if
all cpus in one hctx are offline
block/blk-mq.c | 51 ++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 41 insertions(+), 10 deletions(-)
Comments
On Wed, Oct 25, 2023 at 10:57:37AM +0800, Ming Lei wrote: > Kernel parameter of `isolcpus=` or 'nohz_full=' are used for isolating CPUs > for specific task, and user often won't want block IO to disturb these CPUs, > also long IO latency may be caused if blk-mq kworker is scheduled on these > isolated CPUs. > > Kernel workqueue only respects this limit for WQ_UNBOUND, for bound wq, > the responsibility should be on wq user. > > So don't not run block kworker on isolated CPUs by ruling out isolated CPUs > from hctx->cpumask. Meantime in cpuhp handler, use queue map to check if > all CPUs in this hw queue are offline, this way can avoid any cost in fast > IO code path. > > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Andrew Theurer <atheurer@redhat.com> > Cc: Joe Mario <jmario@redhat.com> > Cc: Sebastian Jug <sejug@redhat.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Bart Van Assche <bvanassche@acm.org> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > > V3: > - avoid to check invalid cpu as reported by Bart > - take current cpu(to be offline, not done yet) into account > - simplify blk_mq_hctx_has_online_cpu() Hello Jens and Guys, Ping... thanks, Ming
Tested-by: Joe Mario <jmario@redhat.com> Running fio on an nvme, I verified that block kworker threads no longer ran on isolated cpus. I used: # timeout -s INT 60 bpftrace -e 'kprobe:nvme_queue_rq{ @["CPU=",cpu] = count(); }' The kernel with this V3 patch showed no hits on isolated cpus. An unpatched kernel showed 40587 hits on isolated cpus. Joe On 10/24/23 10:57 PM, Ming Lei wrote: > Kernel parameter of `isolcpus=` or 'nohz_full=' are used for isolating CPUs > for specific task, and user often won't want block IO to disturb these CPUs, > also long IO latency may be caused if blk-mq kworker is scheduled on these > isolated CPUs. > > Kernel workqueue only respects this limit for WQ_UNBOUND, for bound wq, > the responsibility should be on wq user. > > So don't not run block kworker on isolated CPUs by ruling out isolated CPUs > from hctx->cpumask. Meantime in cpuhp handler, use queue map to check if > all CPUs in this hw queue are offline, this way can avoid any cost in fast > IO code path. > > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Andrew Theurer <atheurer@redhat.com> > Cc: Joe Mario <jmario@redhat.com> > Cc: Sebastian Jug <sejug@redhat.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Bart Van Assche <bvanassche@acm.org> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > > V3: > - avoid to check invalid cpu as reported by Bart > - take current cpu(to be offline, not done yet) into account > - simplify blk_mq_hctx_has_online_cpu() > > V2: > - remove module parameter, meantime use queue map to check if > all cpus in one hctx are offline > > block/blk-mq.c | 51 ++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 41 insertions(+), 10 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index e2d11183f62e..4556978ce71b 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -29,6 +29,7 @@ > #include <linux/prefetch.h> > #include <linux/blk-crypto.h> > #include <linux/part_stat.h> > +#include <linux/sched/isolation.h> > > #include <trace/events/block.h> > > @@ -2158,7 +2159,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) > bool tried = false; > int next_cpu = hctx->next_cpu; > > - if (hctx->queue->nr_hw_queues == 1) > + /* > + * In case of single queue or no allowed CPU for scheduling > + * worker, don't bound our worker with any CPU > + */ > + if (hctx->queue->nr_hw_queues == 1 || next_cpu >= nr_cpu_ids) > return WORK_CPU_UNBOUND; > > if (--hctx->next_cpu_batch <= 0) { > @@ -3459,14 +3464,30 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) > return data.has_rq; > } > > -static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu, > - struct blk_mq_hw_ctx *hctx) > +static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx, > + unsigned int this_cpu) > { > - if (cpumask_first_and(hctx->cpumask, cpu_online_mask) != cpu) > - return false; > - if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids) > - return false; > - return true; > + enum hctx_type type = hctx->type; > + int cpu; > + > + /* > + * hctx->cpumask has rule out isolated CPUs, but userspace still > + * might submit IOs on these isolated CPUs, so use queue map to > + * check if all CPUs mapped to this hctx are offline > + */ > + for_each_online_cpu(cpu) { > + struct blk_mq_hw_ctx *h = blk_mq_map_queue_type(hctx->queue, > + type, cpu); > + > + if (h != hctx) > + continue; > + > + /* this current CPU isn't put offline yet */ > + if (this_cpu != cpu) > + return true; > + } > + > + return false; > } > > static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) > @@ -3474,8 +3495,7 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) > struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, > struct blk_mq_hw_ctx, cpuhp_online); > > - if (!cpumask_test_cpu(cpu, hctx->cpumask) || > - !blk_mq_last_cpu_in_hctx(cpu, hctx)) > + if (blk_mq_hctx_has_online_cpu(hctx, cpu)) > return 0; > > /* > @@ -3883,6 +3903,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) > } > > queue_for_each_hw_ctx(q, hctx, i) { > + int cpu; > + > /* > * If no software queues are mapped to this hardware queue, > * disable it and free the request entries. > @@ -3909,6 +3931,15 @@ static void blk_mq_map_swqueue(struct request_queue *q) > */ > sbitmap_resize(&hctx->ctx_map, hctx->nr_ctx); > > + /* > + * rule out isolated CPUs from hctx->cpumask for avoiding to > + * run wq worker on isolated CPU > + */ > + for_each_cpu(cpu, hctx->cpumask) { > + if (cpu_is_isolated(cpu)) > + cpumask_clear_cpu(cpu, hctx->cpumask); > + } > + > /* > * Initialize batch roundrobin counts > */
On Wed, 2023-10-25 at 10:57 +0800, Ming Lei wrote: > Kernel parameter of `isolcpus=` or 'nohz_full=' are used for isolating CPUs > for specific task, and user often won't want block IO to disturb these CPUs, Suggest breaking up this long sentence to make reading easier. for specific tasks. Users do not want block I/O operations to disturb these CPUS, > also long IO latency may be caused if blk-mq kworker is scheduled on these as long I/O latency could delay intended tasks if blk-mq kworker is scheduled on these > isolated CPUs. > > Kernel workqueue only respects this limit for WQ_UNBOUND, for bound wq, > the responsibility should be on wq user. > > So don't not run block kworker on isolated CPUs by ruling out isolated CPUs So don't run block kworker on isolated CPUs by removing isolated CPUs > from hctx->cpumask. Meantime in cpuhp handler, use queue map to check if > all CPUs in this hw queue are offline, this way can avoid any cost in fast > IO code path. > > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Andrew Theurer <atheurer@redhat.com> > Cc: Joe Mario <jmario@redhat.com> > Cc: Sebastian Jug <sejug@redhat.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Bart Van Assche <bvanassche@acm.org> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > > V3: > - avoid to check invalid cpu as reported by Bart > - take current cpu(to be offline, not done yet) into account > - simplify blk_mq_hctx_has_online_cpu() > > V2: > - remove module parameter, meantime use queue map to check if > all cpus in one hctx are offline > > block/blk-mq.c | 51 ++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 41 insertions(+), 10 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index e2d11183f62e..4556978ce71b 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -29,6 +29,7 @@ > #include <linux/prefetch.h> > #include <linux/blk-crypto.h> > #include <linux/part_stat.h> > +#include <linux/sched/isolation.h> > > #include <trace/events/block.h> > > @@ -2158,7 +2159,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) > bool tried = false; > int next_cpu = hctx->next_cpu; > > - if (hctx->queue->nr_hw_queues == 1) > + /* > + * In case of single queue or no allowed CPU for scheduling > + * worker, don't bound our worker with any CPU > + */ > + if (hctx->queue->nr_hw_queues == 1 || next_cpu >= nr_cpu_ids) > return WORK_CPU_UNBOUND; > > if (--hctx->next_cpu_batch <= 0) { > @@ -3459,14 +3464,30 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) > return data.has_rq; > } > > -static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu, > - struct blk_mq_hw_ctx *hctx) > +static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx, > + unsigned int this_cpu) > { > - if (cpumask_first_and(hctx->cpumask, cpu_online_mask) != cpu) > - return false; > - if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids) > - return false; > - return true; > + enum hctx_type type = hctx->type; > + int cpu; > + > + /* > + * hctx->cpumask has rule out isolated CPUs, but userspace still > + * might submit IOs on these isolated CPUs, so use queue map to > + * check if all CPUs mapped to this hctx are offline > + */ > + for_each_online_cpu(cpu) { > + struct blk_mq_hw_ctx *h = blk_mq_map_queue_type(hctx->queue, > + type, cpu); > + > + if (h != hctx) > + continue; > + > + /* this current CPU isn't put offline yet */ > + if (this_cpu != cpu) > + return true; > + } > + > + return false; > } > > static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) > @@ -3474,8 +3495,7 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) > struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, > struct blk_mq_hw_ctx, cpuhp_online); > > - if (!cpumask_test_cpu(cpu, hctx->cpumask) || > - !blk_mq_last_cpu_in_hctx(cpu, hctx)) > + if (blk_mq_hctx_has_online_cpu(hctx, cpu)) > return 0; > > /* > @@ -3883,6 +3903,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) > } > > queue_for_each_hw_ctx(q, hctx, i) { > + int cpu; > + > /* > * If no software queues are mapped to this hardware queue, > * disable it and free the request entries. > @@ -3909,6 +3931,15 @@ static void blk_mq_map_swqueue(struct request_queue *q) > */ > sbitmap_resize(&hctx->ctx_map, hctx->nr_ctx); > > + /* > + * rule out isolated CPUs from hctx->cpumask for avoiding to s/for avoiding to run/to avoid running/ > + * run wq worker on isolated CPU > + */ > + for_each_cpu(cpu, hctx->cpumask) { > + if (cpu_is_isolated(cpu)) > + cpumask_clear_cpu(cpu, hctx->cpumask); > + } > + > /* > * Initialize batch roundrobin counts > */ Thanks. Tim
On Wed, Dec 13, 2023 at 04:43:26PM -0500, Joe Mario wrote: > Tested-by: Joe Mario <jmario@redhat.com> > > Running fio on an nvme, I verified that block kworker threads no longer ran on isolated cpus. > I used: > # timeout -s INT 60 bpftrace -e 'kprobe:nvme_queue_rq{ @["CPU=",cpu] = count(); }' > > The kernel with this V3 patch showed no hits on isolated cpus. > An unpatched kernel showed 40587 hits on isolated cpus. Hi Joe, Thanks for the test! Jens, any chance to merge this patch? It does address use case in CPU isolation, meantime don't affect normal non-isolation cases. Thanks, Ming > > Joe > > On 10/24/23 10:57 PM, Ming Lei wrote: > > Kernel parameter of `isolcpus=` or 'nohz_full=' are used for isolating CPUs > > for specific task, and user often won't want block IO to disturb these CPUs, > > also long IO latency may be caused if blk-mq kworker is scheduled on these > > isolated CPUs. > > > > Kernel workqueue only respects this limit for WQ_UNBOUND, for bound wq, > > the responsibility should be on wq user. > > > > So don't not run block kworker on isolated CPUs by ruling out isolated CPUs > > from hctx->cpumask. Meantime in cpuhp handler, use queue map to check if > > all CPUs in this hw queue are offline, this way can avoid any cost in fast > > IO code path. > > > > Cc: Juri Lelli <juri.lelli@redhat.com> > > Cc: Andrew Theurer <atheurer@redhat.com> > > Cc: Joe Mario <jmario@redhat.com> > > Cc: Sebastian Jug <sejug@redhat.com> > > Cc: Frederic Weisbecker <frederic@kernel.org> > > Cc: Bart Van Assche <bvanassche@acm.org> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > > > V3: > > - avoid to check invalid cpu as reported by Bart > > - take current cpu(to be offline, not done yet) into account > > - simplify blk_mq_hctx_has_online_cpu() > > > > V2: > > - remove module parameter, meantime use queue map to check if > > all cpus in one hctx are offline > > > > block/blk-mq.c | 51 ++++++++++++++++++++++++++++++++++++++++---------- > > 1 file changed, 41 insertions(+), 10 deletions(-) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index e2d11183f62e..4556978ce71b 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -29,6 +29,7 @@ > > #include <linux/prefetch.h> > > #include <linux/blk-crypto.h> > > #include <linux/part_stat.h> > > +#include <linux/sched/isolation.h> > > > > #include <trace/events/block.h> > > > > @@ -2158,7 +2159,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) > > bool tried = false; > > int next_cpu = hctx->next_cpu; > > > > - if (hctx->queue->nr_hw_queues == 1) > > + /* > > + * In case of single queue or no allowed CPU for scheduling > > + * worker, don't bound our worker with any CPU > > + */ > > + if (hctx->queue->nr_hw_queues == 1 || next_cpu >= nr_cpu_ids) > > return WORK_CPU_UNBOUND; > > > > if (--hctx->next_cpu_batch <= 0) { > > @@ -3459,14 +3464,30 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) > > return data.has_rq; > > } > > > > -static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu, > > - struct blk_mq_hw_ctx *hctx) > > +static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx, > > + unsigned int this_cpu) > > { > > - if (cpumask_first_and(hctx->cpumask, cpu_online_mask) != cpu) > > - return false; > > - if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids) > > - return false; > > - return true; > > + enum hctx_type type = hctx->type; > > + int cpu; > > + > > + /* > > + * hctx->cpumask has rule out isolated CPUs, but userspace still > > + * might submit IOs on these isolated CPUs, so use queue map to > > + * check if all CPUs mapped to this hctx are offline > > + */ > > + for_each_online_cpu(cpu) { > > + struct blk_mq_hw_ctx *h = blk_mq_map_queue_type(hctx->queue, > > + type, cpu); > > + > > + if (h != hctx) > > + continue; > > + > > + /* this current CPU isn't put offline yet */ > > + if (this_cpu != cpu) > > + return true; > > + } > > + > > + return false; > > } > > > > static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) > > @@ -3474,8 +3495,7 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) > > struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, > > struct blk_mq_hw_ctx, cpuhp_online); > > > > - if (!cpumask_test_cpu(cpu, hctx->cpumask) || > > - !blk_mq_last_cpu_in_hctx(cpu, hctx)) > > + if (blk_mq_hctx_has_online_cpu(hctx, cpu)) > > return 0; > > > > /* > > @@ -3883,6 +3903,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) > > } > > > > queue_for_each_hw_ctx(q, hctx, i) { > > + int cpu; > > + > > /* > > * If no software queues are mapped to this hardware queue, > > * disable it and free the request entries. > > @@ -3909,6 +3931,15 @@ static void blk_mq_map_swqueue(struct request_queue *q) > > */ > > sbitmap_resize(&hctx->ctx_map, hctx->nr_ctx); > > > > + /* > > + * rule out isolated CPUs from hctx->cpumask for avoiding to > > + * run wq worker on isolated CPU > > + */ > > + for_each_cpu(cpu, hctx->cpumask) { > > + if (cpu_is_isolated(cpu)) > > + cpumask_clear_cpu(cpu, hctx->cpumask); > > + } > > + > > /* > > * Initialize batch roundrobin counts > > */ >
diff --git a/block/blk-mq.c b/block/blk-mq.c index e2d11183f62e..4556978ce71b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -29,6 +29,7 @@ #include <linux/prefetch.h> #include <linux/blk-crypto.h> #include <linux/part_stat.h> +#include <linux/sched/isolation.h> #include <trace/events/block.h> @@ -2158,7 +2159,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) bool tried = false; int next_cpu = hctx->next_cpu; - if (hctx->queue->nr_hw_queues == 1) + /* + * In case of single queue or no allowed CPU for scheduling + * worker, don't bound our worker with any CPU + */ + if (hctx->queue->nr_hw_queues == 1 || next_cpu >= nr_cpu_ids) return WORK_CPU_UNBOUND; if (--hctx->next_cpu_batch <= 0) { @@ -3459,14 +3464,30 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) return data.has_rq; } -static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu, - struct blk_mq_hw_ctx *hctx) +static bool blk_mq_hctx_has_online_cpu(struct blk_mq_hw_ctx *hctx, + unsigned int this_cpu) { - if (cpumask_first_and(hctx->cpumask, cpu_online_mask) != cpu) - return false; - if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids) - return false; - return true; + enum hctx_type type = hctx->type; + int cpu; + + /* + * hctx->cpumask has rule out isolated CPUs, but userspace still + * might submit IOs on these isolated CPUs, so use queue map to + * check if all CPUs mapped to this hctx are offline + */ + for_each_online_cpu(cpu) { + struct blk_mq_hw_ctx *h = blk_mq_map_queue_type(hctx->queue, + type, cpu); + + if (h != hctx) + continue; + + /* this current CPU isn't put offline yet */ + if (this_cpu != cpu) + return true; + } + + return false; } static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) @@ -3474,8 +3495,7 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online); - if (!cpumask_test_cpu(cpu, hctx->cpumask) || - !blk_mq_last_cpu_in_hctx(cpu, hctx)) + if (blk_mq_hctx_has_online_cpu(hctx, cpu)) return 0; /* @@ -3883,6 +3903,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) } queue_for_each_hw_ctx(q, hctx, i) { + int cpu; + /* * If no software queues are mapped to this hardware queue, * disable it and free the request entries. @@ -3909,6 +3931,15 @@ static void blk_mq_map_swqueue(struct request_queue *q) */ sbitmap_resize(&hctx->ctx_map, hctx->nr_ctx); + /* + * rule out isolated CPUs from hctx->cpumask for avoiding to + * run wq worker on isolated CPU + */ + for_each_cpu(cpu, hctx->cpumask) { + if (cpu_is_isolated(cpu)) + cpumask_clear_cpu(cpu, hctx->cpumask); + } + /* * Initialize batch roundrobin counts */