Message ID | 20230216071148.2060-1-zhangpeng.00@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp158087wrn; Wed, 15 Feb 2023 23:27:21 -0800 (PST) X-Google-Smtp-Source: AK7set8xpfefhF5SypwisjCi3XfISm7Q8pmVwTOQOHxzgP9BI+q9HWv//Xb3wGia9g5ZbGh5roQq X-Received: by 2002:a17:906:1906:b0:8a9:f870:d25b with SMTP id a6-20020a170906190600b008a9f870d25bmr4937062eje.15.1676532441451; Wed, 15 Feb 2023 23:27:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676532441; cv=none; d=google.com; s=arc-20160816; b=CwnHnSd4nt6DINi5Dbz3+pla64RvHxhYsyj99MGhKYBOEBcqV0V7f+6iylgR5oX1gP M/zgy1nkkDpwJn4vsFNQzLg0r+9tLP0uoEpTOVs2Bg4uHmn9XAqbzGCQ2fZGbzyjUyvX iMXgIwDKPSK9S6iZDvrBrBF+jJy5oVMRml7TYc7weQHXB9NiQDm+3CEbU3QhEvXEK4YT DFVYrF1qbmq5W5uN4pqW3MHH7Wy45rQy5fxc1tYU/F7ANMnwx4CtoLgOzYEEwoNej/pI X49TFA4JJuRkglw2zUcNC1Wl5xRTK0fQXS7YsO6z60wO6XnUdQ4xFVqytex31p7evbcL hs6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=oRJfoVOuCSLIf78Tk6rnAQ5m7tVOuXtuPC5IuUQp6R4=; b=O12D9112EGoBAY0/nIJTasVIxVa6bzMLxODWtlDXxgYv80gu3mOkemmPpNxQHXe6eQ XaqhlTvHpyr30JfPcllmqWV4ZGVdPChElJrKfjP/vTtrgUvxxVY/ObnsTWJP+35C8m4p yiERytMxUlCUaBJIp/qjat6PEbq8apQCmAhCbYiR+ltNs7hQhvn0BxqVa9IBvhiVgpbi /woHo4Wq3XoUJPS0eXu2Lr72LIh4tYdkYZczaAJGt6PMnav9wyLB7Rp/rPVAy7fhh4YM Sufz3t78KyTo+xqq6YURo+6+JfSyxphTr+UpDtj6s+Bj/6LDs/wLOhwlWufkVlq83i9C E2Rg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=FEpvpz7m; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x5-20020aa7dac5000000b004acbda9891dsi1368870eds.173.2023.02.15.23.26.58; Wed, 15 Feb 2023 23:27:21 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=FEpvpz7m; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229710AbjBPHNP (ORCPT <rfc822;hadasmailinglist@gmail.com> + 99 others); Thu, 16 Feb 2023 02:13:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35264 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229492AbjBPHNO (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 16 Feb 2023 02:13:14 -0500 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4F0C27D6E for <linux-kernel@vger.kernel.org>; Wed, 15 Feb 2023 23:12:46 -0800 (PST) Received: by mail-pl1-x62e.google.com with SMTP id m2so1157784plg.4 for <linux-kernel@vger.kernel.org>; Wed, 15 Feb 2023 23:12:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=oRJfoVOuCSLIf78Tk6rnAQ5m7tVOuXtuPC5IuUQp6R4=; b=FEpvpz7mUvMjqlm3jQ3PCJGj/ZK71ogs072O+aTQvHvll2bdi0VzLG5enkjmpeeTi8 G5ua1ZIpC56xTVM/1Onk6Qr8qpmt1qH17E6CXo0XMwDmm26CuiQUeaaqOfq6TKTESUpk hEeCprjNFzSQ5mRE9MU/gutxEgamglP3q/YpBW3nsYL63NMnLSrJ78lJIKedq807NCtm FcEXPnfIFfUtyBXIoJCUBwF5KDhTTWusH+inkYdIuV0St6OVNKB+128izxNuZACCZfx8 yDt7+JXgPNxwpGRgYzvZMS8Pf1PZoroiS0wUlKkdCnK+0wHShrr59gcNLl0wip609kVc uzKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=oRJfoVOuCSLIf78Tk6rnAQ5m7tVOuXtuPC5IuUQp6R4=; b=N7ieRou7/4hcOPotwpNtCWzg4y0vLSjUMqUzqf1N5VPTH/zRh7pZyTxZfORENBG0Pf pRhvfYXFiSdXo4SYFSFCXQ3/pPc1xTe2zrIA5ZVcZ4ncbiteZR6Xk9o8msRugz00vi/f 0WE9imHM0uqH/lGk25o7lvNDbLEY44u9/NX9uksJksavcEbFBMJFVUSsDkQJ63x8qIPf IKdvo1W2QysYCuOXVaGDTdlLBSM5Ej8bueVLymtT3ACo1x5IWMr4ok9EPEMorjNlIegp M61R31fwwFAJC4LOqwURKbSdh1Oq91qqx9B4PArsaAyiBjl6yySE7DI4n5bldQ7+kENt 51HA== X-Gm-Message-State: AO0yUKWkrj/HqpWE7ixD5YzgfFrTbxlx0BDChFhGHKQZFdZpxNV3qoNY 5Zx2vyR/LIgHKSEgjXuuyUAvAA== X-Received: by 2002:a05:6a20:441b:b0:c0:af73:be4d with SMTP id ce27-20020a056a20441b00b000c0af73be4dmr5946807pzb.54.1676531566264; Wed, 15 Feb 2023 23:12:46 -0800 (PST) Received: from GL4FX4PXWL.bytedance.net ([139.177.225.239]) by smtp.gmail.com with ESMTPSA id v5-20020a170902b7c500b001994554099esm517126plz.173.2023.02.15.23.12.41 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Feb 2023 23:12:45 -0800 (PST) From: Peng Zhang <zhangpeng.00@bytedance.com> To: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org, Peng Zhang <zhangpeng.00@bytedance.com>, Li Bin <huawei.libin@huawei.com>, Xie XiuQi <xiexiuqi@huawei.com>, Yang Yingliang <yangyingliang@huawei.com> Subject: [PATCH] iommu: Avoid softlockup and rcu stall in fq_flush_timeout(). Date: Thu, 16 Feb 2023 15:11:48 +0800 Message-Id: <20230216071148.2060-1-zhangpeng.00@bytedance.com> X-Mailer: git-send-email 2.37.0 (Apple Git-136) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757971681378885931?= X-GMAIL-MSGID: =?utf-8?q?1757971681378885931?= |
Series |
iommu: Avoid softlockup and rcu stall in fq_flush_timeout().
|
|
Commit Message
Peng Zhang
Feb. 16, 2023, 7:11 a.m. UTC
There is softlockup under fio pressure test with smmu enabled: watchdog: BUG: soft lockup - CPU#81 stuck for 22s! [swapper/81:0] ... Call trace: fq_flush_timeout+0xc0/0x110 call_timer_fn+0x34/0x178 expire_timers+0xec/0x158 run_timer_softirq+0xc0/0x1f8 __do_softirq+0x120/0x324 irq_exit+0x11c/0x140 __handle_domain_irq+0x6c/0xc0 gic_handle_irq+0x6c/0x170 el1_irq+0xb8/0x140 arch_cpu_idle+0x38/0x1c0 default_idle_call+0x24/0x44 do_idle+0x1f4/0x2d8 cpu_startup_entry+0x2c/0x30 secondary_start_kernel+0x17c/0x1c8 Rcu stall may also be triggered: rcu: INFO: rcu_sched self-detected stall on CPU NMI backtrace for cpu 21 CPU: 21 PID: 118 Comm: ksoftirqd/21 ... Call trace: fq_flush_timeout+0x6d/0x90 ? fq_ring_free+0xc0/0xc0 call_timer_fn+0x2b/0x120 run_timer_softirq+0x1a6/0x420 ? finish_task_switch+0x80/0x280 __do_softirq+0xda/0x2da ? sort_range+0x20/0x20 run_ksoftirqd+0x26/0x40 smpboot_thread_fn+0xb8/0x150 kthread+0x110/0x130 ? __kthread_cancel_work+0x40/0x40 ret_from_fork+0x1f/0x30 This is because the timer callback fq_flush_timeout may run more than 10ms, and timer may be processed continuously in the softirq so trigger softlockup and rcu stall. We can use work to deal with fq_ring_free for each cpu which may take long time, that to avoid triggering softlockup and rcu stall. This patch is modified from the patch[1] of openEuler. [1] https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/C3KYS4BXTDMVVBQNEQYNAOGOQWFFINHJ/ Signed-off-by: Li Bin <huawei.libin@huawei.com> Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> --- drivers/iommu/dma-iommu.c | 33 ++++++++++++++++++++++----------- 1 file changed, 22 insertions(+), 11 deletions(-)
Comments
On Thu, Feb 16, 2023 at 03:11:48PM +0800, Peng Zhang wrote: > There is softlockup under fio pressure test with smmu enabled: > watchdog: BUG: soft lockup - CPU#81 stuck for 22s! [swapper/81:0] > ... > Call trace: > fq_flush_timeout+0xc0/0x110 > call_timer_fn+0x34/0x178 > expire_timers+0xec/0x158 > run_timer_softirq+0xc0/0x1f8 > __do_softirq+0x120/0x324 > irq_exit+0x11c/0x140 > __handle_domain_irq+0x6c/0xc0 > gic_handle_irq+0x6c/0x170 > el1_irq+0xb8/0x140 > arch_cpu_idle+0x38/0x1c0 > default_idle_call+0x24/0x44 > do_idle+0x1f4/0x2d8 > cpu_startup_entry+0x2c/0x30 > secondary_start_kernel+0x17c/0x1c8 > > Rcu stall may also be triggered: > > rcu: INFO: rcu_sched self-detected stall on CPU > NMI backtrace for cpu 21 > CPU: 21 PID: 118 Comm: ksoftirqd/21 > ... > Call trace: > fq_flush_timeout+0x6d/0x90 > ? fq_ring_free+0xc0/0xc0 > call_timer_fn+0x2b/0x120 > run_timer_softirq+0x1a6/0x420 > ? finish_task_switch+0x80/0x280 > __do_softirq+0xda/0x2da > ? sort_range+0x20/0x20 > run_ksoftirqd+0x26/0x40 > smpboot_thread_fn+0xb8/0x150 > kthread+0x110/0x130 > ? __kthread_cancel_work+0x40/0x40 > ret_from_fork+0x1f/0x30 > > This is because the timer callback fq_flush_timeout may run more than > 10ms, and timer may be processed continuously in the softirq so trigger > softlockup and rcu stall. We can use work to deal with fq_ring_free for > each cpu which may take long time, that to avoid triggering softlockup > and rcu stall. > > This patch is modified from the patch[1] of openEuler. > Hi Robin, I was looking at something similar to this recently were in this case they were beating the heck out the system with the hazard io stress test, and someone else with some medusa test tool. In one case they had them force a dump on the soft lockup. On the 384 core genoa, 90 cores were spinning the iovad rb tree lock for one domain, 1 had it, and the poor flush queue timer handler was having to fight everyone for the lock. I'm not sure what would be considered a realistic workload compared to these stressors, but could this be an issue over time as systems continue to get more cores since the timer handler potentially grabs and releases the iova domain rb tree lock for each cpu? The only cases I know of are using io stressors, so I don't know how big a deal it is. I think soft lockups could still be produced with this patch, since there would still be the lock contention. Regards, Jerry > [1] https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/C3KYS4BXTDMVVBQNEQYNAOGOQWFFINHJ/ > > Signed-off-by: Li Bin <huawei.libin@huawei.com> > Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com> > Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> > --- > drivers/iommu/dma-iommu.c | 33 ++++++++++++++++++++++----------- > 1 file changed, 22 insertions(+), 11 deletions(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 99b2646cb5c7..bc4c979d7091 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -59,6 +59,8 @@ struct iommu_dma_cookie { > struct timer_list fq_timer; > /* 1 when timer is active, 0 when not */ > atomic_t fq_timer_on; > + /* The work to free iova */ > + struct work_struct free_iova_work; > }; > /* Trivial linear page allocator for IOMMU_DMA_MSI_COOKIE */ > dma_addr_t msi_iova; > @@ -155,20 +157,10 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie) > static void fq_flush_timeout(struct timer_list *t) > { > struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); > - int cpu; > > atomic_set(&cookie->fq_timer_on, 0); > fq_flush_iotlb(cookie); > - > - for_each_possible_cpu(cpu) { > - unsigned long flags; > - struct iova_fq *fq; > - > - fq = per_cpu_ptr(cookie->fq, cpu); > - spin_lock_irqsave(&fq->lock, flags); > - fq_ring_free(cookie, fq); > - spin_unlock_irqrestore(&fq->lock, flags); > - } > + schedule_work(&cookie->free_iova_work); > } > > static void queue_iova(struct iommu_dma_cookie *cookie, > @@ -227,6 +219,7 @@ static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie) > return; > > del_timer_sync(&cookie->fq_timer); > + cancel_work_sync(&cookie->free_iova_work); > /* The IOVAs will be torn down separately, so just free our queued pages */ > for_each_possible_cpu(cpu) { > struct iova_fq *fq = per_cpu_ptr(cookie->fq, cpu); > @@ -238,6 +231,23 @@ static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie) > free_percpu(cookie->fq); > } > > +static void free_iova_work_func(struct work_struct *work) > +{ > + struct iommu_dma_cookie *cookie; > + int cpu; > + > + cookie = container_of(work, struct iommu_dma_cookie, free_iova_work); > + for_each_possible_cpu(cpu) { > + unsigned long flags; > + struct iova_fq *fq; > + > + fq = per_cpu_ptr(cookie->fq, cpu); > + spin_lock_irqsave(&fq->lock, flags); > + fq_ring_free(cookie, fq); > + spin_unlock_irqrestore(&fq->lock, flags); > + } > +} > + > /* sysfs updates are serialised by the mutex of the group owning @domain */ > int iommu_dma_init_fq(struct iommu_domain *domain) > { > @@ -271,6 +281,7 @@ int iommu_dma_init_fq(struct iommu_domain *domain) > > cookie->fq = queue; > > + INIT_WORK(&cookie->free_iova_work, free_iova_work_func); > timer_setup(&cookie->fq_timer, fq_flush_timeout, 0); > atomic_set(&cookie->fq_timer_on, 0); > /* > -- > 2.20.1 >
Hi All, On 4/11/2023 3:52 AM, Jerry Snitselaar wrote: > On Thu, Feb 16, 2023 at 03:11:48PM +0800, Peng Zhang wrote: >> There is softlockup under fio pressure test with smmu enabled: >> watchdog: BUG: soft lockup - CPU#81 stuck for 22s! [swapper/81:0] >> ... >> Call trace: >> fq_flush_timeout+0xc0/0x110 >> call_timer_fn+0x34/0x178 >> expire_timers+0xec/0x158 >> run_timer_softirq+0xc0/0x1f8 >> __do_softirq+0x120/0x324 >> irq_exit+0x11c/0x140 >> __handle_domain_irq+0x6c/0xc0 >> gic_handle_irq+0x6c/0x170 >> el1_irq+0xb8/0x140 >> arch_cpu_idle+0x38/0x1c0 >> default_idle_call+0x24/0x44 >> do_idle+0x1f4/0x2d8 >> cpu_startup_entry+0x2c/0x30 >> secondary_start_kernel+0x17c/0x1c8 >> >> Rcu stall may also be triggered: >> >> rcu: INFO: rcu_sched self-detected stall on CPU >> NMI backtrace for cpu 21 >> CPU: 21 PID: 118 Comm: ksoftirqd/21 >> ... >> Call trace: >> fq_flush_timeout+0x6d/0x90 >> ? fq_ring_free+0xc0/0xc0 >> call_timer_fn+0x2b/0x120 >> run_timer_softirq+0x1a6/0x420 >> ? finish_task_switch+0x80/0x280 >> __do_softirq+0xda/0x2da >> ? sort_range+0x20/0x20 >> run_ksoftirqd+0x26/0x40 >> smpboot_thread_fn+0xb8/0x150 >> kthread+0x110/0x130 >> ? __kthread_cancel_work+0x40/0x40 >> ret_from_fork+0x1f/0x30 >> >> This is because the timer callback fq_flush_timeout may run more than >> 10ms, and timer may be processed continuously in the softirq so trigger >> softlockup and rcu stall. We can use work to deal with fq_ring_free for >> each cpu which may take long time, that to avoid triggering softlockup >> and rcu stall. >> >> This patch is modified from the patch[1] of openEuler. >> > > Hi Robin, > > I was looking at something similar to this recently were in this case > they were beating the heck out the system with the hazard io stress > test, and someone else with some medusa test tool. In one case they > had them force a dump on the soft lockup. On the 384 core genoa, 90 > cores were spinning the iovad rb tree lock for one domain, 1 had it, > and the poor flush queue timer handler was having to fight everyone > for the lock. I'm not sure what would be considered a realistic workload > compared to these stressors, but could this be an issue over time as > systems continue to get more cores since the timer handler potentially > grabs and releases the iova domain rb tree lock for each cpu? The only > cases I know of are using io stressors, so I don't know how big a deal > it is. > > I think soft lockups could still be produced with this patch, since > there would still be the lock contention. Right. We hit this issue when running stress-ng on 384 CPU system having NVMe disk. I have tried this patch. This improved the situation. But we still hit soft lockup after an hour or so. Backtrace with this patch: 4551.470797] watchdog: BUG: soft lockup - CPU#328 stuck for 30s! [kworker/328:1:2710] [ 4551.479441] Modules linked in: nvme_fabrics(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) kvm(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) sha512_ssse3(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) wmi_bmof(E) binfmt_misc(E) ipmi_ssif(E) nls_iso8859_1(E) ast(E) drm_shmem_helper(E) drm_kms_helper(E) i2c_algo_bit(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) acpi_ipmi(E) ccp(E) k10temp(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) mac_hid(E) sch_fq_codel(E) msr(E) parport_pc(E) ppdev(E) lp(E) parport(E) ramoops(E) reed_solomon(E) pstore_blk(E) pstore_zone(E) efi_pstore(E) drm(E) ip_tables(E) x_tables(E) autofs4(E) crc32_pclmul(E) nvme(E) ahci(E) xhci_pci(E) tg3(E) nvme_core(E) libahci(E) i2c_piix4(E) xhci_pci_renesas(E) wmi(E) [ 4551.479475] CPU: 328 PID: 2710 Comm: kworker/328:1 Kdump: loaded Tainted: G E 6.2.9schd_work+ #1 [ 4551.479477] Hardware name: AMD Corporation Genoa, BIOS RTI 03/20/2023 [ 4551.479479] Workqueue: events free_iova_work_func [ 4551.479484] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x60 [ 4551.479488] Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 49 89 f0 48 89 e5 c6 07 00 0f 1f 00 41 f7 c0 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d 50 f8 fc 7d 74 13 5d 31 c0 89 c2 89 c1 89 c6 89 c7 41 89 [ 4551.479489] RSP: 0018:ffa0000092e07e30 EFLAGS: 00000206 [ 4551.479491] RAX: 0000000000000000 RBX: 00000000000000f2 RCX: 0000000000000000 [ 4551.479492] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffd1ff3cf82be060 [ 4551.479492] RBP: ffa0000092e07e30 R08: 0000000000000282 R09: 0000000000000000 [ 4551.479493] R10: 0000000000000000 R11: 0000000000000000 R12: ff1100010d2af6d0 [ 4551.479493] R13: ffd1ff3cf82be060 R14: ffd1ff3cf82bb858 R15: 0000000000000282 [ 4551.479494] FS: 0000000000000000(0000) GS:ff11018056200000(0000) knlGS:0000000000000000 [ 4551.479495] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4551.479496] CR2: 00007ffd392f5f28 CR3: 0000000152f06004 CR4: 0000000000771ee0 [ 4551.479497] PKRU: 55555554 [ 4551.479497] Call Trace: [ 4551.479498] <TASK> [ 4551.479502] free_iova_work_func+0x6a/0xc0 [ 4551.479505] process_one_work+0x21f/0x440 [ 4551.479507] worker_thread+0x50/0x3f0 [ 4551.479509] ? __pfx_worker_thread+0x10/0x10 [ 4551.479510] kthread+0xee/0x120 [ 4551.479513] ? __pfx_kthread+0x10/0x10 [ 4551.479514] ret_from_fork+0x2c/0x50 [ 4551.479518] </TASK> I did slight modification on top of this patch (schedule work on each CPU) and it worked for me. @Robin, Joerg, Does something like below (schedule work on each CPU to free iova (fq_ring_free()) is acceptable? -Vasant ------ diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7aed15e53f10..eb52814e67e5 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -157,10 +157,12 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie) static void fq_flush_timeout(struct timer_list *t) { struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); + int cpu; atomic_set(&cookie->fq_timer_on, 0); fq_flush_iotlb(cookie); - schedule_work(&cookie->free_iova_work); + for_each_possible_cpu(cpu) + schedule_work_on(cpu, &cookie->free_iova_work); } static void queue_iova(struct iommu_dma_cookie *cookie, @@ -235,17 +237,15 @@ static void free_iova_work_func(struct work_struct *work) { struct iommu_dma_cookie *cookie; int cpu; + unsigned long flags; + struct iova_fq *fq; cookie = container_of(work, struct iommu_dma_cookie, free_iova_work); - for_each_possible_cpu(cpu) { - unsigned long flags; - struct iova_fq *fq; + fq = this_cpu_ptr(cookie->fq); - fq = per_cpu_ptr(cookie->fq, cpu); - spin_lock_irqsave(&fq->lock, flags); - fq_ring_free(cookie, fq); - spin_unlock_irqrestore(&fq->lock, flags); - } + spin_lock_irqsave(&fq->lock, flags); + fq_ring_free(cookie, fq); + spin_unlock_irqrestore(&fq->lock, flags); } /* sysfs updates are serialised by the mutex of the group owning @domain */ ---------
Hello Robin, Joerg, Ping. Any suggestion on below proposal (schedule work on each CPU to free iova)? -Vasant On 4/18/2023 4:11 PM, Vasant Hegde wrote: > Hi All, > > > On 4/11/2023 3:52 AM, Jerry Snitselaar wrote: >> On Thu, Feb 16, 2023 at 03:11:48PM +0800, Peng Zhang wrote: >>> There is softlockup under fio pressure test with smmu enabled: >>> watchdog: BUG: soft lockup - CPU#81 stuck for 22s! [swapper/81:0] >>> ... >>> Call trace: >>> fq_flush_timeout+0xc0/0x110 >>> call_timer_fn+0x34/0x178 >>> expire_timers+0xec/0x158 >>> run_timer_softirq+0xc0/0x1f8 >>> __do_softirq+0x120/0x324 >>> irq_exit+0x11c/0x140 >>> __handle_domain_irq+0x6c/0xc0 >>> gic_handle_irq+0x6c/0x170 >>> el1_irq+0xb8/0x140 >>> arch_cpu_idle+0x38/0x1c0 >>> default_idle_call+0x24/0x44 >>> do_idle+0x1f4/0x2d8 >>> cpu_startup_entry+0x2c/0x30 >>> secondary_start_kernel+0x17c/0x1c8 >>> >>> Rcu stall may also be triggered: >>> >>> rcu: INFO: rcu_sched self-detected stall on CPU >>> NMI backtrace for cpu 21 >>> CPU: 21 PID: 118 Comm: ksoftirqd/21 >>> ... >>> Call trace: >>> fq_flush_timeout+0x6d/0x90 >>> ? fq_ring_free+0xc0/0xc0 >>> call_timer_fn+0x2b/0x120 >>> run_timer_softirq+0x1a6/0x420 >>> ? finish_task_switch+0x80/0x280 >>> __do_softirq+0xda/0x2da >>> ? sort_range+0x20/0x20 >>> run_ksoftirqd+0x26/0x40 >>> smpboot_thread_fn+0xb8/0x150 >>> kthread+0x110/0x130 >>> ? __kthread_cancel_work+0x40/0x40 >>> ret_from_fork+0x1f/0x30 >>> >>> This is because the timer callback fq_flush_timeout may run more than >>> 10ms, and timer may be processed continuously in the softirq so trigger >>> softlockup and rcu stall. We can use work to deal with fq_ring_free for >>> each cpu which may take long time, that to avoid triggering softlockup >>> and rcu stall. >>> >>> This patch is modified from the patch[1] of openEuler. >>> >> >> Hi Robin, >> >> I was looking at something similar to this recently were in this case >> they were beating the heck out the system with the hazard io stress >> test, and someone else with some medusa test tool. In one case they >> had them force a dump on the soft lockup. On the 384 core genoa, 90 >> cores were spinning the iovad rb tree lock for one domain, 1 had it, >> and the poor flush queue timer handler was having to fight everyone >> for the lock. I'm not sure what would be considered a realistic workload >> compared to these stressors, but could this be an issue over time as >> systems continue to get more cores since the timer handler potentially >> grabs and releases the iova domain rb tree lock for each cpu? The only >> cases I know of are using io stressors, so I don't know how big a deal >> it is. >> >> I think soft lockups could still be produced with this patch, since >> there would still be the lock contention. > > Right. We hit this issue when running stress-ng on 384 CPU system having NVMe > disk. I have tried this patch. This improved the situation. But we still hit > soft lockup after an hour or so. > > Backtrace with this patch: > > 4551.470797] watchdog: BUG: soft lockup - CPU#328 stuck for 30s! > [kworker/328:1:2710] > [ 4551.479441] Modules linked in: nvme_fabrics(E) intel_rapl_msr(E) > intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) kvm(E) > crct10dif_pclmul(E) ghash_clmulni_intel(E) sha512_ssse3(E) aesni_intel(E) > crypto_simd(E) cryptd(E) rapl(E) wmi_bmof(E) binfmt_misc(E) ipmi_ssif(E) > nls_iso8859_1(E) ast(E) drm_shmem_helper(E) drm_kms_helper(E) i2c_algo_bit(E) > syscopyarea(E) sysfillrect(E) sysimgblt(E) acpi_ipmi(E) ccp(E) k10temp(E) > ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) mac_hid(E) sch_fq_codel(E) msr(E) > parport_pc(E) ppdev(E) lp(E) parport(E) ramoops(E) reed_solomon(E) pstore_blk(E) > pstore_zone(E) efi_pstore(E) drm(E) ip_tables(E) x_tables(E) autofs4(E) > crc32_pclmul(E) nvme(E) ahci(E) xhci_pci(E) tg3(E) nvme_core(E) libahci(E) > i2c_piix4(E) xhci_pci_renesas(E) wmi(E) > [ 4551.479475] CPU: 328 PID: 2710 Comm: kworker/328:1 Kdump: loaded Tainted: G > E 6.2.9schd_work+ #1 > [ 4551.479477] Hardware name: AMD Corporation Genoa, BIOS RTI 03/20/2023 > [ 4551.479479] Workqueue: events free_iova_work_func > [ 4551.479484] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x60 > [ 4551.479488] Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 49 89 f0 48 89 > e5 c6 07 00 0f 1f 00 41 f7 c0 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d 50 > f8 fc 7d 74 13 5d 31 c0 89 c2 89 c1 89 c6 89 c7 41 89 > [ 4551.479489] RSP: 0018:ffa0000092e07e30 EFLAGS: 00000206 > [ 4551.479491] RAX: 0000000000000000 RBX: 00000000000000f2 RCX: 0000000000000000 > [ 4551.479492] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffd1ff3cf82be060 > [ 4551.479492] RBP: ffa0000092e07e30 R08: 0000000000000282 R09: 0000000000000000 > [ 4551.479493] R10: 0000000000000000 R11: 0000000000000000 R12: ff1100010d2af6d0 > [ 4551.479493] R13: ffd1ff3cf82be060 R14: ffd1ff3cf82bb858 R15: 0000000000000282 > [ 4551.479494] FS: 0000000000000000(0000) GS:ff11018056200000(0000) > knlGS:0000000000000000 > [ 4551.479495] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 4551.479496] CR2: 00007ffd392f5f28 CR3: 0000000152f06004 CR4: 0000000000771ee0 > [ 4551.479497] PKRU: 55555554 > [ 4551.479497] Call Trace: > [ 4551.479498] <TASK> > [ 4551.479502] free_iova_work_func+0x6a/0xc0 > [ 4551.479505] process_one_work+0x21f/0x440 > [ 4551.479507] worker_thread+0x50/0x3f0 > [ 4551.479509] ? __pfx_worker_thread+0x10/0x10 > [ 4551.479510] kthread+0xee/0x120 > [ 4551.479513] ? __pfx_kthread+0x10/0x10 > [ 4551.479514] ret_from_fork+0x2c/0x50 > [ 4551.479518] </TASK> > > I did slight modification on top of this patch (schedule work on each CPU) and > it worked for me. > > @Robin, Joerg, > Does something like below (schedule work on each CPU to free iova > (fq_ring_free()) is acceptable? > > -Vasant > > > ------ > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 7aed15e53f10..eb52814e67e5 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -157,10 +157,12 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie) > static void fq_flush_timeout(struct timer_list *t) > { > struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); > + int cpu; > > atomic_set(&cookie->fq_timer_on, 0); > fq_flush_iotlb(cookie); > - schedule_work(&cookie->free_iova_work); > + for_each_possible_cpu(cpu) > + schedule_work_on(cpu, &cookie->free_iova_work); > } > > static void queue_iova(struct iommu_dma_cookie *cookie, > @@ -235,17 +237,15 @@ static void free_iova_work_func(struct work_struct *work) > { > struct iommu_dma_cookie *cookie; > int cpu; > + unsigned long flags; > + struct iova_fq *fq; > > cookie = container_of(work, struct iommu_dma_cookie, free_iova_work); > - for_each_possible_cpu(cpu) { > - unsigned long flags; > - struct iova_fq *fq; > + fq = this_cpu_ptr(cookie->fq); > > - fq = per_cpu_ptr(cookie->fq, cpu); > - spin_lock_irqsave(&fq->lock, flags); > - fq_ring_free(cookie, fq); > - spin_unlock_irqrestore(&fq->lock, flags); > - } > + spin_lock_irqsave(&fq->lock, flags); > + fq_ring_free(cookie, fq); > + spin_unlock_irqrestore(&fq->lock, flags); > } > > /* sysfs updates are serialised by the mutex of the group owning @domain */ > > --------- > > > > >
Hi,
On Fri, Apr 28, 2023 at 11:14:54AM +0530, Vasant Hegde wrote:
> Ping. Any suggestion on below proposal (schedule work on each CPU to free iova)?
Optimizing the flush-timeout path seems to be working on the symptoms
rather than the cause. The main question to look into first is why are
so many CPUs competing for the IOVA allocator lock.
That is a situation which the flush-queue code is there to avoid,
obviously it does not scale to the workloads tested here. Any chance to
check why?
My guess is that the allocations are too big and not covered by the
allocation sizes supported by the flush-queue code. But maybe this is
something that can be fixed. Or the flush-queue code could even be
changed to auto-adapt to allocation patterns of the device driver?
Regards,
Joerg
On Mon, May 22, 2023 at 04:58:33PM +0200, Joerg Roedel wrote: > Hi, > > On Fri, Apr 28, 2023 at 11:14:54AM +0530, Vasant Hegde wrote: > > Ping. Any suggestion on below proposal (schedule work on each CPU to free iova)? > > Optimizing the flush-timeout path seems to be working on the symptoms > rather than the cause. The main question to look into first is why are > so many CPUs competing for the IOVA allocator lock. > > That is a situation which the flush-queue code is there to avoid, > obviously it does not scale to the workloads tested here. Any chance to > check why? > > My guess is that the allocations are too big and not covered by the > allocation sizes supported by the flush-queue code. But maybe this is > something that can be fixed. Or the flush-queue code could even be > changed to auto-adapt to allocation patterns of the device driver? > > Regards, > > Joerg In the case I know of it involved some proprietary test suites (Hazard I/O, and Medusa?), and the lpfc driver. I was able to force the condition using fio with a number of jobs running. I'll play around and see if I can figure out a point where it starts to become an issue. I mentioned what the nvme driver did to the Broadcom folks for the max dma size, but I haven't had a chance to go looking at it myself yet to see if there is somewhere in the lpfc code to fix up. Regards, Jerry
On 22/05/2023 16:18, Jerry Snitselaar wrote: >> My guess is that the allocations are too big and not covered by the >> allocation sizes supported by the flush-queue code. But maybe this is >> something that can be fixed. Or the flush-queue code could even be >> changed to auto-adapt to allocation patterns of the device driver? >> >> Regards, >> >> Joerg > In the case I know of it involved some proprietary test suites > (Hazard I/O, and Medusa?), and the lpfc driver. I was able to force > the condition using fio with a number of jobs running. I'll play > around and see if I can figure out a point where it starts to become > an issue. > > I mentioned what the nvme driver did to the Broadcom folks for the max > dma size, but I haven't had a chance to go looking at it myself yet to > see if there is somewhere in the lpfc code to fix up. JFYI, SCSI core already supports setting this in shost->opt_sectors, see example in sas_host_setup(). This issue may continue to pop up so we may need a better way to turn it on/off for all drivers or classes of drivers. John
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 99b2646cb5c7..bc4c979d7091 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -59,6 +59,8 @@ struct iommu_dma_cookie { struct timer_list fq_timer; /* 1 when timer is active, 0 when not */ atomic_t fq_timer_on; + /* The work to free iova */ + struct work_struct free_iova_work; }; /* Trivial linear page allocator for IOMMU_DMA_MSI_COOKIE */ dma_addr_t msi_iova; @@ -155,20 +157,10 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie) static void fq_flush_timeout(struct timer_list *t) { struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer); - int cpu; atomic_set(&cookie->fq_timer_on, 0); fq_flush_iotlb(cookie); - - for_each_possible_cpu(cpu) { - unsigned long flags; - struct iova_fq *fq; - - fq = per_cpu_ptr(cookie->fq, cpu); - spin_lock_irqsave(&fq->lock, flags); - fq_ring_free(cookie, fq); - spin_unlock_irqrestore(&fq->lock, flags); - } + schedule_work(&cookie->free_iova_work); } static void queue_iova(struct iommu_dma_cookie *cookie, @@ -227,6 +219,7 @@ static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie) return; del_timer_sync(&cookie->fq_timer); + cancel_work_sync(&cookie->free_iova_work); /* The IOVAs will be torn down separately, so just free our queued pages */ for_each_possible_cpu(cpu) { struct iova_fq *fq = per_cpu_ptr(cookie->fq, cpu); @@ -238,6 +231,23 @@ static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie) free_percpu(cookie->fq); } +static void free_iova_work_func(struct work_struct *work) +{ + struct iommu_dma_cookie *cookie; + int cpu; + + cookie = container_of(work, struct iommu_dma_cookie, free_iova_work); + for_each_possible_cpu(cpu) { + unsigned long flags; + struct iova_fq *fq; + + fq = per_cpu_ptr(cookie->fq, cpu); + spin_lock_irqsave(&fq->lock, flags); + fq_ring_free(cookie, fq); + spin_unlock_irqrestore(&fq->lock, flags); + } +} + /* sysfs updates are serialised by the mutex of the group owning @domain */ int iommu_dma_init_fq(struct iommu_domain *domain) { @@ -271,6 +281,7 @@ int iommu_dma_init_fq(struct iommu_domain *domain) cookie->fq = queue; + INIT_WORK(&cookie->free_iova_work, free_iova_work_func); timer_setup(&cookie->fq_timer, fq_flush_timeout, 0); atomic_set(&cookie->fq_timer_on, 0); /*