Message ID | Y1eZbXKdJDoS8loC@wantstofly.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp875361wru; Tue, 25 Oct 2022 01:21:04 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6rJ/xyW7ye591+8MeR4170WCTdynpyUYQgZ64qOnxdGK8YLUv8eIVc0nzFFDVMc39TwBip X-Received: by 2002:a05:6a00:cc6:b0:56c:84c:856d with SMTP id b6-20020a056a000cc600b0056c084c856dmr4855152pfv.56.1666686064713; Tue, 25 Oct 2022 01:21:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666686064; cv=none; d=google.com; s=arc-20160816; b=zCkUgZBVdxLslsX9bVmCW9+HnELG0ZpPW2SVidGxle4XiMEtucYnObZE5qYkvkx6if ubbj37UtZzg3jCWJiNjJG/Tg+BXAoWDOtB1v7qbqw9f+tEkVAZSrDCjJRd9tDjUuEvc5 rO8wHAN9vVif6r+fNbk5ppJryVSHdRQgk4riUWdRuGswuSztCwqScZ9+WnLnBLRBUWma ys+wBzO5X01cBUTSvS4PplUsVmTxy/OYZbnd0EwjYhTd9gbaaHJeN0HzxqpRGw7G8iXs wQHavq0nROGzCOp96Mj1WZ4tngKZzQtIK+guieznGV1pdYjvmyS+CLhgc138dI1UjPzy 89Jg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:message-id :subject:cc:to:from:date; bh=/W4I0vnkQLbmhTcdZVYuZJnSEYDF8muGIQu7sRnnMlk=; b=okHdc2kX9HaajM1FwIlJeYBtC/sVXxiEk8T2J86JW0paJiWowU+J1fzo9utP6qKj55 UQ2K6M5K49HusiD6JTzD61b+Arf3I/5SWG/FbmWrM9n9Ge9sg1nKVR2d6Z7BvRVG4zfV tCAVvFVHYiRAAPKx6akTC/scAayoC3ZRxzWTyya/NohWMJYoIS3Dg/xTZ0tToTNnqt2c GQN5UYdTp/cwGsHhuW/dkPhNZ4vBgGZhHDKj0kWOAv6sc9MHcMlAibxK0QlyRTyS+yPC pME6lKtppsX0R9AVN7pnTffo/57tjxo81/xhEMEU+kQY/+xmCwKwCEstBi8LvlFG9CDt 7Opg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i64-20020a636d43000000b00440a6b83b6asi2243087pgc.529.2022.10.25.01.20.52; Tue, 25 Oct 2022 01:21:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231788AbiJYIIT (ORCPT <rfc822;lucius.rs.storz@gmail.com> + 99 others); Tue, 25 Oct 2022 04:08:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229864AbiJYIIR (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 25 Oct 2022 04:08:17 -0400 Received: from mail.wantstofly.org (hmm.wantstofly.org [213.239.204.108]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1F2B27C1D7 for <linux-kernel@vger.kernel.org>; Tue, 25 Oct 2022 01:08:16 -0700 (PDT) Received: by mail.wantstofly.org (Postfix, from userid 1000) id A83C97F527; Tue, 25 Oct 2022 11:08:13 +0300 (EEST) Date: Tue, 25 Oct 2022 11:08:13 +0300 From: Lennert Buytenhek <buytenh@wantstofly.org> To: David Woodhouse <dwmw2@infradead.org>, Lu Baolu <baolu.lu@linux.intel.com> Cc: Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, iommu@lists.linux.dev, linux-kernel@vger.kernel.org Subject: [PATCH,RFC] iommu/vt-d: Convert dmar_fault IRQ to a threaded IRQ Message-ID: <Y1eZbXKdJDoS8loC@wantstofly.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747647006985893270?= X-GMAIL-MSGID: =?utf-8?q?1747647006985893270?= |
Series |
[RFC] iommu/vt-d: Convert dmar_fault IRQ to a threaded IRQ
|
|
Commit Message
Lennert Buytenhek
Oct. 25, 2022, 8:08 a.m. UTC
Under a high enough I/O page fault load, the dmar_fault hardirq handler
can end up starving other tasks that wanted to run on the CPU that the
IRQ is being routed to. On an i7-6700 CPU this seems to happen at
around 2.5 million I/O page faults per second, and at a fraction of
that rate on some of the lower-end CPUs that we use.
An I/O page fault rate of 2.5 million per second may seem like a very
high number, but when we get an I/O page fault for every cache line
touched by a DMA operation, this I/O page fault rate can be the result
of a confused PCIe device DMAing to RAM at 2.5 * 64 = 160 MB/sec, which
is not an unlikely rate to be DMAing things to RAM at. And, in fact,
when we do see PCIe devices getting confused like this, this sort of
I/O page fault rate is not uncommon.
A peripheral device continuously DMAing to RAM at 160 MB/s is
inarguably a bug, either in the kernel driver for the device or in the
firmware for the device, and should be fixed there, but it's the sort
of bug that iommu/vt-d could be handling better than it currently does,
and there is a fairly simple way to achieve that.
This patch changes the dmar_fault IRQ handler to be a threaded IRQ
handler. This is a pretty minimal code change, and comes with the
advantage that Intel IOMMU I/O page fault handling work is now subject
to RT throttling, which allows it to be kept under control using the
sched_rt_period_us / sched_rt_runtime_us parameters.
iommu/amd already uses a threaded IRQ handler for its I/O page fault
reporting, and so it already has this advantage.
When IRQ remapping is enabled, iommu/vt-d will try to set up its
dmar_fault IRQ handler from start_kernel() -> x86_late_time_init()
-> apic_intr_mode_init() -> apic_bsp_setup() ->
irq_remap_enable_fault_handling() -> enable_drhd_fault_handling(),
which happens before kthreadd is started, and trying to set up a
threaded IRQ handler this early on will oops. However, there
doesn't seem to be a reason why iommu/vt-d needs to set up its fault
reporting IRQ handler this early, and if we remove the IRQ setup code
from enable_drhd_fault_handling(), the IRQ will be registered instead
from pci_iommu_init() -> intel_iommu_init() -> init_dmars(), which
seems to work just fine.
Suggested-by: Scarlett Gourley <scarlett@arista.com>
Suggested-by: James Sewart <jamessewart@arista.com>
Suggested-by: Jack O'Sullivan <jack@arista.com>
Signed-off-by: Lennert Buytenhek <buytenh@arista.com>
---
drivers/iommu/intel/dmar.c | 27 ++-------------------------
1 file changed, 2 insertions(+), 25 deletions(-)
Comments
On 10/25/22 4:08 PM, Lennert Buytenhek wrote: > Under a high enough I/O page fault load, the dmar_fault hardirq handler > can end up starving other tasks that wanted to run on the CPU that the > IRQ is being routed to. On an i7-6700 CPU this seems to happen at > around 2.5 million I/O page faults per second, and at a fraction of > that rate on some of the lower-end CPUs that we use. > > An I/O page fault rate of 2.5 million per second may seem like a very > high number, but when we get an I/O page fault for every cache line > touched by a DMA operation, this I/O page fault rate can be the result > of a confused PCIe device DMAing to RAM at 2.5 * 64 = 160 MB/sec, which > is not an unlikely rate to be DMAing things to RAM at. And, in fact, > when we do see PCIe devices getting confused like this, this sort of > I/O page fault rate is not uncommon. > > A peripheral device continuously DMAing to RAM at 160 MB/s is > inarguably a bug, either in the kernel driver for the device or in the > firmware for the device, and should be fixed there, but it's the sort > of bug that iommu/vt-d could be handling better than it currently does, > and there is a fairly simple way to achieve that. > > This patch changes the dmar_fault IRQ handler to be a threaded IRQ > handler. This is a pretty minimal code change, and comes with the > advantage that Intel IOMMU I/O page fault handling work is now subject > to RT throttling, which allows it to be kept under control using the > sched_rt_period_us / sched_rt_runtime_us parameters. Thanks for the patch! I like it, but also have some concerns. If you look at the commit history, you will find that the opposite change took place 10+ years ago. commit 477694e71113fd0694b6bb0bcc2d006b8ac62691 Author: Thomas Gleixner <tglx@linutronix.de> Date: Tue Jul 19 16:25:42 2011 +0200 x86, iommu: Mark DMAR IRQ as non-threaded Mark this lowlevel IRQ handler as non-threaded. This prevents a boot crash when "threadirqs" is on the kernel commandline. Also the interrupt handler is handling hardware critical events which should not be delayed into a thread. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> I am not sure whether the "boot crash" mentioned above is due to that "trying to setup a threaded IRQ handler before kthreadd is started". > > iommu/amd already uses a threaded IRQ handler for its I/O page fault > reporting, and so it already has this advantage. > > When IRQ remapping is enabled, iommu/vt-d will try to set up its > dmar_fault IRQ handler from start_kernel() -> x86_late_time_init() > -> apic_intr_mode_init() -> apic_bsp_setup() -> > irq_remap_enable_fault_handling() -> enable_drhd_fault_handling(), > which happens before kthreadd is started, and trying to set up a > threaded IRQ handler this early on will oops. However, there > doesn't seem to be a reason why iommu/vt-d needs to set up its fault > reporting IRQ handler this early, and if we remove the IRQ setup code > from enable_drhd_fault_handling(), the IRQ will be registered instead > from pci_iommu_init() -> intel_iommu_init() -> init_dmars(), which > seems to work just fine. At present, we cannot do so. Because the VT-d interrupt remapping and DMA remapping can be independently enabled. In another words, it's a possible case where interrupt remapping is enabled while DMA remapping is not. > > Suggested-by: Scarlett Gourley <scarlett@arista.com> > Suggested-by: James Sewart <jamessewart@arista.com> > Suggested-by: Jack O'Sullivan <jack@arista.com> > Signed-off-by: Lennert Buytenhek <buytenh@arista.com> > --- > drivers/iommu/intel/dmar.c | 27 ++------------------------- > 1 file changed, 2 insertions(+), 25 deletions(-) > > diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c > index 5a8f780e7ffd..d0871fe9d04d 100644 > --- a/drivers/iommu/intel/dmar.c > +++ b/drivers/iommu/intel/dmar.c > @@ -2043,7 +2043,8 @@ int dmar_set_interrupt(struct intel_iommu *iommu) > return -EINVAL; > } > > - ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu); > + ret = request_threaded_irq(irq, NULL, dmar_fault, IRQF_ONESHOT, > + iommu->name, iommu); > if (ret) > pr_err("Can't request irq\n"); > return ret; > @@ -2051,30 +2052,6 @@ int dmar_set_interrupt(struct intel_iommu *iommu) > > int __init enable_drhd_fault_handling(void) > { > - struct dmar_drhd_unit *drhd; > - struct intel_iommu *iommu; > - > - /* > - * Enable fault control interrupt. > - */ > - for_each_iommu(iommu, drhd) { > - u32 fault_status; > - int ret = dmar_set_interrupt(iommu); > - > - if (ret) { > - pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n", > - (unsigned long long)drhd->reg_base_addr, ret); > - return -1; > - } > - > - /* > - * Clear any previous faults. > - */ > - dmar_fault(iommu->irq, iommu); > - fault_status = readl(iommu->reg + DMAR_FSTS_REG); > - writel(fault_status, iommu->reg + DMAR_FSTS_REG); > - } > - > return 0; > } > Best regards, baolu
On Wed, Oct 26, 2022 at 10:10:29AM +0800, Baolu Lu wrote: > > Under a high enough I/O page fault load, the dmar_fault hardirq handler > > can end up starving other tasks that wanted to run on the CPU that the > > IRQ is being routed to. On an i7-6700 CPU this seems to happen at > > around 2.5 million I/O page faults per second, and at a fraction of > > that rate on some of the lower-end CPUs that we use. > > > > An I/O page fault rate of 2.5 million per second may seem like a very > > high number, but when we get an I/O page fault for every cache line > > touched by a DMA operation, this I/O page fault rate can be the result > > of a confused PCIe device DMAing to RAM at 2.5 * 64 = 160 MB/sec, which > > is not an unlikely rate to be DMAing things to RAM at. And, in fact, > > when we do see PCIe devices getting confused like this, this sort of > > I/O page fault rate is not uncommon. > > > > A peripheral device continuously DMAing to RAM at 160 MB/s is > > inarguably a bug, either in the kernel driver for the device or in the > > firmware for the device, and should be fixed there, but it's the sort > > of bug that iommu/vt-d could be handling better than it currently does, > > and there is a fairly simple way to achieve that. > > > > This patch changes the dmar_fault IRQ handler to be a threaded IRQ > > handler. This is a pretty minimal code change, and comes with the > > advantage that Intel IOMMU I/O page fault handling work is now subject > > to RT throttling, which allows it to be kept under control using the > > sched_rt_period_us / sched_rt_runtime_us parameters. > > Thanks for the patch! I like it, but also have some concerns. Thanks for having a look! > If you look at the commit history, you will find that the opposite > change took place 10+ years ago. > > commit 477694e71113fd0694b6bb0bcc2d006b8ac62691 > Author: Thomas Gleixner <tglx@linutronix.de> > Date: Tue Jul 19 16:25:42 2011 +0200 > > x86, iommu: Mark DMAR IRQ as non-threaded > > Mark this lowlevel IRQ handler as non-threaded. This prevents a boot > crash when "threadirqs" is on the kernel commandline. Also the > interrupt handler is handling hardware critical events which should > not be delayed into a thread. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Cc: stable@kernel.org > Signed-off-by: Ingo Molnar <mingo@kernel.org> > > I am not sure whether the "boot crash" mentioned above is due to that > "trying to setup a threaded IRQ handler before kthreadd is started". On v6.1-rc you also get a boot crash if you force the dmar_fault IRQ to be a threaded IRQ without moving the IRQ registration out of the start_kernel() -> x86_late_time_init() -> apic_intr_mode_init() -> apic_bsp_setup() -> irq_remap_enable_fault_handling() -> enable_drhd_fault_handling() path. The crash seen on v3.0 when forcing the dmar_fault IRQ to be a threaded IRQ may have been due to the same reason, but I'm not sure how this may have worked in 2011. :-) I'm not sure I agree with the "the interrupt handler is handling hardware critical events which should not be delayed into a thread" part of this commit message. All that dmar_fault does is log translation faults to the console, and I don't think that anything will break if that gets delayed for a while. > > iommu/amd already uses a threaded IRQ handler for its I/O page fault > > reporting, and so it already has this advantage. > > > > When IRQ remapping is enabled, iommu/vt-d will try to set up its > > dmar_fault IRQ handler from start_kernel() -> x86_late_time_init() > > -> apic_intr_mode_init() -> apic_bsp_setup() -> > > irq_remap_enable_fault_handling() -> enable_drhd_fault_handling(), > > which happens before kthreadd is started, and trying to set up a > > threaded IRQ handler this early on will oops. However, there > > doesn't seem to be a reason why iommu/vt-d needs to set up its fault > > reporting IRQ handler this early, and if we remove the IRQ setup code > > from enable_drhd_fault_handling(), the IRQ will be registered instead > > from pci_iommu_init() -> intel_iommu_init() -> init_dmars(), which > > seems to work just fine. > > At present, we cannot do so. Because the VT-d interrupt remapping and > DMA remapping can be independently enabled. In another words, it's a > possible case where interrupt remapping is enabled while DMA remapping > is not. Is there a way I can test this easily? I think we should be able to handle the "interrupt remapping enabled but DMA remapping disabled" case in the same way, by registering the dmar_fault IRQ sometime after kthreadd has been started. I don't think the dmar_fault handler performs any function that is critical for the operation of the IOMMU, and I think that we can defer setting it up until whenever is convenient. Thank you! > > Suggested-by: Scarlett Gourley <scarlett@arista.com> > > Suggested-by: James Sewart <jamessewart@arista.com> > > Suggested-by: Jack O'Sullivan <jack@arista.com> > > Signed-off-by: Lennert Buytenhek <buytenh@arista.com> > > --- > > drivers/iommu/intel/dmar.c | 27 ++------------------------- > > 1 file changed, 2 insertions(+), 25 deletions(-) > > > > diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c > > index 5a8f780e7ffd..d0871fe9d04d 100644 > > --- a/drivers/iommu/intel/dmar.c > > +++ b/drivers/iommu/intel/dmar.c > > @@ -2043,7 +2043,8 @@ int dmar_set_interrupt(struct intel_iommu *iommu) > > return -EINVAL; > > } > > - ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu); > > + ret = request_threaded_irq(irq, NULL, dmar_fault, IRQF_ONESHOT, > > + iommu->name, iommu); > > if (ret) > > pr_err("Can't request irq\n"); > > return ret; > > @@ -2051,30 +2052,6 @@ int dmar_set_interrupt(struct intel_iommu *iommu) > > int __init enable_drhd_fault_handling(void) > > { > > - struct dmar_drhd_unit *drhd; > > - struct intel_iommu *iommu; > > - > > - /* > > - * Enable fault control interrupt. > > - */ > > - for_each_iommu(iommu, drhd) { > > - u32 fault_status; > > - int ret = dmar_set_interrupt(iommu); > > - > > - if (ret) { > > - pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n", > > - (unsigned long long)drhd->reg_base_addr, ret); > > - return -1; > > - } > > - > > - /* > > - * Clear any previous faults. > > - */ > > - dmar_fault(iommu->irq, iommu); > > - fault_status = readl(iommu->reg + DMAR_FSTS_REG); > > - writel(fault_status, iommu->reg + DMAR_FSTS_REG); > > - } > > - > > return 0; > > }
On 2022/10/27 16:19, Lennert Buytenhek wrote: >>> iommu/amd already uses a threaded IRQ handler for its I/O page fault >>> reporting, and so it already has this advantage. >>> >>> When IRQ remapping is enabled, iommu/vt-d will try to set up its >>> dmar_fault IRQ handler from start_kernel() -> x86_late_time_init() >>> -> apic_intr_mode_init() -> apic_bsp_setup() -> >>> irq_remap_enable_fault_handling() -> enable_drhd_fault_handling(), >>> which happens before kthreadd is started, and trying to set up a >>> threaded IRQ handler this early on will oops. However, there >>> doesn't seem to be a reason why iommu/vt-d needs to set up its fault >>> reporting IRQ handler this early, and if we remove the IRQ setup code >>> from enable_drhd_fault_handling(), the IRQ will be registered instead >>> from pci_iommu_init() -> intel_iommu_init() -> init_dmars(), which >>> seems to work just fine. >> At present, we cannot do so. Because the VT-d interrupt remapping and >> DMA remapping can be independently enabled. In another words, it's a >> possible case where interrupt remapping is enabled while DMA remapping >> is not. > Is there a way I can test this easily? > > I think we should be able to handle the "interrupt remapping enabled > but DMA remapping disabled" case in the same way, by registering the > dmar_fault IRQ sometime after kthreadd has been started. I don't think > the dmar_fault handler performs any function that is critical for the > operation of the IOMMU, and I think that we can defer setting it up > until whenever is convenient. Another possible way is not to split VT-d DMA remapping and interrupt remapping. The possible case of "intr remapping enabled but DMA remapping not" that I can imagine is that the guest VM doesn't want DMA translation because of poor efficiency. If so, the overhead impacted by DMA translation can be eliminated through "iommu=pt" or kernel build configuration. Of course, there may also be some special needs that I did not think of. Best regards, baolu
diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index 5a8f780e7ffd..d0871fe9d04d 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -2043,7 +2043,8 @@ int dmar_set_interrupt(struct intel_iommu *iommu) return -EINVAL; } - ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu); + ret = request_threaded_irq(irq, NULL, dmar_fault, IRQF_ONESHOT, + iommu->name, iommu); if (ret) pr_err("Can't request irq\n"); return ret; @@ -2051,30 +2052,6 @@ int dmar_set_interrupt(struct intel_iommu *iommu) int __init enable_drhd_fault_handling(void) { - struct dmar_drhd_unit *drhd; - struct intel_iommu *iommu; - - /* - * Enable fault control interrupt. - */ - for_each_iommu(iommu, drhd) { - u32 fault_status; - int ret = dmar_set_interrupt(iommu); - - if (ret) { - pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n", - (unsigned long long)drhd->reg_base_addr, ret); - return -1; - } - - /* - * Clear any previous faults. - */ - dmar_fault(iommu->irq, iommu); - fault_status = readl(iommu->reg + DMAR_FSTS_REG); - writel(fault_status, iommu->reg + DMAR_FSTS_REG); - } - return 0; }