Message ID | 20240115114040.6279-1-zhangzekun11@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-25950-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:693c:2614:b0:101:6a76:bbe3 with SMTP id mm20csp1657315dyc; Mon, 15 Jan 2024 04:05:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IHIU3NzznXnvSAyXWCVnxZlpjXEIhlRgrLIFzPLCHt6PY2qjN5LUdw0dQILFmDoT0eYS4x9 X-Received: by 2002:a17:90a:f604:b0:28e:1f0b:debe with SMTP id bw4-20020a17090af60400b0028e1f0bdebemr1395950pjb.98.1705320338500; Mon, 15 Jan 2024 04:05:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1705320338; cv=none; d=google.com; s=arc-20160816; b=C6t6Ds8PqMzpfabaidDPfx6PgmuPI4Ied/LoWnFrD70jnlbWJx0HN4G3Z++4lPBkKX iUQ3d0o+K5exKOnwb4/vNGPyDtXiLDMsqPb4k6l8Q0DM9fsyQoizdPKor+iEPIfvrtiw dTnL93dlfDW0XGmSBdMo7l2U2w2RCJrhJRardFL9BJR2XtP0VfFkbq+QX/94sDO9vQln zfxLN5DKPyLoBZRr0urycPRVPuPpPoHk7qUsQoWfK/t8tBIXPfIeoRUsbZFg3LSMZ0Ij lpSnUNphYIupM/kkMg5+seNrm9r0zlx5rqrKWQcyIg7Er50nobGp3WdZCPKXmZGoPul9 yigw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :message-id:date:subject:cc:to:from; bh=m/bXCpWnDDgz7ZtYGSSOhKYOKkzIGmIxebT22IDPULQ=; fh=WVZbora+mZyVsZ1Id8TxcuzTAuaaM0E7vTha4gPZq2o=; b=tmTrIf9QqB1O//tJR9CxoSU9pi3XdHw3kLFPSdJoeDpNn54uijDVp+PDybDY6cazVb EOIB1ag/koyPlc1IxPcj+p+L46Cb2RuooMyDQD57OCMpmh6Ryb5s9jq3WBzlt9+cWDnT W6ZpK4z1J824tw4CEHsStcmPy681izEqcpkpjdLGwd8yM/RqTEuNU2zaqnpp7zKtKgsz chRGIjdIJmdFaCjOLrL81qwwNKe0jRo0xg4s2K+CqbBuSYivqnkM/BHbxrpE3EUUcLSC eaRrM5N9z9wxWurRAloEu8giE4vQhfMhLK680tRm2vRBpypDMEcVYkCyp3/7dHeEFYWD xFBQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-25950-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-25950-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id a8-20020a17090ad80800b0028cd99189cesi11289350pjv.81.2024.01.15.04.05.38 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Jan 2024 04:05:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-25950-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-25950-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-25950-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 11E52B215F5 for <ouuuleilei@gmail.com>; Mon, 15 Jan 2024 12:05:37 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E6EE12C6AB; Mon, 15 Jan 2024 12:05:23 +0000 (UTC) Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C3A4D1E867 for <linux-kernel@vger.kernel.org>; Mon, 15 Jan 2024 12:05:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4TD9Lz0VXbzNlC3; Mon, 15 Jan 2024 19:45:47 +0800 (CST) Received: from kwepemd100006.china.huawei.com (unknown [7.221.188.47]) by mail.maildlp.com (Postfix) with ESMTPS id 40844180071; Mon, 15 Jan 2024 19:46:30 +0800 (CST) Received: from huawei.com (10.175.112.208) by kwepemd100006.china.huawei.com (7.221.188.47) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1258.28; Mon, 15 Jan 2024 19:46:29 +0800 From: Zhang Zekun <zhangzekun11@huawei.com> To: <will@kernel.org>, <robin.murphy@arm.com>, <joro@8bytes.org>, <jgg@ziepe.ca>, <nicolinc@nvidia.com>, <mshavit@google.com>, <iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org> CC: <zhangzekun11@huawei.com> Subject: [PATCH] iommu/arm-smmu-v3: Add a threshold to avoid potential soft lockup Date: Mon, 15 Jan 2024 19:40:40 +0800 Message-ID: <20240115114040.6279-1-zhangzekun11@huawei.com> X-Mailer: git-send-email 2.17.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemd100006.china.huawei.com (7.221.188.47) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788157979400624304 X-GMAIL-MSGID: 1788157979400624304 |
Series |
iommu/arm-smmu-v3: Add a threshold to avoid potential soft lockup
|
|
Commit Message
zhangzekun (A)
Jan. 15, 2024, 11:40 a.m. UTC
The commit d5afb4b47e13 ("iommu/arm-smmu-v3: Fix soft lockup triggered
by arm_smmu_mm_invalidate_range") has fix a soft lockup problem when
running SVA case, but code paths from iommu_unmap and dma APIs still
remain unfixed which could also cause potential soft lockup.
When cmdq is quite busy and don't have much space for batch submitting
cmds, and size passed to __arm_smmu_tlb_inv_range() is large (1G in this
case), the following softlockup is triggered.
WARN: soft lockup - CPU#71 stuck for 12s! [qemu-kvm:1303]
..
Call trace:
dump_backtrace+0x0/0x200
show_stack+0x20/0x30
dump_stack+0xf0/0x138
watchdog_print_info+0x48/0x54
watchdog_process_before_softlockup+0x9c/0xa0
watchdog_timer_fn+0x1ac/0x2f0
__run_hrtimer+0x98/0x2b4
__hrtimer_run_queues+0xc0/0x13c
hrtimer_interrupt+0x150/0x3e4
arch_timer_handler_phys+0x3c/0x50
handle_percpu_devid_irq+0x90/0x1f4
__handle_domain_irq+0x84/0xfc
gic_handle_irq+0x88/0x2b0
el1_irq+0xb8/0x140
arm_smmu_cmdq_issue_cmdlist+0x184/0x5f4
__arm_smmu_tlb_inv_range+0x114/0x22c
arm_smmu_tlb_inv_walk+0x88/0x120
__arm_lpae_unmap+0x188/0x2c0
__arm_lpae_unmap+0x104/0x2c0
arm_lpae_unmap+0x68/0x80
arm_smmu_unmap+0x24/0x40
__iommu_unmap+0xd8/0x210
iommu_unmap+0x44/0x9c
..
The basic idea is use the actual granual size instead of PAGE_SIZE used
in SVA scenarios to calculate a threshold. When smmu without
ARM_SMMU_FEAT_RANGE_INV need to invalid a TLB range larger than the
threshold, we use the granularity of asid or vmid to invalid the TLB. The
calculation logic is similar to calculate 'bits_per_level' when allocating
io-pgtable, which could also been applyed to calculate the existing
threshold in SVA scenarios. Besides, change the comment "MAX_TLBI_OPS" to
"MAX_DVM_OPS", because it is has been renamed in commit ec1c3b9ff160
("arm64: tlbflush: Rename MAX_TLBI_OPS")
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
---
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 11 +--------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 23 +++++++++++++++----
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 10 ++++++++
3 files changed, 30 insertions(+), 14 deletions(-)
Comments
On Mon, Jan 15, 2024 at 07:40:40PM +0800, Zhang Zekun wrote: > The commit d5afb4b47e13 ("iommu/arm-smmu-v3: Fix soft lockup triggered > by arm_smmu_mm_invalidate_range") has fix a soft lockup problem when > running SVA case, but code paths from iommu_unmap and dma APIs still > remain unfixed which could also cause potential soft lockup. > > When cmdq is quite busy and don't have much space for batch submitting > cmds, and size passed to __arm_smmu_tlb_inv_range() is large (1G in this > case), the following softlockup is triggered. > > WARN: soft lockup - CPU#71 stuck for 12s! [qemu-kvm:1303] > ... > Call trace: > dump_backtrace+0x0/0x200 > show_stack+0x20/0x30 > dump_stack+0xf0/0x138 > watchdog_print_info+0x48/0x54 > watchdog_process_before_softlockup+0x9c/0xa0 > watchdog_timer_fn+0x1ac/0x2f0 > __run_hrtimer+0x98/0x2b4 > __hrtimer_run_queues+0xc0/0x13c > hrtimer_interrupt+0x150/0x3e4 > arch_timer_handler_phys+0x3c/0x50 > handle_percpu_devid_irq+0x90/0x1f4 > __handle_domain_irq+0x84/0xfc > gic_handle_irq+0x88/0x2b0 > el1_irq+0xb8/0x140 > arm_smmu_cmdq_issue_cmdlist+0x184/0x5f4 > __arm_smmu_tlb_inv_range+0x114/0x22c > arm_smmu_tlb_inv_walk+0x88/0x120 > __arm_lpae_unmap+0x188/0x2c0 > __arm_lpae_unmap+0x104/0x2c0 > arm_lpae_unmap+0x68/0x80 > arm_smmu_unmap+0x24/0x40 > __iommu_unmap+0xd8/0x210 > iommu_unmap+0x44/0x9c > ... What is the rest of the call chain? How did you get into such a large invalidation? > @@ -228,7 +219,7 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, > */ > size = end - start; > if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { > - if (size >= CMDQ_MAX_TLBI_OPS * PAGE_SIZE) > + if (size >= CMDQ_MAX_TLBI_OPS(PAGE_SIZE) * PAGE_SIZE) > size = 0; > } else { > if (size == ULONG_MAX) I would like to see the SVA code rely on the common invalidation infrastructure, this should be pushed down into the invalidation logic not be in the SVA code. With some adjustments we can get common functions for all this and push the decision making into the actual function. Something like this, on top of my "part 3" branch https://github.com/jgunthorpe/linux/commits/smmuv3_newapi/ split into more patches: diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index 5ab976f6b108dd..b30cce97b01d06 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -30,7 +30,6 @@ static int arm_smmu_realloc_s1_domain_asid(struct arm_smmu_device *smmu, struct arm_smmu_domain *smmu_domain) { struct arm_smmu_master_domain *master_domain; - u32 old_asid = smmu_domain->asid; struct arm_smmu_cd target_cd; unsigned long flags; int ret; @@ -68,9 +67,6 @@ static int arm_smmu_realloc_s1_domain_asid(struct arm_smmu_device *smmu, &target_cd); } spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); - - /* Clean the ASID we are about to assign to a new translation */ - arm_smmu_tlb_inv_asid(smmu, old_asid); return 0; } @@ -175,24 +171,12 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, * range. So do a simple translation here by calculating size correctly. */ size = end - start; - if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { - if (size >= CMDQ_MAX_TLBI_OPS * PAGE_SIZE) - size = 0; - } else { - if (size == ULONG_MAX) - size = 0; - } + if (size == ULONG_MAX) + size = 0; - if (!smmu_domain->btm_invalidation) { - ioasid_t asid = READ_ONCE(smmu_domain->asid); - - if (!size) - arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid); - else - arm_smmu_tlb_inv_range_asid(start, size, asid, - PAGE_SIZE, false, - smmu_domain); - } + if (!smmu_domain->btm_invalidation) + arm_smmu_tlb_inv_range_s1(start, size, PAGE_SIZE, false, + smmu_domain); arm_smmu_atc_inv_domain(smmu_domain, start, size); } @@ -228,7 +212,7 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) } spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); - arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->asid); + arm_smmu_tlb_inv_all_s1(smmu_domain); arm_smmu_atc_inv_domain(smmu_domain, 0, 0); } @@ -506,6 +490,8 @@ static int arm_smmu_share_asid(struct arm_smmu_device *smmu, ret = arm_smmu_realloc_s1_domain_asid(smmu, old_s1_domain); if (ret) goto out_restore_s1; + /* Clean the ASID since it was just recovered */ + arm_smmu_tlb_inv_all_s1(smmu_domain); } smmu_domain->btm_invalidation = true; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index b07e0692eb78ea..79c85791b0c3d5 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -90,6 +90,7 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu); static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain, struct arm_smmu_device *smmu); static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master); +static void arm_smmu_tlb_inv_all_s2(struct arm_smmu_domain *smmu_domain); static void parse_driver_options(struct arm_smmu_device *smmu) { @@ -965,17 +966,6 @@ static int arm_smmu_page_response(struct device *dev, } /* Context descriptor manipulation functions */ -void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid) -{ - struct arm_smmu_cmdq_ent cmd = { - .opcode = smmu->features & ARM_SMMU_FEAT_E2H ? - CMDQ_OP_TLBI_EL2_ASID : CMDQ_OP_TLBI_NH_ASID, - .tlbi.asid = asid, - }; - - arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd); -} - #define NUM_ENTRY_QWORDS \ (max(sizeof(struct arm_smmu_ste), sizeof(struct arm_smmu_cd)) / \ sizeof(u64)) @@ -2106,8 +2096,6 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, static void arm_smmu_tlb_inv_context(void *cookie) { struct arm_smmu_domain *smmu_domain = cookie; - struct arm_smmu_device *smmu = smmu_domain->smmu; - struct arm_smmu_cmdq_ent cmd; /* * NOTE: when io-pgtable is in non-strict mode, we may get here with @@ -2116,13 +2104,10 @@ static void arm_smmu_tlb_inv_context(void *cookie) * insertion to guarantee those are observed before the TLBI. Do be * careful, 007. */ - if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { - arm_smmu_tlb_inv_asid(smmu, READ_ONCE(smmu_domain->asid)); - } else { - cmd.opcode = CMDQ_OP_TLBI_S12_VMALL; - cmd.tlbi.vmid = smmu_domain->vmid; - arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd); - } + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) + arm_smmu_tlb_inv_all_s1(smmu_domain); + else + arm_smmu_tlb_inv_all_s2(smmu_domain); arm_smmu_atc_inv_domain(smmu_domain, 0, 0); } @@ -2197,47 +2182,69 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, arm_smmu_cmdq_batch_submit(smmu, &cmds); } -static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, - size_t granule, bool leaf, - struct arm_smmu_domain *smmu_domain) +static bool arm_smmu_inv_range_too_big(struct arm_smmu_device *smmu, + size_t size, size_t granule) { - struct arm_smmu_cmdq_ent cmd = { - .tlbi = { - .leaf = leaf, - }, - }; + unsigned max_ops; - if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { - cmd.opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? - CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA; - cmd.tlbi.asid = smmu_domain->asid; - } else { - cmd.opcode = CMDQ_OP_TLBI_S2_IPA; - cmd.tlbi.vmid = smmu_domain->vmid; - } - __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain); + /* 0 size means invalidate all */ + if (!size || size == SIZE_MAX) + return true; - /* - * Unfortunately, this can't be leaf-only since we may have - * zapped an entire table. - */ - arm_smmu_atc_inv_domain(smmu_domain, iova, size); + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) + return false; + + max_ops = 1 << (ilog2(granule) - 3); + return size >= max_ops * granule; } -void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, - size_t granule, bool leaf, - struct arm_smmu_domain *smmu_domain) +void arm_smmu_tlb_inv_range_s1(unsigned long iova, size_t size, size_t granule, + bool leaf, struct arm_smmu_domain *smmu_domain) { struct arm_smmu_cmdq_ent cmd = { .opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA, .tlbi = { - .asid = asid, + .asid = READ_ONCE(smmu_domain->asid), .leaf = leaf, }, }; - __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain); + if (arm_smmu_inv_range_too_big(smmu_domain->smmu, size, granule)) { + cmd.opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? + CMDQ_OP_TLBI_EL2_ASID : + CMDQ_OP_TLBI_NH_ASID, + arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd); + } else{ + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, + smmu_domain); + } +} + +static void arm_smmu_tlb_inv_range_s2(unsigned long iova, size_t size, + size_t granule, bool leaf, + struct arm_smmu_domain *smmu_domain) +{ + struct arm_smmu_cmdq_ent cmd = { + .opcode = CMDQ_OP_TLBI_S2_IPA, + .tlbi = { + .vmid = smmu_domain->vmid, + .leaf = leaf, + }, + }; + + if (arm_smmu_inv_range_too_big(smmu_domain->smmu, size, granule)) { + cmd.opcode = CMDQ_OP_TLBI_S12_VMALL; + arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd); + } else { + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, + smmu_domain); + } +} + +static void arm_smmu_tlb_inv_all_s2(struct arm_smmu_domain *smmu_domain) +{ + arm_smmu_tlb_inv_range_s2(0, 0, PAGE_SIZE, false, smmu_domain); } static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather, @@ -2253,7 +2260,15 @@ static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather, static void arm_smmu_tlb_inv_walk(unsigned long iova, size_t size, size_t granule, void *cookie) { - arm_smmu_tlb_inv_range_domain(iova, size, granule, false, cookie); + struct arm_smmu_domain *smmu_domain = cookie; + + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) + arm_smmu_tlb_inv_range_s1(iova, size, granule, false, + smmu_domain); + else + arm_smmu_tlb_inv_range_s2(iova, size, granule, false, + smmu_domain); + arm_smmu_atc_inv_domain(smmu_domain, iova, size); } static const struct iommu_flush_ops arm_smmu_flush_ops = { @@ -2379,7 +2394,7 @@ void arm_smmu_domain_free_id(struct arm_smmu_domain *smmu_domain) if ((smmu_domain->stage == ARM_SMMU_DOMAIN_S1 || smmu_domain->domain.type == IOMMU_DOMAIN_SVA) && smmu_domain->asid) { - arm_smmu_tlb_inv_asid(smmu, smmu_domain->asid); + arm_smmu_tlb_inv_all_s1(smmu_domain); /* Prevent SVA from touching the CD while we're freeing it */ mutex_lock(&smmu->asid_lock); @@ -3183,13 +3198,23 @@ static void arm_smmu_iotlb_sync(struct iommu_domain *domain, struct iommu_iotlb_gather *gather) { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + size_t size = gather->end - gather->start + 1; if (!gather->pgsize) return; - arm_smmu_tlb_inv_range_domain(gather->start, - gather->end - gather->start + 1, - gather->pgsize, true, smmu_domain); + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) + arm_smmu_tlb_inv_range_s1(gather->start, size, gather->pgsize, + true, smmu_domain); + else + arm_smmu_tlb_inv_range_s2(gather->start, size, gather->pgsize, + true, smmu_domain); + + /* + * Unfortunately, this can't be leaf-only since we may have + * zapped an entire table. + */ + arm_smmu_atc_inv_domain(smmu_domain, gather->start, size); } static phys_addr_t diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index d65c716d4fc04f..9f5537ef97caff 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -834,10 +834,13 @@ void arm_smmu_remove_pasid(struct arm_smmu_master *master, int arm_smmu_domain_alloc_id(struct arm_smmu_device *smmu, struct arm_smmu_domain *smmu_domain); void arm_smmu_domain_free_id(struct arm_smmu_domain *smmu_domain); -void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid); -void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, - size_t granule, bool leaf, - struct arm_smmu_domain *smmu_domain); +void arm_smmu_tlb_inv_range_s1(unsigned long iova, size_t size, size_t granule, + bool leaf, struct arm_smmu_domain *smmu_domain); +static inline void arm_smmu_tlb_inv_all_s1(struct arm_smmu_domain *smmu_domain) +{ + arm_smmu_tlb_inv_range_s1(0, 0, PAGE_SIZE, false, smmu_domain); +} + int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, unsigned long iova, size_t size);
在 2024/1/15 23:31, Jason Gunthorpe 写道: > On Mon, Jan 15, 2024 at 07:40:40PM +0800, Zhang Zekun wrote: >> The commit d5afb4b47e13 ("iommu/arm-smmu-v3: Fix soft lockup triggered >> by arm_smmu_mm_invalidate_range") has fix a soft lockup problem when >> running SVA case, but code paths from iommu_unmap and dma APIs still >> remain unfixed which could also cause potential soft lockup. >> >> When cmdq is quite busy and don't have much space for batch submitting >> cmds, and size passed to __arm_smmu_tlb_inv_range() is large (1G in this >> case), the following softlockup is triggered. >> >> WARN: soft lockup - CPU#71 stuck for 12s! [qemu-kvm:1303] >> ... >> Call trace: >> dump_backtrace+0x0/0x200 >> show_stack+0x20/0x30 >> dump_stack+0xf0/0x138 >> watchdog_print_info+0x48/0x54 >> watchdog_process_before_softlockup+0x9c/0xa0 >> watchdog_timer_fn+0x1ac/0x2f0 >> __run_hrtimer+0x98/0x2b4 >> __hrtimer_run_queues+0xc0/0x13c >> hrtimer_interrupt+0x150/0x3e4 >> arch_timer_handler_phys+0x3c/0x50 >> handle_percpu_devid_irq+0x90/0x1f4 >> __handle_domain_irq+0x84/0xfc >> gic_handle_irq+0x88/0x2b0 >> el1_irq+0xb8/0x140 >> arm_smmu_cmdq_issue_cmdlist+0x184/0x5f4 >> __arm_smmu_tlb_inv_range+0x114/0x22c >> arm_smmu_tlb_inv_walk+0x88/0x120 >> __arm_lpae_unmap+0x188/0x2c0 >> __arm_lpae_unmap+0x104/0x2c0 >> arm_lpae_unmap+0x68/0x80 >> arm_smmu_unmap+0x24/0x40 >> __iommu_unmap+0xd8/0x210 >> iommu_unmap+0x44/0x9c >> ... > What is the rest of the call chain? How did you get into such a large > invalidation? we are doing some test with qemu virtual machines. The testing platform has 96 cores and 960GB memories. It will enable 64 virtual machines and every virtual machine will cost 8G memories. When batching closing these 64 virtaul machine concurrently, we will get soft lockup. WARN: soft lockup - CPU#18 stuck for 11s! [worker:3219] ... Call trace: dump_backtrace+0x0/0x200 show_stack+0x20/0x30 dump_stack+0xf0/0x138 watchdog_print_info+0x48/0x54 watchdog_process_before_softlockup+0x9c/0xa0 watchdog_timer_fn+0x1ac/0x2f0 __run_hrtimer+0x98/0x2b4 __hrtimer_run_queues+0xc0/0x13c hrtimer_interrupt+0x150/0x3e4 arch_timer_handler_phys+0x3c/0x50 handle_percpu_devid_irq+0x90/0x1f4 __handle_domain_irq+0x84/0xfc gic_handle_irq+0x88/0x2b0 el1_irq+0xb8/0x140 ktime_get+0x3c/0xac arm_smmu_cmdq_poll_until_not_full+0xb0/0x1a0 arm_smmu_cmdq_issue_cmdlist+0x194/0x5f4 __arm_smmu_tlb_inv_range+0x114/0x22c arm_smmu_tlb_inv_walk+0x88/0x120 __arm_lpae_unmap+0x188/0x2c0 __arm_lpae_unmap+0x104/0x2c0 arm_lpae_unmap+0x68/0x80 arm_smmu_unmap+0x24/0x40 __iommu_unmap+0xd8/0x210 iommu_unmap_fast+0x18/0x24 unmap_unpin_fast+0x7c/0x140 [vfio_iommu_type1] vfio_unmap_unpin+0x14c/0x3ac [vfio_iommu_type1] vfio_remove_dma+0x38/0x124 [vfio_iommu_type1] vfio_iommu_type1_detach_group+0x4b8/0x4e0 [vfio_iommu_type1] __vfio_group_unset_container+0x58/0x18c [vfio] vfio_group_try_dissolve_container+0x80/0x94 [vfio] vfio_group_put_external_user+0x20/0x54 [vfio] kvm_vfio_destroy+0xa8/0x12c kvm_destroy_vm+0x20c/0x300 kvm_put_kvm+0x74/0xa0 kvm_vcpu_release+0x20/0x30 __fput+0xc4/0x270 ____fput+0x18/0x24 task_work_run+0xd0/0x1a0 do_exit+0x1d8/0x480 do_group_exit+0x40/0x130 get_signal+0x1f8/0x744 do_signal+0x98/0x204 do_notify_resume+0x15c/0x1e0 work_pending+0xc/0xa4 >> @@ -228,7 +219,7 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, >> */ >> size = end - start; >> if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { >> - if (size >= CMDQ_MAX_TLBI_OPS * PAGE_SIZE) >> + if (size >= CMDQ_MAX_TLBI_OPS(PAGE_SIZE) * PAGE_SIZE) >> size = 0; >> } else { >> if (size == ULONG_MAX) > I would like to see the SVA code rely on the common invalidation > infrastructure, this should be pushed down into the invalidation logic > not be in the SVA code. With some adjustments we can get common > functions for all this and push the decision making into the actual > function. Thanks for your patch, the patch described below make sense to me. After spliting TLB invalidate operations according to domain stage, arm_smmu_tlb_inv_*_s1() can be used both in SVA codes and original smmu codes, and can implement arm_smmu_tlb_inv_all_s1() just using arm_smmu_tlb_inv_range_s1() by passing a special size. But the current smmu-v3 driver has not move "asid" into "struct smmu_domain", we still need to pass the exact asid from SVA. > > Something like this, on top of my "part 3" branch > https://github.com/jgunthorpe/linux/commits/smmuv3_newapi/ > split into more patches: > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c > index 5ab976f6b108dd..b30cce97b01d06 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c > @@ -30,7 +30,6 @@ static int arm_smmu_realloc_s1_domain_asid(struct arm_smmu_device *smmu, > struct arm_smmu_domain *smmu_domain) > { > struct arm_smmu_master_domain *master_domain; > - u32 old_asid = smmu_domain->asid; > struct arm_smmu_cd target_cd; > unsigned long flags; > int ret; > @@ -68,9 +67,6 @@ static int arm_smmu_realloc_s1_domain_asid(struct arm_smmu_device *smmu, > &target_cd); > } > spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); > - > - /* Clean the ASID we are about to assign to a new translation */ > - arm_smmu_tlb_inv_asid(smmu, old_asid); > return 0; > } > > @@ -175,24 +171,12 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, > * range. So do a simple translation here by calculating size correctly. > */ > size = end - start; > - if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { > - if (size >= CMDQ_MAX_TLBI_OPS * PAGE_SIZE) > - size = 0; > - } else { > - if (size == ULONG_MAX) > - size = 0; > - } > + if (size == ULONG_MAX) > + size = 0; > > - if (!smmu_domain->btm_invalidation) { > - ioasid_t asid = READ_ONCE(smmu_domain->asid); > - > - if (!size) > - arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid); > - else > - arm_smmu_tlb_inv_range_asid(start, size, asid, > - PAGE_SIZE, false, > - smmu_domain); > - } > + if (!smmu_domain->btm_invalidation) > + arm_smmu_tlb_inv_range_s1(start, size, PAGE_SIZE, false, > + smmu_domain); > > arm_smmu_atc_inv_domain(smmu_domain, start, size); > } > @@ -228,7 +212,7 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm) > } > spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); > > - arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->asid); > + arm_smmu_tlb_inv_all_s1(smmu_domain); > arm_smmu_atc_inv_domain(smmu_domain, 0, 0); > } > > @@ -506,6 +490,8 @@ static int arm_smmu_share_asid(struct arm_smmu_device *smmu, > ret = arm_smmu_realloc_s1_domain_asid(smmu, old_s1_domain); > if (ret) > goto out_restore_s1; > + /* Clean the ASID since it was just recovered */ > + arm_smmu_tlb_inv_all_s1(smmu_domain); > } > > smmu_domain->btm_invalidation = true; > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index b07e0692eb78ea..79c85791b0c3d5 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -90,6 +90,7 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu); > static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain, > struct arm_smmu_device *smmu); > static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master); > +static void arm_smmu_tlb_inv_all_s2(struct arm_smmu_domain *smmu_domain); > > static void parse_driver_options(struct arm_smmu_device *smmu) > { > @@ -965,17 +966,6 @@ static int arm_smmu_page_response(struct device *dev, > } > > /* Context descriptor manipulation functions */ > -void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid) > -{ > - struct arm_smmu_cmdq_ent cmd = { > - .opcode = smmu->features & ARM_SMMU_FEAT_E2H ? > - CMDQ_OP_TLBI_EL2_ASID : CMDQ_OP_TLBI_NH_ASID, > - .tlbi.asid = asid, > - }; > - > - arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd); > -} > - > #define NUM_ENTRY_QWORDS \ > (max(sizeof(struct arm_smmu_ste), sizeof(struct arm_smmu_cd)) / \ > sizeof(u64)) > @@ -2106,8 +2096,6 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, > static void arm_smmu_tlb_inv_context(void *cookie) > { > struct arm_smmu_domain *smmu_domain = cookie; > - struct arm_smmu_device *smmu = smmu_domain->smmu; > - struct arm_smmu_cmdq_ent cmd; > > /* > * NOTE: when io-pgtable is in non-strict mode, we may get here with > @@ -2116,13 +2104,10 @@ static void arm_smmu_tlb_inv_context(void *cookie) > * insertion to guarantee those are observed before the TLBI. Do be > * careful, 007. > */ > - if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { > - arm_smmu_tlb_inv_asid(smmu, READ_ONCE(smmu_domain->asid)); > - } else { > - cmd.opcode = CMDQ_OP_TLBI_S12_VMALL; > - cmd.tlbi.vmid = smmu_domain->vmid; > - arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd); > - } > + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) > + arm_smmu_tlb_inv_all_s1(smmu_domain); > + else > + arm_smmu_tlb_inv_all_s2(smmu_domain); > arm_smmu_atc_inv_domain(smmu_domain, 0, 0); > } > > @@ -2197,47 +2182,69 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd, > arm_smmu_cmdq_batch_submit(smmu, &cmds); > } > > -static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size, > - size_t granule, bool leaf, > - struct arm_smmu_domain *smmu_domain) > +static bool arm_smmu_inv_range_too_big(struct arm_smmu_device *smmu, > + size_t size, size_t granule) > { > - struct arm_smmu_cmdq_ent cmd = { > - .tlbi = { > - .leaf = leaf, > - }, > - }; > + unsigned max_ops; > > - if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) { > - cmd.opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? > - CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA; > - cmd.tlbi.asid = smmu_domain->asid; > - } else { > - cmd.opcode = CMDQ_OP_TLBI_S2_IPA; > - cmd.tlbi.vmid = smmu_domain->vmid; > - } > - __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain); > + /* 0 size means invalidate all */ > + if (!size || size == SIZE_MAX) > + return true; > > - /* > - * Unfortunately, this can't be leaf-only since we may have > - * zapped an entire table. > - */ > - arm_smmu_atc_inv_domain(smmu_domain, iova, size); > + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) > + return false; > + > + max_ops = 1 << (ilog2(granule) - 3); > + return size >= max_ops * granule; > } > > -void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, > - size_t granule, bool leaf, > - struct arm_smmu_domain *smmu_domain) > +void arm_smmu_tlb_inv_range_s1(unsigned long iova, size_t size, size_t granule, > + bool leaf, struct arm_smmu_domain *smmu_domain) > { > struct arm_smmu_cmdq_ent cmd = { > .opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? > CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA, > .tlbi = { > - .asid = asid, > + .asid = READ_ONCE(smmu_domain->asid), > .leaf = leaf, > }, > }; > > - __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain); > + if (arm_smmu_inv_range_too_big(smmu_domain->smmu, size, granule)) { > + cmd.opcode = smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ? > + CMDQ_OP_TLBI_EL2_ASID : > + CMDQ_OP_TLBI_NH_ASID, > + arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd); > + } else{ > + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, > + smmu_domain); > + } > +} > + > +static void arm_smmu_tlb_inv_range_s2(unsigned long iova, size_t size, > + size_t granule, bool leaf, > + struct arm_smmu_domain *smmu_domain) > +{ > + struct arm_smmu_cmdq_ent cmd = { > + .opcode = CMDQ_OP_TLBI_S2_IPA, > + .tlbi = { > + .vmid = smmu_domain->vmid, > + .leaf = leaf, > + }, > + }; > + > + if (arm_smmu_inv_range_too_big(smmu_domain->smmu, size, granule)) { > + cmd.opcode = CMDQ_OP_TLBI_S12_VMALL; > + arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd); > + } else { > + __arm_smmu_tlb_inv_range(&cmd, iova, size, granule, > + smmu_domain); > + } > +} > + > +static void arm_smmu_tlb_inv_all_s2(struct arm_smmu_domain *smmu_domain) > +{ > + arm_smmu_tlb_inv_range_s2(0, 0, PAGE_SIZE, false, smmu_domain); > } > > static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather, > @@ -2253,7 +2260,15 @@ static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather, > static void arm_smmu_tlb_inv_walk(unsigned long iova, size_t size, > size_t granule, void *cookie) > { > - arm_smmu_tlb_inv_range_domain(iova, size, granule, false, cookie); > + struct arm_smmu_domain *smmu_domain = cookie; > + > + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) > + arm_smmu_tlb_inv_range_s1(iova, size, granule, false, > + smmu_domain); > + else > + arm_smmu_tlb_inv_range_s2(iova, size, granule, false, > + smmu_domain); > + arm_smmu_atc_inv_domain(smmu_domain, iova, size); > } > > static const struct iommu_flush_ops arm_smmu_flush_ops = { > @@ -2379,7 +2394,7 @@ void arm_smmu_domain_free_id(struct arm_smmu_domain *smmu_domain) > if ((smmu_domain->stage == ARM_SMMU_DOMAIN_S1 || > smmu_domain->domain.type == IOMMU_DOMAIN_SVA) && > smmu_domain->asid) { > - arm_smmu_tlb_inv_asid(smmu, smmu_domain->asid); > + arm_smmu_tlb_inv_all_s1(smmu_domain); > > /* Prevent SVA from touching the CD while we're freeing it */ > mutex_lock(&smmu->asid_lock); > @@ -3183,13 +3198,23 @@ static void arm_smmu_iotlb_sync(struct iommu_domain *domain, > struct iommu_iotlb_gather *gather) > { > struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); > + size_t size = gather->end - gather->start + 1; > > if (!gather->pgsize) > return; > > - arm_smmu_tlb_inv_range_domain(gather->start, > - gather->end - gather->start + 1, > - gather->pgsize, true, smmu_domain); > + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) > + arm_smmu_tlb_inv_range_s1(gather->start, size, gather->pgsize, > + true, smmu_domain); > + else > + arm_smmu_tlb_inv_range_s2(gather->start, size, gather->pgsize, > + true, smmu_domain); > + > + /* > + * Unfortunately, this can't be leaf-only since we may have > + * zapped an entire table. > + */ > + arm_smmu_atc_inv_domain(smmu_domain, gather->start, size); > } > > static phys_addr_t > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > index d65c716d4fc04f..9f5537ef97caff 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > @@ -834,10 +834,13 @@ void arm_smmu_remove_pasid(struct arm_smmu_master *master, > int arm_smmu_domain_alloc_id(struct arm_smmu_device *smmu, > struct arm_smmu_domain *smmu_domain); > void arm_smmu_domain_free_id(struct arm_smmu_domain *smmu_domain); > -void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid); > -void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid, > - size_t granule, bool leaf, > - struct arm_smmu_domain *smmu_domain); > +void arm_smmu_tlb_inv_range_s1(unsigned long iova, size_t size, size_t granule, > + bool leaf, struct arm_smmu_domain *smmu_domain); > +static inline void arm_smmu_tlb_inv_all_s1(struct arm_smmu_domain *smmu_domain) > +{ > + arm_smmu_tlb_inv_range_s1(0, 0, PAGE_SIZE, false, smmu_domain); > +} > + > int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, > unsigned long iova, size_t size); >
On Thu, Jan 18, 2024 at 07:21:10PM +0800, zhangzekun (A) wrote: > > What is the rest of the call chain? How did you get into such a large > > invalidation? > we are doing some test with qemu virtual machines. The testing platform has > 96 cores and 960GB memories. It will enable 64 virtual machines and every > virtual machine will cost 8G memories. When batching closing these 64 > virtaul machine concurrently, we will get soft lockup. > > WARN: soft lockup - CPU#18 stuck for 11s! [worker:3219] > .... > Call trace: > dump_backtrace+0x0/0x200 > show_stack+0x20/0x30 > dump_stack+0xf0/0x138 > watchdog_print_info+0x48/0x54 > watchdog_process_before_softlockup+0x9c/0xa0 > watchdog_timer_fn+0x1ac/0x2f0 > __run_hrtimer+0x98/0x2b4 > __hrtimer_run_queues+0xc0/0x13c > hrtimer_interrupt+0x150/0x3e4 > arch_timer_handler_phys+0x3c/0x50 > handle_percpu_devid_irq+0x90/0x1f4 > __handle_domain_irq+0x84/0xfc > gic_handle_irq+0x88/0x2b0 > el1_irq+0xb8/0x140 > ktime_get+0x3c/0xac > arm_smmu_cmdq_poll_until_not_full+0xb0/0x1a0 > arm_smmu_cmdq_issue_cmdlist+0x194/0x5f4 > __arm_smmu_tlb_inv_range+0x114/0x22c > arm_smmu_tlb_inv_walk+0x88/0x120 > __arm_lpae_unmap+0x188/0x2c0 > __arm_lpae_unmap+0x104/0x2c0 > arm_lpae_unmap+0x68/0x80 > arm_smmu_unmap+0x24/0x40 > __iommu_unmap+0xd8/0x210 > iommu_unmap_fast+0x18/0x24 > unmap_unpin_fast+0x7c/0x140 [vfio_iommu_type1] > vfio_unmap_unpin+0x14c/0x3ac [vfio_iommu_type1] > vfio_remove_dma+0x38/0x124 [vfio_iommu_type1] > vfio_iommu_type1_detach_group+0x4b8/0x4e0 [vfio_iommu_type1] > __vfio_group_unset_container+0x58/0x18c [vfio] > vfio_group_try_dissolve_container+0x80/0x94 [vfio] > vfio_group_put_external_user+0x20/0x54 [vfio] > kvm_vfio_destroy+0xa8/0x12c > kvm_destroy_vm+0x20c/0x300 > kvm_put_kvm+0x74/0xa0 > kvm_vcpu_release+0x20/0x30 > __fput+0xc4/0x270 > ____fput+0x18/0x24 > task_work_run+0xd0/0x1a0 > do_exit+0x1d8/0x480 > do_group_exit+0x40/0x130 > get_signal+0x1f8/0x744 > do_signal+0x98/0x204 > do_notify_resume+0x15c/0x1e0 > work_pending+0xc/0xa4 I see, that is interesting. I think iommufd is close to substantialy improving this. It already has the domain destruction ordering: - Remove all domain attachments - Read back phys addrs, unmap and unpin - iommu_domain_free() Currently smmu continues to issue ASID invalidations for the domain even though there no references to it. However this is pretty pointless here as we are going to free the ASID right away. I'm going to make a series to allow smmu to support multi-instances on a single domain. In this case when the domain looses all its attachments it looses all its instances too and it won't have to do *anything* for invalidation in this workflow. Ie your flow will boil down to a single ASID invalidation once the domain attachments are all removed, then on invalidations during unmap and no invaludation at free. This would be a significantly faster teardown I suspect. > Thanks for your patch, the patch described below make sense to me. After > spliting TLB invalidate operations according to domain stage, > arm_smmu_tlb_inv_*_s1() can be used both in SVA codes and original smmu > codes, and can implement arm_smmu_tlb_inv_all_s1() just using > arm_smmu_tlb_inv_range_s1() by passing a special size. But the current > smmu-v3 driver has not move "asid" into "struct smmu_domain", we still need > to pass the exact asid from SVA. Yes, it is not intended to be applied on the current kernel, there are a lot of structural problems with how SVA works there.. Jason
On 18/01/2024 2:22 pm, Jason Gunthorpe wrote: > On Thu, Jan 18, 2024 at 07:21:10PM +0800, zhangzekun (A) wrote: >>> What is the rest of the call chain? How did you get into such a large >>> invalidation? >> we are doing some test with qemu virtual machines. The testing platform has >> 96 cores and 960GB memories. It will enable 64 virtual machines and every >> virtual machine will cost 8G memories. When batching closing these 64 >> virtaul machine concurrently, we will get soft lockup. >> >> WARN: soft lockup - CPU#18 stuck for 11s! [worker:3219] >> .... >> Call trace: >> dump_backtrace+0x0/0x200 >> show_stack+0x20/0x30 >> dump_stack+0xf0/0x138 >> watchdog_print_info+0x48/0x54 >> watchdog_process_before_softlockup+0x9c/0xa0 >> watchdog_timer_fn+0x1ac/0x2f0 >> __run_hrtimer+0x98/0x2b4 >> __hrtimer_run_queues+0xc0/0x13c >> hrtimer_interrupt+0x150/0x3e4 >> arch_timer_handler_phys+0x3c/0x50 >> handle_percpu_devid_irq+0x90/0x1f4 >> __handle_domain_irq+0x84/0xfc >> gic_handle_irq+0x88/0x2b0 >> el1_irq+0xb8/0x140 >> ktime_get+0x3c/0xac >> arm_smmu_cmdq_poll_until_not_full+0xb0/0x1a0 >> arm_smmu_cmdq_issue_cmdlist+0x194/0x5f4 >> __arm_smmu_tlb_inv_range+0x114/0x22c >> arm_smmu_tlb_inv_walk+0x88/0x120 >> __arm_lpae_unmap+0x188/0x2c0 >> __arm_lpae_unmap+0x104/0x2c0 >> arm_lpae_unmap+0x68/0x80 >> arm_smmu_unmap+0x24/0x40 >> __iommu_unmap+0xd8/0x210 >> iommu_unmap_fast+0x18/0x24 >> unmap_unpin_fast+0x7c/0x140 [vfio_iommu_type1] >> vfio_unmap_unpin+0x14c/0x3ac [vfio_iommu_type1] >> vfio_remove_dma+0x38/0x124 [vfio_iommu_type1] >> vfio_iommu_type1_detach_group+0x4b8/0x4e0 [vfio_iommu_type1] >> __vfio_group_unset_container+0x58/0x18c [vfio] >> vfio_group_try_dissolve_container+0x80/0x94 [vfio] >> vfio_group_put_external_user+0x20/0x54 [vfio] >> kvm_vfio_destroy+0xa8/0x12c >> kvm_destroy_vm+0x20c/0x300 >> kvm_put_kvm+0x74/0xa0 >> kvm_vcpu_release+0x20/0x30 >> __fput+0xc4/0x270 >> ____fput+0x18/0x24 >> task_work_run+0xd0/0x1a0 >> do_exit+0x1d8/0x480 >> do_group_exit+0x40/0x130 >> get_signal+0x1f8/0x744 >> do_signal+0x98/0x204 >> do_notify_resume+0x15c/0x1e0 >> work_pending+0xc/0xa4 > > I see, that is interesting. > > I think iommufd is close to substantialy improving this. It already > has the domain destruction ordering: > - Remove all domain attachments > - Read back phys addrs, unmap and unpin > - iommu_domain_free() I don't think the story here has changed from when I first remember this being discussed probably 7 or 8 years ago - what VFIO has always really wanted for this situation is to know when it's doing a complete teardown and just detach and free the domain, then unpin the pages afterwards, without wasting any time at all on frivolous unmapping. However so far it's worked out that minor low-level tweaks like the unmap_fast API have kept kept the overhead tolerable enough that nobody's been inspired to attempt the bigger task. However for SMMU in particular it's almost certainly confounded by io_pgtable_tlb_flush_walk() being a bit rubbish for the situation that I bet is happening here - if we don't have range invalidate, then pessimistically invalidating a 1GB table at 4KB granularity without even looking isn't likely to save much time over doing a more accurate recursive invalidation even in the worst case that the whole thing *does* turn out to be mapped as pages. But in the likely case that there are at least some intermediate-level block mappings in there, it's an instant win. I recall I had along-standing to-do item to get rid of tlb_flush_walk from io-pgtable altogether, although that may have been tied in to the freelist stuff; I'd need to find time to page it all back in to remember exactly why... Thanks, Robin. > Currently smmu continues to issue ASID invalidations for the domain > even though there no references to it. However this is pretty > pointless here as we are going to free the ASID right away. > > I'm going to make a series to allow smmu to support multi-instances on > a single domain. In this case when the domain looses all its > attachments it looses all its instances too and it won't have to do > *anything* for invalidation in this workflow. > > Ie your flow will boil down to a single ASID invalidation once the > domain attachments are all removed, then on invalidations during unmap > and no invaludation at free. This would be a significantly faster > teardown I suspect. > >> Thanks for your patch, the patch described below make sense to me. After >> spliting TLB invalidate operations according to domain stage, >> arm_smmu_tlb_inv_*_s1() can be used both in SVA codes and original smmu >> codes, and can implement arm_smmu_tlb_inv_all_s1() just using >> arm_smmu_tlb_inv_range_s1() by passing a special size. But the current >> smmu-v3 driver has not move "asid" into "struct smmu_domain", we still need >> to pass the exact asid from SVA. > > Yes, it is not intended to be applied on the current kernel, there are > a lot of structural problems with how SVA works there.. > > Jason
On Wed, Jan 24, 2024 at 12:17:43PM +0000, Robin Murphy wrote: > I don't think the story here has changed from when I first remember this > being discussed probably 7 or 8 years ago - what VFIO has always really > wanted for this situation is to know when it's doing a complete teardown and > just detach and free the domain, then unpin the pages afterwards, without > wasting any time at all on frivolous unmapping. What VFIO/iommufd really wants is a fast way to to read back and unmap the PFNs from the domain. Today it has to call iommu_iova_to_phys() and then unmap which ends up walking the radix many times. Ideally we'd batch read and unmap in a single call with a single radix walk. This is because the domain is used to store the 'struct page *' which is needed to unpin the memory. Combined with the no-attachments optimization I described it could be pretty close to optimal. Jason
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index 05722121f00e..164a218a4d41 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -203,15 +203,6 @@ static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd) } } -/* - * Cloned from the MAX_TLBI_OPS in arch/arm64/include/asm/tlbflush.h, this - * is used as a threshold to replace per-page TLBI commands to issue in the - * command queue with an address-space TLBI command, when SMMU w/o a range - * invalidation feature handles too many per-page TLBI commands, which will - * otherwise result in a soft lockup. - */ -#define CMDQ_MAX_TLBI_OPS (1 << (PAGE_SHIFT - 3)) - static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, @@ -228,7 +219,7 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn, */ size = end - start; if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV)) { - if (size >= CMDQ_MAX_TLBI_OPS * PAGE_SIZE) + if (size >= CMDQ_MAX_TLBI_OPS(PAGE_SIZE) * PAGE_SIZE) size = 0; } else { if (size == ULONG_MAX) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 0ffb1cf17e0b..cecccba17511 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1997,6 +1997,14 @@ static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather, static void arm_smmu_tlb_inv_walk(unsigned long iova, size_t size, size_t granule, void *cookie) { + struct arm_smmu_domain *smmu_domain = cookie; + struct arm_smmu_device *smmu = smmu_domain->smmu; + + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV) && + size >= CMDQ_MAX_TLBI_OPS(granule) * granule) { + arm_smmu_tlb_inv_context(cookie); + return; + } arm_smmu_tlb_inv_range_domain(iova, size, granule, false, cookie); } @@ -2502,13 +2510,20 @@ static void arm_smmu_iotlb_sync(struct iommu_domain *domain, struct iommu_iotlb_gather *gather) { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + struct arm_smmu_device *smmu = smmu_domain->smmu; + size_t size = gather->end - gather->start + 1; + size_t granule = gather->pgsize; - if (!gather->pgsize) + if (!granule) return; - arm_smmu_tlb_inv_range_domain(gather->start, - gather->end - gather->start + 1, - gather->pgsize, true, smmu_domain); + if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV) && + size >= CMDQ_MAX_TLBI_OPS(granule) * granule) { + arm_smmu_tlb_inv_context(smmu_domain); + return; + } + arm_smmu_tlb_inv_range_domain(gather->start, size, + granule, true, smmu_domain); } static phys_addr_t diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index 65fb388d5173..a9a7376c0437 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -431,6 +431,16 @@ struct arm_smmu_ste { #define MSI_IOVA_BASE 0x8000000 #define MSI_IOVA_LENGTH 0x100000 +/* + * Similar to MAX_DVM_OPS in arch/arm64/include/asm/tlbflush.h, this is used + * as a threshold to replace per-page TLBI commands to issue in the command + * queue with an address-space TLBI command, when SMMU w/o a range invalidation + * feature handles too many per-page TLBI commands, which will otherwise result + * in a soft lockup. + */ + +#define CMDQ_MAX_TLBI_OPS(granule) (1 << (ilog2(granule) - 3)) + enum pri_resp { PRI_RESP_DENY = 0, PRI_RESP_FAIL = 1,