Message ID | 20230625022808.42942-1-Jingqi.liu@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp6710375vqr; Sat, 24 Jun 2023 20:16:41 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ61DgdPg5cSuIzHUq1ZK2jlF6uGC3hrTJ9BGPajBc9ET2eRC4k0wbVLL3TdubbWGJSdTMYx X-Received: by 2002:a9d:7456:0:b0:6b7:1ad7:6b97 with SMTP id p22-20020a9d7456000000b006b71ad76b97mr8841014otk.21.1687663001256; Sat, 24 Jun 2023 20:16:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687663001; cv=none; d=google.com; s=arc-20160816; b=uP5rPdC2gHVC3IgEDSxdNaITpSQA2pU5FbBTQbTJUbf7qEolM4dh9SGPxKsGE8/HPz 90+Me8BtLp7+TAIqwOpwQUhvjnQjzV0RL2DLkUofN7lu2KOjuDSRwxA3jqXzkkVsDfG4 NdvM9TEoRvyVSz7rdB2zl9g1FVYIkrp2tSrUz1NYmVFWIP/6uLRjDvD4kgHjtoWGb8gW zHqXwDCN8MfGKCXKRQNq2MOnuCYq6e0DqFbc0BMCsn4rtWNj3TWXOMNiRu7mSCnw1tZa FW7aZENwotjfPunh3BW2m15Yrb9rkVan2CNhwOKRQv2thTyWTTUTyMH0w0NBhLRHqykE P+DA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=6t7Trmvd0hHfOpF3QMIRPpIAsLreYUiv8IVq4NLzlms=; fh=YvaB6x0Bv2lHS43q7taYXT5PP3vvXFwyYNvDyRpfjAU=; b=OqIm4KCzMJKewZ+do8YHeejezW2kIPfgUSvcRUcCqWuwy8b7nuPrRG/NS72f8RUeY4 8n1aH1vueg3hMTqTVtpnEqdvix/MLeJQmgvEtwWR/GUic62N7x9W+bG/CWEWUhaa+Mz7 Suzl1+Lu9taD7cQmTLafpO/6hOrQreF5jCD4QP1IVwFqT/wBMSXqgeRd3wif0v7doebq cpTQ7rpSj9t+u3V/caLJdnFAZUJqafB2ya+1dTX5YpoEmy+U91BBgWjKJ5Q/lmluLaex S0q4KQqUwAoNbLwEkMXfUr42xuGJHrYeHx+c+M9itkxfHfEDfGyXfTxKoA9LGqOvhhm1 lAlw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@intel.com header.s=Intel header.b=mFX4GO8s; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bf5-20020a170902b90500b001b7d297dd0dsi2262222plb.290.2023.06.24.20.16.21; Sat, 24 Jun 2023 20:16:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@intel.com header.s=Intel header.b=mFX4GO8s; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231375AbjFYClZ (ORCPT <rfc822;duw91626@gmail.com> + 99 others); Sat, 24 Jun 2023 22:41:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43896 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229550AbjFYClY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 24 Jun 2023 22:41:24 -0400 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 963D8EA for <linux-kernel@vger.kernel.org>; Sat, 24 Jun 2023 19:41:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687660883; x=1719196883; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=TAows7OAcmxGvx49AoU/i5mLwi1QDRE6TshwUYSLvtg=; b=mFX4GO8s/RSOzEAwziFh3+ggnOMeERe5wlaB9jnr1DUnx8oCa73AVpmj tYm5JK7sXULfsoBccDXmZm2Hn1GVxFnPonXFFglva+nQJ67IHNEHJu6rU Azjw0IenWuPERmR3+J30oU1AMLvdqKaV9t5Q5DPrLXJv1AlnCDu14WAPv 9i1lpD6mFaJJ2jUOxVlRdIhNPdtLxn0uTyuXpjpdRex2m8/P3Pg5nkwP+ k/71hFOQmW6pa4rOUI9GOcG/mDUfp8B9KcCy8R1R7Vwl20AstTRknTLNl eOPRJzKgR10TzoXKvJ8ADhkUNVoI1X1NflbHi9SvUTpnc+/adkpTNwjzV Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10751"; a="391148606" X-IronPort-AV: E=Sophos;i="6.01,156,1684825200"; d="scan'208";a="391148606" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2023 19:41:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10751"; a="860285839" X-IronPort-AV: E=Sophos;i="6.01,156,1684825200"; d="scan'208";a="860285839" Received: from cascade.sh.intel.com ([10.239.48.162]) by fmsmga001.fm.intel.com with ESMTP; 24 Jun 2023 19:41:20 -0700 From: Jingqi Liu <Jingqi.liu@intel.com> To: iommu@lists.linux.dev, Lu Baolu <baolu.lu@linux.intel.com>, Tian@vger.kernel.org, Kevin <kevin.tian@intel.com>, David Woodhouse <dwmw2@infradead.org>, Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com> Cc: linux-kernel@vger.kernel.org, Jingqi Liu <Jingqi.liu@intel.com> Subject: [PATCH] iommu/vt-d: debugfs: Increment the reference count of page table page Date: Sun, 25 Jun 2023 10:28:08 +0800 Message-Id: <20230625022808.42942-1-Jingqi.liu@intel.com> X-Mailer: git-send-email 2.21.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769642919322703663?= X-GMAIL-MSGID: =?utf-8?q?1769642919322703663?= |
Series |
iommu/vt-d: debugfs: Increment the reference count of page table page
|
|
Commit Message
Liu, Jingqi
June 25, 2023, 2:28 a.m. UTC
There may be a race with iommu_unmap() interface when traversing a page
table.
When debugfs traverses an IOMMU page table, iommu_unmap() may clear
entries and free the page table pages pointed to by the entries.
So debugfs may read invalid or freed pages.
To avoid this, increment the refcount of a page table page before
traversing the page, and decrement its refcount after traversing it.
Signed-off-by: Jingqi Liu <Jingqi.liu@intel.com>
---
drivers/iommu/intel/debugfs.c | 36 +++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)
Comments
> From: Liu, Jingqi <jingqi.liu@intel.com> > Sent: Sunday, June 25, 2023 10:28 AM > > There may be a race with iommu_unmap() interface when traversing a page > table. > > When debugfs traverses an IOMMU page table, iommu_unmap() may clear > entries and free the page table pages pointed to by the entries. > So debugfs may read invalid or freed pages. > > To avoid this, increment the refcount of a page table page before > traversing the page, and decrement its refcount after traversing it. I'm not sure how this race can be fully avoided w/o cooperation in the unmap path... > > Signed-off-by: Jingqi Liu <Jingqi.liu@intel.com> > --- > drivers/iommu/intel/debugfs.c | 36 +++++++++++++++++++++++++++++++++- > - > 1 file changed, 34 insertions(+), 2 deletions(-) > > diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c > index 1f925285104e..d228e1580aec 100644 > --- a/drivers/iommu/intel/debugfs.c > +++ b/drivers/iommu/intel/debugfs.c > @@ -333,9 +333,41 @@ static void pgtable_walk_level(struct seq_file *m, > struct dma_pte *pde, > path[level] = pde->val; > if (dma_pte_superpage(pde) || level == 1) > dump_page_info(m, start, path); > - else > - pgtable_walk_level(m, > phys_to_virt(dma_pte_addr(pde)), > + else { > + struct page *pg; > + u64 pte_addr; > + > + /* > + * The entry references a Page-Directory Table > + * or a Page Table. > + */ > +retry: > + pte_addr = dma_pte_addr(pde); > + pg = pfn_to_page(pte_addr >> PAGE_SHIFT); > + if (!get_page_unless_zero(pg)) > + /* > + * If this page has a refcount of zero, > + * it has been freed, or will be freed. > + */ > + continue; > + > + /* Check if the value of the entry is changed. */ > + if (pde->val != path[level]) { > + put_page(pg); > + > + if (!dma_pte_present(pde)) > + /* The entry is invalid. Skip it. */ > + continue; > + > + /* The entry has been updated. */ > + path[level] = pde->val; > + goto retry; > + } > + > + pgtable_walk_level(m, phys_to_virt(pte_addr), > level - 1, start, path); What about pde->val getting cleared after phys_to_virt(pte_addr) leading to all the levels below 'pg' being freed? In that case this code still walks the stale 'pg' content which however all point to invalid pages then. It's probably simpler to just check the format of each PTE (e.g. whether any reserved bit is set) and if abnormal then break the current level of walk. > + put_page(pg); > + } > path[level] = 0; > } > } > -- > 2.21.3
On 7/3/2023 3:00 PM, Tian, Kevin wrote: >> From: Liu, Jingqi <jingqi.liu@intel.com> >> Sent: Sunday, June 25, 2023 10:28 AM >> >> There may be a race with iommu_unmap() interface when traversing a page >> table. >> >> When debugfs traverses an IOMMU page table, iommu_unmap() may clear >> entries and free the page table pages pointed to by the entries. >> So debugfs may read invalid or freed pages. >> >> To avoid this, increment the refcount of a page table page before >> traversing the page, and decrement its refcount after traversing it. > I'm not sure how this race can be fully avoided w/o cooperation in the > unmap path... Thanks. Indeed, in order to fully avoid this, need to cooperate in the unmap path. :) >> Signed-off-by: Jingqi Liu <Jingqi.liu@intel.com> >> --- >> drivers/iommu/intel/debugfs.c | 36 +++++++++++++++++++++++++++++++++- >> - >> 1 file changed, 34 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c >> index 1f925285104e..d228e1580aec 100644 >> --- a/drivers/iommu/intel/debugfs.c >> +++ b/drivers/iommu/intel/debugfs.c >> @@ -333,9 +333,41 @@ static void pgtable_walk_level(struct seq_file *m, >> struct dma_pte *pde, >> path[level] = pde->val; >> if (dma_pte_superpage(pde) || level == 1) >> dump_page_info(m, start, path); >> - else >> - pgtable_walk_level(m, >> phys_to_virt(dma_pte_addr(pde)), >> + else { >> + struct page *pg; >> + u64 pte_addr; >> + >> + /* >> + * The entry references a Page-Directory Table >> + * or a Page Table. >> + */ >> +retry: >> + pte_addr = dma_pte_addr(pde); >> + pg = pfn_to_page(pte_addr >> PAGE_SHIFT); >> + if (!get_page_unless_zero(pg)) >> + /* >> + * If this page has a refcount of zero, >> + * it has been freed, or will be freed. >> + */ >> + continue; >> + >> + /* Check if the value of the entry is changed. */ >> + if (pde->val != path[level]) { >> + put_page(pg); >> + >> + if (!dma_pte_present(pde)) >> + /* The entry is invalid. Skip it. */ >> + continue; >> + >> + /* The entry has been updated. */ >> + path[level] = pde->val; >> + goto retry; >> + } >> + >> + pgtable_walk_level(m, phys_to_virt(pte_addr), >> level - 1, start, path); > What about pde->val getting cleared after phys_to_virt(pte_addr) leading > to all the levels below 'pg' being freed? In that case this code still walks > the stale 'pg' content which however all point to invalid pages then. There are 2 cases for the page pointed to by the PTE below 'pg'. 1) the page has been freed. It will be skipped after the following check: if (!get_page_unless_zero(pg)) /* * If this page has a refcount of zero, * it has been freed, or will be freed. */ continue; Debugfs won't walk further. 2) The page has not been freed. The content of this page is stale. Dumping these stale content seems to be acceptable for debugfs. If all the PTEs below 'pg' can be cleared before being freed in the unmap path, the following check can avoid to walk these stale pages. if (!dma_pte_present(pde)) continue; > It's probably simpler to just check the format of each PTE (e.g. whether > any reserved bit is set) and if abnormal then break the current level of > walk. Good point. It seems the bit used to check must be differentiated between valid and stale. But the PTE may not change whether it is valid or stale. Thanks, Jingqi >> + put_page(pg); >> + } >> path[level] = 0; >> } >> } >> -- >> 2.21.3
Hi Kevin, On 7/3/2023 10:48 PM, Liu, Jingqi wrote: > > On 7/3/2023 3:00 PM, Tian, Kevin wrote: >>> From: Liu, Jingqi <jingqi.liu@intel.com> >>> Sent: Sunday, June 25, 2023 10:28 AM >>> >>> There may be a race with iommu_unmap() interface when traversing a page >>> table. >>> >>> When debugfs traverses an IOMMU page table, iommu_unmap() may clear >>> entries and free the page table pages pointed to by the entries. >>> So debugfs may read invalid or freed pages. >>> >>> To avoid this, increment the refcount of a page table page before >>> traversing the page, and decrement its refcount after traversing it. >> I'm not sure how this race can be fully avoided w/o cooperation in the >> unmap path... > Thanks. > Indeed, in order to fully avoid this, need to cooperate in the unmap > path. :) >>> Signed-off-by: Jingqi Liu <Jingqi.liu@intel.com> >>> --- >>> drivers/iommu/intel/debugfs.c | 36 +++++++++++++++++++++++++++++++++- >>> - >>> 1 file changed, 34 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/iommu/intel/debugfs.c >>> b/drivers/iommu/intel/debugfs.c >>> index 1f925285104e..d228e1580aec 100644 >>> --- a/drivers/iommu/intel/debugfs.c >>> +++ b/drivers/iommu/intel/debugfs.c >>> @@ -333,9 +333,41 @@ static void pgtable_walk_level(struct seq_file *m, >>> struct dma_pte *pde, >>> path[level] = pde->val; >>> if (dma_pte_superpage(pde) || level == 1) >>> dump_page_info(m, start, path); >>> - else >>> - pgtable_walk_level(m, >>> phys_to_virt(dma_pte_addr(pde)), >>> + else { >>> + struct page *pg; >>> + u64 pte_addr; >>> + >>> + /* >>> + * The entry references a Page-Directory Table >>> + * or a Page Table. >>> + */ >>> +retry: >>> + pte_addr = dma_pte_addr(pde); >>> + pg = pfn_to_page(pte_addr >> PAGE_SHIFT); >>> + if (!get_page_unless_zero(pg)) >>> + /* >>> + * If this page has a refcount of zero, >>> + * it has been freed, or will be freed. >>> + */ >>> + continue; >>> + >>> + /* Check if the value of the entry is changed. */ >>> + if (pde->val != path[level]) { >>> + put_page(pg); >>> + >>> + if (!dma_pte_present(pde)) >>> + /* The entry is invalid. Skip it. */ >>> + continue; >>> + >>> + /* The entry has been updated. */ >>> + path[level] = pde->val; >>> + goto retry; >>> + } >>> + >>> + pgtable_walk_level(m, phys_to_virt(pte_addr), >>> level - 1, start, path); >> What about pde->val getting cleared after phys_to_virt(pte_addr) leading >> to all the levels below 'pg' being freed? In that case this code >> still walks >> the stale 'pg' content which however all point to invalid pages then. > There are 2 cases for the page pointed to by the PTE below 'pg'. > 1) the page has been freed. > It will be skipped after the following check: > if (!get_page_unless_zero(pg)) > /* > * If this page has a refcount of zero, > * it has been freed, or will be freed. > */ > continue; > Debugfs won't walk further. > > 2) The page has not been freed. > The content of this page is stale. > Dumping these stale content seems to be acceptable for debugfs. > > If all the PTEs below 'pg' can be cleared before being freed in > the unmap path, > the following check can avoid to walk these stale pages. > if (!dma_pte_present(pde)) > continue; >> It's probably simpler to just check the format of each PTE (e.g. whether >> any reserved bit is set) and if abnormal then break the current level of >> walk. Thanks for your suggestion. If the PTE references a page directory/table, bit 3 is ignored by hardware according the spec. In the IOMMU driver, bit 3 is set to 0 by default. How about setting bit 3 of the corresponding PTE to 1 in the unmap path to indicate that the page pointed to by the PTE is stale ? The code modified in the unmap path is as follows: static void dma_pte_list_pagetables(struct dmar_domain *domain, int level, struct dma_pte *pte, struct list_head *freelist) { struct page *pg; pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT); list_add_tail(&pg->lru, freelist); + + pte->val |= BIT_ULL(3); ...... Then during debugfs traversal, check bit 3 of the PTE before calling pgtable_walk_level(). If this bit is 1, debugfs will stop traversing. The related code in debugfs is like below: +retry: + pte_addr = dma_pte_addr(pde); + pg = pfn_to_page(pte_addr >> PAGE_SHIFT); + if (!get_page_unless_zero(pg)) + /* + * If this page has a refcount of zero, + * it has been freed, or will be freed. + */ + continue; + + /* + * Check if the page that pointed to + * by the PTE is stale. + */ + if (pde->val & BIT_ULL(3)) { + put_page(pg); + continue; + } + + /* Check if the value of this entry is changed. */ + if (pde->val != path[level]) { + put_page(pg); + + if (!dma_pte_present(pde)) + /* This entry is invalid. Skip it. */ + continue; + + /* The entry has been updated. */ + path[level] = pde->val; + goto retry; + } + + pgtable_walk_level(m, phys_to_virt(pte_addr), level - 1, start, path); Thanks, Jingqi
diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c index 1f925285104e..d228e1580aec 100644 --- a/drivers/iommu/intel/debugfs.c +++ b/drivers/iommu/intel/debugfs.c @@ -333,9 +333,41 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde, path[level] = pde->val; if (dma_pte_superpage(pde) || level == 1) dump_page_info(m, start, path); - else - pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)), + else { + struct page *pg; + u64 pte_addr; + + /* + * The entry references a Page-Directory Table + * or a Page Table. + */ +retry: + pte_addr = dma_pte_addr(pde); + pg = pfn_to_page(pte_addr >> PAGE_SHIFT); + if (!get_page_unless_zero(pg)) + /* + * If this page has a refcount of zero, + * it has been freed, or will be freed. + */ + continue; + + /* Check if the value of the entry is changed. */ + if (pde->val != path[level]) { + put_page(pg); + + if (!dma_pte_present(pde)) + /* The entry is invalid. Skip it. */ + continue; + + /* The entry has been updated. */ + path[level] = pde->val; + goto retry; + } + + pgtable_walk_level(m, phys_to_virt(pte_addr), level - 1, start, path); + put_page(pg); + } path[level] = 0; } }