mm/hugetlb: Fix pgtable lock on pmd sharing

Message ID 20230612160420.809818-1-peterx@redhat.com
State New
Headers
Series mm/hugetlb: Fix pgtable lock on pmd sharing |

Commit Message

Peter Xu June 12, 2023, 4:04 p.m. UTC
  Huge pmd sharing operates on PUD not PMD, huge_pte_lock() is not suitable
in this case because it should only work for last level pte changes, while
pmd sharing is always one level higher.

Meanwhile, here we're locking over the spte pgtable lock which is even not
a lock for current mm but someone else's.

It seems even racy on operating on the lock, as after put_page() of the
spte pgtable page logically the page can be released, so at least the
spin_unlock() needs to be done after the put_page().

No report I am aware, I'm not even sure whether it'll just work on taking
the spte pmd lock, because while we're holding i_mmap read lock it probably
means the vma interval tree is frozen, all pte allocators over this pud
entry could always find the specific svma and spte page, so maybe they'll
serialize on this spte page lock?  Even so, doesn't seem to be expected.
It just seems to be an accident of cb900f412154.

Fix it with the proper pud lock (which is the mm's page_table_lock).

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Fixes: cb900f412154 ("mm, hugetlb: convert hugetlbfs to use split pmd lock")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)
  

Comments

Mike Kravetz June 12, 2023, 8:44 p.m. UTC | #1
On 06/12/23 12:04, Peter Xu wrote:
> Huge pmd sharing operates on PUD not PMD, huge_pte_lock() is not suitable
> in this case because it should only work for last level pte changes, while
> pmd sharing is always one level higher.

Right!  That lock does not prevent someone else from concurrently modifying
the PUD.

> Meanwhile, here we're locking over the spte pgtable lock which is even not
> a lock for current mm but someone else's.
> 
> It seems even racy on operating on the lock, as after put_page() of the
> spte pgtable page logically the page can be released, so at least the
> spin_unlock() needs to be done after the put_page().

Agree.

> No report I am aware, I'm not even sure whether it'll just work on taking
> the spte pmd lock, because while we're holding i_mmap read lock it probably
> means the vma interval tree is frozen, all pte allocators over this pud
> entry could always find the specific svma and spte page, so maybe they'll
> serialize on this spte page lock?

It seems they would serialize IF they were trying to instantiate the same
shared page.  However, I suppose another thread could be trying to
instantiate something totally different in the VA range represented by that
PUD.  In this case, it seems like there would be no synchronization.

>                                    Even so, doesn't seem to be expected.
> It just seems to be an accident of cb900f412154.
> 
> Fix it with the proper pud lock (which is the mm's page_table_lock).
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Fixes: cb900f412154 ("mm, hugetlb: convert hugetlbfs to use split pmd lock")
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Agree with this change.  But, it does make one wonder if the pud_clear()
in huge_pmd_unshare is safe.  Like here, we really should be holding the
higher level lock but are holding the PMD lock.
  
Peter Xu June 12, 2023, 9:35 p.m. UTC | #2
Hi, Mike!

On Mon, Jun 12, 2023 at 01:44:18PM -0700, Mike Kravetz wrote:
> Agree with this change.  But, it does make one wonder if the pud_clear()
> in huge_pmd_unshare is safe.  Like here, we really should be holding the
> higher level lock but are holding the PMD lock.

The vma write lock?  My memory tells me that you worked on the vma lock and
one major reason was for that, but maybe I missed something?

I did check again and all call sites should have that lock held, which
looks fine here.  One thing worth mention is pmd unshare should also always
be gup-fast-safe because it never really releases the pgtable, unlike thp
collapse.
  
Mike Kravetz June 12, 2023, 9:44 p.m. UTC | #3
On 06/12/23 17:35, Peter Xu wrote:
> Hi, Mike!
> 
> On Mon, Jun 12, 2023 at 01:44:18PM -0700, Mike Kravetz wrote:
> > Agree with this change.  But, it does make one wonder if the pud_clear()
> > in huge_pmd_unshare is safe.  Like here, we really should be holding the
> > higher level lock but are holding the PMD lock.
> 
> The vma write lock?  My memory tells me that you worked on the vma lock and
> one major reason was for that, but maybe I missed something?
> 
> I did check again and all call sites should have that lock held, which
> looks fine here.  One thing worth mention is pmd unshare should also always
> be gup-fast-safe because it never really releases the pgtable, unlike thp
> collapse.

Yes, Duh!

In the unshare case we KNOW what is mapped (the vma) via the PUD.  So,
we certainly am synchronized with that access.  I was just trying to
imagine some other situation where we may be walking the page table
without the vma lock held.  However, you already took care of those call
sites.

Thanks!

You or Andrew can add,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
  

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfa412d8cb30..270ec0ecd5a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7133,7 +7133,6 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long saddr;
 	pte_t *spte = NULL;
 	pte_t *pte;
-	spinlock_t *ptl;
 
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -7154,7 +7153,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!spte)
 		goto out;
 
-	ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
+	spin_lock(&mm->page_table_lock);
 	if (pud_none(*pud)) {
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
@@ -7162,7 +7161,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	} else {
 		put_page(virt_to_page(spte));
 	}
-	spin_unlock(ptl);
+	spin_unlock(&mm->page_table_lock);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
 	i_mmap_unlock_read(mapping);