[04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail

Message ID 8218ffdc-8be-54e5-0a8-83f5542af283@google.com
State New
Headers
Series mm: allow pte_offset_map[_lock]() to fail |

Commit Message

Hugh Dickins May 22, 2023, 4:53 a.m. UTC
  Make pte_offset_map() a wrapper for __pte_offset_map() (optionally
outputs pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.

__pte_offset_map() do pmdval validation (including pmd_clear_bad()
when pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the
lock, trying again if it changed.

No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done
to cover the imminent case, but we expect to generalize it later, and
it makes a mess of where to do the pmd_bad() clearing.

Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock.  This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.

Update corresponding Documentation.

Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier).  But comment
where they will go, so that it's easy to add them for experiments.  And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/mm/split_page_table_lock.rst | 17 ++++---
 include/linux/mm.h                         | 27 +++++++----
 include/linux/pgtable.h                    | 22 ++++++---
 mm/pgtable-generic.c                       | 56 ++++++++++++++++++++++
 4 files changed, 101 insertions(+), 21 deletions(-)
  

Comments

Qi Zheng May 22, 2023, 11:17 a.m. UTC | #1
Hi Hugh,

On 2023/5/22 12:53, Hugh Dickins wrote:

[...]

>   
> @@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
>   }
>   #endif
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> +{
> +	pmd_t pmdval;
> +
> +	/* rcu_read_lock() to be added later */
> +	pmdval = pmdp_get_lockless(pmd);
> +	if (pmdvalp)
> +		*pmdvalp = pmdval;
> +	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
> +		goto nomap;
> +	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
> +		goto nomap;

Will the follow-up patch deal with the above situation specially?
Otherwise, maybe it can be changed to the following check method?

	if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval)))
		goto nomap;

> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		goto nomap;
> +	}
> +	return __pte_map(&pmdval, addr);
> +nomap:
> +	/* rcu_read_unlock() to be added later */
> +	return NULL;
> +}
> +
> +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> +			     unsigned long addr, spinlock_t **ptlp)
> +{
> +	pmd_t pmdval;
> +	pte_t *pte;
> +
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (likely(pte))
> +		*ptlp = pte_lockptr(mm, &pmdval);
> +	return pte;
> +}
> +
> +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> +			     unsigned long addr, spinlock_t **ptlp)
> +{
> +	spinlock_t *ptl;
> +	pmd_t pmdval;
> +	pte_t *pte;
> +again:
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (unlikely(!pte))
> +		return pte;
> +	ptl = pte_lockptr(mm, &pmdval);
> +	spin_lock(ptl);
> +	if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> +		*ptlp = ptl;
> +		return pte;
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +	goto again;
> +}
  
Hugh Dickins May 24, 2023, 2:22 a.m. UTC | #2
On Mon, 22 May 2023, Qi Zheng wrote:
> On 2023/5/22 12:53, Hugh Dickins wrote:
> 
> [...]
> 
> >   @@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> > unsigned long address,
> >   }
> >   #endif
> >   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> > +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> > +{
> > +	pmd_t pmdval;
> > +
> > +	/* rcu_read_lock() to be added later */
> > +	pmdval = pmdp_get_lockless(pmd);
> > +	if (pmdvalp)
> > +		*pmdvalp = pmdval;
> > +	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
> > +		goto nomap;
> > +	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
> > +		goto nomap;
> 
> Will the follow-up patch deal with the above situation specially?

No, the follow-up patch will only insert the rcu_read_lock() and unlock();
and do something (something!) about the PAE mismatched halves case.

> Otherwise, maybe it can be changed to the following check method?
> 
> 	if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval)))
> 		goto nomap;

Maybe, but I'm not keen.  Partly just because pmd_leaf() is quite a
(good) new invention (I came across a few instances in updating to
the current tree), whereas here I'm just following the old examples,
from zap_pmd_range() etc.  I'd have to spend a while getting to know
pmd_leaf(), and its interaction with strange gotchas like pmd_present().

And partly because I do want as many corrupt cases as possible to
reach the pmd_bad() check below, so generating a warning (and clear).
I might be wrong, I haven't checked through the architectures and how
pmd_leaf() is implemented in each, but my guess is that pmd_leaf()
will tend to miss the pmd_bad() check.

But if you can demonstrate a performance improvement from using
pmd_leaf() there, I expect many people would prefer that improvement
to badness catching: send a patch later to make that change if it's
justified.

Thanks a lot for all your attention to these.

Hugh

> 
> > +	if (unlikely(pmd_bad(pmdval))) {
> > +		pmd_clear_bad(pmd);
> > +		goto nomap;
> > +	}
> > +	return __pte_map(&pmdval, addr);
> > +nomap:
> > +	/* rcu_read_unlock() to be added later */
> > +	return NULL;
> > +}
  
Qi Zheng May 24, 2023, 3:11 a.m. UTC | #3
On 2023/5/24 10:22, Hugh Dickins wrote:
> On Mon, 22 May 2023, Qi Zheng wrote:
>> On 2023/5/22 12:53, Hugh Dickins wrote:
>>
>> [...]
>>
>>>    @@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>>> unsigned long address,
>>>    }
>>>    #endif
>>>    #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> +
>>> +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>> +{
>>> +	pmd_t pmdval;
>>> +
>>> +	/* rcu_read_lock() to be added later */
>>> +	pmdval = pmdp_get_lockless(pmd);
>>> +	if (pmdvalp)
>>> +		*pmdvalp = pmdval;
>>> +	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>> +		goto nomap;
>>> +	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
>>> +		goto nomap;
>>
>> Will the follow-up patch deal with the above situation specially?
> 
> No, the follow-up patch will only insert the rcu_read_lock() and unlock();
> and do something (something!) about the PAE mismatched halves case.
> 
>> Otherwise, maybe it can be changed to the following check method?
>>
>> 	if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval)))
>> 		goto nomap;
> 
> Maybe, but I'm not keen.  Partly just because pmd_leaf() is quite a
> (good) new invention (I came across a few instances in updating to
> the current tree), whereas here I'm just following the old examples,
> from zap_pmd_range() etc.  I'd have to spend a while getting to know
> pmd_leaf(), and its interaction with strange gotchas like pmd_present().
> 
> And partly because I do want as many corrupt cases as possible to
> reach the pmd_bad() check below, so generating a warning (and clear).
> I might be wrong, I haven't checked through the architectures and how
> pmd_leaf() is implemented in each, but my guess is that pmd_leaf()
> will tend to miss the pmd_bad() check.

IIUC, pmd_leaf() is just for checking a leaf mapped PMD, and will
not cover pmd_bad() case. Can see the examples in vmalloc_to_page()
and apply_to_pmd_range().

> 
> But if you can demonstrate a performance improvement from using
> pmd_leaf() there, I expect many people would prefer that improvement
> to badness catching: send a patch later to make that change if it's
> justified.

Probably not a lot of performance gain, just makes the check more
concise.

Thanks,
Qi

> 
> Thanks a lot for all your attention to these.
> 
> Hugh
> 
>>
>>> +	if (unlikely(pmd_bad(pmdval))) {
>>> +		pmd_clear_bad(pmd);
>>> +		goto nomap;
>>> +	}
>>> +	return __pte_map(&pmdval, addr);
>>> +nomap:
>>> +	/* rcu_read_unlock() to be added later */
>>> +	return NULL;
>>> +}
  
Aneesh Kumar K.V July 5, 2023, 2:48 p.m. UTC | #4
Hi Hugh,

Sorry for not checking about this before. I am looking at a kernel
crash (BUG_ON()) on ppc64 with 4K page size. The reason we hit
BUG_ON() is beause we have pmd_same calling BUG_ON on 4K with hash
translation. We don't support THP with 4k page size and hash
translation.

Hugh Dickins <hughd@google.com> writes:

....

 +
> +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> +			     unsigned long addr, spinlock_t **ptlp)
> +{
> +	pmd_t pmdval;
> +	pte_t *pte;
> +
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (likely(pte))
> +		*ptlp = pte_lockptr(mm, &pmdval);
> +	return pte;
> +}
> +
> +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> +			     unsigned long addr, spinlock_t **ptlp)
> +{
> +	spinlock_t *ptl;
> +	pmd_t pmdval;
> +	pte_t *pte;
> +again:
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (unlikely(!pte))
> +		return pte;
> +	ptl = pte_lockptr(mm, &pmdval);
> +	spin_lock(ptl);
> +	if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> +		*ptlp = ptl;
> +		return pte;
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +	goto again;
> +}

What is expected by that pmd_same check? We are holding pte lock
and not pmd lock. So contents of pmd can change.

-aneesh
  
Hugh Dickins July 5, 2023, 10:26 p.m. UTC | #5
Hi Aneesh,

On Wed, 5 Jul 2023, Aneesh Kumar K.V wrote:
> 
> Hi Hugh,
> 
> Sorry for not checking about this before.

I was about to say "No problem" - but it appears I would be lying!

> I am looking at a kernel
> crash (BUG_ON()) on ppc64 with 4K page size. The reason we hit
> BUG_ON() is beause we have pmd_same calling BUG_ON on 4K with hash
> translation. We don't support THP with 4k page size and hash
> translation.

I misunderstood you at first.  I was trying to work out what in that
context might lead to *pmd changing suddenly, was going to ask for
stack backtrace (in faulting? or in copying mm? or?), whether you
have PER_VMA_LOCK enabled, etc. etc.

Then I looked at the source: oh, that is gross, and not something
I had expected at all.

> > +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> > +			     unsigned long addr, spinlock_t **ptlp)
> > +{
> > +	spinlock_t *ptl;
> > +	pmd_t pmdval;
> > +	pte_t *pte;
> > +again:
> > +	pte = __pte_offset_map(pmd, addr, &pmdval);
> > +	if (unlikely(!pte))
> > +		return pte;
> > +	ptl = pte_lockptr(mm, &pmdval);
> > +	spin_lock(ptl);
> > +	if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> > +		*ptlp = ptl;
> > +		return pte;
> > +	}
> > +	pte_unmap_unlock(pte, ptl);
> > +	goto again;
> > +}
> 
> What is expected by that pmd_same check? We are holding pte lock
> and not pmd lock. So contents of pmd can change.

And you don't need me to answer that question: the answer is in the
"likely".  We do not expect *pmd to change there (though maybe some
ancillary bits of it, like "accessed"), unless the page table is on
its way to being freed; and other locking higher up (mmap_lock or
rmap lock) prevent all possibilities of that at present.  Later, we
arrange to hold pte lock as well as pmd lock when removing page table.

So the obvious quick fix is:

--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -138,8 +138,7 @@ static inline int hash__pmd_trans_huge(p
 
 static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
-	BUG();
-	return 0;
+	return 1;
 }
 
 static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)

But I hope you will reject that as almost as gross, and instead commit
a patch which makes hash__pmd_same() ... check whether the pmd_ts are the
same - as in ppc64's other implementations.  That will save having to change
it again, when/if someone extends the current replace-page-table-by-THP work
by non-THP work to remove empty page tables without mmap_lock or rmap lock.

Thanks for finding this, Aneesh, and I'm sorry I didn't notice it before,
and I'm sorry to have given you trouble... but really, that BUG(),

(H)ugh!
  

Patch

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index 50ee0dfc95be..a834fad9de12 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -14,15 +14,20 @@  tables. Access to higher level tables protected by mm->page_table_lock.
 There are helpers to lock/unlock a table and other accessor functions:
 
  - pte_offset_map_lock()
-	maps pte and takes PTE table lock, returns pointer to the taken
-	lock;
+	maps PTE and takes PTE table lock, returns pointer to PTE with
+	pointer to its PTE table lock, or returns NULL if no PTE table;
+ - pte_offset_map_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map()
+	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
+ - pte_unmap()
+	unmaps PTE table;
  - pte_unmap_unlock()
 	unlocks and unmaps PTE table;
  - pte_alloc_map_lock()
-	allocates PTE table if needed and take the lock, returns pointer
-	to taken lock or NULL if allocation failed;
- - pte_lockptr()
-	returns pointer to PTE table lock;
+	allocates PTE table if needed and takes its lock, returns pointer to
+	PTE with pointer to its lock, or returns NULL if allocation failed;
  - pmd_lock()
 	takes PMD table lock, returns pointer to taken lock;
  - pmd_lockptr()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..3c2e56980853 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2787,14 +2787,25 @@  static inline void pgtable_pte_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
-({							\
-	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
-	pte_t *__pte = pte_offset_map(pmd, address);	\
-	*(ptlp) = __ptl;				\
-	spin_lock(__ptl);				\
-	__pte;						\
-})
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
+static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
+{
+	return __pte_offset_map(pmd, addr, NULL);
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, spinlock_t **ptlp);
+static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, spinlock_t **ptlp)
+{
+	pte_t *pte;
+
+	__cond_lock(*ptlp, pte = __pte_offset_map_lock(mm, pmd, addr, ptlp));
+	return pte;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)	do {		\
 	spin_unlock(ptl);				\
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94235ff2706e..3fabbb018557 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -94,14 +94,22 @@  static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 #define pte_offset_kernel pte_offset_kernel
 #endif
 
-#if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address)				\
-	((pte_t *)kmap_local_page(pmd_page(*(dir))) +		\
-	 pte_index((address)))
-#define pte_unmap(pte) kunmap_local((pte))
+#ifdef CONFIG_HIGHPTE
+#define __pte_map(pmd, address) \
+	((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
+#define pte_unmap(pte)	do {	\
+	kunmap_local((pte));	\
+	/* rcu_read_unlock() to be added later */	\
+} while (0)
 #else
-#define pte_offset_map(dir, address)	pte_offset_kernel((dir), (address))
-#define pte_unmap(pte) ((void)(pte))	/* NOP */
+static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
+{
+	return pte_offset_kernel(pmd, address);
+}
+static inline void pte_unmap(pte_t *pte)
+{
+	/* rcu_read_unlock() to be added later */
+}
 #endif
 
 /* Find an entry in the second-level page table.. */
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d2fc52bffafc..c7ab18a5fb77 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,8 @@ 
 #include <linux/pagemap.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/mm_inline.h>
 #include <asm/tlb.h>
 
@@ -229,3 +231,57 @@  pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
+{
+	pmd_t pmdval;
+
+	/* rcu_read_lock() to be added later */
+	pmdval = pmdp_get_lockless(pmd);
+	if (pmdvalp)
+		*pmdvalp = pmdval;
+	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
+		goto nomap;
+	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+		goto nomap;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		goto nomap;
+	}
+	return __pte_map(&pmdval, addr);
+nomap:
+	/* rcu_read_unlock() to be added later */
+	return NULL;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+			     unsigned long addr, spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	return pte;
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+			     unsigned long addr, spinlock_t **ptlp)
+{
+	spinlock_t *ptl;
+	pmd_t pmdval;
+	pte_t *pte;
+again:
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (unlikely(!pte))
+		return pte;
+	ptl = pte_lockptr(mm, &pmdval);
+	spin_lock(ptl);
+	if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+		*ptlp = ptl;
+		return pte;
+	}
+	pte_unmap_unlock(pte, ptl);
+	goto again;
+}