[02/12] mm/pgtable: add PAE safety to __pte_offset_map()

Message ID 923480d5-35ab-7cac-79d0-343d16e29318@google.com
State New
Headers
Series mm: free retracted page table by RCU |

Commit Message

Hugh Dickins May 29, 2023, 6:16 a.m. UTC
  There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/pgtable-generic.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
  

Comments

Matthew Wilcox May 29, 2023, 1:56 p.m. UTC | #1
On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
> +	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
> +/*
> + * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
> + * the barriers in pmdp_get_lockless() cannot guarantee that the value in
> + * pmd_high actually belongs with the value in pmd_low; but holding interrupts
> + * off blocks the TLB flush between present updates, which guarantees that a
> + * successful __pte_offset_map() points to a page from matched halves.
> + */
> +#define config_might_irq_save(flags)	local_irq_save(flags)
> +#define config_might_irq_restore(flags)	local_irq_restore(flags)
> +#else
> +#define config_might_irq_save(flags)
> +#define config_might_irq_restore(flags)

I don't like the name.  It should indicate that it's PMD-related, so
pmd_read_start(flags) / pmd_read_end(flags)?
  
Jason Gunthorpe May 31, 2023, 7:32 p.m. UTC | #2
On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> There is a faint risk that __pte_offset_map(), on a 32-bit architecture
> with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
> a pmdval assembled from a pmd_low and a pmd_high which never belonged
> together: their combination not pointing to a page table at all, perhaps
> not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
> 
> Guard against that (on such configs) by local_irq_save() blocking TLB
> flush between present updates, as linux/pgtable.h suggests.  It's only
> needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
> __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
> lock, would just send it back to __pte_offset_map() again.

What about the other places calling pmdp_get_lockless ? It seems like
this is quietly making it part of the API that the caller must hold
the IPIs off.

And Jann had a note that this approach used by the lockless functions
doesn't work anyhow:

https://lore.kernel.org/linux-mm/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/

Though we never fixed it, AFAIK..

Jason
  
Hugh Dickins June 2, 2023, 5:35 a.m. UTC | #3
On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> > There is a faint risk that __pte_offset_map(), on a 32-bit architecture
> > with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
> > a pmdval assembled from a pmd_low and a pmd_high which never belonged
> > together: their combination not pointing to a page table at all, perhaps
> > not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
> > 
> > Guard against that (on such configs) by local_irq_save() blocking TLB
> > flush between present updates, as linux/pgtable.h suggests.  It's only
> > needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
> > __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
> > lock, would just send it back to __pte_offset_map() again.
> 
> What about the other places calling pmdp_get_lockless ? It seems like
> this is quietly making it part of the API that the caller must hold
> the IPIs off.

No, I'm making no judgment of other places where pmdp_get_lockless() is
used: examination might show that some need more care, but I'll just
assume that each is taking as much care as it needs.

But here where I'm making changes, I do see that we need this extra care.

> 
> And Jann had a note that this approach used by the lockless functions
> doesn't work anyhow:
> 
> https://lore.kernel.org/linux-mm/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/

Thanks a lot for the link: I don't know why, but I never saw that mail
thread at all before.  I have not fully digested it yet, to be honest:
MADV_DONTNEED, doesn't flush TLB yet, etc - I'll have to get into the
right frame of mind for that.

> 
> Though we never fixed it, AFAIK..

I'm certainly depending very much on pmdp_get_lockless(): and hoping to
find its case is easier to defend than at the ptep_get_lockless() level.

Thanks,
Hugh
  

Patch

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 674671835631..d28b63386cef 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,32 @@  pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+#define config_might_irq_save(flags)	local_irq_save(flags)
+#define config_might_irq_restore(flags)	local_irq_restore(flags)
+#else
+#define config_might_irq_save(flags)
+#define config_might_irq_restore(flags)
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+	unsigned long __maybe_unused flags;
 	pmd_t pmdval;
 
 	rcu_read_lock();
+	config_might_irq_save(flags);
 	pmdval = pmdp_get_lockless(pmd);
+	config_might_irq_restore(flags);
+
 	if (pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))