[28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

Message ID 20230109205336.3665937-29-surenb@google.com
State New
Headers
Series Per-VMA locks |

Commit Message

Suren Baghdasaryan Jan. 9, 2023, 8:53 p.m. UTC
  Introduce lock_vma_under_rcu function to lookup and lock a VMA during
page fault handling. When VMA is not found, can't be locked or changes
after being locked, the function returns NULL. The lookup is performed
under RCU protection to prevent the found VMA from being destroyed before
the VMA lock is acquired. VMA lock statistics are updated according to
the results.
For now only anonymous VMAs can be searched this way. In other cases the
function returns NULL.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)
  

Comments

Michal Hocko Jan. 17, 2023, 3:47 p.m. UTC | #1
On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.

Could you describe why only anonymous vmas are handled at this stage and
what (roughly) has to be done to support other vmas? lock_vma_under_rcu
doesn't seem to have any anonymous vma specific requirements AFAICS.

Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
the naming is really the most important part but the rcu locking is
internal to the function so why should we spread this implementation
detail to the world...

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c464fc8a514c..d0fddf6a1de9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
>  		      vma);
>  }
>  
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +					  unsigned long address);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ece18548db1..a658e26d965d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +/*
> + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> + * stable and not isolated. If the VMA is not found or is being modified the
> + * function returns NULL.
> + */
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +					  unsigned long address)
> +{
> +	MA_STATE(mas, &mm->mm_mt, address, address);
> +	struct vm_area_struct *vma, *validate;
> +
> +	rcu_read_lock();
> +	vma = mas_walk(&mas);
> +retry:
> +	if (!vma)
> +		goto inval;
> +
> +	/* Only anonymous vmas are supported for now */
> +	if (!vma_is_anonymous(vma))
> +		goto inval;
> +
> +	if (!vma_read_trylock(vma))
> +		goto inval;
> +
> +	/* Check since vm_start/vm_end might change before we lock the VMA */
> +	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> +		vma_read_unlock(vma);
> +		goto inval;
> +	}
> +
> +	/* Check if the VMA got isolated after we found it */
> +	mas.index = address;
> +	validate = mas_walk(&mas);
> +	if (validate != vma) {
> +		vma_read_unlock(vma);
> +		count_vm_vma_lock_event(VMA_LOCK_MISS);
> +		/* The area was replaced with another one. */
> +		vma = validate;
> +		goto retry;
> +	}
> +
> +	rcu_read_unlock();
> +	return vma;
> +inval:
> +	rcu_read_unlock();
> +	count_vm_vma_lock_event(VMA_LOCK_ABORT);
> +	return NULL;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  #ifndef __PAGETABLE_P4D_FOLDED
>  /*
>   * Allocate p4d page table.
> -- 
> 2.39.0
  
Jann Horn Jan. 17, 2023, 9:03 p.m. UTC | #2
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.
[...]
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +                                         unsigned long address)
> +{
> +       MA_STATE(mas, &mm->mm_mt, address, address);
> +       struct vm_area_struct *vma, *validate;
> +
> +       rcu_read_lock();
> +       vma = mas_walk(&mas);
> +retry:
> +       if (!vma)
> +               goto inval;
> +
> +       /* Only anonymous vmas are supported for now */
> +       if (!vma_is_anonymous(vma))
> +               goto inval;
> +
> +       if (!vma_read_trylock(vma))
> +               goto inval;
> +
> +       /* Check since vm_start/vm_end might change before we lock the VMA */
> +       if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> +               vma_read_unlock(vma);
> +               goto inval;
> +       }
> +
> +       /* Check if the VMA got isolated after we found it */
> +       mas.index = address;
> +       validate = mas_walk(&mas);

Question for Maple Tree experts:

Are you allowed to use mas_walk() like this? If the first mas_walk()
call encountered a single-entry tree, it would store mas->node =
MAS_ROOT, right? And then the second call would go into
mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
true, and then mas_state_walk() would return the result of
mas_start(), which is NULL? And we'd end up with mas_walk() returning
NULL on the second run even though the tree hasn't changed?

> +       if (validate != vma) {
> +               vma_read_unlock(vma);
> +               count_vm_vma_lock_event(VMA_LOCK_MISS);
> +               /* The area was replaced with another one. */
> +               vma = validate;
> +               goto retry;
> +       }
> +
> +       rcu_read_unlock();
> +       return vma;
> +inval:
> +       rcu_read_unlock();
> +       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> +       return NULL;
> +}
  
Liam R. Howlett Jan. 17, 2023, 11:18 p.m. UTC | #3
* Jann Horn <jannh@google.com> [230117 16:04]:
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
> [...]
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +                                         unsigned long address)
> > +{
> > +       MA_STATE(mas, &mm->mm_mt, address, address);
> > +       struct vm_area_struct *vma, *validate;
> > +
> > +       rcu_read_lock();
> > +       vma = mas_walk(&mas);
> > +retry:
> > +       if (!vma)
> > +               goto inval;
> > +
> > +       /* Only anonymous vmas are supported for now */
> > +       if (!vma_is_anonymous(vma))
> > +               goto inval;
> > +
> > +       if (!vma_read_trylock(vma))
> > +               goto inval;
> > +
> > +       /* Check since vm_start/vm_end might change before we lock the VMA */
> > +       if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > +               vma_read_unlock(vma);
> > +               goto inval;
> > +       }
> > +
> > +       /* Check if the VMA got isolated after we found it */
> > +       mas.index = address;
> > +       validate = mas_walk(&mas);
> 
> Question for Maple Tree experts:
> 
> Are you allowed to use mas_walk() like this? If the first mas_walk()
> call encountered a single-entry tree, it would store mas->node =
> MAS_ROOT, right? And then the second call would go into
> mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
> true, and then mas_state_walk() would return the result of
> mas_start(), which is NULL? And we'd end up with mas_walk() returning
> NULL on the second run even though the tree hasn't changed?

This is safe for VMAs.  There might be a bug in the tree regarding
re-walking with a pointer, but it won't matter here.

A single entry tree will be a pointer if the entry is of the range 0 - 0
(mas.index == 0, mas.last == 0).  This would be a zero sized VMA - which
is not valid.

The second walk will check if the maple node is dead and restart the
walk if it is dead.  If the node isn't dead (almost always the case),
then it will be a very quick walk.

After a mas_walk(), the maple state has mas.index = vma->vm_start
and mas.last = (vma->vm_end - 1). The address is set prior to the second
walk in case of a vma split where mas.index from the first walk
is on the other side of the split than address.

> 
> > +       if (validate != vma) {
> > +               vma_read_unlock(vma);
> > +               count_vm_vma_lock_event(VMA_LOCK_MISS);
> > +               /* The area was replaced with another one. */
> > +               vma = validate;
> > +               goto retry;
> > +       }
> > +
> > +       rcu_read_unlock();
> > +       return vma;
> > +inval:
> > +       rcu_read_unlock();
> > +       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > +       return NULL;
> > +}
  
Suren Baghdasaryan Jan. 18, 2023, 1:06 a.m. UTC | #4
On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
>
> Could you describe why only anonymous vmas are handled at this stage and
> what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> doesn't seem to have any anonymous vma specific requirements AFAICS.

TBH I haven't spent too much time looking into file-backed page faults
yet but a couple of tasks I can think of are:
- Ensure that all vma->vm_ops->fault() handlers do not rely on
mmap_lock being read-locked;
- vma->vm_file freeing like VMA freeing will need to be done after RCU
grace period since page fault handlers use it. This will require some
caution because simply adding it into __vm_area_free() called via
call_rcu()  will cause corresponding fops->release() to be called
asynchronously. I had to solve this issue with out-of-tree SPF
implementation when asynchronously called snd_pcm_release() was
problematic.

I'm sure I'm missing more potential issues and maybe Matthew and
Michel can pinpoint more things to resolve here?

>
> Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
> the naming is really the most important part but the rcu locking is
> internal to the function so why should we spread this implementation
> detail to the world...

I wanted the name to indicate that the lookup is done with no locks
held. But I'm open to suggestions.

>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/mm.h |  3 +++
> >  mm/memory.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 54 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c464fc8a514c..d0fddf6a1de9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> >                     vma);
> >  }
> >
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +                                       unsigned long address);
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9ece18548db1..a658e26d965d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >  }
> >  EXPORT_SYMBOL_GPL(handle_mm_fault);
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +/*
> > + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> > + * stable and not isolated. If the VMA is not found or is being modified the
> > + * function returns NULL.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +                                       unsigned long address)
> > +{
> > +     MA_STATE(mas, &mm->mm_mt, address, address);
> > +     struct vm_area_struct *vma, *validate;
> > +
> > +     rcu_read_lock();
> > +     vma = mas_walk(&mas);
> > +retry:
> > +     if (!vma)
> > +             goto inval;
> > +
> > +     /* Only anonymous vmas are supported for now */
> > +     if (!vma_is_anonymous(vma))
> > +             goto inval;
> > +
> > +     if (!vma_read_trylock(vma))
> > +             goto inval;
> > +
> > +     /* Check since vm_start/vm_end might change before we lock the VMA */
> > +     if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > +             vma_read_unlock(vma);
> > +             goto inval;
> > +     }
> > +
> > +     /* Check if the VMA got isolated after we found it */
> > +     mas.index = address;
> > +     validate = mas_walk(&mas);
> > +     if (validate != vma) {
> > +             vma_read_unlock(vma);
> > +             count_vm_vma_lock_event(VMA_LOCK_MISS);
> > +             /* The area was replaced with another one. */
> > +             vma = validate;
> > +             goto retry;
> > +     }
> > +
> > +     rcu_read_unlock();
> > +     return vma;
> > +inval:
> > +     rcu_read_unlock();
> > +     count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > +     return NULL;
> > +}
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  #ifndef __PAGETABLE_P4D_FOLDED
> >  /*
> >   * Allocate p4d page table.
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs
  
Matthew Wilcox Jan. 18, 2023, 2:44 a.m. UTC | #5
On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > page fault handling. When VMA is not found, can't be locked or changes
> > > after being locked, the function returns NULL. The lookup is performed
> > > under RCU protection to prevent the found VMA from being destroyed before
> > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > the results.
> > > For now only anonymous VMAs can be searched this way. In other cases the
> > > function returns NULL.
> >
> > Could you describe why only anonymous vmas are handled at this stage and
> > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > doesn't seem to have any anonymous vma specific requirements AFAICS.
> 
> TBH I haven't spent too much time looking into file-backed page faults
> yet but a couple of tasks I can think of are:
> - Ensure that all vma->vm_ops->fault() handlers do not rely on
> mmap_lock being read-locked;

I think this way lies madness.  There are just too many device drivers
that implement ->fault.  My plan is to call the ->map_pages() method
under RCU without even read-locking the VMA.  If that doesn't satisfy
the fault, then drop all the way back to taking the mmap_sem for read
before calling into ->fault.
  
Suren Baghdasaryan Jan. 18, 2023, 9:33 p.m. UTC | #6
On Tue, Jan 17, 2023 at 6:44 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > > page fault handling. When VMA is not found, can't be locked or changes
> > > > after being locked, the function returns NULL. The lookup is performed
> > > > under RCU protection to prevent the found VMA from being destroyed before
> > > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > > the results.
> > > > For now only anonymous VMAs can be searched this way. In other cases the
> > > > function returns NULL.
> > >
> > > Could you describe why only anonymous vmas are handled at this stage and
> > > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > > doesn't seem to have any anonymous vma specific requirements AFAICS.
> >
> > TBH I haven't spent too much time looking into file-backed page faults
> > yet but a couple of tasks I can think of are:
> > - Ensure that all vma->vm_ops->fault() handlers do not rely on
> > mmap_lock being read-locked;
>
> I think this way lies madness.  There are just too many device drivers
> that implement ->fault.  My plan is to call the ->map_pages() method
> under RCU without even read-locking the VMA.  If that doesn't satisfy
> the fault, then drop all the way back to taking the mmap_sem for read
> before calling into ->fault.

Sounds reasonable to me but I guess the devil is in the details...

>
  

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c464fc8a514c..d0fddf6a1de9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -687,6 +687,9 @@  static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 		      vma);
 }
 
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+					  unsigned long address);
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 9ece18548db1..a658e26d965d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5242,6 +5242,57 @@  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
+ * stable and not isolated. If the VMA is not found or is being modified the
+ * function returns NULL.
+ */
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+					  unsigned long address)
+{
+	MA_STATE(mas, &mm->mm_mt, address, address);
+	struct vm_area_struct *vma, *validate;
+
+	rcu_read_lock();
+	vma = mas_walk(&mas);
+retry:
+	if (!vma)
+		goto inval;
+
+	/* Only anonymous vmas are supported for now */
+	if (!vma_is_anonymous(vma))
+		goto inval;
+
+	if (!vma_read_trylock(vma))
+		goto inval;
+
+	/* Check since vm_start/vm_end might change before we lock the VMA */
+	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+		vma_read_unlock(vma);
+		goto inval;
+	}
+
+	/* Check if the VMA got isolated after we found it */
+	mas.index = address;
+	validate = mas_walk(&mas);
+	if (validate != vma) {
+		vma_read_unlock(vma);
+		count_vm_vma_lock_event(VMA_LOCK_MISS);
+		/* The area was replaced with another one. */
+		vma = validate;
+		goto retry;
+	}
+
+	rcu_read_unlock();
+	return vma;
+inval:
+	rcu_read_unlock();
+	count_vm_vma_lock_event(VMA_LOCK_ABORT);
+	return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
  * Allocate p4d page table.