diff mbox series

[v3,6/8] mm: handle swap page faults under per-VMA lock

Message ID	20230627042321.1763765-7-surenb@google.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Date: Mon, 26 Jun 2023 21:23:19 -0700 In-Reply-To: <20230627042321.1763765-1-surenb@google.com> Mime-Version: 1.0 References: <20230627042321.1763765-1-surenb@google.com> Message-ID: <20230627042321.1763765-7-surenb@google.com> Subject: [PATCH v3 6/8] mm: handle swap page faults under per-VMA lock From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: willy@infradead.org, hannes@cmpxchg.org, mhocko@suse.com, josef@toxicpanda.com, jack@suse.cz, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, michel@lespinasse.org, liam.howlett@oracle.com, jglisse@google.com, vbabka@suse.cz, minchan@google.com, dave@stgolabs.net, punit.agrawal@bytedance.com, lstoakes@gmail.com, hdanton@sina.com, apopple@nvidia.com, peterx@redhat.com, ying.huang@intel.com, david@redhat.com, yuzhao@google.com, dhowells@redhat.com, hughd@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, pasha.tatashin@soleen.com, surenb@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Per-VMA lock support for swap and userfaults \| [v3,0/8] Per-VMA lock support for swap and userfaults [v3,1/8] swap: remove remnants of polling from read_swap_cache_async [v3,2/8] mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED [v3,3/8] mm: drop per-VMA lock in handle_mm_fault if retrying or when finished [v3,4/8] mm: replace folio_lock_or_retry with folio_lock_fault [v3,5/8] mm: make folio_lock_fault indicate the state of mmap_lock upon return [v3,6/8] mm: handle swap page faults under per-VMA lock [v3,7/8] mm: drop VMA lock before waiting for migration [v3,8/8] mm: handle userfaults under VMA lock

Commit Message

Suren Baghdasaryan June 27, 2023, 4:23 a.m. UTC

  When page fault is handled under per-VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_fault (formerly
known as folio_lock_or_retry) had to drop and reacquire mmap_lock
if folio could not be immediately locked.
Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
for folio in folio_lock_fault and retrying once folio is available.
With this obstacle removed, enable do_swap_page to operate under
per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
still rely on mmap_lock, therefore we have to fall back to mmap_lock in
that particular case.
Note that the only time do_swap_page calls synchronous swap_readpage
is when SWP_SYNCHRONOUS_IO is set, which is only set for
QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
pmem). Therefore we don't sleep in this path, and there's no need to
drop the mmap or per-VMA lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/filemap.c | 24 ++++++++++++++++--------
 mm/memory.c  | 21 ++++++++++++++-------
 2 files changed, 30 insertions(+), 15 deletions(-)

Comments

Peter Xu June 27, 2023, 3:41 p.m. UTC | #1

On Mon, Jun 26, 2023 at 09:23:19PM -0700, Suren Baghdasaryan wrote:
> When page fault is handled under per-VMA lock protection, all swap page
> faults are retried with mmap_lock because folio_lock_fault (formerly
> known as folio_lock_or_retry) had to drop and reacquire mmap_lock
> if folio could not be immediately locked.
> Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
> for folio in folio_lock_fault and retrying once folio is available.
> With this obstacle removed, enable do_swap_page to operate under
> per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
> still rely on mmap_lock, therefore we have to fall back to mmap_lock in
> that particular case.
> Note that the only time do_swap_page calls synchronous swap_readpage
> is when SWP_SYNCHRONOUS_IO is set, which is only set for
> QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
> pmem). Therefore we don't sleep in this path, and there's no need to
> drop the mmap or per-VMA lock.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/filemap.c | 24 ++++++++++++++++--------
>  mm/memory.c  | 21 ++++++++++++++-------
>  2 files changed, 30 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8ad06d69895b..683f11f244cd 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1703,12 +1703,14 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
>   * Return values:
>   * 0 - folio is locked.
>   * VM_FAULT_RETRY - folio is not locked.
> - *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
> - *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> - *     which case mmap_lock is still held.
> + *     FAULT_FLAG_LOCK_DROPPED bit in vmf flags will be set if mmap_lock or

This "FAULT_FLAG_LOCK_DROPPED" should belong to that patch when introduced.
But again I still think this flag as a whole with that patch is not needed
and should be dropped, unless I miss something important..

> + *     per-VMA lock got dropped. mmap_lock/per-VMA lock is dropped when
> + *     function fails to lock the folio, unless flags had both
> + *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in which case
> + *     the lock is still held.
>   *
>   * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
> - * with the folio locked and the mmap_lock unperturbed.
> + * with the folio locked and the mmap_lock/per-VMA lock unperturbed.
>   */
>  vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
>  {
> @@ -1716,13 +1718,16 @@ vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
>  
>  	if (fault_flag_allow_retry_first(vmf->flags)) {
>  		/*
> -		 * CAUTION! In this case, mmap_lock is not released
> -		 * even though return VM_FAULT_RETRY.
> +		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
> +		 * released even though returning VM_FAULT_RETRY.
>  		 */
>  		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
>  			return VM_FAULT_RETRY;
>  
> -		mmap_read_unlock(mm);
> +		if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> +			vma_end_read(vmf->vma);
> +		else
> +			mmap_read_unlock(mm);
>  		vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
>  		if (vmf->flags & FAULT_FLAG_KILLABLE)
>  			folio_wait_locked_killable(folio);
> @@ -1735,7 +1740,10 @@ vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
>  
>  		ret = __folio_lock_killable(folio);
>  		if (ret) {
> -			mmap_read_unlock(mm);
> +			if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> +				vma_end_read(vmf->vma);
> +			else
> +				mmap_read_unlock(mm);
>  			vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
>  			return VM_FAULT_RETRY;
>  		}
> diff --git a/mm/memory.c b/mm/memory.c
> index 3c2acafcd7b6..5caaa4c66ea2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3712,11 +3712,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!pte_unmap_same(vmf))
>  		goto out;
>  
> -	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {

So if with my imagination, here we'll already have the vma_read_end() and
this patch will remove it, which makes sense.  Then...

> -		ret = VM_FAULT_RETRY;
> -		goto out;
> -	}
> -
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> @@ -3726,6 +3721,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			vmf->page = pfn_swap_entry_to_page(entry);
>  			ret = remove_device_exclusive_entry(vmf);
>  		} else if (is_device_private_entry(entry)) {
> +			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> +				/*
> +				 * migrate_to_ram is not yet ready to operate
> +				 * under VMA lock.
> +				 */

... here we probably just do vma_read_end(), then...

> +				ret |= VM_FAULT_RETRY;
> +				goto out;
> +			}
> +
>  			vmf->page = pfn_swap_entry_to_page(entry);
>  			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>  					vmf->address, &vmf->ptl);
> @@ -5089,9 +5093,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  		/*
>  		 * In case of VM_FAULT_RETRY or VM_FAULT_COMPLETED we might
>  		 * be still holding per-VMA lock to keep the vma stable as long
> -		 * as possible. Drop it before returning.
> +		 * as possible. In this situation vmf.flags has
> +		 * FAULT_FLAG_VMA_LOCK set and FAULT_FLAG_LOCK_DROPPED unset.
> +		 * Drop the lock before returning when this happens.
>  		 */
> -		if (vmf.flags & FAULT_FLAG_VMA_LOCK)
> +		if ((vmf.flags & (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_LOCK_DROPPED)) ==
> +		    FAULT_FLAG_VMA_LOCK)
>  			vma_end_read(vma);

This whole chunk should have been dropped altogether with my comment in
previous patch, iiuc, and it should be no-op anyway for swap case.  For the
real "waiting for page lock during swapin" phase we should always 100%
release the vma lock in folio_lock_or_retry() - just like mmap lock.

Thanks,

>  	}
>  	return ret;
> -- 
> 2.41.0.178.g377b9f9a00-goog
>

Suren Baghdasaryan June 27, 2023, 4:05 p.m. UTC | #2

On Tue, Jun 27, 2023 at 8:41 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Jun 26, 2023 at 09:23:19PM -0700, Suren Baghdasaryan wrote:
> > When page fault is handled under per-VMA lock protection, all swap page
> > faults are retried with mmap_lock because folio_lock_fault (formerly
> > known as folio_lock_or_retry) had to drop and reacquire mmap_lock
> > if folio could not be immediately locked.
> > Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
> > for folio in folio_lock_fault and retrying once folio is available.
> > With this obstacle removed, enable do_swap_page to operate under
> > per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
> > still rely on mmap_lock, therefore we have to fall back to mmap_lock in
> > that particular case.
> > Note that the only time do_swap_page calls synchronous swap_readpage
> > is when SWP_SYNCHRONOUS_IO is set, which is only set for
> > QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
> > pmem). Therefore we don't sleep in this path, and there's no need to
> > drop the mmap or per-VMA lock.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/filemap.c | 24 ++++++++++++++++--------
> >  mm/memory.c  | 21 ++++++++++++++-------
> >  2 files changed, 30 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 8ad06d69895b..683f11f244cd 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1703,12 +1703,14 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> >   * Return values:
> >   * 0 - folio is locked.
> >   * VM_FAULT_RETRY - folio is not locked.
> > - *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
> > - *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> > - *     which case mmap_lock is still held.
> > + *     FAULT_FLAG_LOCK_DROPPED bit in vmf flags will be set if mmap_lock or
>
> This "FAULT_FLAG_LOCK_DROPPED" should belong to that patch when introduced.
> But again I still think this flag as a whole with that patch is not needed
> and should be dropped, unless I miss something important..
>
> > + *     per-VMA lock got dropped. mmap_lock/per-VMA lock is dropped when
> > + *     function fails to lock the folio, unless flags had both
> > + *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in which case
> > + *     the lock is still held.
> >   *
> >   * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
> > - * with the folio locked and the mmap_lock unperturbed.
> > + * with the folio locked and the mmap_lock/per-VMA lock unperturbed.
> >   */
> >  vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> >  {
> > @@ -1716,13 +1718,16 @@ vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> >
> >       if (fault_flag_allow_retry_first(vmf->flags)) {
> >               /*
> > -              * CAUTION! In this case, mmap_lock is not released
> > -              * even though return VM_FAULT_RETRY.
> > +              * CAUTION! In this case, mmap_lock/per-VMA lock is not
> > +              * released even though returning VM_FAULT_RETRY.
> >                */
> >               if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> >                       return VM_FAULT_RETRY;
> >
> > -             mmap_read_unlock(mm);
> > +             if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > +                     vma_end_read(vmf->vma);
> > +             else
> > +                     mmap_read_unlock(mm);
> >               vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
> >               if (vmf->flags & FAULT_FLAG_KILLABLE)
> >                       folio_wait_locked_killable(folio);
> > @@ -1735,7 +1740,10 @@ vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> >
> >               ret = __folio_lock_killable(folio);
> >               if (ret) {
> > -                     mmap_read_unlock(mm);
> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > +                             vma_end_read(vmf->vma);
> > +                     else
> > +                             mmap_read_unlock(mm);
> >                       vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
> >                       return VM_FAULT_RETRY;
> >               }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3c2acafcd7b6..5caaa4c66ea2 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3712,11 +3712,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> >
> > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
>
> So if with my imagination, here we'll already have the vma_read_end() and
> this patch will remove it, which makes sense.  Then...
>
> > -             ret = VM_FAULT_RETRY;
> > -             goto out;
> > -     }
> > -
> >       entry = pte_to_swp_entry(vmf->orig_pte);
> >       if (unlikely(non_swap_entry(entry))) {
> >               if (is_migration_entry(entry)) {
> > @@ -3726,6 +3721,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                       vmf->page = pfn_swap_entry_to_page(entry);
> >                       ret = remove_device_exclusive_entry(vmf);
> >               } else if (is_device_private_entry(entry)) {
> > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > +                             /*
> > +                              * migrate_to_ram is not yet ready to operate
> > +                              * under VMA lock.
> > +                              */
>
> ... here we probably just do vma_read_end(), then...
>
> > +                             ret |= VM_FAULT_RETRY;
> > +                             goto out;
> > +                     }
> > +
> >                       vmf->page = pfn_swap_entry_to_page(entry);
> >                       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> >                                       vmf->address, &vmf->ptl);
> > @@ -5089,9 +5093,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >               /*
> >                * In case of VM_FAULT_RETRY or VM_FAULT_COMPLETED we might
> >                * be still holding per-VMA lock to keep the vma stable as long
> > -              * as possible. Drop it before returning.
> > +              * as possible. In this situation vmf.flags has
> > +              * FAULT_FLAG_VMA_LOCK set and FAULT_FLAG_LOCK_DROPPED unset.
> > +              * Drop the lock before returning when this happens.
> >                */
> > -             if (vmf.flags & FAULT_FLAG_VMA_LOCK)
> > +             if ((vmf.flags & (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_LOCK_DROPPED)) ==
> > +                 FAULT_FLAG_VMA_LOCK)
> >                       vma_end_read(vma);
>
> This whole chunk should have been dropped altogether with my comment in
> previous patch, iiuc, and it should be no-op anyway for swap case.  For the
> real "waiting for page lock during swapin" phase we should always 100%
> release the vma lock in folio_lock_or_retry() - just like mmap lock.

Yep, we drop FAULT_FLAG_LOCK_DROPPED, release vma lock when we return
RETRY and that makes all this unnecessary. I just need to make sure we
do not access VMA after we drop its lock since we will be releasing it
now earlier than before.

>
> Thanks,
>
> >       }
> >       return ret;
> > --
> > 2.41.0.178.g377b9f9a00-goog
> >
>
> --
> Peter Xu
>

Peter Xu June 27, 2023, 4:24 p.m. UTC | #3

On Tue, Jun 27, 2023 at 09:05:32AM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 27, 2023 at 8:41 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Jun 26, 2023 at 09:23:19PM -0700, Suren Baghdasaryan wrote:
> > > When page fault is handled under per-VMA lock protection, all swap page
> > > faults are retried with mmap_lock because folio_lock_fault (formerly
> > > known as folio_lock_or_retry) had to drop and reacquire mmap_lock
> > > if folio could not be immediately locked.
> > > Follow the same pattern as mmap_lock to drop per-VMA lock when waiting
> > > for folio in folio_lock_fault and retrying once folio is available.
> > > With this obstacle removed, enable do_swap_page to operate under
> > > per-VMA lock protection. Drivers implementing ops->migrate_to_ram might
> > > still rely on mmap_lock, therefore we have to fall back to mmap_lock in
> > > that particular case.
> > > Note that the only time do_swap_page calls synchronous swap_readpage
> > > is when SWP_SYNCHRONOUS_IO is set, which is only set for
> > > QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
> > > pmem). Therefore we don't sleep in this path, and there's no need to
> > > drop the mmap or per-VMA lock.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/filemap.c | 24 ++++++++++++++++--------
> > >  mm/memory.c  | 21 ++++++++++++++-------
> > >  2 files changed, 30 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 8ad06d69895b..683f11f244cd 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1703,12 +1703,14 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> > >   * Return values:
> > >   * 0 - folio is locked.
> > >   * VM_FAULT_RETRY - folio is not locked.
> > > - *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
> > > - *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> > > - *     which case mmap_lock is still held.
> > > + *     FAULT_FLAG_LOCK_DROPPED bit in vmf flags will be set if mmap_lock or
> >
> > This "FAULT_FLAG_LOCK_DROPPED" should belong to that patch when introduced.
> > But again I still think this flag as a whole with that patch is not needed
> > and should be dropped, unless I miss something important..
> >
> > > + *     per-VMA lock got dropped. mmap_lock/per-VMA lock is dropped when
> > > + *     function fails to lock the folio, unless flags had both
> > > + *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in which case
> > > + *     the lock is still held.
> > >   *
> > >   * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
> > > - * with the folio locked and the mmap_lock unperturbed.
> > > + * with the folio locked and the mmap_lock/per-VMA lock unperturbed.
> > >   */
> > >  vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> > >  {
> > > @@ -1716,13 +1718,16 @@ vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> > >
> > >       if (fault_flag_allow_retry_first(vmf->flags)) {
> > >               /*
> > > -              * CAUTION! In this case, mmap_lock is not released
> > > -              * even though return VM_FAULT_RETRY.
> > > +              * CAUTION! In this case, mmap_lock/per-VMA lock is not
> > > +              * released even though returning VM_FAULT_RETRY.
> > >                */
> > >               if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> > >                       return VM_FAULT_RETRY;
> > >
> > > -             mmap_read_unlock(mm);
> > > +             if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > +                     vma_end_read(vmf->vma);
> > > +             else
> > > +                     mmap_read_unlock(mm);
> > >               vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
> > >               if (vmf->flags & FAULT_FLAG_KILLABLE)
> > >                       folio_wait_locked_killable(folio);
> > > @@ -1735,7 +1740,10 @@ vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> > >
> > >               ret = __folio_lock_killable(folio);
> > >               if (ret) {
> > > -                     mmap_read_unlock(mm);
> > > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > +                             vma_end_read(vmf->vma);
> > > +                     else
> > > +                             mmap_read_unlock(mm);
> > >                       vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
> > >                       return VM_FAULT_RETRY;
> > >               }
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 3c2acafcd7b6..5caaa4c66ea2 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3712,11 +3712,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >       if (!pte_unmap_same(vmf))
> > >               goto out;
> > >
> > > -     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> >
> > So if with my imagination, here we'll already have the vma_read_end() and
> > this patch will remove it, which makes sense.  Then...
> >
> > > -             ret = VM_FAULT_RETRY;
> > > -             goto out;
> > > -     }
> > > -
> > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > >       if (unlikely(non_swap_entry(entry))) {
> > >               if (is_migration_entry(entry)) {
> > > @@ -3726,6 +3721,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >                       vmf->page = pfn_swap_entry_to_page(entry);
> > >                       ret = remove_device_exclusive_entry(vmf);
> > >               } else if (is_device_private_entry(entry)) {
> > > +                     if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > +                             /*
> > > +                              * migrate_to_ram is not yet ready to operate
> > > +                              * under VMA lock.
> > > +                              */
> >
> > ... here we probably just do vma_read_end(), then...
> >
> > > +                             ret |= VM_FAULT_RETRY;
> > > +                             goto out;
> > > +                     }
> > > +
> > >                       vmf->page = pfn_swap_entry_to_page(entry);
> > >                       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > >                                       vmf->address, &vmf->ptl);
> > > @@ -5089,9 +5093,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> > >               /*
> > >                * In case of VM_FAULT_RETRY or VM_FAULT_COMPLETED we might
> > >                * be still holding per-VMA lock to keep the vma stable as long
> > > -              * as possible. Drop it before returning.
> > > +              * as possible. In this situation vmf.flags has
> > > +              * FAULT_FLAG_VMA_LOCK set and FAULT_FLAG_LOCK_DROPPED unset.
> > > +              * Drop the lock before returning when this happens.
> > >                */
> > > -             if (vmf.flags & FAULT_FLAG_VMA_LOCK)
> > > +             if ((vmf.flags & (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_LOCK_DROPPED)) ==
> > > +                 FAULT_FLAG_VMA_LOCK)
> > >                       vma_end_read(vma);
> >
> > This whole chunk should have been dropped altogether with my comment in
> > previous patch, iiuc, and it should be no-op anyway for swap case.  For the
> > real "waiting for page lock during swapin" phase we should always 100%
> > release the vma lock in folio_lock_or_retry() - just like mmap lock.
> 
> Yep, we drop FAULT_FLAG_LOCK_DROPPED, release vma lock when we return
> RETRY and that makes all this unnecessary. I just need to make sure we
> do not access VMA after we drop its lock since we will be releasing it
> now earlier than before.

Yes.  Hopefully that's not a problem, because mmap lock has it the same
way.  And that's also the good thing if we can always keep both locks
behaving the same if possible, because the problem is really shared, imho.

diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 8ad06d69895b..683f11f244cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1703,12 +1703,14 @@  static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
  * Return values:
  * 0 - folio is locked.
  * VM_FAULT_RETRY - folio is not locked.
- *     mmap_lock has been released (mmap_read_unlock(), unless flags had both
- *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
- *     which case mmap_lock is still held.
+ *     FAULT_FLAG_LOCK_DROPPED bit in vmf flags will be set if mmap_lock or
+ *     per-VMA lock got dropped. mmap_lock/per-VMA lock is dropped when
+ *     function fails to lock the folio, unless flags had both
+ *     FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in which case
+ *     the lock is still held.
  *
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return 0
- * with the folio locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock/per-VMA lock unperturbed.
  */
 vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
 {
@@ -1716,13 +1718,16 @@  vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
 
 	if (fault_flag_allow_retry_first(vmf->flags)) {
 		/*
-		 * CAUTION! In this case, mmap_lock is not released
-		 * even though return VM_FAULT_RETRY.
+		 * CAUTION! In this case, mmap_lock/per-VMA lock is not
+		 * released even though returning VM_FAULT_RETRY.
 		 */
 		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
 			return VM_FAULT_RETRY;
 
-		mmap_read_unlock(mm);
+		if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+			vma_end_read(vmf->vma);
+		else
+			mmap_read_unlock(mm);
 		vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
 		if (vmf->flags & FAULT_FLAG_KILLABLE)
 			folio_wait_locked_killable(folio);
@@ -1735,7 +1740,10 @@  vm_fault_t __folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
 
 		ret = __folio_lock_killable(folio);
 		if (ret) {
-			mmap_read_unlock(mm);
+			if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+				vma_end_read(vmf->vma);
+			else
+				mmap_read_unlock(mm);
 			vmf->flags |= FAULT_FLAG_LOCK_DROPPED;
 			return VM_FAULT_RETRY;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index 3c2acafcd7b6..5caaa4c66ea2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3712,11 +3712,6 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!pte_unmap_same(vmf))
 		goto out;
 
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-		ret = VM_FAULT_RETRY;
-		goto out;
-	}
-
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
@@ -3726,6 +3721,15 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
+			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+				/*
+				 * migrate_to_ram is not yet ready to operate
+				 * under VMA lock.
+				 */
+				ret |= VM_FAULT_RETRY;
+				goto out;
+			}
+
 			vmf->page = pfn_swap_entry_to_page(entry);
 			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 					vmf->address, &vmf->ptl);
@@ -5089,9 +5093,12 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		/*
 		 * In case of VM_FAULT_RETRY or VM_FAULT_COMPLETED we might
 		 * be still holding per-VMA lock to keep the vma stable as long
-		 * as possible. Drop it before returning.
+		 * as possible. In this situation vmf.flags has
+		 * FAULT_FLAG_VMA_LOCK set and FAULT_FLAG_LOCK_DROPPED unset.
+		 * Drop the lock before returning when this happens.
 		 */
-		if (vmf.flags & FAULT_FLAG_VMA_LOCK)
+		if ((vmf.flags & (FAULT_FLAG_VMA_LOCK | FAULT_FLAG_LOCK_DROPPED)) ==
+		    FAULT_FLAG_VMA_LOCK)
 			vma_end_read(vma);
 	}
 	return ret;