diff mbox series

[41/41] mm: replace rw_semaphore with atomic_t in vma_lock

Message ID	20230109205336.3665937-42-surenb@google.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Date: Mon, 9 Jan 2023 12:53:36 -0800 In-Reply-To: <20230109205336.3665937-1-surenb@google.com> Mime-Version: 1.0 References: <20230109205336.3665937-1-surenb@google.com> Message-ID: <20230109205336.3665937-42-surenb@google.com> Subject: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: michel@lespinasse.org, jglisse@google.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, paulmck@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, jannh@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, gurua@google.com, arjunroy@google.com, soheil@google.com, hughlynch@google.com, leewalsh@google.com, posk@google.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Per-VMA locks \| [00/41] Per-VMA locks [01/41] maple_tree: Be more cautious about dead nodes [02/41] maple_tree: Detect dead nodes in mas_start() [03/41] maple_tree: Fix freeing of nodes in rcu mode [04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves() [05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode [06/41] maple_tree: Add smp_rmb() to dead node detection [07/41] mm: Enable maple tree RCU mode by default. [08/41] mm: introduce CONFIG_PER_VMA_LOCK [09/41] mm: rcu safe VMA freeing [10/41] mm: move mmap_lock assert function definitions [11/41] mm: export dump_mm() [12/41] mm: add per-VMA lock and helper functions to control it [13/41] mm: introduce vma->vm_flags modifier functions [14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK [15/41] mm: replace vma->vm_flags direct modifications with modifier calls [16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise [17/41] mm/mmap: move VMA locking before anon_vma_lock_write call [18/41] mm/khugepaged: write-lock VMA while collapsing a huge page [19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them [20/41] mm/mmap: write-lock VMAs in vma_adjust [21/41] mm/mmap: write-lock VMAs affected by VMA expansion [22/41] mm/mremap: write-lock VMA while remapping it to a new address range [23/41] mm: write-lock VMAs before removing them from VMA tree [24/41] mm: conditionally write-lock VMA in free_pgtables [25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area [26/41] kernel/fork: assert no VMA readers during its destruction [27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration [28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code [29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set [30/41] mm: add FAULT_FLAG_VMA_LOCK flag [31/41] mm: prevent do_swap_page from handling page faults under VMA lock [32/41] mm: prevent userfaults to be handled under per-vma lock [33/41] mm: introduce per-VMA lock statistics [34/41] x86/mm: try VMA lock-based page fault handling first [35/41] arm64/mm: try VMA lock-based page fault handling first [36/41] powerc/mm: try VMA lock-based page fault handling first [37/41] mm: introduce mod_vm_flags_nolock [38/41] mm: avoid assertion in untrack_pfn [39/41] kernel/fork: throttle call_rcu() calls in vm_area_free [40/41] mm: separate vma->lock from vm_area_struct [41/41] mm: replace rw_semaphore with atomic_t in vma_lock

Commit Message

Suren Baghdasaryan Jan. 9, 2023, 8:53 p.m. UTC

  rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore with an atomic variable.
When a reader takes read lock, it increments the atomic unless the
value is negative. If that fails read-locking is aborted and mmap_lock
is used instead.
When writer takes write lock, it resets atomic value to -1 if the
current value is 0 (no readers). Since all writers take mmap_lock in
write mode first, there can be only one writer at a time. If there
are readers, writer will place itself into a wait queue using new
mm_struct.vma_writer_wait waitqueue head. The last reader to release
the vma_lock will signal the writer to wake up.
vm_lock_seq is also moved into vma_lock and along with atomic_t they
are nicely packed and consume 8 bytes, bringing the overhead from
vma_lock from 44 to 16 bytes:

    slabinfo before the changes:
     <name>            ... <objsize> <objperslab> <pagesperslab> : ...
    vm_area_struct    ...    152   53    2 : ...

    slabinfo with vma_lock:
     <name>            ... <objsize> <objperslab> <pagesperslab> : ...
    rw_semaphore      ...      8  512    1 : ...
    vm_area_struct    ...    160   51    2 : ...

Assuming 40000 vm_area_structs, memory consumption would be:
baseline: 6040kB
vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
Total increase: 556kB

atomic_t might overflow if there are many competing readers, therefore
vma_read_trylock() implements an overflow check and if that occurs it
restors the previous value and exits with a failure to lock.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h       | 37 +++++++++++++++++++++++++------------
 include/linux/mm_types.h | 10 ++++++++--
 kernel/fork.c            |  6 +++---
 mm/init-mm.c             |  2 ++
 4 files changed, 38 insertions(+), 17 deletions(-)

Comments

Vlastimil Babka Jan. 10, 2023, 8:04 a.m. UTC | #1

On 1/9/23 21:53, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore with an atomic variable.
> When a reader takes read lock, it increments the atomic unless the
> value is negative. If that fails read-locking is aborted and mmap_lock
> is used instead.
> When writer takes write lock, it resets atomic value to -1 if the
> current value is 0 (no readers). Since all writers take mmap_lock in
> write mode first, there can be only one writer at a time. If there
> are readers, writer will place itself into a wait queue using new
> mm_struct.vma_writer_wait waitqueue head. The last reader to release
> the vma_lock will signal the writer to wake up.
> vm_lock_seq is also moved into vma_lock and along with atomic_t they
> are nicely packed and consume 8 bytes, bringing the overhead from
> vma_lock from 44 to 16 bytes:
> 
>     slabinfo before the changes:
>      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
>     vm_area_struct    ...    152   53    2 : ...
> 
>     slabinfo with vma_lock:
>      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
>     rw_semaphore      ...      8  512    1 : ...

I guess the cache is called vma_lock, not rw_semaphore?

>     vm_area_struct    ...    160   51    2 : ...
> 
> Assuming 40000 vm_area_structs, memory consumption would be:
> baseline: 6040kB
> vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> Total increase: 556kB
> 
> atomic_t might overflow if there are many competing readers, therefore
> vma_read_trylock() implements an overflow check and if that occurs it
> restors the previous value and exits with a failure to lock.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

This patch is indeed an interesting addition indeed, but I can't help but
think it obsoletes the previous one :) We allocate an extra 8 bytes slab
object for the lock, and the pointer to it is also 8 bytes, and requires an
indirection. The vma_lock cache is not cacheline aligned (otherwise it would
be a major waste), so we have potential false sharing with up to 7 other
vma_lock's.
I'd expect if the vma_lock was placed with the relatively cold fields of
vm_area_struct, it shouldn't cause much cache ping pong when working with
that vma. Even if we don't cache align the vma to save memory (would be 192
bytes instead of 160 when aligned) and place the vma_lock and the cold
fields at the end of the vma, it may be false sharing the cacheline with the
next vma in the slab. But that's a single vma, not up to 7, so it shouldn't
be worse?

Suren Baghdasaryan Jan. 10, 2023, 5:05 p.m. UTC | #2

On Tue, Jan 10, 2023 at 12:04 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/9/23 21:53, Suren Baghdasaryan wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> > 1. Readers never wait. They try to take the vma_lock and fall back to
> > mmap_lock if that fails.
> > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > because writers first take mmap_lock in write mode.
> > Because of these requirements, full rw_semaphore functionality is not
> > needed and we can replace rw_semaphore with an atomic variable.
> > When a reader takes read lock, it increments the atomic unless the
> > value is negative. If that fails read-locking is aborted and mmap_lock
> > is used instead.
> > When writer takes write lock, it resets atomic value to -1 if the
> > current value is 0 (no readers). Since all writers take mmap_lock in
> > write mode first, there can be only one writer at a time. If there
> > are readers, writer will place itself into a wait queue using new
> > mm_struct.vma_writer_wait waitqueue head. The last reader to release
> > the vma_lock will signal the writer to wake up.
> > vm_lock_seq is also moved into vma_lock and along with atomic_t they
> > are nicely packed and consume 8 bytes, bringing the overhead from
> > vma_lock from 44 to 16 bytes:
> >
> >     slabinfo before the changes:
> >      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
> >     vm_area_struct    ...    152   53    2 : ...
> >
> >     slabinfo with vma_lock:
> >      <name>            ... <objsize> <objperslab> <pagesperslab> : ...
> >     rw_semaphore      ...      8  512    1 : ...
>
> I guess the cache is called vma_lock, not rw_semaphore?

Yes, sorry. Copy/paste error when combining the results. The numbers
though look correct, so I did not screw up that part :)

>
> >     vm_area_struct    ...    160   51    2 : ...
> >
> > Assuming 40000 vm_area_structs, memory consumption would be:
> > baseline: 6040kB
> > vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> > Total increase: 556kB
> >
> > atomic_t might overflow if there are many competing readers, therefore
> > vma_read_trylock() implements an overflow check and if that occurs it
> > restors the previous value and exits with a failure to lock.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> This patch is indeed an interesting addition indeed, but I can't help but
> think it obsoletes the previous one :) We allocate an extra 8 bytes slab
> object for the lock, and the pointer to it is also 8 bytes, and requires an
> indirection. The vma_lock cache is not cacheline aligned (otherwise it would
> be a major waste), so we have potential false sharing with up to 7 other
> vma_lock's.

True, I thought long and hard about combining the last two patches but
decided to keep them separate to document the intent. The previous
patch splits the lock for performance reasons and this one is focused
on memory consumption. I'm open to changing this if it's confusing.

> I'd expect if the vma_lock was placed with the relatively cold fields of
> vm_area_struct, it shouldn't cause much cache ping pong when working with
> that vma. Even if we don't cache align the vma to save memory (would be 192
> bytes instead of 160 when aligned) and place the vma_lock and the cold
> fields at the end of the vma, it may be false sharing the cacheline with the
> next vma in the slab.

I would love to combine the vma_lock with vm_area_struct and I spent
several days trying different combinations to achieve decent
performance. My best achieved result was when I placed the vm_lock
into the third cache line at offset 192 and allocated vm_area_structs
from cache-aligned slab (horrible memory waste with each vma consuming
256 bytes). Even then I see regression in pft-threads test on a NUMA
machine (where cache-bouncing problem is most pronounced):

This is the result with split vma locks (current version). The higher
number the better:

BASE                                PVL
Hmean     faults/sec-1    469201.7282 (   0.00%)   464453.3976 *  -1.01%*
Hmean     faults/sec-4   1754465.6221 (   0.00%)  1660688.0452 *  -5.35%*
Hmean     faults/sec-7   2808141.6711 (   0.00%)  2688910.6458 *  -4.25%*
Hmean     faults/sec-12  3750307.7553 (   0.00%)  3863490.2057 *   3.02%*
Hmean     faults/sec-21  4145672.4677 (   0.00%)  3904532.7241 *  -5.82%*
Hmean     faults/sec-30  3775722.5726 (   0.00%)  3923225.3734 *   3.91%*
Hmean     faults/sec-48  4152563.5864 (   0.00%)  4783720.6811 *  15.20%*
Hmean     faults/sec-56  4163868.7111 (   0.00%)  4851473.7241 *  16.51%*

Here are results with the vma locks integrated into cache-aligned
vm_area_struct:

BASE               PVM_MERGED
Hmean     faults/sec-1    469201.7282 (   0.00%)   465268.1274 *  -0.84%*
Hmean     faults/sec-4   1754465.6221 (   0.00%)  1658538.0217 *  -5.47%*
Hmean     faults/sec-7   2808141.6711 (   0.00%)  2645016.1598 *  -5.81%*
Hmean     faults/sec-12  3750307.7553 (   0.00%)  3664676.6956 *  -2.28%*
Hmean     faults/sec-21  4145672.4677 (   0.00%)  3722203.7950 * -10.21%*
Hmean     faults/sec-30  3775722.5726 (   0.00%)  3821025.6963 *   1.20%*
Hmean     faults/sec-48  4152563.5864 (   0.00%)  4561016.1604 *   9.84%*
Hmean     faults/sec-56  4163868.7111 (   0.00%)  4528123.3737 *   8.75%*

These two compare with the same baseline test results, I just
separated the result into two to have readable email formatting.
It's also hard to find 56 bytes worth of fields in vm_area_struct
which are not used during page faults. So, in the end I decided to
keep vma_locks separate to preserve performance. If you have an idea
on how we can combine vm_area_struct fields in a better way, I would
love to try it out.

> But that's a single vma, not up to 7, so it shouldn't be worse?

Yes, I expected that too but mmtests show very small improvement when
I cache-align vma_locks slab. My spf_test does show about 10%
regression due to vma_lock cache-line bouncing, however considering
that it also shows 90% improvement over baseline, losing 10% of that
improvement to save 56 bytes per vma sounds like a good deal.
I think the lack of considerable regression here is due to vma_locks
being used only 2 times in the pagefault path - when we take it and
when we release it, while vm_aa_struct fields are used much more
heavily. So, invalidating vma_lock cache-line does not hit us as hard
as invalidating a part of vm_area_struct.

Looking forward to suggestions and thanks for the review, Vlastimil!

>
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Hyeonggon Yoo Jan. 16, 2023, 11:14 a.m. UTC | #3

On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d40bf8a5e19e..294dd44b2198 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
>  	 * mm->mm_lock_seq can't be concurrently modified.
>  	 */
>  	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> -	if (vma->vm_lock_seq == mm_lock_seq)
> +	if (vma->vm_lock->lock_seq == mm_lock_seq)
>  		return;
>  
> -	down_write(&vma->vm_lock->lock);
> -	vma->vm_lock_seq = mm_lock_seq;
> -	up_write(&vma->vm_lock->lock);
> +	if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> +		wait_event(vma->vm_mm->vma_writer_wait,
> +			   atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> +	vma->vm_lock->lock_seq = mm_lock_seq;
> +	/* Write barrier to ensure lock_seq change is visible before count */
> +	smp_wmb();
> +	atomic_set(&vma->vm_lock->count, 0);
>  }
>  
>  /*
> @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
>  static inline bool vma_read_trylock(struct vm_area_struct *vma)
>  {
>  	/* Check before locking. A race might cause false locked result. */
> -	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> +	if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
>  		return false;
>  
> -	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> +	if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
>  		return false;
>  
> +	/* If atomic_t overflows, restore and fail to lock. */
> +	if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> +		if (atomic_dec_and_test(&vma->vm_lock->count))
> +			wake_up(&vma->vm_mm->vma_writer_wait);
> +		return false;
> +	}
> +
>  	/*
>  	 * Overflow might produce false locked result.
>  	 * False unlocked result is impossible because we modify and check
>  	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
>  	 * modification invalidates all existing locks.
>  	 */
> -	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> -		up_read(&vma->vm_lock->lock);
> +	if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> +		if (atomic_dec_and_test(&vma->vm_lock->count))
> +			wake_up(&vma->vm_mm->vma_writer_wait);
>  		return false;
>  	}

With this change readers can cause writers to starve.
What about checking waitqueue_active() before or after increasing
vma->vm_lock->count?

--
Thanks,
Hyeonggon

Suren Baghdasaryan Jan. 16, 2023, 10:36 p.m. UTC | #4

On Mon, Jan 16, 2023 at 3:15 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote:
>
> On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index d40bf8a5e19e..294dd44b2198 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >        * mm->mm_lock_seq can't be concurrently modified.
> >        */
> >       mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > -     if (vma->vm_lock_seq == mm_lock_seq)
> > +     if (vma->vm_lock->lock_seq == mm_lock_seq)
> >               return;
> >
> > -     down_write(&vma->vm_lock->lock);
> > -     vma->vm_lock_seq = mm_lock_seq;
> > -     up_write(&vma->vm_lock->lock);
> > +     if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > +             wait_event(vma->vm_mm->vma_writer_wait,
> > +                        atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > +     vma->vm_lock->lock_seq = mm_lock_seq;
> > +     /* Write barrier to ensure lock_seq change is visible before count */
> > +     smp_wmb();
> > +     atomic_set(&vma->vm_lock->count, 0);
> >  }
> >
> >  /*
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  {
> >       /* Check before locking. A race might cause false locked result. */
> > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> >               return false;
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +     if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> >               return false;
> >
> > +     /* If atomic_t overflows, restore and fail to lock. */
> > +     if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > +             return false;
> > +     }
> > +
> >       /*
> >        * Overflow might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> >        * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        */
> > -     if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock->lock);
> > +     if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> >               return false;
> >       }
>
> With this change readers can cause writers to starve.
> What about checking waitqueue_active() before or after increasing
> vma->vm_lock->count?

The readers are in page fault path, which is the fast path, while
writers performing updates are in slow path. Therefore I *think*
starving writers should not be a big issue. So far in benchmarks I
haven't seen issues with that but maybe there is such a case?

>
> --
> Thanks,
> Hyeonggon
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Suren Baghdasaryan Jan. 16, 2023, 11:08 p.m. UTC | #5

On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <surenb@google.com>
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >        * mm->mm_lock_seq can't be concurrently modified.
> >        */
> >       mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > -     if (vma->vm_lock_seq == mm_lock_seq)
> > +     if (vma->vm_lock->lock_seq == mm_lock_seq)
> >               return;
>
>         lock acquire for write to info lockdep.

Thanks for the review Hillf!

Good idea. Will add in the next version.

> >
> > -     down_write(&vma->vm_lock->lock);
> > -     vma->vm_lock_seq = mm_lock_seq;
> > -     up_write(&vma->vm_lock->lock);
> > +     if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > +             wait_event(vma->vm_mm->vma_writer_wait,
> > +                        atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > +     vma->vm_lock->lock_seq = mm_lock_seq;
> > +     /* Write barrier to ensure lock_seq change is visible before count */
> > +     smp_wmb();
> > +     atomic_set(&vma->vm_lock->count, 0);
> >  }
> >
> >  /*
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  {
> >       /* Check before locking. A race might cause false locked result. */
> > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> >               return false;
>
> Add mb to pair with the above wmb like

The wmb above is to ensure the ordering between updates of lock_seq
and vm_lock->count (lock_seq is updated first and vm_lock->count only
after that). The first access to vm_lock->count in this function is
atomic_inc_unless_negative() and it's an atomic RMW operation with a
return value. According to documentation such functions are fully
ordered, therefore I think we already have an implicit full memory
barrier between reads of lock_seq and vm_lock->count here. Am I wrong?

>
>         if (READ_ONCE(vma->vm_lock->lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
>                 smp_acquire__after_ctrl_dep();
>                 return false;
>         }
> >
> > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +     if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> >               return false;
> >
> > +     /* If atomic_t overflows, restore and fail to lock. */
> > +     if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > +             return false;
> > +     }
> > +
> >       /*
> >        * Overflow might produce false locked result.
> >        * False unlocked result is impossible because we modify and check
> >        * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >        * modification invalidates all existing locks.
> >        */
> > -     if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > -             up_read(&vma->vm_lock->lock);
> > +     if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> >               return false;
> >       }
>
> Simpler way to detect write lock owner and count overflow like
>
>         int count = atomic_read(&vma->vm_lock->count);
>         for (;;) {
>                 int new = count + 1;
>
>                 if (count < 0 || new < 0)
>                         return false;
>
>                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
>                 if (new == count)
>                         break;
>                 count = new;
>                 cpu_relax();
>         }
>
>         (wake up waiting readers after taking the lock;
>         but the write lock owner before this read trylock should be
>         responsible for waking waiters up.)
>
>         lock acquire for read.

This schema might cause readers to wait, which is not an exact
replacement for down_read_trylock(). The requirement to wake up
waiting readers also complicates things and since we can always fall
back to mmap_lock, that complication is unnecessary IMHO. I could use
part of your suggestion like this:

                 int new = count + 1;

                 if (count < 0 || new < 0)
                         return false;

                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
                 if (new == count)
                         return false;

Compared to doing atomic_inc_unless_negative() first, like I did
originally, this schema opens a bit wider window for the writer to get
in the middle and cause the reader to fail locking but I don't think
it would result in any visible regression.

>
> >       return true;
> > @@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >
> >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> >  {
>         lock release for read.

Ack.

>
> > -     up_read(&vma->vm_lock->lock);
> > +     if (atomic_dec_and_test(&vma->vm_lock->count))
> > +             wake_up(&vma->vm_mm->vma_writer_wait);
> >  }
>

Suren Baghdasaryan Jan. 16, 2023, 11:11 p.m. UTC | #6

On Mon, Jan 16, 2023 at 3:08 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <surenb@google.com>
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > >        * mm->mm_lock_seq can't be concurrently modified.
> > >        */
> > >       mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > > -     if (vma->vm_lock_seq == mm_lock_seq)
> > > +     if (vma->vm_lock->lock_seq == mm_lock_seq)
> > >               return;
> >
> >         lock acquire for write to info lockdep.
>
> Thanks for the review Hillf!
>
> Good idea. Will add in the next version.
>
> > >
> > > -     down_write(&vma->vm_lock->lock);
> > > -     vma->vm_lock_seq = mm_lock_seq;
> > > -     up_write(&vma->vm_lock->lock);
> > > +     if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > > +             wait_event(vma->vm_mm->vma_writer_wait,
> > > +                        atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > > +     vma->vm_lock->lock_seq = mm_lock_seq;
> > > +     /* Write barrier to ensure lock_seq change is visible before count */
> > > +     smp_wmb();
> > > +     atomic_set(&vma->vm_lock->count, 0);
> > >  }
> > >
> > >  /*
> > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >  {
> > >       /* Check before locking. A race might cause false locked result. */
> > > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > >               return false;
> >
> > Add mb to pair with the above wmb like
>
> The wmb above is to ensure the ordering between updates of lock_seq
> and vm_lock->count (lock_seq is updated first and vm_lock->count only
> after that). The first access to vm_lock->count in this function is
> atomic_inc_unless_negative() and it's an atomic RMW operation with a
> return value. According to documentation such functions are fully
> ordered, therefore I think we already have an implicit full memory
> barrier between reads of lock_seq and vm_lock->count here. Am I wrong?
>
> >
> >         if (READ_ONCE(vma->vm_lock->lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
> >                 smp_acquire__after_ctrl_dep();
> >                 return false;
> >         }
> > >
> > > -     if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > +     if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > >               return false;
> > >
> > > +     /* If atomic_t overflows, restore and fail to lock. */
> > > +     if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > > +             return false;
> > > +     }
> > > +
> > >       /*
> > >        * Overflow might produce false locked result.
> > >        * False unlocked result is impossible because we modify and check
> > >        * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > >        * modification invalidates all existing locks.
> > >        */
> > > -     if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > -             up_read(&vma->vm_lock->lock);
> > > +     if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > +             if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                     wake_up(&vma->vm_mm->vma_writer_wait);
> > >               return false;
> > >       }
> >
> > Simpler way to detect write lock owner and count overflow like
> >
> >         int count = atomic_read(&vma->vm_lock->count);
> >         for (;;) {
> >                 int new = count + 1;
> >
> >                 if (count < 0 || new < 0)
> >                         return false;
> >
> >                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> >                 if (new == count)
> >                         break;
> >                 count = new;
> >                 cpu_relax();
> >         }
> >
> >         (wake up waiting readers after taking the lock;
> >         but the write lock owner before this read trylock should be
> >         responsible for waking waiters up.)
> >
> >         lock acquire for read.
>
> This schema might cause readers to wait, which is not an exact
> replacement for down_read_trylock(). The requirement to wake up
> waiting readers also complicates things and since we can always fall
> back to mmap_lock, that complication is unnecessary IMHO. I could use
> part of your suggestion like this:
>
>                  int new = count + 1;
>
>                  if (count < 0 || new < 0)
>                          return false;
>
>                  new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
>                  if (new == count)
>                          return false;

Made a mistake above. It should have been:
                  if (new != count)
                          return false;


>
> Compared to doing atomic_inc_unless_negative() first, like I did
> originally, this schema opens a bit wider window for the writer to get
> in the middle and cause the reader to fail locking but I don't think
> it would result in any visible regression.
>
> >
> > >       return true;
> > > @@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >
> > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > >  {
> >         lock release for read.
>
> Ack.
>
> >
> > > -     up_read(&vma->vm_lock->lock);
> > > +     if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +             wake_up(&vma->vm_mm->vma_writer_wait);
> > >  }
> >

Matthew Wilcox Jan. 17, 2023, 4:14 a.m. UTC | #7

On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >  {
> >  	/* Check before locking. A race might cause false locked result. */
> > -	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > +	if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> >  		return false;
> >  
> > -	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > +	if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> >  		return false;
> >  
> > +	/* If atomic_t overflows, restore and fail to lock. */
> > +	if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > +		if (atomic_dec_and_test(&vma->vm_lock->count))
> > +			wake_up(&vma->vm_mm->vma_writer_wait);
> > +		return false;
> > +	}
> > +
> >  	/*
> >  	 * Overflow might produce false locked result.
> >  	 * False unlocked result is impossible because we modify and check
> >  	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> >  	 * modification invalidates all existing locks.
> >  	 */
> > -	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > -		up_read(&vma->vm_lock->lock);
> > +	if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > +		if (atomic_dec_and_test(&vma->vm_lock->count))
> > +			wake_up(&vma->vm_mm->vma_writer_wait);
> >  		return false;
> >  	}
> 
> With this change readers can cause writers to starve.
> What about checking waitqueue_active() before or after increasing
> vma->vm_lock->count?

I don't understand how readers can starve a writer.  Readers do
atomic_inc_unless_negative() so a writer can always force readers
to fail.

Suren Baghdasaryan Jan. 17, 2023, 4:34 a.m. UTC | #8

On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >  {
> > >     /* Check before locking. A race might cause false locked result. */
> > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > >             return false;
> > >
> > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > >             return false;
> > >
> > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > +           return false;
> > > +   }
> > > +
> > >     /*
> > >      * Overflow might produce false locked result.
> > >      * False unlocked result is impossible because we modify and check
> > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > >      * modification invalidates all existing locks.
> > >      */
> > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > -           up_read(&vma->vm_lock->lock);
> > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > >             return false;
> > >     }
> >
> > With this change readers can cause writers to starve.
> > What about checking waitqueue_active() before or after increasing
> > vma->vm_lock->count?
>
> I don't understand how readers can starve a writer.  Readers do
> atomic_inc_unless_negative() so a writer can always force readers
> to fail.

I think the point here was that if page faults keep occuring and they
prevent vm_lock->count from reaching 0 then a writer will be blocked
and there is no reader throttling mechanism (no max time that writer
will be waiting).

Suren Baghdasaryan Jan. 17, 2023, 4:52 a.m. UTC | #9

On Mon, Jan 16, 2023 at 7:16 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 16 Jan 2023 15:08:48 -0800 Suren Baghdasaryan <surenb@google.com>
> > On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <hdanton@sina.com> wrote:
> > > On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <surenb@google.com>
> > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > >  {
> > > >       /* Check before locking. A race might cause false locked result. */
> > > > -     if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > +     if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > >               return false;
> > >
> > > Add mb to pair with the above wmb like
> >
> > The wmb above is to ensure the ordering between updates of lock_seq
> > and vm_lock->count (lock_seq is updated first and vm_lock->count only
> > after that). The first access to vm_lock->count in this function is
> > atomic_inc_unless_negative() and it's an atomic RMW operation with a
> > return value. According to documentation such functions are fully
> > ordered, therefore I think we already have an implicit full memory
> > barrier between reads of lock_seq and vm_lock->count here. Am I wrong?
>
> No you are not.

I'm not wrong or the other way around? Please expand a bit.

> Revisit it in case of full mb not ensured.
>
> #ifndef arch_atomic_inc_unless_negative
> static __always_inline bool
> arch_atomic_inc_unless_negative(atomic_t *v)
> {
>         int c = arch_atomic_read(v);
>
>         do {
>                 if (unlikely(c < 0))
>                         return false;
>         } while (!arch_atomic_try_cmpxchg(v, &c, c + 1));
>
>         return true;
> }
> #define arch_atomic_inc_unless_negative arch_atomic_inc_unless_negative
> #endif

I think your point is that the full mb is not ensured here, specifically if c<0?

>
> > >         int count = atomic_read(&vma->vm_lock->count);
> > >         for (;;) {
> > >                 int new = count + 1;
> > >
> > >                 if (count < 0 || new < 0)
> > >                         return false;
> > >
> > >                 new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> > >                 if (new == count)
> > >                         break;
> > >                 count = new;
> > >                 cpu_relax();
> > >         }
> > >
> > >         (wake up waiting readers after taking the lock;
> > >         but the write lock owner before this read trylock should be
> > >         responsible for waking waiters up.)
> > >
> > >         lock acquire for read.
> >
> > This schema might cause readers to wait, which is not an exact
>
> Could you specify a bit more on wait?

Oh, I misunderstood your intent with that for() loop. Indeed, if a
writer took the lock, the count will be negative and it terminates
with a failure. Yeah, I think that would work.

>
> > replacement for down_read_trylock(). The requirement to wake up
> > waiting readers also complicates things
>
> If the writer lock owner is preempted by a reader while releasing lock,
>
>         set count to zero
>                           <-- preempt
>         wake up waiters
>
> then lock is owned by reader but with read waiters.
>
> That is buggy if write waiter starvation is allowed in this patchset.

I don't quite understand your point here. Readers don't wait, so there
can't be "read waiters". Could you please expand with a race diagram
maybe?

>
> > and since we can always fall
> > back to mmap_lock, that complication is unnecessary IMHO. I could use
> > part of your suggestion like this:
> >
> >                  int new = count + 1;
> >
> >                  if (count < 0 || new < 0)
> >                          return false;
> >
> >                  new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> >                  if (new == count)
> >                          return false;
> >
> > Compared to doing atomic_inc_unless_negative() first, like I did
> > originally, this schema opens a bit wider window for the writer to get
> > in the middle and cause the reader to fail locking but I don't think
> > it would result in any visible regression.
>
> It is definitely fine for writer to acquire the lock while reader is
> doing trylock, no?

Yes.

>

Matthew Wilcox Jan. 17, 2023, 5:46 a.m. UTC | #10

On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > >  {
> > > >     /* Check before locking. A race might cause false locked result. */
> > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > >             return false;
> > > >
> > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > >             return false;
> > > >
> > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > +           return false;
> > > > +   }
> > > > +
> > > >     /*
> > > >      * Overflow might produce false locked result.
> > > >      * False unlocked result is impossible because we modify and check
> > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > >      * modification invalidates all existing locks.
> > > >      */
> > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > -           up_read(&vma->vm_lock->lock);
> > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > >             return false;
> > > >     }
> > >
> > > With this change readers can cause writers to starve.
> > > What about checking waitqueue_active() before or after increasing
> > > vma->vm_lock->count?
> >
> > I don't understand how readers can starve a writer.  Readers do
> > atomic_inc_unless_negative() so a writer can always force readers
> > to fail.
> 
> I think the point here was that if page faults keep occuring and they
> prevent vm_lock->count from reaching 0 then a writer will be blocked
> and there is no reader throttling mechanism (no max time that writer
> will be waiting).

Perhaps I misunderstood your description; I thought that a _waiting_
writer would make the count negative, not a successfully acquiring
writer.

Suren Baghdasaryan Jan. 17, 2023, 5:58 a.m. UTC | #11

On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > >  {
> > > > >     /* Check before locking. A race might cause false locked result. */
> > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > >             return false;
> > > > >
> > > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > >             return false;
> > > > >
> > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > +           return false;
> > > > > +   }
> > > > > +
> > > > >     /*
> > > > >      * Overflow might produce false locked result.
> > > > >      * False unlocked result is impossible because we modify and check
> > > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > >      * modification invalidates all existing locks.
> > > > >      */
> > > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > -           up_read(&vma->vm_lock->lock);
> > > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >             return false;
> > > > >     }
> > > >
> > > > With this change readers can cause writers to starve.
> > > > What about checking waitqueue_active() before or after increasing
> > > > vma->vm_lock->count?
> > >
> > > I don't understand how readers can starve a writer.  Readers do
> > > atomic_inc_unless_negative() so a writer can always force readers
> > > to fail.
> >
> > I think the point here was that if page faults keep occuring and they
> > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > and there is no reader throttling mechanism (no max time that writer
> > will be waiting).
>
> Perhaps I misunderstood your description; I thought that a _waiting_
> writer would make the count negative, not a successfully acquiring
> writer.

A waiting writer does not modify the counter, instead it's placed on
the wait queue and the last reader which sets the count to 0 while
releasing its read lock will wake it up. Once the writer is woken it
will try to set the count to negative and if successful will own the
lock, otherwise it goes back to sleep.

Jann Horn Jan. 17, 2023, 6:11 p.m. UTC | #12

On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
[...]
>  static inline void vma_read_unlock(struct vm_area_struct *vma)
>  {
> -       up_read(&vma->vm_lock->lock);
> +       if (atomic_dec_and_test(&vma->vm_lock->count))
> +               wake_up(&vma->vm_mm->vma_writer_wait);
>  }

I haven't properly reviewed this, but this bit looks like a
use-after-free because you're accessing the vma after dropping your
reference on it. You'd have to first look up the vma->vm_mm, then do
the atomic_dec_and_test(), and afterwards do the wake_up() without
touching the vma. Or alternatively wrap the whole thing in an RCU
read-side critical section if the VMA is freed with RCU delay.

Suren Baghdasaryan Jan. 17, 2023, 6:21 p.m. UTC | #13

On Tue, Jan 17, 2023 at 12:34 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 16 Jan 2023 20:52:45 -0800 Suren Baghdasaryan <surenb@google.com>
> > On Mon, Jan 16, 2023 at 7:16 PM Hillf Danton <hdanton@sina.com> wrote:
> > > No you are not.
> >
> > I'm not wrong or the other way around? Please expand a bit.
>
> You are not wrong.

Ok, I think if I rewrite the vma_read_trylock() we should be fine?:

static inline bool vma_read_trylock(struct vm_area_struct *vma)
{
       int count, new;

        /* Check before locking. A race might cause false locked result. */
       if (READ_ONCE(vma->vm_lock->lock_seq) ==
           READ_ONCE(vma->vm_mm->mm_lock_seq))
                return false;

        count = atomic_read(&vma->vm_lock->count);
        for (;;) {
              /*
               * Is VMA is write-locked? Overflow might produce false
locked result.
               * False unlocked result is impossible because we modify and check
               * vma->vm_lock_seq under vma->vm_lock protection and
mm->mm_lock_seq
               * modification invalidates all existing locks.
               */
              if (count < 0)
                        return false;

             new = count + 1;
             /* If atomic_t overflows, fail to lock. */
             if (new < 0)
                        return false;

             /*
              * Atomic RMW will provide implicit mb on success to pair
with smp_wmb in
              * vma_write_lock, on failure we retry.
              */
              new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
                if (new == count)
                        break;
                count = new;
                cpu_relax();
        }
       if (unlikely(READ_ONCE(vma->vm_lock->lock_seq) ==
           READ_ONCE(vma->vm_mm->mm_lock_seq))) {
               if (atomic_dec_and_test(&vma->vm_lock->count))
                       wake_up(&vma->vm_mm->vma_writer_wait);
                return false;
        }
        return true;
}
> > >
> > > If the writer lock owner is preempted by a reader while releasing lock,
> > >
> > >         set count to zero
> > >                           <-- preempt
> > >         wake up waiters
> > >
> > > then lock is owned by reader but with read waiters.
> > >
> > > That is buggy if write waiter starvation is allowed in this patchset.
> >
> > I don't quite understand your point here. Readers don't wait, so there
> > can't be "read waiters". Could you please expand with a race diagram
> > maybe?
>
>         cpu3                    cpu2
>         ---                     ---
>         taskA bond to cpu3
>         down_write(&mm->mmap_lock);
>         vma_write_lock L
>                                 taskB fail to take L for read
>                                 taskC fail to take mmap_lock for write
>                                 taskD fail to take L for read
>         vma_write_unlock_mm(mm);
>
>         preempted by taskE
>            taskE take L for read and
>            read waiters of L, taskB and taskD,
>            should be woken up
>
>         up_write(&mm->mmap_lock);

Readers never wait for vma lock, that's why we have only
vma_read_trylock and no vma_read_lock. In your scenario taskB and
taskD will fall back to taking mmap_lock for read after they failed
vma_read_trylock. Once taskA does up_write(mmap_lock) they will be
woken up since they are blocked on taking mmap_lock for read.

>

Matthew Wilcox Jan. 17, 2023, 6:23 p.m. UTC | #14

On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > >  {
> > > > > >     /* Check before locking. A race might cause false locked result. */
> > > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > >             return false;
> > > > > >
> > > > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > > >             return false;
> > > > > >
> > > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > +           return false;
> > > > > > +   }
> > > > > > +
> > > > > >     /*
> > > > > >      * Overflow might produce false locked result.
> > > > > >      * False unlocked result is impossible because we modify and check
> > > > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > > >      * modification invalidates all existing locks.
> > > > > >      */
> > > > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > -           up_read(&vma->vm_lock->lock);
> > > > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > >             return false;
> > > > > >     }
> > > > >
> > > > > With this change readers can cause writers to starve.
> > > > > What about checking waitqueue_active() before or after increasing
> > > > > vma->vm_lock->count?
> > > >
> > > > I don't understand how readers can starve a writer.  Readers do
> > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > to fail.
> > >
> > > I think the point here was that if page faults keep occuring and they
> > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > and there is no reader throttling mechanism (no max time that writer
> > > will be waiting).
> >
> > Perhaps I misunderstood your description; I thought that a _waiting_
> > writer would make the count negative, not a successfully acquiring
> > writer.
> 
> A waiting writer does not modify the counter, instead it's placed on
> the wait queue and the last reader which sets the count to 0 while
> releasing its read lock will wake it up. Once the writer is woken it
> will try to set the count to negative and if successful will own the
> lock, otherwise it goes back to sleep.

Then yes, that's a starvable lock.  Preventing starvation on the mmap
sem was the original motivation for making rwsems non-starvable, so
changing that behaviour now seems like a bad idea.  For efficiency, I'd
suggest that a waiting writer set the top bit of the counter.  That way,
all new readers will back off without needing to check a second variable
and old readers will know that they *may* need to do the wakeup when
atomic_sub_return_release() is negative.

(rwsem.c has a more complex bitfield, but I don't think we need to go
that far; the important point is that the waiting writer indicates its
presence in the count field so that readers can modify their behaviour)

Suren Baghdasaryan Jan. 17, 2023, 6:26 p.m. UTC | #15

On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> [...]
> >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> >  {
> > -       up_read(&vma->vm_lock->lock);
> > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > +               wake_up(&vma->vm_mm->vma_writer_wait);
> >  }
>
> I haven't properly reviewed this, but this bit looks like a
> use-after-free because you're accessing the vma after dropping your
> reference on it. You'd have to first look up the vma->vm_mm, then do
> the atomic_dec_and_test(), and afterwards do the wake_up() without
> touching the vma. Or alternatively wrap the whole thing in an RCU
> read-side critical section if the VMA is freed with RCU delay.

vm_lock->count does not control the lifetime of the VMA, it's a
counter of how many readers took the lock or it's negative if the lock
is write-locked.

Matthew Wilcox Jan. 17, 2023, 6:27 p.m. UTC | #16

On Tue, Jan 17, 2023 at 10:21:28AM -0800, Suren Baghdasaryan wrote:
> static inline bool vma_read_trylock(struct vm_area_struct *vma)
> {
>        int count, new;
> 
>         /* Check before locking. A race might cause false locked result. */
>        if (READ_ONCE(vma->vm_lock->lock_seq) ==
>            READ_ONCE(vma->vm_mm->mm_lock_seq))
>                 return false;
> 
>         count = atomic_read(&vma->vm_lock->count);
>         for (;;) {
>               /*
>                * Is VMA is write-locked? Overflow might produce false
> locked result.
>                * False unlocked result is impossible because we modify and check
>                * vma->vm_lock_seq under vma->vm_lock protection and
> mm->mm_lock_seq
>                * modification invalidates all existing locks.
>                */
>               if (count < 0)
>                         return false;
> 
>              new = count + 1;
>              /* If atomic_t overflows, fail to lock. */
>              if (new < 0)
>                         return false;
> 
>              /*
>               * Atomic RMW will provide implicit mb on success to pair
> with smp_wmb in
>               * vma_write_lock, on failure we retry.
>               */
>               new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
>                 if (new == count)
>                         break;
>                 count = new;
>                 cpu_relax();

The cpu_relax() is exactly the wrong thing to do here.  See this thread:
https://lore.kernel.org/linux-fsdevel/20230113184447.1707316-1-mjguzik@gmail.com/

Suren Baghdasaryan Jan. 17, 2023, 6:28 p.m. UTC | #17

On Tue, Jan 17, 2023 at 10:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > >     /* Check before locking. A race might cause false locked result. */
> > > > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > +   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > >             return false;
> > > > > > >
> > > > > > > -   if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > > > +   if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > > > >             return false;
> > > > > > >
> > > > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > > > +   if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > +           return false;
> > > > > > > +   }
> > > > > > > +
> > > > > > >     /*
> > > > > > >      * Overflow might produce false locked result.
> > > > > > >      * False unlocked result is impossible because we modify and check
> > > > > > >      * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > > > >      * modification invalidates all existing locks.
> > > > > > >      */
> > > > > > > -   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > -           up_read(&vma->vm_lock->lock);
> > > > > > > +   if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > +           if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > +                   wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > >             return false;
> > > > > > >     }
> > > > > >
> > > > > > With this change readers can cause writers to starve.
> > > > > > What about checking waitqueue_active() before or after increasing
> > > > > > vma->vm_lock->count?
> > > > >
> > > > > I don't understand how readers can starve a writer.  Readers do
> > > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > > to fail.
> > > >
> > > > I think the point here was that if page faults keep occuring and they
> > > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > > and there is no reader throttling mechanism (no max time that writer
> > > > will be waiting).
> > >
> > > Perhaps I misunderstood your description; I thought that a _waiting_
> > > writer would make the count negative, not a successfully acquiring
> > > writer.
> >
> > A waiting writer does not modify the counter, instead it's placed on
> > the wait queue and the last reader which sets the count to 0 while
> > releasing its read lock will wake it up. Once the writer is woken it
> > will try to set the count to negative and if successful will own the
> > lock, otherwise it goes back to sleep.
>
> Then yes, that's a starvable lock.  Preventing starvation on the mmap
> sem was the original motivation for making rwsems non-starvable, so
> changing that behaviour now seems like a bad idea.  For efficiency, I'd
> suggest that a waiting writer set the top bit of the counter.  That way,
> all new readers will back off without needing to check a second variable
> and old readers will know that they *may* need to do the wakeup when
> atomic_sub_return_release() is negative.
>
> (rwsem.c has a more complex bitfield, but I don't think we need to go
> that far; the important point is that the waiting writer indicates its
> presence in the count field so that readers can modify their behaviour)

Got it. Ok, I think we can figure something out to check if there are
waiting write-lockers and prevent new readers from taking the lock.

Matthew Wilcox Jan. 17, 2023, 6:31 p.m. UTC | #18

On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > considerable space for each vm_area_struct. However vma_lock has
> > > two important specifics which can be used to replace rw_semaphore
> > > with a simpler structure:
> > [...]
> > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > >  {
> > > -       up_read(&vma->vm_lock->lock);
> > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > >  }
> >
> > I haven't properly reviewed this, but this bit looks like a
> > use-after-free because you're accessing the vma after dropping your
> > reference on it. You'd have to first look up the vma->vm_mm, then do
> > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > touching the vma. Or alternatively wrap the whole thing in an RCU
> > read-side critical section if the VMA is freed with RCU delay.
> 
> vm_lock->count does not control the lifetime of the VMA, it's a
> counter of how many readers took the lock or it's negative if the lock
> is write-locked.

Yes, but ...
	
	Task A:
	atomic_dec_and_test(&vma->vm_lock->count)
			Task B:
			munmap()
			write lock
			free VMA
			synchronize_rcu()
			VMA is really freed
        wake_up(&vma->vm_mm->vma_writer_wait);

... vma is freed.

Now, I think this doesn't occur.  I'm pretty sure that every caller of
vma_read_unlock() is holding the RCU read lock.  But maybe we should
have that assertion?

Suren Baghdasaryan Jan. 17, 2023, 6:31 p.m. UTC | #19

On Tue, Jan 17, 2023 at 10:27 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:21:28AM -0800, Suren Baghdasaryan wrote:
> > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > {
> >        int count, new;
> >
> >         /* Check before locking. A race might cause false locked result. */
> >        if (READ_ONCE(vma->vm_lock->lock_seq) ==
> >            READ_ONCE(vma->vm_mm->mm_lock_seq))
> >                 return false;
> >
> >         count = atomic_read(&vma->vm_lock->count);
> >         for (;;) {
> >               /*
> >                * Is VMA is write-locked? Overflow might produce false
> > locked result.
> >                * False unlocked result is impossible because we modify and check
> >                * vma->vm_lock_seq under vma->vm_lock protection and
> > mm->mm_lock_seq
> >                * modification invalidates all existing locks.
> >                */
> >               if (count < 0)
> >                         return false;
> >
> >              new = count + 1;
> >              /* If atomic_t overflows, fail to lock. */
> >              if (new < 0)
> >                         return false;
> >
> >              /*
> >               * Atomic RMW will provide implicit mb on success to pair
> > with smp_wmb in
> >               * vma_write_lock, on failure we retry.
> >               */
> >               new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> >                 if (new == count)
> >                         break;
> >                 count = new;
> >                 cpu_relax();
>
> The cpu_relax() is exactly the wrong thing to do here.  See this thread:
> https://lore.kernel.org/linux-fsdevel/20230113184447.1707316-1-mjguzik@gmail.com/

Thanks for the pointer, Matthew. I think we can safely remove
cpu_relax() since it's unlikely the count is constantly changing under
a reader.

>

Jann Horn Jan. 17, 2023, 6:36 p.m. UTC | #20

On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -       up_read(&vma->vm_lock->lock);
> > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
>         Task A:
>         atomic_dec_and_test(&vma->vm_lock->count)
>                         Task B:
>                         munmap()
>                         write lock
>                         free VMA
>                         synchronize_rcu()
>                         VMA is really freed
>         wake_up(&vma->vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

I don't see that. When do_user_addr_fault() is calling
vma_read_unlock(), there's no RCU read lock held, right?

Suren Baghdasaryan Jan. 17, 2023, 6:36 p.m. UTC | #21

On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -       up_read(&vma->vm_lock->lock);
> > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
>         Task A:
>         atomic_dec_and_test(&vma->vm_lock->count)
>                         Task B:
>                         munmap()
>                         write lock
>                         free VMA
>                         synchronize_rcu()
>                         VMA is really freed
>         wake_up(&vma->vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

Yep, that's what this patch is doing
https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
by calling vma_assert_no_reader() from __vm_area_free().

>

Matthew Wilcox Jan. 17, 2023, 6:48 p.m. UTC | #22

On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > >  {
> > > > > -       up_read(&vma->vm_lock->lock);
> > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >  }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> >         Task A:
> >         atomic_dec_and_test(&vma->vm_lock->count)
> >                         Task B:
> >                         munmap()
> >                         write lock
> >                         free VMA
> >                         synchronize_rcu()
> >                         VMA is really freed
> >         wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > have that assertion?
> 
> Yep, that's what this patch is doing
> https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> by calling vma_assert_no_reader() from __vm_area_free().

That's not enough though.  Task A still has a pointer to vma after it
has called atomic_dec_and_test(), even after vma has been freed by
Task B, and before Task A dereferences vma->vm_mm.

Suren Baghdasaryan Jan. 17, 2023, 6:49 p.m. UTC | #23

On Tue, Jan 17, 2023 at 10:36 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > >  {
> > > > > -       up_read(&vma->vm_lock->lock);
> > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >  }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> >         Task A:
> >         atomic_dec_and_test(&vma->vm_lock->count)
> >                         Task B:
> >                         munmap()
> >                         write lock
> >                         free VMA
> >                         synchronize_rcu()
> >                         VMA is really freed
> >         wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > have that assertion?
>
> I don't see that. When do_user_addr_fault() is calling
> vma_read_unlock(), there's no RCU read lock held, right?

We free VMAs using call_rcu() after removing them from VMA tree. OTOH
page fault handlers are searching for VMAs from inside RCU read
section and calling vma_read_unlock() from there, see
https://lore.kernel.org/all/20230109205336.3665937-29-surenb@google.com/.
Once we take the VMA read-lock, it ensures that it can't be
write-locked and if someone is destroying or isolating the VMA, it
needs to write-lock it first.

Suren Baghdasaryan Jan. 17, 2023, 6:55 p.m. UTC | #24

On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > with a simpler structure:
> > > > > [...]
> > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > >  {
> > > > > > -       up_read(&vma->vm_lock->lock);
> > > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > >  }
> > > > >
> > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > use-after-free because you're accessing the vma after dropping your
> > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > read-side critical section if the VMA is freed with RCU delay.
> > > >
> > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > counter of how many readers took the lock or it's negative if the lock
> > > > is write-locked.
> > >
> > > Yes, but ...
> > >
> > >         Task A:
> > >         atomic_dec_and_test(&vma->vm_lock->count)
> > >                         Task B:
> > >                         munmap()
> > >                         write lock
> > >                         free VMA
> > >                         synchronize_rcu()
> > >                         VMA is really freed
> > >         wake_up(&vma->vm_mm->vma_writer_wait);
> > >
> > > ... vma is freed.
> > >
> > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > have that assertion?
> >
> > Yep, that's what this patch is doing
> > https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> > by calling vma_assert_no_reader() from __vm_area_free().
>
> That's not enough though.  Task A still has a pointer to vma after it
> has called atomic_dec_and_test(), even after vma has been freed by
> Task B, and before Task A dereferences vma->vm_mm.

Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
local variable and call mmgrab() before atomic_dec_and_test(), then
use it in wake_up() and call mmdrop(). Is that what you are thinking?

Jann Horn Jan. 17, 2023, 6:59 p.m. UTC | #25

On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > with a simpler structure:
> > > > > > [...]
> > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > > -       up_read(&vma->vm_lock->lock);
> > > > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > >  }
> > > > > >
> > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > >
> > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > is write-locked.
> > > >
> > > > Yes, but ...
> > > >
> > > >         Task A:
> > > >         atomic_dec_and_test(&vma->vm_lock->count)
> > > >                         Task B:
> > > >                         munmap()
> > > >                         write lock
> > > >                         free VMA
> > > >                         synchronize_rcu()
> > > >                         VMA is really freed
> > > >         wake_up(&vma->vm_mm->vma_writer_wait);
> > > >
> > > > ... vma is freed.
> > > >
> > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > have that assertion?
> > >
> > > Yep, that's what this patch is doing
> > > https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> > > by calling vma_assert_no_reader() from __vm_area_free().
> >
> > That's not enough though.  Task A still has a pointer to vma after it
> > has called atomic_dec_and_test(), even after vma has been freed by
> > Task B, and before Task A dereferences vma->vm_mm.
>
> Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> local variable and call mmgrab() before atomic_dec_and_test(), then
> use it in wake_up() and call mmdrop(). Is that what you are thinking?

You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
for page fault handling must be keeping the mm_struct alive.

Suren Baghdasaryan Jan. 17, 2023, 7:06 p.m. UTC | #26

On Tue, Jan 17, 2023 at 11:00 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <jannh@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > > with a simpler structure:
> > > > > > > [...]
> > > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > >  {
> > > > > > > > -       up_read(&vma->vm_lock->lock);
> > > > > > > > +       if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > > +               wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > >  }
> > > > > > >
> > > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > > >
> > > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > > is write-locked.
> > > > >
> > > > > Yes, but ...
> > > > >
> > > > >         Task A:
> > > > >         atomic_dec_and_test(&vma->vm_lock->count)
> > > > >                         Task B:
> > > > >                         munmap()
> > > > >                         write lock
> > > > >                         free VMA
> > > > >                         synchronize_rcu()
> > > > >                         VMA is really freed
> > > > >         wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >
> > > > > ... vma is freed.
> > > > >
> > > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > > have that assertion?
> > > >
> > > > Yep, that's what this patch is doing
> > > > https://lore.kernel.org/all/20230109205336.3665937-27-surenb@google.com/
> > > > by calling vma_assert_no_reader() from __vm_area_free().
> > >
> > > That's not enough though.  Task A still has a pointer to vma after it
> > > has called atomic_dec_and_test(), even after vma has been freed by
> > > Task B, and before Task A dereferences vma->vm_mm.
> >
> > Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> > local variable and call mmgrab() before atomic_dec_and_test(), then
> > use it in wake_up() and call mmdrop(). Is that what you are thinking?
>
> You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
> for page fault handling must be keeping the mm_struct alive.

Good point. Will update in the next revision to store mm before
dropping the count. Thanks for all the comments folks!

Michal Hocko Jan. 17, 2023, 8:31 p.m. UTC | #27

On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
[...]
> > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > sem was the original motivation for making rwsems non-starvable, so
> > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > suggest that a waiting writer set the top bit of the counter.  That way,
> > all new readers will back off without needing to check a second variable
> > and old readers will know that they *may* need to do the wakeup when
> > atomic_sub_return_release() is negative.
> >
> > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > that far; the important point is that the waiting writer indicates its
> > presence in the count field so that readers can modify their behaviour)
> 
> Got it. Ok, I think we can figure something out to check if there are
> waiting write-lockers and prevent new readers from taking the lock.

Reinventing locking primitives is a ticket to weird bugs. I would stick
with the rwsem and deal with performance fallouts after it is clear that
the core idea is generally acceptable and based on actual real life
numbers. This whole thing is quite big enough that we do not have to go
through "is this new synchronization primitive correct and behaving
reasonably" exercise.

Suren Baghdasaryan Jan. 17, 2023, 9 p.m. UTC | #28

On Tue, Jan 17, 2023 at 12:31 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
> [...]
> > > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > > sem was the original motivation for making rwsems non-starvable, so
> > > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > > suggest that a waiting writer set the top bit of the counter.  That way,
> > > all new readers will back off without needing to check a second variable
> > > and old readers will know that they *may* need to do the wakeup when
> > > atomic_sub_return_release() is negative.
> > >
> > > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > > that far; the important point is that the waiting writer indicates its
> > > presence in the count field so that readers can modify their behaviour)
> >
> > Got it. Ok, I think we can figure something out to check if there are
> > waiting write-lockers and prevent new readers from taking the lock.
>
> Reinventing locking primitives is a ticket to weird bugs. I would stick
> with the rwsem and deal with performance fallouts after it is clear that
> the core idea is generally acceptable and based on actual real life
> numbers. This whole thing is quite big enough that we do not have to go
> through "is this new synchronization primitive correct and behaving
> reasonably" exercise.

Point taken. That's one of the reasons I kept this patch separate.
I'll drop this last patch from the series for now. One correction
though, this will not be a performance fallout but memory consumption
fallout.

>
> --
> Michal Hocko
> SUSE Labs

Matthew Wilcox Jan. 18, 2023, 6:35 p.m. UTC | #29

On Wed, Jan 18, 2023 at 02:26:39PM +0800, Hillf Danton wrote:
> On Tue, Jan 17, 2023 at 10:27 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > The cpu_relax() is exactly the wrong thing to do here.  See this thread:
> > https://lore.kernel.org/linux-fsdevel/20230113184447.1707316-1-mjguzik@gmail.com/
> 
> If you are right, feel free to go and remove every cpu_relax() under the
> kernel/locking directory.

I see you didn't read the whole thread where Linus points out that a
cmpxchg() loop is fundamentally different from a spinlock.

diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d40bf8a5e19e..294dd44b2198 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,12 +627,16 @@  static inline void vma_write_lock(struct vm_area_struct *vma)
 	 * mm->mm_lock_seq can't be concurrently modified.
 	 */
 	mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
-	if (vma->vm_lock_seq == mm_lock_seq)
+	if (vma->vm_lock->lock_seq == mm_lock_seq)
 		return;
 
-	down_write(&vma->vm_lock->lock);
-	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->vm_lock->lock);
+	if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
+		wait_event(vma->vm_mm->vma_writer_wait,
+			   atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
+	vma->vm_lock->lock_seq = mm_lock_seq;
+	/* Write barrier to ensure lock_seq change is visible before count */
+	smp_wmb();
+	atomic_set(&vma->vm_lock->count, 0);
 }
 
 /*
@@ -643,20 +647,28 @@  static inline void vma_write_lock(struct vm_area_struct *vma)
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
 {
 	/* Check before locking. A race might cause false locked result. */
-	if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+	if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
 		return false;
 
-	if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+	if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
 		return false;
 
+	/* If atomic_t overflows, restore and fail to lock. */
+	if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
+		if (atomic_dec_and_test(&vma->vm_lock->count))
+			wake_up(&vma->vm_mm->vma_writer_wait);
+		return false;
+	}
+
 	/*
 	 * Overflow might produce false locked result.
 	 * False unlocked result is impossible because we modify and check
 	 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 	 * modification invalidates all existing locks.
 	 */
-	if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-		up_read(&vma->vm_lock->lock);
+	if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+		if (atomic_dec_and_test(&vma->vm_lock->count))
+			wake_up(&vma->vm_mm->vma_writer_wait);
 		return false;
 	}
 	return true;
@@ -664,7 +676,8 @@  static inline bool vma_read_trylock(struct vm_area_struct *vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-	up_read(&vma->vm_lock->lock);
+	if (atomic_dec_and_test(&vma->vm_lock->count))
+		wake_up(&vma->vm_mm->vma_writer_wait);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -674,13 +687,13 @@  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 	 * current task is holding mmap_write_lock, both vma->vm_lock_seq and
 	 * mm->mm_lock_seq can't be concurrently modified.
 	 */
-	VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+	VM_BUG_ON_VMA(vma->vm_lock->lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
 }
 
 static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 {
-	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock) &&
-		      vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+	VM_BUG_ON_VMA(atomic_read(&vma->vm_lock->count) > 0 &&
+		      vma->vm_lock->lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
 		      vma);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faa61b400f9b..a6050c38ca2e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -527,7 +527,13 @@  struct anon_vma_name {
 };
 
 struct vma_lock {
-	struct rw_semaphore lock;
+	/*
+	 * count > 0 ==> read-locked with 'count' number of readers
+	 * count < 0 ==> write-locked
+	 * count = 0 ==> unlocked
+	 */
+	atomic_t count;
+	int lock_seq;
 };
 
 /*
@@ -566,7 +572,6 @@  struct vm_area_struct {
 	unsigned long vm_flags;
 
 #ifdef CONFIG_PER_VMA_LOCK
-	int vm_lock_seq;
 	struct vma_lock *vm_lock;
 #endif
 
@@ -706,6 +711,7 @@  struct mm_struct {
 					  * by mmlist_lock
 					  */
 #ifdef CONFIG_PER_VMA_LOCK
+		struct wait_queue_head vma_writer_wait;
 		int mm_lock_seq;
 		struct {
 			struct list_head head;
diff --git a/kernel/fork.c b/kernel/fork.c
index 95db6a521cf1..b221ad182d98 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -461,9 +461,8 @@  static bool vma_init_lock(struct vm_area_struct *vma)
 	vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
 	if (!vma->vm_lock)
 		return false;
-
-	init_rwsem(&vma->vm_lock->lock);
-	vma->vm_lock_seq = -1;
+	atomic_set(&vma->vm_lock->count, 0);
+	vma->vm_lock->lock_seq = -1;
 
 	return true;
 }
@@ -1229,6 +1228,7 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
 #ifdef CONFIG_PER_VMA_LOCK
+	init_waitqueue_head(&mm->vma_writer_wait);
 	WRITE_ONCE(mm->mm_lock_seq, 0);
 	INIT_LIST_HEAD(&mm->vma_free_list.head);
 	spin_lock_init(&mm->vma_free_list.lock);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index b53d23c2d7a3..0088e31e5f7e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -38,6 +38,8 @@  struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 #ifdef CONFIG_PER_VMA_LOCK
+	.vma_writer_wait =
+		__WAIT_QUEUE_HEAD_INITIALIZER(init_mm.vma_writer_wait),
 	.mm_lock_seq	= 0,
 	.vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
 	.vma_free_list.lock =  __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),