[v2,05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

Message ID 20221207203034.650899-6-peterx@redhat.com
State New
Headers
Series [v2,01/10] mm/hugetlb: Let vma_offset_start() to return start |

Commit Message

Peter Xu Dec. 7, 2022, 8:30 p.m. UTC
  We can take the hugetlb walker lock, here taking vma lock directly.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)
  

Comments

John Hubbard Dec. 7, 2022, 11:19 p.m. UTC | #1
On 12/7/22 12:30, Peter Xu wrote:
> We can take the hugetlb walker lock, here taking vma lock directly.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   fs/userfaultfd.c | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 07c81ab3fd4d..a602f008dde5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
>    */
>   vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>   {
> -	struct mm_struct *mm = vmf->vma->vm_mm;
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *mm = vma->vm_mm;
>   	struct userfaultfd_ctx *ctx;
>   	struct userfaultfd_wait_queue uwq;
>   	vm_fault_t ret = VM_FAULT_SIGBUS;
> @@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>   	 */
>   	mmap_assert_locked(mm);
>   
> -	ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
> +	ctx = vma->vm_userfaultfd_ctx.ctx;
>   	if (!ctx)
>   		goto out;
>   
> @@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>   
>   	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>   
> +	/*
> +	 * This stablizes pgtable for hugetlb on e.g. pmd unsharing.  Need
> +	 * to be before setting current state.
> +	 */

Looking at this code, I am not able to come up with a reason for why the
vma lock/unlock placement is exactly where it is. It looks quite arbitrary.

Why not, for example, take and drop the vma lock within
userfaultfd_huge_must_wait()? That makes more sense to me, but I'm not familiar
with userfaultfd so of course I'm missing something.

But the comment above certainly doesn't supply that something.


thanks,
  
Peter Xu Dec. 7, 2022, 11:44 p.m. UTC | #2
On Wed, Dec 07, 2022 at 03:19:55PM -0800, John Hubbard wrote:
> On 12/7/22 12:30, Peter Xu wrote:
> > We can take the hugetlb walker lock, here taking vma lock directly.
> > 
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   fs/userfaultfd.c | 18 ++++++++++++++----
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 07c81ab3fd4d..a602f008dde5 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> >    */
> >   vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >   {
> > -	struct mm_struct *mm = vmf->vma->vm_mm;
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> >   	struct userfaultfd_ctx *ctx;
> >   	struct userfaultfd_wait_queue uwq;
> >   	vm_fault_t ret = VM_FAULT_SIGBUS;
> > @@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >   	 */
> >   	mmap_assert_locked(mm);
> > -	ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
> > +	ctx = vma->vm_userfaultfd_ctx.ctx;
> >   	if (!ctx)
> >   		goto out;
> > @@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >   	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> > +	/*
> > +	 * This stablizes pgtable for hugetlb on e.g. pmd unsharing.  Need
> > +	 * to be before setting current state.
> > +	 */
> 
> Looking at this code, I am not able to come up with a reason for why the
> vma lock/unlock placement is exactly where it is. It looks quite arbitrary.
> 
> Why not, for example, take and drop the vma lock within
> userfaultfd_huge_must_wait()? That makes more sense to me, but I'm not familiar
> with userfaultfd so of course I'm missing something.
> 
> But the comment above certainly doesn't supply that something.

The part that matters in the comment is "need to be before setting current
state".

	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
	if (is_vm_hugetlb_page(vma))
		hugetlb_vma_lock_read(vma);
	set_current_state(blocking_state);

down_read() can sleep and also modify the task state, we cannot take the
lock after that point because otherwise the task state will be messed up.
  
John Hubbard Dec. 7, 2022, 11:54 p.m. UTC | #3
On 12/7/22 15:44, Peter Xu wrote:
> The part that matters in the comment is "need to be before setting current
> state".
> 
> 	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> 	if (is_vm_hugetlb_page(vma))
> 		hugetlb_vma_lock_read(vma);
> 	set_current_state(blocking_state);
> 
> down_read() can sleep and also modify the task state, we cannot take the
> lock after that point because otherwise the task state will be messed up.
> 

OK, how about:

	/*
	 * Take the vma lock now, in order to safely call
	 * userfaultfd_huge_must_wait() a little bit later. Because acquiring
	 * the (sleepable) vma lock potentially modifies the current task state,
	 * that must be before explicitly calling set_current_state().
	 */

Other than that, the patch looks good, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
  

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..a602f008dde5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -376,7 +376,8 @@  static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
  */
 vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 {
-	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -403,7 +404,7 @@  vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 */
 	mmap_assert_locked(mm);
 
-	ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
+	ctx = vma->vm_userfaultfd_ctx.ctx;
 	if (!ctx)
 		goto out;
 
@@ -493,6 +494,13 @@  vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
 
+	/*
+	 * This stablizes pgtable for hugetlb on e.g. pmd unsharing.  Need
+	 * to be before setting current state.
+	 */
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_lock_read(vma);
+
 	spin_lock_irq(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -507,13 +515,15 @@  vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	set_current_state(blocking_state);
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
-	if (!is_vm_hugetlb_page(vmf->vma))
+	if (!is_vm_hugetlb_page(vma))
 		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
 						  reason);
 	else
-		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
+		must_wait = userfaultfd_huge_must_wait(ctx, vma,
 						       vmf->address,
 						       vmf->flags, reason);
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_unlock_read(vma);
 	mmap_read_unlock(mm);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {