[1/4] mm/mprotect: Retry on pmd_trans_unstable()

Message ID 20230602230552.350731-2-peterx@redhat.com
State New
Headers
Series mm: Fix pmd_trans_unstable() call sites on retry |

Commit Message

Peter Xu June 2, 2023, 11:05 p.m. UTC
  When hit unstable pmd, we should retry the pmd once more because it means
we probably raced with a thp insertion.

Skipping it might be a problem as no error will be reported to the caller.
I assume it means the user will expect prot changed (e.g. mprotect or
userfaultfd wr-protections) applied but it's actually not.

To achieve it, move the pmd_trans_unstable() call out of change_pte_range()
which will make the retry easier, as we can keep the retval of
change_pte_range() untouched.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)
  

Comments

Yang Shi June 3, 2023, 2:04 a.m. UTC | #1
On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
>
> When hit unstable pmd, we should retry the pmd once more because it means
> we probably raced with a thp insertion.
>
> Skipping it might be a problem as no error will be reported to the caller.
> I assume it means the user will expect prot changed (e.g. mprotect or
> userfaultfd wr-protections) applied but it's actually not.

IIRC, mprotect() holds write mmap_lock, so it should not matter. PROT
NUMA holds read mmap_lock, but returning 0 also doesn't matter (of
course retry is fine too). just skip that 2M area. The userfaultfd-wp
is your call :-)

>
> To achieve it, move the pmd_trans_unstable() call out of change_pte_range()
> which will make the retry easier, as we can keep the retval of
> change_pte_range() untouched.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/mprotect.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 92d3d3ca390a..e4756899d40c 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -94,15 +94,6 @@ static long change_pte_range(struct mmu_gather *tlb,
>
>         tlb_change_page_size(tlb, PAGE_SIZE);
>
> -       /*
> -        * Can be called with only the mmap_lock for reading by
> -        * prot_numa so we must check the pmd isn't constantly
> -        * changing from under us from pmd_none to pmd_trans_huge
> -        * and/or the other way around.
> -        */
> -       if (pmd_trans_unstable(pmd))
> -               return 0;
> -
>         /*
>          * The pmd points to a regular pte so the pmd can't change
>          * from under us even if the mmap_lock is only hold for
> @@ -411,6 +402,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>                         pages = ret;
>                         break;
>                 }
> +again:
>                 /*
>                  * Automatic NUMA balancing walks the tables with mmap_lock
>                  * held for read. It's possible a parallel update to occur
> @@ -465,6 +457,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>                         }
>                         /* fall through, the trans huge pmd just split */
>                 }
> +
> +               /*
> +                * Can be called with only the mmap_lock for reading by
> +                * prot_numa or userfaultfd-wp, so we must check the pmd
> +                * isn't constantly changing from under us from pmd_none to
> +                * pmd_trans_huge and/or the other way around.
> +                */
> +               if (pmd_trans_unstable(pmd))
> +                       goto again;
> +
>                 pages += change_pte_range(tlb, vma, pmd, addr, next,
>                                           newprot, cp_flags);
>  next:
> --
> 2.40.1
>
>
  
Peter Xu June 4, 2023, 11:58 p.m. UTC | #2
On Fri, Jun 02, 2023 at 07:04:48PM -0700, Yang Shi wrote:
> On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > When hit unstable pmd, we should retry the pmd once more because it means
> > we probably raced with a thp insertion.
> >
> > Skipping it might be a problem as no error will be reported to the caller.
> > I assume it means the user will expect prot changed (e.g. mprotect or
> > userfaultfd wr-protections) applied but it's actually not.
> 
> IIRC, mprotect() holds write mmap_lock, so it should not matter. PROT
> NUMA holds read mmap_lock, but returning 0 also doesn't matter (of
> course retry is fine too). just skip that 2M area.

True.

> The userfaultfd-wp is your call :-)

Yeah I think uffd should still be a problem.  I'll reword the commit
message (by dropping mprotect example) in the new version.

If you have time feel free to have a look at patch 4, where I think it's a
bug for pagemap too (I didn't check as close as all the rest; the memcg one
might be suspecious, that's also in patch 4).

Thanks!
  

Patch

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..e4756899d40c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -94,15 +94,6 @@  static long change_pte_range(struct mmu_gather *tlb,
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 
-	/*
-	 * Can be called with only the mmap_lock for reading by
-	 * prot_numa so we must check the pmd isn't constantly
-	 * changing from under us from pmd_none to pmd_trans_huge
-	 * and/or the other way around.
-	 */
-	if (pmd_trans_unstable(pmd))
-		return 0;
-
 	/*
 	 * The pmd points to a regular pte so the pmd can't change
 	 * from under us even if the mmap_lock is only hold for
@@ -411,6 +402,7 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 			pages = ret;
 			break;
 		}
+again:
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
@@ -465,6 +457,16 @@  static inline long change_pmd_range(struct mmu_gather *tlb,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
+
+		/*
+		 * Can be called with only the mmap_lock for reading by
+		 * prot_numa or userfaultfd-wp, so we must check the pmd
+		 * isn't constantly changing from under us from pmd_none to
+		 * pmd_trans_huge and/or the other way around.
+		 */
+		if (pmd_trans_unstable(pmd))
+			goto again;
+
 		pages += change_pte_range(tlb, vma, pmd, addr, next,
 					  newprot, cp_flags);
 next: