[25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

Message ID b9da41bb-b7b6-2fc6-caac-b01b6719334@google.com
State New
Headers
Series mm: allow pte_offset_map[_lock]() to fail |

Commit Message

Hugh Dickins May 22, 2023, 5:22 a.m. UTC
  There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
distinguish huge_zero_page from a normal THP: follow_page_pte() handles
any instability, and here it's a good idea to replace any pmd_none(*pmd)
by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
(Hmm, couldn't the normal THP case have hit an unstably refaulted THP
before?  But there are only two, exceptional, users of FOLL_SPLIT_PMD.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/gup.c | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)
  

Comments

Yang Shi May 23, 2023, 2:26 a.m. UTC | #1
On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
>
> There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> any instability, and here it's a good idea to replace any pmd_none(*pmd)
> by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> before?  But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/gup.c | 19 ++++---------------
>  1 file changed, 4 insertions(+), 15 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index bb67193c5460..4ad50a59897f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>                 return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>         }
>         if (flags & FOLL_SPLIT_PMD) {
> -               int ret;
> -               page = pmd_page(*pmd);
> -               if (is_huge_zero_page(page)) {
> -                       spin_unlock(ptl);
> -                       ret = 0;
> -                       split_huge_pmd(vma, pmd, address);
> -                       if (pmd_trans_unstable(pmd))
> -                               ret = -EBUSY;

IIUC the pmd_trans_unstable() check was transferred to the implicit
pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
-EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
never true? If so it seems worth mentioning in the commit log about
this return value change.

> -               } else {
> -                       spin_unlock(ptl);
> -                       split_huge_pmd(vma, pmd, address);
> -                       ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> -               }
> -
> -               return ret ? ERR_PTR(ret) :
> +               spin_unlock(ptl);
> +               split_huge_pmd(vma, pmd, address);
> +               /* If pmd was left empty, stuff a page table in there quickly */
> +               return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
>                         follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>         }
>         page = follow_trans_huge_pmd(vma, address, pmd, flags);
> --
> 2.35.3
>
  
Yang Shi May 23, 2023, 2:44 a.m. UTC | #2
On Mon, May 22, 2023 at 7:26 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> > distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> > any instability, and here it's a good idea to replace any pmd_none(*pmd)
> > by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> > (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> > before?  But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/gup.c | 19 ++++---------------
> >  1 file changed, 4 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index bb67193c5460..4ad50a59897f 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> >                 return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> >         }
> >         if (flags & FOLL_SPLIT_PMD) {
> > -               int ret;
> > -               page = pmd_page(*pmd);
> > -               if (is_huge_zero_page(page)) {
> > -                       spin_unlock(ptl);
> > -                       ret = 0;
> > -                       split_huge_pmd(vma, pmd, address);
> > -                       if (pmd_trans_unstable(pmd))
> > -                               ret = -EBUSY;
>
> IIUC the pmd_trans_unstable() check was transferred to the implicit
> pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
> -EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
> never true? If so it seems worth mentioning in the commit log about
> this return value change.

Oops, the above comment is not accurate. It will call
follow_page_pte() instead of returning -EBUSY if pmd is none. For
other unstable cases, it will return -ENOMEM instead of -EBUSY.

>
> > -               } else {
> > -                       spin_unlock(ptl);
> > -                       split_huge_pmd(vma, pmd, address);
> > -                       ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> > -               }
> > -
> > -               return ret ? ERR_PTR(ret) :
> > +               spin_unlock(ptl);
> > +               split_huge_pmd(vma, pmd, address);
> > +               /* If pmd was left empty, stuff a page table in there quickly */
> > +               return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> >                         follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> >         }
> >         page = follow_trans_huge_pmd(vma, address, pmd, flags);
> > --
> > 2.35.3
> >
  
Hugh Dickins May 24, 2023, 4:26 a.m. UTC | #3
On Mon, 22 May 2023, Yang Shi wrote:
> On Mon, May 22, 2023 at 7:26 PM Yang Shi <shy828301@gmail.com> wrote:
> > On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> > > distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> > > any instability, and here it's a good idea to replace any pmd_none(*pmd)
> > > by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> > > (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> > > before?  But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
> > >
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  mm/gup.c | 19 ++++---------------
> > >  1 file changed, 4 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index bb67193c5460..4ad50a59897f 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> > >                 return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > >         }
> > >         if (flags & FOLL_SPLIT_PMD) {
> > > -               int ret;
> > > -               page = pmd_page(*pmd);
> > > -               if (is_huge_zero_page(page)) {
> > > -                       spin_unlock(ptl);
> > > -                       ret = 0;
> > > -                       split_huge_pmd(vma, pmd, address);
> > > -                       if (pmd_trans_unstable(pmd))
> > > -                               ret = -EBUSY;
> >
> > IIUC the pmd_trans_unstable() check was transferred to the implicit
> > pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
> > -EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
> > never true? If so it seems worth mentioning in the commit log about
> > this return value change.

Thanks a lot for looking at these, but I disagree here.

> 
> Oops, the above comment is not accurate. It will call
> follow_page_pte() instead of returning -EBUSY if pmd is none.

Yes.  Ignoring secondary races, if pmd is none, pte_alloc() will allocate
an empty page table there, follow_page_pte() find !pte_present and return
NULL; or if pmd is not none, follow_page_pte() will return no_page_table()
i.e. NULL.  And page NULL ends up with __get_user_pages() having another
go round, instead of failing with -EBUSY.

Which I'd say is better handling for such a transient case - remember,
it's split_huge_pmd() (which should always succeed, but might be raced)
in use there, not split_huge_page() (which might take years for pins to
be removed before it can succeed).

> For other unstable cases, it will return -ENOMEM instead of -EBUSY.

I don't think so: the possibly-failing __pte_alloc() only gets called
in the pmd_none() case.

Hugh

> 
> >
> > > -               } else {
> > > -                       spin_unlock(ptl);
> > > -                       split_huge_pmd(vma, pmd, address);
> > > -                       ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> > > -               }
> > > -
> > > -               return ret ? ERR_PTR(ret) :
> > > +               spin_unlock(ptl);
> > > +               split_huge_pmd(vma, pmd, address);
> > > +               /* If pmd was left empty, stuff a page table in there quickly */
> > > +               return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> > >                         follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > >         }
> > >         page = follow_trans_huge_pmd(vma, address, pmd, flags);
> > > --
> > > 2.35.3
  
Yang Shi May 24, 2023, 10:45 p.m. UTC | #4
On Tue, May 23, 2023 at 9:26 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Mon, 22 May 2023, Yang Shi wrote:
> > On Mon, May 22, 2023 at 7:26 PM Yang Shi <shy828301@gmail.com> wrote:
> > > On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> > > >
> > > > There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> > > > distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> > > > any instability, and here it's a good idea to replace any pmd_none(*pmd)
> > > > by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> > > > (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> > > > before?  But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
> > > >
> > > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > > ---
> > > >  mm/gup.c | 19 ++++---------------
> > > >  1 file changed, 4 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index bb67193c5460..4ad50a59897f 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> > > >                 return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > >         }
> > > >         if (flags & FOLL_SPLIT_PMD) {
> > > > -               int ret;
> > > > -               page = pmd_page(*pmd);
> > > > -               if (is_huge_zero_page(page)) {
> > > > -                       spin_unlock(ptl);
> > > > -                       ret = 0;
> > > > -                       split_huge_pmd(vma, pmd, address);
> > > > -                       if (pmd_trans_unstable(pmd))
> > > > -                               ret = -EBUSY;
> > >
> > > IIUC the pmd_trans_unstable() check was transferred to the implicit
> > > pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
> > > -EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
> > > never true? If so it seems worth mentioning in the commit log about
> > > this return value change.
>
> Thanks a lot for looking at these, but I disagree here.
>
> >
> > Oops, the above comment is not accurate. It will call
> > follow_page_pte() instead of returning -EBUSY if pmd is none.
>
> Yes.  Ignoring secondary races, if pmd is none, pte_alloc() will allocate
> an empty page table there, follow_page_pte() find !pte_present and return
> NULL; or if pmd is not none, follow_page_pte() will return no_page_table()
> i.e. NULL.  And page NULL ends up with __get_user_pages() having another
> go round, instead of failing with -EBUSY.
>
> Which I'd say is better handling for such a transient case - remember,
> it's split_huge_pmd() (which should always succeed, but might be raced)
> in use there, not split_huge_page() (which might take years for pins to
> be removed before it can succeed).

It sounds like an improvement.

>
> > For other unstable cases, it will return -ENOMEM instead of -EBUSY.
>
> I don't think so: the possibly-failing __pte_alloc() only gets called
> in the pmd_none() case.

I mean what if pmd is not none for huge zero page. If it is not
pmd_none pte_alloc() just returns 0, then returns -ENOMEM instead of
-EBUSY. Or it is impossible that pmd end up being pmd_huge_trans or
!pmd_present? It should be very unlikely, for example, migration does
skip huge zero page, but I'm not sure whether there is any corner case
that I missed.

>
> Hugh
>
> >
> > >
> > > > -               } else {
> > > > -                       spin_unlock(ptl);
> > > > -                       split_huge_pmd(vma, pmd, address);
> > > > -                       ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> > > > -               }
> > > > -
> > > > -               return ret ? ERR_PTR(ret) :
> > > > +               spin_unlock(ptl);
> > > > +               split_huge_pmd(vma, pmd, address);
> > > > +               /* If pmd was left empty, stuff a page table in there quickly */
> > > > +               return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> > > >                         follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > >         }
> > > >         page = follow_trans_huge_pmd(vma, address, pmd, flags);
> > > > --
> > > > 2.35.3
  
Hugh Dickins May 25, 2023, 9:16 p.m. UTC | #5
On Wed, 24 May 2023, Yang Shi wrote:
> On Tue, May 23, 2023 at 9:26 PM Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 22 May 2023, Yang Shi wrote:
> >
> > > For other unstable cases, it will return -ENOMEM instead of -EBUSY.
> >
> > I don't think so: the possibly-failing __pte_alloc() only gets called
> > in the pmd_none() case.
> 
> I mean what if pmd is not none for huge zero page. If it is not
> pmd_none pte_alloc() just returns 0,

Yes, I agree with you on that.

> then returns -ENOMEM instead of -EBUSY.

But disagree with you on that.

		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

Doesn't that say that if pte_alloc() returns 0, then follow_page_mask()
will call follow_page_pte() and return whatever that returns?

> Or it is impossible that pmd end up being pmd_huge_trans or
> !pmd_present? It should be very unlikely, for example, migration does
> skip huge zero page, but I'm not sure whether there is any corner case
> that I missed.

I'm assuming both are possible there (but not asserting that they are).

Hugh
  
Yang Shi May 25, 2023, 10:33 p.m. UTC | #6
On Thu, May 25, 2023 at 2:16 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 24 May 2023, Yang Shi wrote:
> > On Tue, May 23, 2023 at 9:26 PM Hugh Dickins <hughd@google.com> wrote:
> > > On Mon, 22 May 2023, Yang Shi wrote:
> > >
> > > > For other unstable cases, it will return -ENOMEM instead of -EBUSY.
> > >
> > > I don't think so: the possibly-failing __pte_alloc() only gets called
> > > in the pmd_none() case.
> >
> > I mean what if pmd is not none for huge zero page. If it is not
> > pmd_none pte_alloc() just returns 0,
>
> Yes, I agree with you on that.
>
> > then returns -ENOMEM instead of -EBUSY.
>
> But disagree with you on that.
>
>                 return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
>                         follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>
> Doesn't that say that if pte_alloc() returns 0, then follow_page_mask()
> will call follow_page_pte() and return whatever that returns?

Err... you are right. I misread the code. Anyway it returns -ENOMEM
instead of -EBUSY when pmd is none and pte alloc fails. Returning
-ENOMEM does make sense for this case. Is it worth some words in the
commit log for the slight behavior change?

>
> > Or it is impossible that pmd end up being pmd_huge_trans or
> > !pmd_present? It should be very unlikely, for example, migration does
> > skip huge zero page, but I'm not sure whether there is any corner case
> > that I missed.
>
> I'm assuming both are possible there (but not asserting that they are).
>
> Hugh
  

Patch

diff --git a/mm/gup.c b/mm/gup.c
index bb67193c5460..4ad50a59897f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -681,21 +681,10 @@  static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
 	if (flags & FOLL_SPLIT_PMD) {
-		int ret;
-		page = pmd_page(*pmd);
-		if (is_huge_zero_page(page)) {
-			spin_unlock(ptl);
-			ret = 0;
-			split_huge_pmd(vma, pmd, address);
-			if (pmd_trans_unstable(pmd))
-				ret = -EBUSY;
-		} else {
-			spin_unlock(ptl);
-			split_huge_pmd(vma, pmd, address);
-			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
-		}
-
-		return ret ? ERR_PTR(ret) :
+		spin_unlock(ptl);
+		split_huge_pmd(vma, pmd, address);
+		/* If pmd was left empty, stuff a page table in there quickly */
+		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
 	page = follow_trans_huge_pmd(vma, address, pmd, flags);