hugetlb: unshare some PMDs when splitting VMAs

Message ID 20230101230042.244286-1-jthoughton@google.com
State New
Headers
Series hugetlb: unshare some PMDs when splitting VMAs |

Commit Message

James Houghton Jan. 1, 2023, 11 p.m. UTC
  PMD sharing can only be done in PUD_SIZE-aligned pieces of VMAs;
however, it is possible that HugeTLB VMAs are split without unsharing
the PMDs first.

In some (most?) cases, this is a non-issue, like userfaultfd_register
and mprotect, where PMDs are unshared before anything is done. However,
mbind() and madvise() (like MADV_DONTDUMP) can cause a split without
unsharing first.

It might seem ideal to unshare in hugetlb_vm_op_open, but that would
only unshare PMDs in the new VMA.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 9 deletions(-)
  

Comments

Mike Kravetz Jan. 3, 2023, 7:24 p.m. UTC | #1
On 01/01/23 23:00, James Houghton wrote:
> PMD sharing can only be done in PUD_SIZE-aligned pieces of VMAs;
> however, it is possible that HugeTLB VMAs are split without unsharing
> the PMDs first.
> 
> In some (most?) cases, this is a non-issue, like userfaultfd_register
> and mprotect, where PMDs are unshared before anything is done. However,
> mbind() and madvise() (like MADV_DONTDUMP) can cause a split without
> unsharing first.

Thanks James.  I am just trying to determine if we may have any issues/bugs/
undesired behavior based on this today.  Consider the cases mentioned above:
mbind - I do not think this would cause any user visible issues.  mbind is
	only dealing with newly allocated pages.  We do not unshare as the
	result of a mbind call today.
madvise(MADV_DONTDUMP) - It looks like this results in a flag (VM_DONTDUMP)
	being set on the vma.  So, I do not believe sharing page tables
	would cause any user visible issue.

One somewhat strange things about two vmas after split sharing a PMD is
that operations on one VMA can impact the other.  For example, suppose
A VMA split via mbind happens.  Then later, mprotect is done on one of
the VMAs in the range that is shared.  That would result in the area being
unshared in both VMAs.  So, the 'other' vma could see minor faults after
the mprotect.

Just curious if you (or anyone) knows of a user visible issue caused by this
today.  Trying to determine if we need a Fixes: tag.

Code changes look fine to me.
  
James Houghton Jan. 3, 2023, 8:26 p.m. UTC | #2
> Thanks James.  I am just trying to determine if we may have any issues/bugs/
> undesired behavior based on this today.  Consider the cases mentioned above:
> mbind - I do not think this would cause any user visible issues.  mbind is
>         only dealing with newly allocated pages.  We do not unshare as the
>         result of a mbind call today.
> madvise(MADV_DONTDUMP) - It looks like this results in a flag (VM_DONTDUMP)
>         being set on the vma.  So, I do not believe sharing page tables
>         would cause any user visible issue.
>
> One somewhat strange things about two vmas after split sharing a PMD is
> that operations on one VMA can impact the other.  For example, suppose
> A VMA split via mbind happens.  Then later, mprotect is done on one of
> the VMAs in the range that is shared.  That would result in the area being
> unshared in both VMAs.  So, the 'other' vma could see minor faults after
> the mprotect.
>
> Just curious if you (or anyone) knows of a user visible issue caused by this
> today.  Trying to determine if we need a Fixes: tag.

I think I've come up with one... :) It only took many many hours of
staring at code to come up with:

1. Fault in PUD_SIZE-aligned hugetlb mapping
2. fork() (to actually share the PMDs)
3. Split VMA with MADV_DONTDUMP
4. Register the lower piece of the newly split VMA with
UFFDIO_REGISTER_MODE_WRITEPROTECT (this will call
hugetlb_unshare_all_pmds, but it will not attempt to unshare in the
unaligned bits now)
5. Now calling UFFDIO_WRITEPROTECT will drop into
hugetlb_change_protection and succeed in unsharing. That will hit the
WARN_ON_ONCE and *not write-protect anything*.

I'll see if I can confirm that this is indeed possible and send a
repro if it is.

60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes") is the
commit that introduced the WARN_ON_ONCE; perhaps it's a good choice
for a Fixes: tag (if above is indeed true).

>
> Code changes look fine to me.

Thanks Mike!
- James
  
James Houghton Jan. 3, 2023, 8:27 p.m. UTC | #3
> I think I've come up with one... :) It only took many many hours of
> staring at code to come up with:
>
> 1. Fault in PUD_SIZE-aligned hugetlb mapping
> 2. fork() (to actually share the PMDs)

Erm, I mean: mmap(), then fork(), then fault in both processes.
  
Mike Kravetz Jan. 3, 2023, 10:23 p.m. UTC | #4
On 01/03/23 20:26, James Houghton wrote:
> > Thanks James.  I am just trying to determine if we may have any issues/bugs/
> > undesired behavior based on this today.  Consider the cases mentioned above:
> > mbind - I do not think this would cause any user visible issues.  mbind is
> >         only dealing with newly allocated pages.  We do not unshare as the
> >         result of a mbind call today.
> > madvise(MADV_DONTDUMP) - It looks like this results in a flag (VM_DONTDUMP)
> >         being set on the vma.  So, I do not believe sharing page tables
> >         would cause any user visible issue.
> >
> > One somewhat strange things about two vmas after split sharing a PMD is
> > that operations on one VMA can impact the other.  For example, suppose
> > A VMA split via mbind happens.  Then later, mprotect is done on one of
> > the VMAs in the range that is shared.  That would result in the area being
> > unshared in both VMAs.  So, the 'other' vma could see minor faults after
> > the mprotect.
> >
> > Just curious if you (or anyone) knows of a user visible issue caused by this
> > today.  Trying to determine if we need a Fixes: tag.
> 
> I think I've come up with one... :) It only took many many hours of
> staring at code to come up with:
> 
> 1. Fault in PUD_SIZE-aligned hugetlb mapping
> 2. fork() (to actually share the PMDs)
> Erm, I mean: mmap(), then fork(), then fault in both processes.
> 3. Split VMA with MADV_DONTDUMP
> 4. Register the lower piece of the newly split VMA with
> UFFDIO_REGISTER_MODE_WRITEPROTECT (this will call
> hugetlb_unshare_all_pmds, but it will not attempt to unshare in the
> unaligned bits now)
> 5. Now calling UFFDIO_WRITEPROTECT will drop into
> hugetlb_change_protection and succeed in unsharing. That will hit the
> WARN_ON_ONCE and *not write-protect anything*.
> 
> I'll see if I can confirm that this is indeed possible and send a
> repro if it is.

I think your analysis above is correct.  The key being the failure to unshare
in the non-PUD_SIZE vma after the split.

To me, the fact it was somewhat difficult to come up with this scenario is an
argument what we should just unshare at split time as you propose.  Who
knows what other issues may exist.

> 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes") is the
> commit that introduced the WARN_ON_ONCE; perhaps it's a good choice
> for a Fixes: tag (if above is indeed true).

If the key issue in your above scenario is indeed the failure of
hugetlb_unshare_all_pmds in the non-PUD_SIZE vma, then perhaps we tag?

6dfeaff93be1 ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when
register wp")
  
Peter Xu Jan. 3, 2023, 11:04 p.m. UTC | #5
On Sun, Jan 01, 2023 at 11:00:42PM +0000, James Houghton wrote:
> PMD sharing can only be done in PUD_SIZE-aligned pieces of VMAs;
> however, it is possible that HugeTLB VMAs are split without unsharing
> the PMDs first.
> 
> In some (most?) cases, this is a non-issue, like userfaultfd_register
> and mprotect, where PMDs are unshared before anything is done. However,
> mbind() and madvise() (like MADV_DONTDUMP) can cause a split without
> unsharing first.
> 
> It might seem ideal to unshare in hugetlb_vm_op_open, but that would
> only unshare PMDs in the new VMA.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b39b74e0591a..bf7a1f628357 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -94,6 +94,8 @@ static int hugetlb_acct_memory(struct hstate *h, long delta);
>  static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
>  static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
>  static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
> +static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end);
>  
>  static inline bool subpool_is_free(struct hugepage_subpool *spool)
>  {
> @@ -4828,6 +4830,23 @@ static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	if (addr & ~(huge_page_mask(hstate_vma(vma))))
>  		return -EINVAL;
> +
> +	/* We require PUD_SIZE VMA alignment for PMD sharing. */

I can get the point, but it reads slightly awkward.  How about:

        /*
         * If the address to split can be in the middle of a shared pmd
         * range, unshare before split the vma.
         */

I remember you had a helper to check pmd sharing possibility.  Can use here
depending on whether that existed in the code base or in your hgm series
(or just pick that up with this one?).

> +	if (addr & ~PUD_MASK) {
> +		/*
> +		 * hugetlb_vm_op_split is called right before we attempt to
> +		 * split the VMA. We will need to unshare PMDs in the old and
> +		 * new VMAs, so let's unshare before we split.
> +		 */
> +		unsigned long floor = addr & PUD_MASK;
> +		unsigned long ceil = floor + PUD_SIZE;
> +
> +		if (floor < vma->vm_start || ceil >= vma->vm_end)

s/>=/>/?

> +			/* PMD sharing is already impossible. */
> +			return 0;

IMHO slightly cleaner to write in the reversed way and let it fall through:

  if (floor >= vma->vm_start && ceil <= vma->vm_end)
      hugetlb_unshare_pmds(vma, floor, ceil);

Thanks,

> +		hugetlb_unshare_pmds(vma, floor, ceil);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -7313,26 +7332,21 @@ void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int re
>  	}
>  }
>  
> -/*
> - * This function will unconditionally remove all the shared pmd pgtable entries
> - * within the specific vma for a hugetlbfs memory range.
> - */
> -void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
> +static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> +				   unsigned long start,
> +				   unsigned long end)
>  {
>  	struct hstate *h = hstate_vma(vma);
>  	unsigned long sz = huge_page_size(h);
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	unsigned long address, start, end;
> +	unsigned long address;
>  	spinlock_t *ptl;
>  	pte_t *ptep;
>  
>  	if (!(vma->vm_flags & VM_MAYSHARE))
>  		return;
>  
> -	start = ALIGN(vma->vm_start, PUD_SIZE);
> -	end = ALIGN_DOWN(vma->vm_end, PUD_SIZE);
> -
>  	if (start >= end)
>  		return;
>  
> @@ -7364,6 +7378,16 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>  
> +/*
> + * This function will unconditionally remove all the shared pmd pgtable entries
> + * within the specific vma for a hugetlbfs memory range.
> + */
> +void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
> +{
> +	hugetlb_unshare_pmds(vma, ALIGN(vma->vm_start, PUD_SIZE),
> +			ALIGN_DOWN(vma->vm_end, PUD_SIZE));
> +}
> +
>  #ifdef CONFIG_CMA
>  static bool cma_reserve_called __initdata;
>  
> -- 
> 2.39.0.314.g84b9a713c41-goog
>
  
James Houghton Jan. 4, 2023, 7:10 p.m. UTC | #6
> > I'll see if I can confirm that this is indeed possible and send a
> > repro if it is.
>
> I think your analysis above is correct.  The key being the failure to unshare
> in the non-PUD_SIZE vma after the split.

I do indeed hit the WARN_ON_ONCE (repro attached), and the MADV wasn't
even needed (the UFFDIO_REGISTER does the VMA split before "unsharing
all PMDs"). With the fix, we avoid the WARN_ON_ONCE, but the behavior
is still incorrect: I expect the address range to be write-protected,
but it isn't.

The reason why is that hugetlb_change_protection uses huge_pte_offset,
even if it's being called for a UFFDIO_WRITEPROTECT with
UFFDIO_WRITEPROTECT_MODE_WP. In that particular case, I'm pretty sure
we should be using huge_pte_alloc, but even so, it's not trivial to
get an allocation failure back up to userspace. The non-hugetlb
implementation of UFFDIO_WRITEPROTECT seems to also have this problem.

Peter, what do you think?

>
> To me, the fact it was somewhat difficult to come up with this scenario is an
> argument what we should just unshare at split time as you propose.  Who
> knows what other issues may exist.
>
> > 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes") is the
> > commit that introduced the WARN_ON_ONCE; perhaps it's a good choice
> > for a Fixes: tag (if above is indeed true).
>
> If the key issue in your above scenario is indeed the failure of
> hugetlb_unshare_all_pmds in the non-PUD_SIZE vma, then perhaps we tag?
>
> 6dfeaff93be1 ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when
> register wp")

SGTM. Thanks Mike.
  
James Houghton Jan. 4, 2023, 7:34 p.m. UTC | #7
> > @@ -4828,6 +4830,23 @@ static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >       if (addr & ~(huge_page_mask(hstate_vma(vma))))
> >               return -EINVAL;
> > +
> > +     /* We require PUD_SIZE VMA alignment for PMD sharing. */
>
> I can get the point, but it reads slightly awkward.  How about:
>
>         /*
>          * If the address to split can be in the middle of a shared pmd
>          * range, unshare before split the vma.
>          */
>

How about:

/*
 * PMD sharing is only possible for PUD_SIZE-aligned address ranges
 * in HugeTLB VMAs. If we will lose PUD_SIZE alignment due to this split,
 * unshare PMDs in the PUD_SIZE interval surrounding addr now.
 */

> I remember you had a helper to check pmd sharing possibility.  Can use here
> depending on whether that existed in the code base or in your hgm series
> (or just pick that up with this one?).

Right, it introduces `pmd_sharing_possible` but I don't think it helps here.

>
> > +     if (addr & ~PUD_MASK) {
> > +             /*
> > +              * hugetlb_vm_op_split is called right before we attempt to
> > +              * split the VMA. We will need to unshare PMDs in the old and
> > +              * new VMAs, so let's unshare before we split.
> > +              */
> > +             unsigned long floor = addr & PUD_MASK;
> > +             unsigned long ceil = floor + PUD_SIZE;
> > +
> > +             if (floor < vma->vm_start || ceil >= vma->vm_end)
>
> s/>=/>/?

Indeed, thanks.

>
> > +                     /* PMD sharing is already impossible. */
> > +                     return 0;
>
> IMHO slightly cleaner to write in the reversed way and let it fall through:
>
>   if (floor >= vma->vm_start && ceil <= vma->vm_end)
>       hugetlb_unshare_pmds(vma, floor, ceil);

Will do.

>
> Thanks,
>

Thanks Peter!
  
Peter Xu Jan. 4, 2023, 8:03 p.m. UTC | #8
On Wed, Jan 04, 2023 at 07:10:11PM +0000, James Houghton wrote:
> > > I'll see if I can confirm that this is indeed possible and send a
> > > repro if it is.
> >
> > I think your analysis above is correct.  The key being the failure to unshare
> > in the non-PUD_SIZE vma after the split.
> 
> I do indeed hit the WARN_ON_ONCE (repro attached), and the MADV wasn't
> even needed (the UFFDIO_REGISTER does the VMA split before "unsharing
> all PMDs"). With the fix, we avoid the WARN_ON_ONCE, but the behavior
> is still incorrect: I expect the address range to be write-protected,
> but it isn't.
> 
> The reason why is that hugetlb_change_protection uses huge_pte_offset,
> even if it's being called for a UFFDIO_WRITEPROTECT with
> UFFDIO_WRITEPROTECT_MODE_WP. In that particular case, I'm pretty sure
> we should be using huge_pte_alloc, but even so, it's not trivial to
> get an allocation failure back up to userspace. The non-hugetlb
> implementation of UFFDIO_WRITEPROTECT seems to also have this problem.
> 
> Peter, what do you think?

Indeed.  Thanks for spotting that, James.

Non-hugetlb should be fine with having empty pgtable entries. Anon doesn't
need to care about no-pgtable-populated ranges so far. Shmem does it with a
few change_prepare() calls to populate the entries so the markers can be
installed later on.

However I think the fault handling is still not well handled as you pointed
out even for shmem: that's the path I probably never triggered myself yet
before and the code stayed there since a very early version:

#define  change_pmd_prepare(vma, pmd, cp_flags)				\
	do {								\
		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
			if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)))	\
				break;					\
		}							\
	} while (0)

I think a better thing we can do here (instead of warning and stop the
UFFDIO_WRITEPROTECT at the current stage) is returning with -ENOMEM
properly so the user can know the error.  We'll need to touch the stacks up
to uffd_wp_range() as it's the only one that can trigger the -ENOMEM so
far, so as to not ignore retval from change_protection().

Meanwhile, I'd also wonder whether we should call pagefault_out_of_memory()
because it should be the same as when pgtable allocation failure happens in
page faults, we may want to OOM already.  I can take care of hugetlb part
too along the way.

Man page of UFFDIO_WRITEPROTECT may need a fixup too to introduce -ENOMEM.

I can quickly prepare some patches for this, and hopefully it doesn't need
to block the current fix on split.

Any thoughts?

> 
> >
> > To me, the fact it was somewhat difficult to come up with this scenario is an
> > argument what we should just unshare at split time as you propose.  Who
> > knows what other issues may exist.
> >
> > > 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes") is the
> > > commit that introduced the WARN_ON_ONCE; perhaps it's a good choice
> > > for a Fixes: tag (if above is indeed true).
> >
> > If the key issue in your above scenario is indeed the failure of
> > hugetlb_unshare_all_pmds in the non-PUD_SIZE vma, then perhaps we tag?
> >
> > 6dfeaff93be1 ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when
> > register wp")
> 
> SGTM. Thanks Mike.

Looks good here too.

Thanks,
  
Peter Xu Jan. 4, 2023, 8:04 p.m. UTC | #9
On Wed, Jan 04, 2023 at 07:34:00PM +0000, James Houghton wrote:
> How about:
> 
> /*
>  * PMD sharing is only possible for PUD_SIZE-aligned address ranges
>  * in HugeTLB VMAs. If we will lose PUD_SIZE alignment due to this split,
>  * unshare PMDs in the PUD_SIZE interval surrounding addr now.
>  */

Even better, thanks.
  
James Houghton Jan. 4, 2023, 11:12 p.m. UTC | #10
On Wed, Jan 4, 2023 at 8:03 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jan 04, 2023 at 07:10:11PM +0000, James Houghton wrote:
> > > > I'll see if I can confirm that this is indeed possible and send a
> > > > repro if it is.
> > >
> > > I think your analysis above is correct.  The key being the failure to unshare
> > > in the non-PUD_SIZE vma after the split.
> >
> > I do indeed hit the WARN_ON_ONCE (repro attached), and the MADV wasn't
> > even needed (the UFFDIO_REGISTER does the VMA split before "unsharing
> > all PMDs"). With the fix, we avoid the WARN_ON_ONCE, but the behavior
> > is still incorrect: I expect the address range to be write-protected,
> > but it isn't.
> >
> > The reason why is that hugetlb_change_protection uses huge_pte_offset,
> > even if it's being called for a UFFDIO_WRITEPROTECT with
> > UFFDIO_WRITEPROTECT_MODE_WP. In that particular case, I'm pretty sure
> > we should be using huge_pte_alloc, but even so, it's not trivial to
> > get an allocation failure back up to userspace. The non-hugetlb
> > implementation of UFFDIO_WRITEPROTECT seems to also have this problem.
> >
> > Peter, what do you think?
>
> Indeed.  Thanks for spotting that, James.
>
> Non-hugetlb should be fine with having empty pgtable entries. Anon doesn't
> need to care about no-pgtable-populated ranges so far. Shmem does it with a
> few change_prepare() calls to populate the entries so the markers can be
> installed later on.

Ah ok! :)

>
> However I think the fault handling is still not well handled as you pointed
> out even for shmem: that's the path I probably never triggered myself yet
> before and the code stayed there since a very early version:
>
> #define  change_pmd_prepare(vma, pmd, cp_flags)                         \
>         do {                                                            \
>                 if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {    \
>                         if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)))   \
>                                 break;                                  \
>                 }                                                       \
>         } while (0)
>
> I think a better thing we can do here (instead of warning and stop the
> UFFDIO_WRITEPROTECT at the current stage) is returning with -ENOMEM
> properly so the user can know the error.  We'll need to touch the stacks up
> to uffd_wp_range() as it's the only one that can trigger the -ENOMEM so
> far, so as to not ignore retval from change_protection().
>
> Meanwhile, I'd also wonder whether we should call pagefault_out_of_memory()
> because it should be the same as when pgtable allocation failure happens in
> page faults, we may want to OOM already.  I can take care of hugetlb part
> too along the way.

I might be misunderstanding, the only case where
hugetlb_change_protection() would *need* to allocate is when it is
called from UFFDIO_WRITEPROTECT, not while handling a #pf. So I don't
think any calls to pagefault_out_of_memory() need to be added.

>
> Man page of UFFDIO_WRITEPROTECT may need a fixup too to introduce -ENOMEM.
>
> I can quickly prepare some patches for this, and hopefully it doesn't need
> to block the current fix on split.

I don't think it should block this splitting fix. I'll send another
version of this fix soon.

>
> Any thoughts?
>
> >
> > >
> > > To me, the fact it was somewhat difficult to come up with this scenario is an
> > > argument what we should just unshare at split time as you propose.  Who
> > > knows what other issues may exist.
> > >
> > > > 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes") is the
> > > > commit that introduced the WARN_ON_ONCE; perhaps it's a good choice
> > > > for a Fixes: tag (if above is indeed true).
> > >
> > > If the key issue in your above scenario is indeed the failure of
> > > hugetlb_unshare_all_pmds in the non-PUD_SIZE vma, then perhaps we tag?
> > >
> > > 6dfeaff93be1 ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when
> > > register wp")
> >
> > SGTM. Thanks Mike.
>
> Looks good here too.

Thanks, Peter!
  

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b39b74e0591a..bf7a1f628357 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -94,6 +94,8 @@  static int hugetlb_acct_memory(struct hstate *h, long delta);
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
 static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
 static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
+static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
+		unsigned long start, unsigned long end);
 
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
@@ -4828,6 +4830,23 @@  static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
 {
 	if (addr & ~(huge_page_mask(hstate_vma(vma))))
 		return -EINVAL;
+
+	/* We require PUD_SIZE VMA alignment for PMD sharing. */
+	if (addr & ~PUD_MASK) {
+		/*
+		 * hugetlb_vm_op_split is called right before we attempt to
+		 * split the VMA. We will need to unshare PMDs in the old and
+		 * new VMAs, so let's unshare before we split.
+		 */
+		unsigned long floor = addr & PUD_MASK;
+		unsigned long ceil = floor + PUD_SIZE;
+
+		if (floor < vma->vm_start || ceil >= vma->vm_end)
+			/* PMD sharing is already impossible. */
+			return 0;
+		hugetlb_unshare_pmds(vma, floor, ceil);
+	}
+
 	return 0;
 }
 
@@ -7313,26 +7332,21 @@  void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int re
 	}
 }
 
-/*
- * This function will unconditionally remove all the shared pmd pgtable entries
- * within the specific vma for a hugetlbfs memory range.
- */
-void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
+static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
+				   unsigned long start,
+				   unsigned long end)
 {
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
-	unsigned long address, start, end;
+	unsigned long address;
 	spinlock_t *ptl;
 	pte_t *ptep;
 
 	if (!(vma->vm_flags & VM_MAYSHARE))
 		return;
 
-	start = ALIGN(vma->vm_start, PUD_SIZE);
-	end = ALIGN_DOWN(vma->vm_end, PUD_SIZE);
-
 	if (start >= end)
 		return;
 
@@ -7364,6 +7378,16 @@  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 	mmu_notifier_invalidate_range_end(&range);
 }
 
+/*
+ * This function will unconditionally remove all the shared pmd pgtable entries
+ * within the specific vma for a hugetlbfs memory range.
+ */
+void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
+{
+	hugetlb_unshare_pmds(vma, ALIGN(vma->vm_start, PUD_SIZE),
+			ALIGN_DOWN(vma->vm_end, PUD_SIZE));
+}
+
 #ifdef CONFIG_CMA
 static bool cma_reserve_called __initdata;