mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vma

Message ID b740c9fb-edba-92ba-59fb-7a5592e5dfc@google.com
State New
Headers
Series mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vma |

Commit Message

Hugh Dickins Dec. 22, 2022, 8:41 p.m. UTC
  uprobe_write_opcode() uses collapse_pte_mapped_thp() to restore huge pmd,
when removing a breakpoint from hugepage text: vma->anon_vma is always
set in that case, so undo the prohibition.  And MADV_COLLAPSE ought to be
able to collapse some page tables in a vma which happens to have anon_vma
set from CoWing elsewhere.

Is anon_vma lock required?  Almost not: if any page other than expected
subpage of the non-anon huge page is found in the page table, collapse is
aborted without making any change.  However, it is possible that an anon
page was CoWed from this extent in another mm or vma, in which case a
concurrent lookup might look here: so keep it away while clearing pmd
(but perhaps we shall go back to using pmd_lock() there in future).

Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
without having cleared its ptes: I'm uneasy about that, and had thought
pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
and we would have to move the mmu_notification if clearing those ptes.

Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org>    [5.4+]
---
What this fixes is not a dangerous instability!  But I suggest Cc stable
because uprobes "healing" has regressed in that way, so this should follow
8d3c106e19e8 into those stable releases where it was backported (and may
want adjustment there - I'll supply backports as needed).

 mm/khugepaged.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)
  

Comments

David Hildenbrand Jan. 2, 2023, 12:19 p.m. UTC | #1
On 22.12.22 21:41, Hugh Dickins wrote:
> uprobe_write_opcode() uses collapse_pte_mapped_thp() to restore huge pmd,
> when removing a breakpoint from hugepage text: vma->anon_vma is always
> set in that case, so undo the prohibition.  And MADV_COLLAPSE ought to be
> able to collapse some page tables in a vma which happens to have anon_vma
> set from CoWing elsewhere.
> 

Just so I get this correctly: the degradation is that we won't 
re-collapse a THP after removing a breakpoint. Certainly "sub-optimal", 
but I guess nothing that particularly matters in practice.

Or am I wrong?

> Is anon_vma lock required?  Almost not: if any page other than expected
> subpage of the non-anon huge page is found in the page table, collapse is
> aborted without making any change.  However, it is possible that an anon
> page was CoWed from this extent in another mm or vma, in which case a
> concurrent lookup might look here: so keep it away while clearing pmd
> (but perhaps we shall go back to using pmd_lock() there in future).
> 
> Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
> without having cleared its ptes: I'm uneasy about that, and had thought
> pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
> and we would have to move the mmu_notification if clearing those ptes.
> 
> Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zach O'Keefe <zokeefe@google.com>
> Cc: Song Liu <songliubraving@fb.com>
> Cc: <stable@vger.kernel.org>    [5.4+]
> ---
> What this fixes is not a dangerous instability!  But I suggest Cc stable
> because uprobes "healing" has regressed in that way, so this should follow
> 8d3c106e19e8 into those stable releases where it was backported (and may
> want adjustment there - I'll supply backports as needed).

If it's really something that doesn't matter in practice (e.g., -1% 
performance while debugging :) ), I guess no CC is needed. If there are 
real production workloads that suffer, I guess ccing stable is fine.

> 
>   mm/khugepaged.c | 14 ++++++--------
>   1 file changed, 6 insertions(+), 8 deletions(-)
> 
> --- 6.2-rc/mm/khugepaged.c
> +++ linux/mm/khugepaged.c
> @@ -1460,14 +1460,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
>   		return SCAN_VMA_CHECK;
>   
> -	/*
> -	 * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
> -	 * that got written to. Without this, we'd have to also lock the
> -	 * anon_vma if one exists.
> -	 */
> -	if (vma->anon_vma)
> -		return SCAN_VMA_CHECK;
> -
>   	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
>   	if (userfaultfd_wp(vma))
>   		return SCAN_PTE_UFFD_WP;
> @@ -1567,8 +1559,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   	}
>   
>   	/* step 4: remove pte entries */
> +	/* we make no change to anon, but protect concurrent anon page lookup */
> +	if (vma->anon_vma)
> +		anon_vma_lock_write(vma->anon_vma);
> +
>   	collapse_and_free_pmd(mm, vma, haddr, pmd);
>   
> +	if (vma->anon_vma)
> +		anon_vma_unlock_write(vma->anon_vma);
>   	i_mmap_unlock_write(vma->vm_file->f_mapping);
>   
>   maybe_install_pmd:
> 

That code is 99% black magic to me, but staring at the original fix and 
at collapse_and_free_pmd() makes me assume that grabbing that lock most 
certainly won't hurt and should be the right thing to do

Acked-by: David Hildenbrand <david@redhat.com>


Side note: set_huge_pmd() wins the award of "ugliest mm function of 
early 2023". I was briefly concerned how do_set_pmd() decides whether 
the PMD can be writable or not. Turns out it's communicated via 
vm_fault->flags. Just horrible.
  
Hugh Dickins Jan. 3, 2023, 8:40 p.m. UTC | #2
On Mon, 2 Jan 2023, David Hildenbrand wrote:
> On 22.12.22 21:41, Hugh Dickins wrote:
> > uprobe_write_opcode() uses collapse_pte_mapped_thp() to restore huge pmd,
> > when removing a breakpoint from hugepage text: vma->anon_vma is always
> > set in that case, so undo the prohibition.  And MADV_COLLAPSE ought to be
> > able to collapse some page tables in a vma which happens to have anon_vma
> > set from CoWing elsewhere.
> > 
> 
> Just so I get this correctly: the degradation is that we won't re-collapse a
> THP after removing a breakpoint. Certainly "sub-optimal", but I guess nothing
> that particularly matters in practice.

Yes, it is not a correctness issue: see what I wrote "below the dashes":

What this fixes is not a dangerous instability!  But I suggest Cc stable
because uprobes "healing" has regressed in that way, so this should follow
8d3c106e19e8 into those stable releases where it was backported (and may
want adjustment there - I'll supply backports as needed).

Which Andrew moved "above the dashes": I expect he shares, or thinks
others may share, your scepticism about it - and deserve explanation.

> 
> Or am I wrong?
> 
> > Is anon_vma lock required?  Almost not: if any page other than expected
> > subpage of the non-anon huge page is found in the page table, collapse is
> > aborted without making any change.  However, it is possible that an anon
> > page was CoWed from this extent in another mm or vma, in which case a
> > concurrent lookup might look here: so keep it away while clearing pmd
> > (but perhaps we shall go back to using pmd_lock() there in future).
> > 
> > Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
> > without having cleared its ptes: I'm uneasy about that, and had thought
> > pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
> > and we would have to move the mmu_notification if clearing those ptes.
> > 
> > Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table
> > retraction")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Yang Shi <shy828301@gmail.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Zach O'Keefe <zokeefe@google.com>
> > Cc: Song Liu <songliubraving@fb.com>
> > Cc: <stable@vger.kernel.org>    [5.4+]
> > ---
> > What this fixes is not a dangerous instability!  But I suggest Cc stable
> > because uprobes "healing" has regressed in that way, so this should follow
> > 8d3c106e19e8 into those stable releases where it was backported (and may
> > want adjustment there - I'll supply backports as needed).
> 
> If it's really something that doesn't matter in practice (e.g., -1%
> performance while debugging :) ), I guess no CC is needed. If there are real
> production workloads that suffer, I guess ccing stable is fine.

It's about recovering performance *after* debugging.  It is not something
that is of any value to me personally, nor (so far as I know) to anyone
whom I work with.  But it is something which Song Liu went to the trouble
to make possible in his "THP aware uprobe" series three years ago, and it
is something which Jann unintentionally regressed in his recent commit:
so I thought it proper to reinstate where regressed.

(What I do have more of an investment in, is for MADV_COLLAPSE to be able
to collapse some extents in a large vma where some other extent got CoWed,
so giving the whole vma an anon_vma.  But that's not an issue for -stable,
and I cannot tell you offhand whether undoing this anon_vma exclusion is
enough to enable that or not - I suspect not, I suspect a result code or
switch statement needs to be adjusted too.)

> 
> > 
> >   mm/khugepaged.c | 14 ++++++--------
> >   1 file changed, 6 insertions(+), 8 deletions(-)
> > 
> > --- 6.2-rc/mm/khugepaged.c
> > +++ linux/mm/khugepaged.c
> > @@ -1460,14 +1460,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
> > unsigned long addr,
> >    if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
> >     return SCAN_VMA_CHECK;
> >   -	/*
> > -	 * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
> > -	 * that got written to. Without this, we'd have to also lock the
> > -	 * anon_vma if one exists.
> > -	 */
> > -	if (vma->anon_vma)
> > -		return SCAN_VMA_CHECK;
> > -
> >    /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> >    if (userfaultfd_wp(vma))
> >   		return SCAN_PTE_UFFD_WP;
> > @@ -1567,8 +1559,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
> > unsigned long addr,
> >    }
> >   
> >   	/* step 4: remove pte entries */
> > +	/* we make no change to anon, but protect concurrent anon page lookup
> > */
> > +	if (vma->anon_vma)
> > +		anon_vma_lock_write(vma->anon_vma);
> > +
> >    collapse_and_free_pmd(mm, vma, haddr, pmd);
> >   +	if (vma->anon_vma)
> > +		anon_vma_unlock_write(vma->anon_vma);
> >    i_mmap_unlock_write(vma->vm_file->f_mapping);
> >   
> >   maybe_install_pmd:
> > 
> 
> That code is 99% black magic to me, but staring at the original fix and at
> collapse_and_free_pmd() makes me assume that grabbing that lock most certainly
> won't hurt and should be the right thing to do
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> 
> 
> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
> writable or not. Turns out it's communicated via vm_fault->flags. Just
> horrible.

I firmly disagree - it's from 2022! and much too small to be ugliest;
but I haven't thought about the aspect that is bothering you there.

What's bothered me most about it, is the way its name, and the naming of
the do_set_pmd() it interfaces with, give no hint that they are entirely
about file (or shmem) vmas, and would not work right on anon vmas
(I forget whether it's just a matter of which stats updated, or more).

Hugh

> 
> -- 
> Thanks,
> 
> David / dhildenb
  
David Hildenbrand Jan. 4, 2023, 9:20 a.m. UTC | #3
>> Or am I wrong?
>>
>>> Is anon_vma lock required?  Almost not: if any page other than expected
>>> subpage of the non-anon huge page is found in the page table, collapse is
>>> aborted without making any change.  However, it is possible that an anon
>>> page was CoWed from this extent in another mm or vma, in which case a
>>> concurrent lookup might look here: so keep it away while clearing pmd
>>> (but perhaps we shall go back to using pmd_lock() there in future).
>>>
>>> Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
>>> without having cleared its ptes: I'm uneasy about that, and had thought
>>> pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
>>> and we would have to move the mmu_notification if clearing those ptes.
>>>
>>> Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table
>>> retraction")
>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>> Cc: Jann Horn <jannh@google.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Zach O'Keefe <zokeefe@google.com>
>>> Cc: Song Liu <songliubraving@fb.com>
>>> Cc: <stable@vger.kernel.org>    [5.4+]
>>> ---
>>> What this fixes is not a dangerous instability!  But I suggest Cc stable
>>> because uprobes "healing" has regressed in that way, so this should follow
>>> 8d3c106e19e8 into those stable releases where it was backported (and may
>>> want adjustment there - I'll supply backports as needed).
>>
>> If it's really something that doesn't matter in practice (e.g., -1%
>> performance while debugging :) ), I guess no CC is needed. If there are real
>> production workloads that suffer, I guess ccing stable is fine.
> 
> It's about recovering performance *after* debugging.  It is not something
> that is of any value to me personally, nor (so far as I know) to anyone
> whom I work with.  But it is something which Song Liu went to the trouble
> to make possible in his "THP aware uprobe" series three years ago, and it
> is something which Jann unintentionally regressed in his recent commit:
> so I thought it proper to reinstate where regressed.

Right, although I wonder if that original series fixed a real 
performance issue or was more a "this makes sense, let's just optimize 
this corner case by some serious complexity". I hope it's not the latter :)

> 
> (What I do have more of an investment in, is for MADV_COLLAPSE to be able
> to collapse some extents in a large vma where some other extent got CoWed,
> so giving the whole vma an anon_vma.  But that's not an issue for -stable,
> and I cannot tell you offhand whether undoing this anon_vma exclusion is
> enough to enable that or not - I suspect not, I suspect a result code or
> switch statement needs to be adjusted too.)

Yeah, having a single COWed page in a large MAP_PRIVATE file/shmem 
mapping would disable collapse, so it's the right thing to do.

Thinking about it some more, and the effective code change, stable 
doesn't sound wrong.

>>
>>
>> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
>> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
>> writable or not. Turns out it's communicated via vm_fault->flags. Just
>> horrible.
> 
> I firmly disagree - it's from 2022! and much too small to be ugliest;
> but I haven't thought about the aspect that is bothering you there.

The ugliest I stumbled over in early 2023 -- until January 2nd :D

> 
> What's bothered me most about it, is the way its name, and the naming of
> the do_set_pmd() it interfaces with, give no hint that they are entirely
> about file (or shmem) vmas, and would not work right on anon vmas
> (I forget whether it's just a matter of which stats updated, or more).

Yes. I dug very deep into in-place collapse yesterday because I was 
briefly concerned about anon THP, and it took me longer to understand 
that whole machinery than it should (and that anon THP never ever 
collapse in-place).

Some of that khugepaged stuff needs some *serious* cleanups and 
refactoring. do_set_pmd() is not an exception.


Some more examples:

if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
	...
	hpage_collapse_scan_file()
} else {
	hpage_collapse_scan_pmd()
	...
}


1) hpage_collapse_scan_pmd() is only for anon memory. Totally obvious
    from the name. But why are we potentially calling it for VMAs that
    are not applicable? For maximum David confusion?

2) "IS_ENABLED(CONFIG_SHMEM) && vma->vm_file" is also supposed to cover
    ordinary file-thp. Totally obvious from the IS_ENABLED(CONFIG_SHMEM)
    ... I probably spent 30minutes understanding what's happening here.
    Just misleading and wrong without CONFIG_SHMEM.


... and what's easier to get than this magic set of boolean flags:

	hugepage_vma_check(vma, vma->vm_flags, false, false, true)

... and obviously
	hugepage_vma_revalidate()
is supposed to be a follow up to a previous
	hugepage_vma_check()
and totally different from
	transhuge_vma_suitable()

Hard to make it even less consistent.

Also, it's very clear from the code that SCAN_PTE_MAPPED_HUGEPAGE only 
applies to file-thp, right? No.
  
Yang Shi Jan. 5, 2023, 12:03 a.m. UTC | #4
On Wed, Jan 4, 2023 at 1:20 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> Or am I wrong?
> >>
> >>> Is anon_vma lock required?  Almost not: if any page other than expected
> >>> subpage of the non-anon huge page is found in the page table, collapse is
> >>> aborted without making any change.  However, it is possible that an anon
> >>> page was CoWed from this extent in another mm or vma, in which case a
> >>> concurrent lookup might look here: so keep it away while clearing pmd
> >>> (but perhaps we shall go back to using pmd_lock() there in future).
> >>>
> >>> Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
> >>> without having cleared its ptes: I'm uneasy about that, and had thought
> >>> pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
> >>> and we would have to move the mmu_notification if clearing those ptes.
> >>>
> >>> Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table
> >>> retraction")
> >>> Signed-off-by: Hugh Dickins <hughd@google.com>
> >>> Cc: Jann Horn <jannh@google.com>
> >>> Cc: Yang Shi <shy828301@gmail.com>
> >>> Cc: David Hildenbrand <david@redhat.com>
> >>> Cc: Zach O'Keefe <zokeefe@google.com>
> >>> Cc: Song Liu <songliubraving@fb.com>
> >>> Cc: <stable@vger.kernel.org>    [5.4+]
> >>> ---
> >>> What this fixes is not a dangerous instability!  But I suggest Cc stable
> >>> because uprobes "healing" has regressed in that way, so this should follow
> >>> 8d3c106e19e8 into those stable releases where it was backported (and may
> >>> want adjustment there - I'll supply backports as needed).
> >>
> >> If it's really something that doesn't matter in practice (e.g., -1%
> >> performance while debugging :) ), I guess no CC is needed. If there are real
> >> production workloads that suffer, I guess ccing stable is fine.
> >
> > It's about recovering performance *after* debugging.  It is not something
> > that is of any value to me personally, nor (so far as I know) to anyone
> > whom I work with.  But it is something which Song Liu went to the trouble
> > to make possible in his "THP aware uprobe" series three years ago, and it
> > is something which Jann unintentionally regressed in his recent commit:
> > so I thought it proper to reinstate where regressed.
>
> Right, although I wonder if that original series fixed a real
> performance issue or was more a "this makes sense, let's just optimize
> this corner case by some serious complexity". I hope it's not the latter :)
>
> >
> > (What I do have more of an investment in, is for MADV_COLLAPSE to be able
> > to collapse some extents in a large vma where some other extent got CoWed,
> > so giving the whole vma an anon_vma.  But that's not an issue for -stable,
> > and I cannot tell you offhand whether undoing this anon_vma exclusion is
> > enough to enable that or not - I suspect not, I suspect a result code or
> > switch statement needs to be adjusted too.)
>
> Yeah, having a single COWed page in a large MAP_PRIVATE file/shmem
> mapping would disable collapse, so it's the right thing to do.
>
> Thinking about it some more, and the effective code change, stable
> doesn't sound wrong.
>
> >>
> >>
> >> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
> >> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
> >> writable or not. Turns out it's communicated via vm_fault->flags. Just
> >> horrible.
> >
> > I firmly disagree - it's from 2022! and much too small to be ugliest;
> > but I haven't thought about the aspect that is bothering you there.
>
> The ugliest I stumbled over in early 2023 -- until January 2nd :D
>
> >
> > What's bothered me most about it, is the way its name, and the naming of
> > the do_set_pmd() it interfaces with, give no hint that they are entirely
> > about file (or shmem) vmas, and would not work right on anon vmas
> > (I forget whether it's just a matter of which stats updated, or more).
>
> Yes. I dug very deep into in-place collapse yesterday because I was
> briefly concerned about anon THP, and it took me longer to understand
> that whole machinery than it should (and that anon THP never ever
> collapse in-place).
>
> Some of that khugepaged stuff needs some *serious* cleanups and
> refactoring. do_set_pmd() is not an exception.
>
>
> Some more examples:
>
> if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
>         ...
>         hpage_collapse_scan_file()
> } else {
>         hpage_collapse_scan_pmd()
>         ...
> }
>
>
> 1) hpage_collapse_scan_pmd() is only for anon memory. Totally obvious
>     from the name. But why are we potentially calling it for VMAs that
>     are not applicable? For maximum David confusion?

IIRC the VMAs are checked before, what do you mean about "not
applicable"? But anyway khugepaged/MADV_COLLAPSE does release and
reacquire mmap_lock multiple times, so there are multiple places to
check VMAs validity.

>
> 2) "IS_ENABLED(CONFIG_SHMEM) && vma->vm_file" is also supposed to cover
>     ordinary file-thp. Totally obvious from the IS_ENABLED(CONFIG_SHMEM)
>     ... I probably spent 30minutes understanding what's happening here.
>     Just misleading and wrong without CONFIG_SHMEM.
>
>
> ... and what's easier to get than this magic set of boolean flags:
>
>         hugepage_vma_check(vma, vma->vm_flags, false, false, true)

This is not perfect. I was thinking about changing them to one flag,
just like TTU_ flags used by try_to_unmap(). That may make things
cleaner.

>
> ... and obviously
>         hugepage_vma_revalidate()
> is supposed to be a follow up to a previous
>         hugepage_vma_check()
> and totally different from
>         transhuge_vma_suitable()
>
> Hard to make it even less consistent.

This was after my cleanup, it was much messier before. And I did add
comments to make them more understandable, but anyway better naming is
definitely welcome.

>
> Also, it's very clear from the code that SCAN_PTE_MAPPED_HUGEPAGE only
> applies to file-thp, right? No.
>
> --
> Thanks,
>
> David / dhildenb
>
  
David Hildenbrand Jan. 5, 2023, 10:44 a.m. UTC | #5
On 05.01.23 01:03, Yang Shi wrote:
> On Wed, Jan 4, 2023 at 1:20 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> Or am I wrong?
>>>>
>>>>> Is anon_vma lock required?  Almost not: if any page other than expected
>>>>> subpage of the non-anon huge page is found in the page table, collapse is
>>>>> aborted without making any change.  However, it is possible that an anon
>>>>> page was CoWed from this extent in another mm or vma, in which case a
>>>>> concurrent lookup might look here: so keep it away while clearing pmd
>>>>> (but perhaps we shall go back to using pmd_lock() there in future).
>>>>>
>>>>> Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
>>>>> without having cleared its ptes: I'm uneasy about that, and had thought
>>>>> pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
>>>>> and we would have to move the mmu_notification if clearing those ptes.
>>>>>
>>>>> Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table
>>>>> retraction")
>>>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>>>> Cc: Jann Horn <jannh@google.com>
>>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Zach O'Keefe <zokeefe@google.com>
>>>>> Cc: Song Liu <songliubraving@fb.com>
>>>>> Cc: <stable@vger.kernel.org>    [5.4+]
>>>>> ---
>>>>> What this fixes is not a dangerous instability!  But I suggest Cc stable
>>>>> because uprobes "healing" has regressed in that way, so this should follow
>>>>> 8d3c106e19e8 into those stable releases where it was backported (and may
>>>>> want adjustment there - I'll supply backports as needed).
>>>>
>>>> If it's really something that doesn't matter in practice (e.g., -1%
>>>> performance while debugging :) ), I guess no CC is needed. If there are real
>>>> production workloads that suffer, I guess ccing stable is fine.
>>>
>>> It's about recovering performance *after* debugging.  It is not something
>>> that is of any value to me personally, nor (so far as I know) to anyone
>>> whom I work with.  But it is something which Song Liu went to the trouble
>>> to make possible in his "THP aware uprobe" series three years ago, and it
>>> is something which Jann unintentionally regressed in his recent commit:
>>> so I thought it proper to reinstate where regressed.
>>
>> Right, although I wonder if that original series fixed a real
>> performance issue or was more a "this makes sense, let's just optimize
>> this corner case by some serious complexity". I hope it's not the latter :)
>>
>>>
>>> (What I do have more of an investment in, is for MADV_COLLAPSE to be able
>>> to collapse some extents in a large vma where some other extent got CoWed,
>>> so giving the whole vma an anon_vma.  But that's not an issue for -stable,
>>> and I cannot tell you offhand whether undoing this anon_vma exclusion is
>>> enough to enable that or not - I suspect not, I suspect a result code or
>>> switch statement needs to be adjusted too.)
>>
>> Yeah, having a single COWed page in a large MAP_PRIVATE file/shmem
>> mapping would disable collapse, so it's the right thing to do.
>>
>> Thinking about it some more, and the effective code change, stable
>> doesn't sound wrong.
>>
>>>>
>>>>
>>>> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
>>>> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
>>>> writable or not. Turns out it's communicated via vm_fault->flags. Just
>>>> horrible.
>>>
>>> I firmly disagree - it's from 2022! and much too small to be ugliest;
>>> but I haven't thought about the aspect that is bothering you there.
>>
>> The ugliest I stumbled over in early 2023 -- until January 2nd :D
>>
>>>
>>> What's bothered me most about it, is the way its name, and the naming of
>>> the do_set_pmd() it interfaces with, give no hint that they are entirely
>>> about file (or shmem) vmas, and would not work right on anon vmas
>>> (I forget whether it's just a matter of which stats updated, or more).
>>
>> Yes. I dug very deep into in-place collapse yesterday because I was
>> briefly concerned about anon THP, and it took me longer to understand
>> that whole machinery than it should (and that anon THP never ever
>> collapse in-place).
>>
>> Some of that khugepaged stuff needs some *serious* cleanups and
>> refactoring. do_set_pmd() is not an exception.
>>
>>
>> Some more examples:
>>
>> if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
>>          ...
>>          hpage_collapse_scan_file()
>> } else {
>>          hpage_collapse_scan_pmd()
>>          ...
>> }
>>
>>
>> 1) hpage_collapse_scan_pmd() is only for anon memory. Totally obvious
>>      from the name. But why are we potentially calling it for VMAs that
>>      are not applicable? For maximum David confusion?
> 
> IIRC the VMAs are checked before, what do you mean about "not
> applicable"? But anyway khugepaged/MADV_COLLAPSE does release and

I assume when CONFIG_SHMEM=n with ordinary file-thp we'll end up calling it.

> reacquire mmap_lock multiple times, so there are multiple places to
> check VMAs validity.
> 

hpage_collapse_scan_pmd() should be renamed to something like 
hpage_collapse_scan_an/on() and the duplicate code in 
khugepaged_scan_mm_slot() and madvise_collapse() should be factored out 
into something like:

hpage_collapse_scan(vma, addr, cc)
{
	if (vma->vm_file) {
		...
		hpage_collapse_scan_file()
		...
	} else if (vma_is_anonymous(vma)) {
		hpage_collapse_scan_anon()
	} else {
		WARN_ON_ONCE();
	}
}

Any CONFIG_SHMEM etc. optimizations to compile that code out should go 
into hpage_collapse_scan_file() IMHO. ... also properly checking for 
ordinary file THP support.

... and we'd really decide on a terminology "transhuge", "hugepage", 
"hpage", it's a mess. hpage might be easiest, or simply "thp". We just 
need a way to distinguish all that stuff from hugetlb.

>>
>> 2) "IS_ENABLED(CONFIG_SHMEM) && vma->vm_file" is also supposed to cover
>>      ordinary file-thp. Totally obvious from the IS_ENABLED(CONFIG_SHMEM)
>>      ... I probably spent 30minutes understanding what's happening here.
>>      Just misleading and wrong without CONFIG_SHMEM.
>>
>>
>> ... and what's easier to get than this magic set of boolean flags:
>>
>>          hugepage_vma_check(vma, vma->vm_flags, false, false, true)
> 
> This is not perfect. I was thinking about changing them to one flag,
> just like TTU_ flags used by try_to_unmap(). That may make things
> cleaner.
> 

We should provide similar flags to hugepage_vma_revalidate() and just 
replace the "cc" parameter by a way to indicate is_khugepaged. Passing 
in cc is just overkill.

We'd name the functions thp_vma_validate() and thp_vma_revalidate() or 
sth. like that.

>>
>> ... and obviously
>>          hugepage_vma_revalidate()
>> is supposed to be a follow up to a previous
>>          hugepage_vma_check()
>> and totally different from
>>          transhuge_vma_suitable()
>>
>> Hard to make it even less consistent.
> 
> This was after my cleanup, it was much messier before. And I did add
> comments to make them more understandable, but anyway better naming is
> definitely welcome.

Yeah, I appreciate any previous and any future cleanups in that area.

For example: why even *care* about the complexity of installing a PMD in 
collapse_pte_mapped_thp() using set_huge_pmd() just for MADV_COLLAPSE?

Sure, we avoid a single page fault afterwards, but is this *really* 
worth the extra code here? I mean, after we installed the PMD, the page 
could just get reclaimed either way, so there is no guarantee that we 
have a PMD mapped once we return to user space IIUC.


Anyhow, don't want to hijack this thread. I was just forced to 
understand  that code and a lot jumped at me :)
  
Zach O'Keefe Jan. 5, 2023, 9:29 p.m. UTC | #6
On Thu, Jan 5, 2023 at 2:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.01.23 01:03, Yang Shi wrote:
> > On Wed, Jan 4, 2023 at 1:20 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>> Or am I wrong?
> >>>>
> >>>>> Is anon_vma lock required?  Almost not: if any page other than expected
> >>>>> subpage of the non-anon huge page is found in the page table, collapse is
> >>>>> aborted without making any change.  However, it is possible that an anon
> >>>>> page was CoWed from this extent in another mm or vma, in which case a
> >>>>> concurrent lookup might look here: so keep it away while clearing pmd
> >>>>> (but perhaps we shall go back to using pmd_lock() there in future).
> >>>>>
> >>>>> Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
> >>>>> without having cleared its ptes: I'm uneasy about that, and had thought
> >>>>> pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem,
> >>>>> and we would have to move the mmu_notification if clearing those ptes.
> >>>>>
> >>>>> Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table
> >>>>> retraction")
> >>>>> Signed-off-by: Hugh Dickins <hughd@google.com>
> >>>>> Cc: Jann Horn <jannh@google.com>
> >>>>> Cc: Yang Shi <shy828301@gmail.com>
> >>>>> Cc: David Hildenbrand <david@redhat.com>
> >>>>> Cc: Zach O'Keefe <zokeefe@google.com>
> >>>>> Cc: Song Liu <songliubraving@fb.com>
> >>>>> Cc: <stable@vger.kernel.org>    [5.4+]
> >>>>> ---
> >>>>> What this fixes is not a dangerous instability!  But I suggest Cc stable
> >>>>> because uprobes "healing" has regressed in that way, so this should follow
> >>>>> 8d3c106e19e8 into those stable releases where it was backported (and may
> >>>>> want adjustment there - I'll supply backports as needed).
> >>>>
> >>>> If it's really something that doesn't matter in practice (e.g., -1%
> >>>> performance while debugging :) ), I guess no CC is needed. If there are real
> >>>> production workloads that suffer, I guess ccing stable is fine.
> >>>
> >>> It's about recovering performance *after* debugging.  It is not something
> >>> that is of any value to me personally, nor (so far as I know) to anyone
> >>> whom I work with.  But it is something which Song Liu went to the trouble
> >>> to make possible in his "THP aware uprobe" series three years ago, and it
> >>> is something which Jann unintentionally regressed in his recent commit:
> >>> so I thought it proper to reinstate where regressed.
> >>
> >> Right, although I wonder if that original series fixed a real
> >> performance issue or was more a "this makes sense, let's just optimize
> >> this corner case by some serious complexity". I hope it's not the latter :)
> >>
> >>>
> >>> (What I do have more of an investment in, is for MADV_COLLAPSE to be able
> >>> to collapse some extents in a large vma where some other extent got CoWed,
> >>> so giving the whole vma an anon_vma.  But that's not an issue for -stable,
> >>> and I cannot tell you offhand whether undoing this anon_vma exclusion is
> >>> enough to enable that or not - I suspect not, I suspect a result code or
> >>> switch statement needs to be adjusted too.)
> >>
> >> Yeah, having a single COWed page in a large MAP_PRIVATE file/shmem
> >> mapping would disable collapse, so it's the right thing to do.
> >>
> >> Thinking about it some more, and the effective code change, stable
> >> doesn't sound wrong.
> >>
> >>>>
> >>>>
> >>>> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
> >>>> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
> >>>> writable or not. Turns out it's communicated via vm_fault->flags. Just
> >>>> horrible.

My first Linux award! :) At least it's not "worst mm security issue of
early 2023". I'll take it!

> >>>
> >>> I firmly disagree - it's from 2022! and much too small to be ugliest;
> >>> but I haven't thought about the aspect that is bothering you there.
> >>
> >> The ugliest I stumbled over in early 2023 -- until January 2nd :D
> >>
> >>>
> >>> What's bothered me most about it, is the way its name, and the naming of
> >>> the do_set_pmd() it interfaces with, give no hint that they are entirely
> >>> about file (or shmem) vmas, and would not work right on anon vmas
> >>> (I forget whether it's just a matter of which stats updated, or more).
> >>
> >> Yes. I dug very deep into in-place collapse yesterday because I was
> >> briefly concerned about anon THP, and it took me longer to understand
> >> that whole machinery than it should (and that anon THP never ever
> >> collapse in-place).
> >>
> >> Some of that khugepaged stuff needs some *serious* cleanups and
> >> refactoring. do_set_pmd() is not an exception.
> >>
> >>
> >> Some more examples:
> >>
> >> if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> >>          ...
> >>          hpage_collapse_scan_file()
> >> } else {
> >>          hpage_collapse_scan_pmd()
> >>          ...
> >> }
> >>
> >>
> >> 1) hpage_collapse_scan_pmd() is only for anon memory. Totally obvious
> >>      from the name. But why are we potentially calling it for VMAs that
> >>      are not applicable? For maximum David confusion?
> >
> > IIRC the VMAs are checked before, what do you mean about "not
> > applicable"? But anyway khugepaged/MADV_COLLAPSE does release and
>
> I assume when CONFIG_SHMEM=n with ordinary file-thp we'll end up calling it.
>
> > reacquire mmap_lock multiple times, so there are multiple places to
> > check VMAs validity.
> >
>
> hpage_collapse_scan_pmd() should be renamed to something like
> hpage_collapse_scan_an/on() and the duplicate code in
> khugepaged_scan_mm_slot() and madvise_collapse() should be factored out
> into something like:
>
> hpage_collapse_scan(vma, addr, cc)
> {
>         if (vma->vm_file) {
>                 ...
>                 hpage_collapse_scan_file()
>                 ...
>         } else if (vma_is_anonymous(vma)) {
>                 hpage_collapse_scan_anon()
>         } else {
>                 WARN_ON_ONCE();
>         }
> }
>
> Any CONFIG_SHMEM etc. optimizations to compile that code out should go
> into hpage_collapse_scan_file() IMHO. ... also properly checking for
> ordinary file THP support.
>
> ... and we'd really decide on a terminology "transhuge", "hugepage",
> "hpage", it's a mess. hpage might be easiest, or simply "thp". We just
> need a way to distinguish all that stuff from hugetlb.
>
> >>
> >> 2) "IS_ENABLED(CONFIG_SHMEM) && vma->vm_file" is also supposed to cover
> >>      ordinary file-thp. Totally obvious from the IS_ENABLED(CONFIG_SHMEM)
> >>      ... I probably spent 30minutes understanding what's happening here.
> >>      Just misleading and wrong without CONFIG_SHMEM.
> >>
> >>
> >> ... and what's easier to get than this magic set of boolean flags:
> >>
> >>          hugepage_vma_check(vma, vma->vm_flags, false, false, true)
> >
> > This is not perfect. I was thinking about changing them to one flag,
> > just like TTU_ flags used by try_to_unmap(). That may make things
> > cleaner.
> >
>
> We should provide similar flags to hugepage_vma_revalidate() and just
> replace the "cc" parameter by a way to indicate is_khugepaged. Passing
> in cc is just overkill.
>
> We'd name the functions thp_vma_validate() and thp_vma_revalidate() or
> sth. like that.
>
> >>
> >> ... and obviously
> >>          hugepage_vma_revalidate()
> >> is supposed to be a follow up to a previous
> >>          hugepage_vma_check()
> >> and totally different from
> >>          transhuge_vma_suitable()
> >>
> >> Hard to make it even less consistent.
> >
> > This was after my cleanup, it was much messier before. And I did add
> > comments to make them more understandable, but anyway better naming is
> > definitely welcome.
>
> Yeah, I appreciate any previous and any future cleanups in that area.

Likewise. I think there's definitely some work that can be done
(though things are actually much better now than they were before
Yang's last cleanup series). One of the issues is that anon and
file/shmem pathways are largely independent here, with very little
overlap, and if you just look at some function in isolation (e.g.
hugepage_vma_revalidate()) it's not clear this is a anon/file-only
function. Renaming with consistent prefixes would help -- among other
things.

> For example: why even *care* about the complexity of installing a PMD in
> collapse_pte_mapped_thp() using set_huge_pmd() just for MADV_COLLAPSE?
>
> Sure, we avoid a single page fault afterwards, but is this *really*
> worth the extra code here? I mean, after we installed the PMD, the page
> could just get reclaimed either way, so there is no guarantee that we
> have a PMD mapped once we return to user space IIUC.

A valid question. The first reason is just semantic symmetry for
MADV_COLLAPSE called on anon vs file/shmem memory. It would be nice to
say that "on success, the memory range provided will be backed by
PMD-mapped hugepages", rather than special-casing file/shmem.

The second reason has a more practical use case. In userfaultfd-based
live migration (using  UFFDIO_REGISTER_MODE_MINOR) pages are migrated
at 4KiB granularity, and it may take a long (O(many minutes)) for the
transfer of all pages to complete. To avoid severe performance
degradation on the target guest, the vmm wants to MADV_COLLAPSE
hugepage-sized regions as they fill up. Since the guest memory is
still uffd-registered, requiring refault post-MADV_COLLAPSE won't
work, since the uffd machinery will intercept the fault, and no PMD
will be mapped. As such, either uffd needs to be taught to install PMD
mappings, or the PMD mapping already must be in-place.

Thanks,
Zach

>
> Anyhow, don't want to hijack this thread. I was just forced to
> understand  that code and a lot jumped at me :)
>
> --
> Thanks,
>
> David / dhildenb
>
  
David Hildenbrand Jan. 9, 2023, 8:50 a.m. UTC | #7
>>>>>>
>>>>>> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
>>>>>> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
>>>>>> writable or not. Turns out it's communicated via vm_fault->flags. Just
>>>>>> horrible.
> 
> My first Linux award! :) At least it's not "worst mm security issue of
> early 2023". I'll take it!

Good that you're not taking my words the wrong way.

MADV_COLLAPSE is a very useful feature (especially also for THP tests 
[1]). I wish I could have looked at some of the patches earlier. But we 
cannot wait forever to get something merged, otherwise we'd never get 
bigger changes upstream.

... so there is plenty of time left in 2023 to cleanup khugepaged.c :P


[1] https://lkml.kernel.org/r/20230104144905.460075-1-david@redhat.com

[...]


>> For example: why even *care* about the complexity of installing a PMD in
>> collapse_pte_mapped_thp() using set_huge_pmd() just for MADV_COLLAPSE?
>>
>> Sure, we avoid a single page fault afterwards, but is this *really*
>> worth the extra code here? I mean, after we installed the PMD, the page
>> could just get reclaimed either way, so there is no guarantee that we
>> have a PMD mapped once we return to user space IIUC.
> 
> A valid question. The first reason is just semantic symmetry for
> MADV_COLLAPSE called on anon vs file/shmem memory. It would be nice to
> say that "on success, the memory range provided will be backed by
> PMD-mapped hugepages", rather than special-casing file/shmem.

But there will never be such a guarantee, right? We could even see a 
split before just before we return to user space IIRC.

> 
> The second reason has a more practical use case. In userfaultfd-based
> live migration (using  UFFDIO_REGISTER_MODE_MINOR) pages are migrated
> at 4KiB granularity, and it may take a long (O(many minutes)) for the
> transfer of all pages to complete. To avoid severe performance
> degradation on the target guest, the vmm wants to MADV_COLLAPSE
> hugepage-sized regions as they fill up. Since the guest memory is
> still uffd-registered, requiring refault post-MADV_COLLAPSE won't
> work, since the uffd machinery will intercept the fault, and no PMD
> will be mapped. As such, either uffd needs to be taught to install PMD
> mappings, or the PMD mapping already must be in-place.

That's an interesting point, thanks. I assume we'd get another minor 
fault and when resolving that, we'll default to a PTE mapping.
  
Zach O'Keefe Jan. 17, 2023, 11 p.m. UTC | #8
On Mon, Jan 9, 2023 at 12:50 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>>>>>
> >>>>>> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
> >>>>>> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
> >>>>>> writable or not. Turns out it's communicated via vm_fault->flags. Just
> >>>>>> horrible.
> >

Hey David,

Sorry for the late response here.

> > My first Linux award! :) At least it's not "worst mm security issue of
> > early 2023". I'll take it!
>
> Good that you're not taking my words the wrong way.
>
> MADV_COLLAPSE is a very useful feature (especially also for THP tests
> [1]). I wish I could have looked at some of the patches earlier. But we
> cannot wait forever to get something merged, otherwise we'd never get
> bigger changes upstream.
>
> ... so there is plenty of time left in 2023 to cleanup khugepaged.c :P
>
>
> [1] https://lkml.kernel.org/r/20230104144905.460075-1-david@redhat.com

Yes, thank you for these tests. I have them open in another tab along
with a mental TODO to take a closer look at them, and response
half-written. In-place collapse of anonymous memory *is* something
that I was interested in exploring later (I have a use-case for it;
hugepage-aware malloc() implementations). I'm taking so long on it
(sorry) b/c I need to review your point (2) (all PTE's mapping
exclusively). Hopefully I can get to it shortly.

> [...]
>
>
> >> For example: why even *care* about the complexity of installing a PMD in
> >> collapse_pte_mapped_thp() using set_huge_pmd() just for MADV_COLLAPSE?
> >>
> >> Sure, we avoid a single page fault afterwards, but is this *really*
> >> worth the extra code here? I mean, after we installed the PMD, the page
> >> could just get reclaimed either way, so there is no guarantee that we
> >> have a PMD mapped once we return to user space IIUC.
> >
> > A valid question. The first reason is just semantic symmetry for
> > MADV_COLLAPSE called on anon vs file/shmem memory. It would be nice to
> > say that "on success, the memory range provided will be backed by
> > PMD-mapped hugepages", rather than special-casing file/shmem.
>
> But there will never be such a guarantee, right? We could even see a
> split before just before we return to user space IIRC.

Absolutely. But at least we are *attempting* for symmetry here; though
admittedly, even a successful return code provides no guarantees.
Perhaps this is a weak argument by itself, though.

> >
> > The second reason has a more practical use case. In userfaultfd-based
> > live migration (using  UFFDIO_REGISTER_MODE_MINOR) pages are migrated
> > at 4KiB granularity, and it may take a long (O(many minutes)) for the
> > transfer of all pages to complete. To avoid severe performance
> > degradation on the target guest, the vmm wants to MADV_COLLAPSE
> > hugepage-sized regions as they fill up. Since the guest memory is
> > still uffd-registered, requiring refault post-MADV_COLLAPSE won't
> > work, since the uffd machinery will intercept the fault, and no PMD
> > will be mapped. As such, either uffd needs to be taught to install PMD
> > mappings, or the PMD mapping already must be in-place.
>
> That's an interesting point, thanks. I assume we'd get another minor
> fault and when resolving that, we'll default to a PTE mapping.

Yes-ish; I think it depends on how userspace decides to deal with the
event. At least in my own test cases, IIRC (hazy memory here), we
ended up in some loop of:

done faulting all 512 pages -> MADV_COLLAPSE -> fault -> copy page ->
done faulting all 512 pages -> ...

Thanks,
Zach


> --
> Thanks,
>
> David / dhildenb
>
  
David Hildenbrand Jan. 23, 2023, 11:09 a.m. UTC | #9
On 18.01.23 00:00, Zach O'Keefe wrote:
> On Mon, Jan 9, 2023 at 12:50 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>>>>>
>>>>>>>> Side note: set_huge_pmd() wins the award of "ugliest mm function of early
>>>>>>>> 2023". I was briefly concerned how do_set_pmd() decides whether the PMD can be
>>>>>>>> writable or not. Turns out it's communicated via vm_fault->flags. Just
>>>>>>>> horrible.
>>>
> 
> Hey David,
> 
> Sorry for the late response here.
> 
>>> My first Linux award! :) At least it's not "worst mm security issue of
>>> early 2023". I'll take it!
>>
>> Good that you're not taking my words the wrong way.
>>
>> MADV_COLLAPSE is a very useful feature (especially also for THP tests
>> [1]). I wish I could have looked at some of the patches earlier. But we
>> cannot wait forever to get something merged, otherwise we'd never get
>> bigger changes upstream.
>>
>> ... so there is plenty of time left in 2023 to cleanup khugepaged.c :P
>>
>>
>> [1] https://lkml.kernel.org/r/20230104144905.460075-1-david@redhat.com
> 
> Yes, thank you for these tests. I have them open in another tab along
> with a mental TODO to take a closer look at them, and response
> half-written. In-place collapse of anonymous memory *is* something
> that I was interested in exploring later (I have a use-case for it;
> hugepage-aware malloc() implementations). I'm taking so long on it
> (sorry) b/c I need to review your point (2) (all PTE's mapping
> exclusively). Hopefully I can get to it shortly.

Cool, please let me know if you have any questions. Being able to 
in-place collapse a PTE-mapped anon THP again would be nice.
  

Patch

--- 6.2-rc/mm/khugepaged.c
+++ linux/mm/khugepaged.c
@@ -1460,14 +1460,6 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
 		return SCAN_VMA_CHECK;
 
-	/*
-	 * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings
-	 * that got written to. Without this, we'd have to also lock the
-	 * anon_vma if one exists.
-	 */
-	if (vma->anon_vma)
-		return SCAN_VMA_CHECK;
-
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
 	if (userfaultfd_wp(vma))
 		return SCAN_PTE_UFFD_WP;
@@ -1567,8 +1559,14 @@  int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	}
 
 	/* step 4: remove pte entries */
+	/* we make no change to anon, but protect concurrent anon page lookup */
+	if (vma->anon_vma)
+		anon_vma_lock_write(vma->anon_vma);
+
 	collapse_and_free_pmd(mm, vma, haddr, pmd);
 
+	if (vma->anon_vma)
+		anon_vma_unlock_write(vma->anon_vma);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
 
 maybe_install_pmd: