[1/1] mm/khugepaged: skip copying lazyfree pages on collapse

Message ID 20240201125226.28372-1-ioworker0@gmail.com
State New
Headers
Series [1/1] mm/khugepaged: skip copying lazyfree pages on collapse |

Commit Message

Lance Yang Feb. 1, 2024, 12:52 p.m. UTC
  The collapsing behavior of khugepaged with pages
marked using MADV_FREE might cause confusion
among users.

For instance, allocate a 2MB chunk using mmap and
later release it by MADV_FREE. Khugepaged will not
collapse this chunk. From the user's perspective,
it treats lazyfree pages as pte_none. However,
for some pages marked as lazyfree with MADV_FREE,
khugepaged might collapse this chunk and copy
these pages to a new huge page. This inconsistency
in behavior could be confusing for users.

After a successful MADV_FREE operation, if there is
no subsequent write, the kernel can free the pages
at any time. Therefore, in my opinion, counting
lazyfree pages in max_pte_none seems reasonable.

Perhaps treating MADV_FREE like MADV_DONTNEED, not
copying lazyfree pages when khugepaged collapses
huge pages in the background better aligns with
user expectations.

Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
 mm/khugepaged.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)
  

Comments

Lance Yang Feb. 1, 2024, 1:49 p.m. UTC | #1
THP:
enable=madivse
defrag=defer
max_ptes_none=511
scan_sleep_millisecs=1000
alloc_sleep_millisecs=1000

Test code:

// allocate a 2MB chunk using mmap and later
// release it by MADV_FRE
Link: https://github.com/ioworker0/mmapvsmprotect/blob/main/test5.c

root@x:/tmp# ./a.out | grep -B 23 hg
7f762a200000-7f762a400000 rw-p 00000000 00:00 0
Size:                     2048 kB
Anonymous:         2048 kB
LazyFree:             2048 kB
AnonHugePages:       0 kB
THPeligible:    1
VmFlags: rd wr mr mw me ac sd hg

// allocate a 2MB chunk using mmap and later
// some pages marked as lazyfree with MADV_FREE
Link: https://github.com/ioworker0/mmapvsmprotect/blob/main/test4.c

root@x:/tmp# ./a.out | grep -B 23 hg
7f762a200000-7f762a400000 rw-p 00000000 00:00 0
Size:                     2048 kB
Anonymous:         2048 kB
LazyFree:                   0 kB
AnonHugePages: 2048 kB
THPeligible:    1
VmFlags: rd wr mr mw me ac sd hg

root@x:/tmp# ./a.out
[...]
root@x:/tmp# echo $?
2



On Thu, Feb 1, 2024 at 8:53 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> The collapsing behavior of khugepaged with pages
> marked using MADV_FREE might cause confusion
> among users.
>
> For instance, allocate a 2MB chunk using mmap and
> later release it by MADV_FREE. Khugepaged will not
> collapse this chunk. From the user's perspective,
> it treats lazyfree pages as pte_none. However,
> for some pages marked as lazyfree with MADV_FREE,
> khugepaged might collapse this chunk and copy
> these pages to a new huge page. This inconsistency
> in behavior could be confusing for users.
>
> After a successful MADV_FREE operation, if there is
> no subsequent write, the kernel can free the pages
> at any time. Therefore, in my opinion, counting
> lazyfree pages in max_pte_none seems reasonable.
>
> Perhaps treating MADV_FREE like MADV_DONTNEED, not
> copying lazyfree pages when khugepaged collapses
> huge pages in the background better aligns with
> user expectations.
>
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> ---
>  mm/khugepaged.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..6cbf46d42c6a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte,
>                                      pmd_t orig_pmd,
>                                      struct vm_area_struct *vma,
>                                      unsigned long address,
> +                                    struct collapse_control *cc,
>                                      spinlock_t *ptl,
>                                      struct list_head *compound_pagelist)
>  {
> @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte,
>                         continue;
>                 }
>                 src_page = pte_page(pteval);
> +
> +               if (cc->is_khugepaged
> +                               && !folio_test_swapbacked(page_folio(src_page))) {
> +                       clear_user_highpage(page, _address);
> +                       continue;
> +               }
> +
>                 if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) {
>                         result = SCAN_COPY_MC;
>                         break;
> @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         anon_vma_unlock_write(vma->anon_vma);
>
>         result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd,
> -                                          vma, address, pte_ptl,
> +                                          vma, address, cc, pte_ptl,
>                                            &compound_pagelist);
>         pte_unmap(pte);
>         if (unlikely(result != SCAN_SUCCEED))
> --
> 2.33.1
>
  
Yang Shi Feb. 1, 2024, 8:37 p.m. UTC | #2
On Thu, Feb 1, 2024 at 4:53 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> The collapsing behavior of khugepaged with pages
> marked using MADV_FREE might cause confusion
> among users.
>
> For instance, allocate a 2MB chunk using mmap and
> later release it by MADV_FREE. Khugepaged will not
> collapse this chunk. From the user's perspective,
> it treats lazyfree pages as pte_none. However,
> for some pages marked as lazyfree with MADV_FREE,
> khugepaged might collapse this chunk and copy
> these pages to a new huge page. This inconsistency
> in behavior could be confusing for users.
>
> After a successful MADV_FREE operation, if there is
> no subsequent write, the kernel can free the pages
> at any time. Therefore, in my opinion, counting
> lazyfree pages in max_pte_none seems reasonable.
>
> Perhaps treating MADV_FREE like MADV_DONTNEED, not
> copying lazyfree pages when khugepaged collapses
> huge pages in the background better aligns with
> user expectations.
>
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> ---
>  mm/khugepaged.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..6cbf46d42c6a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte,
>                                      pmd_t orig_pmd,
>                                      struct vm_area_struct *vma,
>                                      unsigned long address,
> +                                    struct collapse_control *cc,
>                                      spinlock_t *ptl,
>                                      struct list_head *compound_pagelist)
>  {
> @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte,
>                         continue;
>                 }
>                 src_page = pte_page(pteval);
> +
> +               if (cc->is_khugepaged
> +                               && !folio_test_swapbacked(page_folio(src_page))) {
> +                       clear_user_highpage(page, _address);
> +                       continue;

If the page was written before khugepaged collapsed it, and khugepaged
collapsed the page before memory reclaim kicked in, didn't this
somehow cause data corruption?

> +               }
> +
>                 if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) {
>                         result = SCAN_COPY_MC;
>                         break;
> @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         anon_vma_unlock_write(vma->anon_vma);
>
>         result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd,
> -                                          vma, address, pte_ptl,
> +                                          vma, address, cc, pte_ptl,
>                                            &compound_pagelist);
>         pte_unmap(pte);
>         if (unlikely(result != SCAN_SUCCEED))
> --
> 2.33.1
>
  
Michal Hocko Feb. 2, 2024, 10:06 a.m. UTC | #3
On Thu 01-02-24 20:52:26, Lance Yang wrote:
> The collapsing behavior of khugepaged with pages
> marked using MADV_FREE might cause confusion
> among users.
> 
> For instance, allocate a 2MB chunk using mmap and
> later release it by MADV_FREE. Khugepaged will not
> collapse this chunk. From the user's perspective,
> it treats lazyfree pages as pte_none. However,
> for some pages marked as lazyfree with MADV_FREE,
> khugepaged might collapse this chunk and copy
> these pages to a new huge page. This inconsistency
> in behavior could be confusing for users.

Is that any more confusing than collapsing pte_none
pages?

TBH I do not really see why this is a problem. MADV_FREE
are correctly recognized same as pte_none so the user
defined trashold applies.
  
Lance Yang Feb. 2, 2024, 11:18 a.m. UTC | #4
On Fri, Feb 2, 2024 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 01-02-24 20:52:26, Lance Yang wrote:
> > The collapsing behavior of khugepaged with pages
> > marked using MADV_FREE might cause confusion
> > among users.
> >
> > For instance, allocate a 2MB chunk using mmap and
> > later release it by MADV_FREE. Khugepaged will not
> > collapse this chunk. From the user's perspective,
> > it treats lazyfree pages as pte_none. However,
> > for some pages marked as lazyfree with MADV_FREE,
> > khugepaged might collapse this chunk and copy
> > these pages to a new huge page. This inconsistency
> > in behavior could be confusing for users.
>
> Is that any more confusing than collapsing pte_none
> pages?
>
> TBH I do not really see why this is a problem. MADV_FREE
> are correctly recognized same as pte_none so the user
> defined trashold applies.

Sorry, I might not have expressed it clearly.
MADV_FREE is correctly recognized, just like
pte_none. This is also reasonable and not an issue.

IMO, since it's treated the same as pte_none,
perhaps lazyfree pages shouldn't be copied to
the new huge page.

Thanks,
Lance
> --
> Michal Hocko
> SUSE Labs
  
Lance Yang Feb. 2, 2024, 11:23 a.m. UTC | #5
On Fri, Feb 2, 2024 at 4:37 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Feb 1, 2024 at 4:53 AM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > The collapsing behavior of khugepaged with pages
> > marked using MADV_FREE might cause confusion
> > among users.
> >
> > For instance, allocate a 2MB chunk using mmap and
> > later release it by MADV_FREE. Khugepaged will not
> > collapse this chunk. From the user's perspective,
> > it treats lazyfree pages as pte_none. However,
> > for some pages marked as lazyfree with MADV_FREE,
> > khugepaged might collapse this chunk and copy
> > these pages to a new huge page. This inconsistency
> > in behavior could be confusing for users.
> >
> > After a successful MADV_FREE operation, if there is
> > no subsequent write, the kernel can free the pages
> > at any time. Therefore, in my opinion, counting
> > lazyfree pages in max_pte_none seems reasonable.
> >
> > Perhaps treating MADV_FREE like MADV_DONTNEED, not
> > copying lazyfree pages when khugepaged collapses
> > huge pages in the background better aligns with
> > user expectations.
> >
> > Signed-off-by: Lance Yang <ioworker0@gmail.com>
> > ---
> >  mm/khugepaged.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2b219acb528e..6cbf46d42c6a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte,
> >                                      pmd_t orig_pmd,
> >                                      struct vm_area_struct *vma,
> >                                      unsigned long address,
> > +                                    struct collapse_control *cc,
> >                                      spinlock_t *ptl,
> >                                      struct list_head *compound_pagelist)
> >  {
> > @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte,
> >                         continue;
> >                 }
> >                 src_page = pte_page(pteval);
> > +
> > +               if (cc->is_khugepaged
> > +                               && !folio_test_swapbacked(page_folio(src_page))) {
> > +                       clear_user_highpage(page, _address);
> > +                       continue;
>
> If the page was written before khugepaged collapsed it, and khugepaged
> collapsed the page before memory reclaim kicked in, didn't this
> somehow cause data corruption?
>

Thanks a lot! Yang, you're correct; indeed, there is
a potential issue with data corruption.

I took a look at the check for lazyfree pages in
smaps_pte_entry.

Here's the modification:
if (cc->is_khugepaged && !PageSwapBacked(src_page)
        && !pte_dirty(pteval) && !PageDirty(src_page)) {
        clear_user_highpage(page, _address);
        continue;
}

Could you please take a look?

Thanks,
Lance

> > +               }
> > +
> >                 if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) {
> >                         result = SCAN_COPY_MC;
> >                         break;
> > @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >         anon_vma_unlock_write(vma->anon_vma);
> >
> >         result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd,
> > -                                          vma, address, pte_ptl,
> > +                                          vma, address, cc, pte_ptl,
> >                                            &compound_pagelist);
> >         pte_unmap(pte);
> >         if (unlikely(result != SCAN_SUCCEED))
> > --
> > 2.33.1
> >
  
Michal Hocko Feb. 2, 2024, 12:27 p.m. UTC | #6
On Fri 02-02-24 19:18:31, Lance Yang wrote:
> IMO, since it's treated the same as pte_none,
> perhaps lazyfree pages shouldn't be copied to
> the new huge page.

Why? The content of MADV_FREE page is valid until it is reclaimed.
  
Lance Yang Feb. 2, 2024, 12:52 p.m. UTC | #7
On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 02-02-24 19:18:31, Lance Yang wrote:
> > IMO, since it's treated the same as pte_none,
> > perhaps lazyfree pages shouldn't be copied to
> > the new huge page.
>
> Why? The content of MADV_FREE page is valid until it is reclaimed.

IMO, if MADV_FREE pages are considered valid until
reclaimed, treating them the same as pte_none might
pose a conflict.

Thanks,
Lance

> --
> Michal Hocko
> SUSE Labs
  
Michal Hocko Feb. 2, 2024, 12:57 p.m. UTC | #8
On Fri 02-02-24 20:52:48, Lance Yang wrote:
> On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 02-02-24 19:18:31, Lance Yang wrote:
> > > IMO, since it's treated the same as pte_none,
> > > perhaps lazyfree pages shouldn't be copied to
> > > the new huge page.
> >
> > Why? The content of MADV_FREE page is valid until it is reclaimed.
> 
> IMO, if MADV_FREE pages are considered valid until
> reclaimed, treating them the same as pte_none might
> pose a conflict.

What kind of conflict?
  
Lance Yang Feb. 2, 2024, 1:46 p.m. UTC | #9
Here is a part from the man page explaining
the MADV_FREE semantics:

The kernel can thus free thesepages, but the
freeing could be delayed until memory pressure
occurs. For each of the pages that has been
marked to be freed but has not yet been freed,
the free operation will be canceled if the caller
writes into the page. If there is no subsequent
write, the kernel can free the pages at any time.

IIUC, if there is no subsequent write, lazyfree
pages will eventually be reclaimed. khugepaged
treats lazyfree pages the same as pte_none,
avoiding copying them to the new huge page
during collapse. It seems that lazyfree pages
are reclaimed before khugepaged collapses them.
This aligns with user expectations.

However, IMO, if the content of MADV_FREE pages
remains valid during collapse, then khugepaged
treating lazyfree pages the same as pte_none
might not be suitable.

Thanks,
Lance

On Fri, Feb 2, 2024 at 8:57 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 02-02-24 20:52:48, Lance Yang wrote:
> > On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 02-02-24 19:18:31, Lance Yang wrote:
> > > > IMO, since it's treated the same as pte_none,
> > > > perhaps lazyfree pages shouldn't be copied to
> > > > the new huge page.
> > >
> > > Why? The content of MADV_FREE page is valid until it is reclaimed.
> >
> > IMO, if MADV_FREE pages are considered valid until
> > reclaimed, treating them the same as pte_none might
> > pose a conflict.
>
> What kind of conflict?
> --
> Michal Hocko
> SUSE Labs
  
Lance Yang Feb. 2, 2024, 2:20 p.m. UTC | #10
I'd like to add one more point.

Users mark pages as lazyfree with MADV_FREE,
expecting these pages to be reclaimed until memory
pressure occurs. Currently, khugepaged treats lazyfree
pages the same as pte_none, which seems reasonable.
IMO, avoiding the copying of these pages to the new huge
page during khugepaged collapse can better keep the
semantics of MADV_FREE. It appears that lazyfree
pages are reclaimed before khugepaged collapses them.

Thanks,
Lance

On Fri, Feb 2, 2024 at 9:46 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> Here is a part from the man page explaining
> the MADV_FREE semantics:
>
> The kernel can thus free thesepages, but the
> freeing could be delayed until memory pressure
> occurs. For each of the pages that has been
> marked to be freed but has not yet been freed,
> the free operation will be canceled if the caller
> writes into the page. If there is no subsequent
> write, the kernel can free the pages at any time.
>
> IIUC, if there is no subsequent write, lazyfree
> pages will eventually be reclaimed. khugepaged
> treats lazyfree pages the same as pte_none,
> avoiding copying them to the new huge page
> during collapse. It seems that lazyfree pages
> are reclaimed before khugepaged collapses them.
> This aligns with user expectations.
>
> However, IMO, if the content of MADV_FREE pages
> remains valid during collapse, then khugepaged
> treating lazyfree pages the same as pte_none
> might not be suitable.
>
> Thanks,
> Lance
>
> On Fri, Feb 2, 2024 at 8:57 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 02-02-24 20:52:48, Lance Yang wrote:
> > > On Fri, Feb 2, 2024 at 8:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 02-02-24 19:18:31, Lance Yang wrote:
> > > > > IMO, since it's treated the same as pte_none,
> > > > > perhaps lazyfree pages shouldn't be copied to
> > > > > the new huge page.
> > > >
> > > > Why? The content of MADV_FREE page is valid until it is reclaimed.
> > >
> > > IMO, if MADV_FREE pages are considered valid until
> > > reclaimed, treating them the same as pte_none might
> > > pose a conflict.
> >
> > What kind of conflict?
> > --
> > Michal Hocko
> > SUSE Labs
  
Michal Hocko Feb. 2, 2024, 2:42 p.m. UTC | #11
On Fri 02-02-24 21:46:45, Lance Yang wrote:
> Here is a part from the man page explaining
> the MADV_FREE semantics:
> 
> The kernel can thus free thesepages, but the
> freeing could be delayed until memory pressure
> occurs. For each of the pages that has been
> marked to be freed but has not yet been freed,
> the free operation will be canceled if the caller
> writes into the page. If there is no subsequent
> write, the kernel can free the pages at any time.
> 
> IIUC, if there is no subsequent write, lazyfree
> pages will eventually be reclaimed.

If there is no memory pressure then this might not
ever happen. User cannot make any assumption about
their content once madvise call has been done. The
content has to be considered lost. Sure the userspace
might have means to tell those pages from zero pages
and recheck after the write but that is about it.

> khugepaged
> treats lazyfree pages the same as pte_none,
> avoiding copying them to the new huge page
> during collapse. It seems that lazyfree pages
> are reclaimed before khugepaged collapses them.
> This aligns with user expectations.
> 
> However, IMO, if the content of MADV_FREE pages
> remains valid during collapse, then khugepaged
> treating lazyfree pages the same as pte_none
> might not be suitable.

Why?

Unless I am missing something (which is possible of
course) I do not really see why dropping the content
of those pages and replacing them with a THP is any
difference from reclaiming those pages and then faulting
in a non-THP zero page.

Now, if khugepaged reused the original content of MADV_FREE
pages that would be a slightly different story. I can
see why users would expect zero pages to back madvised
area.
  
Lance Yang Feb. 2, 2024, 2:52 p.m. UTC | #12
How about blocking khugepaged from
collapsing lazyfree pages? This way,
is it not better to keep the semantics
of MADV_FREE?

What do you think?

Thanks,
Lance

On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 02-02-24 21:46:45, Lance Yang wrote:
> > Here is a part from the man page explaining
> > the MADV_FREE semantics:
> >
> > The kernel can thus free thesepages, but the
> > freeing could be delayed until memory pressure
> > occurs. For each of the pages that has been
> > marked to be freed but has not yet been freed,
> > the free operation will be canceled if the caller
> > writes into the page. If there is no subsequent
> > write, the kernel can free the pages at any time.
> >
> > IIUC, if there is no subsequent write, lazyfree
> > pages will eventually be reclaimed.
>
> If there is no memory pressure then this might not
> ever happen. User cannot make any assumption about
> their content once madvise call has been done. The
> content has to be considered lost. Sure the userspace
> might have means to tell those pages from zero pages
> and recheck after the write but that is about it.
>
> > khugepaged
> > treats lazyfree pages the same as pte_none,
> > avoiding copying them to the new huge page
> > during collapse. It seems that lazyfree pages
> > are reclaimed before khugepaged collapses them.
> > This aligns with user expectations.
> >
> > However, IMO, if the content of MADV_FREE pages
> > remains valid during collapse, then khugepaged
> > treating lazyfree pages the same as pte_none
> > might not be suitable.
>
> Why?
>
> Unless I am missing something (which is possible of
> course) I do not really see why dropping the content
> of those pages and replacing them with a THP is any
> difference from reclaiming those pages and then faulting
> in a non-THP zero page.
>
> Now, if khugepaged reused the original content of MADV_FREE
> pages that would be a slightly different story. I can
> see why users would expect zero pages to back madvised
> area.
> --
> Michal Hocko
> SUSE Labs
  
David Hildenbrand Feb. 2, 2024, 3:26 p.m. UTC | #13
On 02.02.24 15:52, Lance Yang wrote:
> How about blocking khugepaged from
> collapsing lazyfree pages? This way,
> is it not better to keep the semantics
> of MADV_FREE?
> 
> What do you think?

Why do we even want to change any behavior here? A lot of things "might 
cause confusion among users". Even worse, a lot of things do cause 
confusion ;)
  
Michal Hocko Feb. 2, 2024, 3:38 p.m. UTC | #14
On Fri 02-02-24 22:52:49, Lance Yang wrote:
> How about blocking khugepaged from
> collapsing lazyfree pages? This way,
> is it not better to keep the semantics
> of MADV_FREE?

I do not see any reason why. And you are not providing any
actual reasoning either. Unless you have a very specific
example in mind I do not think we want to change the current
behavior.
  
Yang Shi Feb. 2, 2024, 5:42 p.m. UTC | #15
On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> How about blocking khugepaged from
> collapsing lazyfree pages? This way,
> is it not better to keep the semantics
> of MADV_FREE?
>
> What do you think?

First of all, khugepaged doesn't treat MADV_FREE pages as pte_none
IIUC. The khugepaged does skip the 2M block if all the pages are old
and unreferenced pages in the range in hpage_collapse_scan_pmd(), then
repeat the check in collapse_huge_page() again.

And MADV_FREE pages are just old and unreferenced. This is actually
what your first test case does. The whole 2M range is MADV_FREE range,
so they are skipped by khugepaged.

But if the partial range is MADV_FREE, khugepaged won't skip them.
This is what your second test case does.

Secondly, I think it depends on the semantics of MADV_FREE,
particularly how to treat the redirtied pages. TBH I'm always confused
by the semantics. For example, the page contained "abcd", then it was
MADV_FREE'ed, then it was written again with "1234" after "abcd". So
the user should expect to see "abcd1234" or "00001234".

I'm supposed it should be "abcd1234" since MADV_FREE pages are still
valid and available, if I'm wrong please feel free to correct me. If
so we should always copy MADV_FREE pages in khugepaged regardless of
whether it is redirtied or not otherwise it may incur data corruption.
If we don't copy, then the follow up redirty after collapse to the
hugepage may return "00001234", right?

The current behavior is copying the page.

>
> Thanks,
> Lance
>
> On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 02-02-24 21:46:45, Lance Yang wrote:
> > > Here is a part from the man page explaining
> > > the MADV_FREE semantics:
> > >
> > > The kernel can thus free thesepages, but the
> > > freeing could be delayed until memory pressure
> > > occurs. For each of the pages that has been
> > > marked to be freed but has not yet been freed,
> > > the free operation will be canceled if the caller
> > > writes into the page. If there is no subsequent
> > > write, the kernel can free the pages at any time.
> > >
> > > IIUC, if there is no subsequent write, lazyfree
> > > pages will eventually be reclaimed.
> >
> > If there is no memory pressure then this might not
> > ever happen. User cannot make any assumption about
> > their content once madvise call has been done. The
> > content has to be considered lost. Sure the userspace
> > might have means to tell those pages from zero pages
> > and recheck after the write but that is about it.
> >
> > > khugepaged
> > > treats lazyfree pages the same as pte_none,
> > > avoiding copying them to the new huge page
> > > during collapse. It seems that lazyfree pages
> > > are reclaimed before khugepaged collapses them.
> > > This aligns with user expectations.
> > >
> > > However, IMO, if the content of MADV_FREE pages
> > > remains valid during collapse, then khugepaged
> > > treating lazyfree pages the same as pte_none
> > > might not be suitable.
> >
> > Why?
> >
> > Unless I am missing something (which is possible of
> > course) I do not really see why dropping the content
> > of those pages and replacing them with a THP is any
> > difference from reclaiming those pages and then faulting
> > in a non-THP zero page.
> >
> > Now, if khugepaged reused the original content of MADV_FREE
> > pages that would be a slightly different story. I can
> > see why users would expect zero pages to back madvised
> > area.
> > --
> > Michal Hocko
> > SUSE Labs
  
Yang Shi Feb. 2, 2024, 5:43 p.m. UTC | #16
On Fri, Feb 2, 2024 at 3:23 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 4:37 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Feb 1, 2024 at 4:53 AM Lance Yang <ioworker0@gmail.com> wrote:
> > >
> > > The collapsing behavior of khugepaged with pages
> > > marked using MADV_FREE might cause confusion
> > > among users.
> > >
> > > For instance, allocate a 2MB chunk using mmap and
> > > later release it by MADV_FREE. Khugepaged will not
> > > collapse this chunk. From the user's perspective,
> > > it treats lazyfree pages as pte_none. However,
> > > for some pages marked as lazyfree with MADV_FREE,
> > > khugepaged might collapse this chunk and copy
> > > these pages to a new huge page. This inconsistency
> > > in behavior could be confusing for users.
> > >
> > > After a successful MADV_FREE operation, if there is
> > > no subsequent write, the kernel can free the pages
> > > at any time. Therefore, in my opinion, counting
> > > lazyfree pages in max_pte_none seems reasonable.
> > >
> > > Perhaps treating MADV_FREE like MADV_DONTNEED, not
> > > copying lazyfree pages when khugepaged collapses
> > > huge pages in the background better aligns with
> > > user expectations.
> > >
> > > Signed-off-by: Lance Yang <ioworker0@gmail.com>
> > > ---
> > >  mm/khugepaged.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 2b219acb528e..6cbf46d42c6a 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -777,6 +777,7 @@ static int __collapse_huge_page_copy(pte_t *pte,
> > >                                      pmd_t orig_pmd,
> > >                                      struct vm_area_struct *vma,
> > >                                      unsigned long address,
> > > +                                    struct collapse_control *cc,
> > >                                      spinlock_t *ptl,
> > >                                      struct list_head *compound_pagelist)
> > >  {
> > > @@ -797,6 +798,13 @@ static int __collapse_huge_page_copy(pte_t *pte,
> > >                         continue;
> > >                 }
> > >                 src_page = pte_page(pteval);
> > > +
> > > +               if (cc->is_khugepaged
> > > +                               && !folio_test_swapbacked(page_folio(src_page))) {
> > > +                       clear_user_highpage(page, _address);
> > > +                       continue;
> >
> > If the page was written before khugepaged collapsed it, and khugepaged
> > collapsed the page before memory reclaim kicked in, didn't this
> > somehow cause data corruption?
> >
>
> Thanks a lot! Yang, you're correct; indeed, there is
> a potential issue with data corruption.
>
> I took a look at the check for lazyfree pages in
> smaps_pte_entry.
>
> Here's the modification:
> if (cc->is_khugepaged && !PageSwapBacked(src_page)
>         && !pte_dirty(pteval) && !PageDirty(src_page)) {
>         clear_user_highpage(page, _address);
>         continue;
> }

This may be ok. But as I said in another reply, this may still incur
data corruption.

>
> Could you please take a look?
>
> Thanks,
> Lance
>
> > > +               }
> > > +
> > >                 if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) {
> > >                         result = SCAN_COPY_MC;
> > >                         break;
> > > @@ -1205,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >         anon_vma_unlock_write(vma->anon_vma);
> > >
> > >         result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd,
> > > -                                          vma, address, pte_ptl,
> > > +                                          vma, address, cc, pte_ptl,
> > >                                            &compound_pagelist);
> > >         pte_unmap(pte);
> > >         if (unlikely(result != SCAN_SUCCEED))
> > > --
> > > 2.33.1
> > >
  
Lance Yang Feb. 3, 2024, 4:17 a.m. UTC | #17
Hey Michal, David, Yang,

I sincerely appreciate your time!

I still have two questions that are perplexing me.

First question:
Given that khugepaged doesn't treat MADV_FREE
pages as pte_none, why skip the 2M block when all
the pages within the range are old and unreferenced,
but won't skip if the partial range is MADV_FREE,
even if it's not redirtied? Why make this distinction?
Would it not be more straightforward to maintain
if either all were skipped or not?

Second question:
Does copying lazyfree pages (not redirtied) to the
new huge page during khugepaged collapse
undermine the semantics of MADV_FREE?
Users mark pages as lazyfree with MADV_FREE,
expecting these pages to be eventually reclaimed.
Even without subsequent writes, these pages will
no longer be reclaimed, even if memory pressure
occurs.

BR,
Lance

On Sat, Feb 3, 2024 at 1:42 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > How about blocking khugepaged from
> > collapsing lazyfree pages? This way,
> > is it not better to keep the semantics
> > of MADV_FREE?
> >
> > What do you think?
>
> First of all, khugepaged doesn't treat MADV_FREE pages as pte_none
> IIUC. The khugepaged does skip the 2M block if all the pages are old
> and unreferenced pages in the range in hpage_collapse_scan_pmd(), then
> repeat the check in collapse_huge_page() again.
>
> And MADV_FREE pages are just old and unreferenced. This is actually
> what your first test case does. The whole 2M range is MADV_FREE range,
> so they are skipped by khugepaged.
>
> But if the partial range is MADV_FREE, khugepaged won't skip them.
> This is what your second test case does.
>
> Secondly, I think it depends on the semantics of MADV_FREE,
> particularly how to treat the redirtied pages. TBH I'm always confused
> by the semantics. For example, the page contained "abcd", then it was
> MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> the user should expect to see "abcd1234" or "00001234".
>
> I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> valid and available, if I'm wrong please feel free to correct me. If
> so we should always copy MADV_FREE pages in khugepaged regardless of
> whether it is redirtied or not otherwise it may incur data corruption.
> If we don't copy, then the follow up redirty after collapse to the
> hugepage may return "00001234", right?
>
> The current behavior is copying the page.
>
> >
> > Thanks,
> > Lance
> >
> > On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 02-02-24 21:46:45, Lance Yang wrote:
> > > > Here is a part from the man page explaining
> > > > the MADV_FREE semantics:
> > > >
> > > > The kernel can thus free thesepages, but the
> > > > freeing could be delayed until memory pressure
> > > > occurs. For each of the pages that has been
> > > > marked to be freed but has not yet been freed,
> > > > the free operation will be canceled if the caller
> > > > writes into the page. If there is no subsequent
> > > > write, the kernel can free the pages at any time.
> > > >
> > > > IIUC, if there is no subsequent write, lazyfree
> > > > pages will eventually be reclaimed.
> > >
> > > If there is no memory pressure then this might not
> > > ever happen. User cannot make any assumption about
> > > their content once madvise call has been done. The
> > > content has to be considered lost. Sure the userspace
> > > might have means to tell those pages from zero pages
> > > and recheck after the write but that is about it.
> > >
> > > > khugepaged
> > > > treats lazyfree pages the same as pte_none,
> > > > avoiding copying them to the new huge page
> > > > during collapse. It seems that lazyfree pages
> > > > are reclaimed before khugepaged collapses them.
> > > > This aligns with user expectations.
> > > >
> > > > However, IMO, if the content of MADV_FREE pages
> > > > remains valid during collapse, then khugepaged
> > > > treating lazyfree pages the same as pte_none
> > > > might not be suitable.
> > >
> > > Why?
> > >
> > > Unless I am missing something (which is possible of
> > > course) I do not really see why dropping the content
> > > of those pages and replacing them with a THP is any
> > > difference from reclaiming those pages and then faulting
> > > in a non-THP zero page.
> > >
> > > Now, if khugepaged reused the original content of MADV_FREE
> > > pages that would be a slightly different story. I can
> > > see why users would expect zero pages to back madvised
> > > area.
> > > --
> > > Michal Hocko
> > > SUSE Labs
  
Michal Hocko Feb. 5, 2024, 9:45 a.m. UTC | #18
On Fri 02-02-24 09:42:27, Yang Shi wrote:
> But if the partial range is MADV_FREE, khugepaged won't skip them.
> This is what your second test case does.
> 
> Secondly, I think it depends on the semantics of MADV_FREE,
> particularly how to treat the redirtied pages. TBH I'm always confused
> by the semantics. For example, the page contained "abcd", then it was
> MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> the user should expect to see "abcd1234" or "00001234".

Correct. You cannot assume the content of the first page as it could
have been reclaimed at any time.
 
> I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> valid and available, if I'm wrong please feel free to correct me. If
> so we should always copy MADV_FREE pages in khugepaged regardless of
> whether it is redirtied or not otherwise it may incur data corruption.
> If we don't copy, then the follow up redirty after collapse to the
> hugepage may return "00001234", right?

Right. As pointed above this is a valid outcome if the page has been
dropped. User has means to tell that from /proc/vmstat though. Not in a
great precision but I think it would be really surprising to not see any
pglazyfreed yet the content is gone. I think it would be legit to call
it a bug. One could argue the bug would be in the accounting rather than
the khugepaged implementation because madvised pages could be dropped at
any time. But I think it makes more sense to copy the existing content.
  
Yang Shi Feb. 5, 2024, 7:41 p.m. UTC | #19
On Fri, Feb 2, 2024 at 8:17 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> Hey Michal, David, Yang,
>
> I sincerely appreciate your time!
>
> I still have two questions that are perplexing me.
>
> First question:
> Given that khugepaged doesn't treat MADV_FREE
> pages as pte_none, why skip the 2M block when all
> the pages within the range are old and unreferenced,
> but won't skip if the partial range is MADV_FREE,
> even if it's not redirtied? Why make this distinction?
> Would it not be more straightforward to maintain
> if either all were skipped or not?

It is just some heuristic in the code and may be some arbitrary
choice. It could controlled in a more fine-grained way if we really
see some workloads get benefit.

>
> Second question:
> Does copying lazyfree pages (not redirtied) to the
> new huge page during khugepaged collapse
> undermine the semantics of MADV_FREE?
> Users mark pages as lazyfree with MADV_FREE,
> expecting these pages to be eventually reclaimed.
> Even without subsequent writes, these pages will
> no longer be reclaimed, even if memory pressure
> occurs.

Yeah, it just means khugepaged wins the race against page reclaim. I'm
supposed the delayed free is one of the design goals of MADV_FREE, and
the risk is the pages may not be freed eventually. If you want
immediate free or more deterministic behavior, you should use
MADV_DONTNEED or munmap IIUC.

>
> BR,
> Lance
>
> On Sat, Feb 3, 2024 at 1:42 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@gmail.com> wrote:
> > >
> > > How about blocking khugepaged from
> > > collapsing lazyfree pages? This way,
> > > is it not better to keep the semantics
> > > of MADV_FREE?
> > >
> > > What do you think?
> >
> > First of all, khugepaged doesn't treat MADV_FREE pages as pte_none
> > IIUC. The khugepaged does skip the 2M block if all the pages are old
> > and unreferenced pages in the range in hpage_collapse_scan_pmd(), then
> > repeat the check in collapse_huge_page() again.
> >
> > And MADV_FREE pages are just old and unreferenced. This is actually
> > what your first test case does. The whole 2M range is MADV_FREE range,
> > so they are skipped by khugepaged.
> >
> > But if the partial range is MADV_FREE, khugepaged won't skip them.
> > This is what your second test case does.
> >
> > Secondly, I think it depends on the semantics of MADV_FREE,
> > particularly how to treat the redirtied pages. TBH I'm always confused
> > by the semantics. For example, the page contained "abcd", then it was
> > MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> > the user should expect to see "abcd1234" or "00001234".
> >
> > I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> > valid and available, if I'm wrong please feel free to correct me. If
> > so we should always copy MADV_FREE pages in khugepaged regardless of
> > whether it is redirtied or not otherwise it may incur data corruption.
> > If we don't copy, then the follow up redirty after collapse to the
> > hugepage may return "00001234", right?
> >
> > The current behavior is copying the page.
> >
> > >
> > > Thanks,
> > > Lance
> > >
> > > On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 02-02-24 21:46:45, Lance Yang wrote:
> > > > > Here is a part from the man page explaining
> > > > > the MADV_FREE semantics:
> > > > >
> > > > > The kernel can thus free thesepages, but the
> > > > > freeing could be delayed until memory pressure
> > > > > occurs. For each of the pages that has been
> > > > > marked to be freed but has not yet been freed,
> > > > > the free operation will be canceled if the caller
> > > > > writes into the page. If there is no subsequent
> > > > > write, the kernel can free the pages at any time.
> > > > >
> > > > > IIUC, if there is no subsequent write, lazyfree
> > > > > pages will eventually be reclaimed.
> > > >
> > > > If there is no memory pressure then this might not
> > > > ever happen. User cannot make any assumption about
> > > > their content once madvise call has been done. The
> > > > content has to be considered lost. Sure the userspace
> > > > might have means to tell those pages from zero pages
> > > > and recheck after the write but that is about it.
> > > >
> > > > > khugepaged
> > > > > treats lazyfree pages the same as pte_none,
> > > > > avoiding copying them to the new huge page
> > > > > during collapse. It seems that lazyfree pages
> > > > > are reclaimed before khugepaged collapses them.
> > > > > This aligns with user expectations.
> > > > >
> > > > > However, IMO, if the content of MADV_FREE pages
> > > > > remains valid during collapse, then khugepaged
> > > > > treating lazyfree pages the same as pte_none
> > > > > might not be suitable.
> > > >
> > > > Why?
> > > >
> > > > Unless I am missing something (which is possible of
> > > > course) I do not really see why dropping the content
> > > > of those pages and replacing them with a THP is any
> > > > difference from reclaiming those pages and then faulting
> > > > in a non-THP zero page.
> > > >
> > > > Now, if khugepaged reused the original content of MADV_FREE
> > > > pages that would be a slightly different story. I can
> > > > see why users would expect zero pages to back madvised
> > > > area.
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
  
Yang Shi Feb. 5, 2024, 7:43 p.m. UTC | #20
On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 02-02-24 09:42:27, Yang Shi wrote:
> > But if the partial range is MADV_FREE, khugepaged won't skip them.
> > This is what your second test case does.
> >
> > Secondly, I think it depends on the semantics of MADV_FREE,
> > particularly how to treat the redirtied pages. TBH I'm always confused
> > by the semantics. For example, the page contained "abcd", then it was
> > MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> > the user should expect to see "abcd1234" or "00001234".
>
> Correct. You cannot assume the content of the first page as it could
> have been reclaimed at any time.
>
> > I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> > valid and available, if I'm wrong please feel free to correct me. If
> > so we should always copy MADV_FREE pages in khugepaged regardless of
> > whether it is redirtied or not otherwise it may incur data corruption.
> > If we don't copy, then the follow up redirty after collapse to the
> > hugepage may return "00001234", right?
>
> Right. As pointed above this is a valid outcome if the page has been
> dropped. User has means to tell that from /proc/vmstat though. Not in a
> great precision but I think it would be really surprising to not see any
> pglazyfreed yet the content is gone. I think it would be legit to call
> it a bug. One could argue the bug would be in the accounting rather than
> the khugepaged implementation because madvised pages could be dropped at
> any time. But I think it makes more sense to copy the existing content.

Yeah, as long as khugepaged sees the MADV_FREE pages, it means they
have "NOT" been dropped yet. It may be dropped later if memory
pressure occurs, but anyway khugepaged wins the race and khugepaged
can't assume the pages will be dropped before they get redirtied. So
copying the content does make sense.

> --
> Michal Hocko
> SUSE Labs
  
Zach O'Keefe Feb. 5, 2024, 8:26 p.m. UTC | #21
On Mon, Feb 5, 2024 at 11:43 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 02-02-24 09:42:27, Yang Shi wrote:
> > > But if the partial range is MADV_FREE, khugepaged won't skip them.
> > > This is what your second test case does.
> > >
> > > Secondly, I think it depends on the semantics of MADV_FREE,
> > > particularly how to treat the redirtied pages. TBH I'm always confused
> > > by the semantics. For example, the page contained "abcd", then it was
> > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> > > the user should expect to see "abcd1234" or "00001234".
> >
> > Correct. You cannot assume the content of the first page as it could
> > have been reclaimed at any time.
> >
> > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> > > valid and available, if I'm wrong please feel free to correct me. If
> > > so we should always copy MADV_FREE pages in khugepaged regardless of
> > > whether it is redirtied or not otherwise it may incur data corruption.
> > > If we don't copy, then the follow up redirty after collapse to the
> > > hugepage may return "00001234", right?
> >
> > Right. As pointed above this is a valid outcome if the page has been
> > dropped. User has means to tell that from /proc/vmstat though. Not in a
> > great precision but I think it would be really surprising to not see any
> > pglazyfreed yet the content is gone. I think it would be legit to call
> > it a bug. One could argue the bug would be in the accounting rather than
> > the khugepaged implementation because madvised pages could be dropped at
> > any time. But I think it makes more sense to copy the existing content.

+1. I agree that the content should be dropped iff pglazyfreed is
incremented. Of course, we could do that here, but I don't think there
is a good reason to, and we should just copy the contents.

> Yeah, as long as khugepaged sees the MADV_FREE pages, it means they
> have "NOT" been dropped yet. It may be dropped later if memory
> pressure occurs, but anyway khugepaged wins the race and khugepaged
> can't assume the pages will be dropped before they get redirtied. So
> copying the content does make sense.

Per Lance, I kinda get that this "undermines" MADV_FREE, insofar that,
from the user's perspective, that memory which was intended as a
buffer against OOM kill scenarios, is no longer there to reclaim trivially. I
don't have a real world example where this is an issue, however. Also,
not copying the contents doesn't change that fact.

The proper alternative, if you want to make the "undermining"
argument, is for khugepaged to stay away from hugepage regions with
any MADV_FREE pages. I think it's fair to assume MADV_FREE'd memory is
likely cold memory, and therefore not a good hugepage target anyways.
However, it'd be unfortunate if there were a couple MADV_FREE pages in
the middle of an otherwise hot / highly-utilized hugepage region that
would prevent it from being pmd-mapped via khugepaged. But.. this is
exactly-ish what you get when hugepage-ware system/runtime allocators
split THPs to free up internal caches.

Best,
Zach


> > --
> > Michal Hocko
> > SUSE Labs
  
Lance Yang Feb. 20, 2024, 10:15 a.m. UTC | #22
Hey Zach, Yang, Michal, and David,

Please accept my sincerest apologies for the delayed
response.

Thanks for the replies; it‘s been very helpful to me! I also
appreciate the valuable information you’ve shared!

I agree that it’s not a good idea to let khugepaged avoid
any pages marked with MADV_FREE.

Thanks again for your time!

Best,
Lance

On Tue, Feb 6, 2024 at 4:27 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Mon, Feb 5, 2024 at 11:43 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 02-02-24 09:42:27, Yang Shi wrote:
> > > > But if the partial range is MADV_FREE, khugepaged won't skip them.
> > > > This is what your second test case does.
> > > >
> > > > Secondly, I think it depends on the semantics of MADV_FREE,
> > > > particularly how to treat the redirtied pages. TBH I'm always confused
> > > > by the semantics. For example, the page contained "abcd", then it was
> > > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> > > > the user should expect to see "abcd1234" or "00001234".
> > >
> > > Correct. You cannot assume the content of the first page as it could
> > > have been reclaimed at any time.
> > >
> > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> > > > valid and available, if I'm wrong please feel free to correct me. If
> > > > so we should always copy MADV_FREE pages in khugepaged regardless of
> > > > whether it is redirtied or not otherwise it may incur data corruption.
> > > > If we don't copy, then the follow up redirty after collapse to the
> > > > hugepage may return "00001234", right?
> > >
> > > Right. As pointed above this is a valid outcome if the page has been
> > > dropped. User has means to tell that from /proc/vmstat though. Not in a
> > > great precision but I think it would be really surprising to not see any
> > > pglazyfreed yet the content is gone. I think it would be legit to call
> > > it a bug. One could argue the bug would be in the accounting rather than
> > > the khugepaged implementation because madvised pages could be dropped at
> > > any time. But I think it makes more sense to copy the existing content.
>
> +1. I agree that the content should be dropped iff pglazyfreed is
> incremented. Of course, we could do that here, but I don't think there
> is a good reason to, and we should just copy the contents.
>
> > Yeah, as long as khugepaged sees the MADV_FREE pages, it means they
> > have "NOT" been dropped yet. It may be dropped later if memory
> > pressure occurs, but anyway khugepaged wins the race and khugepaged
> > can't assume the pages will be dropped before they get redirtied. So
> > copying the content does make sense.
>
> Per Lance, I kinda get that this "undermines" MADV_FREE, insofar that,
> from the user's perspective, that memory which was intended as a
> buffer against OOM kill scenarios, is no longer there to reclaim trivially. I
> don't have a real world example where this is an issue, however. Also,
> not copying the contents doesn't change that fact.
>
> The proper alternative, if you want to make the "undermining"
> argument, is for khugepaged to stay away from hugepage regions with
> any MADV_FREE pages. I think it's fair to assume MADV_FREE'd memory is
> likely cold memory, and therefore not a good hugepage target anyways.
> However, it'd be unfortunate if there were a couple MADV_FREE pages in
> the middle of an otherwise hot / highly-utilized hugepage region that
> would prevent it from being pmd-mapped via khugepaged. But.. this is
> exactly-ish what you get when hugepage-ware system/runtime allocators
> split THPs to free up internal caches.
>
> Best,
> Zach
>
>
> > > --
> > > Michal Hocko
> > > SUSE Labs
  

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2b219acb528e..6cbf46d42c6a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -777,6 +777,7 @@  static int __collapse_huge_page_copy(pte_t *pte,
 				     pmd_t orig_pmd,
 				     struct vm_area_struct *vma,
 				     unsigned long address,
+				     struct collapse_control *cc,
 				     spinlock_t *ptl,
 				     struct list_head *compound_pagelist)
 {
@@ -797,6 +798,13 @@  static int __collapse_huge_page_copy(pte_t *pte,
 			continue;
 		}
 		src_page = pte_page(pteval);
+
+		if (cc->is_khugepaged
+				&& !folio_test_swapbacked(page_folio(src_page))) {
+			clear_user_highpage(page, _address);
+			continue;
+		}
+
 		if (copy_mc_user_highpage(page, src_page, _address, vma) > 0) {
 			result = SCAN_COPY_MC;
 			break;
@@ -1205,7 +1213,7 @@  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd,
-					   vma, address, pte_ptl,
+					   vma, address, cc, pte_ptl,
 					   &compound_pagelist);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))