[0/2] don't use mapcount() to check large folio sharing

Message ID	20230728161356.1784568-1-fengwei.yin@intel.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Yin Fengwei <fengwei.yin@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, willy@infradead.org, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, minchan@kernel.org, yuzhao@google.com, david@redhat.com, ryan.roberts@arm.com, shy828301@gmail.com Cc: fengwei.yin@intel.com Subject: [PATCH 0/2] don't use mapcount() to check large folio sharing Date: Sat, 29 Jul 2023 00:13:54 +0800 Message-Id: <20230728161356.1784568-1-fengwei.yin@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	don't use mapcount() to check large folio sharing \| [0/2] don't use mapcount() to check large folio sharing [1/2] madvise: don't use mapcount() against large folio for sharing check [2/2] madvise: don't use mapcount() against large folio for sharing check

Message ID

20230728161356.1784568-1-fengwei.yin@intel.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Yin Fengwei <fengwei.yin@intel.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        stable@vger.kernel.org, akpm@linux-foundation.org,
        willy@infradead.org, vishal.moola@gmail.com,
        wangkefeng.wang@huawei.com, minchan@kernel.org, yuzhao@google.com,
        david@redhat.com, ryan.roberts@arm.com, shy828301@gmail.com
Cc: fengwei.yin@intel.com
Subject: [PATCH 0/2] don't use mapcount() to check large folio sharing
Date: Sat, 29 Jul 2023 00:13:54 +0800
Message-Id: <20230728161356.1784568-1-fengwei.yin@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

don't use mapcount() to check large folio sharing |

Message

Yin Fengwei July 28, 2023, 4:13 p.m. UTC

  In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
folio_mapcount() is used to check whether the folio is shared. But it's
not correct as folio_mapcount() returns total mapcount of large folio.

Use folio_estimated_sharers() here as the estimated number is enough.

Yin Fengwei (2):
  madvise: don't use mapcount() against large folio for sharing check
  madvise: don't use mapcount() against large folio for sharing check

 mm/huge_memory.c | 2 +-
 mm/madvise.c     | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Comments

Andrew Morton July 28, 2023, 5:24 p.m. UTC | #1

On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> folio_mapcount() is used to check whether the folio is shared. But it's
> not correct as folio_mapcount() returns total mapcount of large folio.
> 
> Use folio_estimated_sharers() here as the estimated number is enough.

What are the user-visible runtime effects of these changes?

(and please try to avoid using the same Subject: for different patches)

Ryan Roberts Aug. 2, 2023, 10:27 a.m. UTC | #2

On 28/07/2023 17:13, Yin Fengwei wrote:
> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> folio_mapcount() is used to check whether the folio is shared. But it's
> not correct as folio_mapcount() returns total mapcount of large folio.
> 
> Use folio_estimated_sharers() here as the estimated number is enough.
> 
> Yin Fengwei (2):
>   madvise: don't use mapcount() against large folio for sharing check
>   madvise: don't use mapcount() against large folio for sharing check
> 
>  mm/huge_memory.c | 2 +-
>  mm/madvise.c     | 6 +++---
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 

As a set of fixes, I agree this is definitely an improvement, so:

Reviewed-By: Ryan Roberts

But I have a couple of comments around further improvements;

Once we have the scheme that David is working on to be able to provide precise
exclusive vs shared info, we will probably want to move to that. Although that
scheme will need access to the mm_struct of a process known to be mapping the
folio. We have that info, but its not passed to folio_estimated_sharers() so we
can't just reimplement folio_estimated_sharers() - we will need to rework these
call sites again.

Given the aspiration for most of the memory to be large folios going forwards,
wouldn't it be better to avoid splitting the large folio where the large folio
is mapped entirely within the range of the madvise operation? Sorry if this has
already been discussed and decided against - I didn't follow the RFC too
closely. Or perhaps you plan to do this as a follow up?

Thanks,
Ryan

David Hildenbrand Aug. 2, 2023, 10:48 a.m. UTC | #3

On 02.08.23 12:27, Ryan Roberts wrote:
> On 28/07/2023 17:13, Yin Fengwei wrote:
>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>> folio_mapcount() is used to check whether the folio is shared. But it's
>> not correct as folio_mapcount() returns total mapcount of large folio.
>>
>> Use folio_estimated_sharers() here as the estimated number is enough.
>>
>> Yin Fengwei (2):
>>    madvise: don't use mapcount() against large folio for sharing check
>>    madvise: don't use mapcount() against large folio for sharing check
>>
>>   mm/huge_memory.c | 2 +-
>>   mm/madvise.c     | 6 +++---
>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>
> 
> As a set of fixes, I agree this is definitely an improvement, so:
> 
> Reviewed-By: Ryan Roberts
> 
> 
> But I have a couple of comments around further improvements;
> 
> Once we have the scheme that David is working on to be able to provide precise
> exclusive vs shared info, we will probably want to move to that. Although that
> scheme will need access to the mm_struct of a process known to be mapping the

There are probably ways to work around lack of mm_struct, but it would 
not be completely for free. But passing the mm_struct should probably be 
an easy refactoring.

> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> can't just reimplement folio_estimated_sharers() - we will need to rework these
> call sites again.

We should probably just have a

folio_maybe_mapped_shared()

with proper documentation. Nobody should care about the exact number.

If my scheme for anon pages makes it in, that would be precise for anon 
pages and we could document that. Once we can handle pagecache pages as 
well to get a precise answer, we could change to folio_mapped_shared() 
and adjust the documentation.

I just saw

https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com

that converts a lot of code to folio_estimated_sharers().

That patchset, for example, also does

total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1

I'm not 100% sure what to think about that at this point. We eventually 
add false negatives (actually shared but we fail to detect it) all over 
the place, instead of having false positives (actually exclusive, but we 
fail to detect it).

And that patch set doesn't even spell that out.

Maybe it's as good as we will get, especially if my scheme doesn't make 
it in. But we should definitely spell that out.

Ryan Roberts Aug. 2, 2023, 11:20 a.m. UTC | #4

On 02/08/2023 11:48, David Hildenbrand wrote:
> On 02.08.23 12:27, Ryan Roberts wrote:
>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>
>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>
>>> Yin Fengwei (2):
>>>    madvise: don't use mapcount() against large folio for sharing check
>>>    madvise: don't use mapcount() against large folio for sharing check
>>>
>>>   mm/huge_memory.c | 2 +-
>>>   mm/madvise.c     | 6 +++---
>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>
>>
>> As a set of fixes, I agree this is definitely an improvement, so:
>>
>> Reviewed-By: Ryan Roberts
>>
>>
>> But I have a couple of comments around further improvements;
>>
>> Once we have the scheme that David is working on to be able to provide precise
>> exclusive vs shared info, we will probably want to move to that. Although that
>> scheme will need access to the mm_struct of a process known to be mapping the
> 
> There are probably ways to work around lack of mm_struct, but it would not be
> completely for free. But passing the mm_struct should probably be an easy
> refactoring.
> 
>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>> call sites again.
> 
> We should probably just have a
> 
> folio_maybe_mapped_shared()
> 
> with proper documentation. Nobody should care about the exact number.
> 
> 
> If my scheme for anon pages makes it in, that would be precise for anon pages
> and we could document that. Once we can handle pagecache pages as well to get a
> precise answer, we could change to folio_mapped_shared() and adjust the
> documentation.

Makes sense to me. I'm assuming your change would allow us to get rid of
PG_anon_exclusive too? In which case we would also want a precise API
specifically for anon folios for the CoW case, without waiting for pagecache
page support.

> 
> 
> I just saw
> 
> https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
> 
> that converts a lot of code to folio_estimated_sharers().
> 
> 
> That patchset, for example, also does
> 
> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
> 
> I'm not 100% sure what to think about that at this point. We eventually add
> false negatives (actually shared but we fail to detect it) all over the place,
> instead of having false positives (actually exclusive, but we fail to detect it).
> 
> And that patch set doesn't even spell that out.
> 
> 
> Maybe it's as good as we will get, especially if my scheme doesn't make it in.

I've been working on the assumption that your scheme is plan A, and I'm waiting
for it to unblock forward progress on large anon folios. Is this the right
approach, or do you think your scheme is sufficiently riskly and/or far out that
I should aim not to depend on it?

> But we should definitely spell that out.
>

David Hildenbrand Aug. 2, 2023, 11:36 a.m. UTC | #5

On 02.08.23 13:20, Ryan Roberts wrote:
> On 02/08/2023 11:48, David Hildenbrand wrote:
>> On 02.08.23 12:27, Ryan Roberts wrote:
>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>
>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> Yin Fengwei (2):
>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>
>>>>    mm/huge_memory.c | 2 +-
>>>>    mm/madvise.c     | 6 +++---
>>>>    2 files changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>
>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>
>>> Reviewed-By: Ryan Roberts
>>>
>>>
>>> But I have a couple of comments around further improvements;
>>>
>>> Once we have the scheme that David is working on to be able to provide precise
>>> exclusive vs shared info, we will probably want to move to that. Although that
>>> scheme will need access to the mm_struct of a process known to be mapping the
>>
>> There are probably ways to work around lack of mm_struct, but it would not be
>> completely for free. But passing the mm_struct should probably be an easy
>> refactoring.
>>
>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>> call sites again.
>>
>> We should probably just have a
>>
>> folio_maybe_mapped_shared()
>>
>> with proper documentation. Nobody should care about the exact number.
>>
>>
>> If my scheme for anon pages makes it in, that would be precise for anon pages
>> and we could document that. Once we can handle pagecache pages as well to get a
>> precise answer, we could change to folio_mapped_shared() and adjust the
>> documentation.
> 
> Makes sense to me. I'm assuming your change would allow us to get rid of
> PG_anon_exclusive too? In which case we would also want a precise API
> specifically for anon folios for the CoW case, without waiting for pagecache
> page support.

Not necessarily and I'm currently not planning that

On the COW path, I'm planning on using it only when PG_anon_exclusive is 
clear for a compound page, combined with a check that there are no other 
page references besides from mappings: all mappings from me and #refs == 
#mappings -> reuse (set PG_anon_exclusive). That keeps the default (no 
fork) as fast as possible and simple.

>>
>> I just saw
>>
>> https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
>>
>> that converts a lot of code to folio_estimated_sharers().
>>
>>
>> That patchset, for example, also does
>>
>> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
>>
>> I'm not 100% sure what to think about that at this point. We eventually add
>> false negatives (actually shared but we fail to detect it) all over the place,
>> instead of having false positives (actually exclusive, but we fail to detect it).
>>
>> And that patch set doesn't even spell that out.
>>
>>
>> Maybe it's as good as we will get, especially if my scheme doesn't make it in.
> 
> I've been working on the assumption that your scheme is plan A, and I'm waiting
> for it to unblock forward progress on large anon folios. Is this the right
> approach, or do you think your scheme is sufficiently riskly and/or far out that
> I should aim not to depend on it?

It is plan A. IMHO, it does not feel too risky and/or far out at this 
point -- and the implementation should not end up too complicated. But 
as always, I cannot promise anything before it's been implemented and 
discussed upstream.

Hopefully, we know more soon. I'll get at implementing it fairly soon.

Ryan Roberts Aug. 2, 2023, 11:51 a.m. UTC | #6

On 02/08/2023 12:36, David Hildenbrand wrote:
> On 02.08.23 13:20, Ryan Roberts wrote:
>> On 02/08/2023 11:48, David Hildenbrand wrote:
>>> On 02.08.23 12:27, Ryan Roberts wrote:
>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>
>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>
>>>>> Yin Fengwei (2):
>>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>>     madvise: don't use mapcount() against large folio for sharing check
>>>>>
>>>>>    mm/huge_memory.c | 2 +-
>>>>>    mm/madvise.c     | 6 +++---
>>>>>    2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>
>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>
>>>> Reviewed-By: Ryan Roberts
>>>>
>>>>
>>>> But I have a couple of comments around further improvements;
>>>>
>>>> Once we have the scheme that David is working on to be able to provide precise
>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>
>>> There are probably ways to work around lack of mm_struct, but it would not be
>>> completely for free. But passing the mm_struct should probably be an easy
>>> refactoring.
>>>
>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>> call sites again.
>>>
>>> We should probably just have a
>>>
>>> folio_maybe_mapped_shared()
>>>
>>> with proper documentation. Nobody should care about the exact number.
>>>
>>>
>>> If my scheme for anon pages makes it in, that would be precise for anon pages
>>> and we could document that. Once we can handle pagecache pages as well to get a
>>> precise answer, we could change to folio_mapped_shared() and adjust the
>>> documentation.
>>
>> Makes sense to me. I'm assuming your change would allow us to get rid of
>> PG_anon_exclusive too? In which case we would also want a precise API
>> specifically for anon folios for the CoW case, without waiting for pagecache
>> page support.
> 
> Not necessarily and I'm currently not planning that
> 
> On the COW path, I'm planning on using it only when PG_anon_exclusive is clear
> for a compound page, combined with a check that there are no other page
> references besides from mappings: all mappings from me and #refs == #mappings ->
> reuse (set PG_anon_exclusive). That keeps the default (no fork) as fast as
> possible and simple.
> 
>>>
>>> I just saw
>>>
>>> https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
>>>
>>> that converts a lot of code to folio_estimated_sharers().
>>>
>>>
>>> That patchset, for example, also does
>>>
>>> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
>>>
>>> I'm not 100% sure what to think about that at this point. We eventually add
>>> false negatives (actually shared but we fail to detect it) all over the place,
>>> instead of having false positives (actually exclusive, but we fail to detect
>>> it).
>>>
>>> And that patch set doesn't even spell that out.
>>>
>>>
>>> Maybe it's as good as we will get, especially if my scheme doesn't make it in.
>>
>> I've been working on the assumption that your scheme is plan A, and I'm waiting
>> for it to unblock forward progress on large anon folios. Is this the right
>> approach, or do you think your scheme is sufficiently riskly and/or far out that
>> I should aim not to depend on it?
> 
> It is plan A. IMHO, it does not feel too risky and/or far out at this point --
> and the implementation should not end up too complicated. But as always, I
> cannot promise anything before it's been implemented and discussed upstream.

OK, good we are on the same folio... (stolen from Hugh; if a joke is worth
telling once, its worth telling 1000 times ;-)

> 
> Hopefully, we know more soon. I'll get at implementing it fairly soon.
>

David Hildenbrand Aug. 2, 2023, 11:52 a.m. UTC | #7

On 02.08.23 13:51, Ryan Roberts wrote:
> On 02/08/2023 12:36, David Hildenbrand wrote:
>> On 02.08.23 13:20, Ryan Roberts wrote:
>>> On 02/08/2023 11:48, David Hildenbrand wrote:
>>>> On 02.08.23 12:27, Ryan Roberts wrote:
>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>
>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>
>>>>>> Yin Fengwei (2):
>>>>>>      madvise: don't use mapcount() against large folio for sharing check
>>>>>>      madvise: don't use mapcount() against large folio for sharing check
>>>>>>
>>>>>>     mm/huge_memory.c | 2 +-
>>>>>>     mm/madvise.c     | 6 +++---
>>>>>>     2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>
>>>>>
>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>
>>>>> Reviewed-By: Ryan Roberts
>>>>>
>>>>>
>>>>> But I have a couple of comments around further improvements;
>>>>>
>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>
>>>> There are probably ways to work around lack of mm_struct, but it would not be
>>>> completely for free. But passing the mm_struct should probably be an easy
>>>> refactoring.
>>>>
>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>> call sites again.
>>>>
>>>> We should probably just have a
>>>>
>>>> folio_maybe_mapped_shared()
>>>>
>>>> with proper documentation. Nobody should care about the exact number.
>>>>
>>>>
>>>> If my scheme for anon pages makes it in, that would be precise for anon pages
>>>> and we could document that. Once we can handle pagecache pages as well to get a
>>>> precise answer, we could change to folio_mapped_shared() and adjust the
>>>> documentation.
>>>
>>> Makes sense to me. I'm assuming your change would allow us to get rid of
>>> PG_anon_exclusive too? In which case we would also want a precise API
>>> specifically for anon folios for the CoW case, without waiting for pagecache
>>> page support.
>>
>> Not necessarily and I'm currently not planning that
>>
>> On the COW path, I'm planning on using it only when PG_anon_exclusive is clear
>> for a compound page, combined with a check that there are no other page
>> references besides from mappings: all mappings from me and #refs == #mappings ->
>> reuse (set PG_anon_exclusive). That keeps the default (no fork) as fast as
>> possible and simple.
>>
>>>>
>>>> I just saw
>>>>
>>>> https://lkml.kernel.org/r/20230802095346.87449-1-wangkefeng.wang@huawei.com
>>>>
>>>> that converts a lot of code to folio_estimated_sharers().
>>>>
>>>>
>>>> That patchset, for example, also does
>>>>
>>>> total_mapcount(page) > 1 -> folio_estimated_sharers(folio) > 1
>>>>
>>>> I'm not 100% sure what to think about that at this point. We eventually add
>>>> false negatives (actually shared but we fail to detect it) all over the place,
>>>> instead of having false positives (actually exclusive, but we fail to detect
>>>> it).
>>>>
>>>> And that patch set doesn't even spell that out.
>>>>
>>>>
>>>> Maybe it's as good as we will get, especially if my scheme doesn't make it in.
>>>
>>> I've been working on the assumption that your scheme is plan A, and I'm waiting
>>> for it to unblock forward progress on large anon folios. Is this the right
>>> approach, or do you think your scheme is sufficiently riskly and/or far out that
>>> I should aim not to depend on it?
>>
>> It is plan A. IMHO, it does not feel too risky and/or far out at this point --
>> and the implementation should not end up too complicated. But as always, I
>> cannot promise anything before it's been implemented and discussed upstream.
> 
> OK, good we are on the same folio... (stolen from Hugh; if a joke is worth
> telling once, its worth telling 1000 times ;-)

Heard it first the time :))

Yin Fengwei Aug. 2, 2023, 12:35 p.m. UTC | #8

On 8/2/2023 6:27 PM, Ryan Roberts wrote:
> On 28/07/2023 17:13, Yin Fengwei wrote:
>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>> folio_mapcount() is used to check whether the folio is shared. But it's
>> not correct as folio_mapcount() returns total mapcount of large folio.
>>
>> Use folio_estimated_sharers() here as the estimated number is enough.
>>
>> Yin Fengwei (2):
>>   madvise: don't use mapcount() against large folio for sharing check
>>   madvise: don't use mapcount() against large folio for sharing check
>>
>>  mm/huge_memory.c | 2 +-
>>  mm/madvise.c     | 6 +++---
>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>
> 
> As a set of fixes, I agree this is definitely an improvement, so:
> 
> Reviewed-By: Ryan Roberts
Thanks.

> 
> 
> But I have a couple of comments around further improvements;
> 
> Once we have the scheme that David is working on to be able to provide precise
> exclusive vs shared info, we will probably want to move to that. Although that
> scheme will need access to the mm_struct of a process known to be mapping the
> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> can't just reimplement folio_estimated_sharers() - we will need to rework these
> call sites again.
Yes. This could be extra work. Maybe should delay till David's work is done.

> 
> Given the aspiration for most of the memory to be large folios going forwards,
> wouldn't it be better to avoid splitting the large folio where the large folio
> is mapped entirely within the range of the madvise operation? Sorry if this has
> already been discussed and decided against - I didn't follow the RFC too
> closely. Or perhaps you plan to do this as a follow up?
Yes. We are on same page. RFC patchset did that. But there are some other opens
on the RFC. So I tried to submit this part of change which is bug fix. The other
thing left in RFC is optimization (avoid split large folio if we can).


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
>

Yin Fengwei Aug. 2, 2023, 12:39 p.m. UTC | #9

Hi Andrew,

On 7/29/2023 1:24 AM, Andrew Morton wrote:
> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> 
>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>> folio_mapcount() is used to check whether the folio is shared. But it's
>> not correct as folio_mapcount() returns total mapcount of large folio.
>>
>> Use folio_estimated_sharers() here as the estimated number is enough.
> 
> What are the user-visible runtime effects of these changes?
> 
> (and please try to avoid using the same Subject: for different patches)
> 

Can you hold on these patches to mm-unstable? I think we need to wait for
David's work on folio_maybe_mapped_shared() and redo the fix base on that.
Thanks and sorry for the noise.


Regards
Yin, Fengwei

Ryan Roberts Aug. 2, 2023, 12:40 p.m. UTC | #10

On 02/08/2023 13:35, Yin, Fengwei wrote:
> 
> 
> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>
>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>
>>> Yin Fengwei (2):
>>>   madvise: don't use mapcount() against large folio for sharing check
>>>   madvise: don't use mapcount() against large folio for sharing check
>>>
>>>  mm/huge_memory.c | 2 +-
>>>  mm/madvise.c     | 6 +++---
>>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>>
>>
>> As a set of fixes, I agree this is definitely an improvement, so:
>>
>> Reviewed-By: Ryan Roberts
> Thanks.
> 
>>
>>
>> But I have a couple of comments around further improvements;
>>
>> Once we have the scheme that David is working on to be able to provide precise
>> exclusive vs shared info, we will probably want to move to that. Although that
>> scheme will need access to the mm_struct of a process known to be mapping the
>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>> call sites again.
> Yes. This could be extra work. Maybe should delay till David's work is done.

What you have is definitely an improvement over what was there before. And is
probably the best we can do without David's scheme. So I wouldn't delay this.
Just pointing out that we will be able to make it even better later on (if
David's stuff goes in).

> 
>>
>> Given the aspiration for most of the memory to be large folios going forwards,
>> wouldn't it be better to avoid splitting the large folio where the large folio
>> is mapped entirely within the range of the madvise operation? Sorry if this has
>> already been discussed and decided against - I didn't follow the RFC too
>> closely. Or perhaps you plan to do this as a follow up?
> Yes. We are on same page. RFC patchset did that. But there are some other opens
> on the RFC. So I tried to submit this part of change which is bug fix. The other
> thing left in RFC is optimization (avoid split large folio if we can).
> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks,
>> Ryan
>>

Yin Fengwei Aug. 2, 2023, 12:42 p.m. UTC | #11

On 8/2/2023 8:40 PM, Ryan Roberts wrote:
> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>
>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> Yin Fengwei (2):
>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>
>>>>  mm/huge_memory.c | 2 +-
>>>>  mm/madvise.c     | 6 +++---
>>>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>
>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>
>>> Reviewed-By: Ryan Roberts
>> Thanks.
>>
>>>
>>>
>>> But I have a couple of comments around further improvements;
>>>
>>> Once we have the scheme that David is working on to be able to provide precise
>>> exclusive vs shared info, we will probably want to move to that. Although that
>>> scheme will need access to the mm_struct of a process known to be mapping the
>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>> call sites again.
>> Yes. This could be extra work. Maybe should delay till David's work is done.
> 
> What you have is definitely an improvement over what was there before. And is
> probably the best we can do without David's scheme. So I wouldn't delay this.
> Just pointing out that we will be able to make it even better later on (if
> David's stuff goes in).
Yes. I agree that we should wait for David's work ready and do fix based on that.


Regards
Yin, Fengwei

> 
>>
>>>
>>> Given the aspiration for most of the memory to be large folios going forwards,
>>> wouldn't it be better to avoid splitting the large folio where the large folio
>>> is mapped entirely within the range of the madvise operation? Sorry if this has
>>> already been discussed and decided against - I didn't follow the RFC too
>>> closely. Or perhaps you plan to do this as a follow up?
>> Yes. We are on same page. RFC patchset did that. But there are some other opens
>> on the RFC. So I tried to submit this part of change which is bug fix. The other
>> thing left in RFC is optimization (avoid split large folio if we can).
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> Thanks,
>>> Ryan
>>>
> 
>

David Hildenbrand Aug. 2, 2023, 12:43 p.m. UTC | #12

On 02.08.23 14:40, Ryan Roberts wrote:
> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>
>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> Yin Fengwei (2):
>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>
>>>>   mm/huge_memory.c | 2 +-
>>>>   mm/madvise.c     | 6 +++---
>>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>
>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>
>>> Reviewed-By: Ryan Roberts
>> Thanks.
>>
>>>
>>>
>>> But I have a couple of comments around further improvements;
>>>
>>> Once we have the scheme that David is working on to be able to provide precise
>>> exclusive vs shared info, we will probably want to move to that. Although that
>>> scheme will need access to the mm_struct of a process known to be mapping the
>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>> call sites again.
>> Yes. This could be extra work. Maybe should delay till David's work is done.
> 
> What you have is definitely an improvement over what was there before. And is
> probably the best we can do without David's scheme. So I wouldn't delay this.
> Just pointing out that we will be able to make it even better later on (if
> David's stuff goes in).

Agreed, we just should be careful and clearly spell out the implications 
and that this is eventually also not what we 100% want.

That MADV_PAGEOUT now fails on a PTE-mapped THP -- as can be seen when 
executing the cow selftest where MADV_PAGEOUT will essentially fail -- 
is certainly undesired and should be fixed IMHO.

Ryan Roberts Aug. 2, 2023, 12:49 p.m. UTC | #13

On 02/08/2023 13:42, Yin, Fengwei wrote:
> 
> 
> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>
>>>
>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>
>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>
>>>>> Yin Fengwei (2):
>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>
>>>>>  mm/huge_memory.c | 2 +-
>>>>>  mm/madvise.c     | 6 +++---
>>>>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>
>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>
>>>> Reviewed-By: Ryan Roberts
>>> Thanks.
>>>
>>>>
>>>>
>>>> But I have a couple of comments around further improvements;
>>>>
>>>> Once we have the scheme that David is working on to be able to provide precise
>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>> call sites again.
>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>
>> What you have is definitely an improvement over what was there before. And is
>> probably the best we can do without David's scheme. So I wouldn't delay this.
>> Just pointing out that we will be able to make it even better later on (if
>> David's stuff goes in).
> Yes. I agree that we should wait for David's work ready and do fix based on that.

I was suggesting the opposite - not waiting. Then we can do separate improvement
later.

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>>>
>>>>
>>>> Given the aspiration for most of the memory to be large folios going forwards,
>>>> wouldn't it be better to avoid splitting the large folio where the large folio
>>>> is mapped entirely within the range of the madvise operation? Sorry if this has
>>>> already been discussed and decided against - I didn't follow the RFC too
>>>> closely. Or perhaps you plan to do this as a follow up?
>>> Yes. We are on same page. RFC patchset did that. But there are some other opens
>>> on the RFC. So I tried to submit this part of change which is bug fix. The other
>>> thing left in RFC is optimization (avoid split large folio if we can).
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>
>>

Yin Fengwei Aug. 2, 2023, 12:55 p.m. UTC | #14

On 8/2/2023 8:49 PM, Ryan Roberts wrote:
> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>
>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>
>>>>>> Yin Fengwei (2):
>>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>>
>>>>>>  mm/huge_memory.c | 2 +-
>>>>>>  mm/madvise.c     | 6 +++---
>>>>>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>
>>>>>
>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>
>>>>> Reviewed-By: Ryan Roberts
>>>> Thanks.
>>>>
>>>>>
>>>>>
>>>>> But I have a couple of comments around further improvements;
>>>>>
>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>> call sites again.
>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>
>>> What you have is definitely an improvement over what was there before. And is
>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>> Just pointing out that we will be able to make it even better later on (if
>>> David's stuff goes in).
>> Yes. I agree that we should wait for David's work ready and do fix based on that.
> 
> I was suggesting the opposite - not waiting. Then we can do separate improvement
> later.
Let's wait for David's work ready. Otherwise, some other similar fix could be
submitted. Unfortunately, we can't build folio_estimated_sharers() based on
folio_maybe_mapped_shared() because latter one requires the extra parameter.


Regards
Yin, Fengwei

> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>>>
>>>>> Given the aspiration for most of the memory to be large folios going forwards,
>>>>> wouldn't it be better to avoid splitting the large folio where the large folio
>>>>> is mapped entirely within the range of the madvise operation? Sorry if this has
>>>>> already been discussed and decided against - I didn't follow the RFC too
>>>>> closely. Or perhaps you plan to do this as a follow up?
>>>> Yes. We are on same page. RFC patchset did that. But there are some other opens
>>>> on the RFC. So I tried to submit this part of change which is bug fix. The other
>>>> thing left in RFC is optimization (avoid split large folio if we can).
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>
>>>
>

Yu Zhao Aug. 3, 2023, 8:46 p.m. UTC | #15

On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
> > On 02/08/2023 13:42, Yin, Fengwei wrote:
> >>
> >>
> >> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
> >>> On 02/08/2023 13:35, Yin, Fengwei wrote:
> >>>>
> >>>>
> >>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
> >>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
> >>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> >>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
> >>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
> >>>>>>
> >>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
> >>>>>>
> >>>>>> Yin Fengwei (2):
> >>>>>>   madvise: don't use mapcount() against large folio for sharing check
> >>>>>>   madvise: don't use mapcount() against large folio for sharing check
> >>>>>>
> >>>>>>  mm/huge_memory.c | 2 +-
> >>>>>>  mm/madvise.c     | 6 +++---
> >>>>>>  2 files changed, 4 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>
> >>>>> As a set of fixes, I agree this is definitely an improvement, so:
> >>>>>
> >>>>> Reviewed-By: Ryan Roberts
> >>>> Thanks.
> >>>>
> >>>>>
> >>>>>
> >>>>> But I have a couple of comments around further improvements;
> >>>>>
> >>>>> Once we have the scheme that David is working on to be able to provide precise
> >>>>> exclusive vs shared info, we will probably want to move to that. Although that
> >>>>> scheme will need access to the mm_struct of a process known to be mapping the
> >>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> >>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
> >>>>> call sites again.
> >>>> Yes. This could be extra work. Maybe should delay till David's work is done.
> >>>
> >>> What you have is definitely an improvement over what was there before. And is
> >>> probably the best we can do without David's scheme. So I wouldn't delay this.
> >>> Just pointing out that we will be able to make it even better later on (if
> >>> David's stuff goes in).
> >> Yes. I agree that we should wait for David's work ready and do fix based on that.
> >
> > I was suggesting the opposite - not waiting. Then we can do separate improvement
> > later.
> Let's wait for David's work ready.

Waiting is fine as long as we don't miss the next merge window -- we
don't want these two bugs to get into another release. Also I think we
should cc stable, since as David mentioned, they have been causing
selftest failures.

Yin Fengwei Aug. 3, 2023, 11:27 p.m. UTC | #16

On 8/4/2023 4:46 AM, Yu Zhao wrote:
> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>
>>>>>>
>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>
>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>
>>>>>>>> Yin Fengwei (2):
>>>>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>
>>>>>>>>  mm/huge_memory.c | 2 +-
>>>>>>>>  mm/madvise.c     | 6 +++---
>>>>>>>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>
>>>>>>> Reviewed-By: Ryan Roberts
>>>>>> Thanks.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>
>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>> call sites again.
>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>
>>>>> What you have is definitely an improvement over what was there before. And is
>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>> David's stuff goes in).
>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>
>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>> later.
>> Let's wait for David's work ready.
> 
> Waiting is fine as long as we don't miss the next merge window -- we
> don't want these two bugs to get into another release. Also I think we
> should cc stable, since as David mentioned, they have been causing
> selftest failures.

Stable was CCed. Andrew asked about the user-visible impact and I replied:
https://lore.kernel.org/linux-mm/24e7429c-14ed-d953-e652-eac178de76e3@intel.com/

I was not aware that selftest is impact by the issue. And waiting for whether these
patches are necessary for stable.

But I don't want to promote this kind of change in other place as we will move to
David's solution.


Regards
Yin, Fengwei

Yu Zhao Aug. 3, 2023, 11:38 p.m. UTC | #17

On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 8/4/2023 4:46 AM, Yu Zhao wrote:
> > On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>"
> >>
> >> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
> >>> On 02/08/2023 13:42, Yin, Fengwei wrote:
> >>>>
> >>>>
> >>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
> >>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
> >>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
> >>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> >>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
> >>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
> >>>>>>>>
> >>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
> >>>>>>>>
> >>>>>>>> Yin Fengwei (2):
> >>>>>>>>   madvise: don't use mapcount() against large folio for sharing check
> >>>>>>>>   madvise: don't use mapcount() against large folio for sharing check
> >>>>>>>>
> >>>>>>>>  mm/huge_memory.c | 2 +-
> >>>>>>>>  mm/madvise.c     | 6 +++---
> >>>>>>>>  2 files changed, 4 insertions(+), 4 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
> >>>>>>>
> >>>>>>> Reviewed-By: Ryan Roberts
> >>>>>> Thanks.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> But I have a couple of comments around further improvements;
> >>>>>>>
> >>>>>>> Once we have the scheme that David is working on to be able to provide precise
> >>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
> >>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
> >>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
> >>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
> >>>>>>> call sites again.
> >>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
> >>>>>
> >>>>> What you have is definitely an improvement over what was there before. And is
> >>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
> >>>>> Just pointing out that we will be able to make it even better later on (if
> >>>>> David's stuff goes in).
> >>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
> >>>
> >>> I was suggesting the opposite - not waiting. Then we can do separate improvement
> >>> later.
> >> Let's wait for David's work ready.
> >
> > Waiting is fine as long as we don't miss the next merge window -- we
> > don't want these two bugs to get into another release. Also I think we
> > should cc stable, since as David mentioned, they have been causing
> > selftest failures.
>
> Stable was CCed.

Need to add the "Cc: stable@vger.kernel.org" tag:
Documentation/process/stable-kernel-rules.rst

Yin Fengwei Aug. 4, 2023, 12:17 a.m. UTC | #18

On 8/4/2023 7:38 AM, Yu Zhao wrote:
> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>> "
>>>>
>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>
>>>>>>
>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>
>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>
>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>   madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>
>>>>>>>>>>  mm/huge_memory.c | 2 +-
>>>>>>>>>>  mm/madvise.c     | 6 +++---
>>>>>>>>>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>
>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>
>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>> call sites again.
>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>
>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>> David's stuff goes in).
>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>
>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>> later.
>>>> Let's wait for David's work ready.
>>>
>>> Waiting is fine as long as we don't miss the next merge window -- we
>>> don't want these two bugs to get into another release. Also I think we
>>> should cc stable, since as David mentioned, they have been causing
>>> selftest failures.
>>
>> Stable was CCed.
> 
> Need to add the "Cc: stable@vger.kernel.org" tag:
> Documentation/process/stable-kernel-rules.rst
OK. Thanks for clarification. I totally mis-understanded this. :).

I'd like to wait for answer from Andrew whether these patches are suitable
for stable (I suppose you think so) branch.


Regards
Yin, Fengwei

Yin Fengwei Aug. 4, 2023, 7:14 a.m. UTC | #19

Hi Andrew,

On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
> Hi Andrew,
> 
> On 7/29/2023 1:24 AM, Andrew Morton wrote:
>> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>
>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>
>> What are the user-visible runtime effects of these changes?
>>
>> (and please try to avoid using the same Subject: for different patches)
>>
> 
> Can you hold on these patches to mm-unstable? I think we need to wait for
> David's work on folio_maybe_mapped_shared() and redo the fix base on that.
> Thanks and sorry for the noise.
Sorry for bothering you again for this patchset.

Let me explain the situation here:
  - The reason to hold on the patches to mm-unstable is that I don't want to
    promote the fix in this patch (using folio_estimated_sharers()). The
    correct way is waiting for folio_maybe_mapped_shared() from David.

    Merging these patches motivate using folio_estimated_sharers() in other
    places. So once folio_maybe_mapped_shared() is ready, we need to replace
    folio_estimated_sharers() with folio_maybe_mapped_shared().

  - For this specific patches, if they are suitable for stable, we may want to
    merge it (special for stable branch. I assume folio_maybe_mapped_shared()
    may not be back ported to stable branch).

So how do we deal with this situation? Thanks in advance.

Regards
Yin, Fengwei

> 
> 
> Regards
> Yin, Fengwei

David Hildenbrand Aug. 4, 2023, 7:31 a.m. UTC | #20

On 04.08.23 02:17, Yin, Fengwei wrote:
> 
> 
> On 8/4/2023 7:38 AM, Yu Zhao wrote:
>> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>> "
>>>>>
>>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>>
>>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>>
>>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>
>>>>>>>>>>>   mm/huge_memory.c | 2 +-
>>>>>>>>>>>   mm/madvise.c     | 6 +++---
>>>>>>>>>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>>
>>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>>
>>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>>> call sites again.
>>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>>
>>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>>> David's stuff goes in).
>>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>>
>>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>>> later.
>>>>> Let's wait for David's work ready.
>>>>
>>>> Waiting is fine as long as we don't miss the next merge window -- we
>>>> don't want these two bugs to get into another release. Also I think we
>>>> should cc stable, since as David mentioned, they have been causing
>>>> selftest failures.
>>>
>>> Stable was CCed.
>>
>> Need to add the "Cc: stable@vger.kernel.org" tag:
>> Documentation/process/stable-kernel-rules.rst
> OK. Thanks for clarification. I totally mis-understanded this. :).
> 
> I'd like to wait for answer from Andrew whether these patches are suitable
> for stable (I suppose you think so) branch.

Note that the COW test does not fail -- it skips -- but the behavir changed:

$ ./cow
# [INFO] detected THP size: 2048 KiB
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
# [INFO] huge zeropage is enabled
TAP version 13
1..190
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
ok 2 No leak from parent into child
# [RUN] Basic COW after fork() ... with THP
ok 3 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out THP
ok 4 No leak from parent into child
# [RUN] Basic COW after fork() ... with PTE-mapped THP
ok 5 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
# [RUN] Basic COW after fork() ... with single PTE of THP
ok 7 No leak from parent into child
# [RUN] Basic COW after fork() ... with single PTE of swapped-out THP
ok 8 No leak from parent into child
# [RUN] Basic COW after fork() ... with partially mremap()'ed THP
ok 9 No leak from parent into child
# [RUN] Basic COW after fork() ... with partially shared THP
ok 10 No leak from parent into child
...

Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).

The code that broke it is

commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Date:   Wed Dec 21 10:08:46 2022 -0800

     madvise: convert madvise_cold_or_pageout_pte_range() to use folios
     
     This change removes a number of calls to compound_head(), and saves
     1729 bytes of kernel text.
     
     Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com
     Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
     Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
     Cc: SeongJae Park <sj@kernel.org>
     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


Ever since v6.3.

The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(),
conversion.


Probably all that is information worth having in the patch description.

Yin Fengwei Aug. 4, 2023, 7:36 a.m. UTC | #21

Hi David,

On 8/4/2023 3:31 PM, David Hildenbrand wrote:
> On 04.08.23 02:17, Yin, Fengwei wrote:
>>
>>
>> On 8/4/2023 7:38 AM, Yu Zhao wrote:
>>> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>
>>>>>> "
>>>>>>
>>>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>>>
>>>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>>>
>>>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>
>>>>>>>>>>>>   mm/huge_memory.c | 2 +-
>>>>>>>>>>>>   mm/madvise.c     | 6 +++---
>>>>>>>>>>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>>>
>>>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>>>> call sites again.
>>>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>>>
>>>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>>>> David's stuff goes in).
>>>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>>>
>>>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>>>> later.
>>>>>> Let's wait for David's work ready.
>>>>>
>>>>> Waiting is fine as long as we don't miss the next merge window -- we
>>>>> don't want these two bugs to get into another release. Also I think we
>>>>> should cc stable, since as David mentioned, they have been causing
>>>>> selftest failures.
>>>>
>>>> Stable was CCed.
>>>
>>> Need to add the "Cc: stable@vger.kernel.org" tag:
>>> Documentation/process/stable-kernel-rules.rst
>> OK. Thanks for clarification. I totally mis-understanded this. :).
>>
>> I'd like to wait for answer from Andrew whether these patches are suitable
>> for stable (I suppose you think so) branch.
> 
> Note that the COW test does not fail -- it skips -- but the behavir changed:
> 
> $ ./cow
> # [INFO] detected THP size: 2048 KiB
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> # [INFO] huge zeropage is enabled
> TAP version 13
> 1..190
> # [INFO] Anonymous memory tests in private mappings
> # [RUN] Basic COW after fork() ... with base page
> ok 1 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped out base page
> ok 2 No leak from parent into child
> # [RUN] Basic COW after fork() ... with THP
> ok 3 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out THP
> ok 4 No leak from parent into child
> # [RUN] Basic COW after fork() ... with PTE-mapped THP
> ok 5 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
> # [RUN] Basic COW after fork() ... with single PTE of THP
> ok 7 No leak from parent into child
> # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP
> ok 8 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially mremap()'ed THP
> ok 9 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially shared THP
> ok 10 No leak from parent into child
> ...
> 
> Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).
> 
> The code that broke it is
> 
> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
> Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
> Date:   Wed Dec 21 10:08:46 2022 -0800
> 
>     madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>         This change removes a number of calls to compound_head(), and saves
>     1729 bytes of kernel text.
>         Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com
>     Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
>     Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>     Cc: SeongJae Park <sj@kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> 
> Ever since v6.3.
> 
> The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(),
> conversion.
> 
> 
> Probably all that is information worth having in the patch description.
Thanks a lot for checking this. I will try this patchset to see whether
it can restore the behavior (I suppose so from your broken commit info
but want to confirm).

Regards
Yin, Fengwei

Yin Fengwei Aug. 4, 2023, 8:11 a.m. UTC | #22

On 8/4/2023 3:31 PM, David Hildenbrand wrote:
> On 04.08.23 02:17, Yin, Fengwei wrote:
>>
>>
>> On 8/4/2023 7:38 AM, Yu Zhao wrote:
>>> On Thu, Aug 3, 2023 at 5:27 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/4/2023 4:46 AM, Yu Zhao wrote:
>>>>> On Wed, Aug 2, 2023 at 6:56 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>
>>>>>> "
>>>>>>
>>>>>> On 8/2/2023 8:49 PM, Ryan Roberts wrote:
>>>>>>> On 02/08/2023 13:42, Yin, Fengwei wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/2/2023 8:40 PM, Ryan Roberts wrote:
>>>>>>>>> On 02/08/2023 13:35, Yin, Fengwei wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/2/2023 6:27 PM, Ryan Roberts wrote:
>>>>>>>>>>> On 28/07/2023 17:13, Yin Fengwei wrote:
>>>>>>>>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>>>>>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>>>>>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>>>>>>>>
>>>>>>>>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>>>>>>>>>
>>>>>>>>>>>> Yin Fengwei (2):
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>    madvise: don't use mapcount() against large folio for sharing check
>>>>>>>>>>>>
>>>>>>>>>>>>   mm/huge_memory.c | 2 +-
>>>>>>>>>>>>   mm/madvise.c     | 6 +++---
>>>>>>>>>>>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> As a set of fixes, I agree this is definitely an improvement, so:
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-By: Ryan Roberts
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But I have a couple of comments around further improvements;
>>>>>>>>>>>
>>>>>>>>>>> Once we have the scheme that David is working on to be able to provide precise
>>>>>>>>>>> exclusive vs shared info, we will probably want to move to that. Although that
>>>>>>>>>>> scheme will need access to the mm_struct of a process known to be mapping the
>>>>>>>>>>> folio. We have that info, but its not passed to folio_estimated_sharers() so we
>>>>>>>>>>> can't just reimplement folio_estimated_sharers() - we will need to rework these
>>>>>>>>>>> call sites again.
>>>>>>>>>> Yes. This could be extra work. Maybe should delay till David's work is done.
>>>>>>>>>
>>>>>>>>> What you have is definitely an improvement over what was there before. And is
>>>>>>>>> probably the best we can do without David's scheme. So I wouldn't delay this.
>>>>>>>>> Just pointing out that we will be able to make it even better later on (if
>>>>>>>>> David's stuff goes in).
>>>>>>>> Yes. I agree that we should wait for David's work ready and do fix based on that.
>>>>>>>
>>>>>>> I was suggesting the opposite - not waiting. Then we can do separate improvement
>>>>>>> later.
>>>>>> Let's wait for David's work ready.
>>>>>
>>>>> Waiting is fine as long as we don't miss the next merge window -- we
>>>>> don't want these two bugs to get into another release. Also I think we
>>>>> should cc stable, since as David mentioned, they have been causing
>>>>> selftest failures.
>>>>
>>>> Stable was CCed.
>>>
>>> Need to add the "Cc: stable@vger.kernel.org" tag:
>>> Documentation/process/stable-kernel-rules.rst
>> OK. Thanks for clarification. I totally mis-understanded this. :).
>>
>> I'd like to wait for answer from Andrew whether these patches are suitable
>> for stable (I suppose you think so) branch.
> 
> Note that the COW test does not fail -- it skips -- but the behavir changed:
> 
> $ ./cow
> # [INFO] detected THP size: 2048 KiB
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> # [INFO] huge zeropage is enabled
> TAP version 13
> 1..190
> # [INFO] Anonymous memory tests in private mappings
> # [RUN] Basic COW after fork() ... with base page
> ok 1 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped out base page
> ok 2 No leak from parent into child
> # [RUN] Basic COW after fork() ... with THP
> ok 3 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out THP
> ok 4 No leak from parent into child
> # [RUN] Basic COW after fork() ... with PTE-mapped THP
> ok 5 No leak from parent into child
> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
> # [RUN] Basic COW after fork() ... with single PTE of THP
> ok 7 No leak from parent into child
> # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP
> ok 8 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially mremap()'ed THP
> ok 9 No leak from parent into child
> # [RUN] Basic COW after fork() ... with partially shared THP
> ok 10 No leak from parent into child
> ...
> 
> Observe how patch #6 skips because the MADV_PAGEOUT was not effective (which might have happened due to other reasons as well, thus no failure).
> 
> The code that broke it is
> 
> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
> Author: Vishal Moola (Oracle) <vishal.moola@gmail.com>
> Date:   Wed Dec 21 10:08:46 2022 -0800
> 
>     madvise: convert madvise_cold_or_pageout_pte_range() to use folios
>         This change removes a number of calls to compound_head(), and saves
>     1729 bytes of kernel text.
>         Link: https://lkml.kernel.org/r/20221221180848.20774-3-vishal.moola@gmail.com
>     Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
>     Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>     Cc: SeongJae Park <sj@kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> 
> Ever since v6.3.
> 
> The simplest way to fix it would be to revert the page_mapcount() -> folio_mapcount(),
> conversion.
Confirmed this patchset also can make swapped-out, PTE-mapped THP related tests from
skip to pass.


Regards
Yin, Fengwei 

> 
> 
> Probably all that is information worth having in the patch description.
>

Andrew Morton Aug. 7, 2023, 4:43 p.m. UTC | #23

On Fri, 4 Aug 2023 15:14:53 +0800 "Yin, Fengwei" <fengwei.yin@intel.com> wrote:

> Hi Andrew,
> 
> On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
> > Hi Andrew,
> > 
> > On 7/29/2023 1:24 AM, Andrew Morton wrote:
> >> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
> >>> folio_mapcount() is used to check whether the folio is shared. But it's
> >>> not correct as folio_mapcount() returns total mapcount of large folio.
> >>>
> >>> Use folio_estimated_sharers() here as the estimated number is enough.
> >>
> >> What are the user-visible runtime effects of these changes?
> >>
> >> (and please try to avoid using the same Subject: for different patches)
> >>
> > 
> > Can you hold on these patches to mm-unstable? I think we need to wait for
> > David's work on folio_maybe_mapped_shared() and redo the fix base on that.
> > Thanks and sorry for the noise.
> Sorry for bothering you again for this patchset.
> 
> Let me explain the situation here:
>   - The reason to hold on the patches to mm-unstable is that I don't want to
>     promote the fix in this patch (using folio_estimated_sharers()). The
>     correct way is waiting for folio_maybe_mapped_shared() from David.
> 
>     Merging these patches motivate using folio_estimated_sharers() in other
>     places. So once folio_maybe_mapped_shared() is ready, we need to replace
>     folio_estimated_sharers() with folio_maybe_mapped_shared().
> 
>   - For this specific patches, if they are suitable for stable, we may want to
>     merge it (special for stable branch. I assume folio_maybe_mapped_shared()
>     may not be back ported to stable branch).
> 
> So how do we deal with this situation? Thanks in advance.
> 

I think I'll stage them for 6.5, with a cc:stable.

I'll drop the current three patches.  Please resend with

a) different Subject:s for all patches and

b) changelogs which fully describe the user-visible effects of the change.

Thanks.

Yin Fengwei Aug. 8, 2023, 12:02 a.m. UTC | #24

Hi Andrew,

On 8/8/2023 12:43 AM, Andrew Morton wrote:
> On Fri, 4 Aug 2023 15:14:53 +0800 "Yin, Fengwei" <fengwei.yin@intel.com> wrote:
> 
>> Hi Andrew,
>>
>> On 8/2/2023 8:39 PM, Yin, Fengwei wrote:
>>> Hi Andrew,
>>>
>>> On 7/29/2023 1:24 AM, Andrew Morton wrote:
>>>> On Sat, 29 Jul 2023 00:13:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>> In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
>>>>> folio_mapcount() is used to check whether the folio is shared. But it's
>>>>> not correct as folio_mapcount() returns total mapcount of large folio.
>>>>>
>>>>> Use folio_estimated_sharers() here as the estimated number is enough.
>>>>
>>>> What are the user-visible runtime effects of these changes?
>>>>
>>>> (and please try to avoid using the same Subject: for different patches)
>>>>
>>>
>>> Can you hold on these patches to mm-unstable? I think we need to wait for
>>> David's work on folio_maybe_mapped_shared() and redo the fix base on that.
>>> Thanks and sorry for the noise.
>> Sorry for bothering you again for this patchset.
>>
>> Let me explain the situation here:
>>   - The reason to hold on the patches to mm-unstable is that I don't want to
>>     promote the fix in this patch (using folio_estimated_sharers()). The
>>     correct way is waiting for folio_maybe_mapped_shared() from David.
>>
>>     Merging these patches motivate using folio_estimated_sharers() in other
>>     places. So once folio_maybe_mapped_shared() is ready, we need to replace
>>     folio_estimated_sharers() with folio_maybe_mapped_shared().
>>
>>   - For this specific patches, if they are suitable for stable, we may want to
>>     merge it (special for stable branch. I assume folio_maybe_mapped_shared()
>>     may not be back ported to stable branch).
>>
>> So how do we deal with this situation? Thanks in advance.
>>
> 
> I think I'll stage them for 6.5, with a cc:stable.
> 
> I'll drop the current three patches.  Please resend with
Thanks. Will resend the patches.

Regards
Yin, Fengwei

> 
> a) different Subject:s for all patches and
> 
> b) changelogs which fully describe the user-visible effects of the change.
> 
> Thanks.