[RFC] mm: support large folio numa balancing

Message ID a71a478ce404e93683023dbb7248dd95f11554f4.1699872019.git.baolin.wang@linux.alibaba.com
State New
Headers
Series [RFC] mm: support large folio numa balancing |

Commit Message

Baolin Wang Nov. 13, 2023, 10:45 a.m. UTC
  Currently, the file pages already support large folio, and supporting for
anonymous pages is also under discussion[1]. Moreover, the numa balancing
code are converted to use a folio by previous thread[2], and the migrate_pages
function also already supports the large folio migration.

So now I did not see any reason to continue restricting NUMA balancing for
large folio.

[1] https://lkml.org/lkml/2023/9/29/342
[2] https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/memory.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)
  

Comments

David Hildenbrand Nov. 13, 2023, 10:53 a.m. UTC | #1
On 13.11.23 11:45, Baolin Wang wrote:
> Currently, the file pages already support large folio, and supporting for
> anonymous pages is also under discussion[1]. Moreover, the numa balancing
> code are converted to use a folio by previous thread[2], and the migrate_pages
> function also already supports the large folio migration.
> 
> So now I did not see any reason to continue restricting NUMA balancing for
> large folio.

I recall John wanted to look into that. CCing him.

I'll note that the "head page mapcount" heuristic to detect sharers will
now strike on the PTE path and make us believe that a large folios is
exclusive, although it isn't.

As spelled out in the commit you are referencing:

commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
Author: Kefeng Wang <wangkefeng.wang@huawei.com>
Date:   Thu Sep 21 15:44:14 2023 +0800

     mm: memory: use a folio in do_numa_page()
     
     Numa balancing only try to migrate non-compound page in do_numa_page(),
     use a folio in it to save several compound_head calls, note we use
     folio_estimated_sharers(), it is enough to check the folio sharers since
     only normal page is handled, if large folio numa balancing is supported, a
     precise folio sharers check would be used, no functional change intended.


I'll send WIP patches for one approach that can improve the situation soonish.
  
Kefeng Wang Nov. 13, 2023, 12:10 p.m. UTC | #2
On 2023/11/13 18:53, David Hildenbrand wrote:
> On 13.11.23 11:45, Baolin Wang wrote:
>> Currently, the file pages already support large folio, and supporting for
>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>> code are converted to use a folio by previous thread[2], and the 
>> migrate_pages
>> function also already supports the large folio migration.
>>
>> So now I did not see any reason to continue restricting NUMA balancing 
>> for
>> large folio.
> 
> I recall John wanted to look into that. CCing him.
> 
> I'll note that the "head page mapcount" heuristic to detect sharers will
> now strike on the PTE path and make us believe that a large folios is
> exclusive, although it isn't.
> 
> As spelled out in the commit you are referencing:
> 
> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
> Date:   Thu Sep 21 15:44:14 2023 +0800
> 
>      mm: memory: use a folio in do_numa_page()
>      Numa balancing only try to migrate non-compound page in 
> do_numa_page(),
>      use a folio in it to save several compound_head calls, note we use
>      folio_estimated_sharers(), it is enough to check the folio sharers 
> since
>      only normal page is handled, if large folio numa balancing is 
> supported, a
>      precise folio sharers check would be used, no functional change 
> intended.
> 
> 
> I'll send WIP patches for one approach that can improve the situation 
> soonish.

When convert numa balance to use folio, I make similar change, it works
with large anon folio(test with v5), but David's precise folio sharers
should be merged firstly, also if a large folio shared by many process,
we maybe split it, don't sure about it, this need some evaluation.
  
Baolin Wang Nov. 13, 2023, 12:59 p.m. UTC | #3
On 11/13/2023 6:53 PM, David Hildenbrand wrote:
> On 13.11.23 11:45, Baolin Wang wrote:
>> Currently, the file pages already support large folio, and supporting for
>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>> code are converted to use a folio by previous thread[2], and the 
>> migrate_pages
>> function also already supports the large folio migration.
>>
>> So now I did not see any reason to continue restricting NUMA balancing 
>> for
>> large folio.
> 
> I recall John wanted to look into that. CCing him.
> 
> I'll note that the "head page mapcount" heuristic to detect sharers will
> now strike on the PTE path and make us believe that a large folios is
> exclusive, although it isn't.
> 
> As spelled out in the commit you are referencing:
> 
> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
> Date:   Thu Sep 21 15:44:14 2023 +0800
> 
>      mm: memory: use a folio in do_numa_page()
>      Numa balancing only try to migrate non-compound page in 
> do_numa_page(),
>      use a folio in it to save several compound_head calls, note we use
>      folio_estimated_sharers(), it is enough to check the folio sharers 
> since
>      only normal page is handled, if large folio numa balancing is 
> supported, a
>      precise folio sharers check would be used, no functional change 
> intended.

Thanks for pointing out the part I missed.

I saw the migrate_pages() syscall is also using 
folio_estimated_sharers() to check if the folio is shared, and I wonder 
it will bring about any significant issues?

> I'll send WIP patches for one approach that can improve the situation 
> soonish.

Great. Look forward to seeing this:)
  
Baolin Wang Nov. 13, 2023, 1:01 p.m. UTC | #4
On 11/13/2023 8:10 PM, Kefeng Wang wrote:
> 
> 
> On 2023/11/13 18:53, David Hildenbrand wrote:
>> On 13.11.23 11:45, Baolin Wang wrote:
>>> Currently, the file pages already support large folio, and supporting 
>>> for
>>> anonymous pages is also under discussion[1]. Moreover, the numa 
>>> balancing
>>> code are converted to use a folio by previous thread[2], and the 
>>> migrate_pages
>>> function also already supports the large folio migration.
>>>
>>> So now I did not see any reason to continue restricting NUMA 
>>> balancing for
>>> large folio.
>>
>> I recall John wanted to look into that. CCing him.
>>
>> I'll note that the "head page mapcount" heuristic to detect sharers will
>> now strike on the PTE path and make us believe that a large folios is
>> exclusive, although it isn't.
>>
>> As spelled out in the commit you are referencing:
>>
>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
>> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Date:   Thu Sep 21 15:44:14 2023 +0800
>>
>>      mm: memory: use a folio in do_numa_page()
>>      Numa balancing only try to migrate non-compound page in 
>> do_numa_page(),
>>      use a folio in it to save several compound_head calls, note we use
>>      folio_estimated_sharers(), it is enough to check the folio 
>> sharers since
>>      only normal page is handled, if large folio numa balancing is 
>> supported, a
>>      precise folio sharers check would be used, no functional change 
>> intended.
>>
>>
>> I'll send WIP patches for one approach that can improve the situation 
>> soonish.
> 
> When convert numa balance to use folio, I make similar change, it works
> with large anon folio(test with v5), but David's precise folio sharers
> should be merged firstly, also if a large folio shared by many process,
> we maybe split it, don't sure about it, this need some evaluation.

IIUC, numa balancing will not split the large folio.
  
David Hildenbrand Nov. 13, 2023, 2:49 p.m. UTC | #5
On 13.11.23 13:59, Baolin Wang wrote:
> 
> 
> On 11/13/2023 6:53 PM, David Hildenbrand wrote:
>> On 13.11.23 11:45, Baolin Wang wrote:
>>> Currently, the file pages already support large folio, and supporting for
>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>>> code are converted to use a folio by previous thread[2], and the
>>> migrate_pages
>>> function also already supports the large folio migration.
>>>
>>> So now I did not see any reason to continue restricting NUMA balancing
>>> for
>>> large folio.
>>
>> I recall John wanted to look into that. CCing him.
>>
>> I'll note that the "head page mapcount" heuristic to detect sharers will
>> now strike on the PTE path and make us believe that a large folios is
>> exclusive, although it isn't.
>>
>> As spelled out in the commit you are referencing:
>>
>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
>> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Date:   Thu Sep 21 15:44:14 2023 +0800
>>
>>       mm: memory: use a folio in do_numa_page()
>>       Numa balancing only try to migrate non-compound page in
>> do_numa_page(),
>>       use a folio in it to save several compound_head calls, note we use
>>       folio_estimated_sharers(), it is enough to check the folio sharers
>> since
>>       only normal page is handled, if large folio numa balancing is
>> supported, a
>>       precise folio sharers check would be used, no functional change
>> intended.
> 
> Thanks for pointing out the part I missed.
> 
> I saw the migrate_pages() syscall is also using
> folio_estimated_sharers() to check if the folio is shared, and I wonder
> it will bring about any significant issues?

It's now used all over the place, in some places for making manual 
decisions (e.g., MADV_PAGEOUT works although it shouldn't) and more and 
more automatic places (e.g., the system ends up migrating a folio 
although it shouldn't). The nasty thing about it is that it doesn't give 
you "certainly exclusive" vs. "maybe shared" but "maybe exclusive" vs. 
"certainly shared".

IIUC, the side effect could be that we migrate folios because we assume 
they are exclusive even though they are actually shared. Right now, it's 
sufficient to not have the first page of the folio mapped anymore for 
that to happen.

Anyhow, it's worth mentioning that in the commit message as long as we 
have no better solution for that. For many cases it might be just tolerable.

> 
>> I'll send WIP patches for one approach that can improve the situation
>> soonish.
> 
> Great. Look forward to seeing this:)

I'm still trying to evaluate the performance hit of the additional 
tracking ... turns out there is no such thing as free food ;)
  
John Hubbard Nov. 13, 2023, 10:15 p.m. UTC | #6
On 11/13/23 5:01 AM, Baolin Wang wrote:
> 
> 
> On 11/13/2023 8:10 PM, Kefeng Wang wrote:
>>
>>
>> On 2023/11/13 18:53, David Hildenbrand wrote:
>>> On 13.11.23 11:45, Baolin Wang wrote:
>>>> Currently, the file pages already support large folio, and 
>>>> supporting for
>>>> anonymous pages is also under discussion[1]. Moreover, the numa 
>>>> balancing
>>>> code are converted to use a folio by previous thread[2], and the 
>>>> migrate_pages
>>>> function also already supports the large folio migration.
>>>>
>>>> So now I did not see any reason to continue restricting NUMA 
>>>> balancing for
>>>> large folio.
>>>
>>> I recall John wanted to look into that. CCing him.
>>>
>>> I'll note that the "head page mapcount" heuristic to detect sharers will
>>> now strike on the PTE path and make us believe that a large folios is
>>> exclusive, although it isn't.
>>>
>>> As spelled out in the commit you are referencing:
>>>
>>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
>>> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Date:   Thu Sep 21 15:44:14 2023 +0800
>>>
>>>      mm: memory: use a folio in do_numa_page()
>>>      Numa balancing only try to migrate non-compound page in 
>>> do_numa_page(),
>>>      use a folio in it to save several compound_head calls, note we use
>>>      folio_estimated_sharers(), it is enough to check the folio 
>>> sharers since
>>>      only normal page is handled, if large folio numa balancing is 
>>> supported, a
>>>      precise folio sharers check would be used, no functional change 
>>> intended.
>>>
>>>
>>> I'll send WIP patches for one approach that can improve the situation 
>>> soonish.

To be honest, I'm still catching up on the approximate vs. exact
sharers case. It wasn't clear to me why a precise sharers count
is needed in order to do this. Perhaps the cost of making a wrong
decision is considered just too high?


>>
>> When convert numa balance to use folio, I make similar change, it works
>> with large anon folio(test with v5), but David's precise folio sharers
>> should be merged firstly, also if a large folio shared by many process,
>> we maybe split it, don't sure about it, this need some evaluation.
> 
> IIUC, numa balancing will not split the large folio.

That matches my reading of the code: normal (PMD-based) THPs are
split, but the code does not yet split PTE-THPs (also known as
small-size THPs).

thanks,
  
Huang, Ying Nov. 14, 2023, 1:12 a.m. UTC | #7
David Hildenbrand <david@redhat.com> writes:

> On 13.11.23 11:45, Baolin Wang wrote:
>> Currently, the file pages already support large folio, and supporting for
>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>> code are converted to use a folio by previous thread[2], and the migrate_pages
>> function also already supports the large folio migration.
>> So now I did not see any reason to continue restricting NUMA
>> balancing for
>> large folio.
>
> I recall John wanted to look into that. CCing him.
>
> I'll note that the "head page mapcount" heuristic to detect sharers will
> now strike on the PTE path and make us believe that a large folios is
> exclusive, although it isn't.

Even 4k folio may be shared by multiple processes/threads.  So, numa
balancing uses a multi-stage node selection algorithm (mostly
implemented in should_numa_migrate_memory()) to identify shared folios.
I think that the algorithm needs to be adjusted for PTE mapped large
folio for shared folios.

And, as a performance improvement patch, some performance data needs to
be provided.  And, the effect of shared folio detection needs to be
tested too.

--
Best Regards,
Huang, Ying
  
Baolin Wang Nov. 14, 2023, 10:53 a.m. UTC | #8
On 11/13/2023 10:49 PM, David Hildenbrand wrote:
> On 13.11.23 13:59, Baolin Wang wrote:
>>
>>
>> On 11/13/2023 6:53 PM, David Hildenbrand wrote:
>>> On 13.11.23 11:45, Baolin Wang wrote:
>>>> Currently, the file pages already support large folio, and 
>>>> supporting for
>>>> anonymous pages is also under discussion[1]. Moreover, the numa 
>>>> balancing
>>>> code are converted to use a folio by previous thread[2], and the
>>>> migrate_pages
>>>> function also already supports the large folio migration.
>>>>
>>>> So now I did not see any reason to continue restricting NUMA balancing
>>>> for
>>>> large folio.
>>>
>>> I recall John wanted to look into that. CCing him.
>>>
>>> I'll note that the "head page mapcount" heuristic to detect sharers will
>>> now strike on the PTE path and make us believe that a large folios is
>>> exclusive, although it isn't.
>>>
>>> As spelled out in the commit you are referencing:
>>>
>>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
>>> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Date:   Thu Sep 21 15:44:14 2023 +0800
>>>
>>>       mm: memory: use a folio in do_numa_page()
>>>       Numa balancing only try to migrate non-compound page in
>>> do_numa_page(),
>>>       use a folio in it to save several compound_head calls, note we use
>>>       folio_estimated_sharers(), it is enough to check the folio sharers
>>> since
>>>       only normal page is handled, if large folio numa balancing is
>>> supported, a
>>>       precise folio sharers check would be used, no functional change
>>> intended.
>>
>> Thanks for pointing out the part I missed.
>>
>> I saw the migrate_pages() syscall is also using
>> folio_estimated_sharers() to check if the folio is shared, and I wonder
>> it will bring about any significant issues?
> 
> It's now used all over the place, in some places for making manual 
> decisions (e.g., MADV_PAGEOUT works although it shouldn't) and more and 
> more automatic places (e.g., the system ends up migrating a folio 
> although it shouldn't). The nasty thing about it is that it doesn't give 
> you "certainly exclusive" vs. "maybe shared" but "maybe exclusive" vs. 
> "certainly shared".
> 
> IIUC, the side effect could be that we migrate folios because we assume 
> they are exclusive even though they are actually shared. Right now, it's 
> sufficient to not have the first page of the folio mapped anymore for 
> that to happen.

Yes.

> Anyhow, it's worth mentioning that in the commit message as long as we 
> have no better solution for that. For many cases it might be just 
> tolerable.

Agree. The 'maybe shared' folio may affect the numa group statistics, 
which is used to accumulate the numa faults in one group to choose a 
prefered node for the tasks. For this case, it may be tolerable too, but 
I have no performance numbers now. Let me think about it.

>>> I'll send WIP patches for one approach that can improve the situation
>>> soonish.
>>
>> Great. Look forward to seeing this:)
> 
> I'm still trying to evaluate the performance hit of the additional 
> tracking ... turns out there is no such thing as free food ;)

Make sense.
  
Baolin Wang Nov. 14, 2023, 11:11 a.m. UTC | #9
On 11/14/2023 9:12 AM, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 13.11.23 11:45, Baolin Wang wrote:
>>> Currently, the file pages already support large folio, and supporting for
>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>>> code are converted to use a folio by previous thread[2], and the migrate_pages
>>> function also already supports the large folio migration.
>>> So now I did not see any reason to continue restricting NUMA
>>> balancing for
>>> large folio.
>>
>> I recall John wanted to look into that. CCing him.
>>
>> I'll note that the "head page mapcount" heuristic to detect sharers will
>> now strike on the PTE path and make us believe that a large folios is
>> exclusive, although it isn't.
> 
> Even 4k folio may be shared by multiple processes/threads.  So, numa
> balancing uses a multi-stage node selection algorithm (mostly
> implemented in should_numa_migrate_memory()) to identify shared folios.
> I think that the algorithm needs to be adjusted for PTE mapped large
> folio for shared folios.

Not sure I get you here. In should_numa_migrate_memory(), it will use 
last CPU id, last PID and group numa faults to determine if this page 
can be migrated to the target node. So for large folio, a precise folio 
sharers check can make the numa faults of a group more accurate, which 
is enough for should_numa_migrate_memory() to make a decision?

Could you provide a more detailed description of the algorithm you would 
like to change for large folio? Thanks.

> And, as a performance improvement patch, some performance data needs to

Do you have some benchmark recommendation? I know the the autonuma can 
not support large folio now.

> be provided.  And, the effect of shared folio detection needs to be
> tested too
  
David Hildenbrand Nov. 14, 2023, 11:35 a.m. UTC | #10
On 13.11.23 23:15, John Hubbard wrote:
> On 11/13/23 5:01 AM, Baolin Wang wrote:
>>
>>
>> On 11/13/2023 8:10 PM, Kefeng Wang wrote:
>>>
>>>
>>> On 2023/11/13 18:53, David Hildenbrand wrote:
>>>> On 13.11.23 11:45, Baolin Wang wrote:
>>>>> Currently, the file pages already support large folio, and
>>>>> supporting for
>>>>> anonymous pages is also under discussion[1]. Moreover, the numa
>>>>> balancing
>>>>> code are converted to use a folio by previous thread[2], and the
>>>>> migrate_pages
>>>>> function also already supports the large folio migration.
>>>>>
>>>>> So now I did not see any reason to continue restricting NUMA
>>>>> balancing for
>>>>> large folio.
>>>>
>>>> I recall John wanted to look into that. CCing him.
>>>>
>>>> I'll note that the "head page mapcount" heuristic to detect sharers will
>>>> now strike on the PTE path and make us believe that a large folios is
>>>> exclusive, although it isn't.
>>>>
>>>> As spelled out in the commit you are referencing:
>>>>
>>>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
>>>> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> Date:   Thu Sep 21 15:44:14 2023 +0800
>>>>
>>>>       mm: memory: use a folio in do_numa_page()
>>>>       Numa balancing only try to migrate non-compound page in
>>>> do_numa_page(),
>>>>       use a folio in it to save several compound_head calls, note we use
>>>>       folio_estimated_sharers(), it is enough to check the folio
>>>> sharers since
>>>>       only normal page is handled, if large folio numa balancing is
>>>> supported, a
>>>>       precise folio sharers check would be used, no functional change
>>>> intended.
>>>>
>>>>
>>>> I'll send WIP patches for one approach that can improve the situation
>>>> soonish.
> 
> To be honest, I'm still catching up on the approximate vs. exact
> sharers case. It wasn't clear to me why a precise sharers count
> is needed in order to do this. Perhaps the cost of making a wrong
> decision is considered just too high?

Good question, I didn't really look into the impact for the NUMA hinting 
case where we might end up not setting TNF_SHARED although it is shared. 
For other folio_estimate_sharers() users it's more obvious.

As a side note, it could have happened already in corner cases (e.g., 
concurrent page migration of a small folio).

If precision as documented in that commit is really required remains to 
be seen -- just wanted to spell it out.
  
Kefeng Wang Nov. 14, 2023, 1:12 p.m. UTC | #11
On 2023/11/14 19:35, David Hildenbrand wrote:
> On 13.11.23 23:15, John Hubbard wrote:
>> On 11/13/23 5:01 AM, Baolin Wang wrote:
>>>
>>>
>>> On 11/13/2023 8:10 PM, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2023/11/13 18:53, David Hildenbrand wrote:
>>>>> On 13.11.23 11:45, Baolin Wang wrote:
>>>>>> Currently, the file pages already support large folio, and
>>>>>> supporting for
>>>>>> anonymous pages is also under discussion[1]. Moreover, the numa
>>>>>> balancing
>>>>>> code are converted to use a folio by previous thread[2], and the
>>>>>> migrate_pages
>>>>>> function also already supports the large folio migration.
>>>>>>
>>>>>> So now I did not see any reason to continue restricting NUMA
>>>>>> balancing for
>>>>>> large folio.
>>>>>
>>>>> I recall John wanted to look into that. CCing him.
>>>>>
>>>>> I'll note that the "head page mapcount" heuristic to detect sharers 
>>>>> will
>>>>> now strike on the PTE path and make us believe that a large folios is
>>>>> exclusive, although it isn't.
>>>>>
>>>>> As spelled out in the commit you are referencing:
>>>>>
>>>>> commit 6695cf68b15c215d33b8add64c33e01e3cbe236c
>>>>> Author: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> Date:   Thu Sep 21 15:44:14 2023 +0800
>>>>>
>>>>>       mm: memory: use a folio in do_numa_page()
>>>>>       Numa balancing only try to migrate non-compound page in
>>>>> do_numa_page(),
>>>>>       use a folio in it to save several compound_head calls, note 
>>>>> we use
>>>>>       folio_estimated_sharers(), it is enough to check the folio
>>>>> sharers since
>>>>>       only normal page is handled, if large folio numa balancing is
>>>>> supported, a
>>>>>       precise folio sharers check would be used, no functional change
>>>>> intended.
>>>>>
>>>>>
>>>>> I'll send WIP patches for one approach that can improve the situation
>>>>> soonish.
>>
>> To be honest, I'm still catching up on the approximate vs. exact
>> sharers case. It wasn't clear to me why a precise sharers count
>> is needed in order to do this. Perhaps the cost of making a wrong
>> decision is considered just too high?
> 
> Good question, I didn't really look into the impact for the NUMA hinting 
> case where we might end up not setting TNF_SHARED although it is shared. 
> For other folio_estimate_sharers() users it's more obvious.

The task_numa_group() will check the TNF_SHARED, if processes share same
page/folio, they will be packed into a single numa group, and the numa
group fault statistic will be used in should_numa_migrate_memory() to
decide whether to migrate or not, if not setting TNF_SHARED, maybe be
lead to more page/folio migration.

> 
> As a side note, it could have happened already in corner cases (e.g., 
> concurrent page migration of a small folio).
> 
> If precision as documented in that commit is really required remains to 
> be seen -- just wanted to spell it out.
>
  
Huang, Ying Nov. 15, 2023, 2:58 a.m. UTC | #12
Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 11/14/2023 9:12 AM, Huang, Ying wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> On 13.11.23 11:45, Baolin Wang wrote:
>>>> Currently, the file pages already support large folio, and supporting for
>>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>>>> code are converted to use a folio by previous thread[2], and the migrate_pages
>>>> function also already supports the large folio migration.
>>>> So now I did not see any reason to continue restricting NUMA
>>>> balancing for
>>>> large folio.
>>>
>>> I recall John wanted to look into that. CCing him.
>>>
>>> I'll note that the "head page mapcount" heuristic to detect sharers will
>>> now strike on the PTE path and make us believe that a large folios is
>>> exclusive, although it isn't.
>> Even 4k folio may be shared by multiple processes/threads.  So, numa
>> balancing uses a multi-stage node selection algorithm (mostly
>> implemented in should_numa_migrate_memory()) to identify shared folios.
>> I think that the algorithm needs to be adjusted for PTE mapped large
>> folio for shared folios.
>
> Not sure I get you here. In should_numa_migrate_memory(), it will use
> last CPU id, last PID and group numa faults to determine if this page
> can be migrated to the target node. So for large folio, a precise
> folio sharers check can make the numa faults of a group more accurate,
> which is enough for should_numa_migrate_memory() to make a decision?

A large folio that is mapped by multiple process may be accessed by one
remote NUMA node, so we still want to migrate it.  A large folio that is
mapped by one process but accessed by multiple threads on multiple NUMA
node may be not migrated.

> Could you provide a more detailed description of the algorithm you
> would like to change for large folio? Thanks.

I haven't thought about that thoroughly.  So, please evaluate the
algorithm by yourself.

For example, the 2 sub-pages of a shared PTE-mapped large folio may be
accessed together by a task.  This may cause the folio be migrated
wrongly.  One possible solution is to restore all other PTE mappings of
the large folio in do_numa_page() as the first step.  This resembles the
PMD-mapped THP behavior.

>> And, as a performance improvement patch, some performance data needs to
>
> Do you have some benchmark recommendation? I know the the autonuma can
> not support large folio now.

There are autonuma-benchmark, and specjbb is used by someone before.

>> be provided.  And, the effect of shared folio detection needs to be
>> tested too

--
Best Regards,
Huang, Ying
  
David Hildenbrand Nov. 15, 2023, 10:46 a.m. UTC | #13
On 13.11.23 11:45, Baolin Wang wrote:
> Currently, the file pages already support large folio, and supporting for
> anonymous pages is also under discussion[1]. Moreover, the numa balancing
> code are converted to use a folio by previous thread[2], and the migrate_pages
> function also already supports the large folio migration.
> 
> So now I did not see any reason to continue restricting NUMA balancing for
> large folio.
> 
> [1] https://lkml.org/lkml/2023/9/29/342
> [2] https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

I'll note that another piece is missing, and I'd be curious how you
tested your patch set or what I am missing. (no anonymous pages?)

change_pte_range() contains:

if (prot_numa) {
	...
	/* Also skip shared copy-on-write pages */
	if (is_cow_mapping(vma->vm_flags) &&
	    folio_ref_count(folio) != 1)
		continue;

So we'll never end up mapping an anon PTE-mapped THP prot-none (well, unless a
single PTE remains) and consequently never trigger NUMA hinting faults.

Now, that change has some history [1], but the original problem has been
sorted out in the meantime. But we should consider Linus' original feedback.

For pte-mapped THP, we might want to do something like the following
(completely untested):

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 81991102f785..c4e6b9032e40 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
  
                                 /* Also skip shared copy-on-write pages */
                                 if (is_cow_mapping(vma->vm_flags) &&
-                                   folio_ref_count(folio) != 1)
+                                   (folio_maybe_dma_pinned(folio) ||
+                                    folio_estimated_sharers(folio) != 1))
                                         continue;
  

Another note about the possible imprecision that might or might not
be tolerable ;)


[1] https://bugzilla.kernel.org/show_bug.cgi?id=215616
  
David Hildenbrand Nov. 15, 2023, 10:47 a.m. UTC | #14
On 15.11.23 11:46, David Hildenbrand wrote:
> On 13.11.23 11:45, Baolin Wang wrote:
>> Currently, the file pages already support large folio, and supporting for
>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>> code are converted to use a folio by previous thread[2], and the migrate_pages
>> function also already supports the large folio migration.
>>
>> So now I did not see any reason to continue restricting NUMA balancing for
>> large folio.
>>
>> [1] https://lkml.org/lkml/2023/9/29/342
>> [2] https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
> 
> I'll note that another piece is missing, and I'd be curious how you
> tested your patch set or what I am missing. (no anonymous pages?)
> 
> change_pte_range() contains:
> 
> if (prot_numa) {
> 	...
> 	/* Also skip shared copy-on-write pages */
> 	if (is_cow_mapping(vma->vm_flags) &&
> 	    folio_ref_count(folio) != 1)
> 		continue;
> 
> So we'll never end up mapping an anon PTE-mapped THP prot-none (well, unless a
> single PTE remains) and consequently never trigger NUMA hinting faults.
> 
> Now, that change has some history [1], but the original problem has been
> sorted out in the meantime. But we should consider Linus' original feedback.
> 
> For pte-mapped THP, we might want to do something like the following
> (completely untested):
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 81991102f785..c4e6b9032e40 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>    
>                                   /* Also skip shared copy-on-write pages */
>                                   if (is_cow_mapping(vma->vm_flags) &&
> -                                   folio_ref_count(folio) != 1)
> +                                   (folio_maybe_dma_pinned(folio) ||
> +                                    folio_estimated_sharers(folio) != 1))

Actually, > 1 might be better if the first subpage is not mapped; it's a 
mess.
  
Mel Gorman Nov. 17, 2023, 10:07 a.m. UTC | #15
On Wed, Nov 15, 2023 at 10:58:32AM +0800, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
> > On 11/14/2023 9:12 AM, Huang, Ying wrote:
> >> David Hildenbrand <david@redhat.com> writes:
> >> 
> >>> On 13.11.23 11:45, Baolin Wang wrote:
> >>>> Currently, the file pages already support large folio, and supporting for
> >>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
> >>>> code are converted to use a folio by previous thread[2], and the migrate_pages
> >>>> function also already supports the large folio migration.
> >>>> So now I did not see any reason to continue restricting NUMA
> >>>> balancing for
> >>>> large folio.
> >>>
> >>> I recall John wanted to look into that. CCing him.
> >>>
> >>> I'll note that the "head page mapcount" heuristic to detect sharers will
> >>> now strike on the PTE path and make us believe that a large folios is
> >>> exclusive, although it isn't.
> >> Even 4k folio may be shared by multiple processes/threads.  So, numa
> >> balancing uses a multi-stage node selection algorithm (mostly
> >> implemented in should_numa_migrate_memory()) to identify shared folios.
> >> I think that the algorithm needs to be adjusted for PTE mapped large
> >> folio for shared folios.
> >
> > Not sure I get you here. In should_numa_migrate_memory(), it will use
> > last CPU id, last PID and group numa faults to determine if this page
> > can be migrated to the target node. So for large folio, a precise
> > folio sharers check can make the numa faults of a group more accurate,
> > which is enough for should_numa_migrate_memory() to make a decision?
> 
> A large folio that is mapped by multiple process may be accessed by one
> remote NUMA node, so we still want to migrate it.  A large folio that is
> mapped by one process but accessed by multiple threads on multiple NUMA
> node may be not migrated.
> 

This leads into a generic problem with large anything with NUMA
balancing -- false sharing. As it stands, THP can be false shared by
threads if thread-local data is split within a THP range. In this case,
the ideal would be the THP is migrated to the hottest node but such
support doesn't exist. The same applies for folios. If not handled
properly, a large folio of any type can ping-pong between nodes so just
migrating because we can is not necessarily a good idea. The patch
should cover a realistic case why this matters, why splitting the folio
is not better and supporting data.
  
Peter Zijlstra Nov. 17, 2023, 10:13 a.m. UTC | #16
On Fri, Nov 17, 2023 at 10:07:45AM +0000, Mel Gorman wrote:

> This leads into a generic problem with large anything with NUMA
> balancing -- false sharing. As it stands, THP can be false shared by
> threads if thread-local data is split within a THP range. In this case,
> the ideal would be the THP is migrated to the hottest node but such
> support doesn't exist. The same applies for folios. If not handled
> properly, a large folio of any type can ping-pong between nodes so just
> migrating because we can is not necessarily a good idea. The patch
> should cover a realistic case why this matters, why splitting the folio
> is not better and supporting data.

Would it make sense to have THP merging conditional on all (most?) pages
having the same node?
  
Mel Gorman Nov. 17, 2023, 4:04 p.m. UTC | #17
On Fri, Nov 17, 2023 at 11:13:43AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 17, 2023 at 10:07:45AM +0000, Mel Gorman wrote:
> 
> > This leads into a generic problem with large anything with NUMA
> > balancing -- false sharing. As it stands, THP can be false shared by
> > threads if thread-local data is split within a THP range. In this case,
> > the ideal would be the THP is migrated to the hottest node but such
> > support doesn't exist. The same applies for folios. If not handled
> > properly, a large folio of any type can ping-pong between nodes so just
> > migrating because we can is not necessarily a good idea. The patch
> > should cover a realistic case why this matters, why splitting the folio
> > is not better and supporting data.
> 
> Would it make sense to have THP merging conditional on all (most?) pages
> having the same node?

Potentially yes, maybe with something similar to max_ptes_none, but it has
corner cases of it's own. THP can be allocated up-front so we don't get
the per-base-page hints unless the page is first split. I experimented
with this once upon a time but cost post-splitting was not offset by
the smarter NUMA placement. While we could always allocate small pages
and promote later (originally known as the promotion threshold), that
was known to have significant penalties of it's own so we still eagerly
allocate THP. Part of that is that KVM was the main load to benefit from
THP and always preferred eager promotion. Even if we always started with
base pages, sparse addressing within the THP range may mean the threshold
for collapsing can never be reached.

Both THP and folios have the same false sharing problem but at least
we knew about the false sharing problem for THP and NUMA balancing. It
was found initially that THP false sharing is mostly an edge-case issue
mitigated by the fact that large anonymous buffers tended to be either
2M aligned or only affected the boundaries. Later glibc and ABI changes
made it even more likely that private buffers were THP-aligned. The same
is not true of folios and it is a new problem so I'm uncomfortable with
a patch that essentially says "migrate folios because it's possible"
without considering any of the corner cases or measuring them.
  
Baolin Wang Nov. 20, 2023, 3:28 a.m. UTC | #18
On 11/15/2023 6:47 PM, David Hildenbrand wrote:
> On 15.11.23 11:46, David Hildenbrand wrote:
>> On 13.11.23 11:45, Baolin Wang wrote:
>>> Currently, the file pages already support large folio, and supporting 
>>> for
>>> anonymous pages is also under discussion[1]. Moreover, the numa 
>>> balancing
>>> code are converted to use a folio by previous thread[2], and the 
>>> migrate_pages
>>> function also already supports the large folio migration.
>>>
>>> So now I did not see any reason to continue restricting NUMA 
>>> balancing for
>>> large folio.
>>>
>>> [1] https://lkml.org/lkml/2023/9/29/342
>>> [2] 
>>> https://lore.kernel.org/all/20230921074417.24004-4-wangkefeng.wang@huawei.com/T/#md9d10fe34587229a72801f0d731f7457ab3f4a6e
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>
>> I'll note that another piece is missing, and I'd be curious how you
>> tested your patch set or what I am missing. (no anonymous pages?)

I tested it with file large folio (order = 4) created by XFS filesystem.

>> change_pte_range() contains:
>>
>> if (prot_numa) {
>>     ...
>>     /* Also skip shared copy-on-write pages */
>>     if (is_cow_mapping(vma->vm_flags) &&
>>         folio_ref_count(folio) != 1)
>>         continue;
>>
>> So we'll never end up mapping an anon PTE-mapped THP prot-none (well, 
>> unless a
>> single PTE remains) and consequently never trigger NUMA hinting faults.
>>
>> Now, that change has some history [1], but the original problem has been
>> sorted out in the meantime. But we should consider Linus' original 
>> feedback.
>>
>> For pte-mapped THP, we might want to do something like the following
>> (completely untested):

Thanks for pointing out. I have not tried pte-mapped THP yet, and will 
look at it in detail.

>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 81991102f785..c4e6b9032e40 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
>>                                   /* Also skip shared copy-on-write 
>> pages */
>>                                   if (is_cow_mapping(vma->vm_flags) &&
>> -                                   folio_ref_count(folio) != 1)
>> +                                   (folio_maybe_dma_pinned(folio) ||
>> +                                    folio_estimated_sharers(folio) != 
>> 1))
> 
> Actually, > 1 might be better if the first subpage is not mapped; it's a 
> mess.
>
  
Baolin Wang Nov. 20, 2023, 8:01 a.m. UTC | #19
On 11/17/2023 6:07 PM, Mel Gorman wrote:
> On Wed, Nov 15, 2023 at 10:58:32AM +0800, Huang, Ying wrote:
>> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
>>
>>> On 11/14/2023 9:12 AM, Huang, Ying wrote:
>>>> David Hildenbrand <david@redhat.com> writes:
>>>>
>>>>> On 13.11.23 11:45, Baolin Wang wrote:
>>>>>> Currently, the file pages already support large folio, and supporting for
>>>>>> anonymous pages is also under discussion[1]. Moreover, the numa balancing
>>>>>> code are converted to use a folio by previous thread[2], and the migrate_pages
>>>>>> function also already supports the large folio migration.
>>>>>> So now I did not see any reason to continue restricting NUMA
>>>>>> balancing for
>>>>>> large folio.
>>>>>
>>>>> I recall John wanted to look into that. CCing him.
>>>>>
>>>>> I'll note that the "head page mapcount" heuristic to detect sharers will
>>>>> now strike on the PTE path and make us believe that a large folios is
>>>>> exclusive, although it isn't.
>>>> Even 4k folio may be shared by multiple processes/threads.  So, numa
>>>> balancing uses a multi-stage node selection algorithm (mostly
>>>> implemented in should_numa_migrate_memory()) to identify shared folios.
>>>> I think that the algorithm needs to be adjusted for PTE mapped large
>>>> folio for shared folios.
>>>
>>> Not sure I get you here. In should_numa_migrate_memory(), it will use
>>> last CPU id, last PID and group numa faults to determine if this page
>>> can be migrated to the target node. So for large folio, a precise
>>> folio sharers check can make the numa faults of a group more accurate,
>>> which is enough for should_numa_migrate_memory() to make a decision?
>>
>> A large folio that is mapped by multiple process may be accessed by one
>> remote NUMA node, so we still want to migrate it.  A large folio that is
>> mapped by one process but accessed by multiple threads on multiple NUMA
>> node may be not migrated.
>>
> 
> This leads into a generic problem with large anything with NUMA
> balancing -- false sharing. As it stands, THP can be false shared by
> threads if thread-local data is split within a THP range. In this case,
> the ideal would be the THP is migrated to the hottest node but such
> support doesn't exist. The same applies for folios. If not handled

So below check in should_numa_migrate_memory() can not avoid the false 
sharing of large folio you mentioned? Please correct me if I missed 
anything.

         /*
          * Destination node is much more heavily used than the source
          * node? Allow migration.
          */
         if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
                                         ACTIVE_NODE_FRACTION)
                 return true;

         /*
          * Distribute memory according to CPU & memory use on each node,
          * with 3/4 hysteresis to avoid unnecessary memory migrations:
          *
          * faults_cpu(dst)   3   faults_cpu(src)
          * --------------- * - > ---------------
          * faults_mem(dst)   4   faults_mem(src)
          */
         return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) 
* 3 >
                group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) 
* 4;


> properly, a large folio of any type can ping-pong between nodes so just
> migrating because we can is not necessarily a good idea. The patch
> should cover a realistic case why this matters, why splitting the folio
> is not better and supporting data.

Sure. For a private mapping, we should always migrate the large folio. 
The tricky part is the shared mapping as you and Ying said, which can 
have different scenarios, and I'm thinking about how to validate it. Do 
you have any suggestion? Thanks.
  

Patch

diff --git a/mm/memory.c b/mm/memory.c
index c32954e16b28..8ca21eff294c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4804,7 +4804,7 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	int last_cpupid;
 	int target_nid;
 	pte_t pte, old_pte;
-	int flags = 0;
+	int flags = 0, nr_pages = 0;
 
 	/*
 	 * The "pte" at this point cannot be used safely without
@@ -4834,10 +4834,6 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	if (!folio || folio_is_zone_device(folio))
 		goto out_map;
 
-	/* TODO: handle PTE-mapped THP */
-	if (folio_test_large(folio))
-		goto out_map;
-
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
 	 * much anyway since they can be in shared cache state. This misses
@@ -4857,6 +4853,7 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		flags |= TNF_SHARED;
 
 	nid = folio_nid(folio);
+	nr_pages = folio_nr_pages(folio);
 	/*
 	 * For memory tiering mode, cpupid of slow memory page is used
 	 * to record page access time.  So use default value.
@@ -4893,7 +4890,7 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 out:
 	if (nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, nid, 1, flags);
+		task_numa_fault(last_cpupid, nid, nr_pages, flags);
 	return 0;
 out_map:
 	/*