[v3,1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags

Message ID 20231025144546.577640-2-ryan.roberts@arm.com
State New
Headers
Series Swap-out small-sized THP without splitting |

Commit Message

Ryan Roberts Oct. 25, 2023, 2:45 p.m. UTC
  As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
avoid bothering to call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be equivalent.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 ----------
 mm/huge_memory.c     |  3 ---
 mm/swapfile.c        | 47 ++++++++------------------------------------
 3 files changed, 8 insertions(+), 52 deletions(-)
  

Comments

David Hildenbrand Feb. 22, 2024, 10:19 a.m. UTC | #1
On 25.10.23 16:45, Ryan Roberts wrote:
> As preparation for supporting small-sized THP in the swap-out path,
> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
> which, when present, always implies PMD-sized THP, which is the same as
> the cluster size.
> 
> The only use of the flag was to determine whether a swap entry refers to
> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
> Instead of relying on the flag, we now pass in nr_pages, which
> originates from the folio's number of pages. This allows the logic to
> work for folios of any order.
> 
> The one snag is that one of the swap_page_trans_huge_swapped() call
> sites does not have the folio. But it was only being called there to
> avoid bothering to call __try_to_reclaim_swap() in some cases.
> __try_to_reclaim_swap() gets the folio and (via some other functions)
> calls swap_page_trans_huge_swapped(). So I've removed the problematic
> call site and believe the new logic should be equivalent.

That is the  __try_to_reclaim_swap() -> folio_free_swap() -> 
folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.

The "difference" is that you will now (1) get another temporary 
reference on the folio and (2) (try)lock the folio every time you 
discard a single PTE of a (possibly) large THP.
  
David Hildenbrand Feb. 22, 2024, 10:20 a.m. UTC | #2
On 22.02.24 11:19, David Hildenbrand wrote:
> On 25.10.23 16:45, Ryan Roberts wrote:
>> As preparation for supporting small-sized THP in the swap-out path,
>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>> which, when present, always implies PMD-sized THP, which is the same as
>> the cluster size.
>>
>> The only use of the flag was to determine whether a swap entry refers to
>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>> Instead of relying on the flag, we now pass in nr_pages, which
>> originates from the folio's number of pages. This allows the logic to
>> work for folios of any order.
>>
>> The one snag is that one of the swap_page_trans_huge_swapped() call
>> sites does not have the folio. But it was only being called there to
>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>> call site and believe the new logic should be equivalent.
> 
> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
> 
> The "difference" is that you will now (1) get another temporary
> reference on the folio and (2) (try)lock the folio every time you
> discard a single PTE of a (possibly) large THP.
> 

Thinking about it, your change will not only affect THP, but any call to 
free_swap_and_cache().

Likely that's not what we want. :/
  
Ryan Roberts Feb. 26, 2024, 5:41 p.m. UTC | #3
On 22/02/2024 10:20, David Hildenbrand wrote:
> On 22.02.24 11:19, David Hildenbrand wrote:
>> On 25.10.23 16:45, Ryan Roberts wrote:
>>> As preparation for supporting small-sized THP in the swap-out path,
>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>> which, when present, always implies PMD-sized THP, which is the same as
>>> the cluster size.
>>>
>>> The only use of the flag was to determine whether a swap entry refers to
>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>> Instead of relying on the flag, we now pass in nr_pages, which
>>> originates from the folio's number of pages. This allows the logic to
>>> work for folios of any order.
>>>
>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>> sites does not have the folio. But it was only being called there to
>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>> call site and believe the new logic should be equivalent.
>>
>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>
>> The "difference" is that you will now (1) get another temporary
>> reference on the folio and (2) (try)lock the folio every time you
>> discard a single PTE of a (possibly) large THP.
>>
> 
> Thinking about it, your change will not only affect THP, but any call to
> free_swap_and_cache().
> 
> Likely that's not what we want. :/
> 

Is folio_trylock() really that expensive given the code path is already locking
multiple spinlocks, and I don't think we would expect the folio lock to be very
contended?

I guess filemap_get_folio() could be a bit more expensive, but again, is this
really a deal-breaker?


I'm just trying to refamiliarize myself with this series, but I think I ended up
allocating a cluster per cpu per order. So one potential solution would be to
turn the flag into a size and store it in the cluster info. (In fact I think I
was doing that in an early version of this series - will have to look at why I
got rid of that). Then we could avoid needing to figure out nr_pages from the folio.
  
Ryan Roberts Feb. 27, 2024, 5:10 p.m. UTC | #4
Hi David,

On 26/02/2024 17:41, Ryan Roberts wrote:
> On 22/02/2024 10:20, David Hildenbrand wrote:
>> On 22.02.24 11:19, David Hildenbrand wrote:
>>> On 25.10.23 16:45, Ryan Roberts wrote:
>>>> As preparation for supporting small-sized THP in the swap-out path,
>>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>>> which, when present, always implies PMD-sized THP, which is the same as
>>>> the cluster size.
>>>>
>>>> The only use of the flag was to determine whether a swap entry refers to
>>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>>> Instead of relying on the flag, we now pass in nr_pages, which
>>>> originates from the folio's number of pages. This allows the logic to
>>>> work for folios of any order.
>>>>
>>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>>> sites does not have the folio. But it was only being called there to
>>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>>> call site and believe the new logic should be equivalent.
>>>
>>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>>
>>> The "difference" is that you will now (1) get another temporary
>>> reference on the folio and (2) (try)lock the folio every time you
>>> discard a single PTE of a (possibly) large THP.
>>>
>>
>> Thinking about it, your change will not only affect THP, but any call to
>> free_swap_and_cache().
>>
>> Likely that's not what we want. :/
>>
> 
> Is folio_trylock() really that expensive given the code path is already locking
> multiple spinlocks, and I don't think we would expect the folio lock to be very
> contended?
> 
> I guess filemap_get_folio() could be a bit more expensive, but again, is this
> really a deal-breaker?
> 
> 
> I'm just trying to refamiliarize myself with this series, but I think I ended up
> allocating a cluster per cpu per order. So one potential solution would be to
> turn the flag into a size and store it in the cluster info. (In fact I think I
> was doing that in an early version of this series - will have to look at why I
> got rid of that). Then we could avoid needing to figure out nr_pages from the folio.

I ran some microbenchmarks to see if these extra operations cause a performance
issue - it all looks OK to me.

I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
mode, which prepares the 1G memory mapping by first paging it out with
MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
swap slots have 2 references, then it measures the duration of munmap() in the
parent on the entire range. The idea is that free_swap_and_cache() is called for
each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
will return true, due to the child's references, and __try_to_reclaim_swap() is
not called. After my change, we no longer have this short cut.

In both cases the results are within 1% (confirmed across multiple runs of 20
seconds each):

mm-stable: Average: 0.004997
 + change: Average: 0.005037

(these numbers are for Ampere Altra. I also tested on M2 VM - no regression
their either).

Do you still have a concern about this change?

An alternative is to store the folio size in the cluster, but that won't be
accurate if the folio is later split or if an entry within the cluster is later
stolen for an order-0 entry. I think it would still work though; it just means
that you might get a false positive in those circumstances, which means taking
the "slow" path. But this is a rare event.

Regardless, I prefer not to do this, since it adds complexity and doesn't
benefit performance.

Thanks,
Ryan
  
David Hildenbrand Feb. 27, 2024, 7:17 p.m. UTC | #5
On 27.02.24 18:10, Ryan Roberts wrote:
> Hi David,
> 
> On 26/02/2024 17:41, Ryan Roberts wrote:
>> On 22/02/2024 10:20, David Hildenbrand wrote:
>>> On 22.02.24 11:19, David Hildenbrand wrote:
>>>> On 25.10.23 16:45, Ryan Roberts wrote:
>>>>> As preparation for supporting small-sized THP in the swap-out path,
>>>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>>>> which, when present, always implies PMD-sized THP, which is the same as
>>>>> the cluster size.
>>>>>
>>>>> The only use of the flag was to determine whether a swap entry refers to
>>>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>>>> Instead of relying on the flag, we now pass in nr_pages, which
>>>>> originates from the folio's number of pages. This allows the logic to
>>>>> work for folios of any order.
>>>>>
>>>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>>>> sites does not have the folio. But it was only being called there to
>>>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>>>> call site and believe the new logic should be equivalent.
>>>>
>>>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>>>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>>>
>>>> The "difference" is that you will now (1) get another temporary
>>>> reference on the folio and (2) (try)lock the folio every time you
>>>> discard a single PTE of a (possibly) large THP.
>>>>
>>>
>>> Thinking about it, your change will not only affect THP, but any call to
>>> free_swap_and_cache().
>>>
>>> Likely that's not what we want. :/
>>>
>>
>> Is folio_trylock() really that expensive given the code path is already locking
>> multiple spinlocks, and I don't think we would expect the folio lock to be very
>> contended?
>>
>> I guess filemap_get_folio() could be a bit more expensive, but again, is this
>> really a deal-breaker?
>>
>>
>> I'm just trying to refamiliarize myself with this series, but I think I ended up
>> allocating a cluster per cpu per order. So one potential solution would be to
>> turn the flag into a size and store it in the cluster info. (In fact I think I
>> was doing that in an early version of this series - will have to look at why I
>> got rid of that). Then we could avoid needing to figure out nr_pages from the folio.
> 
> I ran some microbenchmarks to see if these extra operations cause a performance
> issue - it all looks OK to me.

Sorry, I'm drowning in reviews right now. I was hoping to get some of my own
stuff figured out today ... maybe tomorrow.

> 
> I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
> mode, which prepares the 1G memory mapping by first paging it out with
> MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
> swap slots have 2 references, then it measures the duration of munmap() in the
> parent on the entire range. The idea is that free_swap_and_cache() is called for
> each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
> will return true, due to the child's references, and __try_to_reclaim_swap() is
> not called. After my change, we no longer have this short cut.
> 
> In both cases the results are within 1% (confirmed across multiple runs of 20
> seconds each):
> 
> mm-stable: Average: 0.004997
>   + change: Average: 0.005037
> 
> (these numbers are for Ampere Altra. I also tested on M2 VM - no regression
> their either).
> 
> Do you still have a concern about this change?

The main concern I had was not about overhead due to atomic operations in the
non-concurrent case that you are measuring.

We might now unnecessarily be incrementing the folio refcount and taking
the folio lock. That will affects large folios in the swapcache now IIUC.
Small folios should be unaffected.

The side effects of that can be:
* Code checking for additional folio reference could now detect some and
   back out. (the "mapcount + swapcache*folio_nr_pages != folio_refcount"
   stuff)
* Code that might really benefit from trylocking the folio might fail to
   do so.

For example, splitting a large folio might now fail more often simply
because some process zaps a swap entry and the additional reference+page
lock were optimized out previously.

How relevant is it? Relevant enough that someone decided to put that
optimization in? I don't know :)

Arguably, zapping a present PTE also leaves the refcount elevated for a while
until the mapcount was freed. But here, it could be avoided.

Digging a bit, it was introduced in:

commit e07098294adfd03d582af7626752255e3d170393
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed Sep 6 16:22:16 2017 -0700

     mm, THP, swap: support to reclaim swap space for THP swapped out
     
     The normal swap slot reclaiming can be done when the swap count reaches
     SWAP_HAS_CACHE.  But for the swap slot which is backing a THP, all swap
     slots backing one THP must be reclaimed together, because the swap slot
     may be used again when the THP is swapped out again later.  So the swap
     slots backing one THP can be reclaimed together when the swap count for
     all swap slots for the THP reached SWAP_HAS_CACHE.  In the patch, the
     functions to check whether the swap count for all swap slots backing one
     THP reached SWAP_HAS_CACHE are implemented and used when checking
     whether a swap slot can be reclaimed.
     
     To make it easier to determine whether a swap slot is backing a THP, a
     new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
     cluster which is backing a THP (Transparent Huge Page).  Because THP
     swap in as a whole isn't supported now.  After deleting the THP from the
     swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
     flag will be cleared.  So that, the normal pages inside THP can be
     swapped in individually.


With your change, if we have a swapped out THP with 512 entries and exit(), we
would now 512 times in a row grab a folio reference and trylock the folio. In the
past, we would have done that at most once.

That doesn't feel quite right TBH ... so I'm wondering if there are any low-hanging
fruits to avoid that.
  
Ryan Roberts Feb. 28, 2024, 9:37 a.m. UTC | #6
Hi David, Huang,


On 27/02/2024 19:17, David Hildenbrand wrote:
> On 27.02.24 18:10, Ryan Roberts wrote:
>> Hi David,
>>
>> On 26/02/2024 17:41, Ryan Roberts wrote:
>>> On 22/02/2024 10:20, David Hildenbrand wrote:
>>>> On 22.02.24 11:19, David Hildenbrand wrote:
>>>>> On 25.10.23 16:45, Ryan Roberts wrote:
>>>>>> As preparation for supporting small-sized THP in the swap-out path,
>>>>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>>>>> which, when present, always implies PMD-sized THP, which is the same as
>>>>>> the cluster size.
>>>>>>
>>>>>> The only use of the flag was to determine whether a swap entry refers to
>>>>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>>>>> Instead of relying on the flag, we now pass in nr_pages, which
>>>>>> originates from the folio's number of pages. This allows the logic to
>>>>>> work for folios of any order.
>>>>>>
>>>>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>>>>> sites does not have the folio. But it was only being called there to
>>>>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>>>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>>>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>>>>> call site and believe the new logic should be equivalent.
>>>>>
>>>>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>>>>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>>>>
>>>>> The "difference" is that you will now (1) get another temporary
>>>>> reference on the folio and (2) (try)lock the folio every time you
>>>>> discard a single PTE of a (possibly) large THP.
>>>>>
>>>>
>>>> Thinking about it, your change will not only affect THP, but any call to
>>>> free_swap_and_cache().
>>>>
>>>> Likely that's not what we want. :/
>>>>
>>>
>>> Is folio_trylock() really that expensive given the code path is already locking
>>> multiple spinlocks, and I don't think we would expect the folio lock to be very
>>> contended?
>>>
>>> I guess filemap_get_folio() could be a bit more expensive, but again, is this
>>> really a deal-breaker?
>>>
>>>
>>> I'm just trying to refamiliarize myself with this series, but I think I ended up
>>> allocating a cluster per cpu per order. So one potential solution would be to
>>> turn the flag into a size and store it in the cluster info. (In fact I think I
>>> was doing that in an early version of this series - will have to look at why I
>>> got rid of that). Then we could avoid needing to figure out nr_pages from the
>>> folio.
>>
>> I ran some microbenchmarks to see if these extra operations cause a performance
>> issue - it all looks OK to me.
> 
> Sorry, I'm drowning in reviews right now. I was hoping to get some of my own
> stuff figured out today ... maybe tomorrow.

No need to apologise - as always I appreciate whatever time you can spare.

> 
>>
>> I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
>> mode, which prepares the 1G memory mapping by first paging it out with
>> MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
>> swap slots have 2 references, then it measures the duration of munmap() in the
>> parent on the entire range. The idea is that free_swap_and_cache() is called for
>> each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
>> will return true, due to the child's references, and __try_to_reclaim_swap() is
>> not called. After my change, we no longer have this short cut.
>>
>> In both cases the results are within 1% (confirmed across multiple runs of 20
>> seconds each):
>>
>> mm-stable: Average: 0.004997
>>   + change: Average: 0.005037
>>
>> (these numbers are for Ampere Altra. I also tested on M2 VM - no regression
>> their either).
>>
>> Do you still have a concern about this change?
> 
> The main concern I had was not about overhead due to atomic operations in the
> non-concurrent case that you are measuring.
> 
> We might now unnecessarily be incrementing the folio refcount and taking
> the folio lock. That will affects large folios in the swapcache now IIUC.
> Small folios should be unaffected.

Yes I think you are right, because `count == SWAP_HAS_CACHE` is already checking
the small page is not swapped. So my perf tests weren't actually doing what I
thought they were.

> 
> The side effects of that can be:
> * Code checking for additional folio reference could now detect some and
>   back out. (the "mapcount + swapcache*folio_nr_pages != folio_refcount"
>   stuff)
> * Code that might really benefit from trylocking the folio might fail to
>   do so.
> 
> For example, splitting a large folio might now fail more often simply
> because some process zaps a swap entry and the additional reference+page
> lock were optimized out previously.

Understood. Of course this is the type of fuzzy reasoning that is very dificult
to test objectively :). But it makes sense and I suppose I'll have to come up
with an alternative approach (see below).

> 
> How relevant is it? Relevant enough that someone decided to put that
> optimization in? I don't know :)

I'll have one last go at convincing you: Huang Ying (original author) commented
"I believe this should be OK.  Better to compare the performance too." at [1].
That implies to me that perhaps the optimization wasn't in response to a
specific problem after all. Do you have any thoughts, Huang?

[1]
https://lore.kernel.org/linux-mm/87v8bdfvtj.fsf@yhuang6-desk2.ccr.corp.intel.com/

> 
> Arguably, zapping a present PTE also leaves the refcount elevated for a while
> until the mapcount was freed. But here, it could be avoided.
> 
> Digging a bit, it was introduced in:
> 
> commit e07098294adfd03d582af7626752255e3d170393
> Author: Huang Ying <ying.huang@intel.com>
> Date:   Wed Sep 6 16:22:16 2017 -0700
> 
>     mm, THP, swap: support to reclaim swap space for THP swapped out
>         The normal swap slot reclaiming can be done when the swap count reaches
>     SWAP_HAS_CACHE.  But for the swap slot which is backing a THP, all swap
>     slots backing one THP must be reclaimed together, because the swap slot
>     may be used again when the THP is swapped out again later.  So the swap
>     slots backing one THP can be reclaimed together when the swap count for
>     all swap slots for the THP reached SWAP_HAS_CACHE.  In the patch, the
>     functions to check whether the swap count for all swap slots backing one
>     THP reached SWAP_HAS_CACHE are implemented and used when checking
>     whether a swap slot can be reclaimed.
>         To make it easier to determine whether a swap slot is backing a THP, a
>     new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
>     cluster which is backing a THP (Transparent Huge Page).  Because THP
>     swap in as a whole isn't supported now.  After deleting the THP from the
>     swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
>     flag will be cleared.  So that, the normal pages inside THP can be
>     swapped in individually.

Thanks. I did this same archaeology, but found nothing pointing to the rationale
for this optimization, so decided that if its undocumented, then it probably
wasn't critical.

> 
> 
> With your change, if we have a swapped out THP with 512 entries and exit(), we
> would now 512 times in a row grab a folio reference and trylock the folio. In the
> past, we would have done that at most once.
> 
> That doesn't feel quite right TBH ... so I'm wondering if there are any low-hanging
> fruits to avoid that.
> 

OK so if we really do need to keep this optimization, here are some ideas:

Fundamentally, we would like to be able to figure out the size of the swap slot
from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
to mark it as PMD_SIZE.

Going forwards, we want to support all sizes (power-of-2). Most of the time, a
cluster will contain only one size of THPs, but this is not the case when a THP
in the swapcache gets split or when an order-0 slot gets stolen. We expect these
cases to be rare.

1) Keep the size of the smallest swap entry in the cluster header. Most of the
time it will be the full size of the swap entry, but sometimes it will cover
only a portion. In the latter case you may see a false negative for
swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
to order-0). I think that is safe, but haven't completely convinced myself yet.

2) allocate 4 bits per (small) swap slot to hold the order. This will give
precise information and is conceptually simpler to understand, but will cost
more memory (half as much as the initial swap_map[] again).

I still prefer to avoid this at all if we can (and would like to hear Huang's
thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
prototyping.

Thanks,
Ryan
  
David Hildenbrand Feb. 28, 2024, 12:12 p.m. UTC | #7
>> How relevant is it? Relevant enough that someone decided to put that
>> optimization in? I don't know :)
> 
> I'll have one last go at convincing you: Huang Ying (original author) commented
> "I believe this should be OK.  Better to compare the performance too." at [1].
> That implies to me that perhaps the optimization wasn't in response to a
> specific problem after all. Do you have any thoughts, Huang?

Might make sense to include that in the patch description!

> OK so if we really do need to keep this optimization, here are some ideas:
> 
> Fundamentally, we would like to be able to figure out the size of the swap slot
> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
> to mark it as PMD_SIZE.
> 
> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
> cluster will contain only one size of THPs, but this is not the case when a THP
> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
> cases to be rare.
> 
> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
> time it will be the full size of the swap entry, but sometimes it will cover
> only a portion. In the latter case you may see a false negative for
> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
> to order-0). I think that is safe, but haven't completely convinced myself yet.
> 
> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
> precise information and is conceptually simpler to understand, but will cost
> more memory (half as much as the initial swap_map[] again).
> 
> I still prefer to avoid this at all if we can (and would like to hear Huang's
> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
> prototyping.

Taking a step back: what about we simply batch unmapping of swap entries?

That is, if we're unmapping a PTE range, we'll collect swap entries 
(under PT lock) that reference consecutive swap offsets in the same swap 
file.

There, we can then first decrement all the swap counts, and then try 
minimizing how often we actually have to try reclaiming swap space 
(lookup folio, see it's a large folio that we cannot reclaim or could 
reclaim, ...).

Might need some fine-tuning in swap code to "advance" to the next entry 
to try freeing up, but we certainly can do better than what we would do 
right now.
  
Matthew Wilcox Feb. 28, 2024, 1:33 p.m. UTC | #8
On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote:
> Fundamentally, we would like to be able to figure out the size of the swap slot
> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
> to mark it as PMD_SIZE.
> 
> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
> cluster will contain only one size of THPs, but this is not the case when a THP
> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
> cases to be rare.
> 
> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
> time it will be the full size of the swap entry, but sometimes it will cover
> only a portion. In the latter case you may see a false negative for
> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
> to order-0). I think that is safe, but haven't completely convinced myself yet.
> 
> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
> precise information and is conceptually simpler to understand, but will cost
> more memory (half as much as the initial swap_map[] again).
> 
> I still prefer to avoid this at all if we can (and would like to hear Huang's
> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
> prototyping.

I can't quite bring myself to look up the encoding of swap entries
but as long as we're willing to restrict ourselves to naturally aligning
the clusters, there's an encoding (which I believe I invented) that lets
us encode arbitrary power-of-two sizes with a single bit.

I describe it here:
https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder

Let me know if it's not clear.
  
Ryan Roberts Feb. 28, 2024, 2:24 p.m. UTC | #9
On 28/02/2024 13:33, Matthew Wilcox wrote:
> On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote:
>> Fundamentally, we would like to be able to figure out the size of the swap slot
>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>> to mark it as PMD_SIZE.
>>
>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>> cluster will contain only one size of THPs, but this is not the case when a THP
>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>> cases to be rare.
>>
>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>> time it will be the full size of the swap entry, but sometimes it will cover
>> only a portion. In the latter case you may see a false negative for
>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>
>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>> precise information and is conceptually simpler to understand, but will cost
>> more memory (half as much as the initial swap_map[] again).
>>
>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>> prototyping.
> 
> I can't quite bring myself to look up the encoding of swap entries
> but as long as we're willing to restrict ourselves to naturally aligning
> the clusters, there's an encoding (which I believe I invented) that lets
> us encode arbitrary power-of-two sizes with a single bit.
> 
> I describe it here:
> https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder
> 
> Let me know if it's not clear.

Ahh yes, I'm familiar with this encoding scheme from other settings. Although
I've previously thought of it as having a bit to indicate whether the scheme is
enabled or not, and if it is enabled then the encoded PFN is:

PFNe = PFNd | (1 << (log2(n) - 1))

Where n is the power-of-2 page count.

Same thing, I think.

I think we would have to steal a bit from the offset to make this work, and it
looks like the size of that is bottlnecked on the arch's swp_entry PTE
representation. Looks like there is a MIPS config that only has 17 bits for
offset to begin with, so I doubt we would be able to spare a bit here? Although
it looks possible that there are some unused low bits that could be used...
  
Ryan Roberts Feb. 28, 2024, 2:57 p.m. UTC | #10
On 28/02/2024 12:12, David Hildenbrand wrote:
>>> How relevant is it? Relevant enough that someone decided to put that
>>> optimization in? I don't know :)
>>
>> I'll have one last go at convincing you: Huang Ying (original author) commented
>> "I believe this should be OK.  Better to compare the performance too." at [1].
>> That implies to me that perhaps the optimization wasn't in response to a
>> specific problem after all. Do you have any thoughts, Huang?
> 
> Might make sense to include that in the patch description!
> 
>> OK so if we really do need to keep this optimization, here are some ideas:
>>
>> Fundamentally, we would like to be able to figure out the size of the swap slot
>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>> to mark it as PMD_SIZE.
>>
>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>> cluster will contain only one size of THPs, but this is not the case when a THP
>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>> cases to be rare.
>>
>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>> time it will be the full size of the swap entry, but sometimes it will cover
>> only a portion. In the latter case you may see a false negative for
>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>
>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>> precise information and is conceptually simpler to understand, but will cost
>> more memory (half as much as the initial swap_map[] again).
>>
>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>> prototyping.
> 
> Taking a step back: what about we simply batch unmapping of swap entries?
> 
> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
> lock) that reference consecutive swap offsets in the same swap file.

Yes in principle, but there are 4 places where free_swap_and_cache() is called,
and only 2 of those are really amenable to batching (zap_pte_range() and
madvise_free_pte_range()). So the other two users will still take the "slow"
path. Maybe those 2 callsites are the only ones that really matter? I can
certainly have a stab at this approach.

> 
> There, we can then first decrement all the swap counts, and then try minimizing
> how often we actually have to try reclaiming swap space (lookup folio, see it's
> a large folio that we cannot reclaim or could reclaim, ...).
> 
> Might need some fine-tuning in swap code to "advance" to the next entry to try
> freeing up, but we certainly can do better than what we would do right now.

I'm not sure I've understood this. Isn't advancing just a matter of:

entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
  
Ryan Roberts Feb. 28, 2024, 2:59 p.m. UTC | #11
On 28/02/2024 14:24, Ryan Roberts wrote:
> On 28/02/2024 13:33, Matthew Wilcox wrote:
>> On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote:
>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>>> to mark it as PMD_SIZE.
>>>
>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>>> cases to be rare.
>>>
>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>> time it will be the full size of the swap entry, but sometimes it will cover
>>> only a portion. In the latter case you may see a false negative for
>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>
>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>> precise information and is conceptually simpler to understand, but will cost
>>> more memory (half as much as the initial swap_map[] again).
>>>
>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>> prototyping.
>>
>> I can't quite bring myself to look up the encoding of swap entries
>> but as long as we're willing to restrict ourselves to naturally aligning
>> the clusters, there's an encoding (which I believe I invented) that lets
>> us encode arbitrary power-of-two sizes with a single bit.
>>
>> I describe it here:
>> https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder
>>
>> Let me know if it's not clear.
> 
> Ahh yes, I'm familiar with this encoding scheme from other settings. Although
> I've previously thought of it as having a bit to indicate whether the scheme is
> enabled or not, and if it is enabled then the encoded PFN is:
> 
> PFNe = PFNd | (1 << (log2(n) - 1))
> 
> Where n is the power-of-2 page count.
> 
> Same thing, I think.
> 
> I think we would have to steal a bit from the offset to make this work, and it
> looks like the size of that is bottlnecked on the arch's swp_entry PTE
> representation. Looks like there is a MIPS config that only has 17 bits for
> offset to begin with, so I doubt we would be able to spare a bit here? Although
> it looks possible that there are some unused low bits that could be used...
> 

I think the other problem with this is that it won't tell us which slot in the
"swap slot block" each entry is targetting?
  
David Hildenbrand Feb. 28, 2024, 3:12 p.m. UTC | #12
On 28.02.24 15:57, Ryan Roberts wrote:
> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>> How relevant is it? Relevant enough that someone decided to put that
>>>> optimization in? I don't know :)
>>>
>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>> That implies to me that perhaps the optimization wasn't in response to a
>>> specific problem after all. Do you have any thoughts, Huang?
>>
>> Might make sense to include that in the patch description!
>>
>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>
>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>>> to mark it as PMD_SIZE.
>>>
>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>>> cases to be rare.
>>>
>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>> time it will be the full size of the swap entry, but sometimes it will cover
>>> only a portion. In the latter case you may see a false negative for
>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>
>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>> precise information and is conceptually simpler to understand, but will cost
>>> more memory (half as much as the initial swap_map[] again).
>>>
>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>> prototyping.
>>
>> Taking a step back: what about we simply batch unmapping of swap entries?
>>
>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>> lock) that reference consecutive swap offsets in the same swap file.
> 
> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
> and only 2 of those are really amenable to batching (zap_pte_range() and
> madvise_free_pte_range()). So the other two users will still take the "slow"
> path. Maybe those 2 callsites are the only ones that really matter? I can
> certainly have a stab at this approach.

We can ignore the s390x one. That s390x code should only apply to KVM 
guest memory where ordinary THP are not even supported. (and nobody uses 
mTHP there yet).

Long story short: the VM can hint that some memory pages are now unused 
and the hypervisor can reclaim them. That's what that callback does (zap 
guest-provided guest memory). No need to worry about any batching for now.

Then, there is the shmem one in shmem_free_swap(). I really don't know 
how shmem handles THP+swapout.

But looking at shmem_writepage(), we split any large folios before 
moving them to the swapcache, so likely we don't care at all, because 
THP don't apply.

> 
>>
>> There, we can then first decrement all the swap counts, and then try minimizing
>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>> a large folio that we cannot reclaim or could reclaim, ...).
>>
>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>> freeing up, but we certainly can do better than what we would do right now.
> 
> I'm not sure I've understood this. Isn't advancing just a matter of:
> 
> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);

I was talking about the advancing swapslot processing after decrementing 
the swapcounts.

Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, 
you'd have to start with the first swapslot that has now a swapcount=0 
one and try to reclaim swap.

Assume you get a small folio, then you'll have to proceed with the next 
swap slot and try to reclaim swap.

Assume you get a large folio, then you can skip more swapslots 
(depending on offset into the folio etc).

If you get what I mean. :)
  
Ryan Roberts Feb. 28, 2024, 3:18 p.m. UTC | #13
On 28/02/2024 15:12, David Hildenbrand wrote:
> On 28.02.24 15:57, Ryan Roberts wrote:
>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>> optimization in? I don't know :)
>>>>
>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>> specific problem after all. Do you have any thoughts, Huang?
>>>
>>> Might make sense to include that in the patch description!
>>>
>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>
>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>> cluster
>>>> to mark it as PMD_SIZE.
>>>>
>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>> these
>>>> cases to be rare.
>>>>
>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>> only a portion. In the latter case you may see a false negative for
>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>> put_swap_folio(). We
>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>> cluster
>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>
>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>> precise information and is conceptually simpler to understand, but will cost
>>>> more memory (half as much as the initial swap_map[] again).
>>>>
>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>> prototyping.
>>>
>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>
>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>> lock) that reference consecutive swap offsets in the same swap file.
>>
>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>> and only 2 of those are really amenable to batching (zap_pte_range() and
>> madvise_free_pte_range()). So the other two users will still take the "slow"
>> path. Maybe those 2 callsites are the only ones that really matter? I can
>> certainly have a stab at this approach.
> 
> We can ignore the s390x one. That s390x code should only apply to KVM guest
> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
> 
> Long story short: the VM can hint that some memory pages are now unused and the
> hypervisor can reclaim them. That's what that callback does (zap guest-provided
> guest memory). No need to worry about any batching for now.

OK good.

> 
> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
> handles THP+swapout.
> 
> But looking at shmem_writepage(), we split any large folios before moving them
> to the swapcache, so likely we don't care at all, because THP don't apply.

Excellent.

> 
>>
>>>
>>> There, we can then first decrement all the swap counts, and then try minimizing
>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>
>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>> freeing up, but we certainly can do better than what we would do right now.
>>
>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>
>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
> 
> I was talking about the advancing swapslot processing after decrementing the
> swapcounts.
> 
> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
> have to start with the first swapslot that has now a swapcount=0 one and try to
> reclaim swap.
> 
> Assume you get a small folio, then you'll have to proceed with the next swap
> slot and try to reclaim swap.
> 
> Assume you get a large folio, then you can skip more swapslots (depending on
> offset into the folio etc).
> 
> If you get what I mean. :)

Ahh gottya. I'll have a play.
  
Ryan Roberts March 1, 2024, 4:27 p.m. UTC | #14
On 28/02/2024 15:12, David Hildenbrand wrote:
> On 28.02.24 15:57, Ryan Roberts wrote:
>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>> optimization in? I don't know :)
>>>>
>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>> specific problem after all. Do you have any thoughts, Huang?
>>>
>>> Might make sense to include that in the patch description!
>>>
>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>
>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>> cluster
>>>> to mark it as PMD_SIZE.
>>>>
>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>> these
>>>> cases to be rare.
>>>>
>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>> only a portion. In the latter case you may see a false negative for
>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>> put_swap_folio(). We
>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>> cluster
>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>
>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>> precise information and is conceptually simpler to understand, but will cost
>>>> more memory (half as much as the initial swap_map[] again).
>>>>
>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>> prototyping.
>>>
>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>
>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>> lock) that reference consecutive swap offsets in the same swap file.
>>
>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>> and only 2 of those are really amenable to batching (zap_pte_range() and
>> madvise_free_pte_range()). So the other two users will still take the "slow"
>> path. Maybe those 2 callsites are the only ones that really matter? I can
>> certainly have a stab at this approach.
> 
> We can ignore the s390x one. That s390x code should only apply to KVM guest
> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
> 
> Long story short: the VM can hint that some memory pages are now unused and the
> hypervisor can reclaim them. That's what that callback does (zap guest-provided
> guest memory). No need to worry about any batching for now.
> 
> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
> handles THP+swapout.
> 
> But looking at shmem_writepage(), we split any large folios before moving them
> to the swapcache, so likely we don't care at all, because THP don't apply.
> 
>>
>>>
>>> There, we can then first decrement all the swap counts, and then try minimizing
>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>
>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>> freeing up, but we certainly can do better than what we would do right now.
>>
>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>
>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
> 
> I was talking about the advancing swapslot processing after decrementing the
> swapcounts.
> 
> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
> have to start with the first swapslot that has now a swapcount=0 one and try to
> reclaim swap.
> 
> Assume you get a small folio, then you'll have to proceed with the next swap
> slot and try to reclaim swap.
> 
> Assume you get a large folio, then you can skip more swapslots (depending on
> offset into the folio etc).
> 
> If you get what I mean. :)
> 

I've implemented the batching as David suggested, and I'm pretty confident it's
correct. The only problem is that during testing I can't provoke the code to
take the path. I've been pouring through the code but struggling to figure out
under what situation you would expect the swap entry passed to
free_swap_and_cache() to still have a cached folio? Does anyone have any idea?

This is the original (unbatched) function, after my change, which caused David's
concern that we would end up calling __try_to_reclaim_swap() far too much:

int free_swap_and_cache(swp_entry_t entry)
{
	struct swap_info_struct *p;
	unsigned char count;

	if (non_swap_entry(entry))
		return 1;

	p = _swap_info_get(entry);
	if (p) {
		count = __swap_entry_free(p, entry);
		if (count == SWAP_HAS_CACHE)
			__try_to_reclaim_swap(p, swp_offset(entry),
					      TTRS_UNMAPPED | TTRS_FULL);
	}
	return p != NULL;
}

The trouble is, whenever its called, count is always 0, so
__try_to_reclaim_swap() never gets called.

My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
function to be called for every PTE, but count is always 0 after
__swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
order-0 as well as PTE- and PMD-mapped 2M THP.

I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
using a block ram device as my backing store - I think this does synchronous IO
so perhaps if I have a real block device with async IO I might have more luck?
Just a guess...

Or perhaps this code path is a corner case? In which case, perhaps its not worth
adding the batching optimization after all?

Thanks,
Ryan
  
Matthew Wilcox March 1, 2024, 4:31 p.m. UTC | #15
On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
> I've implemented the batching as David suggested, and I'm pretty confident it's
> correct. The only problem is that during testing I can't provoke the code to
> take the path. I've been pouring through the code but struggling to figure out
> under what situation you would expect the swap entry passed to
> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> 
> This is the original (unbatched) function, after my change, which caused David's
> concern that we would end up calling __try_to_reclaim_swap() far too much:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);
> 		if (count == SWAP_HAS_CACHE)
> 			__try_to_reclaim_swap(p, swp_offset(entry),
> 					      TTRS_UNMAPPED | TTRS_FULL);
> 	}
> 	return p != NULL;
> }
> 
> The trouble is, whenever its called, count is always 0, so
> __try_to_reclaim_swap() never gets called.
> 
> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> function to be called for every PTE, but count is always 0 after
> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> order-0 as well as PTE- and PMD-mapped 2M THP.

I think you have to page it back in again, then it will have an entry in
the swap cache.  Maybe.  I know little about anon memory ;-)

If that doesn't work, perhaps use tmpfs, and use some memory pressure to
force that to swap?

> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> using a block ram device as my backing store - I think this does synchronous IO
> so perhaps if I have a real block device with async IO I might have more luck?
> Just a guess...
> 
> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> adding the batching optimization after all?
> 
> Thanks,
> Ryan
>
  
Ryan Roberts March 1, 2024, 4:31 p.m. UTC | #16
On 01/03/2024 16:27, Ryan Roberts wrote:
> On 28/02/2024 15:12, David Hildenbrand wrote:
>> On 28.02.24 15:57, Ryan Roberts wrote:
>>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>>> optimization in? I don't know :)
>>>>>
>>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>>> specific problem after all. Do you have any thoughts, Huang?
>>>>
>>>> Might make sense to include that in the patch description!
>>>>
>>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>>
>>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>>> cluster
>>>>> to mark it as PMD_SIZE.
>>>>>
>>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>>> these
>>>>> cases to be rare.
>>>>>
>>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>>> only a portion. In the latter case you may see a false negative for
>>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>>> put_swap_folio(). We
>>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>>> cluster
>>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>>
>>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>>> precise information and is conceptually simpler to understand, but will cost
>>>>> more memory (half as much as the initial swap_map[] again).
>>>>>
>>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>>> prototyping.
>>>>
>>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>>
>>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>>> lock) that reference consecutive swap offsets in the same swap file.
>>>
>>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>>> and only 2 of those are really amenable to batching (zap_pte_range() and
>>> madvise_free_pte_range()). So the other two users will still take the "slow"
>>> path. Maybe those 2 callsites are the only ones that really matter? I can
>>> certainly have a stab at this approach.
>>
>> We can ignore the s390x one. That s390x code should only apply to KVM guest
>> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
>>
>> Long story short: the VM can hint that some memory pages are now unused and the
>> hypervisor can reclaim them. That's what that callback does (zap guest-provided
>> guest memory). No need to worry about any batching for now.
>>
>> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
>> handles THP+swapout.
>>
>> But looking at shmem_writepage(), we split any large folios before moving them
>> to the swapcache, so likely we don't care at all, because THP don't apply.
>>
>>>
>>>>
>>>> There, we can then first decrement all the swap counts, and then try minimizing
>>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>>
>>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>>> freeing up, but we certainly can do better than what we would do right now.
>>>
>>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>>
>>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
>>
>> I was talking about the advancing swapslot processing after decrementing the
>> swapcounts.
>>
>> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
>> have to start with the first swapslot that has now a swapcount=0 one and try to
>> reclaim swap.
>>
>> Assume you get a small folio, then you'll have to proceed with the next swap
>> slot and try to reclaim swap.
>>
>> Assume you get a large folio, then you can skip more swapslots (depending on
>> offset into the folio etc).
>>
>> If you get what I mean. :)
>>
> 
> I've implemented the batching as David suggested, and I'm pretty confident it's
> correct. The only problem is that during testing I can't provoke the code to
> take the path. I've been pouring through the code but struggling to figure out
> under what situation you would expect the swap entry passed to
> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> 
> This is the original (unbatched) function, after my change, which caused David's
> concern that we would end up calling __try_to_reclaim_swap() far too much:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);
> 		if (count == SWAP_HAS_CACHE)
> 			__try_to_reclaim_swap(p, swp_offset(entry),
> 					      TTRS_UNMAPPED | TTRS_FULL);
> 	}
> 	return p != NULL;
> }
> 
> The trouble is, whenever its called, count is always 0, so
> __try_to_reclaim_swap() never gets called.
> 
> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> function to be called for every PTE, but count is always 0 after
> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> order-0 as well as PTE- and PMD-mapped 2M THP.
> 
> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> using a block ram device as my backing store - I think this does synchronous IO
> so perhaps if I have a real block device with async IO I might have more luck?

Ahh I just switched to SSD as swap device and now its getting called. I guess
that's the reason. Sorry for the noise.

> Just a guess...
> 
> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> adding the batching optimization after all?
> 
> Thanks,
> Ryan
>
  
David Hildenbrand March 1, 2024, 4:32 p.m. UTC | #17
On 01.03.24 17:27, Ryan Roberts wrote:
> On 28/02/2024 15:12, David Hildenbrand wrote:
>> On 28.02.24 15:57, Ryan Roberts wrote:
>>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>>> optimization in? I don't know :)
>>>>>
>>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>>> specific problem after all. Do you have any thoughts, Huang?
>>>>
>>>> Might make sense to include that in the patch description!
>>>>
>>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>>
>>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>>> cluster
>>>>> to mark it as PMD_SIZE.
>>>>>
>>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>>> these
>>>>> cases to be rare.
>>>>>
>>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>>> only a portion. In the latter case you may see a false negative for
>>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>>> put_swap_folio(). We
>>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>>> cluster
>>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>>
>>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>>> precise information and is conceptually simpler to understand, but will cost
>>>>> more memory (half as much as the initial swap_map[] again).
>>>>>
>>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>>> prototyping.
>>>>
>>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>>
>>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>>> lock) that reference consecutive swap offsets in the same swap file.
>>>
>>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>>> and only 2 of those are really amenable to batching (zap_pte_range() and
>>> madvise_free_pte_range()). So the other two users will still take the "slow"
>>> path. Maybe those 2 callsites are the only ones that really matter? I can
>>> certainly have a stab at this approach.
>>
>> We can ignore the s390x one. That s390x code should only apply to KVM guest
>> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
>>
>> Long story short: the VM can hint that some memory pages are now unused and the
>> hypervisor can reclaim them. That's what that callback does (zap guest-provided
>> guest memory). No need to worry about any batching for now.
>>
>> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
>> handles THP+swapout.
>>
>> But looking at shmem_writepage(), we split any large folios before moving them
>> to the swapcache, so likely we don't care at all, because THP don't apply.
>>
>>>
>>>>
>>>> There, we can then first decrement all the swap counts, and then try minimizing
>>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>>
>>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>>> freeing up, but we certainly can do better than what we would do right now.
>>>
>>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>>
>>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
>>
>> I was talking about the advancing swapslot processing after decrementing the
>> swapcounts.
>>
>> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
>> have to start with the first swapslot that has now a swapcount=0 one and try to
>> reclaim swap.
>>
>> Assume you get a small folio, then you'll have to proceed with the next swap
>> slot and try to reclaim swap.
>>
>> Assume you get a large folio, then you can skip more swapslots (depending on
>> offset into the folio etc).
>>
>> If you get what I mean. :)
>>
> 
> I've implemented the batching as David suggested, and I'm pretty confident it's
> correct. The only problem is that during testing I can't provoke the code to
> take the path. I've been pouring through the code but struggling to figure out
> under what situation you would expect the swap entry passed to
> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> 
> This is the original (unbatched) function, after my change, which caused David's
> concern that we would end up calling __try_to_reclaim_swap() far too much:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);
> 		if (count == SWAP_HAS_CACHE)
> 			__try_to_reclaim_swap(p, swp_offset(entry),
> 					      TTRS_UNMAPPED | TTRS_FULL);
> 	}
> 	return p != NULL;
> }
> 
> The trouble is, whenever its called, count is always 0, so
> __try_to_reclaim_swap() never gets called.
> 
> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> function to be called for every PTE, but count is always 0 after
> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> order-0 as well as PTE- and PMD-mapped 2M THP.
> 
> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> using a block ram device as my backing store - I think this does synchronous IO
> so perhaps if I have a real block device with async IO I might have more luck?
> Just a guess...
> 
> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> adding the batching optimization after all?

I had to disable zswap in the past and was able to trigger this reliably 
with an ordinary swap backend (e.g., proper disk).

Whenever you involve swap-to-ram, you might just get it reclaimed 
immediately.
  
Ryan Roberts March 1, 2024, 4:44 p.m. UTC | #18
On 01/03/2024 16:31, Matthew Wilcox wrote:
> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>> I've implemented the batching as David suggested, and I'm pretty confident it's
>> correct. The only problem is that during testing I can't provoke the code to
>> take the path. I've been pouring through the code but struggling to figure out
>> under what situation you would expect the swap entry passed to
>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>
>> This is the original (unbatched) function, after my change, which caused David's
>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>
>> int free_swap_and_cache(swp_entry_t entry)
>> {
>> 	struct swap_info_struct *p;
>> 	unsigned char count;
>>
>> 	if (non_swap_entry(entry))
>> 		return 1;
>>
>> 	p = _swap_info_get(entry);
>> 	if (p) {
>> 		count = __swap_entry_free(p, entry);
>> 		if (count == SWAP_HAS_CACHE)
>> 			__try_to_reclaim_swap(p, swp_offset(entry),
>> 					      TTRS_UNMAPPED | TTRS_FULL);
>> 	}
>> 	return p != NULL;
>> }
>>
>> The trouble is, whenever its called, count is always 0, so
>> __try_to_reclaim_swap() never gets called.
>>
>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
>> function to be called for every PTE, but count is always 0 after
>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>> order-0 as well as PTE- and PMD-mapped 2M THP.
> 
> I think you have to page it back in again, then it will have an entry in
> the swap cache.  Maybe.  I know little about anon memory ;-)

Ahh, I was under the impression that the original folio is put into the swap
cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
I'm miles out... what exactly is the lifecycle of a folio going through swap out?

I guess I can try forking after swap out, then fault it back in in the child and
exit. Then do the munmap in the parent. I guess that could force it? Thanks for
the tip - I'll have a play.

> 
> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
> force that to swap?
> 
>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
>> using a block ram device as my backing store - I think this does synchronous IO
>> so perhaps if I have a real block device with async IO I might have more luck?
>> Just a guess...
>>
>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
>> adding the batching optimization after all?
>>
>> Thanks,
>> Ryan
>>
  
David Hildenbrand March 1, 2024, 5 p.m. UTC | #19
On 01.03.24 17:44, Ryan Roberts wrote:
> On 01/03/2024 16:31, Matthew Wilcox wrote:
>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>> correct. The only problem is that during testing I can't provoke the code to
>>> take the path. I've been pouring through the code but struggling to figure out
>>> under what situation you would expect the swap entry passed to
>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>
>>> This is the original (unbatched) function, after my change, which caused David's
>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>
>>> int free_swap_and_cache(swp_entry_t entry)
>>> {
>>> 	struct swap_info_struct *p;
>>> 	unsigned char count;
>>>
>>> 	if (non_swap_entry(entry))
>>> 		return 1;
>>>
>>> 	p = _swap_info_get(entry);
>>> 	if (p) {
>>> 		count = __swap_entry_free(p, entry);
>>> 		if (count == SWAP_HAS_CACHE)
>>> 			__try_to_reclaim_swap(p, swp_offset(entry),
>>> 					      TTRS_UNMAPPED | TTRS_FULL);
>>> 	}
>>> 	return p != NULL;
>>> }
>>>
>>> The trouble is, whenever its called, count is always 0, so
>>> __try_to_reclaim_swap() never gets called.
>>>
>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
>>> function to be called for every PTE, but count is always 0 after
>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>
>> I think you have to page it back in again, then it will have an entry in
>> the swap cache.  Maybe.  I know little about anon memory ;-)
> 
> Ahh, I was under the impression that the original folio is put into the swap
> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> I'm miles out... what exactly is the lifecycle of a folio going through swap out?

I thought with most (disk) backends you will add it to the swapcache and 
leave it there until there is actual memory pressure. Only then, under 
memory pressure, you'd actually reclaim the folio.

You can fault it back in from the swapcache without having to go to disk.

That's how you can today end up with a THP in the swapcache: during 
swapin from disk (after the folio was reclaimed) you'd currently only 
get order-0 folios.

At least that was my assumption with my MADV_PAGEOUT testing so far :)
  
Ryan Roberts March 1, 2024, 5:06 p.m. UTC | #20
On 01/03/2024 16:44, Ryan Roberts wrote:
> On 01/03/2024 16:31, Matthew Wilcox wrote:
>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>> correct. The only problem is that during testing I can't provoke the code to
>>> take the path. I've been pouring through the code but struggling to figure out
>>> under what situation you would expect the swap entry passed to
>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>
>>> This is the original (unbatched) function, after my change, which caused David's
>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>
>>> int free_swap_and_cache(swp_entry_t entry)
>>> {
>>> 	struct swap_info_struct *p;
>>> 	unsigned char count;
>>>
>>> 	if (non_swap_entry(entry))
>>> 		return 1;
>>>
>>> 	p = _swap_info_get(entry);
>>> 	if (p) {
>>> 		count = __swap_entry_free(p, entry);
>>> 		if (count == SWAP_HAS_CACHE)
>>> 			__try_to_reclaim_swap(p, swp_offset(entry),
>>> 					      TTRS_UNMAPPED | TTRS_FULL);
>>> 	}
>>> 	return p != NULL;
>>> }
>>>
>>> The trouble is, whenever its called, count is always 0, so
>>> __try_to_reclaim_swap() never gets called.
>>>
>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
>>> function to be called for every PTE, but count is always 0 after
>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>
>> I think you have to page it back in again, then it will have an entry in
>> the swap cache.  Maybe.  I know little about anon memory ;-)
> 
> Ahh, I was under the impression that the original folio is put into the swap
> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> 
> I guess I can try forking after swap out, then fault it back in in the child and
> exit. Then do the munmap in the parent. I guess that could force it? Thanks for
> the tip - I'll have a play.

That has sort of solved it, the only problem now is that all the folios in the
swap cache are small (because I don't have Barry's large swap-in series). So
really I need to figure out how to avoid removing the folio from the cache in
the first place...

> 
>>
>> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
>> force that to swap?
>>
>>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
>>> using a block ram device as my backing store - I think this does synchronous IO
>>> so perhaps if I have a real block device with async IO I might have more luck?
>>> Just a guess...
>>>
>>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
>>> adding the batching optimization after all?
>>>
>>> Thanks,
>>> Ryan
>>>
>
  
Ryan Roberts March 1, 2024, 5:14 p.m. UTC | #21
On 01/03/2024 17:00, David Hildenbrand wrote:
> On 01.03.24 17:44, Ryan Roberts wrote:
>> On 01/03/2024 16:31, Matthew Wilcox wrote:
>>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>>> correct. The only problem is that during testing I can't provoke the code to
>>>> take the path. I've been pouring through the code but struggling to figure out
>>>> under what situation you would expect the swap entry passed to
>>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>>
>>>> This is the original (unbatched) function, after my change, which caused
>>>> David's
>>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>>
>>>> int free_swap_and_cache(swp_entry_t entry)
>>>> {
>>>>     struct swap_info_struct *p;
>>>>     unsigned char count;
>>>>
>>>>     if (non_swap_entry(entry))
>>>>         return 1;
>>>>
>>>>     p = _swap_info_get(entry);
>>>>     if (p) {
>>>>         count = __swap_entry_free(p, entry);
>>>>         if (count == SWAP_HAS_CACHE)
>>>>             __try_to_reclaim_swap(p, swp_offset(entry),
>>>>                           TTRS_UNMAPPED | TTRS_FULL);
>>>>     }
>>>>     return p != NULL;
>>>> }
>>>>
>>>> The trouble is, whenever its called, count is always 0, so
>>>> __try_to_reclaim_swap() never gets called.
>>>>
>>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT)
>>>> over
>>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause
>>>> this
>>>> function to be called for every PTE, but count is always 0 after
>>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>>
>>> I think you have to page it back in again, then it will have an entry in
>>> the swap cache.  Maybe.  I know little about anon memory ;-)
>>
>> Ahh, I was under the impression that the original folio is put into the swap
>> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
>> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> 
> I thought with most (disk) backends you will add it to the swapcache and leave
> it there until there is actual memory pressure. Only then, under memory
> pressure, you'd actually reclaim the folio.

OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so
the swap subsystem refuses to swap out THPs to that and they get split. To solve
that, (and to speed up testing) I moved to the block ram disk, which convinces
swap to swap-out THPs. But that causes the folios to be removed from the swap
cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to
affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest
under macos. Perhaps the easiest thing is to hack it to ignore the rotating
media flag.

> 
> You can fault it back in from the swapcache without having to go to disk.
> 
> That's how you can today end up with a THP in the swapcache: during swapin from
> disk (after the folio was reclaimed) you'd currently only get order-0 folios.
> 
> At least that was my assumption with my MADV_PAGEOUT testing so far :)
>
  
David Hildenbrand March 1, 2024, 5:18 p.m. UTC | #22
On 01.03.24 18:14, Ryan Roberts wrote:
> On 01/03/2024 17:00, David Hildenbrand wrote:
>> On 01.03.24 17:44, Ryan Roberts wrote:
>>> On 01/03/2024 16:31, Matthew Wilcox wrote:
>>>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>>>> correct. The only problem is that during testing I can't provoke the code to
>>>>> take the path. I've been pouring through the code but struggling to figure out
>>>>> under what situation you would expect the swap entry passed to
>>>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>>>
>>>>> This is the original (unbatched) function, after my change, which caused
>>>>> David's
>>>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>>>
>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>> {
>>>>>      struct swap_info_struct *p;
>>>>>      unsigned char count;
>>>>>
>>>>>      if (non_swap_entry(entry))
>>>>>          return 1;
>>>>>
>>>>>      p = _swap_info_get(entry);
>>>>>      if (p) {
>>>>>          count = __swap_entry_free(p, entry);
>>>>>          if (count == SWAP_HAS_CACHE)
>>>>>              __try_to_reclaim_swap(p, swp_offset(entry),
>>>>>                            TTRS_UNMAPPED | TTRS_FULL);
>>>>>      }
>>>>>      return p != NULL;
>>>>> }
>>>>>
>>>>> The trouble is, whenever its called, count is always 0, so
>>>>> __try_to_reclaim_swap() never gets called.
>>>>>
>>>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT)
>>>>> over
>>>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause
>>>>> this
>>>>> function to be called for every PTE, but count is always 0 after
>>>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>>>
>>>> I think you have to page it back in again, then it will have an entry in
>>>> the swap cache.  Maybe.  I know little about anon memory ;-)
>>>
>>> Ahh, I was under the impression that the original folio is put into the swap
>>> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
>>> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
>>
>> I thought with most (disk) backends you will add it to the swapcache and leave
>> it there until there is actual memory pressure. Only then, under memory
>> pressure, you'd actually reclaim the folio.
> 
> OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so
> the swap subsystem refuses to swap out THPs to that and they get split. To solve
> that, (and to speed up testing) I moved to the block ram disk, which convinces
> swap to swap-out THPs. But that causes the folios to be removed from the swap
> cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to
> affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest
> under macos. Perhaps the easiest thing is to hack it to ignore the rotating
> media flag.

I'm trying to remember how I triggered it in the past, I thought cow.c 
selftest was able to do that.

What certainly works is taking a reference on the page using vmsplice() 
and then doing the MADV_PAGEOUT. But there has to be a better way :)

I'll dig on Monday!
  
Barry Song March 4, 2024, 4:52 a.m. UTC | #23
On Sat, Mar 2, 2024 at 6:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 01/03/2024 16:44, Ryan Roberts wrote:
> > On 01/03/2024 16:31, Matthew Wilcox wrote:
> >> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
> >>> I've implemented the batching as David suggested, and I'm pretty confident it's
> >>> correct. The only problem is that during testing I can't provoke the code to
> >>> take the path. I've been pouring through the code but struggling to figure out
> >>> under what situation you would expect the swap entry passed to
> >>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> >>>
> >>> This is the original (unbatched) function, after my change, which caused David's
> >>> concern that we would end up calling __try_to_reclaim_swap() far too much:
> >>>
> >>> int free_swap_and_cache(swp_entry_t entry)
> >>> {
> >>>     struct swap_info_struct *p;
> >>>     unsigned char count;
> >>>
> >>>     if (non_swap_entry(entry))
> >>>             return 1;
> >>>
> >>>     p = _swap_info_get(entry);
> >>>     if (p) {
> >>>             count = __swap_entry_free(p, entry);
> >>>             if (count == SWAP_HAS_CACHE)
> >>>                     __try_to_reclaim_swap(p, swp_offset(entry),
> >>>                                           TTRS_UNMAPPED | TTRS_FULL);
> >>>     }
> >>>     return p != NULL;
> >>> }
> >>>
> >>> The trouble is, whenever its called, count is always 0, so
> >>> __try_to_reclaim_swap() never gets called.
> >>>
> >>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> >>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> >>> function to be called for every PTE, but count is always 0 after
> >>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> >>> order-0 as well as PTE- and PMD-mapped 2M THP.
> >>
> >> I think you have to page it back in again, then it will have an entry in
> >> the swap cache.  Maybe.  I know little about anon memory ;-)
> >
> > Ahh, I was under the impression that the original folio is put into the swap
> > cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> > I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> >
> > I guess I can try forking after swap out, then fault it back in in the child and
> > exit. Then do the munmap in the parent. I guess that could force it? Thanks for
> > the tip - I'll have a play.
>
> That has sort of solved it, the only problem now is that all the folios in the
> swap cache are small (because I don't have Barry's large swap-in series). So
> really I need to figure out how to avoid removing the folio from the cache in
> the first place...

I am quite sure we have a chance to hit a large swapcache even using zRAM -
a sync swapfile and even during swap-out.

I have a test case as below,
1. two threads to run MADV_PAGEOUT
2. two threads to read data being swapped-out

in do_swap_page, from time to time, I can get a large swapcache.

We have a short time window after add_to_swap() and before
__removing_mapping() of
vmscan,  a large folio is still in swapcache.

So Ryan, I guess you can trigger this by adding one more thread of
MADV_DONTNEED to do zap_pte_range?


>
> >
> >>
> >> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
> >> force that to swap?
> >>
> >>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> >>> using a block ram device as my backing store - I think this does synchronous IO
> >>> so perhaps if I have a real block device with async IO I might have more luck?
> >>> Just a guess...
> >>>
> >>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> >>> adding the batching optimization after all?
> >>>
> >>> Thanks,
> >>> Ryan
> >>>
> >

Thanks
Barry
  
Barry Song March 4, 2024, 5:42 a.m. UTC | #24
On Mon, Mar 4, 2024 at 5:52 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 2, 2024 at 6:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 01/03/2024 16:44, Ryan Roberts wrote:
> > > On 01/03/2024 16:31, Matthew Wilcox wrote:
> > >> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
> > >>> I've implemented the batching as David suggested, and I'm pretty confident it's
> > >>> correct. The only problem is that during testing I can't provoke the code to
> > >>> take the path. I've been pouring through the code but struggling to figure out
> > >>> under what situation you would expect the swap entry passed to
> > >>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> > >>>
> > >>> This is the original (unbatched) function, after my change, which caused David's
> > >>> concern that we would end up calling __try_to_reclaim_swap() far too much:
> > >>>
> > >>> int free_swap_and_cache(swp_entry_t entry)
> > >>> {
> > >>>     struct swap_info_struct *p;
> > >>>     unsigned char count;
> > >>>
> > >>>     if (non_swap_entry(entry))
> > >>>             return 1;
> > >>>
> > >>>     p = _swap_info_get(entry);
> > >>>     if (p) {
> > >>>             count = __swap_entry_free(p, entry);
> > >>>             if (count == SWAP_HAS_CACHE)
> > >>>                     __try_to_reclaim_swap(p, swp_offset(entry),
> > >>>                                           TTRS_UNMAPPED | TTRS_FULL);
> > >>>     }
> > >>>     return p != NULL;
> > >>> }
> > >>>
> > >>> The trouble is, whenever its called, count is always 0, so
> > >>> __try_to_reclaim_swap() never gets called.
> > >>>
> > >>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> > >>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> > >>> function to be called for every PTE, but count is always 0 after
> > >>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> > >>> order-0 as well as PTE- and PMD-mapped 2M THP.
> > >>
> > >> I think you have to page it back in again, then it will have an entry in
> > >> the swap cache.  Maybe.  I know little about anon memory ;-)
> > >
> > > Ahh, I was under the impression that the original folio is put into the swap
> > > cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> > > I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> > >
> > > I guess I can try forking after swap out, then fault it back in in the child and
> > > exit. Then do the munmap in the parent. I guess that could force it? Thanks for
> > > the tip - I'll have a play.
> >
> > That has sort of solved it, the only problem now is that all the folios in the
> > swap cache are small (because I don't have Barry's large swap-in series). So
> > really I need to figure out how to avoid removing the folio from the cache in
> > the first place...
>
> I am quite sure we have a chance to hit a large swapcache even using zRAM -
> a sync swapfile and even during swap-out.
>
> I have a test case as below,
> 1. two threads to run MADV_PAGEOUT
> 2. two threads to read data being swapped-out
>
> in do_swap_page, from time to time, I can get a large swapcache.
>
> We have a short time window after add_to_swap() and before
> __removing_mapping() of
> vmscan,  a large folio is still in swapcache.
>
> So Ryan, I guess you can trigger this by adding one more thread of
> MADV_DONTNEED to do zap_pte_range?

Ryan, I have modified my test case to have 4 threads:
1. MADV_PAGEOUT
2. MADV_DONTNEED
3. write data
4. read data

and git push the code here so that you can get it,
https://github.com/BarrySong666/swaptest/blob/main/swptest.c

I can reproduce the issue in zap_pte_range() in just a couple of minutes.

>
>
> >
> > >
> > >>
> > >> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
> > >> force that to swap?
> > >>
> > >>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> > >>> using a block ram device as my backing store - I think this does synchronous IO
> > >>> so perhaps if I have a real block device with async IO I might have more luck?
> > >>> Just a guess...
> > >>>
> > >>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> > >>> adding the batching optimization after all?
> > >>>
> > >>> Thanks,
> > >>> Ryan
> > >>>
> > >

Thanks
Barry
  
Ryan Roberts March 4, 2024, 4:03 p.m. UTC | #25
On 28/02/2024 12:12, David Hildenbrand wrote:
>>> How relevant is it? Relevant enough that someone decided to put that
>>> optimization in? I don't know :)
>>
>> I'll have one last go at convincing you: Huang Ying (original author) commented
>> "I believe this should be OK.  Better to compare the performance too." at [1].
>> That implies to me that perhaps the optimization wasn't in response to a
>> specific problem after all. Do you have any thoughts, Huang?
> 
> Might make sense to include that in the patch description!
> 
>> OK so if we really do need to keep this optimization, here are some ideas:
>>
>> Fundamentally, we would like to be able to figure out the size of the swap slot
>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>> to mark it as PMD_SIZE.
>>
>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>> cluster will contain only one size of THPs, but this is not the case when a THP
>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>> cases to be rare.
>>
>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>> time it will be the full size of the swap entry, but sometimes it will cover
>> only a portion. In the latter case you may see a false negative for
>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>
>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>> precise information and is conceptually simpler to understand, but will cost
>> more memory (half as much as the initial swap_map[] again).
>>
>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>> prototyping.
> 
> Taking a step back: what about we simply batch unmapping of swap entries?
> 
> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
> lock) that reference consecutive swap offsets in the same swap file.
> 
> There, we can then first decrement all the swap counts, and then try minimizing
> how often we actually have to try reclaiming swap space (lookup folio, see it's
> a large folio that we cannot reclaim or could reclaim, ...).
> 
> Might need some fine-tuning in swap code to "advance" to the next entry to try
> freeing up, but we certainly can do better than what we would do right now.
> 

Hi,

I'm struggling to convince myself that free_swap_and_cache() can't race with
with swapoff(). Can anyone explain that this is safe?

I *think* they are both serialized by the PTL, since all callers of
free_swap_and_cache() (except shmem) have the PTL, and swapoff() calls
try_to_unuse() early on, which takes the PTL as it iterates over every vma in
every mm. It looks like shmem is handled specially by a call to shmem_unuse(),
but I can't see the exact serialization mechanism.

I've implemented a batching function, as David suggested above, but I'm trying
to convince myself that it is safe for it to access si->swap_map[] without a
lock (i.e. that swapoff() can't concurrently free it). But I think
free_swap_and_cache() as it already exists depends on being able to access the
si without an explicit lock, so I'm assuming the same mechanism will protect my
new changes. But I want to be sure I understand the mechanism...


This is the existing free_swap_and_cache(). I think _swap_info_get() would break
if this could race with swapoff(), and __swap_entry_free() looks up the cluster
from an array, which would also be freed by swapoff if racing:

int free_swap_and_cache(swp_entry_t entry)
{
	struct swap_info_struct *p;
	unsigned char count;

	if (non_swap_entry(entry))
		return 1;

	p = _swap_info_get(entry);
	if (p) {
		count = __swap_entry_free(p, entry);
		if (count == SWAP_HAS_CACHE)
			__try_to_reclaim_swap(p, swp_offset(entry),
					      TTRS_UNMAPPED | TTRS_FULL);
	}
	return p != NULL;
}


This is my new function. I want to be sure that it's safe to do the
READ_ONCE(si->swap_info[...]):

void free_swap_and_cache_nr(swp_entry_t entry, int nr)
{
	unsigned long end = swp_offset(entry) + nr;
	unsigned type = swp_type(entry);
	struct swap_info_struct *si;
	unsigned long offset;

	if (non_swap_entry(entry))
		return;

	si = _swap_info_get(entry);
	if (!si || end > si->max)
		return;

	/*
	 * First free all entries in the range.
	 */
	for (offset = swp_offset(entry); offset < end; offset++) {
		VM_WARN_ON(data_race(!si->swap_map[offset]));
		__swap_entry_free(si, swp_entry(type, offset));
	}

	/*
	 * Now go back over the range trying to reclaim the swap cache. This is
	 * more efficient for large folios because we will only try to reclaim
	 * the swap once per folio in the common case. If we do
	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
	 * latter will get a reference and lock the folio for every individual
	 * page but will only succeed once the swap slot for every subpage is
	 * zero.
	 */
	for (offset = swp_offset(entry); offset < end; offset += nr) {
		nr = 1;
		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { << HERE
			/*
			 * Folios are always naturally aligned in swap so
			 * advance forward to the next boundary. Zero means no
			 * folio was found for the swap entry, so advance by 1
			 * in this case. Negative value means folio was found
			 * but could not be reclaimed. Here we can still advance
			 * to the next boundary.
			 */
			nr = __try_to_reclaim_swap(si, offset,
					      TTRS_UNMAPPED | TTRS_FULL);
			if (nr == 0)
				nr = 1;
			else if (nr < 0)
				nr = -nr;
			nr = ALIGN(offset + 1, nr) - offset;
		}
	}
}

Thanks,
Ryan
  
David Hildenbrand March 4, 2024, 5:30 p.m. UTC | #26
On 04.03.24 17:03, Ryan Roberts wrote:
> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>> How relevant is it? Relevant enough that someone decided to put that
>>>> optimization in? I don't know :)
>>>
>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>> That implies to me that perhaps the optimization wasn't in response to a
>>> specific problem after all. Do you have any thoughts, Huang?
>>
>> Might make sense to include that in the patch description!
>>
>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>
>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>>> to mark it as PMD_SIZE.
>>>
>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>>> cases to be rare.
>>>
>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>> time it will be the full size of the swap entry, but sometimes it will cover
>>> only a portion. In the latter case you may see a false negative for
>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>
>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>> precise information and is conceptually simpler to understand, but will cost
>>> more memory (half as much as the initial swap_map[] again).
>>>
>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>> prototyping.
>>
>> Taking a step back: what about we simply batch unmapping of swap entries?
>>
>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>> lock) that reference consecutive swap offsets in the same swap file.
>>
>> There, we can then first decrement all the swap counts, and then try minimizing
>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>> a large folio that we cannot reclaim or could reclaim, ...).
>>
>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>> freeing up, but we certainly can do better than what we would do right now.
>>
> 
> Hi,
> 
> I'm struggling to convince myself that free_swap_and_cache() can't race with
> with swapoff(). Can anyone explain that this is safe?
> 
> I *think* they are both serialized by the PTL, since all callers of
> free_swap_and_cache() (except shmem) have the PTL, and swapoff() calls
> try_to_unuse() early on, which takes the PTL as it iterates over every vma in
> every mm. It looks like shmem is handled specially by a call to shmem_unuse(),
> but I can't see the exact serialization mechanism.

As get_swap_device() documents:

"if there aren't some other ways to prevent swapoff, such as the folio 
in swap cache is locked, page table lock is held, etc., the swap entry 
may become invalid because of swapoff"

PTL it is, in theory. But I'm afraid that's half the story.

> 
> I've implemented a batching function, as David suggested above, but I'm trying
> to convince myself that it is safe for it to access si->swap_map[] without a
> lock (i.e. that swapoff() can't concurrently free it). But I think
> free_swap_and_cache() as it already exists depends on being able to access the
> si without an explicit lock, so I'm assuming the same mechanism will protect my
> new changes. But I want to be sure I understand the mechanism...

Very valid concern.

> 
> 
> This is the existing free_swap_and_cache(). I think _swap_info_get() would break
> if this could race with swapoff(), and __swap_entry_free() looks up the cluster
> from an array, which would also be freed by swapoff if racing:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);

If count dropped to 0 and

> 		if (count == SWAP_HAS_CACHE)


count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We 
removed it. That one would have to be reclaimed asynchronously.

The existing code we would call swap_page_trans_huge_swapped() with the 
SI it obtained via _swap_info_get().

I also don't see what should be left protecting the SI. It's not locked 
anymore, the swapcounts are at 0. We don't hold the folio lock.

try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...

Would performing the overall operation under lock_cluster_or_swap_info 
help? Not so sure :(
  

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 19f30a29e1f1..a073366a227c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@  struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -595,15 +594,6 @@  static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..b411dd4f1612 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2598,9 +2598,6 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		shmem_uncharge(head->mapping->host, nr_dropped);
 	remap_page(folio, nr);
 
-	if (folio_test_swapcache(folio))
-		split_swap_cluster(folio->swap);
-
 	for (i = 0; i < nr; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..b83ad77e04c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -342,18 +342,6 @@  static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
-	if (IS_ENABLED(CONFIG_THP_SWAP))
-		return info->flags & CLUSTER_FLAG_HUGE;
-	return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
-	info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -1021,7 +1009,7 @@  static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+	cluster_set_count(ci, SWAPFILE_CLUSTER);
 
 	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
 	unlock_cluster(ci);
@@ -1354,7 +1342,6 @@  void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
 		map = si->swap_map + offset;
 		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 			val = map[i];
@@ -1362,7 +1349,6 @@  void put_swap_folio(struct folio *folio, swp_entry_t entry)
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
@@ -1384,23 +1370,6 @@  void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1508,22 +1477,23 @@  int swp_swapcount(swp_entry_t entry)
 }
 
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+					 swp_entry_t entry,
+					 unsigned int nr_pages)
 {
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
+	if (!ci || nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
 	}
-	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (swap_count(map[offset + i])) {
 			ret = true;
 			break;
@@ -1545,7 +1515,7 @@  static bool folio_swapped(struct folio *folio)
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_swapcount(si, entry) != 0;
 
-	return swap_page_trans_huge_swapped(si, entry);
+	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
 }
 
 /**
@@ -1606,8 +1576,7 @@  int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
+		if (count == SWAP_HAS_CACHE)
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
 	}