[RFC,v1,2/2] mm: swap: Swap-out small-sized THP without splitting

Message ID 20231010142111.3997780-3-ryan.roberts@arm.com
State New
Headers
Series Swap-out small-sized THP without splitting |

Commit Message

Ryan Roberts Oct. 10, 2023, 2:21 p.m. UTC
  The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.

Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device
- these are the same constraints as for the existing PMD-sized THP
swap-out support.

Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [4, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller falls back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).

As far as I can tell, this should not cause any extra fragmentation
concerns, given how similar it is to the existing PMD-sized THP
allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
concurrent use though. In practice, the number of orders in use will be
small though.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  7 ++++++
 mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
 mm/vmscan.c          | 10 +++++---
 3 files changed, 59 insertions(+), 18 deletions(-)

--
2.25.1
  

Comments

Ryan Roberts Oct. 11, 2023, 7:44 a.m. UTC | #1
On 10/10/2023 15:21, Ryan Roberts wrote:
> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
> 
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device
> - these are the same constraints as for the existing PMD-sized THP
> swap-out support.
> 
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
> 
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [4, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
> 
> As far as I can tell, this should not cause any extra fragmentation
> concerns, given how similar it is to the existing PMD-sized THP
> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
> concurrent use though. In practice, the number of orders in use will be
> small though.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  7 ++++++
>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>  mm/vmscan.c          | 10 +++++---
>  3 files changed, 59 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a073366a227c..fc55b760aeff 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -320,6 +320,13 @@ struct swap_info_struct {
>  					 */
>  	struct work_struct discard_work; /* discard worker */
>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
> +	unsigned int large_next[PMD_ORDER]; /*

Oh dear, I've tripped over this twice in a week now... PMD_ORDER is not a
compile-time const on powerpc, so results in build fail. This would have to be
allocated at init time, I guess.


> +					     * next free offset within current
> +					     * allocation cluster for large
> +					     * folios, or UINT_MAX if no current
> +					     * cluster. Index is (order - 1).
> +					     * Only when cluster_info is used.
> +					     */
>  	struct plist_node avail_lists[]; /*
>  					   * entries in swap_avail_heads, one
>  					   * entry per node.
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c668838fa660..f8093dedc866 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -987,8 +987,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return n_ret;
>  }
> 
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
> +			    unsigned int nr_pages)
>  {
> +	int order;
>  	unsigned long idx;
>  	struct swap_cluster_info *ci;
>  	unsigned long offset;
> @@ -1002,20 +1004,47 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>  		return 0;
>  	}
> 
> -	if (cluster_list_empty(&si->free_clusters))
> -		return 0;
> +	VM_WARN_ON(nr_pages < 2);
> +	VM_WARN_ON(nr_pages > SWAPFILE_CLUSTER);
> +	VM_WARN_ON(!is_power_of_2(nr_pages));
> 
> -	idx = cluster_list_first(&si->free_clusters);
> -	offset = idx * SWAPFILE_CLUSTER;
> -	ci = lock_cluster(si, offset);
> -	alloc_cluster(si, idx);
> -	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
> +	order = ilog2(nr_pages);
> +	offset = si->large_next[order - 1];
> +
> +	if (offset == UINT_MAX) {
> +		if (cluster_list_empty(&si->free_clusters))
> +			return 0;
> 
> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> +		idx = cluster_list_first(&si->free_clusters);
> +		offset = idx * SWAPFILE_CLUSTER;
> +
> +		ci = lock_cluster(si, offset);
> +		alloc_cluster(si, idx);
> +		cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
> +
> +		/*
> +		 * If scan_swap_map_slots() can't find a free cluster, it will
> +		 * check si->swap_map directly. To make sure this standby
> +		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
> +		 * entries bad (occupied). (same approach as discard).
> +		 */
> +		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
> +			SWAPFILE_CLUSTER - nr_pages);
> +	} else {
> +		idx = offset / SWAPFILE_CLUSTER;
> +		ci = lock_cluster(si, offset);
> +	}
> +
> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>  	unlock_cluster(ci);
> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> +	swap_range_alloc(si, offset, nr_pages);
>  	*slot = swp_entry(si->type, offset);
> 
> +	offset += nr_pages;
> +	if (idx != offset / SWAPFILE_CLUSTER)
> +		offset = UINT_MAX;
> +	si->large_next[order - 1] = offset;
> +
>  	return 1;
>  }
> 
> @@ -1041,7 +1070,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  	int node;
> 
>  	/* Only single cluster request supported */
> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
> 
>  	spin_lock(&swap_avail_lock);
> 
> @@ -1078,14 +1107,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  			spin_unlock(&si->lock);
>  			goto nextsi;
>  		}
> -		if (size == SWAPFILE_CLUSTER) {
> +		if (size > 1) {
>  			if (si->flags & SWP_BLKDEV)
> -				n_ret = swap_alloc_cluster(si, swp_entries);
> +				n_ret = swap_alloc_large(si, swp_entries, size);
>  		} else
>  			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>  						    n_goal, swp_entries);
>  		spin_unlock(&si->lock);
> -		if (n_ret || size == SWAPFILE_CLUSTER)
> +		if (n_ret || size > 1)
>  			goto check_out;
>  		cond_resched();
> 
> @@ -2725,6 +2754,9 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	spin_lock_init(&p->cont_lock);
>  	init_completion(&p->comp);
> 
> +	for (i = 0; i < ARRAY_SIZE(p->large_next); i++)
> +		p->large_next[i] = UINT_MAX;
> +
>  	return p;
>  }
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c16e2b1ea8ae..5984d2ae4547 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;
> --
> 2.25.1
>
  
Huang, Ying Oct. 11, 2023, 8:25 a.m. UTC | #2
Ryan Roberts <ryan.roberts@arm.com> writes:

> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
>
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device
> - these are the same constraints as for the existing PMD-sized THP
> swap-out support.
>
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [4, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
>
> As far as I can tell, this should not cause any extra fragmentation
> concerns, given how similar it is to the existing PMD-sized THP
> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
> concurrent use though. In practice, the number of orders in use will be
> small though.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  7 ++++++
>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>  mm/vmscan.c          | 10 +++++---
>  3 files changed, 59 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a073366a227c..fc55b760aeff 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -320,6 +320,13 @@ struct swap_info_struct {
>  					 */
>  	struct work_struct discard_work; /* discard worker */
>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
> +	unsigned int large_next[PMD_ORDER]; /*
> +					     * next free offset within current
> +					     * allocation cluster for large
> +					     * folios, or UINT_MAX if no current
> +					     * cluster. Index is (order - 1).
> +					     * Only when cluster_info is used.
> +					     */

I think that it is better to make this per-CPU.  That is, extend the
percpu_cluster mechanism.  Otherwise, we may have scalability issue.

And this should be enclosed in CONFIG_THP_SWAP.

>  	struct plist_node avail_lists[]; /*
>  					   * entries in swap_avail_heads, one
>  					   * entry per node.

--
Best Regards,
Huang, Ying
  
Ryan Roberts Oct. 11, 2023, 10:36 a.m. UTC | #3
On 11/10/2023 09:25, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> The upcoming anonymous small-sized THP feature enables performance
>> improvements by allocating large folios for anonymous memory. However
>> I've observed that on an arm64 system running a parallel workload (e.g.
>> kernel compilation) across many cores, under high memory pressure, the
>> speed regresses. This is due to bottlenecking on the increased number of
>> TLBIs added due to all the extra folio splitting.
>>
>> Therefore, solve this regression by adding support for swapping out
>> small-sized THP without needing to split the folio, just like is already
>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>> enabled, and when the swap backing store is a non-rotating block device
>> - these are the same constraints as for the existing PMD-sized THP
>> swap-out support.
>>
>> Note that no attempt is made to swap-in THP here - this is still done
>> page-by-page, like for PMD-sized THP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [4, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller falls back to splitting the folio and
>> allocates individual entries (as per existing PMD-sized THP fallback).
>>
>> As far as I can tell, this should not cause any extra fragmentation
>> concerns, given how similar it is to the existing PMD-sized THP
>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>> concurrent use though. In practice, the number of orders in use will be
>> small though.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  7 ++++++
>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>  mm/vmscan.c          | 10 +++++---
>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index a073366a227c..fc55b760aeff 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>  					 */
>>  	struct work_struct discard_work; /* discard worker */
>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>> +	unsigned int large_next[PMD_ORDER]; /*
>> +					     * next free offset within current
>> +					     * allocation cluster for large
>> +					     * folios, or UINT_MAX if no current
>> +					     * cluster. Index is (order - 1).
>> +					     * Only when cluster_info is used.
>> +					     */
> 
> I think that it is better to make this per-CPU.  That is, extend the
> percpu_cluster mechanism.  Otherwise, we may have scalability issue.

Is your concern that the swap_info spinlock will get too contended as its
currently written? From briefly looking at percpu_cluster, it looks like that
spinlock is always held when accessing the per-cpu structures - presumably
that's what's disabling preemption and making sure the thread is not migrated?
So I'm not sure what the benefit is currently? Surely you want to just disable
preemption but not hold the lock? I'm sure I've missed something crucial...

> 
> And this should be enclosed in CONFIG_THP_SWAP.

Yes, I'll fix this in the next version.

Thanks for the review!

> 
>>  	struct plist_node avail_lists[]; /*
>>  					   * entries in swap_avail_heads, one
>>  					   * entry per node.
> 
> --
> Best Regards,
> Huang, Ying
  
Ryan Roberts Oct. 11, 2023, 5:14 p.m. UTC | #4
On 11/10/2023 11:36, Ryan Roberts wrote:
> On 11/10/2023 09:25, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> The upcoming anonymous small-sized THP feature enables performance
>>> improvements by allocating large folios for anonymous memory. However
>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>> kernel compilation) across many cores, under high memory pressure, the
>>> speed regresses. This is due to bottlenecking on the increased number of
>>> TLBIs added due to all the extra folio splitting.
>>>
>>> Therefore, solve this regression by adding support for swapping out
>>> small-sized THP without needing to split the folio, just like is already
>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>> enabled, and when the swap backing store is a non-rotating block device
>>> - these are the same constraints as for the existing PMD-sized THP
>>> swap-out support.
>>>
>>> Note that no attempt is made to swap-in THP here - this is still done
>>> page-by-page, like for PMD-sized THP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller falls back to splitting the folio and
>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>
>>> As far as I can tell, this should not cause any extra fragmentation
>>> concerns, given how similar it is to the existing PMD-sized THP
>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>> concurrent use though. In practice, the number of orders in use will be
>>> small though.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/swap.h |  7 ++++++
>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>  mm/vmscan.c          | 10 +++++---
>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index a073366a227c..fc55b760aeff 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>  					 */
>>>  	struct work_struct discard_work; /* discard worker */
>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>> +	unsigned int large_next[PMD_ORDER]; /*
>>> +					     * next free offset within current
>>> +					     * allocation cluster for large
>>> +					     * folios, or UINT_MAX if no current
>>> +					     * cluster. Index is (order - 1).
>>> +					     * Only when cluster_info is used.
>>> +					     */
>>
>> I think that it is better to make this per-CPU.  That is, extend the
>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
> 
> Is your concern that the swap_info spinlock will get too contended as its
> currently written? From briefly looking at percpu_cluster, it looks like that
> spinlock is always held when accessing the per-cpu structures - presumably
> that's what's disabling preemption and making sure the thread is not migrated?
> So I'm not sure what the benefit is currently? Surely you want to just disable
> preemption but not hold the lock? I'm sure I've missed something crucial...

I looked a bit further at how to implement what you are suggesting.
get_swap_pages() is currently taking the swap_info lock which it needs to check
and update some other parts of the swap_info - I'm not sure that part can be
removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
so I'm not convinced that you would save too much by releasing the lock for that
part. In contrast there is a lot more going on in scan_swap_map_slots() so there
is more benefit to releasing the lock and using the percpu stuff - correct me if
I've missunderstood.

As an alternative approach, perhaps it makes more sense to beef up the caching
layer in swap_slots.c to handle large folios too? Then you avoid taking the
swap_info lock at all most of the time, like you currently do for single entry
allocations.

What do you think?

> 
>>
>> And this should be enclosed in CONFIG_THP_SWAP.
> 
> Yes, I'll fix this in the next version.
> 
> Thanks for the review!
> 
>>
>>>  	struct plist_node avail_lists[]; /*
>>>  					   * entries in swap_avail_heads, one
>>>  					   * entry per node.
>>
>> --
>> Best Regards,
>> Huang, Ying
>
  
Huang, Ying Oct. 16, 2023, 6:17 a.m. UTC | #5
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 11/10/2023 11:36, Ryan Roberts wrote:
>> On 11/10/2023 09:25, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> The upcoming anonymous small-sized THP feature enables performance
>>>> improvements by allocating large folios for anonymous memory. However
>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>> kernel compilation) across many cores, under high memory pressure, the
>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>> TLBIs added due to all the extra folio splitting.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out
>>>> small-sized THP without needing to split the folio, just like is already
>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>> enabled, and when the swap backing store is a non-rotating block device
>>>> - these are the same constraints as for the existing PMD-sized THP
>>>> swap-out support.
>>>>
>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>> page-by-page, like for PMD-sized THP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller falls back to splitting the folio and
>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>
>>>> As far as I can tell, this should not cause any extra fragmentation
>>>> concerns, given how similar it is to the existing PMD-sized THP
>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>>> concurrent use though. In practice, the number of orders in use will be
>>>> small though.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/swap.h |  7 ++++++
>>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>>  mm/vmscan.c          | 10 +++++---
>>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index a073366a227c..fc55b760aeff 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>>  					 */
>>>>  	struct work_struct discard_work; /* discard worker */
>>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>>> +	unsigned int large_next[PMD_ORDER]; /*
>>>> +					     * next free offset within current
>>>> +					     * allocation cluster for large
>>>> +					     * folios, or UINT_MAX if no current
>>>> +					     * cluster. Index is (order - 1).
>>>> +					     * Only when cluster_info is used.
>>>> +					     */
>>>
>>> I think that it is better to make this per-CPU.  That is, extend the
>>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
>> 
>> Is your concern that the swap_info spinlock will get too contended as its
>> currently written? From briefly looking at percpu_cluster, it looks like that
>> spinlock is always held when accessing the per-cpu structures - presumably
>> that's what's disabling preemption and making sure the thread is not migrated?
>> So I'm not sure what the benefit is currently? Surely you want to just disable
>> preemption but not hold the lock? I'm sure I've missed something crucial...
>
> I looked a bit further at how to implement what you are suggesting.
> get_swap_pages() is currently taking the swap_info lock which it needs to check
> and update some other parts of the swap_info - I'm not sure that part can be
> removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
> so I'm not convinced that you would save too much by releasing the lock for that
> part. In contrast there is a lot more going on in scan_swap_map_slots() so there
> is more benefit to releasing the lock and using the percpu stuff - correct me if
> I've missunderstood.
>
> As an alternative approach, perhaps it makes more sense to beef up the caching
> layer in swap_slots.c to handle large folios too? Then you avoid taking the
> swap_info lock at all most of the time, like you currently do for single entry
> allocations.
>
> What do you think?

Sorry for late reply.

percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster
allocation per-cpu").  Please check the changelog for why it's
introduced.  Sorry about my incorrect memory about scalability.
percpu_cluster is introduced for disk performance mainly instead of
scalability.

--
Best Regards,
Huang, Ying

>> 
>>>
>>> And this should be enclosed in CONFIG_THP_SWAP.
>> 
>> Yes, I'll fix this in the next version.
>> 
>> Thanks for the review!
>> 
>>>
>>>>  	struct plist_node avail_lists[]; /*
>>>>  					   * entries in swap_avail_heads, one
>>>>  					   * entry per node.
  
Ryan Roberts Oct. 16, 2023, 12:10 p.m. UTC | #6
On 16/10/2023 07:17, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 11/10/2023 11:36, Ryan Roberts wrote:
>>> On 11/10/2023 09:25, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> The upcoming anonymous small-sized THP feature enables performance
>>>>> improvements by allocating large folios for anonymous memory. However
>>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>>> kernel compilation) across many cores, under high memory pressure, the
>>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>>> TLBIs added due to all the extra folio splitting.
>>>>>
>>>>> Therefore, solve this regression by adding support for swapping out
>>>>> small-sized THP without needing to split the folio, just like is already
>>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>>> enabled, and when the swap backing store is a non-rotating block device
>>>>> - these are the same constraints as for the existing PMD-sized THP
>>>>> swap-out support.
>>>>>
>>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>>> page-by-page, like for PMD-sized THP.
>>>>>
>>>>> The main change here is to improve the swap entry allocator so that it
>>>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>>> order and allocating sequentially from it until the cluster is full.
>>>>> This ensures that we don't need to search the map and we get no
>>>>> fragmentation due to alignment padding for different orders in the
>>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>>> fail the allocation and the caller falls back to splitting the folio and
>>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>>
>>>>> As far as I can tell, this should not cause any extra fragmentation
>>>>> concerns, given how similar it is to the existing PMD-sized THP
>>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>>>> concurrent use though. In practice, the number of orders in use will be
>>>>> small though.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  include/linux/swap.h |  7 ++++++
>>>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>>>  mm/vmscan.c          | 10 +++++---
>>>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index a073366a227c..fc55b760aeff 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>>>  					 */
>>>>>  	struct work_struct discard_work; /* discard worker */
>>>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>>>> +	unsigned int large_next[PMD_ORDER]; /*
>>>>> +					     * next free offset within current
>>>>> +					     * allocation cluster for large
>>>>> +					     * folios, or UINT_MAX if no current
>>>>> +					     * cluster. Index is (order - 1).
>>>>> +					     * Only when cluster_info is used.
>>>>> +					     */
>>>>
>>>> I think that it is better to make this per-CPU.  That is, extend the
>>>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
>>>
>>> Is your concern that the swap_info spinlock will get too contended as its
>>> currently written? From briefly looking at percpu_cluster, it looks like that
>>> spinlock is always held when accessing the per-cpu structures - presumably
>>> that's what's disabling preemption and making sure the thread is not migrated?
>>> So I'm not sure what the benefit is currently? Surely you want to just disable
>>> preemption but not hold the lock? I'm sure I've missed something crucial...
>>
>> I looked a bit further at how to implement what you are suggesting.
>> get_swap_pages() is currently taking the swap_info lock which it needs to check
>> and update some other parts of the swap_info - I'm not sure that part can be
>> removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
>> so I'm not convinced that you would save too much by releasing the lock for that
>> part. In contrast there is a lot more going on in scan_swap_map_slots() so there
>> is more benefit to releasing the lock and using the percpu stuff - correct me if
>> I've missunderstood.
>>
>> As an alternative approach, perhaps it makes more sense to beef up the caching
>> layer in swap_slots.c to handle large folios too? Then you avoid taking the
>> swap_info lock at all most of the time, like you currently do for single entry
>> allocations.
>>
>> What do you think?
> 
> Sorry for late reply.
> 
> percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster
> allocation per-cpu").  Please check the changelog for why it's
> introduced.  Sorry about my incorrect memory about scalability.
> percpu_cluster is introduced for disk performance mainly instead of
> scalability.

Thanks for the pointer. I'm not sure if you are still suggesting that I make my
small-sized THP allocation mechanism per-cpu though?

I anticipate that by virtue of allocating multiple contiguous swap entries for a
small-sized THP we already get a lot of the benefits that percpu_cluster gives
order-0 allocations. (Although obviously it will only give contiguity matching
the size of the THP rather than a full cluster). The downside of making this
percpu would be keeping more free clusters tied up in the percpu caches,
potentially causing a need to scan for free entries more often.

What's your view?

Thanks,
Ryan



> 
> --
> Best Regards,
> Huang, Ying
> 
>>>
>>>>
>>>> And this should be enclosed in CONFIG_THP_SWAP.
>>>
>>> Yes, I'll fix this in the next version.
>>>
>>> Thanks for the review!
>>>
>>>>
>>>>>  	struct plist_node avail_lists[]; /*
>>>>>  					   * entries in swap_avail_heads, one
>>>>>  					   * entry per node.
  
Huang, Ying Oct. 17, 2023, 5:44 a.m. UTC | #7
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 16/10/2023 07:17, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 11/10/2023 11:36, Ryan Roberts wrote:
>>>> On 11/10/2023 09:25, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> The upcoming anonymous small-sized THP feature enables performance
>>>>>> improvements by allocating large folios for anonymous memory. However
>>>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>>>> kernel compilation) across many cores, under high memory pressure, the
>>>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>>>> TLBIs added due to all the extra folio splitting.
>>>>>>
>>>>>> Therefore, solve this regression by adding support for swapping out
>>>>>> small-sized THP without needing to split the folio, just like is already
>>>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>>>> enabled, and when the swap backing store is a non-rotating block device
>>>>>> - these are the same constraints as for the existing PMD-sized THP
>>>>>> swap-out support.
>>>>>>
>>>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>>>> page-by-page, like for PMD-sized THP.
>>>>>>
>>>>>> The main change here is to improve the swap entry allocator so that it
>>>>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>>>> order and allocating sequentially from it until the cluster is full.
>>>>>> This ensures that we don't need to search the map and we get no
>>>>>> fragmentation due to alignment padding for different orders in the
>>>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>>>> fail the allocation and the caller falls back to splitting the folio and
>>>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>>>
>>>>>> As far as I can tell, this should not cause any extra fragmentation
>>>>>> concerns, given how similar it is to the existing PMD-sized THP
>>>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>>>>> concurrent use though. In practice, the number of orders in use will be
>>>>>> small though.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/swap.h |  7 ++++++
>>>>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>>>>  mm/vmscan.c          | 10 +++++---
>>>>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index a073366a227c..fc55b760aeff 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>>>>  					 */
>>>>>>  	struct work_struct discard_work; /* discard worker */
>>>>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>>>>> +	unsigned int large_next[PMD_ORDER]; /*
>>>>>> +					     * next free offset within current
>>>>>> +					     * allocation cluster for large
>>>>>> +					     * folios, or UINT_MAX if no current
>>>>>> +					     * cluster. Index is (order - 1).
>>>>>> +					     * Only when cluster_info is used.
>>>>>> +					     */
>>>>>
>>>>> I think that it is better to make this per-CPU.  That is, extend the
>>>>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
>>>>
>>>> Is your concern that the swap_info spinlock will get too contended as its
>>>> currently written? From briefly looking at percpu_cluster, it looks like that
>>>> spinlock is always held when accessing the per-cpu structures - presumably
>>>> that's what's disabling preemption and making sure the thread is not migrated?
>>>> So I'm not sure what the benefit is currently? Surely you want to just disable
>>>> preemption but not hold the lock? I'm sure I've missed something crucial...
>>>
>>> I looked a bit further at how to implement what you are suggesting.
>>> get_swap_pages() is currently taking the swap_info lock which it needs to check
>>> and update some other parts of the swap_info - I'm not sure that part can be
>>> removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
>>> so I'm not convinced that you would save too much by releasing the lock for that
>>> part. In contrast there is a lot more going on in scan_swap_map_slots() so there
>>> is more benefit to releasing the lock and using the percpu stuff - correct me if
>>> I've missunderstood.
>>>
>>> As an alternative approach, perhaps it makes more sense to beef up the caching
>>> layer in swap_slots.c to handle large folios too? Then you avoid taking the
>>> swap_info lock at all most of the time, like you currently do for single entry
>>> allocations.
>>>
>>> What do you think?
>> 
>> Sorry for late reply.
>> 
>> percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster
>> allocation per-cpu").  Please check the changelog for why it's
>> introduced.  Sorry about my incorrect memory about scalability.
>> percpu_cluster is introduced for disk performance mainly instead of
>> scalability.
>
> Thanks for the pointer. I'm not sure if you are still suggesting that I make my
> small-sized THP allocation mechanism per-cpu though?

Yes.  I think that the reason for that we introduced percpu_cluster
still applies now.

> I anticipate that by virtue of allocating multiple contiguous swap entries for a
> small-sized THP we already get a lot of the benefits that percpu_cluster gives
> order-0 allocations. (Although obviously it will only give contiguity matching
> the size of the THP rather than a full cluster).

I think that you will still introduce "interleave disk access" when
multiple CPU allocate from and write to swap device simultaneously.
Right?  Yes, 16KB block is better than 4KB block, but I don't think it
solves the problem.

> The downside of making this percpu would be keeping more free clusters
> tied up in the percpu caches, potentially causing a need to scan for
> free entries more often.

Yes.  We may waste several MB swap space per-CPU.  Is this a practical
issue given the swap device capacity becomes larger and larger?

--
Best Regards,
Huang, Ying
  

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..fc55b760aeff 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -320,6 +320,13 @@  struct swap_info_struct {
 					 */
 	struct work_struct discard_work; /* discard worker */
 	struct swap_cluster_list discard_clusters; /* discard clusters list */
+	unsigned int large_next[PMD_ORDER]; /*
+					     * next free offset within current
+					     * allocation cluster for large
+					     * folios, or UINT_MAX if no current
+					     * cluster. Index is (order - 1).
+					     * Only when cluster_info is used.
+					     */
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c668838fa660..f8093dedc866 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -987,8 +987,10 @@  static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }

-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
+			    unsigned int nr_pages)
 {
+	int order;
 	unsigned long idx;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
@@ -1002,20 +1004,47 @@  static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 		return 0;
 	}

-	if (cluster_list_empty(&si->free_clusters))
-		return 0;
+	VM_WARN_ON(nr_pages < 2);
+	VM_WARN_ON(nr_pages > SWAPFILE_CLUSTER);
+	VM_WARN_ON(!is_power_of_2(nr_pages));

-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+	order = ilog2(nr_pages);
+	offset = si->large_next[order - 1];
+
+	if (offset == UINT_MAX) {
+		if (cluster_list_empty(&si->free_clusters))
+			return 0;

-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+		idx = cluster_list_first(&si->free_clusters);
+		offset = idx * SWAPFILE_CLUSTER;
+
+		ci = lock_cluster(si, offset);
+		alloc_cluster(si, idx);
+		cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+
+		/*
+		 * If scan_swap_map_slots() can't find a free cluster, it will
+		 * check si->swap_map directly. To make sure this standby
+		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
+		 * entries bad (occupied). (same approach as discard).
+		 */
+		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
+			SWAPFILE_CLUSTER - nr_pages);
+	} else {
+		idx = offset / SWAPFILE_CLUSTER;
+		ci = lock_cluster(si, offset);
+	}
+
+	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
 	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
+	swap_range_alloc(si, offset, nr_pages);
 	*slot = swp_entry(si->type, offset);

+	offset += nr_pages;
+	if (idx != offset / SWAPFILE_CLUSTER)
+		offset = UINT_MAX;
+	si->large_next[order - 1] = offset;
+
 	return 1;
 }

@@ -1041,7 +1070,7 @@  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 	int node;

 	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+	WARN_ON_ONCE(n_goal > 1 && size > 1);

 	spin_lock(&swap_avail_lock);

@@ -1078,14 +1107,14 @@  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
+		if (size > 1) {
 			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
+				n_ret = swap_alloc_large(si, swp_entries, size);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();

@@ -2725,6 +2754,9 @@  static struct swap_info_struct *alloc_swap_info(void)
 	spin_lock_init(&p->cont_lock);
 	init_completion(&p->comp);

+	for (i = 0; i < ARRAY_SIZE(p->large_next); i++)
+		p->large_next[i] = UINT_MAX;
+
 	return p;
 }

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c16e2b1ea8ae..5984d2ae4547 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,11 +1212,13 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split PMD-mappable folios without a
+					 * PMD map right away. Chances are some
+					 * or all of the tail pages can be freed
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (folio_test_pmd_mappable(folio) &&
+					    !folio_entire_mapcount(folio) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;