[v3,4/4] mm: swap: Swap-out small-sized THP without splitting

Message ID 20231025144546.577640-5-ryan.roberts@arm.com
State New
Headers
Series Swap-out small-sized THP without splitting |

Commit Message

Ryan Roberts Oct. 25, 2023, 2:45 p.m. UTC
  The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.

Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device.
These are the same constraints as for the existing PMD-sized THP
swap-out support.

Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller falls back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.
__scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
for order-0.

As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.

I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
device as the swap device and from inside a memcg limited to 40G memory.
I've then run `usemem` from vm-scalability with 70 processes (each has
its own core), each allocating and writing 1G of memory. I've repeated
everything 5 times and taken the mean:

Mean Performance Improvement vs 4K/baseline

| alloc size |            baseline |       + this series |
|            |  v6.6-rc4+anonfolio |                     |
|:-----------|--------------------:|--------------------:|
| 4K Page    |                0.0% |                4.9% |
| 64K THP    |              -44.1% |               10.7% |
| 2M THP     |               56.0% |               65.9% |

So with this change, the regression for 64K swap performance goes away
and 4K and 2M swap improves slightly too.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  10 +--
 mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
 mm/vmscan.c          |  10 +--
 3 files changed, 119 insertions(+), 50 deletions(-)
  

Comments

Huang, Ying Oct. 30, 2023, 8:18 a.m. UTC | #1
Hi, Ryan,

Ryan Roberts <ryan.roberts@arm.com> writes:

> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
>
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device.
> These are the same constraints as for the existing PMD-sized THP
> swap-out support.
>
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [1, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> infrastructure. This is done to avoid interleving pages from different
> tasks, which would prevent IO being batched. This is already done for
> the order-0 allocations so we follow the same pattern.
> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
> for order-0.
>
> As is done for order-0 per-cpu clusters, the scanner now can steal
> order-0 entries from any per-cpu-per-order reserved cluster. This
> ensures that when the swap file is getting full, space doesn't get tied
> up in the per-cpu reserves.
>
> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
> device as the swap device and from inside a memcg limited to 40G memory.
> I've then run `usemem` from vm-scalability with 70 processes (each has
> its own core), each allocating and writing 1G of memory. I've repeated
> everything 5 times and taken the mean:
>
> Mean Performance Improvement vs 4K/baseline
>
> | alloc size |            baseline |       + this series |
> |            |  v6.6-rc4+anonfolio |                     |
> |:-----------|--------------------:|--------------------:|
> | 4K Page    |                0.0% |                4.9% |
> | 64K THP    |              -44.1% |               10.7% |
> | 2M THP     |               56.0% |               65.9% |
>
> So with this change, the regression for 64K swap performance goes away
> and 4K and 2M swap improves slightly too.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  10 +--
>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>  mm/vmscan.c          |  10 +--
>  3 files changed, 119 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0ca8aaa098ba..ccbca5db851b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -295,11 +295,11 @@ struct swap_info_struct {
>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>  	unsigned int __percpu *cpu_next;/*
>  					 * Likely next allocation offset. We
> -					 * assign a cluster to each CPU, so each
> -					 * CPU can allocate swap entry from its
> -					 * own cluster and swapout sequentially.
> -					 * The purpose is to optimize swapout
> -					 * throughput.
> +					 * assign a cluster per-order to each
> +					 * CPU, so each CPU can allocate swap
> +					 * entry from its own cluster and
> +					 * swapout sequentially. The purpose is
> +					 * to optimize swapout throughput.
>  					 */

This is kind of hard to understand.  Better to define some intermediate
data structure to improve readability.  For example,

#ifdef CONFIG_THP_SWAP
#define NR_SWAP_ORDER   PMD_ORDER
#else
#define NR_SWAP_ORDER   1
#endif

struct percpu_clusters {
        unsigned int alloc_next[NR_SWAP_ORDER];
};

PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
powerpc too.

>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>  	struct block_device *bdev;	/* swap device or bdev of swap file */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 94f7cc225eb9..b50bce50bed9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>  
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will be
> - * removed from free cluster list and its usage counter will be increased.
> + * removed from free cluster list and its usage counter will be increased by
> + * count.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *p,
> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void add_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
> +	unsigned long count)
>  {
>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>  
> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>  	if (cluster_is_free(&cluster_info[idx]))
>  		alloc_cluster(p, idx);
>  
> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>  	cluster_set_count(&cluster_info[idx],
> -		cluster_count(&cluster_info[idx]) + 1);
> +		cluster_count(&cluster_info[idx]) + count);
> +}
> +
> +/*
> + * The cluster corresponding to page_nr will be used. The cluster will be
> + * removed from free cluster list and its usage counter will be increased.
> + */
> +static void inc_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +{
> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>  }
>  
>  /*
> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>   * cluster list. Avoiding such abuse to avoid list corruption.
>   */
>  static bool
> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> -	unsigned long offset)
> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +	unsigned long offset, int order)
>  {
>  	bool conflict;
>  
> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>  	if (!conflict)
>  		return false;
>  
> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;

This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
good name.  Because NEXT isn't a pointer (while cluster_next is). Better
to name it as SWAP_NEXT_INVALID, etc.

>  	return true;
>  }
>  
>  /*
> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> - * might involve allocating a new cluster for current CPU too.
> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
> + * cluster list. Avoiding such abuse to avoid list corruption.
>   */
> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> -	unsigned long *offset, unsigned long *scan_base)
> +static bool
> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +	unsigned long offset)
> +{
> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
> +}
> +
> +/*
> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
> + * entry pool (a cluster). This might involve allocating a new cluster for
> + * current CPU too.
> + */
> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +	unsigned long *offset, unsigned long *scan_base, int order)
>  {
>  	struct swap_cluster_info *ci;
> -	unsigned int tmp, max;
> +	unsigned int tmp, max, i;
>  	unsigned int *cpu_next;
> +	unsigned int nr_pages = 1 << order;
>  
>  new_cluster:
> -	cpu_next = this_cpu_ptr(si->cpu_next);
> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>  	tmp = *cpu_next;
>  	if (tmp == SWAP_NEXT_NULL) {
>  		if (!cluster_list_empty(&si->free_clusters)) {
> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  	 * reserve a new cluster.
>  	 */
>  	ci = lock_cluster(si, tmp);
> -	if (si->swap_map[tmp]) {
> -		unlock_cluster(ci);
> -		*cpu_next = SWAP_NEXT_NULL;
> -		goto new_cluster;
> +	for (i = 0; i < nr_pages; i++) {
> +		if (si->swap_map[tmp + i]) {
> +			unlock_cluster(ci);
> +			*cpu_next = SWAP_NEXT_NULL;
> +			goto new_cluster;
> +		}
>  	}
>  	unlock_cluster(ci);
>  
> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  	*scan_base = tmp;
>  
>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;

This line is added in a previous patch.  Can we just use

        max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);

Or, add ALIGN_UP() for this?

> -	tmp += 1;
> +	tmp += nr_pages;
>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>  
>  	return true;
>  }
>  
> +/*
> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> + * might involve allocating a new cluster for current CPU too.
> + */
> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +	unsigned long *offset, unsigned long *scan_base)
> +{
> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
> +}
> +
>  static void __del_from_avail_list(struct swap_info_struct *p)
>  {
>  	int nid;
> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return n_ret;
>  }
>  
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
> +			    unsigned int nr_pages)

IMHO, it's better to make scan_swap_map_slots() to support order > 0
instead of making swap_alloc_cluster() to support order != PMD_ORDER.
And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
that.

>  {
> -	unsigned long idx;
>  	struct swap_cluster_info *ci;
> -	unsigned long offset;
> +	unsigned long offset, scan_base;
> +	int order = ilog2(nr_pages);
> +	bool ret;
>  
>  	/*
> -	 * Should not even be attempting cluster allocations when huge
> +	 * Should not even be attempting large allocations when huge
>  	 * page swap is disabled.  Warn and fail the allocation.
>  	 */
> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> +	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> +	    nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
> +	    !is_power_of_2(nr_pages)) {
>  		VM_WARN_ON_ONCE(1);
>  		return 0;
>  	}
>  
> -	if (cluster_list_empty(&si->free_clusters))
> +	/*
> +	 * Swapfile is not block device or not using clusters so unable to
> +	 * allocate large entries.
> +	 */
> +	if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>  		return 0;
>  
> -	idx = cluster_list_first(&si->free_clusters);
> -	offset = idx * SWAPFILE_CLUSTER;
> -	ci = lock_cluster(si, offset);
> -	alloc_cluster(si, idx);
> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
> +again:
> +	/*
> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> +	 * so indicate that we are scanning to synchronise with swapoff.
> +	 */
> +	si->flags += SWP_SCANNING;
> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> +	si->flags -= SWP_SCANNING;
> +
> +	/*
> +	 * If we failed to allocate or if swapoff is waiting for us (due to lock
> +	 * being dropped for discard above), return immediately.
> +	 */
> +	if (!ret || !(si->flags & SWP_WRITEOK))
> +		return 0;
>  
> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> +	if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
> +		goto again;
> +
> +	ci = lock_cluster(si, offset);
> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>  	unlock_cluster(ci);
> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> -	*slot = swp_entry(si->type, offset);
>  
> +	swap_range_alloc(si, offset, nr_pages);
> +	*slot = swp_entry(si->type, offset);
>  	return 1;
>  }
>  
> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  	int node;
>  
>  	/* Only single cluster request supported */
> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
>  
>  	spin_lock(&swap_avail_lock);
>  
> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  			spin_unlock(&si->lock);
>  			goto nextsi;
>  		}
> -		if (size == SWAPFILE_CLUSTER) {
> -			if (si->flags & SWP_BLKDEV)
> -				n_ret = swap_alloc_cluster(si, swp_entries);
> +		if (size > 1) {
> +			n_ret = swap_alloc_large(si, swp_entries, size);
>  		} else
>  			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>  						    n_goal, swp_entries);
>  		spin_unlock(&si->lock);
> -		if (n_ret || size == SWAPFILE_CLUSTER)
> +		if (n_ret || size > 1)
>  			goto check_out;
>  		cond_resched();
>  
> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  	if (p->bdev && bdev_nonrot(p->bdev)) {
>  		int cpu;
>  		unsigned long ci, nr_cluster;
> +		int nr_order;
> +		int i;
>  
>  		p->flags |= SWP_SOLIDSTATE;
>  		p->cluster_next_cpu = alloc_percpu(unsigned int);
> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		for (ci = 0; ci < nr_cluster; ci++)
>  			spin_lock_init(&((cluster_info + ci)->lock));
>  
> -		p->cpu_next = alloc_percpu(unsigned int);
> +		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
> +		p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
> +					     __alignof__(unsigned int));
>  		if (!p->cpu_next) {
>  			error = -ENOMEM;
>  			goto bad_swap_unlock_inode;
>  		}
> -		for_each_possible_cpu(cpu)
> -			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
> +		for_each_possible_cpu(cpu) {
> +			unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
> +
> +			for (i = 0; i < nr_order; i++)
> +				cpu_next[i] = SWAP_NEXT_NULL;
> +		}
>  	} else {
>  		atomic_inc(&nr_rotate_swap);
>  		inced_nr_rotate_swap = true;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cc0cb41fb32..ea19710aa4cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;

--
Best Regards,
Huang, Ying
  
Ryan Roberts Oct. 30, 2023, 1:59 p.m. UTC | #2
On 30/10/2023 08:18, Huang, Ying wrote:
> Hi, Ryan,
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> The upcoming anonymous small-sized THP feature enables performance
>> improvements by allocating large folios for anonymous memory. However
>> I've observed that on an arm64 system running a parallel workload (e.g.
>> kernel compilation) across many cores, under high memory pressure, the
>> speed regresses. This is due to bottlenecking on the increased number of
>> TLBIs added due to all the extra folio splitting.
>>
>> Therefore, solve this regression by adding support for swapping out
>> small-sized THP without needing to split the folio, just like is already
>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>> enabled, and when the swap backing store is a non-rotating block device.
>> These are the same constraints as for the existing PMD-sized THP
>> swap-out support.
>>
>> Note that no attempt is made to swap-in THP here - this is still done
>> page-by-page, like for PMD-sized THP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller falls back to splitting the folio and
>> allocates individual entries (as per existing PMD-sized THP fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>> for order-0.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>> device as the swap device and from inside a memcg limited to 40G memory.
>> I've then run `usemem` from vm-scalability with 70 processes (each has
>> its own core), each allocating and writing 1G of memory. I've repeated
>> everything 5 times and taken the mean:
>>
>> Mean Performance Improvement vs 4K/baseline
>>
>> | alloc size |            baseline |       + this series |
>> |            |  v6.6-rc4+anonfolio |                     |
>> |:-----------|--------------------:|--------------------:|
>> | 4K Page    |                0.0% |                4.9% |
>> | 64K THP    |              -44.1% |               10.7% |
>> | 2M THP     |               56.0% |               65.9% |
>>
>> So with this change, the regression for 64K swap performance goes away
>> and 4K and 2M swap improves slightly too.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  10 +--
>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>  mm/vmscan.c          |  10 +--
>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0ca8aaa098ba..ccbca5db851b 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>  	unsigned int __percpu *cpu_next;/*
>>  					 * Likely next allocation offset. We
>> -					 * assign a cluster to each CPU, so each
>> -					 * CPU can allocate swap entry from its
>> -					 * own cluster and swapout sequentially.
>> -					 * The purpose is to optimize swapout
>> -					 * throughput.
>> +					 * assign a cluster per-order to each
>> +					 * CPU, so each CPU can allocate swap
>> +					 * entry from its own cluster and
>> +					 * swapout sequentially. The purpose is
>> +					 * to optimize swapout throughput.
>>  					 */
> 
> This is kind of hard to understand.  Better to define some intermediate
> data structure to improve readability.  For example,
> 
> #ifdef CONFIG_THP_SWAP
> #define NR_SWAP_ORDER   PMD_ORDER
> #else
> #define NR_SWAP_ORDER   1
> #endif
> 
> struct percpu_clusters {
>         unsigned int alloc_next[NR_SWAP_ORDER];
> };
> 
> PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
> powerpc too.

I get your point, but this is just making it more difficult for powerpc to ever
enable the feature in future - you're implicitly depending on !powerpc, which
seems fragile. How about if I change the first line of the coment to be "per-cpu
array indexed by allocation order"? Would that be enough?

> 
>>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>  	struct block_device *bdev;	/* swap device or bdev of swap file */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 94f7cc225eb9..b50bce50bed9 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  
>>  /*
>>   * The cluster corresponding to page_nr will be used. The cluster will be
>> - * removed from free cluster list and its usage counter will be increased.
>> + * removed from free cluster list and its usage counter will be increased by
>> + * count.
>>   */
>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +static void add_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>> +	unsigned long count)
>>  {
>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>  
>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>  	if (cluster_is_free(&cluster_info[idx]))
>>  		alloc_cluster(p, idx);
>>  
>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>  	cluster_set_count(&cluster_info[idx],
>> -		cluster_count(&cluster_info[idx]) + 1);
>> +		cluster_count(&cluster_info[idx]) + count);
>> +}
>> +
>> +/*
>> + * The cluster corresponding to page_nr will be used. The cluster will be
>> + * removed from free cluster list and its usage counter will be increased.
>> + */
>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +{
>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>  }
>>  
>>  /*
>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>>  static bool
>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> -	unsigned long offset)
>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +	unsigned long offset, int order)
>>  {
>>  	bool conflict;
>>  
>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>  	if (!conflict)
>>  		return false;
>>  
>> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
> 
> This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
> good name.  Because NEXT isn't a pointer (while cluster_next is). Better
> to name it as SWAP_NEXT_INVALID, etc.

ACK, will make change for next version.

> 
>>  	return true;
>>  }
>>  
>>  /*
>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> - * might involve allocating a new cluster for current CPU too.
>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> -	unsigned long *offset, unsigned long *scan_base)
>> +static bool
>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +	unsigned long offset)
>> +{
>> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>> +}
>> +
>> +/*
>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>> + * entry pool (a cluster). This might involve allocating a new cluster for
>> + * current CPU too.
>> + */
>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>  {
>>  	struct swap_cluster_info *ci;
>> -	unsigned int tmp, max;
>> +	unsigned int tmp, max, i;
>>  	unsigned int *cpu_next;
>> +	unsigned int nr_pages = 1 << order;
>>  
>>  new_cluster:
>> -	cpu_next = this_cpu_ptr(si->cpu_next);
>> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>  	tmp = *cpu_next;
>>  	if (tmp == SWAP_NEXT_NULL) {
>>  		if (!cluster_list_empty(&si->free_clusters)) {
>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  	 * reserve a new cluster.
>>  	 */
>>  	ci = lock_cluster(si, tmp);
>> -	if (si->swap_map[tmp]) {
>> -		unlock_cluster(ci);
>> -		*cpu_next = SWAP_NEXT_NULL;
>> -		goto new_cluster;
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (si->swap_map[tmp + i]) {
>> +			unlock_cluster(ci);
>> +			*cpu_next = SWAP_NEXT_NULL;
>> +			goto new_cluster;
>> +		}
>>  	}
>>  	unlock_cluster(ci);
>>  
>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  	*scan_base = tmp;
>>  
>>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
> 
> This line is added in a previous patch.  Can we just use
> 
>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);

Sure. This is how I originally had it, but then decided that the other approach
was a bit clearer. But I don't have a strong opinion, so I'll change it as you
suggest.

> 
> Or, add ALIGN_UP() for this?
> 
>> -	tmp += 1;
>> +	tmp += nr_pages;
>>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>  
>>  	return true;
>>  }
>>  
>> +/*
>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> + * might involve allocating a new cluster for current CPU too.
>> + */
>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +	unsigned long *offset, unsigned long *scan_base)
>> +{
>> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>> +}
>> +
>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>  {
>>  	int nid;
>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	return n_ret;
>>  }
>>  
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>> +			    unsigned int nr_pages)
> 
> IMHO, it's better to make scan_swap_map_slots() to support order > 0
> instead of making swap_alloc_cluster() to support order != PMD_ORDER.
> And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
> that.

I did consider adding a 5th patch to rename swap_alloc_large() to something like
swap_alloc_one_ssd_entry() (which would then be used for order=0 too) and
refactor scan_swap_map_slots() to fully delegate to it for the non-scaning ssd
allocation case. Would something like that suit?

I have reservations about making scan_swap_map_slots() take an order and be the
sole entry point:

  - in the non-ssd case, we can't support order!=0
  - there is a lot of other logic to deal with falling back to scanning which we
    would only want to do for order==0, so we would end up with a few ugly
    conditionals against order.
  - I was concerned the risk of me introducing a bug when refactoring all that
    subtle logic was high

What do you think? Is not making scan_swap_map_slots() support order > 0 a deal
breaker for you?

Thanks,
Ryan


> 
>>  {
>> -	unsigned long idx;
>>  	struct swap_cluster_info *ci;
>> -	unsigned long offset;
>> +	unsigned long offset, scan_base;
>> +	int order = ilog2(nr_pages);
>> +	bool ret;
>>  
>>  	/*
>> -	 * Should not even be attempting cluster allocations when huge
>> +	 * Should not even be attempting large allocations when huge
>>  	 * page swap is disabled.  Warn and fail the allocation.
>>  	 */
>> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> +	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> +	    nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
>> +	    !is_power_of_2(nr_pages)) {
>>  		VM_WARN_ON_ONCE(1);
>>  		return 0;
>>  	}
>>  
>> -	if (cluster_list_empty(&si->free_clusters))
>> +	/*
>> +	 * Swapfile is not block device or not using clusters so unable to
>> +	 * allocate large entries.
>> +	 */
>> +	if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>>  		return 0;
>>  
>> -	idx = cluster_list_first(&si->free_clusters);
>> -	offset = idx * SWAPFILE_CLUSTER;
>> -	ci = lock_cluster(si, offset);
>> -	alloc_cluster(si, idx);
>> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
>> +again:
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
>> +
>> +	/*
>> +	 * If we failed to allocate or if swapoff is waiting for us (due to lock
>> +	 * being dropped for discard above), return immediately.
>> +	 */
>> +	if (!ret || !(si->flags & SWP_WRITEOK))
>> +		return 0;
>>  
>> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> +	if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
>> +		goto again;
>> +
>> +	ci = lock_cluster(si, offset);
>> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>  	unlock_cluster(ci);
>> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> -	*slot = swp_entry(si->type, offset);
>>  
>> +	swap_range_alloc(si, offset, nr_pages);
>> +	*slot = swp_entry(si->type, offset);
>>  	return 1;
>>  }
>>  
>> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>  	int node;
>>  
>>  	/* Only single cluster request supported */
>> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
>>  
>>  	spin_lock(&swap_avail_lock);
>>  
>> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>  			spin_unlock(&si->lock);
>>  			goto nextsi;
>>  		}
>> -		if (size == SWAPFILE_CLUSTER) {
>> -			if (si->flags & SWP_BLKDEV)
>> -				n_ret = swap_alloc_cluster(si, swp_entries);
>> +		if (size > 1) {
>> +			n_ret = swap_alloc_large(si, swp_entries, size);
>>  		} else
>>  			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>  						    n_goal, swp_entries);
>>  		spin_unlock(&si->lock);
>> -		if (n_ret || size == SWAPFILE_CLUSTER)
>> +		if (n_ret || size > 1)
>>  			goto check_out;
>>  		cond_resched();
>>  
>> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  	if (p->bdev && bdev_nonrot(p->bdev)) {
>>  		int cpu;
>>  		unsigned long ci, nr_cluster;
>> +		int nr_order;
>> +		int i;
>>  
>>  		p->flags |= SWP_SOLIDSTATE;
>>  		p->cluster_next_cpu = alloc_percpu(unsigned int);
>> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		for (ci = 0; ci < nr_cluster; ci++)
>>  			spin_lock_init(&((cluster_info + ci)->lock));
>>  
>> -		p->cpu_next = alloc_percpu(unsigned int);
>> +		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
>> +		p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
>> +					     __alignof__(unsigned int));
>>  		if (!p->cpu_next) {
>>  			error = -ENOMEM;
>>  			goto bad_swap_unlock_inode;
>>  		}
>> -		for_each_possible_cpu(cpu)
>> -			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
>> +		for_each_possible_cpu(cpu) {
>> +			unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
>> +
>> +			for (i = 0; i < nr_order; i++)
>> +				cpu_next[i] = SWAP_NEXT_NULL;
>> +		}
>>  	} else {
>>  		atomic_inc(&nr_rotate_swap);
>>  		inced_nr_rotate_swap = true;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cc0cb41fb32..ea19710aa4cd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
> 
> --
> Best Regards,
> Huang, Ying
  
Huang, Ying Oct. 31, 2023, 8:12 a.m. UTC | #3
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 30/10/2023 08:18, Huang, Ying wrote:
>> Hi, Ryan,
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> The upcoming anonymous small-sized THP feature enables performance
>>> improvements by allocating large folios for anonymous memory. However
>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>> kernel compilation) across many cores, under high memory pressure, the
>>> speed regresses. This is due to bottlenecking on the increased number of
>>> TLBIs added due to all the extra folio splitting.
>>>
>>> Therefore, solve this regression by adding support for swapping out
>>> small-sized THP without needing to split the folio, just like is already
>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>> enabled, and when the swap backing store is a non-rotating block device.
>>> These are the same constraints as for the existing PMD-sized THP
>>> swap-out support.
>>>
>>> Note that no attempt is made to swap-in THP here - this is still done
>>> page-by-page, like for PMD-sized THP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller falls back to splitting the folio and
>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>
>>> The per-order current clusters are maintained per-cpu using the existing
>>> infrastructure. This is done to avoid interleving pages from different
>>> tasks, which would prevent IO being batched. This is already done for
>>> the order-0 allocations so we follow the same pattern.
>>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>>> for order-0.
>>>
>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>> ensures that when the swap file is getting full, space doesn't get tied
>>> up in the per-cpu reserves.
>>>
>>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>>> device as the swap device and from inside a memcg limited to 40G memory.
>>> I've then run `usemem` from vm-scalability with 70 processes (each has
>>> its own core), each allocating and writing 1G of memory. I've repeated
>>> everything 5 times and taken the mean:
>>>
>>> Mean Performance Improvement vs 4K/baseline
>>>
>>> | alloc size |            baseline |       + this series |
>>> |            |  v6.6-rc4+anonfolio |                     |
>>> |:-----------|--------------------:|--------------------:|
>>> | 4K Page    |                0.0% |                4.9% |
>>> | 64K THP    |              -44.1% |               10.7% |
>>> | 2M THP     |               56.0% |               65.9% |
>>>
>>> So with this change, the regression for 64K swap performance goes away
>>> and 4K and 2M swap improves slightly too.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/swap.h |  10 +--
>>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>>  mm/vmscan.c          |  10 +--
>>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0ca8aaa098ba..ccbca5db851b 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>>  	unsigned int __percpu *cpu_next;/*
>>>  					 * Likely next allocation offset. We
>>> -					 * assign a cluster to each CPU, so each
>>> -					 * CPU can allocate swap entry from its
>>> -					 * own cluster and swapout sequentially.
>>> -					 * The purpose is to optimize swapout
>>> -					 * throughput.
>>> +					 * assign a cluster per-order to each
>>> +					 * CPU, so each CPU can allocate swap
>>> +					 * entry from its own cluster and
>>> +					 * swapout sequentially. The purpose is
>>> +					 * to optimize swapout throughput.
>>>  					 */
>> 
>> This is kind of hard to understand.  Better to define some intermediate
>> data structure to improve readability.  For example,
>> 
>> #ifdef CONFIG_THP_SWAP
>> #define NR_SWAP_ORDER   PMD_ORDER
>> #else
>> #define NR_SWAP_ORDER   1
>> #endif
>> 
>> struct percpu_clusters {
>>         unsigned int alloc_next[NR_SWAP_ORDER];
>> };
>> 
>> PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
>> powerpc too.
>
> I get your point, but this is just making it more difficult for powerpc to ever
> enable the feature in future - you're implicitly depending on !powerpc, which
> seems fragile. How about if I change the first line of the coment to be "per-cpu
> array indexed by allocation order"? Would that be enough?

Even if PMD_ORDER isn't constant on powerpc, it's not necessary for
NR_SWAP_ORDER to be variable.  At least (1 << (NR_SWAP_ORDER-1)) should
< SWAPFILE_CLUSTER.  When someone adds THP swap support on powerpc, he
can choose a reasonable constant for NR_SWAP_ORDER (for example, 10 or
7).

>> 
>>>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>>  	struct block_device *bdev;	/* swap device or bdev of swap file */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 94f7cc225eb9..b50bce50bed9 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  
>>>  /*
>>>   * The cluster corresponding to page_nr will be used. The cluster will be
>>> - * removed from free cluster list and its usage counter will be increased.
>>> + * removed from free cluster list and its usage counter will be increased by
>>> + * count.
>>>   */
>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>> +	unsigned long count)
>>>  {
>>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>  
>>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>>  	if (cluster_is_free(&cluster_info[idx]))
>>>  		alloc_cluster(p, idx);
>>>  
>>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>>  	cluster_set_count(&cluster_info[idx],
>>> -		cluster_count(&cluster_info[idx]) + 1);
>>> +		cluster_count(&cluster_info[idx]) + count);
>>> +}
>>> +
>>> +/*
>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>> + * removed from free cluster list and its usage counter will be increased.
>>> + */
>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +{
>>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>>  }
>>>  
>>>  /*
>>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>>   */
>>>  static bool
>>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> -	unsigned long offset)
>>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> +	unsigned long offset, int order)
>>>  {
>>>  	bool conflict;
>>>  
>>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>  	if (!conflict)
>>>  		return false;
>>>  
>>> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>>> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>> 
>> This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
>> good name.  Because NEXT isn't a pointer (while cluster_next is). Better
>> to name it as SWAP_NEXT_INVALID, etc.
>
> ACK, will make change for next version.

Thanks!

>> 
>>>  	return true;
>>>  }
>>>  
>>>  /*
>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>> - * might involve allocating a new cluster for current CPU too.
>>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>>   */
>>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> -	unsigned long *offset, unsigned long *scan_base)
>>> +static bool
>>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> +	unsigned long offset)
>>> +{
>>> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>>> +}
>>> +
>>> +/*
>>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>>> + * entry pool (a cluster). This might involve allocating a new cluster for
>>> + * current CPU too.
>>> + */
>>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>>  {
>>>  	struct swap_cluster_info *ci;
>>> -	unsigned int tmp, max;
>>> +	unsigned int tmp, max, i;
>>>  	unsigned int *cpu_next;
>>> +	unsigned int nr_pages = 1 << order;
>>>  
>>>  new_cluster:
>>> -	cpu_next = this_cpu_ptr(si->cpu_next);
>>> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>>  	tmp = *cpu_next;
>>>  	if (tmp == SWAP_NEXT_NULL) {
>>>  		if (!cluster_list_empty(&si->free_clusters)) {
>>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>  	 * reserve a new cluster.
>>>  	 */
>>>  	ci = lock_cluster(si, tmp);
>>> -	if (si->swap_map[tmp]) {
>>> -		unlock_cluster(ci);
>>> -		*cpu_next = SWAP_NEXT_NULL;
>>> -		goto new_cluster;
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		if (si->swap_map[tmp + i]) {
>>> +			unlock_cluster(ci);
>>> +			*cpu_next = SWAP_NEXT_NULL;
>>> +			goto new_cluster;
>>> +		}
>>>  	}
>>>  	unlock_cluster(ci);
>>>  
>>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>  	*scan_base = tmp;
>>>  
>>>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
>> 
>> This line is added in a previous patch.  Can we just use
>> 
>>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
>
> Sure. This is how I originally had it, but then decided that the other approach
> was a bit clearer. But I don't have a strong opinion, so I'll change it as you
> suggest.

Thanks!

>> 
>> Or, add ALIGN_UP() for this?
>> 
>>> -	tmp += 1;
>>> +	tmp += nr_pages;
>>>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>>  
>>>  	return true;
>>>  }
>>>  
>>> +/*
>>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>> + * might involve allocating a new cluster for current CPU too.
>>> + */
>>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> +	unsigned long *offset, unsigned long *scan_base)
>>> +{
>>> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>>> +}
>>> +
>>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>>  {
>>>  	int nid;
>>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	return n_ret;
>>>  }
>>>  
>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>>> +			    unsigned int nr_pages)
>> 
>> IMHO, it's better to make scan_swap_map_slots() to support order > 0
>> instead of making swap_alloc_cluster() to support order != PMD_ORDER.
>> And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
>> that.
>
> I did consider adding a 5th patch to rename swap_alloc_large() to something like
> swap_alloc_one_ssd_entry() (which would then be used for order=0 too) and
> refactor scan_swap_map_slots() to fully delegate to it for the non-scaning ssd
> allocation case. Would something like that suit?
>
> I have reservations about making scan_swap_map_slots() take an order and be the
> sole entry point:
>
>   - in the non-ssd case, we can't support order!=0

Don't need to check ssd directly, we only support order != 0 if
si->cluster_info != NULL.

>   - there is a lot of other logic to deal with falling back to scanning which we
>     would only want to do for order==0, so we would end up with a few ugly
>     conditionals against order.

We don't need to care about them in most cases.  IIUC, only the "goto
scan" after scan_swap_map_try_ssd_cluster() return false need to "goto
no_page" for order != 0.

>   - I was concerned the risk of me introducing a bug when refactoring all that
>     subtle logic was high

IMHO, readability is more important for long term maintenance.  So, we
need to refactor the existing code for that.

> What do you think? Is not making scan_swap_map_slots() support order > 0 a deal
> breaker for you?

I just think that it's better to use scan_swap_map_slots() for any order
other than PMD_ORDER.  In that way, we share as much code as possible.

--
Best Regards,
Huang, Ying
  
Barry Song Nov. 2, 2023, 7:40 a.m. UTC | #4
On Wed, Oct 25, 2023 at 10:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
>
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device.
> These are the same constraints as for the existing PMD-sized THP
> swap-out support.

Hi Ryan,

We had a problem while enabling THP SWP on arm64,
commit d0637c505f8 ("arm64: enable THP_SWAP for arm64")

this means we have to depend on !system_supports_mte().
static inline bool arch_thp_swp_supported(void)
{
        return !system_supports_mte();
}

Do we have the same problem for small-sized THP? If yes, MTE has been
widely existing in various ARM64 SoC. Does it mean we should begin to fix
the issue now?


>
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [1, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> infrastructure. This is done to avoid interleving pages from different
> tasks, which would prevent IO being batched. This is already done for
> the order-0 allocations so we follow the same pattern.
> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
> for order-0.
>
> As is done for order-0 per-cpu clusters, the scanner now can steal
> order-0 entries from any per-cpu-per-order reserved cluster. This
> ensures that when the swap file is getting full, space doesn't get tied
> up in the per-cpu reserves.
>
> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
> device as the swap device and from inside a memcg limited to 40G memory.
> I've then run `usemem` from vm-scalability with 70 processes (each has
> its own core), each allocating and writing 1G of memory. I've repeated
> everything 5 times and taken the mean:
>
> Mean Performance Improvement vs 4K/baseline
>
> | alloc size |            baseline |       + this series |
> |            |  v6.6-rc4+anonfolio |                     |
> |:-----------|--------------------:|--------------------:|
> | 4K Page    |                0.0% |                4.9% |
> | 64K THP    |              -44.1% |               10.7% |
> | 2M THP     |               56.0% |               65.9% |
>
> So with this change, the regression for 64K swap performance goes away
> and 4K and 2M swap improves slightly too.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  10 +--
>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>  mm/vmscan.c          |  10 +--
>  3 files changed, 119 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0ca8aaa098ba..ccbca5db851b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -295,11 +295,11 @@ struct swap_info_struct {
>         unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>         unsigned int __percpu *cpu_next;/*
>                                          * Likely next allocation offset. We
> -                                        * assign a cluster to each CPU, so each
> -                                        * CPU can allocate swap entry from its
> -                                        * own cluster and swapout sequentially.
> -                                        * The purpose is to optimize swapout
> -                                        * throughput.
> +                                        * assign a cluster per-order to each
> +                                        * CPU, so each CPU can allocate swap
> +                                        * entry from its own cluster and
> +                                        * swapout sequentially. The purpose is
> +                                        * to optimize swapout throughput.
>                                          */
>         struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>         struct block_device *bdev;      /* swap device or bdev of swap file */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 94f7cc225eb9..b50bce50bed9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will be
> - * removed from free cluster list and its usage counter will be increased.
> + * removed from free cluster list and its usage counter will be increased by
> + * count.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *p,
> -       struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void add_cluster_info_page(struct swap_info_struct *p,
> +       struct swap_cluster_info *cluster_info, unsigned long page_nr,
> +       unsigned long count)
>  {
>         unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>
> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>         if (cluster_is_free(&cluster_info[idx]))
>                 alloc_cluster(p, idx);
>
> -       VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
> +       VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>         cluster_set_count(&cluster_info[idx],
> -               cluster_count(&cluster_info[idx]) + 1);
> +               cluster_count(&cluster_info[idx]) + count);
> +}
> +
> +/*
> + * The cluster corresponding to page_nr will be used. The cluster will be
> + * removed from free cluster list and its usage counter will be increased.
> + */
> +static void inc_cluster_info_page(struct swap_info_struct *p,
> +       struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +{
> +       add_cluster_info_page(p, cluster_info, page_nr, 1);
>  }
>
>  /*
> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>   * cluster list. Avoiding such abuse to avoid list corruption.
>   */
>  static bool
> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> -       unsigned long offset)
> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +       unsigned long offset, int order)
>  {
>         bool conflict;
>
> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>         if (!conflict)
>                 return false;
>
> -       *this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
> +       this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>         return true;
>  }
>
>  /*
> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> - * might involve allocating a new cluster for current CPU too.
> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
> + * cluster list. Avoiding such abuse to avoid list corruption.
>   */
> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> -       unsigned long *offset, unsigned long *scan_base)
> +static bool
> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +       unsigned long offset)
> +{
> +       return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
> +}
> +
> +/*
> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
> + * entry pool (a cluster). This might involve allocating a new cluster for
> + * current CPU too.
> + */
> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +       unsigned long *offset, unsigned long *scan_base, int order)
>  {
>         struct swap_cluster_info *ci;
> -       unsigned int tmp, max;
> +       unsigned int tmp, max, i;
>         unsigned int *cpu_next;
> +       unsigned int nr_pages = 1 << order;
>
>  new_cluster:
> -       cpu_next = this_cpu_ptr(si->cpu_next);
> +       cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>         tmp = *cpu_next;
>         if (tmp == SWAP_NEXT_NULL) {
>                 if (!cluster_list_empty(&si->free_clusters)) {
> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>          * reserve a new cluster.
>          */
>         ci = lock_cluster(si, tmp);
> -       if (si->swap_map[tmp]) {
> -               unlock_cluster(ci);
> -               *cpu_next = SWAP_NEXT_NULL;
> -               goto new_cluster;
> +       for (i = 0; i < nr_pages; i++) {
> +               if (si->swap_map[tmp + i]) {
> +                       unlock_cluster(ci);
> +                       *cpu_next = SWAP_NEXT_NULL;
> +                       goto new_cluster;
> +               }
>         }
>         unlock_cluster(ci);
>
> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>         *scan_base = tmp;
>
>         max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
> -       tmp += 1;
> +       tmp += nr_pages;
>         *cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>
>         return true;
>  }
>
> +/*
> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> + * might involve allocating a new cluster for current CPU too.
> + */
> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +       unsigned long *offset, unsigned long *scan_base)
> +{
> +       return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
> +}
> +
>  static void __del_from_avail_list(struct swap_info_struct *p)
>  {
>         int nid;
> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>         return n_ret;
>  }
>
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
> +                           unsigned int nr_pages)
>  {
> -       unsigned long idx;
>         struct swap_cluster_info *ci;
> -       unsigned long offset;
> +       unsigned long offset, scan_base;
> +       int order = ilog2(nr_pages);
> +       bool ret;
>
>         /*
> -        * Should not even be attempting cluster allocations when huge
> +        * Should not even be attempting large allocations when huge
>          * page swap is disabled.  Warn and fail the allocation.
>          */
> -       if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> +       if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> +           nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
> +           !is_power_of_2(nr_pages)) {
>                 VM_WARN_ON_ONCE(1);
>                 return 0;
>         }
>
> -       if (cluster_list_empty(&si->free_clusters))
> +       /*
> +        * Swapfile is not block device or not using clusters so unable to
> +        * allocate large entries.
> +        */
> +       if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>                 return 0;
>
> -       idx = cluster_list_first(&si->free_clusters);
> -       offset = idx * SWAPFILE_CLUSTER;
> -       ci = lock_cluster(si, offset);
> -       alloc_cluster(si, idx);
> -       cluster_set_count(ci, SWAPFILE_CLUSTER);
> +again:
> +       /*
> +        * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> +        * so indicate that we are scanning to synchronise with swapoff.
> +        */
> +       si->flags += SWP_SCANNING;
> +       ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> +       si->flags -= SWP_SCANNING;
> +
> +       /*
> +        * If we failed to allocate or if swapoff is waiting for us (due to lock
> +        * being dropped for discard above), return immediately.
> +        */
> +       if (!ret || !(si->flags & SWP_WRITEOK))
> +               return 0;
>
> -       memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> +       if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
> +               goto again;
> +
> +       ci = lock_cluster(si, offset);
> +       memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
> +       add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>         unlock_cluster(ci);
> -       swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> -       *slot = swp_entry(si->type, offset);
>
> +       swap_range_alloc(si, offset, nr_pages);
> +       *slot = swp_entry(si->type, offset);
>         return 1;
>  }
>
> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>         int node;
>
>         /* Only single cluster request supported */
> -       WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> +       WARN_ON_ONCE(n_goal > 1 && size > 1);
>
>         spin_lock(&swap_avail_lock);
>
> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>                         spin_unlock(&si->lock);
>                         goto nextsi;
>                 }
> -               if (size == SWAPFILE_CLUSTER) {
> -                       if (si->flags & SWP_BLKDEV)
> -                               n_ret = swap_alloc_cluster(si, swp_entries);
> +               if (size > 1) {
> +                       n_ret = swap_alloc_large(si, swp_entries, size);
>                 } else
>                         n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>                                                     n_goal, swp_entries);
>                 spin_unlock(&si->lock);
> -               if (n_ret || size == SWAPFILE_CLUSTER)
> +               if (n_ret || size > 1)
>                         goto check_out;
>                 cond_resched();
>
> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (p->bdev && bdev_nonrot(p->bdev)) {
>                 int cpu;
>                 unsigned long ci, nr_cluster;
> +               int nr_order;
> +               int i;
>
>                 p->flags |= SWP_SOLIDSTATE;
>                 p->cluster_next_cpu = alloc_percpu(unsigned int);
> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 for (ci = 0; ci < nr_cluster; ci++)
>                         spin_lock_init(&((cluster_info + ci)->lock));
>
> -               p->cpu_next = alloc_percpu(unsigned int);
> +               nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
> +               p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
> +                                            __alignof__(unsigned int));
>                 if (!p->cpu_next) {
>                         error = -ENOMEM;
>                         goto bad_swap_unlock_inode;
>                 }
> -               for_each_possible_cpu(cpu)
> -                       per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
> +               for_each_possible_cpu(cpu) {
> +                       unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
> +
> +                       for (i = 0; i < nr_order; i++)
> +                               cpu_next[i] = SWAP_NEXT_NULL;
> +               }
>         } else {
>                 atomic_inc(&nr_rotate_swap);
>                 inced_nr_rotate_swap = true;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cc0cb41fb32..ea19710aa4cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         if (!can_split_folio(folio, NULL))
>                                                 goto activate_locked;
>                                         /*
> -                                        * Split folios without a PMD map right
> -                                        * away. Chances are some or all of the
> -                                        * tail pages can be freed without IO.
> +                                        * Split PMD-mappable folios without a
> +                                        * PMD map right away. Chances are some
> +                                        * or all of the tail pages can be freed
> +                                        * without IO.
>                                          */
> -                                       if (!folio_entire_mapcount(folio) &&
> +                                       if (folio_test_pmd_mappable(folio) &&
> +                                           !folio_entire_mapcount(folio) &&
>                                             split_folio_to_list(folio,
>                                                                 folio_list))
>                                                 goto activate_locked;
> --
> 2.25.1
>

Thanks
Barry
  
Ryan Roberts Nov. 2, 2023, 10:21 a.m. UTC | #5
On 02/11/2023 07:40, Barry Song wrote:
> On Wed, Oct 25, 2023 at 10:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> The upcoming anonymous small-sized THP feature enables performance
>> improvements by allocating large folios for anonymous memory. However
>> I've observed that on an arm64 system running a parallel workload (e.g.
>> kernel compilation) across many cores, under high memory pressure, the
>> speed regresses. This is due to bottlenecking on the increased number of
>> TLBIs added due to all the extra folio splitting.
>>
>> Therefore, solve this regression by adding support for swapping out
>> small-sized THP without needing to split the folio, just like is already
>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>> enabled, and when the swap backing store is a non-rotating block device.
>> These are the same constraints as for the existing PMD-sized THP
>> swap-out support.
> 
> Hi Ryan,
> 
> We had a problem while enabling THP SWP on arm64,
> commit d0637c505f8 ("arm64: enable THP_SWAP for arm64")
> 
> this means we have to depend on !system_supports_mte().
> static inline bool arch_thp_swp_supported(void)
> {
>         return !system_supports_mte();
> }
> 
> Do we have the same problem for small-sized THP? If yes, MTE has been
> widely existing in various ARM64 SoC. Does it mean we should begin to fix
> the issue now?

Hi Barry,

I'm guessing that the current problem for MTE is that when it saves the tags
prior to swap out, it assumes all folios are small (i.e. base page size) and
therefore doesn't have the logic to iterate over a large folio, saving the tags
for each page?

If that's the issue, then yes we have the same problem for small-sized THP, but
this is all safe - arch_thp_swp_supported() will return false and we continue to
use that signal to cause the page to be split prior to swap out.

But, yes, it would be nice to fix that! And if I've understood the problem
correctly, it doesn't sound like it should be too hard? Is this something you
are volunteering for?? :)

Thanks,
Ryan


> 
> 
>>
>> Note that no attempt is made to swap-in THP here - this is still done
>> page-by-page, like for PMD-sized THP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller falls back to splitting the folio and
>> allocates individual entries (as per existing PMD-sized THP fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>> for order-0.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>> device as the swap device and from inside a memcg limited to 40G memory.
>> I've then run `usemem` from vm-scalability with 70 processes (each has
>> its own core), each allocating and writing 1G of memory. I've repeated
>> everything 5 times and taken the mean:
>>
>> Mean Performance Improvement vs 4K/baseline
>>
>> | alloc size |            baseline |       + this series |
>> |            |  v6.6-rc4+anonfolio |                     |
>> |:-----------|--------------------:|--------------------:|
>> | 4K Page    |                0.0% |                4.9% |
>> | 64K THP    |              -44.1% |               10.7% |
>> | 2M THP     |               56.0% |               65.9% |
>>
>> So with this change, the regression for 64K swap performance goes away
>> and 4K and 2M swap improves slightly too.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  10 +--
>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>  mm/vmscan.c          |  10 +--
>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0ca8aaa098ba..ccbca5db851b 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>         unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>         unsigned int __percpu *cpu_next;/*
>>                                          * Likely next allocation offset. We
>> -                                        * assign a cluster to each CPU, so each
>> -                                        * CPU can allocate swap entry from its
>> -                                        * own cluster and swapout sequentially.
>> -                                        * The purpose is to optimize swapout
>> -                                        * throughput.
>> +                                        * assign a cluster per-order to each
>> +                                        * CPU, so each CPU can allocate swap
>> +                                        * entry from its own cluster and
>> +                                        * swapout sequentially. The purpose is
>> +                                        * to optimize swapout throughput.
>>                                          */
>>         struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>         struct block_device *bdev;      /* swap device or bdev of swap file */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 94f7cc225eb9..b50bce50bed9 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>
>>  /*
>>   * The cluster corresponding to page_nr will be used. The cluster will be
>> - * removed from free cluster list and its usage counter will be increased.
>> + * removed from free cluster list and its usage counter will be increased by
>> + * count.
>>   */
>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>> -       struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +static void add_cluster_info_page(struct swap_info_struct *p,
>> +       struct swap_cluster_info *cluster_info, unsigned long page_nr,
>> +       unsigned long count)
>>  {
>>         unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>
>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>         if (cluster_is_free(&cluster_info[idx]))
>>                 alloc_cluster(p, idx);
>>
>> -       VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>> +       VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>         cluster_set_count(&cluster_info[idx],
>> -               cluster_count(&cluster_info[idx]) + 1);
>> +               cluster_count(&cluster_info[idx]) + count);
>> +}
>> +
>> +/*
>> + * The cluster corresponding to page_nr will be used. The cluster will be
>> + * removed from free cluster list and its usage counter will be increased.
>> + */
>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>> +       struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +{
>> +       add_cluster_info_page(p, cluster_info, page_nr, 1);
>>  }
>>
>>  /*
>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>>  static bool
>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> -       unsigned long offset)
>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +       unsigned long offset, int order)
>>  {
>>         bool conflict;
>>
>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>         if (!conflict)
>>                 return false;
>>
>> -       *this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>> +       this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>>         return true;
>>  }
>>
>>  /*
>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> - * might involve allocating a new cluster for current CPU too.
>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> -       unsigned long *offset, unsigned long *scan_base)
>> +static bool
>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +       unsigned long offset)
>> +{
>> +       return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>> +}
>> +
>> +/*
>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>> + * entry pool (a cluster). This might involve allocating a new cluster for
>> + * current CPU too.
>> + */
>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +       unsigned long *offset, unsigned long *scan_base, int order)
>>  {
>>         struct swap_cluster_info *ci;
>> -       unsigned int tmp, max;
>> +       unsigned int tmp, max, i;
>>         unsigned int *cpu_next;
>> +       unsigned int nr_pages = 1 << order;
>>
>>  new_cluster:
>> -       cpu_next = this_cpu_ptr(si->cpu_next);
>> +       cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>         tmp = *cpu_next;
>>         if (tmp == SWAP_NEXT_NULL) {
>>                 if (!cluster_list_empty(&si->free_clusters)) {
>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>          * reserve a new cluster.
>>          */
>>         ci = lock_cluster(si, tmp);
>> -       if (si->swap_map[tmp]) {
>> -               unlock_cluster(ci);
>> -               *cpu_next = SWAP_NEXT_NULL;
>> -               goto new_cluster;
>> +       for (i = 0; i < nr_pages; i++) {
>> +               if (si->swap_map[tmp + i]) {
>> +                       unlock_cluster(ci);
>> +                       *cpu_next = SWAP_NEXT_NULL;
>> +                       goto new_cluster;
>> +               }
>>         }
>>         unlock_cluster(ci);
>>
>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>         *scan_base = tmp;
>>
>>         max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
>> -       tmp += 1;
>> +       tmp += nr_pages;
>>         *cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>
>>         return true;
>>  }
>>
>> +/*
>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> + * might involve allocating a new cluster for current CPU too.
>> + */
>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +       unsigned long *offset, unsigned long *scan_base)
>> +{
>> +       return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>> +}
>> +
>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>  {
>>         int nid;
>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>         return n_ret;
>>  }
>>
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>> +                           unsigned int nr_pages)
>>  {
>> -       unsigned long idx;
>>         struct swap_cluster_info *ci;
>> -       unsigned long offset;
>> +       unsigned long offset, scan_base;
>> +       int order = ilog2(nr_pages);
>> +       bool ret;
>>
>>         /*
>> -        * Should not even be attempting cluster allocations when huge
>> +        * Should not even be attempting large allocations when huge
>>          * page swap is disabled.  Warn and fail the allocation.
>>          */
>> -       if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> +       if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> +           nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
>> +           !is_power_of_2(nr_pages)) {
>>                 VM_WARN_ON_ONCE(1);
>>                 return 0;
>>         }
>>
>> -       if (cluster_list_empty(&si->free_clusters))
>> +       /*
>> +        * Swapfile is not block device or not using clusters so unable to
>> +        * allocate large entries.
>> +        */
>> +       if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>>                 return 0;
>>
>> -       idx = cluster_list_first(&si->free_clusters);
>> -       offset = idx * SWAPFILE_CLUSTER;
>> -       ci = lock_cluster(si, offset);
>> -       alloc_cluster(si, idx);
>> -       cluster_set_count(ci, SWAPFILE_CLUSTER);
>> +again:
>> +       /*
>> +        * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +        * so indicate that we are scanning to synchronise with swapoff.
>> +        */
>> +       si->flags += SWP_SCANNING;
>> +       ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +       si->flags -= SWP_SCANNING;
>> +
>> +       /*
>> +        * If we failed to allocate or if swapoff is waiting for us (due to lock
>> +        * being dropped for discard above), return immediately.
>> +        */
>> +       if (!ret || !(si->flags & SWP_WRITEOK))
>> +               return 0;
>>
>> -       memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> +       if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
>> +               goto again;
>> +
>> +       ci = lock_cluster(si, offset);
>> +       memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>> +       add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>         unlock_cluster(ci);
>> -       swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> -       *slot = swp_entry(si->type, offset);
>>
>> +       swap_range_alloc(si, offset, nr_pages);
>> +       *slot = swp_entry(si->type, offset);
>>         return 1;
>>  }
>>
>> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>         int node;
>>
>>         /* Only single cluster request supported */
>> -       WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>> +       WARN_ON_ONCE(n_goal > 1 && size > 1);
>>
>>         spin_lock(&swap_avail_lock);
>>
>> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>                         spin_unlock(&si->lock);
>>                         goto nextsi;
>>                 }
>> -               if (size == SWAPFILE_CLUSTER) {
>> -                       if (si->flags & SWP_BLKDEV)
>> -                               n_ret = swap_alloc_cluster(si, swp_entries);
>> +               if (size > 1) {
>> +                       n_ret = swap_alloc_large(si, swp_entries, size);
>>                 } else
>>                         n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>                                                     n_goal, swp_entries);
>>                 spin_unlock(&si->lock);
>> -               if (n_ret || size == SWAPFILE_CLUSTER)
>> +               if (n_ret || size > 1)
>>                         goto check_out;
>>                 cond_resched();
>>
>> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>         if (p->bdev && bdev_nonrot(p->bdev)) {
>>                 int cpu;
>>                 unsigned long ci, nr_cluster;
>> +               int nr_order;
>> +               int i;
>>
>>                 p->flags |= SWP_SOLIDSTATE;
>>                 p->cluster_next_cpu = alloc_percpu(unsigned int);
>> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>                 for (ci = 0; ci < nr_cluster; ci++)
>>                         spin_lock_init(&((cluster_info + ci)->lock));
>>
>> -               p->cpu_next = alloc_percpu(unsigned int);
>> +               nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
>> +               p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
>> +                                            __alignof__(unsigned int));
>>                 if (!p->cpu_next) {
>>                         error = -ENOMEM;
>>                         goto bad_swap_unlock_inode;
>>                 }
>> -               for_each_possible_cpu(cpu)
>> -                       per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
>> +               for_each_possible_cpu(cpu) {
>> +                       unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
>> +
>> +                       for (i = 0; i < nr_order; i++)
>> +                               cpu_next[i] = SWAP_NEXT_NULL;
>> +               }
>>         } else {
>>                 atomic_inc(&nr_rotate_swap);
>>                 inced_nr_rotate_swap = true;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cc0cb41fb32..ea19710aa4cd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>                                         if (!can_split_folio(folio, NULL))
>>                                                 goto activate_locked;
>>                                         /*
>> -                                        * Split folios without a PMD map right
>> -                                        * away. Chances are some or all of the
>> -                                        * tail pages can be freed without IO.
>> +                                        * Split PMD-mappable folios without a
>> +                                        * PMD map right away. Chances are some
>> +                                        * or all of the tail pages can be freed
>> +                                        * without IO.
>>                                          */
>> -                                       if (!folio_entire_mapcount(folio) &&
>> +                                       if (folio_test_pmd_mappable(folio) &&
>> +                                           !folio_entire_mapcount(folio) &&
>>                                             split_folio_to_list(folio,
>>                                                                 folio_list))
>>                                                 goto activate_locked;
>> --
>> 2.25.1
>>
> 
> Thanks
> Barry
  
Barry Song Nov. 2, 2023, 10:36 p.m. UTC | #6
> But, yes, it would be nice to fix that! And if I've understood the problem
> correctly, it doesn't sound like it should be too hard? Is this something you
> are volunteering for?? :)

Unfornately right now I haven't a real hardware with MTE which can run the latest
kernel. but i have written a RFC, it will be nice to get someone to test it. Let
me figure out if we can get someone :-)

[RFC PATCH] arm64: mm: swap: save and restore mte tags for large folios

This patch makes MTE tags saving and restoring support large folios,
then we don't need to split them into base pages for swapping on
ARM64 SoCs with MTE.

---
 arch/arm64/include/asm/pgtable.h | 21 ++++-----------------
 arch/arm64/mm/mteswap.c          | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1df4e5..b12783dca00a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
 	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool arch_thp_swp_supported(void)
-{
-	return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
 /*
  * Outside of a few very special situations (e.g. hibernation), we always
  * use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1028,12 +1022,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #ifdef CONFIG_ARM64_MTE
 
 #define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
-	if (system_supports_mte())
-		return mte_save_tags(page);
-	return 0;
-}
+#define arch_prepare_to_swap arch_prepare_to_swap
+extern int arch_prepare_to_swap(struct page *page);
 
 #define __HAVE_ARCH_SWAP_INVALIDATE
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1049,11 +1039,8 @@ static inline void arch_swap_invalidate_area(int type)
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
-	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
-}
+#define arch_swap_restore arch_swap_restore
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
 
 #endif /* CONFIG_ARM64_MTE */
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..e5637e931e4f 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -83,3 +83,23 @@ void mte_invalidate_tags_area(int type)
 	}
 	xa_unlock(&mte_pages);
 }
+
+int arch_prepare_to_swap(struct page *page)
+{
+	if (system_supports_mte()) {
+		struct folio *folio = page_folio(page);
+		long i, nr = folio_nr_pages(folio);
+		for (i = 0; i < nr; i++)
+			return mte_save_tags(folio_page(folio, i));
+	}
+	return 0;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+	if (system_supports_mte()) {
+		long i, nr = folio_nr_pages(folio);
+		for (i = 0; i < nr; i++)
+			mte_restore_tags(entry, folio_page(folio, i));
+	}
+}
-- 
2.25.1

> Thanks,
> Ryan

Barry
  
Ryan Roberts Nov. 3, 2023, 11:31 a.m. UTC | #7
On 02/11/2023 22:36, Barry Song wrote:
>> But, yes, it would be nice to fix that! And if I've understood the problem
>> correctly, it doesn't sound like it should be too hard? Is this something you
>> are volunteering for?? :)
> 
> Unfornately right now I haven't a real hardware with MTE which can run the latest
> kernel. but i have written a RFC, it will be nice to get someone to test it. Let
> me figure out if we can get someone :-)

OK, let me know if you find someone. Otherwise I can have a hunt around to see
if I can test it.

> 
> [RFC PATCH] arm64: mm: swap: save and restore mte tags for large folios
> 
> This patch makes MTE tags saving and restoring support large folios,
> then we don't need to split them into base pages for swapping on
> ARM64 SoCs with MTE.
> 
> ---
>  arch/arm64/include/asm/pgtable.h | 21 ++++-----------------
>  arch/arm64/mm/mteswap.c          | 20 ++++++++++++++++++++
>  2 files changed, 24 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7f7d9b1df4e5..b12783dca00a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported

IIRC, arm64 was the only arch implementing this, so perhaps it should be ripped
out from the core code now?

> -
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1028,12 +1022,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  #ifdef CONFIG_ARM64_MTE
>  
>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> -	if (system_supports_mte())
> -		return mte_save_tags(page);
> -	return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap
> +extern int arch_prepare_to_swap(struct page *page);

I think it would be better to modify this API to take a folio explicitly. The
caller already has the folio.

>  
>  #define __HAVE_ARCH_SWAP_INVALIDATE
>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1049,11 +1039,8 @@ static inline void arch_swap_invalidate_area(int type)
>  }
>  
>  #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> -	if (system_supports_mte())
> -		mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore
> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..e5637e931e4f 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -83,3 +83,23 @@ void mte_invalidate_tags_area(int type)
>  	}
>  	xa_unlock(&mte_pages);
>  }
> +
> +int arch_prepare_to_swap(struct page *page)
> +{
> +	if (system_supports_mte()) {
> +		struct folio *folio = page_folio(page);
> +		long i, nr = folio_nr_pages(folio);
> +		for (i = 0; i < nr; i++)
> +			return mte_save_tags(folio_page(folio, i));

This will return after saving the first page of the folio! You will need to add
each page in a loop, and if you get an error at any point, you will need to
remove the pages that you already added successfully, by calling
arch_swap_invalidate_page() as far as I can see. Steven can you confirm?

> +	}
> +	return 0;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> +	if (system_supports_mte()) {
> +		long i, nr = folio_nr_pages(folio);
> +		for (i = 0; i < nr; i++)
> +			mte_restore_tags(entry, folio_page(folio, i));

swap-in currently doesn't support large folios - everything is a single page
folio. So this isn't technically needed. But from the API POV, it seems
reasonable to make this change - except your implementation is broken. You are
currently setting every page in the folio to use the same tags as the first
page. You need to increment the swap entry for each page.

Thanks,
Ryan


> +	}
> +}
  
Ryan Roberts Nov. 3, 2023, 11:42 a.m. UTC | #8
On 31/10/2023 08:12, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 30/10/2023 08:18, Huang, Ying wrote:
>>> Hi, Ryan,
>>>
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> The upcoming anonymous small-sized THP feature enables performance
>>>> improvements by allocating large folios for anonymous memory. However
>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>> kernel compilation) across many cores, under high memory pressure, the
>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>> TLBIs added due to all the extra folio splitting.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out
>>>> small-sized THP without needing to split the folio, just like is already
>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>> enabled, and when the swap backing store is a non-rotating block device.
>>>> These are the same constraints as for the existing PMD-sized THP
>>>> swap-out support.
>>>>
>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>> page-by-page, like for PMD-sized THP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller falls back to splitting the folio and
>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>
>>>> The per-order current clusters are maintained per-cpu using the existing
>>>> infrastructure. This is done to avoid interleving pages from different
>>>> tasks, which would prevent IO being batched. This is already done for
>>>> the order-0 allocations so we follow the same pattern.
>>>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>>>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>>>> for order-0.
>>>>
>>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>>> ensures that when the swap file is getting full, space doesn't get tied
>>>> up in the per-cpu reserves.
>>>>
>>>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>>>> device as the swap device and from inside a memcg limited to 40G memory.
>>>> I've then run `usemem` from vm-scalability with 70 processes (each has
>>>> its own core), each allocating and writing 1G of memory. I've repeated
>>>> everything 5 times and taken the mean:
>>>>
>>>> Mean Performance Improvement vs 4K/baseline
>>>>
>>>> | alloc size |            baseline |       + this series |
>>>> |            |  v6.6-rc4+anonfolio |                     |
>>>> |:-----------|--------------------:|--------------------:|
>>>> | 4K Page    |                0.0% |                4.9% |
>>>> | 64K THP    |              -44.1% |               10.7% |
>>>> | 2M THP     |               56.0% |               65.9% |
>>>>
>>>> So with this change, the regression for 64K swap performance goes away
>>>> and 4K and 2M swap improves slightly too.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/swap.h |  10 +--
>>>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>>>  mm/vmscan.c          |  10 +--
>>>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 0ca8aaa098ba..ccbca5db851b 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>>>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>>>  	unsigned int __percpu *cpu_next;/*
>>>>  					 * Likely next allocation offset. We
>>>> -					 * assign a cluster to each CPU, so each
>>>> -					 * CPU can allocate swap entry from its
>>>> -					 * own cluster and swapout sequentially.
>>>> -					 * The purpose is to optimize swapout
>>>> -					 * throughput.
>>>> +					 * assign a cluster per-order to each
>>>> +					 * CPU, so each CPU can allocate swap
>>>> +					 * entry from its own cluster and
>>>> +					 * swapout sequentially. The purpose is
>>>> +					 * to optimize swapout throughput.
>>>>  					 */
>>>
>>> This is kind of hard to understand.  Better to define some intermediate
>>> data structure to improve readability.  For example,
>>>
>>> #ifdef CONFIG_THP_SWAP
>>> #define NR_SWAP_ORDER   PMD_ORDER
>>> #else
>>> #define NR_SWAP_ORDER   1
>>> #endif
>>>
>>> struct percpu_clusters {
>>>         unsigned int alloc_next[NR_SWAP_ORDER];
>>> };
>>>
>>> PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
>>> powerpc too.
>>
>> I get your point, but this is just making it more difficult for powerpc to ever
>> enable the feature in future - you're implicitly depending on !powerpc, which
>> seems fragile. How about if I change the first line of the coment to be "per-cpu
>> array indexed by allocation order"? Would that be enough?
> 
> Even if PMD_ORDER isn't constant on powerpc, it's not necessary for
> NR_SWAP_ORDER to be variable.  At least (1 << (NR_SWAP_ORDER-1)) should
> < SWAPFILE_CLUSTER.  When someone adds THP swap support on powerpc, he
> can choose a reasonable constant for NR_SWAP_ORDER (for example, 10 or
> 7).
> 
>>>
>>>>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>>>  	struct block_device *bdev;	/* swap device or bdev of swap file */
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index 94f7cc225eb9..b50bce50bed9 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>  
>>>>  /*
>>>>   * The cluster corresponding to page_nr will be used. The cluster will be
>>>> - * removed from free cluster list and its usage counter will be increased.
>>>> + * removed from free cluster list and its usage counter will be increased by
>>>> + * count.
>>>>   */
>>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>>> +	unsigned long count)
>>>>  {
>>>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>>  
>>>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>>>  	if (cluster_is_free(&cluster_info[idx]))
>>>>  		alloc_cluster(p, idx);
>>>>  
>>>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>>>  	cluster_set_count(&cluster_info[idx],
>>>> -		cluster_count(&cluster_info[idx]) + 1);
>>>> +		cluster_count(&cluster_info[idx]) + count);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>>> + * removed from free cluster list and its usage counter will be increased.
>>>> + */
>>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>>> +{
>>>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>>>   */
>>>>  static bool
>>>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>> -	unsigned long offset)
>>>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>> +	unsigned long offset, int order)
>>>>  {
>>>>  	bool conflict;
>>>>  
>>>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>>  	if (!conflict)
>>>>  		return false;
>>>>  
>>>> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>>>> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>>>
>>> This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
>>> good name.  Because NEXT isn't a pointer (while cluster_next is). Better
>>> to name it as SWAP_NEXT_INVALID, etc.
>>
>> ACK, will make change for next version.
> 
> Thanks!
> 
>>>
>>>>  	return true;
>>>>  }
>>>>  
>>>>  /*
>>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>>> - * might involve allocating a new cluster for current CPU too.
>>>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>>>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>>>   */
>>>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>> -	unsigned long *offset, unsigned long *scan_base)
>>>> +static bool
>>>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>> +	unsigned long offset)
>>>> +{
>>>> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>>>> + * entry pool (a cluster). This might involve allocating a new cluster for
>>>> + * current CPU too.
>>>> + */
>>>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>>>  {
>>>>  	struct swap_cluster_info *ci;
>>>> -	unsigned int tmp, max;
>>>> +	unsigned int tmp, max, i;
>>>>  	unsigned int *cpu_next;
>>>> +	unsigned int nr_pages = 1 << order;
>>>>  
>>>>  new_cluster:
>>>> -	cpu_next = this_cpu_ptr(si->cpu_next);
>>>> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>>>  	tmp = *cpu_next;
>>>>  	if (tmp == SWAP_NEXT_NULL) {
>>>>  		if (!cluster_list_empty(&si->free_clusters)) {
>>>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>>  	 * reserve a new cluster.
>>>>  	 */
>>>>  	ci = lock_cluster(si, tmp);
>>>> -	if (si->swap_map[tmp]) {
>>>> -		unlock_cluster(ci);
>>>> -		*cpu_next = SWAP_NEXT_NULL;
>>>> -		goto new_cluster;
>>>> +	for (i = 0; i < nr_pages; i++) {
>>>> +		if (si->swap_map[tmp + i]) {
>>>> +			unlock_cluster(ci);
>>>> +			*cpu_next = SWAP_NEXT_NULL;
>>>> +			goto new_cluster;
>>>> +		}
>>>>  	}
>>>>  	unlock_cluster(ci);
>>>>  
>>>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>>  	*scan_base = tmp;
>>>>  
>>>>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
>>>
>>> This line is added in a previous patch.  Can we just use
>>>
>>>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
>>
>> Sure. This is how I originally had it, but then decided that the other approach
>> was a bit clearer. But I don't have a strong opinion, so I'll change it as you
>> suggest.
> 
> Thanks!
> 
>>>
>>> Or, add ALIGN_UP() for this?
>>>
>>>> -	tmp += 1;
>>>> +	tmp += nr_pages;
>>>>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>>>  
>>>>  	return true;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>>> + * might involve allocating a new cluster for current CPU too.
>>>> + */
>>>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>> +	unsigned long *offset, unsigned long *scan_base)
>>>> +{
>>>> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>>>> +}
>>>> +
>>>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>>>  {
>>>>  	int nid;
>>>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>  	return n_ret;
>>>>  }
>>>>  
>>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>>>> +			    unsigned int nr_pages)
>>>
>>> IMHO, it's better to make scan_swap_map_slots() to support order > 0
>>> instead of making swap_alloc_cluster() to support order != PMD_ORDER.
>>> And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
>>> that.
>>
>> I did consider adding a 5th patch to rename swap_alloc_large() to something like
>> swap_alloc_one_ssd_entry() (which would then be used for order=0 too) and
>> refactor scan_swap_map_slots() to fully delegate to it for the non-scaning ssd
>> allocation case. Would something like that suit?
>>
>> I have reservations about making scan_swap_map_slots() take an order and be the
>> sole entry point:
>>
>>   - in the non-ssd case, we can't support order!=0
> 
> Don't need to check ssd directly, we only support order != 0 if
> si->cluster_info != NULL.
> 
>>   - there is a lot of other logic to deal with falling back to scanning which we
>>     would only want to do for order==0, so we would end up with a few ugly
>>     conditionals against order.
> 
> We don't need to care about them in most cases.  IIUC, only the "goto
> scan" after scan_swap_map_try_ssd_cluster() return false need to "goto
> no_page" for order != 0.
> 
>>   - I was concerned the risk of me introducing a bug when refactoring all that
>>     subtle logic was high
> 
> IMHO, readability is more important for long term maintenance.  So, we
> need to refactor the existing code for that.
> 
>> What do you think? Is not making scan_swap_map_slots() support order > 0 a deal
>> breaker for you?
> 
> I just think that it's better to use scan_swap_map_slots() for any order
> other than PMD_ORDER.  In that way, we share as much code as possible.

OK, I'll take a look at implementing it as you propose, although I likely won't
have bandwidth until start of December. Will repost once I have something.

Thanks,
Ryan

> 
> --
> Best Regards,
> Huang, Ying
  
Steven Price Nov. 3, 2023, 1:57 p.m. UTC | #9
On 03/11/2023 11:31, Ryan Roberts wrote:
> On 02/11/2023 22:36, Barry Song wrote:
>>> But, yes, it would be nice to fix that! And if I've understood the problem
>>> correctly, it doesn't sound like it should be too hard? Is this something you
>>> are volunteering for?? :)
>>
>> Unfornately right now I haven't a real hardware with MTE which can run the latest
>> kernel. but i have written a RFC, it will be nice to get someone to test it. Let
>> me figure out if we can get someone :-)
> 
> OK, let me know if you find someone. Otherwise I can have a hunt around to see
> if I can test it.
> 
>>
>> [RFC PATCH] arm64: mm: swap: save and restore mte tags for large folios
>>
>> This patch makes MTE tags saving and restoring support large folios,
>> then we don't need to split them into base pages for swapping on
>> ARM64 SoCs with MTE.
>>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 21 ++++-----------------
>>  arch/arm64/mm/mteswap.c          | 20 ++++++++++++++++++++
>>  2 files changed, 24 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7f7d9b1df4e5..b12783dca00a 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -45,12 +45,6 @@
>>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>> -static inline bool arch_thp_swp_supported(void)
>> -{
>> -	return !system_supports_mte();
>> -}
>> -#define arch_thp_swp_supported arch_thp_swp_supported
> 
> IIRC, arm64 was the only arch implementing this, so perhaps it should be ripped
> out from the core code now?
> 
>> -
>>  /*
>>   * Outside of a few very special situations (e.g. hibernation), we always
>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>> @@ -1028,12 +1022,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>  #ifdef CONFIG_ARM64_MTE
>>  
>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>> -static inline int arch_prepare_to_swap(struct page *page)
>> -{
>> -	if (system_supports_mte())
>> -		return mte_save_tags(page);
>> -	return 0;
>> -}
>> +#define arch_prepare_to_swap arch_prepare_to_swap
>> +extern int arch_prepare_to_swap(struct page *page);
> 
> I think it would be better to modify this API to take a folio explicitly. The
> caller already has the folio.
> 
>>  
>>  #define __HAVE_ARCH_SWAP_INVALIDATE
>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>> @@ -1049,11 +1039,8 @@ static inline void arch_swap_invalidate_area(int type)
>>  }
>>  
>>  #define __HAVE_ARCH_SWAP_RESTORE
>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> -{
>> -	if (system_supports_mte())
>> -		mte_restore_tags(entry, &folio->page);
>> -}
>> +#define arch_swap_restore arch_swap_restore
>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>  
>>  #endif /* CONFIG_ARM64_MTE */
>>  
>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>> index a31833e3ddc5..e5637e931e4f 100644
>> --- a/arch/arm64/mm/mteswap.c
>> +++ b/arch/arm64/mm/mteswap.c
>> @@ -83,3 +83,23 @@ void mte_invalidate_tags_area(int type)
>>  	}
>>  	xa_unlock(&mte_pages);
>>  }
>> +
>> +int arch_prepare_to_swap(struct page *page)
>> +{
>> +	if (system_supports_mte()) {
>> +		struct folio *folio = page_folio(page);
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			return mte_save_tags(folio_page(folio, i));
> 
> This will return after saving the first page of the folio! You will need to add
> each page in a loop, and if you get an error at any point, you will need to
> remove the pages that you already added successfully, by calling
> arch_swap_invalidate_page() as far as I can see. Steven can you confirm?

Yes that's right. mte_save_tags() needs to allocate memory so can fail
and if failing then arch_prepare_to_swap() would need to put things back
how they were with calls to mte_invalidate_tags() (although I think
you'd actually want to refactor to create a function which takes a
struct page *).

Steve

>> +	}
>> +	return 0;
>> +}
>> +
>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> +{
>> +	if (system_supports_mte()) {
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			mte_restore_tags(entry, folio_page(folio, i));
> 
> swap-in currently doesn't support large folios - everything is a single page
> folio. So this isn't technically needed. But from the API POV, it seems
> reasonable to make this change - except your implementation is broken. You are
> currently setting every page in the folio to use the same tags as the first
> page. You need to increment the swap entry for each page.
> 
> Thanks,
> Ryan
> 
> 
>> +	}
>> +}
>
  
Barry Song Nov. 4, 2023, 5:49 a.m. UTC | #10
>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>> -static inline int arch_prepare_to_swap(struct page *page)
>> -{
>> -	if (system_supports_mte())
>> -		return mte_save_tags(page);
>> -	return 0;
>> -}
>> +#define arch_prepare_to_swap arch_prepare_to_swap
>> +extern int arch_prepare_to_swap(struct page *page);
> 
> I think it would be better to modify this API to take a folio explicitly. The
> caller already has the folio.

agree. that was actually what i thought I should change while making this rfc,
though i didn't do it.

>> +int arch_prepare_to_swap(struct page *page)
>> +{
>> +	if (system_supports_mte()) {
>> +		struct folio *folio = page_folio(page);
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			return mte_save_tags(folio_page(folio, i));
>
> This will return after saving the first page of the folio! You will need to add
> each page in a loop, and if you get an error at any point, you will need to
> remove the pages that you already added successfully, by calling
> arch_swap_invalidate_page() as far as I can see. Steven can you confirm?

right. oops...

> 
>> +	}
>> +	return 0;
>> +}
>> +
>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> +{
>> +	if (system_supports_mte()) {
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			mte_restore_tags(entry, folio_page(folio, i));
>
> swap-in currently doesn't support large folios - everything is a single page
> folio. So this isn't technically needed. But from the API POV, it seems
> reasonable to make this change - except your implementation is broken. You are
> currently setting every page in the folio to use the same tags as the first
> page. You need to increment the swap entry for each page.

one case is that we have a chance to "swapin" a folio which is still in swapcache
and hasn't been dropped yet. i mean the process's ptes have been swap entry, but
the large folio is still in swapcache. in this case, we will hit the swapcache
while swapping in, thus we are handling a large folio. in this case, it seems
we are restoring tags multiple times? i mean, if large folio has 16 basepages,
for each page fault of each base page, we are restoring a large folio, then
for 16 page faults, we are duplicating the restore.
any thought to handle this situation? should we move arch_swap_restore() to take
page rather than folio since swapin only supports basepage at this moment.

> Thanks,
> Ryan

Thanks
Barry
  
Barry Song Nov. 4, 2023, 9:34 a.m. UTC | #11
> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> and if failing then arch_prepare_to_swap() would need to put things back
> how they were with calls to mte_invalidate_tags() (although I think
> you'd actually want to refactor to create a function which takes a
> struct page *).
> 
> Steve

Thanks, Steve. combining all comments from You and Ryan, I made a v2.
One tricky thing is that we are restoring one page rather than folio
in arch_restore_swap() as we are only swapping in one page at this
stage.

[RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios

This patch makes MTE tags saving and restoring support large folios,
then we don't need to split them into base pages for swapping on
ARM64 SoCs with MTE.

This patch moves arch_prepare_to_swap() to take folio rather than
page, as we support THP swap-out as a whole. And this patch also
drops arch_thp_swp_supported() as ARM64 MTE is the only one who
needs it.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/include/asm/pgtable.h | 21 +++------------
 arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
 include/linux/huge_mm.h          | 12 ---------
 include/linux/pgtable.h          |  2 +-
 mm/page_io.c                     |  2 +-
 mm/swap_slots.c                  |  2 +-
 6 files changed, 51 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..d8f523dc41e7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
 	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool arch_thp_swp_supported(void)
-{
-	return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
 /*
  * Outside of a few very special situations (e.g. hibernation), we always
  * use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #ifdef CONFIG_ARM64_MTE
 
 #define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
-	if (system_supports_mte())
-		return mte_save_tags(page);
-	return 0;
-}
+#define arch_prepare_to_swap arch_prepare_to_swap
+extern int arch_prepare_to_swap(struct folio *folio);
 
 #define __HAVE_ARCH_SWAP_INVALIDATE
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
-	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
-}
+#define arch_swap_restore arch_swap_restore
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
 
 #endif /* CONFIG_ARM64_MTE */
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..14a479e4ea8e 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	mte_free_tag_storage(tags);
 }
 
+static inline void __mte_invalidate_tags(struct page *page)
+{
+	swp_entry_t entry = page_swap_entry(page);
+	mte_invalidate_tags(swp_type(entry), swp_offset(entry));
+}
+
 void mte_invalidate_tags_area(int type)
 {
 	swp_entry_t entry = swp_entry(type, 0);
@@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
 	}
 	xa_unlock(&mte_pages);
 }
+
+int arch_prepare_to_swap(struct folio *folio)
+{
+	int err;
+	long i;
+
+	if (system_supports_mte()) {
+		long nr = folio_nr_pages(folio);
+		for (i = 0; i < nr; i++) {
+			err = mte_save_tags(folio_page(folio, i));
+			if (err)
+				goto out;
+		}
+	}
+	return 0;
+
+out:
+	while (--i)
+		__mte_invalidate_tags(folio_page(folio, i));
+	return err;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+	if (system_supports_mte()) {
+		/*
+		 * We don't support large folios swap in as whole yet, but
+		 * we can hit a large folio which is still in swapcache
+		 * after those related processes' PTEs have been unmapped
+		 * but before the swapcache folio  is dropped, in this case,
+		 * we need to find the exact page which "entry" is mapping
+		 * to. If we are not hitting swapcache, this folio won't be
+		 * large
+		 */
+		struct page *page = folio_file_page(folio, swp_offset(entry));
+		mte_restore_tags(entry, page);
+	}
+}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..f83fb8d5241e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
 	return split_folio_to_list(folio, NULL);
 }
 
-/*
- * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
- * limitations in the implementation like arm64 MTE can override this to
- * false
- */
-#ifndef arch_thp_swp_supported
-static inline bool arch_thp_swp_supported(void)
-{
-	return true;
-}
-#endif
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..33ab4ddd91dd 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * prototypes must be defined in the arch-specific asm/pgtable.h file.
  */
 #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
+static inline int arch_prepare_to_swap(struct folio *folio)
 {
 	return 0;
 }
diff --git a/mm/page_io.c b/mm/page_io.c
index cb559ae324c6..0fd832474c1d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 	 * Arch code may have to preserve more data than just the page
 	 * contents, e.g. memory tags.
 	 */
-	ret = arch_prepare_to_swap(&folio->page);
+	ret = arch_prepare_to_swap(folio);
 	if (ret) {
 		folio_mark_dirty(folio);
 		folio_unlock(folio);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..2325adbb1f19 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 	entry.val = 0;
 
 	if (folio_test_large(folio)) {
-		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+		if (IS_ENABLED(CONFIG_THP_SWAP))
 			get_swap_pages(1, &entry, folio_nr_pages(folio));
 		goto out;
 	}
  
Steven Price Nov. 6, 2023, 10:12 a.m. UTC | #12
On 04/11/2023 09:34, Barry Song wrote:
>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>> and if failing then arch_prepare_to_swap() would need to put things back
>> how they were with calls to mte_invalidate_tags() (although I think
>> you'd actually want to refactor to create a function which takes a
>> struct page *).
>>
>> Steve
> 
> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> One tricky thing is that we are restoring one page rather than folio
> in arch_restore_swap() as we are only swapping in one page at this
> stage.
> 
> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> 
> This patch makes MTE tags saving and restoring support large folios,
> then we don't need to split them into base pages for swapping on
> ARM64 SoCs with MTE.
> 
> This patch moves arch_prepare_to_swap() to take folio rather than
> page, as we support THP swap-out as a whole. And this patch also
> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> needs it.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h          | 12 ---------
>  include/linux/pgtable.h          |  2 +-
>  mm/page_io.c                     |  2 +-
>  mm/swap_slots.c                  |  2 +-
>  6 files changed, 51 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b19a8aee684c..d8f523dc41e7 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported
> -
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  #ifdef CONFIG_ARM64_MTE
>  
>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> -	if (system_supports_mte())
> -		return mte_save_tags(page);
> -	return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap
> +extern int arch_prepare_to_swap(struct folio *folio);
>  
>  #define __HAVE_ARCH_SWAP_INVALIDATE
>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>  }
>  
>  #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> -	if (system_supports_mte())
> -		mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore
> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..14a479e4ea8e 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>  	mte_free_tag_storage(tags);
>  }
>  
> +static inline void __mte_invalidate_tags(struct page *page)
> +{
> +	swp_entry_t entry = page_swap_entry(page);
> +	mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> +}
> +
>  void mte_invalidate_tags_area(int type)
>  {
>  	swp_entry_t entry = swp_entry(type, 0);
> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>  	}
>  	xa_unlock(&mte_pages);
>  }
> +
> +int arch_prepare_to_swap(struct folio *folio)
> +{
> +	int err;
> +	long i;
> +
> +	if (system_supports_mte()) {
> +		long nr = folio_nr_pages(folio);
> +		for (i = 0; i < nr; i++) {
> +			err = mte_save_tags(folio_page(folio, i));
> +			if (err)
> +				goto out;
> +		}
> +	}
> +	return 0;
> +
> +out:
> +	while (--i)
> +		__mte_invalidate_tags(folio_page(folio, i));
> +	return err;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> +	if (system_supports_mte()) {
> +		/*
> +		 * We don't support large folios swap in as whole yet, but
> +		 * we can hit a large folio which is still in swapcache
> +		 * after those related processes' PTEs have been unmapped
> +		 * but before the swapcache folio  is dropped, in this case,
> +		 * we need to find the exact page which "entry" is mapping
> +		 * to. If we are not hitting swapcache, this folio won't be
> +		 * large
> +		 */

Does it make sense to keep arch_swap_restore taking a folio? I'm not
sure I understand why the change was made in the first place. It just
seems odd to have a function taking a struct folio but making the
assumption that it's actually only a single page (and having to use
entry to figure out which page).

It seems particularly broken in the case of unuse_pte() which calls
page_folio() to get the folio in the first place.

Other than that it looks correct to me.

Thanks,

Steve

> +		struct page *page = folio_file_page(folio, swp_offset(entry));
> +		mte_restore_tags(entry, page);
> +	}
> +}
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..f83fb8d5241e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
>  	return split_folio_to_list(folio, NULL);
>  }
>  
> -/*
> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> - * limitations in the implementation like arm64 MTE can override this to
> - * false
> - */
> -#ifndef arch_thp_swp_supported
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return true;
> -}
> -#endif
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..33ab4ddd91dd 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   * prototypes must be defined in the arch-specific asm/pgtable.h file.
>   */
>  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> +static inline int arch_prepare_to_swap(struct folio *folio)
>  {
>  	return 0;
>  }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index cb559ae324c6..0fd832474c1d 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>  	 * Arch code may have to preserve more data than just the page
>  	 * contents, e.g. memory tags.
>  	 */
> -	ret = arch_prepare_to_swap(&folio->page);
> +	ret = arch_prepare_to_swap(folio);
>  	if (ret) {
>  		folio_mark_dirty(folio);
>  		folio_unlock(folio);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..2325adbb1f19 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>  	entry.val = 0;
>  
>  	if (folio_test_large(folio)) {
> -		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> +		if (IS_ENABLED(CONFIG_THP_SWAP))
>  			get_swap_pages(1, &entry, folio_nr_pages(folio));
>  		goto out;
>  	}
  
Barry Song Nov. 6, 2023, 9:39 p.m. UTC | #13
On Mon, Nov 6, 2023 at 6:12 PM Steven Price <steven.price@arm.com> wrote:
>
> On 04/11/2023 09:34, Barry Song wrote:
> >> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> >> and if failing then arch_prepare_to_swap() would need to put things back
> >> how they were with calls to mte_invalidate_tags() (although I think
> >> you'd actually want to refactor to create a function which takes a
> >> struct page *).
> >>
> >> Steve
> >
> > Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> > One tricky thing is that we are restoring one page rather than folio
> > in arch_restore_swap() as we are only swapping in one page at this
> > stage.
> >
> > [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> >
> > This patch makes MTE tags saving and restoring support large folios,
> > then we don't need to split them into base pages for swapping on
> > ARM64 SoCs with MTE.
> >
> > This patch moves arch_prepare_to_swap() to take folio rather than
> > page, as we support THP swap-out as a whole. And this patch also
> > drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> > needs it.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 21 +++------------
> >  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> >  include/linux/huge_mm.h          | 12 ---------
> >  include/linux/pgtable.h          |  2 +-
> >  mm/page_io.c                     |  2 +-
> >  mm/swap_slots.c                  |  2 +-
> >  6 files changed, 51 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index b19a8aee684c..d8f523dc41e7 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,12 +45,6 @@
> >       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return !system_supports_mte();
> > -}
> > -#define arch_thp_swp_supported arch_thp_swp_supported
> > -
> >  /*
> >   * Outside of a few very special situations (e.g. hibernation), we always
> >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  #ifdef CONFIG_ARM64_MTE
> >
> >  #define __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > -{
> > -     if (system_supports_mte())
> > -             return mte_save_tags(page);
> > -     return 0;
> > -}
> > +#define arch_prepare_to_swap arch_prepare_to_swap
> > +extern int arch_prepare_to_swap(struct folio *folio);
> >
> >  #define __HAVE_ARCH_SWAP_INVALIDATE
> >  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> >  }
> >
> >  #define __HAVE_ARCH_SWAP_RESTORE
> > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > -{
> > -     if (system_supports_mte())
> > -             mte_restore_tags(entry, &folio->page);
> > -}
> > +#define arch_swap_restore arch_swap_restore
> > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >
> >  #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index a31833e3ddc5..14a479e4ea8e 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >       mte_free_tag_storage(tags);
> >  }
> >
> > +static inline void __mte_invalidate_tags(struct page *page)
> > +{
> > +     swp_entry_t entry = page_swap_entry(page);
> > +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > +}
> > +
> >  void mte_invalidate_tags_area(int type)
> >  {
> >       swp_entry_t entry = swp_entry(type, 0);
> > @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> >       }
> >       xa_unlock(&mte_pages);
> >  }
> > +
> > +int arch_prepare_to_swap(struct folio *folio)
> > +{
> > +     int err;
> > +     long i;
> > +
> > +     if (system_supports_mte()) {
> > +             long nr = folio_nr_pages(folio);
> > +             for (i = 0; i < nr; i++) {
> > +                     err = mte_save_tags(folio_page(folio, i));
> > +                     if (err)
> > +                             goto out;
> > +             }
> > +     }
> > +     return 0;
> > +
> > +out:
> > +     while (--i)
> > +             __mte_invalidate_tags(folio_page(folio, i));
> > +     return err;
> > +}
> > +
> > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > +{
> > +     if (system_supports_mte()) {
> > +             /*
> > +              * We don't support large folios swap in as whole yet, but
> > +              * we can hit a large folio which is still in swapcache
> > +              * after those related processes' PTEs have been unmapped
> > +              * but before the swapcache folio  is dropped, in this case,
> > +              * we need to find the exact page which "entry" is mapping
> > +              * to. If we are not hitting swapcache, this folio won't be
> > +              * large
> > +              */
>
> Does it make sense to keep arch_swap_restore taking a folio? I'm not
> sure I understand why the change was made in the first place. It just
> seems odd to have a function taking a struct folio but making the
> assumption that it's actually only a single page (and having to use
> entry to figure out which page).

Steve, let me give an example. in case we have a large anon folios with
16 pages.

while reclaiming, we do add_to_swap(), this folio is added to swapcache
as a whole; then we unmap the folio; in the last step,  we try to release
the folio.

we have a good chance some processes might access the virtual address
after the folio is unmapped but before the folio is finally released. thus,
do_swap_page() will find the large folio in swapcache, there is no I/O needed.

Let's assume processes read the 3rd page of the unmapped folio, in
do_swap_page(), the code is like,

vm_fault_t do_swap_page(struct vm_fault *vmf)
{
     swp_entry_t entry;
     ...
     entry = pte_to_swp_entry(vmf->orig_pte);

     folio = swap_cache_get_folio(entry, vma, vmf->address);
     if (folio)
           page = folio_file_page(folio, swp_offset(entry));

     arch_swap_restore(entry, folio);
}

entry points to the 3rd page, but folio points to the head page. so we
can't use the entry parameter to restore the whole folio in
arch_swap_restore()

then we have two choices in arch_swap_restore()
1. we get the 1st page's swap entry and restore all 16 tags in this large folio.
2. we restore the 3rd tag only by getting the right page in the folio

if we choose 1, in all 16 page faults of do_swap_page for the 16 unmapped
PTEs, we will restore 16*16=256 tags. One pte will have one page fault
since we don't restore 16 PTEs in do_swap_page().

if we choose 2, in all 16 pages fault of do_swap_page for the 16 unmapped
PTEs, we will only restore 16 *1=16 tags.

>
> It seems particularly broken in the case of unuse_pte() which calls
> page_folio() to get the folio in the first place.
>
> Other than that it looks correct to me.
>
> Thanks,
>
> Steve
>
> > +             struct page *page = folio_file_page(folio, swp_offset(entry));
> > +             mte_restore_tags(entry, page);
> > +     }
> > +}
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index fa0350b0812a..f83fb8d5241e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
> >       return split_folio_to_list(folio, NULL);
> >  }
> >
> > -/*
> > - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > - * limitations in the implementation like arm64 MTE can override this to
> > - * false
> > - */
> > -#ifndef arch_thp_swp_supported
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return true;
> > -}
> > -#endif
> > -
> >  #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index af7639c3b0a3..33ab4ddd91dd 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >   */
> >  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > +static inline int arch_prepare_to_swap(struct folio *folio)
> >  {
> >       return 0;
> >  }
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index cb559ae324c6..0fd832474c1d 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >        * Arch code may have to preserve more data than just the page
> >        * contents, e.g. memory tags.
> >        */
> > -     ret = arch_prepare_to_swap(&folio->page);
> > +     ret = arch_prepare_to_swap(folio);
> >       if (ret) {
> >               folio_mark_dirty(folio);
> >               folio_unlock(folio);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 0bec1f705f8e..2325adbb1f19 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >       entry.val = 0;
> >
> >       if (folio_test_large(folio)) {
> > -             if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > +             if (IS_ENABLED(CONFIG_THP_SWAP))
> >                       get_swap_pages(1, &entry, folio_nr_pages(folio));
> >               goto out;
> >       }
>

Thanks
Barry
  
Ryan Roberts Nov. 7, 2023, 12:46 p.m. UTC | #14
On 04/11/2023 09:34, Barry Song wrote:
>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>> and if failing then arch_prepare_to_swap() would need to put things back
>> how they were with calls to mte_invalidate_tags() (although I think
>> you'd actually want to refactor to create a function which takes a
>> struct page *).
>>
>> Steve
> 
> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> One tricky thing is that we are restoring one page rather than folio
> in arch_restore_swap() as we are only swapping in one page at this
> stage.
> 
> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> 
> This patch makes MTE tags saving and restoring support large folios,
> then we don't need to split them into base pages for swapping on
> ARM64 SoCs with MTE.
> 
> This patch moves arch_prepare_to_swap() to take folio rather than
> page, as we support THP swap-out as a whole. And this patch also
> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> needs it.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h          | 12 ---------
>  include/linux/pgtable.h          |  2 +-
>  mm/page_io.c                     |  2 +-
>  mm/swap_slots.c                  |  2 +-
>  6 files changed, 51 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b19a8aee684c..d8f523dc41e7 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported
> -
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  #ifdef CONFIG_ARM64_MTE
>  
>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> -	if (system_supports_mte())
> -		return mte_save_tags(page);
> -	return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap
> +extern int arch_prepare_to_swap(struct folio *folio);
>  
>  #define __HAVE_ARCH_SWAP_INVALIDATE
>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>  }
>  
>  #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> -	if (system_supports_mte())
> -		mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore
> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..14a479e4ea8e 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>  	mte_free_tag_storage(tags);
>  }
>  
> +static inline void __mte_invalidate_tags(struct page *page)
> +{
> +	swp_entry_t entry = page_swap_entry(page);
> +	mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> +}
> +
>  void mte_invalidate_tags_area(int type)
>  {
>  	swp_entry_t entry = swp_entry(type, 0);
> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>  	}
>  	xa_unlock(&mte_pages);
>  }
> +
> +int arch_prepare_to_swap(struct folio *folio)
> +{
> +	int err;
> +	long i;
> +
> +	if (system_supports_mte()) {
> +		long nr = folio_nr_pages(folio);

nit: there should be a clear line between variable declarations and logic.

> +		for (i = 0; i < nr; i++) {
> +			err = mte_save_tags(folio_page(folio, i));
> +			if (err)
> +				goto out;
> +		}
> +	}
> +	return 0;
> +
> +out:
> +	while (--i)

If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
then it will wrap and run ~forever. I think you meant `while (i--)`?

> +		__mte_invalidate_tags(folio_page(folio, i));
> +	return err;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> +	if (system_supports_mte()) {
> +		/*
> +		 * We don't support large folios swap in as whole yet, but
> +		 * we can hit a large folio which is still in swapcache
> +		 * after those related processes' PTEs have been unmapped
> +		 * but before the swapcache folio  is dropped, in this case,
> +		 * we need to find the exact page which "entry" is mapping
> +		 * to. If we are not hitting swapcache, this folio won't be
> +		 * large
> +		 */

So the currently defined API allows a large folio to be passed but the caller is
supposed to find the single correct page using the swap entry? That feels quite
nasty to me. And that's not what the old version of the function was doing; it
always assumed that the folio was small and passed the first page (which also
doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
to fix that. If the old version is correct, then I guess this version is wrong.

Thanks,
Ryan

> +		struct page *page = folio_file_page(folio, swp_offset(entry));
> +		mte_restore_tags(entry, page);
> +	}
> +}
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..f83fb8d5241e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
>  	return split_folio_to_list(folio, NULL);
>  }
>  
> -/*
> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> - * limitations in the implementation like arm64 MTE can override this to
> - * false
> - */
> -#ifndef arch_thp_swp_supported
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return true;
> -}
> -#endif
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..33ab4ddd91dd 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   * prototypes must be defined in the arch-specific asm/pgtable.h file.
>   */
>  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> +static inline int arch_prepare_to_swap(struct folio *folio)
>  {
>  	return 0;
>  }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index cb559ae324c6..0fd832474c1d 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>  	 * Arch code may have to preserve more data than just the page
>  	 * contents, e.g. memory tags.
>  	 */
> -	ret = arch_prepare_to_swap(&folio->page);
> +	ret = arch_prepare_to_swap(folio);
>  	if (ret) {
>  		folio_mark_dirty(folio);
>  		folio_unlock(folio);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..2325adbb1f19 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>  	entry.val = 0;
>  
>  	if (folio_test_large(folio)) {
> -		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> +		if (IS_ENABLED(CONFIG_THP_SWAP))
>  			get_swap_pages(1, &entry, folio_nr_pages(folio));
>  		goto out;
>  	}
  
Barry Song Nov. 7, 2023, 6:05 p.m. UTC | #15
On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/11/2023 09:34, Barry Song wrote:
> >> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> >> and if failing then arch_prepare_to_swap() would need to put things back
> >> how they were with calls to mte_invalidate_tags() (although I think
> >> you'd actually want to refactor to create a function which takes a
> >> struct page *).
> >>
> >> Steve
> >
> > Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> > One tricky thing is that we are restoring one page rather than folio
> > in arch_restore_swap() as we are only swapping in one page at this
> > stage.
> >
> > [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> >
> > This patch makes MTE tags saving and restoring support large folios,
> > then we don't need to split them into base pages for swapping on
> > ARM64 SoCs with MTE.
> >
> > This patch moves arch_prepare_to_swap() to take folio rather than
> > page, as we support THP swap-out as a whole. And this patch also
> > drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> > needs it.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 21 +++------------
> >  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> >  include/linux/huge_mm.h          | 12 ---------
> >  include/linux/pgtable.h          |  2 +-
> >  mm/page_io.c                     |  2 +-
> >  mm/swap_slots.c                  |  2 +-
> >  6 files changed, 51 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index b19a8aee684c..d8f523dc41e7 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,12 +45,6 @@
> >       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return !system_supports_mte();
> > -}
> > -#define arch_thp_swp_supported arch_thp_swp_supported
> > -
> >  /*
> >   * Outside of a few very special situations (e.g. hibernation), we always
> >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  #ifdef CONFIG_ARM64_MTE
> >
> >  #define __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > -{
> > -     if (system_supports_mte())
> > -             return mte_save_tags(page);
> > -     return 0;
> > -}
> > +#define arch_prepare_to_swap arch_prepare_to_swap
> > +extern int arch_prepare_to_swap(struct folio *folio);
> >
> >  #define __HAVE_ARCH_SWAP_INVALIDATE
> >  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> >  }
> >
> >  #define __HAVE_ARCH_SWAP_RESTORE
> > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > -{
> > -     if (system_supports_mte())
> > -             mte_restore_tags(entry, &folio->page);
> > -}
> > +#define arch_swap_restore arch_swap_restore
> > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >
> >  #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index a31833e3ddc5..14a479e4ea8e 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >       mte_free_tag_storage(tags);
> >  }
> >
> > +static inline void __mte_invalidate_tags(struct page *page)
> > +{
> > +     swp_entry_t entry = page_swap_entry(page);
> > +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > +}
> > +
> >  void mte_invalidate_tags_area(int type)
> >  {
> >       swp_entry_t entry = swp_entry(type, 0);
> > @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> >       }
> >       xa_unlock(&mte_pages);
> >  }
> > +
> > +int arch_prepare_to_swap(struct folio *folio)
> > +{
> > +     int err;
> > +     long i;
> > +
> > +     if (system_supports_mte()) {
> > +             long nr = folio_nr_pages(folio);
>
> nit: there should be a clear line between variable declarations and logic.

right.

>
> > +             for (i = 0; i < nr; i++) {
> > +                     err = mte_save_tags(folio_page(folio, i));
> > +                     if (err)
> > +                             goto out;
> > +             }
> > +     }
> > +     return 0;
> > +
> > +out:
> > +     while (--i)
>
> If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
> then it will wrap and run ~forever. I think you meant `while (i--)`?

nop. if i=0 and we goto out, that means the page0 has failed to save tags,
there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
saved, we restore 0,1,2 and we don't restore 3.

>
> > +             __mte_invalidate_tags(folio_page(folio, i));
> > +     return err;
> > +}
> > +
> > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > +{
> > +     if (system_supports_mte()) {
> > +             /*
> > +              * We don't support large folios swap in as whole yet, but
> > +              * we can hit a large folio which is still in swapcache
> > +              * after those related processes' PTEs have been unmapped
> > +              * but before the swapcache folio  is dropped, in this case,
> > +              * we need to find the exact page which "entry" is mapping
> > +              * to. If we are not hitting swapcache, this folio won't be
> > +              * large
> > +              */
>
> So the currently defined API allows a large folio to be passed but the caller is
> supposed to find the single correct page using the swap entry? That feels quite
> nasty to me. And that's not what the old version of the function was doing; it
> always assumed that the folio was small and passed the first page (which also
> doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
> to fix that. If the old version is correct, then I guess this version is wrong.

the original version(mainline) is wrong but it works as once we find the SoCs
support MTE, we will split large folios into small pages. so only small pages
will be added into swapcache successfully.

but now we want to swap out large folios even on SoCs with MTE as a whole,
we don't split, so this breaks the assumption do_swap_page() will always get
small pages.

>
> Thanks,
> Ryan
>
> > +             struct page *page = folio_file_page(folio, swp_offset(entry));
> > +             mte_restore_tags(entry, page);
> > +     }
> > +}
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index fa0350b0812a..f83fb8d5241e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
> >       return split_folio_to_list(folio, NULL);
> >  }
> >
> > -/*
> > - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > - * limitations in the implementation like arm64 MTE can override this to
> > - * false
> > - */
> > -#ifndef arch_thp_swp_supported
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return true;
> > -}
> > -#endif
> > -
> >  #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index af7639c3b0a3..33ab4ddd91dd 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >   */
> >  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > +static inline int arch_prepare_to_swap(struct folio *folio)
> >  {
> >       return 0;
> >  }
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index cb559ae324c6..0fd832474c1d 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >        * Arch code may have to preserve more data than just the page
> >        * contents, e.g. memory tags.
> >        */
> > -     ret = arch_prepare_to_swap(&folio->page);
> > +     ret = arch_prepare_to_swap(folio);
> >       if (ret) {
> >               folio_mark_dirty(folio);
> >               folio_unlock(folio);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 0bec1f705f8e..2325adbb1f19 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >       entry.val = 0;
> >
> >       if (folio_test_large(folio)) {
> > -             if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > +             if (IS_ENABLED(CONFIG_THP_SWAP))
> >                       get_swap_pages(1, &entry, folio_nr_pages(folio));
> >               goto out;
> >       }
>

Thanks
Barry
  
Barry Song Nov. 8, 2023, 11:23 a.m. UTC | #16
On Wed, Nov 8, 2023 at 2:05 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 04/11/2023 09:34, Barry Song wrote:
> > >> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> > >> and if failing then arch_prepare_to_swap() would need to put things back
> > >> how they were with calls to mte_invalidate_tags() (although I think
> > >> you'd actually want to refactor to create a function which takes a
> > >> struct page *).
> > >>
> > >> Steve
> > >
> > > Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> > > One tricky thing is that we are restoring one page rather than folio
> > > in arch_restore_swap() as we are only swapping in one page at this
> > > stage.
> > >
> > > [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> > >
> > > This patch makes MTE tags saving and restoring support large folios,
> > > then we don't need to split them into base pages for swapping on
> > > ARM64 SoCs with MTE.
> > >
> > > This patch moves arch_prepare_to_swap() to take folio rather than
> > > page, as we support THP swap-out as a whole. And this patch also
> > > drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> > > needs it.
> > >
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  arch/arm64/include/asm/pgtable.h | 21 +++------------
> > >  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> > >  include/linux/huge_mm.h          | 12 ---------
> > >  include/linux/pgtable.h          |  2 +-
> > >  mm/page_io.c                     |  2 +-
> > >  mm/swap_slots.c                  |  2 +-
> > >  6 files changed, 51 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index b19a8aee684c..d8f523dc41e7 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -45,12 +45,6 @@
> > >       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >
> > > -static inline bool arch_thp_swp_supported(void)
> > > -{
> > > -     return !system_supports_mte();
> > > -}
> > > -#define arch_thp_swp_supported arch_thp_swp_supported
> > > -
> > >  /*
> > >   * Outside of a few very special situations (e.g. hibernation), we always
> > >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > > @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > >  #ifdef CONFIG_ARM64_MTE
> > >
> > >  #define __HAVE_ARCH_PREPARE_TO_SWAP
> > > -static inline int arch_prepare_to_swap(struct page *page)
> > > -{
> > > -     if (system_supports_mte())
> > > -             return mte_save_tags(page);
> > > -     return 0;
> > > -}
> > > +#define arch_prepare_to_swap arch_prepare_to_swap
> > > +extern int arch_prepare_to_swap(struct folio *folio);
> > >
> > >  #define __HAVE_ARCH_SWAP_INVALIDATE
> > >  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > > @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> > >  }
> > >
> > >  #define __HAVE_ARCH_SWAP_RESTORE
> > > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > > -{
> > > -     if (system_supports_mte())
> > > -             mte_restore_tags(entry, &folio->page);
> > > -}
> > > +#define arch_swap_restore arch_swap_restore
> > > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> > >
> > >  #endif /* CONFIG_ARM64_MTE */
> > >
> > > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > > index a31833e3ddc5..14a479e4ea8e 100644
> > > --- a/arch/arm64/mm/mteswap.c
> > > +++ b/arch/arm64/mm/mteswap.c
> > > @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> > >       mte_free_tag_storage(tags);
> > >  }
> > >
> > > +static inline void __mte_invalidate_tags(struct page *page)
> > > +{
> > > +     swp_entry_t entry = page_swap_entry(page);
> > > +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > > +}
> > > +
> > >  void mte_invalidate_tags_area(int type)
> > >  {
> > >       swp_entry_t entry = swp_entry(type, 0);
> > > @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> > >       }
> > >       xa_unlock(&mte_pages);
> > >  }
> > > +
> > > +int arch_prepare_to_swap(struct folio *folio)
> > > +{
> > > +     int err;
> > > +     long i;
> > > +
> > > +     if (system_supports_mte()) {
> > > +             long nr = folio_nr_pages(folio);
> >
> > nit: there should be a clear line between variable declarations and logic.
>
> right.
>
> >
> > > +             for (i = 0; i < nr; i++) {
> > > +                     err = mte_save_tags(folio_page(folio, i));
> > > +                     if (err)
> > > +                             goto out;
> > > +             }
> > > +     }
> > > +     return 0;
> > > +
> > > +out:
> > > +     while (--i)
> >
> > If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
> > then it will wrap and run ~forever. I think you meant `while (i--)`?
>
> nop. if i=0 and we goto out, that means the page0 has failed to save tags,
> there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
> saved, we restore 0,1,2 and we don't restore 3.

I am terribly sorry for my previous noise. You are right, Ryan. i
actually meant i--.

>
> >
> > > +             __mte_invalidate_tags(folio_page(folio, i));
> > > +     return err;
> > > +}
> > > +
> > > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > > +{
> > > +     if (system_supports_mte()) {
> > > +             /*
> > > +              * We don't support large folios swap in as whole yet, but
> > > +              * we can hit a large folio which is still in swapcache
> > > +              * after those related processes' PTEs have been unmapped
> > > +              * but before the swapcache folio  is dropped, in this case,
> > > +              * we need to find the exact page which "entry" is mapping
> > > +              * to. If we are not hitting swapcache, this folio won't be
> > > +              * large
> > > +              */
> >
> > So the currently defined API allows a large folio to be passed but the caller is
> > supposed to find the single correct page using the swap entry? That feels quite
> > nasty to me. And that's not what the old version of the function was doing; it
> > always assumed that the folio was small and passed the first page (which also
> > doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
> > to fix that. If the old version is correct, then I guess this version is wrong.
>
> the original version(mainline) is wrong but it works as once we find the SoCs
> support MTE, we will split large folios into small pages. so only small pages
> will be added into swapcache successfully.
>
> but now we want to swap out large folios even on SoCs with MTE as a whole,
> we don't split, so this breaks the assumption do_swap_page() will always get
> small pages.

let me clarify this more. The current mainline assumes
arch_swap_restore() always
get a folio with only one page. this is true as we split large folios
if we find SoCs
have MTE. but since we are dropping the split now, that means a large
folio can be
gotten by do_swap_page(). we have a chance that try_to_unmap_one() has been done
but folio is not put. so PTEs will have swap entry but folio is still
there, and do_swap_page()
to hit cache directly and the folio won't be released.

but after getting the large folio in do_swap_page, it still only takes
one basepage particularly
for the faulted PTE and maps this 4KB PTE only. so it uses the faulted
swap_entry and
the folio as parameters to call arch_swap_restore() which can be something like:

do_swap_page()
{
        arch_swap_restore(the swap entry for the faulted 4KB PTE, large folio);
}
>
> >
> > Thanks,
> > Ryan

Thanks
Barry
  
Steven Price Nov. 8, 2023, 11:51 a.m. UTC | #17
On 06/11/2023 21:39, Barry Song wrote:
> On Mon, Nov 6, 2023 at 6:12 PM Steven Price <steven.price@arm.com> wrote:
>>
>> On 04/11/2023 09:34, Barry Song wrote:
>>>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>>>> and if failing then arch_prepare_to_swap() would need to put things back
>>>> how they were with calls to mte_invalidate_tags() (although I think
>>>> you'd actually want to refactor to create a function which takes a
>>>> struct page *).
>>>>
>>>> Steve
>>>
>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
>>> One tricky thing is that we are restoring one page rather than folio
>>> in arch_restore_swap() as we are only swapping in one page at this
>>> stage.
>>>
>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
>>>
>>> This patch makes MTE tags saving and restoring support large folios,
>>> then we don't need to split them into base pages for swapping on
>>> ARM64 SoCs with MTE.
>>>
>>> This patch moves arch_prepare_to_swap() to take folio rather than
>>> page, as we support THP swap-out as a whole. And this patch also
>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
>>> needs it.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>>>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>>>  include/linux/huge_mm.h          | 12 ---------
>>>  include/linux/pgtable.h          |  2 +-
>>>  mm/page_io.c                     |  2 +-
>>>  mm/swap_slots.c                  |  2 +-
>>>  6 files changed, 51 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index b19a8aee684c..d8f523dc41e7 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -45,12 +45,6 @@
>>>       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>
>>> -static inline bool arch_thp_swp_supported(void)
>>> -{
>>> -     return !system_supports_mte();
>>> -}
>>> -#define arch_thp_swp_supported arch_thp_swp_supported
>>> -
>>>  /*
>>>   * Outside of a few very special situations (e.g. hibernation), we always
>>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>>  #ifdef CONFIG_ARM64_MTE
>>>
>>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>>> -static inline int arch_prepare_to_swap(struct page *page)
>>> -{
>>> -     if (system_supports_mte())
>>> -             return mte_save_tags(page);
>>> -     return 0;
>>> -}
>>> +#define arch_prepare_to_swap arch_prepare_to_swap
>>> +extern int arch_prepare_to_swap(struct folio *folio);
>>>
>>>  #define __HAVE_ARCH_SWAP_INVALIDATE
>>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>>>  }
>>>
>>>  #define __HAVE_ARCH_SWAP_RESTORE
>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>> -{
>>> -     if (system_supports_mte())
>>> -             mte_restore_tags(entry, &folio->page);
>>> -}
>>> +#define arch_swap_restore arch_swap_restore
>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>>
>>>  #endif /* CONFIG_ARM64_MTE */
>>>
>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>>> index a31833e3ddc5..14a479e4ea8e 100644
>>> --- a/arch/arm64/mm/mteswap.c
>>> +++ b/arch/arm64/mm/mteswap.c
>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>>>       mte_free_tag_storage(tags);
>>>  }
>>>
>>> +static inline void __mte_invalidate_tags(struct page *page)
>>> +{
>>> +     swp_entry_t entry = page_swap_entry(page);
>>> +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
>>> +}
>>> +
>>>  void mte_invalidate_tags_area(int type)
>>>  {
>>>       swp_entry_t entry = swp_entry(type, 0);
>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>>>       }
>>>       xa_unlock(&mte_pages);
>>>  }
>>> +
>>> +int arch_prepare_to_swap(struct folio *folio)
>>> +{
>>> +     int err;
>>> +     long i;
>>> +
>>> +     if (system_supports_mte()) {
>>> +             long nr = folio_nr_pages(folio);
>>> +             for (i = 0; i < nr; i++) {
>>> +                     err = mte_save_tags(folio_page(folio, i));
>>> +                     if (err)
>>> +                             goto out;
>>> +             }
>>> +     }
>>> +     return 0;
>>> +
>>> +out:
>>> +     while (--i)
>>> +             __mte_invalidate_tags(folio_page(folio, i));
>>> +     return err;
>>> +}
>>> +
>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>> +{
>>> +     if (system_supports_mte()) {
>>> +             /*
>>> +              * We don't support large folios swap in as whole yet, but
>>> +              * we can hit a large folio which is still in swapcache
>>> +              * after those related processes' PTEs have been unmapped
>>> +              * but before the swapcache folio  is dropped, in this case,
>>> +              * we need to find the exact page which "entry" is mapping
>>> +              * to. If we are not hitting swapcache, this folio won't be
>>> +              * large
>>> +              */
>>
>> Does it make sense to keep arch_swap_restore taking a folio? I'm not
>> sure I understand why the change was made in the first place. It just
>> seems odd to have a function taking a struct folio but making the
>> assumption that it's actually only a single page (and having to use
>> entry to figure out which page).
> 
> Steve, let me give an example. in case we have a large anon folios with
> 16 pages.
> 
> while reclaiming, we do add_to_swap(), this folio is added to swapcache
> as a whole; then we unmap the folio; in the last step,  we try to release
> the folio.
> 
> we have a good chance some processes might access the virtual address
> after the folio is unmapped but before the folio is finally released. thus,
> do_swap_page() will find the large folio in swapcache, there is no I/O needed.
> 
> Let's assume processes read the 3rd page of the unmapped folio, in
> do_swap_page(), the code is like,
> 
> vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
>      swp_entry_t entry;
>      ...
>      entry = pte_to_swp_entry(vmf->orig_pte);
> 
>      folio = swap_cache_get_folio(entry, vma, vmf->address);
>      if (folio)
>            page = folio_file_page(folio, swp_offset(entry));
> 
>      arch_swap_restore(entry, folio);
> }
> 
> entry points to the 3rd page, but folio points to the head page. so we
> can't use the entry parameter to restore the whole folio in
> arch_swap_restore()

Sorry, I don't think I explained myself very clearly. My issue was that
with your patch (and currently) we have the situation where
arch_swap_restore() can only restore a single page. But the function
takes a "struct folio *" argument.

Current mainline assumes that the folio is a single page, and with your
patch we now have a big comment explaining what's going on (bonus points
for that!) and pick out the correct page from the folio. What I'm
puzzled by is why the change was made in the first place to pass a
"struct folio *" - if we passed a "struct page *" that:

 a) It would be clear that the current API only allows a single page at
    a time.

 b) The correct page could be passed by the caller rather than
    arch_swap_restore() having to obtain the offset into the folio.

> then we have two choices in arch_swap_restore()
> 1. we get the 1st page's swap entry and restore all 16 tags in this large folio.
> 2. we restore the 3rd tag only by getting the right page in the folio
> 
> if we choose 1, in all 16 page faults of do_swap_page for the 16 unmapped
> PTEs, we will restore 16*16=256 tags. One pte will have one page fault
> since we don't restore 16 PTEs in do_swap_page().
> 
> if we choose 2, in all 16 pages fault of do_swap_page for the 16 unmapped
> PTEs, we will only restore 16 *1=16 tags.

So if we choose option 1 then we're changing the API of
arch_swap_restore() to actually restore the entire folio and it makes
sense to pass a "struct folio *" - and I'm happy with that. But AFAICT
that's not what your patch currently implements as it appears to be
doing option 2.

I'm quite happy to believe that the overhead of option 2 is small and
that might be the right solution, but at the moment we've got an API
which implies arch_swap_restore() should be operating on an entire folio.

Note that I don't have any particularly strong views on this - I've not
been following the folio work very closely, but I personally find it
confusing when a function takes a "struct folio *" but then operates on
only one page of it.

Steve

>>
>> It seems particularly broken in the case of unuse_pte() which calls
>> page_folio() to get the folio in the first place.
>>
>> Other than that it looks correct to me.
>>
>> Thanks,
>>
>> Steve
>>
>>> +             struct page *page = folio_file_page(folio, swp_offset(entry));
>>> +             mte_restore_tags(entry, page);
>>> +     }
>>> +}
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index fa0350b0812a..f83fb8d5241e 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
>>>       return split_folio_to_list(folio, NULL);
>>>  }
>>>
>>> -/*
>>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>> - * limitations in the implementation like arm64 MTE can override this to
>>> - * false
>>> - */
>>> -#ifndef arch_thp_swp_supported
>>> -static inline bool arch_thp_swp_supported(void)
>>> -{
>>> -     return true;
>>> -}
>>> -#endif
>>> -
>>>  #endif /* _LINUX_HUGE_MM_H */
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..33ab4ddd91dd 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>   * prototypes must be defined in the arch-specific asm/pgtable.h file.
>>>   */
>>>  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
>>> -static inline int arch_prepare_to_swap(struct page *page)
>>> +static inline int arch_prepare_to_swap(struct folio *folio)
>>>  {
>>>       return 0;
>>>  }
>>> diff --git a/mm/page_io.c b/mm/page_io.c
>>> index cb559ae324c6..0fd832474c1d 100644
>>> --- a/mm/page_io.c
>>> +++ b/mm/page_io.c
>>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>>>        * Arch code may have to preserve more data than just the page
>>>        * contents, e.g. memory tags.
>>>        */
>>> -     ret = arch_prepare_to_swap(&folio->page);
>>> +     ret = arch_prepare_to_swap(folio);
>>>       if (ret) {
>>>               folio_mark_dirty(folio);
>>>               folio_unlock(folio);
>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>> index 0bec1f705f8e..2325adbb1f19 100644
>>> --- a/mm/swap_slots.c
>>> +++ b/mm/swap_slots.c
>>> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>       entry.val = 0;
>>>
>>>       if (folio_test_large(folio)) {
>>> -             if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>>> +             if (IS_ENABLED(CONFIG_THP_SWAP))
>>>                       get_swap_pages(1, &entry, folio_nr_pages(folio));
>>>               goto out;
>>>       }
>>
> 
> Thanks
> Barry
  
Ryan Roberts Nov. 8, 2023, 8:20 p.m. UTC | #18
On 08/11/2023 11:23, Barry Song wrote:
> On Wed, Nov 8, 2023 at 2:05 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 04/11/2023 09:34, Barry Song wrote:
>>>>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>>>>> and if failing then arch_prepare_to_swap() would need to put things back
>>>>> how they were with calls to mte_invalidate_tags() (although I think
>>>>> you'd actually want to refactor to create a function which takes a
>>>>> struct page *).
>>>>>
>>>>> Steve
>>>>
>>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
>>>> One tricky thing is that we are restoring one page rather than folio
>>>> in arch_restore_swap() as we are only swapping in one page at this
>>>> stage.
>>>>
>>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
>>>>
>>>> This patch makes MTE tags saving and restoring support large folios,
>>>> then we don't need to split them into base pages for swapping on
>>>> ARM64 SoCs with MTE.
>>>>
>>>> This patch moves arch_prepare_to_swap() to take folio rather than
>>>> page, as we support THP swap-out as a whole. And this patch also
>>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
>>>> needs it.
>>>>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>> ---
>>>>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>>>>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>>>>  include/linux/huge_mm.h          | 12 ---------
>>>>  include/linux/pgtable.h          |  2 +-
>>>>  mm/page_io.c                     |  2 +-
>>>>  mm/swap_slots.c                  |  2 +-
>>>>  6 files changed, 51 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index b19a8aee684c..d8f523dc41e7 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -45,12 +45,6 @@
>>>>       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>
>>>> -static inline bool arch_thp_swp_supported(void)
>>>> -{
>>>> -     return !system_supports_mte();
>>>> -}
>>>> -#define arch_thp_swp_supported arch_thp_swp_supported
>>>> -
>>>>  /*
>>>>   * Outside of a few very special situations (e.g. hibernation), we always
>>>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>>>  #ifdef CONFIG_ARM64_MTE
>>>>
>>>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>>>> -static inline int arch_prepare_to_swap(struct page *page)
>>>> -{
>>>> -     if (system_supports_mte())
>>>> -             return mte_save_tags(page);
>>>> -     return 0;
>>>> -}
>>>> +#define arch_prepare_to_swap arch_prepare_to_swap
>>>> +extern int arch_prepare_to_swap(struct folio *folio);
>>>>
>>>>  #define __HAVE_ARCH_SWAP_INVALIDATE
>>>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>>>>  }
>>>>
>>>>  #define __HAVE_ARCH_SWAP_RESTORE
>>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>> -{
>>>> -     if (system_supports_mte())
>>>> -             mte_restore_tags(entry, &folio->page);
>>>> -}
>>>> +#define arch_swap_restore arch_swap_restore
>>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>>>
>>>>  #endif /* CONFIG_ARM64_MTE */
>>>>
>>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>>>> index a31833e3ddc5..14a479e4ea8e 100644
>>>> --- a/arch/arm64/mm/mteswap.c
>>>> +++ b/arch/arm64/mm/mteswap.c
>>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>>>>       mte_free_tag_storage(tags);
>>>>  }
>>>>
>>>> +static inline void __mte_invalidate_tags(struct page *page)
>>>> +{
>>>> +     swp_entry_t entry = page_swap_entry(page);
>>>> +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
>>>> +}
>>>> +
>>>>  void mte_invalidate_tags_area(int type)
>>>>  {
>>>>       swp_entry_t entry = swp_entry(type, 0);
>>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>>>>       }
>>>>       xa_unlock(&mte_pages);
>>>>  }
>>>> +
>>>> +int arch_prepare_to_swap(struct folio *folio)
>>>> +{
>>>> +     int err;
>>>> +     long i;
>>>> +
>>>> +     if (system_supports_mte()) {
>>>> +             long nr = folio_nr_pages(folio);
>>>
>>> nit: there should be a clear line between variable declarations and logic.
>>
>> right.
>>
>>>
>>>> +             for (i = 0; i < nr; i++) {
>>>> +                     err = mte_save_tags(folio_page(folio, i));
>>>> +                     if (err)
>>>> +                             goto out;
>>>> +             }
>>>> +     }
>>>> +     return 0;
>>>> +
>>>> +out:
>>>> +     while (--i)
>>>
>>> If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
>>> then it will wrap and run ~forever. I think you meant `while (i--)`?
>>
>> nop. if i=0 and we goto out, that means the page0 has failed to save tags,
>> there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
>> saved, we restore 0,1,2 and we don't restore 3.
> 
> I am terribly sorry for my previous noise. You are right, Ryan. i
> actually meant i--.

No problem - it saves me from writing a long response explaining why --i is
wrong, at least!

> 
>>
>>>
>>>> +             __mte_invalidate_tags(folio_page(folio, i));
>>>> +     return err;
>>>> +}
>>>> +
>>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>> +{
>>>> +     if (system_supports_mte()) {
>>>> +             /*
>>>> +              * We don't support large folios swap in as whole yet, but
>>>> +              * we can hit a large folio which is still in swapcache
>>>> +              * after those related processes' PTEs have been unmapped
>>>> +              * but before the swapcache folio  is dropped, in this case,
>>>> +              * we need to find the exact page which "entry" is mapping
>>>> +              * to. If we are not hitting swapcache, this folio won't be
>>>> +              * large
>>>> +              */
>>>
>>> So the currently defined API allows a large folio to be passed but the caller is
>>> supposed to find the single correct page using the swap entry? That feels quite
>>> nasty to me. And that's not what the old version of the function was doing; it
>>> always assumed that the folio was small and passed the first page (which also
>>> doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
>>> to fix that. If the old version is correct, then I guess this version is wrong.
>>
>> the original version(mainline) is wrong but it works as once we find the SoCs
>> support MTE, we will split large folios into small pages. so only small pages
>> will be added into swapcache successfully.
>>
>> but now we want to swap out large folios even on SoCs with MTE as a whole,
>> we don't split, so this breaks the assumption do_swap_page() will always get
>> small pages.
> 
> let me clarify this more. The current mainline assumes
> arch_swap_restore() always
> get a folio with only one page. this is true as we split large folios
> if we find SoCs
> have MTE. but since we are dropping the split now, that means a large
> folio can be
> gotten by do_swap_page(). we have a chance that try_to_unmap_one() has been done
> but folio is not put. so PTEs will have swap entry but folio is still
> there, and do_swap_page()
> to hit cache directly and the folio won't be released.
> 
> but after getting the large folio in do_swap_page, it still only takes
> one basepage particularly
> for the faulted PTE and maps this 4KB PTE only. so it uses the faulted
> swap_entry and
> the folio as parameters to call arch_swap_restore() which can be something like:
> 
> do_swap_page()
> {
>         arch_swap_restore(the swap entry for the faulted 4KB PTE, large folio);
> }

OK, I understand what's going on, but it seems like a bad API decision. I think
Steve is saying the same thing; If its only intended to operate on a single
page, it would be much clearer to pass the actual page rather than the folio;
i.e. leave the complexity of figuring out the target page to the caller, which
understands all this.

As a side note, if the folio is still in the cache, doesn't that imply that the
tags haven't been torn down yet? So perhaps you can avoid even making the call
in this case?

>>
>>>
>>> Thanks,
>>> Ryan
> 
> Thanks
> Barry
  
Barry Song Nov. 8, 2023, 9:04 p.m. UTC | #19
On Thu, Nov 9, 2023 at 4:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 08/11/2023 11:23, Barry Song wrote:
> > On Wed, Nov 8, 2023 at 2:05 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> On 04/11/2023 09:34, Barry Song wrote:
> >>>>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> >>>>> and if failing then arch_prepare_to_swap() would need to put things back
> >>>>> how they were with calls to mte_invalidate_tags() (although I think
> >>>>> you'd actually want to refactor to create a function which takes a
> >>>>> struct page *).
> >>>>>
> >>>>> Steve
> >>>>
> >>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> >>>> One tricky thing is that we are restoring one page rather than folio
> >>>> in arch_restore_swap() as we are only swapping in one page at this
> >>>> stage.
> >>>>
> >>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> >>>>
> >>>> This patch makes MTE tags saving and restoring support large folios,
> >>>> then we don't need to split them into base pages for swapping on
> >>>> ARM64 SoCs with MTE.
> >>>>
> >>>> This patch moves arch_prepare_to_swap() to take folio rather than
> >>>> page, as we support THP swap-out as a whole. And this patch also
> >>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> >>>> needs it.
> >>>>
> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>> ---
> >>>>  arch/arm64/include/asm/pgtable.h | 21 +++------------
> >>>>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> >>>>  include/linux/huge_mm.h          | 12 ---------
> >>>>  include/linux/pgtable.h          |  2 +-
> >>>>  mm/page_io.c                     |  2 +-
> >>>>  mm/swap_slots.c                  |  2 +-
> >>>>  6 files changed, 51 insertions(+), 32 deletions(-)
> >>>>
> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >>>> index b19a8aee684c..d8f523dc41e7 100644
> >>>> --- a/arch/arm64/include/asm/pgtable.h
> >>>> +++ b/arch/arm64/include/asm/pgtable.h
> >>>> @@ -45,12 +45,6 @@
> >>>>       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>>>
> >>>> -static inline bool arch_thp_swp_supported(void)
> >>>> -{
> >>>> -     return !system_supports_mte();
> >>>> -}
> >>>> -#define arch_thp_swp_supported arch_thp_swp_supported
> >>>> -
> >>>>  /*
> >>>>   * Outside of a few very special situations (e.g. hibernation), we always
> >>>>   * use broadcast TLB invalidation instructions, therefore a spurious page
> >>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >>>>  #ifdef CONFIG_ARM64_MTE
> >>>>
> >>>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> >>>> -static inline int arch_prepare_to_swap(struct page *page)
> >>>> -{
> >>>> -     if (system_supports_mte())
> >>>> -             return mte_save_tags(page);
> >>>> -     return 0;
> >>>> -}
> >>>> +#define arch_prepare_to_swap arch_prepare_to_swap
> >>>> +extern int arch_prepare_to_swap(struct folio *folio);
> >>>>
> >>>>  #define __HAVE_ARCH_SWAP_INVALIDATE
> >>>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> >>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> >>>>  }
> >>>>
> >>>>  #define __HAVE_ARCH_SWAP_RESTORE
> >>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>>> -{
> >>>> -     if (system_supports_mte())
> >>>> -             mte_restore_tags(entry, &folio->page);
> >>>> -}
> >>>> +#define arch_swap_restore arch_swap_restore
> >>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >>>>
> >>>>  #endif /* CONFIG_ARM64_MTE */
> >>>>
> >>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> >>>> index a31833e3ddc5..14a479e4ea8e 100644
> >>>> --- a/arch/arm64/mm/mteswap.c
> >>>> +++ b/arch/arm64/mm/mteswap.c
> >>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >>>>       mte_free_tag_storage(tags);
> >>>>  }
> >>>>
> >>>> +static inline void __mte_invalidate_tags(struct page *page)
> >>>> +{
> >>>> +     swp_entry_t entry = page_swap_entry(page);
> >>>> +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> >>>> +}
> >>>> +
> >>>>  void mte_invalidate_tags_area(int type)
> >>>>  {
> >>>>       swp_entry_t entry = swp_entry(type, 0);
> >>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> >>>>       }
> >>>>       xa_unlock(&mte_pages);
> >>>>  }
> >>>> +
> >>>> +int arch_prepare_to_swap(struct folio *folio)
> >>>> +{
> >>>> +     int err;
> >>>> +     long i;
> >>>> +
> >>>> +     if (system_supports_mte()) {
> >>>> +             long nr = folio_nr_pages(folio);
> >>>
> >>> nit: there should be a clear line between variable declarations and logic.
> >>
> >> right.
> >>
> >>>
> >>>> +             for (i = 0; i < nr; i++) {
> >>>> +                     err = mte_save_tags(folio_page(folio, i));
> >>>> +                     if (err)
> >>>> +                             goto out;
> >>>> +             }
> >>>> +     }
> >>>> +     return 0;
> >>>> +
> >>>> +out:
> >>>> +     while (--i)
> >>>
> >>> If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
> >>> then it will wrap and run ~forever. I think you meant `while (i--)`?
> >>
> >> nop. if i=0 and we goto out, that means the page0 has failed to save tags,
> >> there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
> >> saved, we restore 0,1,2 and we don't restore 3.
> >
> > I am terribly sorry for my previous noise. You are right, Ryan. i
> > actually meant i--.
>
> No problem - it saves me from writing a long response explaining why --i is
> wrong, at least!
>
> >
> >>
> >>>
> >>>> +             __mte_invalidate_tags(folio_page(folio, i));
> >>>> +     return err;
> >>>> +}
> >>>> +
> >>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>>> +{
> >>>> +     if (system_supports_mte()) {
> >>>> +             /*
> >>>> +              * We don't support large folios swap in as whole yet, but
> >>>> +              * we can hit a large folio which is still in swapcache
> >>>> +              * after those related processes' PTEs have been unmapped
> >>>> +              * but before the swapcache folio  is dropped, in this case,
> >>>> +              * we need to find the exact page which "entry" is mapping
> >>>> +              * to. If we are not hitting swapcache, this folio won't be
> >>>> +              * large
> >>>> +              */
> >>>
> >>> So the currently defined API allows a large folio to be passed but the caller is
> >>> supposed to find the single correct page using the swap entry? That feels quite
> >>> nasty to me. And that's not what the old version of the function was doing; it
> >>> always assumed that the folio was small and passed the first page (which also
> >>> doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
> >>> to fix that. If the old version is correct, then I guess this version is wrong.
> >>
> >> the original version(mainline) is wrong but it works as once we find the SoCs
> >> support MTE, we will split large folios into small pages. so only small pages
> >> will be added into swapcache successfully.
> >>
> >> but now we want to swap out large folios even on SoCs with MTE as a whole,
> >> we don't split, so this breaks the assumption do_swap_page() will always get
> >> small pages.
> >
> > let me clarify this more. The current mainline assumes
> > arch_swap_restore() always
> > get a folio with only one page. this is true as we split large folios
> > if we find SoCs
> > have MTE. but since we are dropping the split now, that means a large
> > folio can be
> > gotten by do_swap_page(). we have a chance that try_to_unmap_one() has been done
> > but folio is not put. so PTEs will have swap entry but folio is still
> > there, and do_swap_page()
> > to hit cache directly and the folio won't be released.
> >
> > but after getting the large folio in do_swap_page, it still only takes
> > one basepage particularly
> > for the faulted PTE and maps this 4KB PTE only. so it uses the faulted
> > swap_entry and
> > the folio as parameters to call arch_swap_restore() which can be something like:
> >
> > do_swap_page()
> > {
> >         arch_swap_restore(the swap entry for the faulted 4KB PTE, large folio);
> > }
>
> OK, I understand what's going on, but it seems like a bad API decision. I think
> Steve is saying the same thing; If its only intended to operate on a single
> page, it would be much clearer to pass the actual page rather than the folio;
> i.e. leave the complexity of figuring out the target page to the caller, which
> understands all this.

right.

>
> As a side note, if the folio is still in the cache, doesn't that imply that the
> tags haven't been torn down yet? So perhaps you can avoid even making the call
> in this case?

right. but it is practically very hard as arch_swap_restore() is
always called unconditionally.
it is hard to find a decent condition before calling
arch_swap_restore(). That is why we
actually have been doing redundant arch_swap_restore() lots of times right now.

For example, A forks B,C,D,E,F,G. now A,B,C,D,E,F,G will share one
page before CoW. After
the page is swapped out, if B is the first process to swap in, B will
add the page to swapcache,
and restore MTE. After that, A, C, D, E, F,G will directly hit the
page swapped in by B, now they
restore MTE again. so the MTE is restored 7 times but actually only B
needs to do it.

so it seems we can put a condition to only let B do restore.  But it
won't work because we can't
guarrent B is the first process who will do PTE mapping.  A, C, D, E,
F, G can map PTEs earlier than B
even if B is the one who did the I/O swapin. swapin/add swapcache and
PTE mapping are
not done atomically. PTE mapping needs to take PTL. so After B has
done swapin, A, C,E,F,G
can still begin to use the page earlier than B. So it turns out anyone
who first maps the page
should restore MTE, but the question is that: How could A,B,C,D,E,F,G
know if it is the first
one mapping the page to PTE?

>
> >>
> >>>
> >>> Thanks,
> >>> Ryan
> >
> > Thanks
> > Barry
>

Thanks
Barry
  
Barry Song Feb. 5, 2024, 9:51 a.m. UTC | #20
+Chris, Suren and Chuanhua

Hi Ryan,

> +	/*
> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> +	 * so indicate that we are scanning to synchronise with swapoff.
> +	 */
> +	si->flags += SWP_SCANNING;
> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> +	si->flags -= SWP_SCANNING;

nobody is using this scan_base afterwards. it seems a bit weird to
pass a pointer.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;
> --

Chuanhua and I ran this patchset for a couple of days and found a race
between reclamation and split_folio. this might cause applications get
wrong data 0 while swapping-in.

in case one thread(T1) is reclaiming a large folio by some means, still
another thread is calling madvise MADV_PGOUT(T2). and at the same time,
we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
and T2 does split as below,

static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
                                unsigned long addr, unsigned long end, 
                                struct mm_walk *walk)
{

                /*   
                 * Creating a THP page is expensive so split it only if we
                 * are sure it's worth. Split it if we are only owner.
                 */
                if (folio_test_large(folio)) {
                        int err; 

                        if (folio_estimated_sharers(folio) != 1)
                                break;
                        if (pageout_anon_only_filter && !folio_test_anon(folio))
                                break;
                        if (!folio_trylock(folio))
                                break;
                        folio_get(folio);
                        arch_leave_lazy_mmu_mode();
                        pte_unmap_unlock(start_pte, ptl);
                        start_pte = NULL;
                        err = split_folio(folio);
                        folio_unlock(folio);
                        folio_put(folio);
                        if (err)
                                break;
                        start_pte = pte =
                                pte_offset_map_lock(mm, pmd, addr, &ptl);
                        if (!start_pte)
                                break;
                        arch_enter_lazy_mmu_mode();
                        pte--;
                        addr -= PAGE_SIZE;
                        continue;
                }    

        return 0;
}



if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
check pte_same() and find pte has been changed by another thread, thus
goto out_nomap in do_swap_page.
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
        if (!folio) {
                if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
                    __swap_count(entry) == 1) {
                        /* skip swapcache */
                        folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
                                                vma, vmf->address, false);
                        page = &folio->page;
                        if (folio) {
                                __folio_set_locked(folio);
                                __folio_set_swapbacked(folio);
                         
                                /* To provide entry to swap_read_folio() */
                                folio->swap = entry;
                                swap_read_folio(folio, true, NULL);
                                folio->private = NULL;
                        }
                } else {
                }
        
        
        /*
         * Back out if somebody else already faulted in this pte.
         */
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
                        &vmf->ptl);
        if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
                goto out_nomap;

        swap_free(entry);
        pte = mk_pte(page, vma->vm_page_prot);

        set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
        return ret;
}


while T1 and T2 is working in parallel, T2 will split folio. this can
run into race with T1's reclamation without splitting. T2 will split
a large folio into a couple of normal pages and reclaim them.

If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
set_pte and swap_free. this will cause zRAM to free the slot. then
t4 will get zero data in swap_read_folio() as the below zRAM code
will fill zero for freed slots, 

static int zram_read_from_zspool(struct zram *zram, struct page *page,
                                 u32 index)
{
        ...

        handle = zram_get_handle(zram, index);
        if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
                unsigned long value;
                void *mem;

                value = handle ? zram_get_element(zram, index) : 0; 
                mem = kmap_local_page(page);
                zram_fill_page(mem, PAGE_SIZE, value);
                kunmap_local(mem);
                return 0;
        }
}

usually, after t3 frees swap and does set_pte, t4's pte_same becomes
false, it won't set pte again. So filling zero data into freed slot
by zRAM driver is not a problem at all. but the race is that T1 and
T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
(splitted normal folios are also added into reclaim_list), thus, the
corrupted zero data will get a chance to be set into PTE by t4 as t4
reads the new PTE which is set secondly and has the same swap entry
as its orig_pte after T3 has swapped-in and free the swap entry.

we have worked around this problem by preventing T4 from splitting
large folios and letting it goto skip the large folios entirely in
MADV PAGEOUT once we detect a concurrent reclamation for this large
folio.

so my understanding is changing vmscan isn't sufficient to support
large folio swap-out without splitting. we have to adjust madvise
as well. we will have a fix for this problem in
[PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/

But i feel this patch should be a part of your swap-out patchset rather
than the swap-in series of Chuanhua and me :-)

Thanks
Barry
  
Ryan Roberts Feb. 5, 2024, 12:14 p.m. UTC | #21
On 05/02/2024 09:51, Barry Song wrote:
> +Chris, Suren and Chuanhua
> 
> Hi Ryan,
> 
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
> 
> nobody is using this scan_base afterwards. it seems a bit weird to
> pass a pointer.
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
>> --
> 
> Chuanhua and I ran this patchset for a couple of days and found a race
> between reclamation and split_folio. this might cause applications get
> wrong data 0 while swapping-in.
> 
> in case one thread(T1) is reclaiming a large folio by some means, still
> another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> and T2 does split as below,
> 
> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                                 unsigned long addr, unsigned long end, 
>                                 struct mm_walk *walk)
> {
> 
>                 /*   
>                  * Creating a THP page is expensive so split it only if we
>                  * are sure it's worth. Split it if we are only owner.
>                  */
>                 if (folio_test_large(folio)) {
>                         int err; 
> 
>                         if (folio_estimated_sharers(folio) != 1)
>                                 break;
>                         if (pageout_anon_only_filter && !folio_test_anon(folio))
>                                 break;
>                         if (!folio_trylock(folio))
>                                 break;
>                         folio_get(folio);
>                         arch_leave_lazy_mmu_mode();
>                         pte_unmap_unlock(start_pte, ptl);
>                         start_pte = NULL;
>                         err = split_folio(folio);
>                         folio_unlock(folio);
>                         folio_put(folio);
>                         if (err)
>                                 break;
>                         start_pte = pte =
>                                 pte_offset_map_lock(mm, pmd, addr, &ptl);
>                         if (!start_pte)
>                                 break;
>                         arch_enter_lazy_mmu_mode();
>                         pte--;
>                         addr -= PAGE_SIZE;
>                         continue;
>                 }    
> 
>         return 0;
> }
> 
> 
> 
> if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
> first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
> check pte_same() and find pte has been changed by another thread, thus
> goto out_nomap in do_swap_page.
> vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
>                         /* skip swapcache */
>                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>                                                 vma, vmf->address, false);
>                         page = &folio->page;
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>                          
>                                 /* To provide entry to swap_read_folio() */
>                                 folio->swap = entry;
>                                 swap_read_folio(folio, true, NULL);
>                                 folio->private = NULL;
>                         }
>                 } else {
>                 }
>         
>         
>         /*
>          * Back out if somebody else already faulted in this pte.
>          */
>         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>                         &vmf->ptl);
>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>                 goto out_nomap;
> 
>         swap_free(entry);
>         pte = mk_pte(page, vma->vm_page_prot);
> 
>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>         return ret;
> }
> 
> 
> while T1 and T2 is working in parallel, T2 will split folio. this can
> run into race with T1's reclamation without splitting. T2 will split
> a large folio into a couple of normal pages and reclaim them.
> 
> If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
> set_pte and swap_free. this will cause zRAM to free the slot. then
> t4 will get zero data in swap_read_folio() as the below zRAM code
> will fill zero for freed slots, 
> 
> static int zram_read_from_zspool(struct zram *zram, struct page *page,
>                                  u32 index)
> {
>         ...
> 
>         handle = zram_get_handle(zram, index);
>         if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
>                 unsigned long value;
>                 void *mem;
> 
>                 value = handle ? zram_get_element(zram, index) : 0; 
>                 mem = kmap_local_page(page);
>                 zram_fill_page(mem, PAGE_SIZE, value);
>                 kunmap_local(mem);
>                 return 0;
>         }
> }
> 
> usually, after t3 frees swap and does set_pte, t4's pte_same becomes
> false, it won't set pte again. So filling zero data into freed slot
> by zRAM driver is not a problem at all. but the race is that T1 and
> T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
> (splitted normal folios are also added into reclaim_list), thus, the
> corrupted zero data will get a chance to be set into PTE by t4 as t4
> reads the new PTE which is set secondly and has the same swap entry
> as its orig_pte after T3 has swapped-in and free the swap entry.
> 
> we have worked around this problem by preventing T4 from splitting
> large folios and letting it goto skip the large folios entirely in
> MADV PAGEOUT once we detect a concurrent reclamation for this large
> folio.
> 
> so my understanding is changing vmscan isn't sufficient to support
> large folio swap-out without splitting. we have to adjust madvise
> as well. we will have a fix for this problem in
> [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
> https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
> 
> But i feel this patch should be a part of your swap-out patchset rather
> than the swap-in series of Chuanhua and me :-)

Hi Barry, Chuanhua,

Thanks for the very detailed bug report! I'm going to have to take some time to
get my head around the details. But yes, I agree the fix needs to be part of the
swap-out series.

Sorry I haven't progressed this series as I had hoped. I've been concentrating
on getting the contpte series upstream. I'm hoping I will find some time to move
this series along by the tail end of Feb (hoping to get it in shape for v6.10).
Hopefully that doesn't cause you any big problems?

Thanks,
Ryan

> 
> Thanks
> Barry
  
Barry Song Feb. 18, 2024, 11:40 p.m. UTC | #22
On Tue, Feb 6, 2024 at 1:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/02/2024 09:51, Barry Song wrote:
> > +Chris, Suren and Chuanhua
> >
> > Hi Ryan,
> >
> >> +    /*
> >> +     * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> >> +     * so indicate that we are scanning to synchronise with swapoff.
> >> +     */
> >> +    si->flags += SWP_SCANNING;
> >> +    ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> >> +    si->flags -= SWP_SCANNING;
> >
> > nobody is using this scan_base afterwards. it seems a bit weird to
> > pass a pointer.
> >
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                      if (!can_split_folio(folio, NULL))
> >>                                              goto activate_locked;
> >>                                      /*
> >> -                                     * Split folios without a PMD map right
> >> -                                     * away. Chances are some or all of the
> >> -                                     * tail pages can be freed without IO.
> >> +                                     * Split PMD-mappable folios without a
> >> +                                     * PMD map right away. Chances are some
> >> +                                     * or all of the tail pages can be freed
> >> +                                     * without IO.
> >>                                       */
> >> -                                    if (!folio_entire_mapcount(folio) &&
> >> +                                    if (folio_test_pmd_mappable(folio) &&
> >> +                                        !folio_entire_mapcount(folio) &&
> >>                                          split_folio_to_list(folio,
> >>                                                              folio_list))
> >>                                              goto activate_locked;
> >> --
> >
> > Chuanhua and I ran this patchset for a couple of days and found a race
> > between reclamation and split_folio. this might cause applications get
> > wrong data 0 while swapping-in.
> >
> > in case one thread(T1) is reclaiming a large folio by some means, still
> > another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> > and T2 does split as below,
> >
> > static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >                                 unsigned long addr, unsigned long end,
> >                                 struct mm_walk *walk)
> > {
> >
> >                 /*
> >                  * Creating a THP page is expensive so split it only if we
> >                  * are sure it's worth. Split it if we are only owner.
> >                  */
> >                 if (folio_test_large(folio)) {
> >                         int err;
> >
> >                         if (folio_estimated_sharers(folio) != 1)
> >                                 break;
> >                         if (pageout_anon_only_filter && !folio_test_anon(folio))
> >                                 break;
> >                         if (!folio_trylock(folio))
> >                                 break;
> >                         folio_get(folio);
> >                         arch_leave_lazy_mmu_mode();
> >                         pte_unmap_unlock(start_pte, ptl);
> >                         start_pte = NULL;
> >                         err = split_folio(folio);
> >                         folio_unlock(folio);
> >                         folio_put(folio);
> >                         if (err)
> >                                 break;
> >                         start_pte = pte =
> >                                 pte_offset_map_lock(mm, pmd, addr, &ptl);
> >                         if (!start_pte)
> >                                 break;
> >                         arch_enter_lazy_mmu_mode();
> >                         pte--;
> >                         addr -= PAGE_SIZE;
> >                         continue;
> >                 }
> >
> >         return 0;
> > }
> >
> >
> >
> > if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
> > first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
> > check pte_same() and find pte has been changed by another thread, thus
> > goto out_nomap in do_swap_page.
> > vm_fault_t do_swap_page(struct vm_fault *vmf)
> > {
> >         if (!folio) {
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                     __swap_count(entry) == 1) {
> >                         /* skip swapcache */
> >                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >                                                 vma, vmf->address, false);
> >                         page = &folio->page;
> >                         if (folio) {
> >                                 __folio_set_locked(folio);
> >                                 __folio_set_swapbacked(folio);
> >
> >                                 /* To provide entry to swap_read_folio() */
> >                                 folio->swap = entry;
> >                                 swap_read_folio(folio, true, NULL);
> >                                 folio->private = NULL;
> >                         }
> >                 } else {
> >                 }
> >
> >
> >         /*
> >          * Back out if somebody else already faulted in this pte.
> >          */
> >         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >                         &vmf->ptl);
> >         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >                 goto out_nomap;
> >
> >         swap_free(entry);
> >         pte = mk_pte(page, vma->vm_page_prot);
> >
> >         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >         return ret;
> > }
> >
> >
> > while T1 and T2 is working in parallel, T2 will split folio. this can
> > run into race with T1's reclamation without splitting. T2 will split
> > a large folio into a couple of normal pages and reclaim them.
> >
> > If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
> > set_pte and swap_free. this will cause zRAM to free the slot. then
> > t4 will get zero data in swap_read_folio() as the below zRAM code
> > will fill zero for freed slots,
> >
> > static int zram_read_from_zspool(struct zram *zram, struct page *page,
> >                                  u32 index)
> > {
> >         ...
> >
> >         handle = zram_get_handle(zram, index);
> >         if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
> >                 unsigned long value;
> >                 void *mem;
> >
> >                 value = handle ? zram_get_element(zram, index) : 0;
> >                 mem = kmap_local_page(page);
> >                 zram_fill_page(mem, PAGE_SIZE, value);
> >                 kunmap_local(mem);
> >                 return 0;
> >         }
> > }
> >
> > usually, after t3 frees swap and does set_pte, t4's pte_same becomes
> > false, it won't set pte again. So filling zero data into freed slot
> > by zRAM driver is not a problem at all. but the race is that T1 and
> > T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
> > (splitted normal folios are also added into reclaim_list), thus, the
> > corrupted zero data will get a chance to be set into PTE by t4 as t4
> > reads the new PTE which is set secondly and has the same swap entry
> > as its orig_pte after T3 has swapped-in and free the swap entry.
> >
> > we have worked around this problem by preventing T4 from splitting
> > large folios and letting it goto skip the large folios entirely in
> > MADV PAGEOUT once we detect a concurrent reclamation for this large
> > folio.
> >
> > so my understanding is changing vmscan isn't sufficient to support
> > large folio swap-out without splitting. we have to adjust madvise
> > as well. we will have a fix for this problem in
> > [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
> > https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
> >
> > But i feel this patch should be a part of your swap-out patchset rather
> > than the swap-in series of Chuanhua and me :-)
>
> Hi Barry, Chuanhua,
>
> Thanks for the very detailed bug report! I'm going to have to take some time to
> get my head around the details. But yes, I agree the fix needs to be part of the
> swap-out series.
>

Hi Ryan,
I am running into some races especially while enabling large folio swap-out and
swap-in both. some of them, i am still struggling with the detailed
timing how they
are happening.
but the below change can help remove those bugs which cause corrupted data.

index da2aab219c40..ef9cfbc84760 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1953,6 +1953,16 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,

                        if (folio_test_pmd_mappable(folio))
                                flags |= TTU_SPLIT_HUGE_PMD;
+                       /*
+                        * make try_to_unmap_one hold ptl from the very first
+                        * beginning if we are reclaiming a folio with multi-
+                        * ptes. otherwise, we may only reclaim a part of the
+                        * folio from the middle.
+                        * for example, a parallel thread might temporarily
+                        * set pte to none for various purposes.
+                        */
+                       else if (folio_test_large(folio))
+                               flags |= TTU_SYNC;

                        try_to_unmap(folio, flags);
                        if (folio_mapped(folio)) {


While we are swapping-out a large folio, it has many ptes, we change those ptes
to swap entries in try_to_unmap_one(). "while (page_vma_mapped_walk(&pvmw))"
will iterate all ptes within the large folio. but it will only begin
to acquire ptl when
it meets a valid pte as below /* xxxxxxx */

static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
{
        pte_t ptent;

        if (pvmw->flags & PVMW_SYNC) {
                /* Use the stricter lookup */
                pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
                                                pvmw->address, &pvmw->ptl);
                *ptlp = pvmw->ptl;
                return !!pvmw->pte;
        }

       ...
        pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
                                          pvmw->address, ptlp);
        if (!pvmw->pte)
                return false;

        ptent = ptep_get(pvmw->pte);

        if (pvmw->flags & PVMW_MIGRATION) {
                if (!is_swap_pte(ptent))
                        return false;
        } else if (is_swap_pte(ptent)) {
                swp_entry_t entry;
                ...
                entry = pte_to_swp_entry(ptent);
                if (!is_device_private_entry(entry) &&
                    !is_device_exclusive_entry(entry))
                        return false;
        } else if (!pte_present(ptent)) {
                return false;
        }
        pvmw->ptl = *ptlp;
        spin_lock(pvmw->ptl);   /* xxxxxxx */
        return true;
}


for various reasons,  for example, break-before-make for clearing access flags
etc. pte can be set to none. since page_vma_mapped_walk() doesn't hold ptl
from the beginning,  it might only begin to set swap entries from the middle of
a large folio.

For example, in case a large folio has 16 ptes, and 0,1,2 are somehow zero
in the intermediate stage of a break-before-make, ptl will be held
from the 3rd pte,
and swap entries will be set from 3rd pte as well. it seems not good as we are
trying to swap out a large folio, but we are swapping out a part of them.

I am still struggling with all the timing of races, but using PVMW_SYNC to
explicitly ask for ptl from the first pte seems a good thing for large folio
regardless of those races. it can avoid try_to_unmap_one reading intermediate
pte and further make the wrong decision since reclaiming pte-mapped large
folios is atomic with just one pte.

> Sorry I haven't progressed this series as I had hoped. I've been concentrating
> on getting the contpte series upstream. I'm hoping I will find some time to move
> this series along by the tail end of Feb (hoping to get it in shape for v6.10).
> Hopefully that doesn't cause you any big problems?

no worries. Anyway, we are already using your code to run various tests.

>
> Thanks,
> Ryan

Thanks
Barry
  
Ryan Roberts Feb. 20, 2024, 8:03 p.m. UTC | #23
On 18/02/2024 23:40, Barry Song wrote:
> On Tue, Feb 6, 2024 at 1:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 05/02/2024 09:51, Barry Song wrote:
>>> +Chris, Suren and Chuanhua
>>>
>>> Hi Ryan,
>>>
>>>> +    /*
>>>> +     * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>>>> +     * so indicate that we are scanning to synchronise with swapoff.
>>>> +     */
>>>> +    si->flags += SWP_SCANNING;
>>>> +    ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>>>> +    si->flags -= SWP_SCANNING;
>>>
>>> nobody is using this scan_base afterwards. it seems a bit weird to
>>> pass a pointer.
>>>
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>                                      if (!can_split_folio(folio, NULL))
>>>>                                              goto activate_locked;
>>>>                                      /*
>>>> -                                     * Split folios without a PMD map right
>>>> -                                     * away. Chances are some or all of the
>>>> -                                     * tail pages can be freed without IO.
>>>> +                                     * Split PMD-mappable folios without a
>>>> +                                     * PMD map right away. Chances are some
>>>> +                                     * or all of the tail pages can be freed
>>>> +                                     * without IO.
>>>>                                       */
>>>> -                                    if (!folio_entire_mapcount(folio) &&
>>>> +                                    if (folio_test_pmd_mappable(folio) &&
>>>> +                                        !folio_entire_mapcount(folio) &&
>>>>                                          split_folio_to_list(folio,
>>>>                                                              folio_list))
>>>>                                              goto activate_locked;
>>>> --
>>>
>>> Chuanhua and I ran this patchset for a couple of days and found a race
>>> between reclamation and split_folio. this might cause applications get
>>> wrong data 0 while swapping-in.
>>>
>>> in case one thread(T1) is reclaiming a large folio by some means, still
>>> another thread is calling madvise MADV_PGOUT(T2). and at the same time,
>>> we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
>>> and T2 does split as below,
>>>
>>> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>                                 unsigned long addr, unsigned long end,
>>>                                 struct mm_walk *walk)
>>> {
>>>
>>>                 /*
>>>                  * Creating a THP page is expensive so split it only if we
>>>                  * are sure it's worth. Split it if we are only owner.
>>>                  */
>>>                 if (folio_test_large(folio)) {
>>>                         int err;
>>>
>>>                         if (folio_estimated_sharers(folio) != 1)
>>>                                 break;
>>>                         if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>                                 break;
>>>                         if (!folio_trylock(folio))
>>>                                 break;
>>>                         folio_get(folio);
>>>                         arch_leave_lazy_mmu_mode();
>>>                         pte_unmap_unlock(start_pte, ptl);
>>>                         start_pte = NULL;
>>>                         err = split_folio(folio);
>>>                         folio_unlock(folio);
>>>                         folio_put(folio);
>>>                         if (err)
>>>                                 break;
>>>                         start_pte = pte =
>>>                                 pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>                         if (!start_pte)
>>>                                 break;
>>>                         arch_enter_lazy_mmu_mode();
>>>                         pte--;
>>>                         addr -= PAGE_SIZE;
>>>                         continue;
>>>                 }
>>>
>>>         return 0;
>>> }
>>>
>>>
>>>
>>> if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
>>> first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
>>> check pte_same() and find pte has been changed by another thread, thus
>>> goto out_nomap in do_swap_page.
>>> vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> {
>>>         if (!folio) {
>>>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>>                     __swap_count(entry) == 1) {
>>>                         /* skip swapcache */
>>>                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>>>                                                 vma, vmf->address, false);
>>>                         page = &folio->page;
>>>                         if (folio) {
>>>                                 __folio_set_locked(folio);
>>>                                 __folio_set_swapbacked(folio);
>>>
>>>                                 /* To provide entry to swap_read_folio() */
>>>                                 folio->swap = entry;
>>>                                 swap_read_folio(folio, true, NULL);
>>>                                 folio->private = NULL;
>>>                         }
>>>                 } else {
>>>                 }
>>>
>>>
>>>         /*
>>>          * Back out if somebody else already faulted in this pte.
>>>          */
>>>         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>                         &vmf->ptl);
>>>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>>                 goto out_nomap;
>>>
>>>         swap_free(entry);
>>>         pte = mk_pte(page, vma->vm_page_prot);
>>>
>>>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>>         return ret;
>>> }
>>>
>>>
>>> while T1 and T2 is working in parallel, T2 will split folio. this can
>>> run into race with T1's reclamation without splitting. T2 will split
>>> a large folio into a couple of normal pages and reclaim them.
>>>
>>> If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
>>> set_pte and swap_free. this will cause zRAM to free the slot. then
>>> t4 will get zero data in swap_read_folio() as the below zRAM code
>>> will fill zero for freed slots,
>>>
>>> static int zram_read_from_zspool(struct zram *zram, struct page *page,
>>>                                  u32 index)
>>> {
>>>         ...
>>>
>>>         handle = zram_get_handle(zram, index);
>>>         if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
>>>                 unsigned long value;
>>>                 void *mem;
>>>
>>>                 value = handle ? zram_get_element(zram, index) : 0;
>>>                 mem = kmap_local_page(page);
>>>                 zram_fill_page(mem, PAGE_SIZE, value);
>>>                 kunmap_local(mem);
>>>                 return 0;
>>>         }
>>> }
>>>
>>> usually, after t3 frees swap and does set_pte, t4's pte_same becomes
>>> false, it won't set pte again. So filling zero data into freed slot
>>> by zRAM driver is not a problem at all. but the race is that T1 and
>>> T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
>>> (splitted normal folios are also added into reclaim_list), thus, the
>>> corrupted zero data will get a chance to be set into PTE by t4 as t4
>>> reads the new PTE which is set secondly and has the same swap entry
>>> as its orig_pte after T3 has swapped-in and free the swap entry.
>>>
>>> we have worked around this problem by preventing T4 from splitting
>>> large folios and letting it goto skip the large folios entirely in
>>> MADV PAGEOUT once we detect a concurrent reclamation for this large
>>> folio.
>>>
>>> so my understanding is changing vmscan isn't sufficient to support
>>> large folio swap-out without splitting. we have to adjust madvise
>>> as well. we will have a fix for this problem in
>>> [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
>>> https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
>>>
>>> But i feel this patch should be a part of your swap-out patchset rather
>>> than the swap-in series of Chuanhua and me :-)
>>
>> Hi Barry, Chuanhua,
>>
>> Thanks for the very detailed bug report! I'm going to have to take some time to
>> get my head around the details. But yes, I agree the fix needs to be part of the
>> swap-out series.
>>
> 
> Hi Ryan,
> I am running into some races especially while enabling large folio swap-out and
> swap-in both. some of them, i am still struggling with the detailed
> timing how they
> are happening.
> but the below change can help remove those bugs which cause corrupted data.

Thanks for the report! I'm out of office this week, but this is top of my todo
list starting next week, so hopefully will knock these into shape and repost
very soon.

Thanks,
Ryan

> 
> index da2aab219c40..ef9cfbc84760 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1953,6 +1953,16 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
> 
>                         if (folio_test_pmd_mappable(folio))
>                                 flags |= TTU_SPLIT_HUGE_PMD;
> +                       /*
> +                        * make try_to_unmap_one hold ptl from the very first
> +                        * beginning if we are reclaiming a folio with multi-
> +                        * ptes. otherwise, we may only reclaim a part of the
> +                        * folio from the middle.
> +                        * for example, a parallel thread might temporarily
> +                        * set pte to none for various purposes.
> +                        */
> +                       else if (folio_test_large(folio))
> +                               flags |= TTU_SYNC;
> 
>                         try_to_unmap(folio, flags);
>                         if (folio_mapped(folio)) {
> 
> 
> While we are swapping-out a large folio, it has many ptes, we change those ptes
> to swap entries in try_to_unmap_one(). "while (page_vma_mapped_walk(&pvmw))"
> will iterate all ptes within the large folio. but it will only begin
> to acquire ptl when
> it meets a valid pte as below /* xxxxxxx */
> 
> static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
> {
>         pte_t ptent;
> 
>         if (pvmw->flags & PVMW_SYNC) {
>                 /* Use the stricter lookup */
>                 pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
>                                                 pvmw->address, &pvmw->ptl);
>                 *ptlp = pvmw->ptl;
>                 return !!pvmw->pte;
>         }
> 
>        ...
>         pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>                                           pvmw->address, ptlp);
>         if (!pvmw->pte)
>                 return false;
> 
>         ptent = ptep_get(pvmw->pte);
> 
>         if (pvmw->flags & PVMW_MIGRATION) {
>                 if (!is_swap_pte(ptent))
>                         return false;
>         } else if (is_swap_pte(ptent)) {
>                 swp_entry_t entry;
>                 ...
>                 entry = pte_to_swp_entry(ptent);
>                 if (!is_device_private_entry(entry) &&
>                     !is_device_exclusive_entry(entry))
>                         return false;
>         } else if (!pte_present(ptent)) {
>                 return false;
>         }
>         pvmw->ptl = *ptlp;
>         spin_lock(pvmw->ptl);   /* xxxxxxx */
>         return true;
> }
> 
> 
> for various reasons,  for example, break-before-make for clearing access flags
> etc. pte can be set to none. since page_vma_mapped_walk() doesn't hold ptl
> from the beginning,  it might only begin to set swap entries from the middle of
> a large folio.
> 
> For example, in case a large folio has 16 ptes, and 0,1,2 are somehow zero
> in the intermediate stage of a break-before-make, ptl will be held
> from the 3rd pte,
> and swap entries will be set from 3rd pte as well. it seems not good as we are
> trying to swap out a large folio, but we are swapping out a part of them.
> 
> I am still struggling with all the timing of races, but using PVMW_SYNC to
> explicitly ask for ptl from the first pte seems a good thing for large folio
> regardless of those races. it can avoid try_to_unmap_one reading intermediate
> pte and further make the wrong decision since reclaiming pte-mapped large
> folios is atomic with just one pte.
> 
>> Sorry I haven't progressed this series as I had hoped. I've been concentrating
>> on getting the contpte series upstream. I'm hoping I will find some time to move
>> this series along by the tail end of Feb (hoping to get it in shape for v6.10).
>> Hopefully that doesn't cause you any big problems?
> 
> no worries. Anyway, we are already using your code to run various tests.
> 
>>
>> Thanks,
>> Ryan
> 
> Thanks
> Barry
  
Barry Song Feb. 22, 2024, 7:05 a.m. UTC | #24
Hi Ryan,

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cc0cb41fb32..ea19710aa4cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;

I ran a test to investigate what would happen while reclaiming a partially
unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
4KB~64KB, and keep the first subpage 0~4KiB.
 
My test wants to address my three concerns,
a. whether we will have leak on swap slots
b. whether we will have redundant I/O
c. whether we will cause races on swapcache

what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
at some specific stage
1. just after add_to_swap   (swap slots are allocated)
2. before and after try_to_unmap   (ptes are set to swap_entry)
3. before and after pageout (also add printk in zram driver to dump all I/O write)
4. before and after remove_mapping

The below is the dumped info for a particular large folio,

1. after add_to_swap
[   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
[   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40

as you can see,
_nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)


2. before and after try_to_unmap
[   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
[   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
[   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
[   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40

as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1

3. before and after pageout
[   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
[   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
[   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
[   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
[   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
[   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
[   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
[   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
[   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
[   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
[   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
[   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
[   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
[   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
[   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
[   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
[   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
[   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
[   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0

as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
4~64KiB has been zap_pte_range before, we still write them to zRAM.

4. before and after remove_mapping
[   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
[   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
[   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00

as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
slot leak at all!

Thus, only two concerns are left for me,
1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
is partially unmapped.
2. large folio is added as a whole as a swapcache covering the range whose
part has been zapped. I am not quite sure if this will cause some problems
while some concurrent do_anon_page, swapin and swapout occurs between 3 and
4 on zapped subpage1~subpage15. still struggling.. my brain is exploding... 

To me, it seems safer to split or do some other similar optimization if we find a
large folio has partial map and unmap.

Thanks
Barry
  
David Hildenbrand Feb. 22, 2024, 10:09 a.m. UTC | #25
On 22.02.24 08:05, Barry Song wrote:
> Hi Ryan,
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cc0cb41fb32..ea19710aa4cd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>   					if (!can_split_folio(folio, NULL))
>>   						goto activate_locked;
>>   					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>   					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>   					    split_folio_to_list(folio,
>>   								folio_list))
>>   						goto activate_locked;
> 
> I ran a test to investigate what would happen while reclaiming a partially
> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
> 4KB~64KB, and keep the first subpage 0~4KiB.

IOW, something that already happens with ordinary THP already IIRC.

>   
> My test wants to address my three concerns,
> a. whether we will have leak on swap slots
> b. whether we will have redundant I/O
> c. whether we will cause races on swapcache
> 
> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
> at some specific stage
> 1. just after add_to_swap   (swap slots are allocated)
> 2. before and after try_to_unmap   (ptes are set to swap_entry)
> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
> 4. before and after remove_mapping
> 
> The below is the dumped info for a particular large folio,
> 
> 1. after add_to_swap
> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> 
> as you can see,
> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
> 
> 
> 2. before and after try_to_unmap
> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> 
> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
> 
> 3. before and after pageout
> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
> 
> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
> 
> 4. before and after remove_mapping
> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> 
> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
> slot leak at all!
> 
> Thus, only two concerns are left for me,
> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
> is partially unmapped.
> 2. large folio is added as a whole as a swapcache covering the range whose
> part has been zapped. I am not quite sure if this will cause some problems
> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...

Just noting: I was running into something different in the past with 
THP. And it's effectively the same scenario, just swapout and 
MADV_DONTNEED reversed.

Imagine you swapped out a THP and the THP it still is in the swapcache.

Then you unmap/zap some PTEs, freeing up the swap slots.

In zap_pte_range(), we'll call free_swap_and_cache(). There, we run into 
the "!swap_page_trans_huge_swapped(p, entry)", and we won't be calling 
__try_to_reclaim_swap().

So we won't split the large folio that is in the swapcache and it will 
continue consuming "more memory" than intended until fully evicted.

> 
> To me, it seems safer to split or do some other similar optimization if we find a
> large folio has partial map and unmap.

I'm hoping that we can avoid any new direct users of _nr_pages_mapped if 
possible.

If we find that the folio is on the deferred split list, we might as 
well just split it right away, before swapping it out. That might be a 
reasonable optimization for the case you describe.
  
Barry Song Feb. 23, 2024, 9:46 a.m. UTC | #26
On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.02.24 08:05, Barry Song wrote:
> > Hi Ryan,
> >
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 2cc0cb41fb32..ea19710aa4cd 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                      if (!can_split_folio(folio, NULL))
> >>                                              goto activate_locked;
> >>                                      /*
> >> -                                     * Split folios without a PMD map right
> >> -                                     * away. Chances are some or all of the
> >> -                                     * tail pages can be freed without IO.
> >> +                                     * Split PMD-mappable folios without a
> >> +                                     * PMD map right away. Chances are some
> >> +                                     * or all of the tail pages can be freed
> >> +                                     * without IO.
> >>                                       */
> >> -                                    if (!folio_entire_mapcount(folio) &&
> >> +                                    if (folio_test_pmd_mappable(folio) &&
> >> +                                        !folio_entire_mapcount(folio) &&
> >>                                          split_folio_to_list(folio,
> >>                                                              folio_list))
> >>                                              goto activate_locked;
> >
> > I ran a test to investigate what would happen while reclaiming a partially
> > unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
> > 4KB~64KB, and keep the first subpage 0~4KiB.
>
> IOW, something that already happens with ordinary THP already IIRC.
>
> >
> > My test wants to address my three concerns,
> > a. whether we will have leak on swap slots
> > b. whether we will have redundant I/O
> > c. whether we will cause races on swapcache
> >
> > what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
> > at some specific stage
> > 1. just after add_to_swap   (swap slots are allocated)
> > 2. before and after try_to_unmap   (ptes are set to swap_entry)
> > 3. before and after pageout (also add printk in zram driver to dump all I/O write)
> > 4. before and after remove_mapping
> >
> > The below is the dumped info for a particular large folio,
> >
> > 1. after add_to_swap
> > [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
> > [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >
> > as you can see,
> > _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
> >
> >
> > 2. before and after try_to_unmap
> > [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
> > [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
> > [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
> > [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >
> > as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
> > 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
> >
> > 3. before and after pageout
> > [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
> > [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> > [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
> > [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
> > [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
> > [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
> > [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
> > [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
> > [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
> > [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
> > [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
> > [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
> > [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
> > [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
> > [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
> > [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
> > [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
> > [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
> > [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
> >
> > as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
> > 4~64KiB has been zap_pte_range before, we still write them to zRAM.
> >
> > 4. before and after remove_mapping
> > [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> > [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
> > [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> >
> > as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
> > all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
> > slot leak at all!
> >
> > Thus, only two concerns are left for me,
> > 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
> > is partially unmapped.
> > 2. large folio is added as a whole as a swapcache covering the range whose
> > part has been zapped. I am not quite sure if this will cause some problems
> > while some concurrent do_anon_page, swapin and swapout occurs between 3 and
> > 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...
>
> Just noting: I was running into something different in the past with
> THP. And it's effectively the same scenario, just swapout and
> MADV_DONTNEED reversed.
>
> Imagine you swapped out a THP and the THP it still is in the swapcache.
>
> Then you unmap/zap some PTEs, freeing up the swap slots.
>
> In zap_pte_range(), we'll call free_swap_and_cache(). There, we run into
> the "!swap_page_trans_huge_swapped(p, entry)", and we won't be calling
> __try_to_reclaim_swap().

I guess you mean swap_page_trans_huge_swapped(p, entry)  not
!swap_page_trans_huge_swapped(p, entry) ?

at that time, swap_page_trans_huge_swapped should be true as there are still
some entries whose swap_map=0x41 or above (SWAP_HAS_CACHE and
swap_count >= 1)

static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
                                         swp_entry_t entry,
                                         unsigned int nr_pages)
{
        ...
        for (i = 0; i < nr_pages; i++) {
                if (swap_count(map[offset + i])) {
                        ret = true;
                        break;
                }
        }
unlock_out:
        unlock_cluster_or_swap_info(si, ci);
        return ret;
}
So this will stop the swap free even for those ptes which have been
zapped?

Another case I have reported[1] is that while reclaiming a large folio,
in try_to_unmap_one, we are calling  page_vma_mapped_walk().
as it only begins to hold PTL after it hits a valid pte, a paralel
break-before-make might make 0nd, 1st and beginning PTEs zero,
try_to_unmap_one can read intermediate ptes value, thus we can run
into this below case.  afte try_to_unmap_one,
pte 0   -  untouched, present pte
pte 1   - untouched, present pte
pte 2  - swap entries
pte 3 - swap entries
..
pte n - swap entries

or

pte 0   -  untouched, present pte
pte 1  - swap entries
pte 2  - swap entries
pte 3  - swap entries
..
pte n - swap entries

etc.

Thus, after try_to_unmap, the folio is still mapped with some ptes becoming
swap entries, some PTEs are still present. it might be staying in swapcache
for a long time with a broken CONT-PTE.

I also hate that and hope for a SYNC way to let large folio hold PTL from the
0nd pte, thus, it won't get intermediate PTEs from other break-before-make.

This also doesn't increase PTL contention as my test shows we will always
get PTL for a large folio after skipping zero, one or two PTEs in
try_to_unmap_one.
but skipping 1 or 2 is really bad and sad, breaking a large folio from the same
whole to nr_pages different segments.

[1] https://lore.kernel.org/linux-mm/CAGsJ_4wo7BiJWSKb1K_WyAai30KmfckMQ3-mCJPXZ892CtXpyQ@mail.gmail.com/

>
> So we won't split the large folio that is in the swapcache and it will
> continue consuming "more memory" than intended until fully evicted.
>
> >
> > To me, it seems safer to split or do some other similar optimization if we find a
> > large folio has partial map and unmap.
>
> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
> possible.
>

Is _nr_pages_mapped < nr_pages a reasonable case to split as we
have known the folio has at least some subpages zapped?

> If we find that the folio is on the deferred split list, we might as
> well just split it right away, before swapping it out. That might be a
> reasonable optimization for the case you describe.

i tried to change Ryan's code as below

@@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
                                         * PMD map right away. Chances are some
                                         * or all of the tail pages can be freed
                                         * without IO.
+                                        * Similarly, split PTE-mapped folios if
+                                        * they have been already
deferred_split.
                                         */
-                                       if (folio_test_pmd_mappable(folio) &&
-                                           !folio_entire_mapcount(folio) &&
-                                           split_folio_to_list(folio,
-                                                               folio_list))
+                                       if
(((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
+
(!folio_test_pmd_mappable(folio) &&
!list_empty(&folio->_deferred_list)))
+                                           &&
split_folio_to_list(folio, folio_list))
                                                goto activate_locked;
                                }
                                if (!add_to_swap(folio)) {

It seems to work as expected. only one I/O is left for a large folio
with 16 PTEs
but 15 of them have been zapped before.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
  
Ryan Roberts Feb. 27, 2024, 12:05 p.m. UTC | #27
On 23/02/2024 09:46, Barry Song wrote:
> On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 22.02.24 08:05, Barry Song wrote:
>>> Hi Ryan,
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 2cc0cb41fb32..ea19710aa4cd 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>                                      if (!can_split_folio(folio, NULL))
>>>>                                              goto activate_locked;
>>>>                                      /*
>>>> -                                     * Split folios without a PMD map right
>>>> -                                     * away. Chances are some or all of the
>>>> -                                     * tail pages can be freed without IO.
>>>> +                                     * Split PMD-mappable folios without a
>>>> +                                     * PMD map right away. Chances are some
>>>> +                                     * or all of the tail pages can be freed
>>>> +                                     * without IO.
>>>>                                       */
>>>> -                                    if (!folio_entire_mapcount(folio) &&
>>>> +                                    if (folio_test_pmd_mappable(folio) &&
>>>> +                                        !folio_entire_mapcount(folio) &&
>>>>                                          split_folio_to_list(folio,
>>>>                                                              folio_list))
>>>>                                              goto activate_locked;
>>>
>>> I ran a test to investigate what would happen while reclaiming a partially
>>> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
>>> 4KB~64KB, and keep the first subpage 0~4KiB.
>>
>> IOW, something that already happens with ordinary THP already IIRC.
>>
>>>
>>> My test wants to address my three concerns,
>>> a. whether we will have leak on swap slots
>>> b. whether we will have redundant I/O
>>> c. whether we will cause races on swapcache
>>>
>>> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
>>> at some specific stage
>>> 1. just after add_to_swap   (swap slots are allocated)
>>> 2. before and after try_to_unmap   (ptes are set to swap_entry)
>>> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
>>> 4. before and after remove_mapping
>>>
>>> The below is the dumped info for a particular large folio,
>>>
>>> 1. after add_to_swap
>>> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
>>> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>
>>> as you can see,
>>> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
>>>
>>>
>>> 2. before and after try_to_unmap
>>> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
>>> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
>>> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
>>> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>
>>> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
>>> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
>>>
>>> 3. before and after pageout
>>> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
>>> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
>>> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
>>> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
>>> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
>>> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
>>> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
>>> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
>>> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
>>> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
>>> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
>>> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
>>> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
>>> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
>>> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
>>> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
>>> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
>>> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
>>>
>>> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
>>> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
>>>
>>> 4. before and after remove_mapping
>>> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
>>> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>>>
>>> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
>>> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
>>> slot leak at all!
>>>
>>> Thus, only two concerns are left for me,
>>> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
>>> is partially unmapped.

So the cost of this is increased IO and swap storage, correct? Is this a big
problem in practice? i.e. do you see a lot of partially mapped large folios in
your workload? (I agree the proposed fix below is simple, so I think we should
do it anyway - I'm just interested in the scale of the problem).

>>> 2. large folio is added as a whole as a swapcache covering the range whose
>>> part has been zapped. I am not quite sure if this will cause some problems
>>> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
>>> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...

Yes mine too. I would only expect the ptes that map the folio will get replaced
with swap entries? So I would expect it to be safe. Although I understand the
concern with the extra swap consumption.

[...]
>>>
>>> To me, it seems safer to split or do some other similar optimization if we find a
>>> large folio has partial map and unmap.
>>
>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
>> possible.
>>
> 
> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
> have known the folio has at least some subpages zapped?

I'm not sure we need this - the folio's presence on the split list will tell us
everything we need to know I think?

> 
>> If we find that the folio is on the deferred split list, we might as
>> well just split it right away, before swapping it out. That might be a
>> reasonable optimization for the case you describe.

Yes, agreed. I think there is still chance of a race though; Some other thread
could be munmapping in parallel. But in that case, I think we just end up with
the increased IO and swap storage? That's not the end of the world if its a
corner case.

> 
> i tried to change Ryan's code as below
> 
> @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                                          * PMD map right away. Chances are some
>                                          * or all of the tail pages can be freed
>                                          * without IO.
> +                                        * Similarly, split PTE-mapped folios if
> +                                        * they have been already
> deferred_split.
>                                          */
> -                                       if (folio_test_pmd_mappable(folio) &&
> -                                           !folio_entire_mapcount(folio) &&
> -                                           split_folio_to_list(folio,
> -                                                               folio_list))
> +                                       if
> (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
> +
> (!folio_test_pmd_mappable(folio) &&
> !list_empty(&folio->_deferred_list)))

I'm not sure we need the different tests for pmd_mappable vs !pmd_mappable. I
think presence on the deferred list is a sufficient indicator that there are
unmapped subpages?

I'll incorporate this into my next version.

> +                                           &&
> split_folio_to_list(folio, folio_list))
>                                                 goto activate_locked;
>                                 }
>                                 if (!add_to_swap(folio)) {
> 
> It seems to work as expected. only one I/O is left for a large folio
> with 16 PTEs
> but 15 of them have been zapped before.
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Thanks
> Barry
  
Ryan Roberts Feb. 27, 2024, 12:28 p.m. UTC | #28
On 05/02/2024 09:51, Barry Song wrote:
> +Chris, Suren and Chuanhua
> 
> Hi Ryan,
> 
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
> 
> nobody is using this scan_base afterwards. it seems a bit weird to
> pass a pointer.
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
>> --
> 
> Chuanhua and I ran this patchset for a couple of days and found a race
> between reclamation and split_folio. this might cause applications get
> wrong data 0 while swapping-in.

I can't claim to fully understand the problem yet (thanks for all the details -
I'll keep reading it and looking at the code until I do), but I guess this
problem should exist today for PMD-mappable folios? We already skip splitting
those folios if they are pmd-mapped. Or does the problem only apply to
pte-mapped folios?
  
Ryan Roberts Feb. 27, 2024, 1:37 p.m. UTC | #29
On 05/02/2024 09:51, Barry Song wrote:
> +Chris, Suren and Chuanhua
> 
> Hi Ryan,
> 
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
> 
> nobody is using this scan_base afterwards. it seems a bit weird to
> pass a pointer.
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
>> --
> 
> Chuanhua and I ran this patchset for a couple of days and found a race
> between reclamation and split_folio. this might cause applications get
> wrong data 0 while swapping-in.
> 
> in case one thread(T1) is reclaiming a large folio by some means, still
> another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> and T2 does split as below,

Hi Barry,

Do you have a test case you can share that provokes this problem? And is this a
separate problem to the race you solved with TTU_SYNC or is this solving the
same problem?

Thanks,
Ryan
  
Barry Song Feb. 28, 2024, 1:23 a.m. UTC | #30
On Wed, Feb 28, 2024 at 1:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 23/02/2024 09:46, Barry Song wrote:
> > On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 22.02.24 08:05, Barry Song wrote:
> >>> Hi Ryan,
> >>>
> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>> index 2cc0cb41fb32..ea19710aa4cd 100644
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>                                      if (!can_split_folio(folio, NULL))
> >>>>                                              goto activate_locked;
> >>>>                                      /*
> >>>> -                                     * Split folios without a PMD map right
> >>>> -                                     * away. Chances are some or all of the
> >>>> -                                     * tail pages can be freed without IO.
> >>>> +                                     * Split PMD-mappable folios without a
> >>>> +                                     * PMD map right away. Chances are some
> >>>> +                                     * or all of the tail pages can be freed
> >>>> +                                     * without IO.
> >>>>                                       */
> >>>> -                                    if (!folio_entire_mapcount(folio) &&
> >>>> +                                    if (folio_test_pmd_mappable(folio) &&
> >>>> +                                        !folio_entire_mapcount(folio) &&
> >>>>                                          split_folio_to_list(folio,
> >>>>                                                              folio_list))
> >>>>                                              goto activate_locked;
> >>>
> >>> I ran a test to investigate what would happen while reclaiming a partially
> >>> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
> >>> 4KB~64KB, and keep the first subpage 0~4KiB.
> >>
> >> IOW, something that already happens with ordinary THP already IIRC.
> >>
> >>>
> >>> My test wants to address my three concerns,
> >>> a. whether we will have leak on swap slots
> >>> b. whether we will have redundant I/O
> >>> c. whether we will cause races on swapcache
> >>>
> >>> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
> >>> at some specific stage
> >>> 1. just after add_to_swap   (swap slots are allocated)
> >>> 2. before and after try_to_unmap   (ptes are set to swap_entry)
> >>> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
> >>> 4. before and after remove_mapping
> >>>
> >>> The below is the dumped info for a particular large folio,
> >>>
> >>> 1. after add_to_swap
> >>> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
> >>> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>>
> >>> as you can see,
> >>> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
> >>>
> >>>
> >>> 2. before and after try_to_unmap
> >>> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
> >>> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
> >>> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
> >>> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>>
> >>> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
> >>> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
> >>>
> >>> 3. before and after pageout
> >>> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
> >>> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
> >>> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
> >>> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
> >>> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
> >>> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
> >>> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
> >>> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
> >>> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
> >>> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
> >>> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
> >>> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
> >>> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
> >>> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
> >>> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
> >>> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
> >>> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
> >>> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
> >>>
> >>> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
> >>> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
> >>>
> >>> 4. before and after remove_mapping
> >>> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
> >>> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> >>>
> >>> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
> >>> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
> >>> slot leak at all!
> >>>
> >>> Thus, only two concerns are left for me,
> >>> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
> >>> is partially unmapped.
>
> So the cost of this is increased IO and swap storage, correct? Is this a big
> problem in practice? i.e. do you see a lot of partially mapped large folios in
> your workload? (I agree the proposed fix below is simple, so I think we should
> do it anyway - I'm just interested in the scale of the problem).
>
> >>> 2. large folio is added as a whole as a swapcache covering the range whose
> >>> part has been zapped. I am not quite sure if this will cause some problems
> >>> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
> >>> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...
>
> Yes mine too. I would only expect the ptes that map the folio will get replaced
> with swap entries? So I would expect it to be safe. Although I understand the
> concern with the extra swap consumption.

yes. it should still be safe. just more I/O and more swap spaces. but they will
be removed while remove_mapping happens if try_to_unmap_one makes
the folio unmapped.

but with the potential possibility even mapped PTEs can be skipped by
try_to_unmap_one (reported intermediate PTEs issue - PTL is held till
a valid PTE, some PTEs might be skipped by try_to_unmap without being
set to swap entries), we could have the possibility folio_mapped() is still true
after try_to_unmap_one. so we can't get to __remove_mapping() for a long
time. but it still doesn't cause a crash.

>
> [...]
> >>>
> >>> To me, it seems safer to split or do some other similar optimization if we find a
> >>> large folio has partial map and unmap.
> >>
> >> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
> >> possible.
> >>
> >
> > Is _nr_pages_mapped < nr_pages a reasonable case to split as we
> > have known the folio has at least some subpages zapped?
>
> I'm not sure we need this - the folio's presence on the split list will tell us
> everything we need to know I think?

I agree, this is just one question to David, not my proposal.  if
deferred_list is sufficient,
I prefer we use deferred_list.

I actually don't quite understand why David dislikes _nr_pages_mapped being used
though I do think _nr_pages_mapped cannot precisely reflect how a
folio is mapped
by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
tell the folio
is partially unmapped :-)

>
> >
> >> If we find that the folio is on the deferred split list, we might as
> >> well just split it right away, before swapping it out. That might be a
> >> reasonable optimization for the case you describe.
>
> Yes, agreed. I think there is still chance of a race though; Some other thread
> could be munmapping in parallel. But in that case, I think we just end up with
> the increased IO and swap storage? That's not the end of the world if its a
> corner case.

I agree. btw, do we need a spinlock ds_queue->split_queue_lock for checking
the list? deferred_split_folio(), for itself, has no spinlock while checking
 if (!list_empty(&folio->_deferred_list)), but why? the read and write
need to be exclusive.....

void deferred_split_folio(struct folio *folio)
{
        ...

        if (!list_empty(&folio->_deferred_list))
                return;

        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
        if (list_empty(&folio->_deferred_list)) {
                count_vm_event(THP_DEFERRED_SPLIT_PAGE);
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
#ifdef CONFIG_MEMCG
                if (memcg)
                        set_shrinker_bit(memcg, folio_nid(folio),
                                         deferred_split_shrinker->id);
#endif
        }
        spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
}

>
> >
> > i tried to change Ryan's code as below
> >
> > @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                                          * PMD map right away. Chances are some
> >                                          * or all of the tail pages can be freed
> >                                          * without IO.
> > +                                        * Similarly, split PTE-mapped folios if
> > +                                        * they have been already
> > deferred_split.
> >                                          */
> > -                                       if (folio_test_pmd_mappable(folio) &&
> > -                                           !folio_entire_mapcount(folio) &&
> > -                                           split_folio_to_list(folio,
> > -                                                               folio_list))
> > +                                       if
> > (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
> > +
> > (!folio_test_pmd_mappable(folio) &&
> > !list_empty(&folio->_deferred_list)))
>
> I'm not sure we need the different tests for pmd_mappable vs !pmd_mappable. I
> think presence on the deferred list is a sufficient indicator that there are
> unmapped subpages?

I don't think there are fundamental differences for pmd and pte. i was
testing pte-mapped folio at that time, so kept the behavior of pmd as is.

>
> I'll incorporate this into my next version.

Great!

>
> > +                                           &&
> > split_folio_to_list(folio, folio_list))
> >                                                 goto activate_locked;
> >                                 }
> >                                 if (!add_to_swap(folio)) {
> >
> > It seems to work as expected. only one I/O is left for a large folio
> > with 16 PTEs
> > but 15 of them have been zapped before.
> >
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >>
> >

Thanks
Barry
  
Barry Song Feb. 28, 2024, 2:46 a.m. UTC | #31
On Wed, Feb 28, 2024 at 2:37 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/02/2024 09:51, Barry Song wrote:
> > +Chris, Suren and Chuanhua
> >
> > Hi Ryan,
> >
> >> +    /*
> >> +     * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> >> +     * so indicate that we are scanning to synchronise with swapoff.
> >> +     */
> >> +    si->flags += SWP_SCANNING;
> >> +    ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> >> +    si->flags -= SWP_SCANNING;
> >
> > nobody is using this scan_base afterwards. it seems a bit weird to
> > pass a pointer.
> >
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                      if (!can_split_folio(folio, NULL))
> >>                                              goto activate_locked;
> >>                                      /*
> >> -                                     * Split folios without a PMD map right
> >> -                                     * away. Chances are some or all of the
> >> -                                     * tail pages can be freed without IO.
> >> +                                     * Split PMD-mappable folios without a
> >> +                                     * PMD map right away. Chances are some
> >> +                                     * or all of the tail pages can be freed
> >> +                                     * without IO.
> >>                                       */
> >> -                                    if (!folio_entire_mapcount(folio) &&
> >> +                                    if (folio_test_pmd_mappable(folio) &&
> >> +                                        !folio_entire_mapcount(folio) &&
> >>                                          split_folio_to_list(folio,
> >>                                                              folio_list))
> >>                                              goto activate_locked;
> >> --
> >
> > Chuanhua and I ran this patchset for a couple of days and found a race
> > between reclamation and split_folio. this might cause applications get
> > wrong data 0 while swapping-in.
> >
> > in case one thread(T1) is reclaiming a large folio by some means, still
> > another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> > and T2 does split as below,
>

Hi Ryan,

> Hi Barry,
>
> Do you have a test case you can share that provokes this problem? And is this a
> separate problem to the race you solved with TTU_SYNC or is this solving the
> same problem?

They are the same.

After sending you the report about the races, I spent some time and
finally figured
out what was happening, why corrupted data came while swapping in. it
is absolutely
not your fault, but TTU_SYNC does somehow resolve my problem though it is not
the root cause. this corrupted data only can reproduce after applying
patch 4[1] of
swap-in series,
[1]  [PATCH RFC 4/6] mm: support large folios swapin as a whole
https://lore.kernel.org/linux-mm/20240118111036.72641-5-21cnbao@gmail.com/

In case we have a large folio with 16 PTEs as below, and after
add_to_swap(), they get
swapoffset 0x10000,  their PTEs are all present as they are still mapped.
PTE          pte_stat
PTE0        present
PTE1        present
PTE2        present
PTE3        present
..
PTE15       present

then we get to try_to_unmap_one, as try_to_unmap_one doesn't hold PTL
from PTE0, while it scans PTEs, we might have
PTE          pte_stat
PTE0        none (someone is writing PTE0 for various reasons)
PTE1        present
PTE2        present
PTE3        present
..
PTE15       present

We hold PTL from PTE1.

after try_to_unmap_one, PTEs become

PTE          pte_stat
PTE0        present (someone finished the write of PTE0)
PTE1        swap 0x10001
PTE2        swap 0x10002
PTE3        swap 0x10003
..
..
PTE15      swap 0x1000F

Thus, after try_to_unmap_one, the large folio is still mapped. so its swapcache
will still be there.

Now a parallel thread runs MADV_PAGEOUT, and it finds this large folio
is not completely mapped, so it splits the folio into 16 small folios but their
swap offsets are kept.

Now in swapcache, we have 16 folios with contiguous swap offsets.
MADV_PAGEOUT will reclaim these 16 folios, after new 16 try_to_unmap_one,

PTE          pte_stat
PTE0        swap 0x10000  SWAP_HAS_CACHE
PTE1        swap 0x10001  SWAP_HAS_CACHE
PTE2        swap 0x10002  SWAP_HAS_CACHE
PTE3        swap 0x10003  SWAP_HAS_CACHE
..
PTE15        swap 0x1000F  SWAP_HAS_CACHE

From this time, we can have various different cases for these 16 PTEs.
for example,

PTE          pte_stat
PTE0        swap 0x10000  SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE1        swap 0x10001  SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE2        swap 0x10002  SWAP_HAS_CACHE = 0 -> become false due to
concurrent swapin and swapout
PTE3        swap 0x10003  SWAP_HAS_CACHE = 1
..
PTE13        swap 0x1000D  SWAP_HAS_CACHE = 1
PTE14        swap 0x1000E  SWAP_HAS_CACHE = 1
PTE15        swap 0x1000F  SWAP_HAS_CACHE = 1

but all of them have swp_count = 1 and different SWAP_HAS_CACHE. some of these
small folios might be in swapcache, some others might not be in.

then we do_swap_page at one PTE whose SWAP_HAS_CACHE=0 and
swap_count=1 (the folio is not in swapcache, thus has been written to swap),
we do this check:

static bool pte_range_swap(pte_t *pte, int nr_pages)
{
        int i;
        swp_entry_t entry;
        unsigned type;
        pgoff_t start_offset;

        entry = pte_to_swp_entry(ptep_get_lockless(pte));
        if (non_swap_entry(entry))
                return false;
        start_offset = swp_offset(entry);
        if (start_offset % nr_pages)
                return false;

        type = swp_type(entry);
        for (i = 1; i < nr_pages; i++) {
                entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
                if (non_swap_entry(entry))
                        return false;
                if (swp_offset(entry) != start_offset + i)
                        return false;
                if (swp_type(entry) != type)
                        return false;
        }

        return true;
}

as those swp entries are contiguous, we will call swap_read_folio().
For those folios which are still in swapcache and haven't been written,
we get zero-filled data from zRAM.

So the root cause is that pte_range_swap should check
all 16 swap_map have the same SWAP_HAS_CACHE as
false.

static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
{
       ...
       count = si->swap_map[start_offset];
       for (i = 1; i < nr_pages; i++) {
               entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
               if (non_swap_entry(entry))
                       return false;
               if (swp_offset(entry) != start_offset + i)
                       return false;
               if (swp_type(entry) != type)
                       return false;
               /* fallback to small folios if SWAP_HAS_CACHE isn't same */
               if (si->swap_map[start_offset + i] != count)
                       return false;
       }

       return true;
}

but somehow TTU_SYNC "resolves" it by giving no chance to
MADV_PAGEOUT to split this folio as the large folio are either
entirely written by swap entries, or entirely keep present PTEs.

Though the bug is within the swap-in series, I am still a big fan of
TTU_SYNC for large folio reclamation for at least three reasons,

1. We remove some possibility that large folios fail to be reclaimed, improving
reclamation efficiency.

2. We avoid many strange cases and potential folio_split during reclamation.
without TTU_SYNC, folios can be splitted later, or partially being set to swap
entries while partially being still present

3. we don't increase PTL contention. My test shows try_to_unmap_one
will always get PTL after it sometimes skips one or two PTEs because
intermediate break-before-makes are short. Of course, most time try_to_unmap_one
will get PTL from PTE0.

>
> Thanks,
> Ryan
>

Thanks
Barry
  
David Hildenbrand Feb. 28, 2024, 9:34 a.m. UTC | #32
>>>>>
>>>>> To me, it seems safer to split or do some other similar optimization if we find a
>>>>> large folio has partial map and unmap.
>>>>
>>>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
>>>> possible.
>>>>
>>>
>>> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
>>> have known the folio has at least some subpages zapped?
>>
>> I'm not sure we need this - the folio's presence on the split list will tell us
>> everything we need to know I think?
> 
> I agree, this is just one question to David, not my proposal.  if
> deferred_list is sufficient,
> I prefer we use deferred_list.
> 
> I actually don't quite understand why David dislikes _nr_pages_mapped being used
> though I do think _nr_pages_mapped cannot precisely reflect how a
> folio is mapped
> by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
> tell the folio
> is partially unmapped :-)

I'm hoping we can get rid of _nr_pages_mapped in some kernel configs in 
the future (that's what I am working on). So the less we depend on it 
the better.

With the total mapcount patch I'll revive shortly, _nr_pages_mapped will 
only be used inside rmap code. I'm hoping we won't have to introduce 
other users that will be harder to get rid of.

So please, if avoidable, no usage of _nr_pages_mapped outside of core 
rmap code.
  
Ryan Roberts Feb. 28, 2024, 3:57 p.m. UTC | #33
On 28/02/2024 01:23, Barry Song wrote:
> On Wed, Feb 28, 2024 at 1:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 23/02/2024 09:46, Barry Song wrote:
>>> On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 22.02.24 08:05, Barry Song wrote:
>>>>> Hi Ryan,
>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 2cc0cb41fb32..ea19710aa4cd 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>                                      if (!can_split_folio(folio, NULL))
>>>>>>                                              goto activate_locked;
>>>>>>                                      /*
>>>>>> -                                     * Split folios without a PMD map right
>>>>>> -                                     * away. Chances are some or all of the
>>>>>> -                                     * tail pages can be freed without IO.
>>>>>> +                                     * Split PMD-mappable folios without a
>>>>>> +                                     * PMD map right away. Chances are some
>>>>>> +                                     * or all of the tail pages can be freed
>>>>>> +                                     * without IO.
>>>>>>                                       */
>>>>>> -                                    if (!folio_entire_mapcount(folio) &&
>>>>>> +                                    if (folio_test_pmd_mappable(folio) &&
>>>>>> +                                        !folio_entire_mapcount(folio) &&
>>>>>>                                          split_folio_to_list(folio,
>>>>>>                                                              folio_list))
>>>>>>                                              goto activate_locked;
>>>>>
>>>>> I ran a test to investigate what would happen while reclaiming a partially
>>>>> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
>>>>> 4KB~64KB, and keep the first subpage 0~4KiB.
>>>>
>>>> IOW, something that already happens with ordinary THP already IIRC.
>>>>
>>>>>
>>>>> My test wants to address my three concerns,
>>>>> a. whether we will have leak on swap slots
>>>>> b. whether we will have redundant I/O
>>>>> c. whether we will cause races on swapcache
>>>>>
>>>>> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
>>>>> at some specific stage
>>>>> 1. just after add_to_swap   (swap slots are allocated)
>>>>> 2. before and after try_to_unmap   (ptes are set to swap_entry)
>>>>> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
>>>>> 4. before and after remove_mapping
>>>>>
>>>>> The below is the dumped info for a particular large folio,
>>>>>
>>>>> 1. after add_to_swap
>>>>> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
>>>>> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>>
>>>>> as you can see,
>>>>> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
>>>>>
>>>>>
>>>>> 2. before and after try_to_unmap
>>>>> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
>>>>> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
>>>>> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
>>>>> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>>
>>>>> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
>>>>> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
>>>>>
>>>>> 3. before and after pageout
>>>>> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
>>>>> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
>>>>> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
>>>>> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
>>>>> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
>>>>> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
>>>>> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
>>>>> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
>>>>> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
>>>>> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
>>>>> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
>>>>> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
>>>>> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
>>>>> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
>>>>> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
>>>>> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
>>>>> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
>>>>> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
>>>>>
>>>>> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
>>>>> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
>>>>>
>>>>> 4. before and after remove_mapping
>>>>> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
>>>>> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>>>>>
>>>>> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
>>>>> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
>>>>> slot leak at all!
>>>>>
>>>>> Thus, only two concerns are left for me,
>>>>> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
>>>>> is partially unmapped.
>>
>> So the cost of this is increased IO and swap storage, correct? Is this a big
>> problem in practice? i.e. do you see a lot of partially mapped large folios in
>> your workload? (I agree the proposed fix below is simple, so I think we should
>> do it anyway - I'm just interested in the scale of the problem).
>>
>>>>> 2. large folio is added as a whole as a swapcache covering the range whose
>>>>> part has been zapped. I am not quite sure if this will cause some problems
>>>>> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
>>>>> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...
>>
>> Yes mine too. I would only expect the ptes that map the folio will get replaced
>> with swap entries? So I would expect it to be safe. Although I understand the
>> concern with the extra swap consumption.
> 
> yes. it should still be safe. just more I/O and more swap spaces. but they will
> be removed while remove_mapping happens if try_to_unmap_one makes
> the folio unmapped.
> 
> but with the potential possibility even mapped PTEs can be skipped by
> try_to_unmap_one (reported intermediate PTEs issue - PTL is held till
> a valid PTE, some PTEs might be skipped by try_to_unmap without being
> set to swap entries), we could have the possibility folio_mapped() is still true
> after try_to_unmap_one. so we can't get to __remove_mapping() for a long
> time. but it still doesn't cause a crash.
> 
>>
>> [...]
>>>>>
>>>>> To me, it seems safer to split or do some other similar optimization if we find a
>>>>> large folio has partial map and unmap.
>>>>
>>>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
>>>> possible.
>>>>
>>>
>>> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
>>> have known the folio has at least some subpages zapped?
>>
>> I'm not sure we need this - the folio's presence on the split list will tell us
>> everything we need to know I think?
> 
> I agree, this is just one question to David, not my proposal.  if
> deferred_list is sufficient,
> I prefer we use deferred_list.
> 
> I actually don't quite understand why David dislikes _nr_pages_mapped being used
> though I do think _nr_pages_mapped cannot precisely reflect how a
> folio is mapped
> by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
> tell the folio
> is partially unmapped :-)
> 
>>
>>>
>>>> If we find that the folio is on the deferred split list, we might as
>>>> well just split it right away, before swapping it out. That might be a
>>>> reasonable optimization for the case you describe.
>>
>> Yes, agreed. I think there is still chance of a race though; Some other thread
>> could be munmapping in parallel. But in that case, I think we just end up with
>> the increased IO and swap storage? That's not the end of the world if its a
>> corner case.
> 
> I agree. btw, do we need a spinlock ds_queue->split_queue_lock for checking
> the list? deferred_split_folio(), for itself, has no spinlock while checking
>  if (!list_empty(&folio->_deferred_list)), but why? the read and write
> need to be exclusive.....

I don't think so. It's safe to check if the folio is on the queue like this; but
if it isn't then you need to recheck under the lock, as is done here. So for us,
I think we can also do this safely. It is certainly preferable to avoid taking
the lock.

The original change says this:

Before acquire split_queue_lock, check and bail out early if the THP
head page is in the queue already. The checking without holding
split_queue_lock could race with deferred_split_scan, but it doesn't
impact the correctness here.

> 
> void deferred_split_folio(struct folio *folio)
> {
>         ...
> 
>         if (!list_empty(&folio->_deferred_list))
>                 return;
> 
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         if (list_empty(&folio->_deferred_list)) {
>                 count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
> #ifdef CONFIG_MEMCG
>                 if (memcg)
>                         set_shrinker_bit(memcg, folio_nid(folio),
>                                          deferred_split_shrinker->id);
> #endif
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> }
> 
>>
>>>
>>> i tried to change Ryan's code as below
>>>
>>> @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
>>> list_head *folio_list,
>>>                                          * PMD map right away. Chances are some
>>>                                          * or all of the tail pages can be freed
>>>                                          * without IO.
>>> +                                        * Similarly, split PTE-mapped folios if
>>> +                                        * they have been already
>>> deferred_split.
>>>                                          */
>>> -                                       if (folio_test_pmd_mappable(folio) &&
>>> -                                           !folio_entire_mapcount(folio) &&
>>> -                                           split_folio_to_list(folio,
>>> -                                                               folio_list))
>>> +                                       if
>>> (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
>>> +
>>> (!folio_test_pmd_mappable(folio) &&
>>> !list_empty(&folio->_deferred_list)))
>>
>> I'm not sure we need the different tests for pmd_mappable vs !pmd_mappable. I
>> think presence on the deferred list is a sufficient indicator that there are
>> unmapped subpages?
> 
> I don't think there are fundamental differences for pmd and pte. i was
> testing pte-mapped folio at that time, so kept the behavior of pmd as is.
> 
>>
>> I'll incorporate this into my next version.
> 
> Great!
> 
>>
>>> +                                           &&
>>> split_folio_to_list(folio, folio_list))
>>>                                                 goto activate_locked;
>>>                                 }
>>>                                 if (!add_to_swap(folio)) {
>>>
>>> It seems to work as expected. only one I/O is left for a large folio
>>> with 16 PTEs
>>> but 15 of them have been zapped before.
>>>
>>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>>
>>>
> 
> Thanks
> Barry
  
Barry Song Feb. 28, 2024, 11:18 p.m. UTC | #34
On Wed, Feb 28, 2024 at 10:34 PM David Hildenbrand <david@redhat.com> wrote:
>
> >>>>>
> >>>>> To me, it seems safer to split or do some other similar optimization if we find a
> >>>>> large folio has partial map and unmap.
> >>>>
> >>>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
> >>>> possible.
> >>>>
> >>>
> >>> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
> >>> have known the folio has at least some subpages zapped?
> >>
> >> I'm not sure we need this - the folio's presence on the split list will tell us
> >> everything we need to know I think?
> >
> > I agree, this is just one question to David, not my proposal.  if
> > deferred_list is sufficient,
> > I prefer we use deferred_list.
> >
> > I actually don't quite understand why David dislikes _nr_pages_mapped being used
> > though I do think _nr_pages_mapped cannot precisely reflect how a
> > folio is mapped
> > by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
> > tell the folio
> > is partially unmapped :-)
>
> I'm hoping we can get rid of _nr_pages_mapped in some kernel configs in
> the future (that's what I am working on). So the less we depend on it
> the better.
>
> With the total mapcount patch I'll revive shortly, _nr_pages_mapped will
> only be used inside rmap code. I'm hoping we won't have to introduce
> other users that will be harder to get rid of.
>
> So please, if avoidable, no usage of _nr_pages_mapped outside of core
> rmap code.

Thanks for clarification on the plan. good to use deferred_list in this
swap-out case.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
  

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ca8aaa098ba..ccbca5db851b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,11 +295,11 @@  struct swap_info_struct {
 	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	unsigned int __percpu *cpu_next;/*
 					 * Likely next allocation offset. We
-					 * assign a cluster to each CPU, so each
-					 * CPU can allocate swap entry from its
-					 * own cluster and swapout sequentially.
-					 * The purpose is to optimize swapout
-					 * throughput.
+					 * assign a cluster per-order to each
+					 * CPU, so each CPU can allocate swap
+					 * entry from its own cluster and
+					 * swapout sequentially. The purpose is
+					 * to optimize swapout throughput.
 					 */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 94f7cc225eb9..b50bce50bed9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -545,10 +545,12 @@  static void free_cluster(struct swap_info_struct *si, unsigned long idx)
 
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased.
+ * removed from free cluster list and its usage counter will be increased by
+ * count.
  */
-static void inc_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void add_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr,
+	unsigned long count)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
 
@@ -557,9 +559,19 @@  static void inc_cluster_info_page(struct swap_info_struct *p,
 	if (cluster_is_free(&cluster_info[idx]))
 		alloc_cluster(p, idx);
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
+	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) + 1);
+		cluster_count(&cluster_info[idx]) + count);
+}
+
+/*
+ * The cluster corresponding to page_nr will be used. The cluster will be
+ * removed from free cluster list and its usage counter will be increased.
+ */
+static void inc_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+{
+	add_cluster_info_page(p, cluster_info, page_nr, 1);
 }
 
 /*
@@ -588,8 +600,8 @@  static void dec_cluster_info_page(struct swap_info_struct *p,
  * cluster list. Avoiding such abuse to avoid list corruption.
  */
 static bool
-scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
-	unsigned long offset)
+__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
+	unsigned long offset, int order)
 {
 	bool conflict;
 
@@ -601,23 +613,36 @@  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	if (!conflict)
 		return false;
 
-	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
+	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
 	return true;
 }
 
 /*
- * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
- * might involve allocating a new cluster for current CPU too.
+ * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
+ * cluster list. Avoiding such abuse to avoid list corruption.
  */
-static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
-	unsigned long *offset, unsigned long *scan_base)
+static bool
+scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
+	unsigned long offset)
+{
+	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
+}
+
+/*
+ * Try to get a swap entry (or size indicated by order) from current cpu's swap
+ * entry pool (a cluster). This might involve allocating a new cluster for
+ * current CPU too.
+ */
+static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+	unsigned long *offset, unsigned long *scan_base, int order)
 {
 	struct swap_cluster_info *ci;
-	unsigned int tmp, max;
+	unsigned int tmp, max, i;
 	unsigned int *cpu_next;
+	unsigned int nr_pages = 1 << order;
 
 new_cluster:
-	cpu_next = this_cpu_ptr(si->cpu_next);
+	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
 	tmp = *cpu_next;
 	if (tmp == SWAP_NEXT_NULL) {
 		if (!cluster_list_empty(&si->free_clusters)) {
@@ -643,10 +668,12 @@  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * reserve a new cluster.
 	 */
 	ci = lock_cluster(si, tmp);
-	if (si->swap_map[tmp]) {
-		unlock_cluster(ci);
-		*cpu_next = SWAP_NEXT_NULL;
-		goto new_cluster;
+	for (i = 0; i < nr_pages; i++) {
+		if (si->swap_map[tmp + i]) {
+			unlock_cluster(ci);
+			*cpu_next = SWAP_NEXT_NULL;
+			goto new_cluster;
+		}
 	}
 	unlock_cluster(ci);
 
@@ -654,12 +681,22 @@  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	*scan_base = tmp;
 
 	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
-	tmp += 1;
+	tmp += nr_pages;
 	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
 
 	return true;
 }
 
+/*
+ * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
+ * might involve allocating a new cluster for current CPU too.
+ */
+static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+	unsigned long *offset, unsigned long *scan_base)
+{
+	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
+}
+
 static void __del_from_avail_list(struct swap_info_struct *p)
 {
 	int nid;
@@ -982,35 +1019,58 @@  static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }
 
-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
+			    unsigned int nr_pages)
 {
-	unsigned long idx;
 	struct swap_cluster_info *ci;
-	unsigned long offset;
+	unsigned long offset, scan_base;
+	int order = ilog2(nr_pages);
+	bool ret;
 
 	/*
-	 * Should not even be attempting cluster allocations when huge
+	 * Should not even be attempting large allocations when huge
 	 * page swap is disabled.  Warn and fail the allocation.
 	 */
-	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
+	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+	    nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
+	    !is_power_of_2(nr_pages)) {
 		VM_WARN_ON_ONCE(1);
 		return 0;
 	}
 
-	if (cluster_list_empty(&si->free_clusters))
+	/*
+	 * Swapfile is not block device or not using clusters so unable to
+	 * allocate large entries.
+	 */
+	if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
 		return 0;
 
-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count(ci, SWAPFILE_CLUSTER);
+again:
+	/*
+	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
+	 * so indicate that we are scanning to synchronise with swapoff.
+	 */
+	si->flags += SWP_SCANNING;
+	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
+	si->flags -= SWP_SCANNING;
+
+	/*
+	 * If we failed to allocate or if swapoff is waiting for us (due to lock
+	 * being dropped for discard above), return immediately.
+	 */
+	if (!ret || !(si->flags & SWP_WRITEOK))
+		return 0;
 
-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+	if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
+		goto again;
+
+	ci = lock_cluster(si, offset);
+	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
+	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
 	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
-	*slot = swp_entry(si->type, offset);
 
+	swap_range_alloc(si, offset, nr_pages);
+	*slot = swp_entry(si->type, offset);
 	return 1;
 }
 
@@ -1036,7 +1096,7 @@  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 	int node;
 
 	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+	WARN_ON_ONCE(n_goal > 1 && size > 1);
 
 	spin_lock(&swap_avail_lock);
 
@@ -1073,14 +1133,13 @@  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
-			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
+		if (size > 1) {
+			n_ret = swap_alloc_large(si, swp_entries, size);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();
 
@@ -3041,6 +3100,8 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (p->bdev && bdev_nonrot(p->bdev)) {
 		int cpu;
 		unsigned long ci, nr_cluster;
+		int nr_order;
+		int i;
 
 		p->flags |= SWP_SOLIDSTATE;
 		p->cluster_next_cpu = alloc_percpu(unsigned int);
@@ -3068,13 +3129,19 @@  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		for (ci = 0; ci < nr_cluster; ci++)
 			spin_lock_init(&((cluster_info + ci)->lock));
 
-		p->cpu_next = alloc_percpu(unsigned int);
+		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
+		p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
+					     __alignof__(unsigned int));
 		if (!p->cpu_next) {
 			error = -ENOMEM;
 			goto bad_swap_unlock_inode;
 		}
-		for_each_possible_cpu(cpu)
-			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
+		for_each_possible_cpu(cpu) {
+			unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
+
+			for (i = 0; i < nr_order; i++)
+				cpu_next[i] = SWAP_NEXT_NULL;
+		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cc0cb41fb32..ea19710aa4cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,11 +1212,13 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split PMD-mappable folios without a
+					 * PMD map right away. Chances are some
+					 * or all of the tail pages can be freed
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (folio_test_pmd_mappable(folio) &&
+					    !folio_entire_mapcount(folio) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;