[v2] mm: swap: async free swap slot cache entries

Message ID 20240131-async-free-v2-1-525f03e07184@kernel.org
State New
Headers
Series [v2] mm: swap: async free swap slot cache entries |

Commit Message

Chris Li Feb. 1, 2024, 1:17 a.m. UTC
  We discovered that 1% swap page fault is 100us+ while 50% of
the swap fault is under 20us.

Further investigation show that a large portion of the time
spent in the free_swap_slots() function for the long tail case.

The percpu cache of swap slots is freed in a batch of 64 entries
inside free_swap_slots(). These cache entries are accumulated
from previous page faults, which may not be related to the current
process.

Doing the batch free in the page fault handler causes longer
tail latencies and penalizes the current process.

Move free_swap_slots() outside of the swapin page fault handler into an
async work queue to avoid such long tail latencies.

The batch free of swap slots is typically at 100us level. Such a
short time will not have a significant impact on the CPU accounting.
Notice that the previous swap slot batching behavior already performs
a delayed batch free. It waits for the entries accumulated to 64.
Adding the async scheduling time does not change the original
free timing significantly.

Testing:

Chun-Tse did some benchmark in chromebook, showing that
zram_wait_metrics improve about 15% with 80% and 95% confidence.

I recently ran some experiments on about 1000 Google production
machines. It shows swapin latency drops in the long tail
100us - 500us bucket dramatically.

platform	(100-500us)	 	(0-100us)
A		1.12% -> 0.36%		98.47% -> 99.22%
B		0.65% -> 0.15%		98.96% -> 99.46%
C		0.61% -> 0.23%		98.96% -> 99.38%

Signed-off-by: Chris Li <chrisl@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Wei Xu<weixugc@google.com>
Cc: Yu Zhao<yuzhao@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Chun-Tse Shao <ctshao@google.com>
Cc: Suren Baghdasaryan<surenb@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Brain Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Barry Song <v-songbaohua@oppo.com>

remove create_work queue

remove another work queue usage

---
Changes in v2:
- Add description of the impact of time changing suggest by Ying.
- Remove create_workqueue() and use schedule_work()
- Link to v1: https://lore.kernel.org/r/20231221-async-free-v1-1-94b277992cb0@kernel.org
---
 include/linux/swap_slots.h |  1 +
 mm/swap_slots.c            | 29 +++++++++++++++++++++--------
 2 files changed, 22 insertions(+), 8 deletions(-)


---
base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
change-id: 20231216-async-free-bef392015432

Best regards,
  

Comments

Huang, Ying Feb. 1, 2024, 5:33 a.m. UTC | #1
Chris Li <chrisl@kernel.org> writes:

> We discovered that 1% swap page fault is 100us+ while 50% of
> the swap fault is under 20us.
>
> Further investigation show that a large portion of the time
> spent in the free_swap_slots() function for the long tail case.
>
> The percpu cache of swap slots is freed in a batch of 64 entries
> inside free_swap_slots(). These cache entries are accumulated
> from previous page faults, which may not be related to the current
> process.
>
> Doing the batch free in the page fault handler causes longer
> tail latencies and penalizes the current process.
>
> Move free_swap_slots() outside of the swapin page fault handler into an
> async work queue to avoid such long tail latencies.
>
> The batch free of swap slots is typically at 100us level. Such a

Running ~100us operation in an asynchronous task appears overkill for me
too.

Can you try to move some operations out of swapcache_free_entries() to
check whether that can resolve your issue?

--
Best Regards,
Huang, Ying

> short time will not have a significant impact on the CPU accounting.
> Notice that the previous swap slot batching behavior already performs
> a delayed batch free. It waits for the entries accumulated to 64.
> Adding the async scheduling time does not change the original
> free timing significantly.
>
> Testing:
>
> Chun-Tse did some benchmark in chromebook, showing that
> zram_wait_metrics improve about 15% with 80% and 95% confidence.
>
> I recently ran some experiments on about 1000 Google production
> machines. It shows swapin latency drops in the long tail
> 100us - 500us bucket dramatically.
>
> platform	(100-500us)	 	(0-100us)
> A		1.12% -> 0.36%		98.47% -> 99.22%
> B		0.65% -> 0.15%		98.96% -> 99.46%
> C		0.61% -> 0.23%		98.96% -> 99.38%
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Wei Xu<weixugc@google.com>
> Cc: Yu Zhao<yuzhao@google.com>
> Cc: Greg Thelen <gthelen@google.com>
> Cc: Chun-Tse Shao <ctshao@google.com>
> Cc: Suren Baghdasaryan<surenb@google.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Brain Geffon <bgeffon@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Cc: Kemeng Shi <shikemeng@huaweicloud.com>
> Cc: Barry Song <v-songbaohua@oppo.com>
>
> remove create_work queue
>
> remove another work queue usage
>
> ---
> Changes in v2:
> - Add description of the impact of time changing suggest by Ying.
> - Remove create_workqueue() and use schedule_work()
> - Link to v1: https://lore.kernel.org/r/20231221-async-free-v1-1-94b277992cb0@kernel.org
> ---
>  include/linux/swap_slots.h |  1 +
>  mm/swap_slots.c            | 29 +++++++++++++++++++++--------
>  2 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
> index 15adfb8c813a..67bc8fa30d63 100644
> --- a/include/linux/swap_slots.h
> +++ b/include/linux/swap_slots.h
> @@ -19,6 +19,7 @@ struct swap_slots_cache {
>  	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
>  	swp_entry_t	*slots_ret;
>  	int		n_ret;
> +	struct work_struct async_free;
>  };
>  
>  void disable_swap_slots_cache_lock(void);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..71d344564e55 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -44,6 +44,7 @@ static DEFINE_MUTEX(swap_slots_cache_mutex);
>  static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
>  
>  static void __drain_swap_slots_cache(unsigned int type);
> +static void swapcache_async_free_entries(struct work_struct *data);
>  
>  #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
>  #define SLOTS_CACHE 0x1
> @@ -149,6 +150,7 @@ static int alloc_swap_slot_cache(unsigned int cpu)
>  		spin_lock_init(&cache->free_lock);
>  		cache->lock_initialized = true;
>  	}
> +	INIT_WORK(&cache->async_free, swapcache_async_free_entries);
>  	cache->nr = 0;
>  	cache->cur = 0;
>  	cache->n_ret = 0;
> @@ -269,6 +271,20 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>  	return cache->nr;
>  }
>  
> +static void swapcache_async_free_entries(struct work_struct *data)
> +{
> +	struct swap_slots_cache *cache;
> +
> +	cache = container_of(data, struct swap_slots_cache, async_free);
> +	spin_lock_irq(&cache->free_lock);
> +	/* Swap slots cache may be deactivated before acquiring lock */
> +	if (cache->slots_ret) {
> +		swapcache_free_entries(cache->slots_ret, cache->n_ret);
> +		cache->n_ret = 0;
> +	}
> +	spin_unlock_irq(&cache->free_lock);
> +}
> +
>  void free_swap_slot(swp_entry_t entry)
>  {
>  	struct swap_slots_cache *cache;
> @@ -282,17 +298,14 @@ void free_swap_slot(swp_entry_t entry)
>  			goto direct_free;
>  		}
>  		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
> -			/*
> -			 * Return slots to global pool.
> -			 * The current swap_map value is SWAP_HAS_CACHE.
> -			 * Set it to 0 to indicate it is available for
> -			 * allocation in global pool
> -			 */
> -			swapcache_free_entries(cache->slots_ret, cache->n_ret);
> -			cache->n_ret = 0;
> +			spin_unlock_irq(&cache->free_lock);
> +			schedule_work(&cache->async_free);
> +			goto direct_free;
>  		}
>  		cache->slots_ret[cache->n_ret++] = entry;
>  		spin_unlock_irq(&cache->free_lock);
> +		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE)
> +			schedule_work(&cache->async_free);
>  	} else {
>  direct_free:
>  		swapcache_free_entries(&entry, 1);
>
> ---
> base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
> change-id: 20231216-async-free-bef392015432
>
> Best regards,
  
Tim Chen Feb. 1, 2024, 11:20 p.m. UTC | #2
On Thu, 2024-02-01 at 13:33 +0800, Huang, Ying wrote:
> Chris Li <chrisl@kernel.org> writes:
> 
> > 
> > Changes in v2:
> > - Add description of the impact of time changing suggest by Ying.
> > - Remove create_workqueue() and use schedule_work()
> > - Link to v1: https://lore.kernel.org/r/20231221-async-free-v1-1-94b277992cb0@kernel.org
> > ---
> >  include/linux/swap_slots.h |  1 +
> >  mm/swap_slots.c            | 29 +++++++++++++++++++++--------
> >  2 files changed, 22 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
> > index 15adfb8c813a..67bc8fa30d63 100644
> > --- a/include/linux/swap_slots.h
> > +++ b/include/linux/swap_slots.h
> > @@ -19,6 +19,7 @@ struct swap_slots_cache {
> >  	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
> >  	swp_entry_t	*slots_ret;
> >  	int		n_ret;
> > +	struct work_struct async_free;
> >  };
> >  
> >  void disable_swap_slots_cache_lock(void);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 0bec1f705f8e..71d344564e55 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -44,6 +44,7 @@ static DEFINE_MUTEX(swap_slots_cache_mutex);
> >  static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
> >  
> >  static void __drain_swap_slots_cache(unsigned int type);
> > +static void swapcache_async_free_entries(struct work_struct *data);
> >  
> >  #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
> >  #define SLOTS_CACHE 0x1
> > @@ -149,6 +150,7 @@ static int alloc_swap_slot_cache(unsigned int cpu)
> >  		spin_lock_init(&cache->free_lock);
> >  		cache->lock_initialized = true;
> >  	}
> > +	INIT_WORK(&cache->async_free, swapcache_async_free_entries);
> >  	cache->nr = 0;
> >  	cache->cur = 0;
> >  	cache->n_ret = 0;
> > @@ -269,6 +271,20 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> >  	return cache->nr;
> >  }
> >  
> > +static void swapcache_async_free_entries(struct work_struct *data)
> > +{
> > +	struct swap_slots_cache *cache;
> > +
> > +	cache = container_of(data, struct swap_slots_cache, async_free);
> > +	spin_lock_irq(&cache->free_lock);
> > +	/* Swap slots cache may be deactivated before acquiring lock */
> > +	if (cache->slots_ret) {
> > +		swapcache_free_entries(cache->slots_ret, cache->n_ret);
> > +		cache->n_ret = 0;
> > +	}
> > +	spin_unlock_irq(&cache->free_lock);
> > +}
> > +
> >  void free_swap_slot(swp_entry_t entry)
> >  {
> >  	struct swap_slots_cache *cache;
> > @@ -282,17 +298,14 @@ void free_swap_slot(swp_entry_t entry)
> >  			goto direct_free;
> >  		}
> >  		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
> > -			/*
> > -			 * Return slots to global pool.
> > -			 * The current swap_map value is SWAP_HAS_CACHE.
> > -			 * Set it to 0 to indicate it is available for
> > -			 * allocation in global pool
> > -			 */
> > -			swapcache_free_entries(cache->slots_ret, cache->n_ret);
> > -			cache->n_ret = 0;
> > +			spin_unlock_irq(&cache->free_lock);
> > +			schedule_work(&cache->async_free);
> > +			goto direct_free;
> >  		}
> >  		cache->slots_ret[cache->n_ret++] = entry;
> >  		spin_unlock_irq(&cache->free_lock);
> > +		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE)
> > +			schedule_work(&cache->async_free);


I have some concerns about the current patch with the change above.
We could hit the direct_free path very often.

By delaying the freeing of entries in the return
cache, we have to do more freeing of swap entry one at a time. When
we try to free an entry, we can find the return cache still full, waiting to be freed.

So we have fewer batch free of swap entries, resulting in an increase in
number of sis->lock acquisitions overall. This could have the
effect of reducing swap throughput overall when swap is under heavy
operations and sis->lock is contended.

Tim

> 

> >  	} else {
> >  direct_free:
> >  		swapcache_free_entries(&entry, 1);
> > 
> > ---
> > base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
> > change-id: 20231216-async-free-bef392015432
> > 
> > Best regards,
>
  
Chris Li Feb. 3, 2024, 6:12 p.m. UTC | #3
On Thu, Feb 1, 2024 at 3:21 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Thu, 2024-02-01 at 13:33 +0800, Huang, Ying wrote:
> > Chris Li <chrisl@kernel.org> writes:
> >
> > >
> > > Changes in v2:
> > > - Add description of the impact of time changing suggest by Ying.
> > > - Remove create_workqueue() and use schedule_work()
> > > - Link to v1: https://lore.kernel.org/r/20231221-async-free-v1-1-94b277992cb0@kernel.org
> > > ---
> > >  include/linux/swap_slots.h |  1 +
> > >  mm/swap_slots.c            | 29 +++++++++++++++++++++--------
> > >  2 files changed, 22 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
> > > index 15adfb8c813a..67bc8fa30d63 100644
> > > --- a/include/linux/swap_slots.h
> > > +++ b/include/linux/swap_slots.h
> > > @@ -19,6 +19,7 @@ struct swap_slots_cache {
> > >     spinlock_t      free_lock;  /* protects slots_ret, n_ret */
> > >     swp_entry_t     *slots_ret;
> > >     int             n_ret;
> > > +   struct work_struct async_free;
> > >  };
> > >
> > >  void disable_swap_slots_cache_lock(void);
> > > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > > index 0bec1f705f8e..71d344564e55 100644
> > > --- a/mm/swap_slots.c
> > > +++ b/mm/swap_slots.c
> > > @@ -44,6 +44,7 @@ static DEFINE_MUTEX(swap_slots_cache_mutex);
> > >  static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
> > >
> > >  static void __drain_swap_slots_cache(unsigned int type);
> > > +static void swapcache_async_free_entries(struct work_struct *data);
> > >
> > >  #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
> > >  #define SLOTS_CACHE 0x1
> > > @@ -149,6 +150,7 @@ static int alloc_swap_slot_cache(unsigned int cpu)
> > >             spin_lock_init(&cache->free_lock);
> > >             cache->lock_initialized = true;
> > >     }
> > > +   INIT_WORK(&cache->async_free, swapcache_async_free_entries);
> > >     cache->nr = 0;
> > >     cache->cur = 0;
> > >     cache->n_ret = 0;
> > > @@ -269,6 +271,20 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
> > >     return cache->nr;
> > >  }
> > >
> > > +static void swapcache_async_free_entries(struct work_struct *data)
> > > +{
> > > +   struct swap_slots_cache *cache;
> > > +
> > > +   cache = container_of(data, struct swap_slots_cache, async_free);
> > > +   spin_lock_irq(&cache->free_lock);
> > > +   /* Swap slots cache may be deactivated before acquiring lock */
> > > +   if (cache->slots_ret) {
> > > +           swapcache_free_entries(cache->slots_ret, cache->n_ret);
> > > +           cache->n_ret = 0;
> > > +   }
> > > +   spin_unlock_irq(&cache->free_lock);
> > > +}
> > > +
> > >  void free_swap_slot(swp_entry_t entry)
> > >  {
> > >     struct swap_slots_cache *cache;
> > > @@ -282,17 +298,14 @@ void free_swap_slot(swp_entry_t entry)
> > >                     goto direct_free;
> > >             }
> > >             if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
> > > -                   /*
> > > -                    * Return slots to global pool.
> > > -                    * The current swap_map value is SWAP_HAS_CACHE.
> > > -                    * Set it to 0 to indicate it is available for
> > > -                    * allocation in global pool
> > > -                    */
> > > -                   swapcache_free_entries(cache->slots_ret, cache->n_ret);
> > > -                   cache->n_ret = 0;
> > > +                   spin_unlock_irq(&cache->free_lock);
> > > +                   schedule_work(&cache->async_free);
> > > +                   goto direct_free;
> > >             }
> > >             cache->slots_ret[cache->n_ret++] = entry;
> > >             spin_unlock_irq(&cache->free_lock);
> > > +           if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE)
> > > +                   schedule_work(&cache->async_free);
>
>
> I have some concerns about the current patch with the change above.
> We could hit the direct_free path very often.
>
> By delaying the freeing of entries in the return
> cache, we have to do more freeing of swap entry one at a time. When
> we try to free an entry, we can find the return cache still full, waiting to be freed.

You are describing the async free is not working. In that case it will always
hit the direct free path one by one.

>
> So we have fewer batch free of swap entries, resulting in an increase in
> number of sis->lock acquisitions overall. This could have the
> effect of reducing swap throughput overall when swap is under heavy
> operations and sis->lock is contended.

I  can change the direct free path to free all entries. If the async
free hasn't freed up the batch by the time the next swap fault comes in.
The new swap fault will take the hit, just free the whole batch. It will behave
closer to the original batch free behavior in this path.

Chris
  
Tim Chen Feb. 5, 2024, 6:15 p.m. UTC | #4
On Sat, 2024-02-03 at 10:12 -0800, Chris Li wrote:
> 
> > > >  {
> > > >     struct swap_slots_cache *cache;
> > > > @@ -282,17 +298,14 @@ void free_swap_slot(swp_entry_t entry)
> > > >                     goto direct_free;
> > > >             }
> > > >             if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
> > > > -                   /*
> > > > -                    * Return slots to global pool.
> > > > -                    * The current swap_map value is SWAP_HAS_CACHE.
> > > > -                    * Set it to 0 to indicate it is available for
> > > > -                    * allocation in global pool
> > > > -                    */
> > > > -                   swapcache_free_entries(cache->slots_ret, cache->n_ret);
> > > > -                   cache->n_ret = 0;
> > > > +                   spin_unlock_irq(&cache->free_lock);
> > > > +                   schedule_work(&cache->async_free);
> > > > +                   goto direct_free;
> > > >             }
> > > >             cache->slots_ret[cache->n_ret++] = entry;
> > > >             spin_unlock_irq(&cache->free_lock);
> > > > +           if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE)
> > > > +                   schedule_work(&cache->async_free);
> > 
> > 
> > I have some concerns about the current patch with the change above.
> > We could hit the direct_free path very often.
> > 
> > By delaying the freeing of entries in the return
> > cache, we have to do more freeing of swap entry one at a time. When
> > we try to free an entry, we can find the return cache still full, waiting to be freed.
> 
> You are describing the async free is not working. In that case it will always
> hit the direct free path one by one.
> 
> > 
> > So we have fewer batch free of swap entries, resulting in an increase in
> > number of sis->lock acquisitions overall. This could have the
> > effect of reducing swap throughput overall when swap is under heavy
> > operations and sis->lock is contended.
> 
> I  can change the direct free path to free all entries. If the async
> free hasn't freed up the batch by the time the next swap fault comes in.
> The new swap fault will take the hit, just free the whole batch. It will behave
> closer to the original batch free behavior in this path.
> 
Will that negate the benefit you are looking for?

A hack is to double the SWAP_SLOTS_CACHE_SIZE to 128, and trigger the
background reclaim when entries reach 64. This will allow you to avoid
the one by one relcaim direct path and hopefully the delayed job
would have done its job when slots accumulate between 64 and 128.

However, I am unsure how well this hack is under really heavy
swap load.  It means that the background reclaim will need to 
work through a larger backlog and
hold the sis->lock longer. So if you hit the direct path while
the background reclaim is underway, you may have longer tail latency
to acquire the sis->lock. 

Tim
  
Chris Li Feb. 5, 2024, 7:10 p.m. UTC | #5
On Mon, Feb 5, 2024 at 10:15 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Sat, 2024-02-03 at 10:12 -0800, Chris Li wrote:
> >
> > > > >  {
> > > > >     struct swap_slots_cache *cache;
> > > > > @@ -282,17 +298,14 @@ void free_swap_slot(swp_entry_t entry)
> > > > >                     goto direct_free;
> > > > >             }
> > > > >             if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
> > > > > -                   /*
> > > > > -                    * Return slots to global pool.
> > > > > -                    * The current swap_map value is SWAP_HAS_CACHE.
> > > > > -                    * Set it to 0 to indicate it is available for
> > > > > -                    * allocation in global pool
> > > > > -                    */
> > > > > -                   swapcache_free_entries(cache->slots_ret, cache->n_ret);
> > > > > -                   cache->n_ret = 0;
> > > > > +                   spin_unlock_irq(&cache->free_lock);
> > > > > +                   schedule_work(&cache->async_free);
> > > > > +                   goto direct_free;
> > > > >             }
> > > > >             cache->slots_ret[cache->n_ret++] = entry;
> > > > >             spin_unlock_irq(&cache->free_lock);
> > > > > +           if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE)
> > > > > +                   schedule_work(&cache->async_free);
> > >
> > >
> > > I have some concerns about the current patch with the change above.
> > > We could hit the direct_free path very often.
> > >
> > > By delaying the freeing of entries in the return
> > > cache, we have to do more freeing of swap entry one at a time. When
> > > we try to free an entry, we can find the return cache still full, waiting to be freed.
> >
> > You are describing the async free is not working. In that case it will always
> > hit the direct free path one by one.
> >
> > >
> > > So we have fewer batch free of swap entries, resulting in an increase in
> > > number of sis->lock acquisitions overall. This could have the
> > > effect of reducing swap throughput overall when swap is under heavy
> > > operations and sis->lock is contended.
> >
> > I  can change the direct free path to free all entries. If the async
> > free hasn't freed up the batch by the time the next swap fault comes in.
> > The new swap fault will take the hit, just free the whole batch. It will behave
> > closer to the original batch free behavior in this path.
> >
> Will that negate the benefit you are looking for?

It should not. In our deployment, the rate of swap faults isn't that
high. It is one of the matrices that gets monitored and controlled. If
the swap fault gets that busy, the app's performance most likely
already starts to suffer from it. We should back off from swapping out
that much.
I expect the normal case, the async free already free up the 64
entries before the next swap fault on the same CPU hits.

> A hack is to double the SWAP_SLOTS_CACHE_SIZE to 128, and trigger the
> background reclaim when entries reach 64. This will allow you to avoid
> the one by one relcaim direct path and hopefully the delayed job
> would have done its job when slots accumulate between 64 and 128.

I would have some concern on that due to higher per CPU memory usage.
We have machines with high CPU count. That would mean more memory
reserved. I actually have a variant of the patch that starts the async
free before it reaches the 64 entries limit. e.g. 60 entries. It will
give some head room to avoid the direct free path for another 4
entries. I did not include that in the patch because it makes things
more complicated, also this code path isn't taking much at all. If it
helps I can resurrect that in the patch as V3.
>
> However, I am unsure how well this hack is under really heavy
> swap load.  It means that the background reclaim will need to

In our system, a really heavy swap load is rare and it means something
is already wrong. At that point the app's SLO is likely at risk,
regardless of long tail swap latency. It is already too late to
address it at the swap fault end. We need to address the source of the
problem which is swapping out too much.

> work through a larger backlog and
> hold the sis->lock longer. So if you hit the direct path while
> the background reclaim is underway, you may have longer tail latency
> to acquire the sis->lock.

Chris
  
Tim Chen Feb. 7, 2024, 1:08 a.m. UTC | #6
On Mon, 2024-02-05 at 11:10 -0800, Chris Li wrote:
> 
> 
> In our system, a really heavy swap load is rare and it means something
> is already wrong. At that point the app's SLO is likely at risk,
> regardless of long tail swap latency. It is already too late to
> address it at the swap fault end. We need to address the source of the
> problem which is swapping out too much.
> 
> 

Could some usage scenarios put more pressure on swap than your
usage scenario?  Say system with limited RAM and rely on zswap?

Tim
  
Chris Li Feb. 7, 2024, 1:51 a.m. UTC | #7
On Tue, Feb 6, 2024 at 5:08 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Mon, 2024-02-05 at 11:10 -0800, Chris Li wrote:
> >
> >
> > In our system, a really heavy swap load is rare and it means something
> > is already wrong. At that point the app's SLO is likely at risk,
> > regardless of long tail swap latency. It is already too late to
> > address it at the swap fault end. We need to address the source of the
> > problem which is swapping out too much.
> >
> >
>
> Could some usage scenarios put more pressure on swap than your
> usage scenario?  Say system with limited RAM and rely on zswap?
>
Of course. In that case what I proposed  to do will already doing what
I think is the best of this situation. After grabbing the cache lock
and finding out async fre hasn't started the freeing yet. Just free
all 64 entries in the swap slot cache. It is similar to the old code
behavior.
Yes, it will have the long tail latency due to batch freeing 64 entries.
My point is not that I don't care about heavy swap behavior.
My point is that the app will suffer from the swap strom anyway, it is
unavoidable. That will be the dominant factor shadowing the batch free
optimization effect.

Or do I miss your point as you want to purpose the swap cache double
buffer  so it can perform better under swap storm situations?

Chris
  
Tim Chen Feb. 9, 2024, 5:52 p.m. UTC | #8
On Tue, 2024-02-06 at 17:51 -0800, Chris Li wrote:
> On Tue, Feb 6, 2024 at 5:08 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > 
> > On Mon, 2024-02-05 at 11:10 -0800, Chris Li wrote:
> > > 
> > > 
> > > In our system, a really heavy swap load is rare and it means something
> > > is already wrong. At that point the app's SLO is likely at risk,
> > > regardless of long tail swap latency. It is already too late to
> > > address it at the swap fault end. We need to address the source of the
> > > problem which is swapping out too much.
> > > 
> > > 
> > 
> > Could some usage scenarios put more pressure on swap than your
> > usage scenario?  Say system with limited RAM and rely on zswap?
> > 
> Of course. In that case what I proposed  to do will already doing what
> I think is the best of this situation. After grabbing the cache lock
> and finding out async fre hasn't started the freeing yet. Just free
> all 64 entries in the swap slot cache. It is similar to the old code
> behavior.
> Yes, it will have the long tail latency due to batch freeing 64 entries.
> My point is not that I don't care about heavy swap behavior.
> My point is that the app will suffer from the swap strom anyway, it is
> unavoidable. That will be the dominant factor shadowing the batch free
> optimization effect.

The original optimization introducing swap_slots target such heavy
swap use cases when we have fast swap backend to allow higher sustainable
swap throughput.  We should not ignore it.  And I am afraid your current
patch as is will hurt that performance.  If you change the direct free
path to free all entries, that could maintain the throughput and I'll
be okay with that.

> 
> Or do I miss your point as you want to purpose the swap cache double
> buffer  so it can perform better under swap storm situations?
> 

I am not actually proposing doubling the buffer as that proposal have
its own downside. 

Tim
  
Chris Li Feb. 13, 2024, 10:46 p.m. UTC | #9
On Fri, Feb 9, 2024 at 9:52 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > Yes, it will have the long tail latency due to batch freeing 64 entries.
> > My point is not that I don't care about heavy swap behavior.
> > My point is that the app will suffer from the swap strom anyway, it is
> > unavoidable. That will be the dominant factor shadowing the batch free
> > optimization effect.
>
> The original optimization introducing swap_slots target such heavy
> swap use cases when we have fast swap backend to allow higher sustainable
> swap throughput.  We should not ignore it.  And I am afraid your current
> patch as is will hurt that performance.  If you change the direct free
> path to free all entries, that could maintain the throughput and I'll
> be okay with that.

That is great. Thanks for the confirmation. I will send out the V3 soon.
In V3, I changed the direct free path to free all entries. I also add the
/sys/kernel/swap/swap_slot_async_free to enable the async free behavior.

>
> >
> > Or do I miss your point as you want to purpose the swap cache double
> > buffer  so it can perform better under swap storm situations?
> >
>
> I am not actually proposing doubling the buffer as that proposal have
> its own downside.

Ack

Chris
  

Patch

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..67bc8fa30d63 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -19,6 +19,7 @@  struct swap_slots_cache {
 	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
 	swp_entry_t	*slots_ret;
 	int		n_ret;
+	struct work_struct async_free;
 };
 
 void disable_swap_slots_cache_lock(void);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..71d344564e55 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -44,6 +44,7 @@  static DEFINE_MUTEX(swap_slots_cache_mutex);
 static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
 
 static void __drain_swap_slots_cache(unsigned int type);
+static void swapcache_async_free_entries(struct work_struct *data);
 
 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
 #define SLOTS_CACHE 0x1
@@ -149,6 +150,7 @@  static int alloc_swap_slot_cache(unsigned int cpu)
 		spin_lock_init(&cache->free_lock);
 		cache->lock_initialized = true;
 	}
+	INIT_WORK(&cache->async_free, swapcache_async_free_entries);
 	cache->nr = 0;
 	cache->cur = 0;
 	cache->n_ret = 0;
@@ -269,6 +271,20 @@  static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	return cache->nr;
 }
 
+static void swapcache_async_free_entries(struct work_struct *data)
+{
+	struct swap_slots_cache *cache;
+
+	cache = container_of(data, struct swap_slots_cache, async_free);
+	spin_lock_irq(&cache->free_lock);
+	/* Swap slots cache may be deactivated before acquiring lock */
+	if (cache->slots_ret) {
+		swapcache_free_entries(cache->slots_ret, cache->n_ret);
+		cache->n_ret = 0;
+	}
+	spin_unlock_irq(&cache->free_lock);
+}
+
 void free_swap_slot(swp_entry_t entry)
 {
 	struct swap_slots_cache *cache;
@@ -282,17 +298,14 @@  void free_swap_slot(swp_entry_t entry)
 			goto direct_free;
 		}
 		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
-			/*
-			 * Return slots to global pool.
-			 * The current swap_map value is SWAP_HAS_CACHE.
-			 * Set it to 0 to indicate it is available for
-			 * allocation in global pool
-			 */
-			swapcache_free_entries(cache->slots_ret, cache->n_ret);
-			cache->n_ret = 0;
+			spin_unlock_irq(&cache->free_lock);
+			schedule_work(&cache->async_free);
+			goto direct_free;
 		}
 		cache->slots_ret[cache->n_ret++] = entry;
 		spin_unlock_irq(&cache->free_lock);
+		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE)
+			schedule_work(&cache->async_free);
 	} else {
 direct_free:
 		swapcache_free_entries(&entry, 1);