[v4] mm: swap: async free swap slot cache entries

Message ID 20240214-async-free-v4-1-6abe0d59f85f@kernel.org
State New
Headers
Series [v4] mm: swap: async free swap slot cache entries |

Commit Message

Chris Li Feb. 15, 2024, 1:02 a.m. UTC
  We discovered that 1% swap page fault is 100us+ while 50% of
the swap fault is under 20us.

Further investigation shows that a large portion of the time
spent in the free_swap_slots() function for the long tail case.

The percpu cache of swap slots is freed in a batch of 64 entries
inside free_swap_slots(). These cache entries are accumulated
from previous page faults, which may not be related to the current
process.

Doing the batch free in the page fault handler causes longer
tail latencies and penalizes the current process.

When the swap cache slot is full, schedule async free cached
swap slots in a work queue, before the next swap fault comes in.
If the next swap fault comes in very fast, before the async
free gets a chance to run. It will directly free all the swap
cache in the swap fault the same way as previously.

Testing:

Chun-Tse did some benchmark in chromebook, showing that
zram_wait_metrics improve about 15% with 80% and 95% confidence.

I recently ran some experiments on about 1000 Google production
machines. It shows swapin latency drops in the long tail
100us - 500us bucket dramatically.

platform	(100-500us)	 	(0-100us)
A		1.12% -> 0.36%		98.47% -> 99.22%
B		0.65% -> 0.15%		98.96% -> 99.46%
C		0.61% -> 0.23%		98.96% -> 99.38%

Signed-off-by: Chris Li <chrisl@kernel.org>
---
Changes in v4:
- Remove the sysfs interface file, according the feedback.
- Move the full condition test inside the spinlock.
- Link to v3: https://lore.kernel.org/r/20240213-async-free-v3-1-b89c3cc48384@kernel.org

Changes in v3:
- Address feedback from Tim Chen, direct free path will free all swap slots.
- Add /sys/kernel/mm/swap/swap_slot_async_fee to enable async free. Default is off.
- Link to v2: https://lore.kernel.org/r/20240131-async-free-v2-1-525f03e07184@kernel.org

Changes in v2:
- Add description of the impact of time changing suggest by Ying.
- Remove create_workqueue() and use schedule_work()
- Link to v1: https://lore.kernel.org/r/20231221-async-free-v1-1-94b277992cb0@kernel.org
---
 include/linux/swap_slots.h |  1 +
 mm/swap_slots.c            | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)


---
base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
change-id: 20231216-async-free-bef392015432

Best regards,
  

Comments

Tim Chen Feb. 15, 2024, 6:31 p.m. UTC | #1
On Wed, 2024-02-14 at 17:02 -0800, Chris Li wrote:
> We discovered that 1% swap page fault is 100us+ while 50% of
> the swap fault is under 20us.
> 
> Further investigation shows that a large portion of the time
> spent in the free_swap_slots() function for the long tail case.
> 
> The percpu cache of swap slots is freed in a batch of 64 entries
> inside free_swap_slots(). These cache entries are accumulated
> from previous page faults, which may not be related to the current
> process.
> 
> Doing the batch free in the page fault handler causes longer
> tail latencies and penalizes the current process.
> 
> When the swap cache slot is full, schedule async free cached
> swap slots in a work queue, before the next swap fault comes in.
> If the next swap fault comes in very fast, before the async
> free gets a chance to run. It will directly free all the swap
> cache in the swap fault the same way as previously.
> 
> Testing:
> 
> Chun-Tse did some benchmark in chromebook, showing that
> zram_wait_metrics improve about 15% with 80% and 95% confidence.
> 
> I recently ran some experiments on about 1000 Google production
> machines. It shows swapin latency drops in the long tail
> 100us - 500us bucket dramatically.
> 
> platform	(100-500us)	 	(0-100us)
> A		1.12% -> 0.36%		98.47% -> 99.22%
> B		0.65% -> 0.15%		98.96% -> 99.46%
> C		0.61% -> 0.23%		98.96% -> 99.38%
> 
> Signed-off-by: Chris Li <chrisl@kernel.org>

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> ---
> Changes in v4:
> - Remove the sysfs interface file, according the feedback.
> - Move the full condition test inside the spinlock.
> - Link to v3: https://lore.kernel.org/r/20240213-async-free-v3-1-b89c3cc48384@kernel.org
> 
> Changes in v3:
> - Address feedback from Tim Chen, direct free path will free all swap slots.
> - Add /sys/kernel/mm/swap/swap_slot_async_fee to enable async free. Default is off.
> - Link to v2: https://lore.kernel.org/r/20240131-async-free-v2-1-525f03e07184@kernel.org
> 
> Changes in v2:
> - Add description of the impact of time changing suggest by Ying.
> - Remove create_workqueue() and use schedule_work()
> - Link to v1: https://lore.kernel.org/r/20231221-async-free-v1-1-94b277992cb0@kernel.org
> ---
>  include/linux/swap_slots.h |  1 +
>  mm/swap_slots.c            | 20 ++++++++++++++++++++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
> index 15adfb8c813a..67bc8fa30d63 100644
> --- a/include/linux/swap_slots.h
> +++ b/include/linux/swap_slots.h
> @@ -19,6 +19,7 @@ struct swap_slots_cache {
>  	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
>  	swp_entry_t	*slots_ret;
>  	int		n_ret;
> +	struct work_struct async_free;
>  };
>  
>  void disable_swap_slots_cache_lock(void);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..23dc04bce9ca 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -44,6 +44,7 @@ static DEFINE_MUTEX(swap_slots_cache_mutex);
>  static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
>  
>  static void __drain_swap_slots_cache(unsigned int type);
> +static void swapcache_async_free_entries(struct work_struct *data);
>  
>  #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
>  #define SLOTS_CACHE 0x1
> @@ -149,6 +150,7 @@ static int alloc_swap_slot_cache(unsigned int cpu)
>  		spin_lock_init(&cache->free_lock);
>  		cache->lock_initialized = true;
>  	}
> +	INIT_WORK(&cache->async_free, swapcache_async_free_entries);
>  	cache->nr = 0;
>  	cache->cur = 0;
>  	cache->n_ret = 0;
> @@ -269,12 +271,27 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>  	return cache->nr;
>  }
>  
> +static void swapcache_async_free_entries(struct work_struct *data)
> +{
> +	struct swap_slots_cache *cache;
> +
> +	cache = container_of(data, struct swap_slots_cache, async_free);
> +	spin_lock_irq(&cache->free_lock);
> +	/* Swap slots cache may be deactivated before acquiring lock */
> +	if (cache->slots_ret && cache->n_ret) {
> +		swapcache_free_entries(cache->slots_ret, cache->n_ret);
> +		cache->n_ret = 0;
> +	}
> +	spin_unlock_irq(&cache->free_lock);
> +}
> +
>  void free_swap_slot(swp_entry_t entry)
>  {
>  	struct swap_slots_cache *cache;
>  
>  	cache = raw_cpu_ptr(&swp_slots);
>  	if (likely(use_swap_slot_cache && cache->slots_ret)) {
> +		bool full;
>  		spin_lock_irq(&cache->free_lock);
>  		/* Swap slots cache may be deactivated before acquiring lock */
>  		if (!use_swap_slot_cache || !cache->slots_ret) {
> @@ -292,7 +309,10 @@ void free_swap_slot(swp_entry_t entry)
>  			cache->n_ret = 0;
>  		}
>  		cache->slots_ret[cache->n_ret++] = entry;
> +		full = cache->n_ret >= SWAP_SLOTS_CACHE_SIZE;
>  		spin_unlock_irq(&cache->free_lock);
> +		if (full)
> +			schedule_work(&cache->async_free);
>  	} else {
>  direct_free:
>  		swapcache_free_entries(&entry, 1);
> 
> ---
> base-commit: eacce8189e28717da6f44ee492b7404c636ae0de
> change-id: 20231216-async-free-bef392015432
> 
> Best regards,
  
Chris Li Feb. 15, 2024, 10:57 p.m. UTC | #2
On Thu, Feb 15, 2024 at 10:31 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Wed, 2024-02-14 at 17:02 -0800, Chris Li wrote:
> > We discovered that 1% swap page fault is 100us+ while 50% of
> > the swap fault is under 20us.
> >
> > Further investigation shows that a large portion of the time
> > spent in the free_swap_slots() function for the long tail case.
> >
> > The percpu cache of swap slots is freed in a batch of 64 entries
> > inside free_swap_slots(). These cache entries are accumulated
> > from previous page faults, which may not be related to the current
> > process.
> >
> > Doing the batch free in the page fault handler causes longer
> > tail latencies and penalizes the current process.
> >
> > When the swap cache slot is full, schedule async free cached
> > swap slots in a work queue, before the next swap fault comes in.
> > If the next swap fault comes in very fast, before the async
> > free gets a chance to run. It will directly free all the swap
> > cache in the swap fault the same way as previously.
> >
> > Testing:
> >
> > Chun-Tse did some benchmark in chromebook, showing that
> > zram_wait_metrics improve about 15% with 80% and 95% confidence.
> >
> > I recently ran some experiments on about 1000 Google production
> > machines. It shows swapin latency drops in the long tail
> > 100us - 500us bucket dramatically.
> >
> > platform      (100-500us)             (0-100us)
> > A             1.12% -> 0.36%          98.47% -> 99.22%
> > B             0.65% -> 0.15%          98.96% -> 99.46%
> > C             0.61% -> 0.23%          98.96% -> 99.38%
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

Thank you so much for your review.

Chris
  
Andrew Morton Feb. 16, 2024, 12:11 a.m. UTC | #3
On Wed, 14 Feb 2024 17:02:13 -0800 Chris Li <chrisl@kernel.org> wrote:

> We discovered that 1% swap page fault is 100us+ while 50% of
> the swap fault is under 20us.
> 
> Further investigation shows that a large portion of the time
> spent in the free_swap_slots() function for the long tail case.
> 
> The percpu cache of swap slots is freed in a batch of 64 entries
> inside free_swap_slots(). These cache entries are accumulated
> from previous page faults, which may not be related to the current
> process.
> 
> Doing the batch free in the page fault handler causes longer
> tail latencies and penalizes the current process.
> 
> When the swap cache slot is full, schedule async free cached
> swap slots in a work queue, before the next swap fault comes in.
> If the next swap fault comes in very fast, before the async
> free gets a chance to run. It will directly free all the swap
> cache in the swap fault the same way as previously.
> 
> Testing:
> 
> Chun-Tse did some benchmark in chromebook, showing that
> zram_wait_metrics improve about 15% with 80% and 95% confidence.
> 
> I recently ran some experiments on about 1000 Google production
> machines. It shows swapin latency drops in the long tail
> 100us - 500us bucket dramatically.
> 
> platform	(100-500us)	 	(0-100us)
> A		1.12% -> 0.36%		98.47% -> 99.22%
> B		0.65% -> 0.15%		98.96% -> 99.46%
> C		0.61% -> 0.23%		98.96% -> 99.38%
> 

What this description lacks is any description of why anyone cares. 

The patch clearly decreases overall throughput (speed-vs-latency is a
common tradeoff).

And the "we don't know how to fix this properly so punt it into a
kernel thread" approach remains lame.  For example, the risk that the
now-liberated allocator can outpace the async freeing, resulting in
unlimited object windup.

And here's a fun one: what happens if the producer of these objects has
SCHED_FIFO policy and it's a uniprocessor machine?  If the producer sits
there allocating objects and the freeing thread never executes?  Has
this been considered, and tested for?


All these concerns, risks and complexity and the changelog offers us no
reason to take any of this on.  What's wrong with the existing code? 
Please exhaustively describe the issues which are being seen.  And
explain why those issues are sufficiently serious to leave the above
issues and risks unaddressed.
  
Tim Chen Feb. 16, 2024, 1:38 a.m. UTC | #4
On Thu, 2024-02-15 at 16:11 -0800, Andrew Morton wrote:
> On Wed, 14 Feb 2024 17:02:13 -0800 Chris Li <chrisl@kernel.org> wrote:
> 
> > We discovered that 1% swap page fault is 100us+ while 50% of
> > the swap fault is under 20us.
> > 
> > Further investigation shows that a large portion of the time
> > spent in the free_swap_slots() function for the long tail case.
> > 
> > The percpu cache of swap slots is freed in a batch of 64 entries
> > inside free_swap_slots(). These cache entries are accumulated
> > from previous page faults, which may not be related to the current
> > process.
> > 
> > Doing the batch free in the page fault handler causes longer
> > tail latencies and penalizes the current process.
> > 
> > When the swap cache slot is full, schedule async free cached
> > swap slots in a work queue, before the next swap fault comes in.
> > If the next swap fault comes in very fast, before the async
> > free gets a chance to run. It will directly free all the swap
> > cache in the swap fault the same way as previously.
> > 
> > Testing:
> > 
> > Chun-Tse did some benchmark in chromebook, showing that
> > zram_wait_metrics improve about 15% with 80% and 95% confidence.
> > 
> > I recently ran some experiments on about 1000 Google production
> > machines. It shows swapin latency drops in the long tail
> > 100us - 500us bucket dramatically.
> > 
> > platform	(100-500us)	 	(0-100us)
> > A		1.12% -> 0.36%		98.47% -> 99.22%
> > B		0.65% -> 0.15%		98.96% -> 99.46%
> > C		0.61% -> 0.23%		98.96% -> 99.38%
> > 
> 
> What this description lacks is any description of why anyone cares. 
> 
> The patch clearly decreases overall throughput (speed-vs-latency is a
> common tradeoff).
> 
> And the "we don't know how to fix this properly so punt it into a
> kernel thread" approach remains lame.  For example, the risk that the
> now-liberated allocator can outpace the async freeing, resulting in
> unlimited object windup.


Andrew,

What you are saying about outpacing asyn free is true for v1 and v2 versions of the patch.

But in this latest version, if another reclaim comes in before the async free has kicked in,
we would be freeing the whole cache directly, same as original code, without waiting
for the async free.  It is different from the first version
where you go into the free one at a time mode while waiting for the async free. 
That was also my objection to the first two versions as you could be in this
slow free one at a time mode for a long time.

So now we should not have unlimited object windup.  And we would be doing free
in batch of 64, either still in the direct path or in the async path.

> 
> 
> And here's a fun one: what happens if the producer of these objects has
> SCHED_FIFO policy and it's a uniprocessor machine?  If the producer sits
> there allocating objects and the freeing thread never executes?  Has
> this been considered, and tested for?

If the free thread did not execute, in this version of the patch, we would
free the full cache directly, should the allocate path see a full cache. This works
just as before the patch is applied. 
So I do not forsee current change reducing throughput.

Current patch does allow a chance to do background free, so it cut down the
chances that allocate path needs to free the cache directly.

That should help the tail latency and the number of times where you have to wait for
the free to be complete. And most of the time, we would not have to do direct free
ourselves.

Tim
 
> 
> 
> All these concerns, risks and complexity and the changelog offers us no
> reason to take any of this on.  What's wrong with the existing code? 
> Please exhaustively describe the issues which are being seen.  And
> explain why those issues are sufficiently serious to leave the above
> issues and risks unaddressed.
>
  
Andrew Morton Feb. 16, 2024, 4:16 a.m. UTC | #5
On Thu, 15 Feb 2024 17:38:38 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> > What this description lacks is any description of why anyone cares. 
> > 
> > The patch clearly decreases overall throughput (speed-vs-latency is a
> > common tradeoff).

This, please.

> > And the "we don't know how to fix this properly so punt it into a
> > kernel thread" approach remains lame.  For example, the risk that the
> > now-liberated allocator can outpace the async freeing, resulting in
> > unlimited object windup.
> 
> 
> Andrew,
> 
> What you are saying about outpacing asyn free is true for v1 and v2 versions of the patch.
> 
> But in this latest version, if another reclaim comes in before the async free has kicked in,
> we would be freeing the whole cache directly, same as original code, without waiting
> for the async free.  It is different from the first version
> where you go into the free one at a time mode while waiting for the async free. 
> That was also my objection to the first two versions as you could be in this
> slow free one at a time mode for a long time.
> 
> So now we should not have unlimited object windup.  And we would be doing free
> in batch of 64, either still in the direct path or in the async path.
> 

OK, thanks, I didn't read closely enough,

> If the next swap fault comes in very fast, before the async
> free gets a chance to run. It will directly free all the swap
> cache in the swap fault the same way as previously.

And might it be a win to cancel the async_work in this case?


Again, without a clear description of the userspace-visible effects of
this problem I am groping in the dark.  My hands blindly landed upon
the question: the overall effect here is to leave worst-case latency
unaltered, but to decrease average latency.  Does this satisfy the
yet-to-be-described requirements?


Also, the V4 patch's quoted quantitative testing results are pasted
from the V2 patch's.  V2 was a fundamentally different implementation. 
I think it is fair to say that V4 is "untested", with regard to
satisfying its runtime objectives.
  
Tim Chen Feb. 16, 2024, 4:57 p.m. UTC | #6
On Thu, 2024-02-15 at 20:16 -0800, Andrew Morton wrote:
> On Thu, 15 Feb 2024 17:38:38 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > > What this description lacks is any description of why anyone cares. 
> > > 
> > > The patch clearly decreases overall throughput (speed-vs-latency is a
> > > common tradeoff).
> 
> This, please.
> 
> > > And the "we don't know how to fix this properly so punt it into a
> > > kernel thread" approach remains lame.  For example, the risk that the
> > > now-liberated allocator can outpace the async freeing, resulting in
> > > unlimited object windup.
> > 
> > 
> > Andrew,
> > 
> > What you are saying about outpacing asyn free is true for v1 and v2 versions of the patch.
> > 
> > But in this latest version, if another reclaim comes in before the async free has kicked in,
> > we would be freeing the whole cache directly, same as original code, without waiting
> > for the async free.  It is different from the first version
> > where you go into the free one at a time mode while waiting for the async free. 
> > That was also my objection to the first two versions as you could be in this
> > slow free one at a time mode for a long time.
> > 
> > So now we should not have unlimited object windup.  And we would be doing free
> > in batch of 64, either still in the direct path or in the async path.
> > 
> 
> OK, thanks, I didn't read closely enough,
> 
> > If the next swap fault comes in very fast, before the async
> > free gets a chance to run. It will directly free all the swap
> > cache in the swap fault the same way as previously.
> 
> And might it be a win to cancel the async_work in this case?
> 
Canceling async_work will matter for the case where we push swap hard,
and have a better chance of finding async have not yet engaged when
we need to free additional swap slots.

Chris' tests so far has been for his use cases where swap is lightly
loaded.  The scenarios you listed are when 
we push swap hard close to its max throughput. 

It would help answer your concerns if Chris could also test high swap scenario.
Then we can make sure sustainable swap throughput does not regress and
latency is improved. And check whether it is beneficial to cancel outstanding async_work on
direct free path. I think the pro of canceling the asyn_work is to
skip an extra lock acquisition on the cache. Though
there is also some overhead in canceling the work itself.

Tim

> 
> Again, without a clear description of the userspace-visible effects of
> this problem I am groping in the dark.  My hands blindly landed upon
> the question: the overall effect here is to leave worst-case latency
> unaltered, but to decrease average latency.  Does this satisfy the
> yet-to-be-described requirements?
> 
> 
> Also, the V4 patch's quoted quantitative testing results are pasted
> from the V2 patch's.  V2 was a fundamentally different implementation. 
> I think it is fair to say that V4 is "untested", with regard to
> satisfying its runtime objectives.
>
  

Patch

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..67bc8fa30d63 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -19,6 +19,7 @@  struct swap_slots_cache {
 	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
 	swp_entry_t	*slots_ret;
 	int		n_ret;
+	struct work_struct async_free;
 };
 
 void disable_swap_slots_cache_lock(void);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..23dc04bce9ca 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -44,6 +44,7 @@  static DEFINE_MUTEX(swap_slots_cache_mutex);
 static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
 
 static void __drain_swap_slots_cache(unsigned int type);
+static void swapcache_async_free_entries(struct work_struct *data);
 
 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
 #define SLOTS_CACHE 0x1
@@ -149,6 +150,7 @@  static int alloc_swap_slot_cache(unsigned int cpu)
 		spin_lock_init(&cache->free_lock);
 		cache->lock_initialized = true;
 	}
+	INIT_WORK(&cache->async_free, swapcache_async_free_entries);
 	cache->nr = 0;
 	cache->cur = 0;
 	cache->n_ret = 0;
@@ -269,12 +271,27 @@  static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	return cache->nr;
 }
 
+static void swapcache_async_free_entries(struct work_struct *data)
+{
+	struct swap_slots_cache *cache;
+
+	cache = container_of(data, struct swap_slots_cache, async_free);
+	spin_lock_irq(&cache->free_lock);
+	/* Swap slots cache may be deactivated before acquiring lock */
+	if (cache->slots_ret && cache->n_ret) {
+		swapcache_free_entries(cache->slots_ret, cache->n_ret);
+		cache->n_ret = 0;
+	}
+	spin_unlock_irq(&cache->free_lock);
+}
+
 void free_swap_slot(swp_entry_t entry)
 {
 	struct swap_slots_cache *cache;
 
 	cache = raw_cpu_ptr(&swp_slots);
 	if (likely(use_swap_slot_cache && cache->slots_ret)) {
+		bool full;
 		spin_lock_irq(&cache->free_lock);
 		/* Swap slots cache may be deactivated before acquiring lock */
 		if (!use_swap_slot_cache || !cache->slots_ret) {
@@ -292,7 +309,10 @@  void free_swap_slot(swp_entry_t entry)
 			cache->n_ret = 0;
 		}
 		cache->slots_ret[cache->n_ret++] = entry;
+		full = cache->n_ret >= SWAP_SLOTS_CACHE_SIZE;
 		spin_unlock_irq(&cache->free_lock);
+		if (full)
+			schedule_work(&cache->async_free);
 	} else {
 direct_free:
 		swapcache_free_entries(&entry, 1);