[v6,6/6] zsmalloc: Implement writeback mechanism for zsmalloc

Message ID 20221119001536.2086599-7-nphamcs@gmail.com
State New
Headers
Series Implement writeback for zsmalloc |

Commit Message

Nhat Pham Nov. 19, 2022, 12:15 a.m. UTC
  This commit adds the writeback mechanism for zsmalloc, analogous to the
zbud allocator. Zsmalloc will attempt to determine the coldest zspage
(i.e least recently used) in the pool, and attempt to write back all the
stored compressed objects via the pool's evict handler.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zsmalloc.c | 193 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 182 insertions(+), 11 deletions(-)

--
2.30.2
  

Comments

Johannes Weiner Nov. 19, 2022, 4:45 p.m. UTC | #1
On Fri, Nov 18, 2022 at 04:15:36PM -0800, Nhat Pham wrote:
> This commit adds the writeback mechanism for zsmalloc, analogous to the
> zbud allocator. Zsmalloc will attempt to determine the coldest zspage
> (i.e least recently used) in the pool, and attempt to write back all the
> stored compressed objects via the pool's evict handler.
> 
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Excellent!
  
Minchan Kim Nov. 19, 2022, 5:35 p.m. UTC | #2
On Fri, Nov 18, 2022 at 04:15:36PM -0800, Nhat Pham wrote:
> This commit adds the writeback mechanism for zsmalloc, analogous to the
> zbud allocator. Zsmalloc will attempt to determine the coldest zspage
> (i.e least recently used) in the pool, and attempt to write back all the
> stored compressed objects via the pool's evict handler.
> 
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>

Thanks.
  
Sergey Senozhatsky Nov. 22, 2022, 1:40 a.m. UTC | #3
On (22/11/18 16:15), Nhat Pham wrote:
> +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries);
> +
> +static int zs_zpool_shrink(void *pool, unsigned int pages,
> +			unsigned int *reclaimed)
> +{
> +	unsigned int total = 0;
> +	int ret = -EINVAL;
> +
> +	while (total < pages) {
> +		ret = zs_reclaim_page(pool, 8);

Just curious why 8 retries and how was 8 picked?

> +		if (ret < 0)
> +			break;
> +		total++;
> +	}
> +
> +	if (reclaimed)
> +		*reclaimed = total;
> +
> +	return ret;
> +}
  
Sergey Senozhatsky Nov. 22, 2022, 2 a.m. UTC | #4
On (22/11/18 16:15), Nhat Pham wrote:
> +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
> +{
> +	int i, obj_idx, ret = 0;
> +	unsigned long handle;
> +	struct zspage *zspage;
> +	struct page *page;
> +	enum fullness_group fullness;
> +
> +	/* Lock LRU and fullness list */
> +	spin_lock(&pool->lock);
> +	if (list_empty(&pool->lru)) {
> +		spin_unlock(&pool->lock);
> +		return -EINVAL;
> +	}
> +
> +	for (i = 0; i < retries; i++) {
> +		struct size_class *class;
> +
> +		zspage = list_last_entry(&pool->lru, struct zspage, lru);
> +		list_del(&zspage->lru);
> +
> +		/* zs_free may free objects, but not the zspage and handles */
> +		zspage->under_reclaim = true;
> +
> +		class = zspage_class(pool, zspage);
> +		fullness = get_fullness_group(class, zspage);
> +
> +		/* Lock out object allocations and object compaction */
> +		remove_zspage(class, zspage, fullness);
> +
> +		spin_unlock(&pool->lock);
> +
> +		/* Lock backing pages into place */
> +		lock_zspage(zspage);
> +
> +		obj_idx = 0;
> +		page = zspage->first_page;

A nit: we usually call get_first_page() in such cases.

> +		while (1) {
> +			handle = find_alloced_obj(class, page, &obj_idx);
> +			if (!handle) {
> +				page = get_next_page(page);
> +				if (!page)
> +					break;
> +				obj_idx = 0;
> +				continue;
> +			}
> +
> +			/*
> +			 * This will write the object and call zs_free.
> +			 *
> +			 * zs_free will free the object, but the
> +			 * under_reclaim flag prevents it from freeing
> +			 * the zspage altogether. This is necessary so
> +			 * that we can continue working with the
> +			 * zspage potentially after the last object
> +			 * has been freed.
> +			 */
> +			ret = pool->zpool_ops->evict(pool->zpool, handle);
> +			if (ret)
> +				goto next;
> +
> +			obj_idx++;
> +		}
  
Sergey Senozhatsky Nov. 22, 2022, 2:15 a.m. UTC | #5
On (22/11/18 16:15), Nhat Pham wrote:
> +
> +static int zs_zpool_shrink(void *pool, unsigned int pages,
> +			unsigned int *reclaimed)
> +{
> +	unsigned int total = 0;
> +	int ret = -EINVAL;
> +
> +	while (total < pages) {
> +		ret = zs_reclaim_page(pool, 8);
> +		if (ret < 0)
> +			break;
> +		total++;
> +	}
> +
> +	if (reclaimed)
> +		*reclaimed = total;
> +
> +	return ret;
> +}

A silly question: why do we need a retry loop in zs_reclaim_page()?
  
Johannes Weiner Nov. 22, 2022, 3:12 a.m. UTC | #6
On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> On (22/11/18 16:15), Nhat Pham wrote:
> > +
> > +static int zs_zpool_shrink(void *pool, unsigned int pages,
> > +			unsigned int *reclaimed)
> > +{
> > +	unsigned int total = 0;
> > +	int ret = -EINVAL;
> > +
> > +	while (total < pages) {
> > +		ret = zs_reclaim_page(pool, 8);
> > +		if (ret < 0)
> > +			break;
> > +		total++;
> > +	}
> > +
> > +	if (reclaimed)
> > +		*reclaimed = total;
> > +
> > +	return ret;
> > +}
> 
> A silly question: why do we need a retry loop in zs_reclaim_page()?

Individual objects in a zspage can be busy (swapped in simultaneously
for example), which will prevent the zspage from being freed. Zswap
currently requests reclaim of one backend page at a time (another
project...), so if we don't retry we're not meeting the reclaim goal
and cause rejections for new stores. Rejections are worse than moving
on to the adjacent LRU item, because a rejected page, which should be
at the head of the LRU, bypasses the list and goes straight to swap.

The number 8 is cribbed from zbud and z3fold. It works well in
practice, but is as arbitrary as MAX_RECLAIM_RETRIES used all over MM.
We may want to revisit it at some point, but we should probably do it
for all backends then.
  
Sergey Senozhatsky Nov. 22, 2022, 3:42 a.m. UTC | #7
On (22/11/21 22:12), Johannes Weiner wrote:
> On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > On (22/11/18 16:15), Nhat Pham wrote:
> > > +
> > > +static int zs_zpool_shrink(void *pool, unsigned int pages,
> > > +			unsigned int *reclaimed)
> > > +{
> > > +	unsigned int total = 0;
> > > +	int ret = -EINVAL;
> > > +
> > > +	while (total < pages) {
> > > +		ret = zs_reclaim_page(pool, 8);
> > > +		if (ret < 0)
> > > +			break;
> > > +		total++;
> > > +	}
> > > +
> > > +	if (reclaimed)
> > > +		*reclaimed = total;
> > > +
> > > +	return ret;
> > > +}
> > 
> > A silly question: why do we need a retry loop in zs_reclaim_page()?
> 
> Individual objects in a zspage can be busy (swapped in simultaneously
> for example), which will prevent the zspage from being freed. Zswap
> currently requests reclaim of one backend page at a time (another
> project...), so if we don't retry we're not meeting the reclaim goal
> and cause rejections for new stores.

What I meant was: if zs_reclaim_page() makes only partial progress
with the current LRU tail zspage and returns -EAGAIN, then we just
don't increment `total` and continue looping in zs_zpool_shrink().
On each iteration zs_reclaim_page() picks the new LRU tail (if any)
and tries to write it back.

> The number 8 is cribbed from zbud and z3fold.

OK.
  
Johannes Weiner Nov. 22, 2022, 6:09 a.m. UTC | #8
On Tue, Nov 22, 2022 at 12:42:20PM +0900, Sergey Senozhatsky wrote:
> On (22/11/21 22:12), Johannes Weiner wrote:
> > On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > > On (22/11/18 16:15), Nhat Pham wrote:
> > > > +
> > > > +static int zs_zpool_shrink(void *pool, unsigned int pages,
> > > > +			unsigned int *reclaimed)
> > > > +{
> > > > +	unsigned int total = 0;
> > > > +	int ret = -EINVAL;
> > > > +
> > > > +	while (total < pages) {
> > > > +		ret = zs_reclaim_page(pool, 8);
> > > > +		if (ret < 0)
> > > > +			break;
> > > > +		total++;
> > > > +	}
> > > > +
> > > > +	if (reclaimed)
> > > > +		*reclaimed = total;
> > > > +
> > > > +	return ret;
> > > > +}
> > > 
> > > A silly question: why do we need a retry loop in zs_reclaim_page()?
> > 
> > Individual objects in a zspage can be busy (swapped in simultaneously
> > for example), which will prevent the zspage from being freed. Zswap
> > currently requests reclaim of one backend page at a time (another
> > project...), so if we don't retry we're not meeting the reclaim goal
> > and cause rejections for new stores.
> 
> What I meant was: if zs_reclaim_page() makes only partial progress
> with the current LRU tail zspage and returns -EAGAIN, then we just
> don't increment `total` and continue looping in zs_zpool_shrink().

Hm, but it breaks on -EAGAIN, it doesn't continue.

This makes sense, IMO. zs_reclaim_page() will try to reclaim one page,
but considers up to 8 LRU tail pages until one succeeds. If it does,
it continues (total++). But if one of these calls fails, we exit the
loop, give up and return failure from zs_zpool_shrink().
  
Sergey Senozhatsky Nov. 22, 2022, 6:35 a.m. UTC | #9
On (22/11/22 01:09), Johannes Weiner wrote:
> On Tue, Nov 22, 2022 at 12:42:20PM +0900, Sergey Senozhatsky wrote:
> > On (22/11/21 22:12), Johannes Weiner wrote:
> > > On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > > > On (22/11/18 16:15), Nhat Pham wrote:

[..]

> > What I meant was: if zs_reclaim_page() makes only partial progress
> > with the current LRU tail zspage and returns -EAGAIN, then we just
> > don't increment `total` and continue looping in zs_zpool_shrink().
> 
> Hm, but it breaks on -EAGAIN, it doesn't continue.

Yes. "What if it would continue". Would it make sense to not
break on EAGAIN?

	while (total < pages) {
		ret = zs_reclaim_page(pool);
		if (ret == -EAGAIN)
			continue;
		if (ret < 0)
			break;
		total++;
	}

Then we don't need retry loop in zs_reclaim_page().
  
Sergey Senozhatsky Nov. 22, 2022, 6:37 a.m. UTC | #10
On (22/11/18 16:15), Nhat Pham wrote:
[..]
> +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
> +{
> +	int i, obj_idx, ret = 0;
> +	unsigned long handle;
> +	struct zspage *zspage;
> +	struct page *page;
> +	enum fullness_group fullness;
> +
> +	/* Lock LRU and fullness list */
> +	spin_lock(&pool->lock);
> +	if (list_empty(&pool->lru)) {
> +		spin_unlock(&pool->lock);
> +		return -EINVAL;
> +	}
> +
> +	for (i = 0; i < retries; i++) {
> +		struct size_class *class;
> +
> +		zspage = list_last_entry(&pool->lru, struct zspage, lru);
> +		list_del(&zspage->lru);
> +
> +		/* zs_free may free objects, but not the zspage and handles */
> +		zspage->under_reclaim = true;
> +
> +		class = zspage_class(pool, zspage);
> +		fullness = get_fullness_group(class, zspage);
> +
> +		/* Lock out object allocations and object compaction */
> +		remove_zspage(class, zspage, fullness);
> +
> +		spin_unlock(&pool->lock);
> +
> +		/* Lock backing pages into place */
> +		lock_zspage(zspage);
> +
> +		obj_idx = 0;
> +		page = zspage->first_page;
> +		while (1) {
> +			handle = find_alloced_obj(class, page, &obj_idx);
> +			if (!handle) {
> +				page = get_next_page(page);
> +				if (!page)
> +					break;
> +				obj_idx = 0;
> +				continue;
> +			}
> +
> +			/*
> +			 * This will write the object and call zs_free.
> +			 *
> +			 * zs_free will free the object, but the
> +			 * under_reclaim flag prevents it from freeing
> +			 * the zspage altogether. This is necessary so
> +			 * that we can continue working with the
> +			 * zspage potentially after the last object
> +			 * has been freed.
> +			 */
> +			ret = pool->zpool_ops->evict(pool->zpool, handle);
> +			if (ret)
> +				goto next;
> +
> +			obj_idx++;
> +		}
> +
> +next:
> +		/* For freeing the zspage, or putting it back in the pool and LRU list. */
> +		spin_lock(&pool->lock);
> +		zspage->under_reclaim = false;
> +
> +		if (!get_zspage_inuse(zspage)) {
> +			/*
> +			 * Fullness went stale as zs_free() won't touch it
> +			 * while the page is removed from the pool. Fix it
> +			 * up for the check in __free_zspage().
> +			 */
> +			zspage->fullness = ZS_EMPTY;
> +
> +			__free_zspage(pool, class, zspage);
> +			spin_unlock(&pool->lock);
> +			return 0;
> +		}
> +
> +		putback_zspage(class, zspage);
> +		list_add(&zspage->lru, &pool->lru);
> +		unlock_zspage(zspage);

We probably better to cond_resched() somewhere here. Or in zs_zpool_shrink()
loop.

> +	}
> +
> +	spin_unlock(&pool->lock);
> +	return -EAGAIN;
> +}
  
Johannes Weiner Nov. 22, 2022, 7:10 a.m. UTC | #11
On Tue, Nov 22, 2022 at 03:35:18PM +0900, Sergey Senozhatsky wrote:
> On (22/11/22 01:09), Johannes Weiner wrote:
> > On Tue, Nov 22, 2022 at 12:42:20PM +0900, Sergey Senozhatsky wrote:
> > > On (22/11/21 22:12), Johannes Weiner wrote:
> > > > On Tue, Nov 22, 2022 at 11:15:20AM +0900, Sergey Senozhatsky wrote:
> > > > > On (22/11/18 16:15), Nhat Pham wrote:
> 
> [..]
> 
> > > What I meant was: if zs_reclaim_page() makes only partial progress
> > > with the current LRU tail zspage and returns -EAGAIN, then we just
> > > don't increment `total` and continue looping in zs_zpool_shrink().
> > 
> > Hm, but it breaks on -EAGAIN, it doesn't continue.
> 
> Yes. "What if it would continue". Would it make sense to not
> break on EAGAIN?
> 
> 	while (total < pages) {
> 		ret = zs_reclaim_page(pool);
> 		if (ret == -EAGAIN)
> 			continue;
> 		if (ret < 0)
> 			break;
> 		total++;
> 	}
> 
> Then we don't need retry loop in zs_reclaim_page().

But that's an indefinite busy-loop?

I don't see what the problem with limited retrying in
zs_reclaim_page() is. It's robust and has worked for years.
  
Sergey Senozhatsky Nov. 22, 2022, 7:19 a.m. UTC | #12
On (22/11/22 02:10), Johannes Weiner wrote:
> > Yes. "What if it would continue". Would it make sense to not
> > break on EAGAIN?
> > 
> > 	while (total < pages) {
> > 		ret = zs_reclaim_page(pool);
> > 		if (ret == -EAGAIN)
> > 			continue;
> > 		if (ret < 0)
> > 			break;
> > 		total++;
> > 	}
> > 
> > Then we don't need retry loop in zs_reclaim_page().
> 
> But that's an indefinite busy-loop?

That would mean that all lru pages constantly have locked objects
and we can only make partial progress.

> I don't see what the problem with limited retrying in
> zs_reclaim_page() is. It's robust and has worked for years.

No problem with it, just asking.
  
Nhat Pham Nov. 23, 2022, 4:30 p.m. UTC | #13
Thanks for the comments, Sergey!

> A nit: we usually call get_first_page() in such cases.

I'll use get_first_page() here in v7.

> We probably better to cond_resched() somewhere here. Or in zs_zpool_shrink()
> loop.

I'll put it right after releasing the pool's lock in the retry loop:

		/* Lock out object allocations and object compaction */
		remove_zspage(class, zspage, fullness);

		spin_unlock(&pool->lock);
		cond_resched();

		/* Lock backing pages into place */
		lock_zspage(zspage);

This will also appear in v7. In the meantime, please feel free to discuss all
the patches - I'll try to batch the changes to minimize the churning.
  
Johannes Weiner Nov. 23, 2022, 5:18 p.m. UTC | #14
On Tue, Nov 22, 2022 at 03:37:29PM +0900, Sergey Senozhatsky wrote:
> On (22/11/18 16:15), Nhat Pham wrote:
> [..]
> > +static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
> > +{
> > +	int i, obj_idx, ret = 0;
> > +	unsigned long handle;
> > +	struct zspage *zspage;
> > +	struct page *page;
> > +	enum fullness_group fullness;
> > +
> > +	/* Lock LRU and fullness list */
> > +	spin_lock(&pool->lock);
> > +	if (list_empty(&pool->lru)) {
> > +		spin_unlock(&pool->lock);
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (i = 0; i < retries; i++) {
> > +		struct size_class *class;
> > +
> > +		zspage = list_last_entry(&pool->lru, struct zspage, lru);
> > +		list_del(&zspage->lru);
> > +
> > +		/* zs_free may free objects, but not the zspage and handles */
> > +		zspage->under_reclaim = true;
> > +
> > +		class = zspage_class(pool, zspage);
> > +		fullness = get_fullness_group(class, zspage);
> > +
> > +		/* Lock out object allocations and object compaction */
> > +		remove_zspage(class, zspage, fullness);
> > +
> > +		spin_unlock(&pool->lock);
> > +
> > +		/* Lock backing pages into place */
> > +		lock_zspage(zspage);
> > +
> > +		obj_idx = 0;
> > +		page = zspage->first_page;
> > +		while (1) {
> > +			handle = find_alloced_obj(class, page, &obj_idx);
> > +			if (!handle) {
> > +				page = get_next_page(page);
> > +				if (!page)
> > +					break;
> > +				obj_idx = 0;
> > +				continue;
> > +			}
> > +
> > +			/*
> > +			 * This will write the object and call zs_free.
> > +			 *
> > +			 * zs_free will free the object, but the
> > +			 * under_reclaim flag prevents it from freeing
> > +			 * the zspage altogether. This is necessary so
> > +			 * that we can continue working with the
> > +			 * zspage potentially after the last object
> > +			 * has been freed.
> > +			 */
> > +			ret = pool->zpool_ops->evict(pool->zpool, handle);
> > +			if (ret)
> > +				goto next;
> > +
> > +			obj_idx++;
> > +		}
> > +
> > +next:
> > +		/* For freeing the zspage, or putting it back in the pool and LRU list. */
> > +		spin_lock(&pool->lock);
> > +		zspage->under_reclaim = false;
> > +
> > +		if (!get_zspage_inuse(zspage)) {
> > +			/*
> > +			 * Fullness went stale as zs_free() won't touch it
> > +			 * while the page is removed from the pool. Fix it
> > +			 * up for the check in __free_zspage().
> > +			 */
> > +			zspage->fullness = ZS_EMPTY;
> > +
> > +			__free_zspage(pool, class, zspage);
> > +			spin_unlock(&pool->lock);
> > +			return 0;
> > +		}
> > +
> > +		putback_zspage(class, zspage);
> > +		list_add(&zspage->lru, &pool->lru);
> > +		unlock_zspage(zspage);
> 
> We probably better to cond_resched() somewhere here. Or in zs_zpool_shrink()
> loop.

Hm, yeah I suppose that could make sense if we try more than one page.

We always hold either the pool lock or the page locks, and we probably
don't want to schedule with the page locks held. So it would need to
actually lockbreak the pool lock. And then somebody can steal the page
and empty the LRU under us, so we need to check that on looping, too.

Something like this?

for (i = 0; i < retries; i++) {
	spin_lock(&pool->lock);
	if (list_empty(&pool->lru)) {
		spin_unlock(&pool->lock);
		return -EINVAL;
	}
	zspage = list_last_entry(&pool->lru, ...);

	...

	putback_zspage(class, zspage);
	list_add(&zspage->lru, &pool->lru);
	unlock_zspage(zspage);
	spin_unlock(&pool->lock);

	cond_resched();
}
return -EAGAIN;
  
Johannes Weiner Nov. 23, 2022, 5:27 p.m. UTC | #15
On Wed, Nov 23, 2022 at 08:30:44AM -0800, Nhat Pham wrote:
> I'll put it right after releasing the pool's lock in the retry loop:
> 
> 		/* Lock out object allocations and object compaction */
> 		remove_zspage(class, zspage, fullness);
> 
> 		spin_unlock(&pool->lock);
> 		cond_resched();
> 
> 		/* Lock backing pages into place */
> 		lock_zspage(zspage);
> 
> This will also appear in v7. In the meantime, please feel free to discuss all
> the patches - I'll try to batch the changes to minimize the churning.

Oh, our emails collided. This is easier than my version :)
  

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 9920f3584511..3fba04e10227 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -271,12 +271,13 @@  struct zspage {
 #ifdef CONFIG_ZPOOL
 	/* links the zspage to the lru list in the pool */
 	struct list_head lru;
+	bool under_reclaim;
+	/* list of unfreed handles whose objects have been reclaimed */
+	unsigned long *deferred_handles;
 #endif

 	struct zs_pool *pool;
-#ifdef CONFIG_COMPACTION
 	rwlock_t lock;
-#endif
 };

 struct mapping_area {
@@ -297,10 +298,11 @@  static bool ZsHugePage(struct zspage *zspage)
 	return zspage->huge;
 }

-#ifdef CONFIG_COMPACTION
 static void migrate_lock_init(struct zspage *zspage);
 static void migrate_read_lock(struct zspage *zspage);
 static void migrate_read_unlock(struct zspage *zspage);
+
+#ifdef CONFIG_COMPACTION
 static void migrate_write_lock(struct zspage *zspage);
 static void migrate_write_lock_nested(struct zspage *zspage);
 static void migrate_write_unlock(struct zspage *zspage);
@@ -308,9 +310,6 @@  static void kick_deferred_free(struct zs_pool *pool);
 static void init_deferred_free(struct zs_pool *pool);
 static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
 #else
-static void migrate_lock_init(struct zspage *zspage) {}
-static void migrate_read_lock(struct zspage *zspage) {}
-static void migrate_read_unlock(struct zspage *zspage) {}
 static void migrate_write_lock(struct zspage *zspage) {}
 static void migrate_write_lock_nested(struct zspage *zspage) {}
 static void migrate_write_unlock(struct zspage *zspage) {}
@@ -413,6 +412,27 @@  static void zs_zpool_free(void *pool, unsigned long handle)
 	zs_free(pool, handle);
 }

+static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries);
+
+static int zs_zpool_shrink(void *pool, unsigned int pages,
+			unsigned int *reclaimed)
+{
+	unsigned int total = 0;
+	int ret = -EINVAL;
+
+	while (total < pages) {
+		ret = zs_reclaim_page(pool, 8);
+		if (ret < 0)
+			break;
+		total++;
+	}
+
+	if (reclaimed)
+		*reclaimed = total;
+
+	return ret;
+}
+
 static void *zs_zpool_map(void *pool, unsigned long handle,
 			enum zpool_mapmode mm)
 {
@@ -451,6 +471,7 @@  static struct zpool_driver zs_zpool_driver = {
 	.malloc_support_movable = true,
 	.malloc =		  zs_zpool_malloc,
 	.free =			  zs_zpool_free,
+	.shrink =		  zs_zpool_shrink,
 	.map =			  zs_zpool_map,
 	.unmap =		  zs_zpool_unmap,
 	.total_size =		  zs_zpool_total_size,
@@ -924,6 +945,25 @@  static int trylock_zspage(struct zspage *zspage)
 	return 0;
 }

+#ifdef CONFIG_ZPOOL
+/*
+ * Free all the deferred handles whose objects are freed in zs_free.
+ */
+static void free_handles(struct zs_pool *pool, struct zspage *zspage)
+{
+	unsigned long handle = (unsigned long)zspage->deferred_handles;
+
+	while (handle) {
+		unsigned long nxt_handle = handle_to_obj(handle);
+
+		cache_free_handle(pool, handle);
+		handle = nxt_handle;
+	}
+}
+#else
+static inline void free_handles(struct zs_pool *pool, struct zspage *zspage) {}
+#endif
+
 static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 				struct zspage *zspage)
 {
@@ -938,6 +978,9 @@  static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 	VM_BUG_ON(get_zspage_inuse(zspage));
 	VM_BUG_ON(fg != ZS_EMPTY);

+	/* Free all deferred handles from zs_free */
+	free_handles(pool, zspage);
+
 	next = page = get_first_page(zspage);
 	do {
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1023,6 +1066,8 @@  static void init_zspage(struct size_class *class, struct zspage *zspage)

 #ifdef CONFIG_ZPOOL
 	INIT_LIST_HEAD(&zspage->lru);
+	zspage->under_reclaim = false;
+	zspage->deferred_handles = NULL;
 #endif

 	set_freeobj(zspage, 0);
@@ -1535,12 +1580,26 @@  void zs_free(struct zs_pool *pool, unsigned long handle)

 	obj_free(class->size, obj);
 	class_stat_dec(class, OBJ_USED, 1);
+
+#ifdef CONFIG_ZPOOL
+	if (zspage->under_reclaim) {
+		/*
+		 * Reclaim needs the handles during writeback. It'll free
+		 * them along with the zspage when it's done with them.
+		 *
+		 * Record current deferred handle at the memory location
+		 * whose address is given by handle.
+		 */
+		record_obj(handle, (unsigned long)zspage->deferred_handles);
+		zspage->deferred_handles = (unsigned long *)handle;
+		spin_unlock(&pool->lock);
+		return;
+	}
+#endif
 	fullness = fix_fullness_group(class, zspage);
-	if (fullness != ZS_EMPTY)
-		goto out;
+	if (fullness == ZS_EMPTY)
+		free_zspage(pool, class, zspage);

-	free_zspage(pool, class, zspage);
-out:
 	spin_unlock(&pool->lock);
 	cache_free_handle(pool, handle);
 }
@@ -1740,7 +1799,7 @@  static enum fullness_group putback_zspage(struct size_class *class,
 	return fullness;
 }

-#ifdef CONFIG_COMPACTION
+#if defined(CONFIG_ZPOOL) || defined(CONFIG_COMPACTION)
 /*
  * To prevent zspage destroy during migration, zspage freeing should
  * hold locks of all pages in the zspage.
@@ -1782,6 +1841,24 @@  static void lock_zspage(struct zspage *zspage)
 	}
 	migrate_read_unlock(zspage);
 }
+#endif /* defined(CONFIG_ZPOOL) || defined(CONFIG_COMPACTION) */
+
+#ifdef CONFIG_ZPOOL
+/*
+ * Unlocks all the pages of the zspage.
+ *
+ * pool->lock must be held before this function is called
+ * to prevent the underlying pages from migrating.
+ */
+static void unlock_zspage(struct zspage *zspage)
+{
+	struct page *page = get_first_page(zspage);
+
+	do {
+		unlock_page(page);
+	} while ((page = get_next_page(page)) != NULL);
+}
+#endif /* CONFIG_ZPOOL */

 static void migrate_lock_init(struct zspage *zspage)
 {
@@ -1798,6 +1875,7 @@  static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock)
 	read_unlock(&zspage->lock);
 }

+#ifdef CONFIG_COMPACTION
 static void migrate_write_lock(struct zspage *zspage)
 {
 	write_lock(&zspage->lock);
@@ -2362,6 +2440,99 @@  void zs_destroy_pool(struct zs_pool *pool)
 }
 EXPORT_SYMBOL_GPL(zs_destroy_pool);

+#ifdef CONFIG_ZPOOL
+static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
+{
+	int i, obj_idx, ret = 0;
+	unsigned long handle;
+	struct zspage *zspage;
+	struct page *page;
+	enum fullness_group fullness;
+
+	/* Lock LRU and fullness list */
+	spin_lock(&pool->lock);
+	if (list_empty(&pool->lru)) {
+		spin_unlock(&pool->lock);
+		return -EINVAL;
+	}
+
+	for (i = 0; i < retries; i++) {
+		struct size_class *class;
+
+		zspage = list_last_entry(&pool->lru, struct zspage, lru);
+		list_del(&zspage->lru);
+
+		/* zs_free may free objects, but not the zspage and handles */
+		zspage->under_reclaim = true;
+
+		class = zspage_class(pool, zspage);
+		fullness = get_fullness_group(class, zspage);
+
+		/* Lock out object allocations and object compaction */
+		remove_zspage(class, zspage, fullness);
+
+		spin_unlock(&pool->lock);
+
+		/* Lock backing pages into place */
+		lock_zspage(zspage);
+
+		obj_idx = 0;
+		page = zspage->first_page;
+		while (1) {
+			handle = find_alloced_obj(class, page, &obj_idx);
+			if (!handle) {
+				page = get_next_page(page);
+				if (!page)
+					break;
+				obj_idx = 0;
+				continue;
+			}
+
+			/*
+			 * This will write the object and call zs_free.
+			 *
+			 * zs_free will free the object, but the
+			 * under_reclaim flag prevents it from freeing
+			 * the zspage altogether. This is necessary so
+			 * that we can continue working with the
+			 * zspage potentially after the last object
+			 * has been freed.
+			 */
+			ret = pool->zpool_ops->evict(pool->zpool, handle);
+			if (ret)
+				goto next;
+
+			obj_idx++;
+		}
+
+next:
+		/* For freeing the zspage, or putting it back in the pool and LRU list. */
+		spin_lock(&pool->lock);
+		zspage->under_reclaim = false;
+
+		if (!get_zspage_inuse(zspage)) {
+			/*
+			 * Fullness went stale as zs_free() won't touch it
+			 * while the page is removed from the pool. Fix it
+			 * up for the check in __free_zspage().
+			 */
+			zspage->fullness = ZS_EMPTY;
+
+			__free_zspage(pool, class, zspage);
+			spin_unlock(&pool->lock);
+			return 0;
+		}
+
+		putback_zspage(class, zspage);
+		list_add(&zspage->lru, &pool->lru);
+		unlock_zspage(zspage);
+	}
+
+	spin_unlock(&pool->lock);
+	return -EAGAIN;
+}
+#endif /* CONFIG_ZPOOL */
+
 static int __init zs_init(void)
 {
 	int ret;