[v6,4/6] zsmalloc: Add a LRU to zs_pool to keep track of zspages in LRU order

Message ID 20221119001536.2086599-5-nphamcs@gmail.com
State New
Headers
Series Implement writeback for zsmalloc |

Commit Message

Nhat Pham Nov. 19, 2022, 12:15 a.m. UTC
  This helps determines the coldest zspages as candidates for writeback.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zsmalloc.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

--
2.30.2
  

Comments

Johannes Weiner Nov. 19, 2022, 4:38 p.m. UTC | #1
On Fri, Nov 18, 2022 at 04:15:34PM -0800, Nhat Pham wrote:
> This helps determines the coldest zspages as candidates for writeback.
> 
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

This looks good to me. The ifdefs are higher than usual, but in this
case they actually really nicely annotate exactly which hunks need to
move to zswap (as CONFIG_ZPOOL == CONFIG_ZSWAP) when we unify the LRU!
zbud and z3fold don't have those helpful annotations (since they're
zswap-only to begin with), which will make their conversion a bit more
laborious. But zsmalloc can be a (rough) guiding template for them.

Thanks
  
Minchan Kim Nov. 19, 2022, 5:34 p.m. UTC | #2
On Fri, Nov 18, 2022 at 04:15:34PM -0800, Nhat Pham wrote:
> This helps determines the coldest zspages as candidates for writeback.
> 
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
  
Sergey Senozhatsky Nov. 22, 2022, 1:52 a.m. UTC | #3
On (22/11/18 16:15), Nhat Pham wrote:
[..]
> @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	obj_to_location(obj, &page, &obj_idx);
>  	zspage = get_zspage(page);
> 
> +#ifdef CONFIG_ZPOOL
> +	/* Move the zspage to front of pool's LRU */
> +	if (mm == ZS_MM_WO) {
> +		if (!list_empty(&zspage->lru))
> +			list_del(&zspage->lru);
> +		list_add(&zspage->lru, &pool->lru);
> +	}
> +#endif

Do we consider pages that were mapped for MM_RO/MM_RW as cold?
I wonder why, we use them, so technically they are not exactly
"least recently used".
  
Johannes Weiner Nov. 22, 2022, 5:42 p.m. UTC | #4
On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote:
> On (22/11/18 16:15), Nhat Pham wrote:
> [..]
> > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> >  	obj_to_location(obj, &page, &obj_idx);
> >  	zspage = get_zspage(page);
> > 
> > +#ifdef CONFIG_ZPOOL
> > +	/* Move the zspage to front of pool's LRU */
> > +	if (mm == ZS_MM_WO) {
> > +		if (!list_empty(&zspage->lru))
> > +			list_del(&zspage->lru);
> > +		list_add(&zspage->lru, &pool->lru);
> > +	}
> > +#endif
> 
> Do we consider pages that were mapped for MM_RO/MM_RW as cold?
> I wonder why, we use them, so technically they are not exactly
> "least recently used".

This is a swap LRU. Per definition there are no ongoing accesses to
the memory while the page is swapped out that would make it "hot". A
new entry is hot, then ages to the tail until it gets either written
back or swaps back in. Because of that, the zswap backends have
traditionally had the lru-add in the allocation function (zs_malloc,
zbud_alloc, z3fold_alloc).

Minchan insisted we move it here for zsmalloc, since 'update lru on
data access' is more generic. Unfortunately, one of the data accesses
is when we write the swap entry to disk - during reclaim when the page
is isolated from the LRU! Obviously we MUST NOT put it back on the LRU
mid-reclaim.

So now we have very generic LRU code, and exactly one usecase that
needs exceptions from the generic behavior.

The code is raising questions, not surprisingly. We can add a lengthy
comment to it - a variant of the above text?

My vote would still be to just move it back to zs_malloc, where it
makes sense, is easier to explain, and matches the other backends.
  
Sergey Senozhatsky Nov. 23, 2022, 3:50 a.m. UTC | #5
On (22/11/22 12:42), Johannes Weiner wrote:
> On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote:
> > On (22/11/18 16:15), Nhat Pham wrote:
> > [..]
> > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> > >  	obj_to_location(obj, &page, &obj_idx);
> > >  	zspage = get_zspage(page);
> > > 
> > > +#ifdef CONFIG_ZPOOL
> > > +	/* Move the zspage to front of pool's LRU */
> > > +	if (mm == ZS_MM_WO) {
> > > +		if (!list_empty(&zspage->lru))
> > > +			list_del(&zspage->lru);
> > > +		list_add(&zspage->lru, &pool->lru);
> > > +	}
> > > +#endif
> > 
> > Do we consider pages that were mapped for MM_RO/MM_RW as cold?
> > I wonder why, we use them, so technically they are not exactly
> > "least recently used".
> 
> This is a swap LRU. Per definition there are no ongoing accesses to
> the memory while the page is swapped out that would make it "hot".

Hmm. Not arguing, just trying to understand some things.

There are no accesses to swapped out pages yes, but zspage holds multiple
objects, which are compressed swapped out pages in this particular case.
For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage,
that is 93 compressed swapped out pages. Consider ZS_FULL zspages which
is at the tail of the LRU list. Suppose that we page-faulted 20 times and
read 20 objects from that zspage, IOW zspage has been in use 20 times very
recently, while writeback still considers it to be "not-used" and will
evict it.

So if this works for you then I'm fine. But we probably, like you suggested,
can document a couple of things here - namely why WRITE access to zspage
counts as "zspage is in use" but READ access to the same zspage does not
count as "zspage is in use".
  
Sergey Senozhatsky Nov. 23, 2022, 3:58 a.m. UTC | #6
On (22/11/18 16:15), Nhat Pham wrote:
> +#ifdef CONFIG_ZPOOL
> +	/* Move the zspage to front of pool's LRU */
> +	if (mm == ZS_MM_WO) {
> +		if (!list_empty(&zspage->lru))
> +			list_del(&zspage->lru);
> +		list_add(&zspage->lru, &pool->lru);
> +	}
> +#endif

Just an idea.

Have you considered having size class LRU instead of pool LRU?

Evicting pages from different classes can have different impact on the
system, in theory. For instance, ZS_FULL zspage in class size 3264
(bytes) holds 5 compressed objects per-zspage, which are 5 compressed
swapped out pages. While zspage in a class size 176 (bytes) holds 93
compressed objects (swapped pages). Both zspages consist of 4
non-contiguous 0-order physical pages, so when we free zspage from these
classes we release 4 physical pages. However, in terms of major
page faults evicting a page from size class 3264 looks better than from
a size class 176: 5 major page faults vs 93 major page faults.

Does this make sense?
  
Yosry Ahmed Nov. 23, 2022, 8:02 a.m. UTC | #7
On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (22/11/22 12:42), Johannes Weiner wrote:
> > On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote:
> > > On (22/11/18 16:15), Nhat Pham wrote:
> > > [..]
> > > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> > > >   obj_to_location(obj, &page, &obj_idx);
> > > >   zspage = get_zspage(page);
> > > >
> > > > +#ifdef CONFIG_ZPOOL
> > > > + /* Move the zspage to front of pool's LRU */
> > > > + if (mm == ZS_MM_WO) {
> > > > +         if (!list_empty(&zspage->lru))
> > > > +                 list_del(&zspage->lru);
> > > > +         list_add(&zspage->lru, &pool->lru);
> > > > + }
> > > > +#endif
> > >
> > > Do we consider pages that were mapped for MM_RO/MM_RW as cold?
> > > I wonder why, we use them, so technically they are not exactly
> > > "least recently used".
> >
> > This is a swap LRU. Per definition there are no ongoing accesses to
> > the memory while the page is swapped out that would make it "hot".
>
> Hmm. Not arguing, just trying to understand some things.
>
> There are no accesses to swapped out pages yes, but zspage holds multiple
> objects, which are compressed swapped out pages in this particular case.
> For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage,
> that is 93 compressed swapped out pages. Consider ZS_FULL zspages which
> is at the tail of the LRU list. Suppose that we page-faulted 20 times and
> read 20 objects from that zspage, IOW zspage has been in use 20 times very
> recently, while writeback still considers it to be "not-used" and will
> evict it.
>
> So if this works for you then I'm fine. But we probably, like you suggested,
> can document a couple of things here - namely why WRITE access to zspage
> counts as "zspage is in use" but READ access to the same zspage does not
> count as "zspage is in use".
>

I guess the key here is that we have an LRU of zspages, when we really
want an LRU of compressed objects. In some cases, we may end up
reclaiming the wrong pages.

Assuming we have 2 zspages, Z1 and Z2, and 4 physical pages that we
compress over time, P1 -> P4.

Let's assume P1 -> P4 get compressed in order (P4 is the hottest
page), and they get assigned to zspages as follows:
Z1: P1, P3
Z2: P2, P4

In this case, the zspages LRU would be Z2->Z1, because Z2 was touched
last when we compressed P4. Now if we want to writeback, we will look
at Z1, and we might end up reclaiming P3, depending on the order the
pages are stored in.

A worst case scenario of this would be if we have a large number of
pages, maybe 1000, P1->P1000 (where P1000 is the hottest), and they
all go into Z1 and Z2 in this way:
Z1: P1 -> P499, P1000
Z2: P500 -> P999

In this case, Z1 contains 499 cold pages, but it got P1000 at the end
which caused us to put it on the front of the LRU. Now writeback will
consistently use Z2. This is bad. Now I have no idea how practical
this is, but it seems fairly random, based on the compression size of
pages and access patterns.

Does this mean we should move zspages to the front of the LRU when we
writeback from them? No, I wouldn't say so. The same exact scenario
can happen because of this. Imagine the following assignment of the
1000 pages:
Z1: P<odd> (P1, P3, .., P999)
Z2: P<even> (P2, P4, .., P1000)

Z2 is at the front of the LRU because it has P1000, so the first time
we do writeback we will start at Z1. Once we reclaim one object from
Z1, we will start writeback from Z2 next time, and we will keep
alternating. Now if we are really unlucky, we can end up reclaiming in
this order P999, P1000, P997, P998, ... . So yeah I don't think
putting zspages in the front of the LRU when we writeback is the
answer. I would even say it's completely orthogonal to the problem,
because writing back an object from the zspage at the end of the LRU
gives us 0 information about the state of other objects on the same
zspage.

Ideally, we would have an LRU of objects instead, but this would be
very complicated with the current form of writeback. It would be much
easier if we have an LRU for zswap entries instead, which is something
I am looking into, and is a much bigger surgery, and should be
separate from this work. Today zswap inverts LRU priorities anyway by
sending hot pages to the swapfile when zswap is full, when colder
pages are in zswap, so I wouldn't really worry about this now :)
  
Yosry Ahmed Nov. 23, 2022, 8:11 a.m. UTC | #8
On Wed, Nov 23, 2022 at 12:02 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> >
> > On (22/11/22 12:42), Johannes Weiner wrote:
> > > On Tue, Nov 22, 2022 at 10:52:58AM +0900, Sergey Senozhatsky wrote:
> > > > On (22/11/18 16:15), Nhat Pham wrote:
> > > > [..]
> > > > > @@ -1249,6 +1267,15 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> > > > >   obj_to_location(obj, &page, &obj_idx);
> > > > >   zspage = get_zspage(page);
> > > > >
> > > > > +#ifdef CONFIG_ZPOOL
> > > > > + /* Move the zspage to front of pool's LRU */
> > > > > + if (mm == ZS_MM_WO) {
> > > > > +         if (!list_empty(&zspage->lru))
> > > > > +                 list_del(&zspage->lru);
> > > > > +         list_add(&zspage->lru, &pool->lru);
> > > > > + }
> > > > > +#endif
> > > >
> > > > Do we consider pages that were mapped for MM_RO/MM_RW as cold?
> > > > I wonder why, we use them, so technically they are not exactly
> > > > "least recently used".
> > >
> > > This is a swap LRU. Per definition there are no ongoing accesses to
> > > the memory while the page is swapped out that would make it "hot".
> >
> > Hmm. Not arguing, just trying to understand some things.
> >
> > There are no accesses to swapped out pages yes, but zspage holds multiple
> > objects, which are compressed swapped out pages in this particular case.
> > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage,
> > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which
> > is at the tail of the LRU list. Suppose that we page-faulted 20 times and
> > read 20 objects from that zspage, IOW zspage has been in use 20 times very
> > recently, while writeback still considers it to be "not-used" and will
> > evict it.
> >
> > So if this works for you then I'm fine. But we probably, like you suggested,
> > can document a couple of things here - namely why WRITE access to zspage
> > counts as "zspage is in use" but READ access to the same zspage does not
> > count as "zspage is in use".
> >
>
> I guess the key here is that we have an LRU of zspages, when we really
> want an LRU of compressed objects. In some cases, we may end up
> reclaiming the wrong pages.
>
> Assuming we have 2 zspages, Z1 and Z2, and 4 physical pages that we
> compress over time, P1 -> P4.
>
> Let's assume P1 -> P4 get compressed in order (P4 is the hottest
> page), and they get assigned to zspages as follows:
> Z1: P1, P3
> Z2: P2, P4
>
> In this case, the zspages LRU would be Z2->Z1, because Z2 was touched
> last when we compressed P4. Now if we want to writeback, we will look
> at Z1, and we might end up reclaiming P3, depending on the order the
> pages are stored in.
>
> A worst case scenario of this would be if we have a large number of
> pages, maybe 1000, P1->P1000 (where P1000 is the hottest), and they
> all go into Z1 and Z2 in this way:
> Z1: P1 -> P499, P1000
> Z2: P500 -> P999
>
> In this case, Z1 contains 499 cold pages, but it got P1000 at the end
> which caused us to put it on the front of the LRU. Now writeback will
> consistently use Z2. This is bad. Now I have no idea how practical
> this is, but it seems fairly random, based on the compression size of
> pages and access patterns.
>
> Does this mean we should move zspages to the front of the LRU when we
> writeback from them? No, I wouldn't say so. The same exact scenario
> can happen because of this. Imagine the following assignment of the
> 1000 pages:
> Z1: P<odd> (P1, P3, .., P999)
> Z2: P<even> (P2, P4, .., P1000)
>
> Z2 is at the front of the LRU because it has P1000, so the first time
> we do writeback we will start at Z1. Once we reclaim one object from
> Z1, we will start writeback from Z2 next time, and we will keep
> alternating. Now if we are really unlucky, we can end up reclaiming in
> this order P999, P1000, P997, P998, ... . So yeah I don't think
> putting zspages in the front of the LRU when we writeback is the
> answer. I would even say it's completely orthogonal to the problem,
> because writing back an object from the zspage at the end of the LRU
> gives us 0 information about the state of other objects on the same
> zspage.
>
> Ideally, we would have an LRU of objects instead, but this would be
> very complicated with the current form of writeback. It would be much
> easier if we have an LRU for zswap entries instead, which is something
> I am looking into, and is a much bigger surgery, and should be
> separate from this work. Today zswap inverts LRU priorities anyway by
> sending hot pages to the swapfile when zswap is full, when colder
> pages are in zswap, so I wouldn't really worry about this now :)

Oh, I didn't realize we reclaim all the objects in the zspage at the
end of the LRU. All the examples are wrong, but the concept still
stands, the problem is that we have an LRU of zspages not an LRU of
objects.

Nonetheless, the fact that we refaulted an object in a zspage does not
necessarily mean that other objects on the same are hotter than
objects in other zspages IIUC.
  
Johannes Weiner Nov. 23, 2022, 4:30 p.m. UTC | #9
On Wed, Nov 23, 2022 at 12:11:24AM -0800, Yosry Ahmed wrote:
> On Wed, Nov 23, 2022 at 12:02 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > On Tue, Nov 22, 2022 at 7:50 PM Sergey Senozhatsky
> > > There are no accesses to swapped out pages yes, but zspage holds multiple
> > > objects, which are compressed swapped out pages in this particular case.
> > > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage,
> > > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which
> > > is at the tail of the LRU list. Suppose that we page-faulted 20 times and
> > > read 20 objects from that zspage, IOW zspage has been in use 20 times very
> > > recently, while writeback still considers it to be "not-used" and will
> > > evict it.
> > >
> > > So if this works for you then I'm fine. But we probably, like you suggested,
> > > can document a couple of things here - namely why WRITE access to zspage
> > > counts as "zspage is in use" but READ access to the same zspage does not
> > > count as "zspage is in use".

> Nonetheless, the fact that we refaulted an object in a zspage does not
> necessarily mean that other objects on the same are hotter than
> objects in other zspages IIUC.

Yes.

On allocation, we know that there is at least one hot object in the
page. On refault, the connection between objects in a page is weak.

And it's weaker on zsmalloc than with other backends due to the many
size classes making temporal grouping less likely. So I think you're
quite right, Segey, that a per-class LRU would be more accurate.

It's no-LRU < zspage-LRU < class-LRU < object-LRU.

Like Yosry said, the plan is to implement an object-LRU next as part
of the generalized LRU for zsmalloc, zbud and z3fold.

For now, the zspage LRU is an improvement to no-LRU. Our production
experiments confirmed that.
  
Sergey Senozhatsky Nov. 24, 2022, 3:21 a.m. UTC | #10
On (22/11/23 00:02), Yosry Ahmed wrote:
> > There are no accesses to swapped out pages yes, but zspage holds multiple
> > objects, which are compressed swapped out pages in this particular case.
> > For example, zspage in class size 176 (bytes) can hold 93 objects per-zspage,
> > that is 93 compressed swapped out pages. Consider ZS_FULL zspages which
> > is at the tail of the LRU list. Suppose that we page-faulted 20 times and
> > read 20 objects from that zspage, IOW zspage has been in use 20 times very
> > recently, while writeback still considers it to be "not-used" and will
> > evict it.
> >
> > So if this works for you then I'm fine. But we probably, like you suggested,
> > can document a couple of things here - namely why WRITE access to zspage
> > counts as "zspage is in use" but READ access to the same zspage does not
> > count as "zspage is in use".
> >
> 
> I guess the key here is that we have an LRU of zspages, when we really
> want an LRU of compressed objects. In some cases, we may end up
> reclaiming the wrong pages.

Yes, completely agree.

[..]
> Ideally, we would have an LRU of objects instead, but this would be
> very complicated with the current form of writeback.

Right. So we have two writebacks now: one in zram and on in zsmalloc.
And zram writeback works with objects' access patterns, it simply tracks
timestamps per entry and it doesn't know/care about zspages. Writeback
targets in zram are selected by simply looking at timestamps of objects
(compressed normal pages). And that is the right level for LRU, allocator
is too low-level for this.

I'm looking forward to seeing new LRU implementation (at a level higher
than allocator) :)
  
Sergey Senozhatsky Nov. 24, 2022, 3:29 a.m. UTC | #11
On (22/11/23 11:30), Johannes Weiner wrote:
> Like Yosry said, the plan is to implement an object-LRU next as part
> of the generalized LRU for zsmalloc, zbud and z3fold.
> 
> For now, the zspage LRU is an improvement to no-LRU. Our production
> experiments confirmed that.

Sounds good!
  

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 326faa751f0a..7dd464b5a6a5 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -239,6 +239,11 @@  struct zs_pool {
 	/* Compact classes */
 	struct shrinker shrinker;

+#ifdef CONFIG_ZPOOL
+	/* List tracking the zspages in LRU order by most recently added object */
+	struct list_head lru;
+#endif
+
 #ifdef CONFIG_ZSMALLOC_STAT
 	struct dentry *stat_dentry;
 #endif
@@ -260,6 +265,12 @@  struct zspage {
 	unsigned int freeobj;
 	struct page *first_page;
 	struct list_head list; /* fullness list */
+
+#ifdef CONFIG_ZPOOL
+	/* links the zspage to the lru list in the pool */
+	struct list_head lru;
+#endif
+
 	struct zs_pool *pool;
 #ifdef CONFIG_COMPACTION
 	rwlock_t lock;
@@ -953,6 +964,9 @@  static void free_zspage(struct zs_pool *pool, struct size_class *class,
 	}

 	remove_zspage(class, zspage, ZS_EMPTY);
+#ifdef CONFIG_ZPOOL
+	list_del(&zspage->lru);
+#endif
 	__free_zspage(pool, class, zspage);
 }

@@ -998,6 +1012,10 @@  static void init_zspage(struct size_class *class, struct zspage *zspage)
 		off %= PAGE_SIZE;
 	}

+#ifdef CONFIG_ZPOOL
+	INIT_LIST_HEAD(&zspage->lru);
+#endif
+
 	set_freeobj(zspage, 0);
 }

@@ -1249,6 +1267,15 @@  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	obj_to_location(obj, &page, &obj_idx);
 	zspage = get_zspage(page);

+#ifdef CONFIG_ZPOOL
+	/* Move the zspage to front of pool's LRU */
+	if (mm == ZS_MM_WO) {
+		if (!list_empty(&zspage->lru))
+			list_del(&zspage->lru);
+		list_add(&zspage->lru, &pool->lru);
+	}
+#endif
+
 	/*
 	 * migration cannot move any zpages in this zspage. Here, pool->lock
 	 * is too heavy since callers would take some time until they calls
@@ -1967,6 +1994,9 @@  static void async_free_zspage(struct work_struct *work)
 		VM_BUG_ON(fullness != ZS_EMPTY);
 		class = pool->size_class[class_idx];
 		spin_lock(&pool->lock);
+#ifdef CONFIG_ZPOOL
+		list_del(&zspage->lru);
+#endif
 		__free_zspage(pool, class, zspage);
 		spin_unlock(&pool->lock);
 	}
@@ -2278,6 +2308,10 @@  struct zs_pool *zs_create_pool(const char *name)
 	 */
 	zs_register_shrinker(pool);

+#ifdef CONFIG_ZPOOL
+	INIT_LIST_HEAD(&pool->lru);
+#endif
+
 	return pool;

 err: