[2/5] zsmalloc: Consolidate zs_pool's migrate_lock and size_class's locks

Message ID 20221026200613.1031261-3-nphamcs@gmail.com
State New
Headers
Series Implement writeback for zsmalloc |

Commit Message

Nhat Pham Oct. 26, 2022, 8:06 p.m. UTC
  Currently, zsmalloc has a hierarchy of locks, which includes a
pool-level migrate_lock, and a lock for each size class. We have to
obtain both locks in the hotpath in most cases anyway, except for
zs_malloc. This exception will no longer exist when we introduce a LRU
into the zs_pool for the new writeback functionality - we will need to
obtain a pool-level lock to synchronize LRU handling even in zs_malloc.

In preparation for zsmalloc writeback, consolidate these locks into a
single pool-level lock, which drastically reduces the complexity of
synchronization in zsmalloc.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zsmalloc.c | 87 ++++++++++++++++++++++-----------------------------
 1 file changed, 37 insertions(+), 50 deletions(-)
  

Comments

Johannes Weiner Oct. 28, 2022, 2:46 p.m. UTC | #1
On Wed, Oct 26, 2022 at 01:06:10PM -0700, Nhat Pham wrote:
> Currently, zsmalloc has a hierarchy of locks, which includes a
> pool-level migrate_lock, and a lock for each size class. We have to
> obtain both locks in the hotpath in most cases anyway, except for
> zs_malloc. This exception will no longer exist when we introduce a LRU
> into the zs_pool for the new writeback functionality - we will need to
> obtain a pool-level lock to synchronize LRU handling even in zs_malloc.
> 
> In preparation for zsmalloc writeback, consolidate these locks into a
> single pool-level lock, which drastically reduces the complexity of
> synchronization in zsmalloc.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
  
Sergey Senozhatsky Nov. 2, 2022, 3:28 a.m. UTC | #2
On (22/10/26 13:06), Nhat Pham wrote:
>  struct size_class {
> -	spinlock_t lock;
>  	struct list_head fullness_list[NR_ZS_FULLNESS];
>  	/*
>  	 * Size of objects stored in this class. Must be multiple
> @@ -247,8 +245,7 @@ struct zs_pool {
>  #ifdef CONFIG_COMPACTION
>  	struct work_struct free_work;
>  #endif
> -	/* protect page/zspage migration */
> -	rwlock_t migrate_lock;
> +	spinlock_t lock;
>  };

I'm not in love with this, to be honest. One big pool lock instead
of 255 per-class locks doesn't look attractive, as one big pool lock
is going to be hammered quite a lot when zram is used, e.g. as a regular
block device with a file system and is under heavy parallel writes/reads.
  
Minchan Kim Nov. 2, 2022, 9:36 p.m. UTC | #3
On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote:
> On (22/10/26 13:06), Nhat Pham wrote:
> >  struct size_class {
> > -	spinlock_t lock;
> >  	struct list_head fullness_list[NR_ZS_FULLNESS];
> >  	/*
> >  	 * Size of objects stored in this class. Must be multiple
> > @@ -247,8 +245,7 @@ struct zs_pool {
> >  #ifdef CONFIG_COMPACTION
> >  	struct work_struct free_work;
> >  #endif
> > -	/* protect page/zspage migration */
> > -	rwlock_t migrate_lock;
> > +	spinlock_t lock;
> >  };
> 
> I'm not in love with this, to be honest. One big pool lock instead
> of 255 per-class locks doesn't look attractive, as one big pool lock
> is going to be hammered quite a lot when zram is used, e.g. as a regular
> block device with a file system and is under heavy parallel writes/reads.
> 
I agree with Sergey.

I am also worry about that LRU stuff should be part of allocator
instead of higher level.
  
Johannes Weiner Nov. 3, 2022, 3:18 p.m. UTC | #4
On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote:
> On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote:
> > On (22/10/26 13:06), Nhat Pham wrote:
> > >  struct size_class {
> > > -	spinlock_t lock;
> > >  	struct list_head fullness_list[NR_ZS_FULLNESS];
> > >  	/*
> > >  	 * Size of objects stored in this class. Must be multiple
> > > @@ -247,8 +245,7 @@ struct zs_pool {
> > >  #ifdef CONFIG_COMPACTION
> > >  	struct work_struct free_work;
> > >  #endif
> > > -	/* protect page/zspage migration */
> > > -	rwlock_t migrate_lock;
> > > +	spinlock_t lock;
> > >  };
> > 
> > I'm not in love with this, to be honest. One big pool lock instead
> > of 255 per-class locks doesn't look attractive, as one big pool lock
> > is going to be hammered quite a lot when zram is used, e.g. as a regular
> > block device with a file system and is under heavy parallel writes/reads.

TBH the class always struck me as an odd scope to split the lock. Lock
contention depends on how variable the compression rate is of the
hottest incoming data, which is unpredictable from a user POV.

My understanding is that the primary usecase for zram is swapping, and
the pool lock is the same granularity as the swap locking.

Regardless, we'll do some benchmarks with filesystems to understand
what a reasonable tradeoff would be between overhead and complexity.
Do you have a particular one in mind? (I'm thinking journaled ones are
not of much interest, since their IO tends to be fairly serialized.)

btrfs?

> I am also worry about that LRU stuff should be part of allocator
> instead of higher level.

I'm sorry, but that's not a reasonable objection.

These patches implement a core feature of being a zswap backend, using
standard LRU and locking techniques established by the other backends.

I don't disagree that it would nicer if zswap had a strong abstraction
for backend pages and a generalized LRU. But that is major surgery on
a codebase of over 6,500 lines. It's not a reasonable ask to change
all that first before implementing a basic feature that's useful now.

I get that your main interest is zram, and so this feature isn't of
interest to you. But zram isn't the only user, nor is it the primary
user, of zsmalloc.
  
Minchan Kim Nov. 3, 2022, 3:53 p.m. UTC | #5
On Thu, Nov 03, 2022 at 11:18:04AM -0400, Johannes Weiner wrote:
> On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote:
> > On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote:
> > > On (22/10/26 13:06), Nhat Pham wrote:
> > > >  struct size_class {
> > > > -	spinlock_t lock;
> > > >  	struct list_head fullness_list[NR_ZS_FULLNESS];
> > > >  	/*
> > > >  	 * Size of objects stored in this class. Must be multiple
> > > > @@ -247,8 +245,7 @@ struct zs_pool {
> > > >  #ifdef CONFIG_COMPACTION
> > > >  	struct work_struct free_work;
> > > >  #endif
> > > > -	/* protect page/zspage migration */
> > > > -	rwlock_t migrate_lock;
> > > > +	spinlock_t lock;
> > > >  };
> > > 
> > > I'm not in love with this, to be honest. One big pool lock instead
> > > of 255 per-class locks doesn't look attractive, as one big pool lock
> > > is going to be hammered quite a lot when zram is used, e.g. as a regular
> > > block device with a file system and is under heavy parallel writes/reads.
> 
> TBH the class always struck me as an odd scope to split the lock. Lock
> contention depends on how variable the compression rate is of the
> hottest incoming data, which is unpredictable from a user POV.
> 
> My understanding is that the primary usecase for zram is swapping, and
> the pool lock is the same granularity as the swap locking.

People uses the zram to store caching object files in build server.

> 
> Regardless, we'll do some benchmarks with filesystems to understand
> what a reasonable tradeoff would be between overhead and complexity.

Thanks.

> Do you have a particular one in mind? (I'm thinking journaled ones are
> not of much interest, since their IO tends to be fairly serialized.)
> 
> btrfs?

I am not sure what FSes others are using but at least for me, just
plain ext4.

> 
> > I am also worry about that LRU stuff should be part of allocator
> > instead of higher level.
> 
> I'm sorry, but that's not a reasonable objection.
> 
> These patches implement a core feature of being a zswap backend, using
> standard LRU and locking techniques established by the other backends.
> 
> I don't disagree that it would nicer if zswap had a strong abstraction
> for backend pages and a generalized LRU. But that is major surgery on
> a codebase of over 6,500 lines. It's not a reasonable ask to change
> all that first before implementing a basic feature that's useful now.

With same logic, folks added the LRU logic into their allocators
without the effort considering moving the LRU into upper layer.

And then trend is still going on since I have seen multiple times
people are trying to add more allocators. So if it's not a reasonable
ask to consier, we couldn't stop the trend in the end.

> 
> I get that your main interest is zram, and so this feature isn't of
> interest to you. But zram isn't the only user, nor is it the primary

I am interest to the feature but my interest is more of general swap
layer to manage the LRU so that it could support any hierarchy among
swap devices, not only zswap.
  
Johannes Weiner Nov. 3, 2022, 6:08 p.m. UTC | #6
On Thu, Nov 03, 2022 at 08:53:29AM -0700, Minchan Kim wrote:
> On Thu, Nov 03, 2022 at 11:18:04AM -0400, Johannes Weiner wrote:
> > On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote:
> > > On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote:
> > > > On (22/10/26 13:06), Nhat Pham wrote:
> > > > >  struct size_class {
> > > > > -	spinlock_t lock;
> > > > >  	struct list_head fullness_list[NR_ZS_FULLNESS];
> > > > >  	/*
> > > > >  	 * Size of objects stored in this class. Must be multiple
> > > > > @@ -247,8 +245,7 @@ struct zs_pool {
> > > > >  #ifdef CONFIG_COMPACTION
> > > > >  	struct work_struct free_work;
> > > > >  #endif
> > > > > -	/* protect page/zspage migration */
> > > > > -	rwlock_t migrate_lock;
> > > > > +	spinlock_t lock;
> > > > >  };
> > > > 
> > > > I'm not in love with this, to be honest. One big pool lock instead
> > > > of 255 per-class locks doesn't look attractive, as one big pool lock
> > > > is going to be hammered quite a lot when zram is used, e.g. as a regular
> > > > block device with a file system and is under heavy parallel writes/reads.
> > 
> > TBH the class always struck me as an odd scope to split the lock. Lock
> > contention depends on how variable the compression rate is of the
> > hottest incoming data, which is unpredictable from a user POV.
> > 
> > My understanding is that the primary usecase for zram is swapping, and
> > the pool lock is the same granularity as the swap locking.
> 
> People uses the zram to store caching object files in build server.

Oh, interesting. We can try with a kernel build directory on zram.

> > Do you have a particular one in mind? (I'm thinking journaled ones are
> > not of much interest, since their IO tends to be fairly serialized.)
> > 
> > btrfs?
> 
> I am not sure what FSes others are using but at least for me, just
> plain ext4.

Okay, we can test with both.

> > > I am also worry about that LRU stuff should be part of allocator
> > > instead of higher level.
> > 
> > I'm sorry, but that's not a reasonable objection.
> > 
> > These patches implement a core feature of being a zswap backend, using
> > standard LRU and locking techniques established by the other backends.
> > 
> > I don't disagree that it would nicer if zswap had a strong abstraction
> > for backend pages and a generalized LRU. But that is major surgery on
> > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > all that first before implementing a basic feature that's useful now.
> 
> With same logic, folks added the LRU logic into their allocators
> without the effort considering moving the LRU into upper layer.
> 
> And then trend is still going on since I have seen multiple times
> people are trying to add more allocators. So if it's not a reasonable
> ask to consier, we couldn't stop the trend in the end.

So there is actually an ongoing effort to do that. Yosry and I have
spent quite some time on coming up with an LRU design that's
independent from compression policy over email and at Plumbers.

My concern is more about the order of doing things:

1. The missing writeback support is a gaping hole in zsmalloc, which
   affects production systems. A generalized LRU list is a good idea,
   but it's a huge task that from a user pov really is not
   critical. Even from a kernel dev / maintainer POV, there are bigger
   fish to fry in the zswap code base and the backends than this.

2. Refactoring existing functionality is much easier than writing
   generalized code that simultaneously enables new behavior. zsmalloc
   is the most complex of our backends. To make its LRU writeback work
   we had to patch zswap's ->map ordering to accomodate it, e.g. Such
   tricky changes are easier to make and test incrementally.

   The generalized LRU project will hugely benefit from already having
   a proven writeback implementation in zsmalloc, because then all the
   requirements in zswap and zsmalloc will be in black and white.

> > I get that your main interest is zram, and so this feature isn't of
> > interest to you. But zram isn't the only user, nor is it the primary
> 
> I am interest to the feature but my interest is more of general swap
> layer to manage the LRU so that it could support any hierarchy among
> swap devices, not only zswap.

I think we're on the same page about the longer term goals.
  
Yosry Ahmed Nov. 3, 2022, 6:10 p.m. UTC | #7
On Thu, Nov 3, 2022 at 11:08 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Nov 03, 2022 at 08:53:29AM -0700, Minchan Kim wrote:
> > On Thu, Nov 03, 2022 at 11:18:04AM -0400, Johannes Weiner wrote:
> > > On Wed, Nov 02, 2022 at 02:36:35PM -0700, Minchan Kim wrote:
> > > > On Wed, Nov 02, 2022 at 12:28:56PM +0900, Sergey Senozhatsky wrote:
> > > > > On (22/10/26 13:06), Nhat Pham wrote:
> > > > > >  struct size_class {
> > > > > > -     spinlock_t lock;
> > > > > >       struct list_head fullness_list[NR_ZS_FULLNESS];
> > > > > >       /*
> > > > > >        * Size of objects stored in this class. Must be multiple
> > > > > > @@ -247,8 +245,7 @@ struct zs_pool {
> > > > > >  #ifdef CONFIG_COMPACTION
> > > > > >       struct work_struct free_work;
> > > > > >  #endif
> > > > > > -     /* protect page/zspage migration */
> > > > > > -     rwlock_t migrate_lock;
> > > > > > +     spinlock_t lock;
> > > > > >  };
> > > > >
> > > > > I'm not in love with this, to be honest. One big pool lock instead
> > > > > of 255 per-class locks doesn't look attractive, as one big pool lock
> > > > > is going to be hammered quite a lot when zram is used, e.g. as a regular
> > > > > block device with a file system and is under heavy parallel writes/reads.
> > >
> > > TBH the class always struck me as an odd scope to split the lock. Lock
> > > contention depends on how variable the compression rate is of the
> > > hottest incoming data, which is unpredictable from a user POV.
> > >
> > > My understanding is that the primary usecase for zram is swapping, and
> > > the pool lock is the same granularity as the swap locking.
> >
> > People uses the zram to store caching object files in build server.
>
> Oh, interesting. We can try with a kernel build directory on zram.
>
> > > Do you have a particular one in mind? (I'm thinking journaled ones are
> > > not of much interest, since their IO tends to be fairly serialized.)
> > >
> > > btrfs?
> >
> > I am not sure what FSes others are using but at least for me, just
> > plain ext4.
>
> Okay, we can test with both.
>
> > > > I am also worry about that LRU stuff should be part of allocator
> > > > instead of higher level.
> > >
> > > I'm sorry, but that's not a reasonable objection.
> > >
> > > These patches implement a core feature of being a zswap backend, using
> > > standard LRU and locking techniques established by the other backends.
> > >
> > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > for backend pages and a generalized LRU. But that is major surgery on
> > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > all that first before implementing a basic feature that's useful now.
> >
> > With same logic, folks added the LRU logic into their allocators
> > without the effort considering moving the LRU into upper layer.
> >
> > And then trend is still going on since I have seen multiple times
> > people are trying to add more allocators. So if it's not a reasonable
> > ask to consier, we couldn't stop the trend in the end.
>
> So there is actually an ongoing effort to do that. Yosry and I have
> spent quite some time on coming up with an LRU design that's
> independent from compression policy over email and at Plumbers.
>
> My concern is more about the order of doing things:
>
> 1. The missing writeback support is a gaping hole in zsmalloc, which
>    affects production systems. A generalized LRU list is a good idea,
>    but it's a huge task that from a user pov really is not
>    critical. Even from a kernel dev / maintainer POV, there are bigger
>    fish to fry in the zswap code base and the backends than this.
>
> 2. Refactoring existing functionality is much easier than writing
>    generalized code that simultaneously enables new behavior. zsmalloc
>    is the most complex of our backends. To make its LRU writeback work
>    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
>    tricky changes are easier to make and test incrementally.
>
>    The generalized LRU project will hugely benefit from already having
>    a proven writeback implementation in zsmalloc, because then all the
>    requirements in zswap and zsmalloc will be in black and white.
>
> > > I get that your main interest is zram, and so this feature isn't of
> > > interest to you. But zram isn't the only user, nor is it the primary
> >
> > I am interest to the feature but my interest is more of general swap
> > layer to manage the LRU so that it could support any hierarchy among
> > swap devices, not only zswap.
>
> I think we're on the same page about the longer term goals.
>

Yeah. As Johannes said, I was also recently looking into this. This
can also help solve other problems than consolidating implementations.
Currently if zswap rejects a page, it goes into swap, which is
more-or-less a violation of page LRUs since hotter pages that are more
recently reclaimed end up in swap (slow), while colder pages that were
reclaimed before are in zswap. Having a separate layer managing the
LRU of swap pages can also make sure this doesn't happen.

More broadly, making zswap a separate layer from swap enables other
improvements such as using zswap regardless of the presence of a
backend swapfile and not consuming space in swapfiles if a page is in
zswap. Of course, this is a much larger surgery.

I am intending to spend more time looking further into this, but other
things keep popping up :)
  
Minchan Kim Nov. 3, 2022, 8:22 p.m. UTC | #8
On Thu, Nov 03, 2022 at 02:08:01PM -0400, Johannes Weiner wrote:
< snip >

> > > > I am also worry about that LRU stuff should be part of allocator
> > > > instead of higher level.
> > > 
> > > I'm sorry, but that's not a reasonable objection.
> > > 
> > > These patches implement a core feature of being a zswap backend, using
> > > standard LRU and locking techniques established by the other backends.
> > > 
> > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > for backend pages and a generalized LRU. But that is major surgery on
> > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > all that first before implementing a basic feature that's useful now.
> > 
> > With same logic, folks added the LRU logic into their allocators
> > without the effort considering moving the LRU into upper layer.
> > 
> > And then trend is still going on since I have seen multiple times
> > people are trying to add more allocators. So if it's not a reasonable
> > ask to consier, we couldn't stop the trend in the end.
> 
> So there is actually an ongoing effort to do that. Yosry and I have
> spent quite some time on coming up with an LRU design that's
> independent from compression policy over email and at Plumbers.

I am really glad to hear somebody is working toward right direction.

> 
> My concern is more about the order of doing things:
> 
> 1. The missing writeback support is a gaping hole in zsmalloc, which
>    affects production systems. A generalized LRU list is a good idea,
>    but it's a huge task that from a user pov really is not
>    critical. Even from a kernel dev / maintainer POV, there are bigger
>    fish to fry in the zswap code base and the backends than this.

Even though I believe the general LRU in the swap subsystem is way to go,
I was about to suggesting putting the LRU logic in the zswap layer 
to stop this trend since it's not too difficult at my first glance(Sure,
I might miss something clear there). However, if you guys are working
toward the generalized direction, I am totally in favor of the approach
and looking forward to seeing the project under expectation that we will
clean up all the duplicated logic, fixing the weird layering and then
finally supports hierarchical swap writeback for any combinations of swap
devices.

> 
> 2. Refactoring existing functionality is much easier than writing
>    generalized code that simultaneously enables new behavior. zsmalloc
>    is the most complex of our backends. To make its LRU writeback work
>    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
>    tricky changes are easier to make and test incrementally.
> 
>    The generalized LRU project will hugely benefit from already having
>    a proven writeback implementation in zsmalloc, because then all the
>    requirements in zswap and zsmalloc will be in black and white.

Agreed if we are working toward the right direction and this work could
help to fill the gap of the hole until we can see all the requirements
and achieve it.


> 
> > > I get that your main interest is zram, and so this feature isn't of
> > > interest to you. But zram isn't the only user, nor is it the primary
> > 
> > I am interest to the feature but my interest is more of general swap
> > layer to manage the LRU so that it could support any hierarchy among
> > swap devices, not only zswap.
> 
> I think we're on the same page about the longer term goals.

Fingers crossed.
  
Minchan Kim Nov. 3, 2022, 8:37 p.m. UTC | #9
On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote:
< snip >

> > > > > I am also worry about that LRU stuff should be part of allocator
> > > > > instead of higher level.
> > > >
> > > > I'm sorry, but that's not a reasonable objection.
> > > >
> > > > These patches implement a core feature of being a zswap backend, using
> > > > standard LRU and locking techniques established by the other backends.
> > > >
> > > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > > for backend pages and a generalized LRU. But that is major surgery on
> > > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > > all that first before implementing a basic feature that's useful now.
> > >
> > > With same logic, folks added the LRU logic into their allocators
> > > without the effort considering moving the LRU into upper layer.
> > >
> > > And then trend is still going on since I have seen multiple times
> > > people are trying to add more allocators. So if it's not a reasonable
> > > ask to consier, we couldn't stop the trend in the end.
> >
> > So there is actually an ongoing effort to do that. Yosry and I have
> > spent quite some time on coming up with an LRU design that's
> > independent from compression policy over email and at Plumbers.
> >
> > My concern is more about the order of doing things:
> >
> > 1. The missing writeback support is a gaping hole in zsmalloc, which
> >    affects production systems. A generalized LRU list is a good idea,
> >    but it's a huge task that from a user pov really is not
> >    critical. Even from a kernel dev / maintainer POV, there are bigger
> >    fish to fry in the zswap code base and the backends than this.
> >
> > 2. Refactoring existing functionality is much easier than writing
> >    generalized code that simultaneously enables new behavior. zsmalloc
> >    is the most complex of our backends. To make its LRU writeback work
> >    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
> >    tricky changes are easier to make and test incrementally.
> >
> >    The generalized LRU project will hugely benefit from already having
> >    a proven writeback implementation in zsmalloc, because then all the
> >    requirements in zswap and zsmalloc will be in black and white.
> >
> > > > I get that your main interest is zram, and so this feature isn't of
> > > > interest to you. But zram isn't the only user, nor is it the primary
> > >
> > > I am interest to the feature but my interest is more of general swap
> > > layer to manage the LRU so that it could support any hierarchy among
> > > swap devices, not only zswap.
> >
> > I think we're on the same page about the longer term goals.
> >
> 
> Yeah. As Johannes said, I was also recently looking into this. This
> can also help solve other problems than consolidating implementations.
> Currently if zswap rejects a page, it goes into swap, which is
> more-or-less a violation of page LRUs since hotter pages that are more
> recently reclaimed end up in swap (slow), while colder pages that were
> reclaimed before are in zswap. Having a separate layer managing the
> LRU of swap pages can also make sure this doesn't happen.

True.

> 
> More broadly, making zswap a separate layer from swap enables other
> improvements such as using zswap regardless of the presence of a
> backend swapfile and not consuming space in swapfiles if a page is in
> zswap. Of course, this is a much larger surgery.

If we could decouple the LRU writeback from zswap and supports
compression without backing swapfile, sounds like becoming more of
zram. ;-)

> 
> I am intending to spend more time looking further into this, but other
> things keep popping up :)

Same with me. Thanks for looking it, Yosry!
  
Yosry Ahmed Nov. 3, 2022, 8:46 p.m. UTC | #10
On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote:
> < snip >
>
> > > > > > I am also worry about that LRU stuff should be part of allocator
> > > > > > instead of higher level.
> > > > >
> > > > > I'm sorry, but that's not a reasonable objection.
> > > > >
> > > > > These patches implement a core feature of being a zswap backend, using
> > > > > standard LRU and locking techniques established by the other backends.
> > > > >
> > > > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > > > for backend pages and a generalized LRU. But that is major surgery on
> > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > > > all that first before implementing a basic feature that's useful now.
> > > >
> > > > With same logic, folks added the LRU logic into their allocators
> > > > without the effort considering moving the LRU into upper layer.
> > > >
> > > > And then trend is still going on since I have seen multiple times
> > > > people are trying to add more allocators. So if it's not a reasonable
> > > > ask to consier, we couldn't stop the trend in the end.
> > >
> > > So there is actually an ongoing effort to do that. Yosry and I have
> > > spent quite some time on coming up with an LRU design that's
> > > independent from compression policy over email and at Plumbers.
> > >
> > > My concern is more about the order of doing things:
> > >
> > > 1. The missing writeback support is a gaping hole in zsmalloc, which
> > >    affects production systems. A generalized LRU list is a good idea,
> > >    but it's a huge task that from a user pov really is not
> > >    critical. Even from a kernel dev / maintainer POV, there are bigger
> > >    fish to fry in the zswap code base and the backends than this.
> > >
> > > 2. Refactoring existing functionality is much easier than writing
> > >    generalized code that simultaneously enables new behavior. zsmalloc
> > >    is the most complex of our backends. To make its LRU writeback work
> > >    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
> > >    tricky changes are easier to make and test incrementally.
> > >
> > >    The generalized LRU project will hugely benefit from already having
> > >    a proven writeback implementation in zsmalloc, because then all the
> > >    requirements in zswap and zsmalloc will be in black and white.
> > >
> > > > > I get that your main interest is zram, and so this feature isn't of
> > > > > interest to you. But zram isn't the only user, nor is it the primary
> > > >
> > > > I am interest to the feature but my interest is more of general swap
> > > > layer to manage the LRU so that it could support any hierarchy among
> > > > swap devices, not only zswap.
> > >
> > > I think we're on the same page about the longer term goals.
> > >
> >
> > Yeah. As Johannes said, I was also recently looking into this. This
> > can also help solve other problems than consolidating implementations.
> > Currently if zswap rejects a page, it goes into swap, which is
> > more-or-less a violation of page LRUs since hotter pages that are more
> > recently reclaimed end up in swap (slow), while colder pages that were
> > reclaimed before are in zswap. Having a separate layer managing the
> > LRU of swap pages can also make sure this doesn't happen.
>
> True.
>
> >
> > More broadly, making zswap a separate layer from swap enables other
> > improvements such as using zswap regardless of the presence of a
> > backend swapfile and not consuming space in swapfiles if a page is in
> > zswap. Of course, this is a much larger surgery.
>
> If we could decouple the LRU writeback from zswap and supports
> compression without backing swapfile, sounds like becoming more of
> zram. ;-)

That's a little bit grey. Maybe we can consolidate them one day :)

We have been using zswap without swapfile at Google for a while, this
gives us the ability to reject pages that do not compress well enough
for us, which I suspect zram would not support given that it is
designed to be the final destination of the page. Also, having the
same configuration and code running on machines whether or not they
have a swapfile is nice, otherwise one would need to use zram if there
is no swapfile and switch to zswap if there is one.

>
> >
> > I am intending to spend more time looking further into this, but other
> > things keep popping up :)
>
> Same with me. Thanks for looking it, Yosry!
  
Yu Zhao Nov. 3, 2022, 9:15 p.m. UTC | #11
On Thu, Nov 3, 2022 at 2:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote:
> > < snip >
> >
> > > > > > > I am also worry about that LRU stuff should be part of allocator
> > > > > > > instead of higher level.
> > > > > >
> > > > > > I'm sorry, but that's not a reasonable objection.
> > > > > >
> > > > > > These patches implement a core feature of being a zswap backend, using
> > > > > > standard LRU and locking techniques established by the other backends.
> > > > > >
> > > > > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > > > > for backend pages and a generalized LRU. But that is major surgery on
> > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > > > > all that first before implementing a basic feature that's useful now.
> > > > >
> > > > > With same logic, folks added the LRU logic into their allocators
> > > > > without the effort considering moving the LRU into upper layer.
> > > > >
> > > > > And then trend is still going on since I have seen multiple times
> > > > > people are trying to add more allocators. So if it's not a reasonable
> > > > > ask to consier, we couldn't stop the trend in the end.
> > > >
> > > > So there is actually an ongoing effort to do that. Yosry and I have
> > > > spent quite some time on coming up with an LRU design that's
> > > > independent from compression policy over email and at Plumbers.
> > > >
> > > > My concern is more about the order of doing things:
> > > >
> > > > 1. The missing writeback support is a gaping hole in zsmalloc, which
> > > >    affects production systems. A generalized LRU list is a good idea,
> > > >    but it's a huge task that from a user pov really is not
> > > >    critical. Even from a kernel dev / maintainer POV, there are bigger
> > > >    fish to fry in the zswap code base and the backends than this.
> > > >
> > > > 2. Refactoring existing functionality is much easier than writing
> > > >    generalized code that simultaneously enables new behavior. zsmalloc
> > > >    is the most complex of our backends. To make its LRU writeback work
> > > >    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
> > > >    tricky changes are easier to make and test incrementally.
> > > >
> > > >    The generalized LRU project will hugely benefit from already having
> > > >    a proven writeback implementation in zsmalloc, because then all the
> > > >    requirements in zswap and zsmalloc will be in black and white.
> > > >
> > > > > > I get that your main interest is zram, and so this feature isn't of
> > > > > > interest to you. But zram isn't the only user, nor is it the primary
> > > > >
> > > > > I am interest to the feature but my interest is more of general swap
> > > > > layer to manage the LRU so that it could support any hierarchy among
> > > > > swap devices, not only zswap.
> > > >
> > > > I think we're on the same page about the longer term goals.
> > > >
> > >
> > > Yeah. As Johannes said, I was also recently looking into this. This
> > > can also help solve other problems than consolidating implementations.
> > > Currently if zswap rejects a page, it goes into swap, which is
> > > more-or-less a violation of page LRUs since hotter pages that are more
> > > recently reclaimed end up in swap (slow), while colder pages that were
> > > reclaimed before are in zswap. Having a separate layer managing the
> > > LRU of swap pages can also make sure this doesn't happen.
> >
> > True.
> >
> > >
> > > More broadly, making zswap a separate layer from swap enables other
> > > improvements such as using zswap regardless of the presence of a
> > > backend swapfile and not consuming space in swapfiles if a page is in
> > > zswap. Of course, this is a much larger surgery.
> >
> > If we could decouple the LRU writeback from zswap and supports
> > compression without backing swapfile, sounds like becoming more of
> > zram. ;-)
>
> That's a little bit grey. Maybe we can consolidate them one day :)
>
> We have been using zswap without swapfile at Google for a while

We do require swapfiles for zswap -- they can be truncated.

> this
> gives us the ability to reject pages that do not compress well enough
> for us, which I suspect zram would not support given that it is
> designed to be the final destination of the page. Also, having the
> same configuration and code running on machines whether or not they
> have a swapfile is nice, otherwise one would need to use zram if there
> is no swapfile and switch to zswap if there is one.
>
> >
> > >
> > > I am intending to spend more time looking further into this, but other
> > > things keep popping up :)
> >
> > Same with me. Thanks for looking it, Yosry!
>
  
Minchan Kim Nov. 3, 2022, 9:43 p.m. UTC | #12
On Thu, Nov 03, 2022 at 01:46:56PM -0700, Yosry Ahmed wrote:
> On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote:
> > < snip >
> >
> > > > > > > I am also worry about that LRU stuff should be part of allocator
> > > > > > > instead of higher level.
> > > > > >
> > > > > > I'm sorry, but that's not a reasonable objection.
> > > > > >
> > > > > > These patches implement a core feature of being a zswap backend, using
> > > > > > standard LRU and locking techniques established by the other backends.
> > > > > >
> > > > > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > > > > for backend pages and a generalized LRU. But that is major surgery on
> > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > > > > all that first before implementing a basic feature that's useful now.
> > > > >
> > > > > With same logic, folks added the LRU logic into their allocators
> > > > > without the effort considering moving the LRU into upper layer.
> > > > >
> > > > > And then trend is still going on since I have seen multiple times
> > > > > people are trying to add more allocators. So if it's not a reasonable
> > > > > ask to consier, we couldn't stop the trend in the end.
> > > >
> > > > So there is actually an ongoing effort to do that. Yosry and I have
> > > > spent quite some time on coming up with an LRU design that's
> > > > independent from compression policy over email and at Plumbers.
> > > >
> > > > My concern is more about the order of doing things:
> > > >
> > > > 1. The missing writeback support is a gaping hole in zsmalloc, which
> > > >    affects production systems. A generalized LRU list is a good idea,
> > > >    but it's a huge task that from a user pov really is not
> > > >    critical. Even from a kernel dev / maintainer POV, there are bigger
> > > >    fish to fry in the zswap code base and the backends than this.
> > > >
> > > > 2. Refactoring existing functionality is much easier than writing
> > > >    generalized code that simultaneously enables new behavior. zsmalloc
> > > >    is the most complex of our backends. To make its LRU writeback work
> > > >    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
> > > >    tricky changes are easier to make and test incrementally.
> > > >
> > > >    The generalized LRU project will hugely benefit from already having
> > > >    a proven writeback implementation in zsmalloc, because then all the
> > > >    requirements in zswap and zsmalloc will be in black and white.
> > > >
> > > > > > I get that your main interest is zram, and so this feature isn't of
> > > > > > interest to you. But zram isn't the only user, nor is it the primary
> > > > >
> > > > > I am interest to the feature but my interest is more of general swap
> > > > > layer to manage the LRU so that it could support any hierarchy among
> > > > > swap devices, not only zswap.
> > > >
> > > > I think we're on the same page about the longer term goals.
> > > >
> > >
> > > Yeah. As Johannes said, I was also recently looking into this. This
> > > can also help solve other problems than consolidating implementations.
> > > Currently if zswap rejects a page, it goes into swap, which is
> > > more-or-less a violation of page LRUs since hotter pages that are more
> > > recently reclaimed end up in swap (slow), while colder pages that were
> > > reclaimed before are in zswap. Having a separate layer managing the
> > > LRU of swap pages can also make sure this doesn't happen.
> >
> > True.
> >
> > >
> > > More broadly, making zswap a separate layer from swap enables other
> > > improvements such as using zswap regardless of the presence of a
> > > backend swapfile and not consuming space in swapfiles if a page is in
> > > zswap. Of course, this is a much larger surgery.
> >
> > If we could decouple the LRU writeback from zswap and supports
> > compression without backing swapfile, sounds like becoming more of
> > zram. ;-)
> 
> That's a little bit grey. Maybe we can consolidate them one day :)
> 
> We have been using zswap without swapfile at Google for a while, this
> gives us the ability to reject pages that do not compress well enough
> for us, which I suspect zram would not support given that it is
> designed to be the final destination of the page. Also, having the

zRAM could do with little change but at current implmentation, it will
print swapout failure message(it's not a big deal since we could
suppress) but I have thought rather than that, we needs to move the
page unevictable LRU list with marking the page CoW to catch a time
to move the page into evictable LRU list o provide second chance to
be compressed. Just off-topic.

> same configuration and code running on machines whether or not they
> have a swapfile is nice, otherwise one would need to use zram if there
> is no swapfile and switch to zswap if there is one.
  
Yosry Ahmed Nov. 3, 2022, 11:19 p.m. UTC | #13
On Thu, Nov 3, 2022 at 2:16 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Nov 3, 2022 at 2:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote:
> > > < snip >
> > >
> > > > > > > > I am also worry about that LRU stuff should be part of allocator
> > > > > > > > instead of higher level.
> > > > > > >
> > > > > > > I'm sorry, but that's not a reasonable objection.
> > > > > > >
> > > > > > > These patches implement a core feature of being a zswap backend, using
> > > > > > > standard LRU and locking techniques established by the other backends.
> > > > > > >
> > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > > > > > for backend pages and a generalized LRU. But that is major surgery on
> > > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > > > > > all that first before implementing a basic feature that's useful now.
> > > > > >
> > > > > > With same logic, folks added the LRU logic into their allocators
> > > > > > without the effort considering moving the LRU into upper layer.
> > > > > >
> > > > > > And then trend is still going on since I have seen multiple times
> > > > > > people are trying to add more allocators. So if it's not a reasonable
> > > > > > ask to consier, we couldn't stop the trend in the end.
> > > > >
> > > > > So there is actually an ongoing effort to do that. Yosry and I have
> > > > > spent quite some time on coming up with an LRU design that's
> > > > > independent from compression policy over email and at Plumbers.
> > > > >
> > > > > My concern is more about the order of doing things:
> > > > >
> > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which
> > > > >    affects production systems. A generalized LRU list is a good idea,
> > > > >    but it's a huge task that from a user pov really is not
> > > > >    critical. Even from a kernel dev / maintainer POV, there are bigger
> > > > >    fish to fry in the zswap code base and the backends than this.
> > > > >
> > > > > 2. Refactoring existing functionality is much easier than writing
> > > > >    generalized code that simultaneously enables new behavior. zsmalloc
> > > > >    is the most complex of our backends. To make its LRU writeback work
> > > > >    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
> > > > >    tricky changes are easier to make and test incrementally.
> > > > >
> > > > >    The generalized LRU project will hugely benefit from already having
> > > > >    a proven writeback implementation in zsmalloc, because then all the
> > > > >    requirements in zswap and zsmalloc will be in black and white.
> > > > >
> > > > > > > I get that your main interest is zram, and so this feature isn't of
> > > > > > > interest to you. But zram isn't the only user, nor is it the primary
> > > > > >
> > > > > > I am interest to the feature but my interest is more of general swap
> > > > > > layer to manage the LRU so that it could support any hierarchy among
> > > > > > swap devices, not only zswap.
> > > > >
> > > > > I think we're on the same page about the longer term goals.
> > > > >
> > > >
> > > > Yeah. As Johannes said, I was also recently looking into this. This
> > > > can also help solve other problems than consolidating implementations.
> > > > Currently if zswap rejects a page, it goes into swap, which is
> > > > more-or-less a violation of page LRUs since hotter pages that are more
> > > > recently reclaimed end up in swap (slow), while colder pages that were
> > > > reclaimed before are in zswap. Having a separate layer managing the
> > > > LRU of swap pages can also make sure this doesn't happen.
> > >
> > > True.
> > >
> > > >
> > > > More broadly, making zswap a separate layer from swap enables other
> > > > improvements such as using zswap regardless of the presence of a
> > > > backend swapfile and not consuming space in swapfiles if a page is in
> > > > zswap. Of course, this is a much larger surgery.
> > >
> > > If we could decouple the LRU writeback from zswap and supports
> > > compression without backing swapfile, sounds like becoming more of
> > > zram. ;-)
> >
> > That's a little bit grey. Maybe we can consolidate them one day :)
> >
> > We have been using zswap without swapfile at Google for a while
>
> We do require swapfiles for zswap -- they can be truncated.

Right. We detect truncated swapfiles and use them a "ghost" swapfiles,
which enables us to use zswap without a "real" swapfile more-or-less.

>
> > this
> > gives us the ability to reject pages that do not compress well enough
> > for us, which I suspect zram would not support given that it is
> > designed to be the final destination of the page. Also, having the
> > same configuration and code running on machines whether or not they
> > have a swapfile is nice, otherwise one would need to use zram if there
> > is no swapfile and switch to zswap if there is one.
> >
> > >
> > > >
> > > > I am intending to spend more time looking further into this, but other
> > > > things keep popping up :)
> > >
> > > Same with me. Thanks for looking it, Yosry!
> >
  
Yosry Ahmed Nov. 3, 2022, 11:31 p.m. UTC | #14
On Thu, Nov 3, 2022 at 2:43 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Nov 03, 2022 at 01:46:56PM -0700, Yosry Ahmed wrote:
> > On Thu, Nov 3, 2022 at 1:37 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Thu, Nov 03, 2022 at 11:10:47AM -0700, Yosry Ahmed wrote:
> > > < snip >
> > >
> > > > > > > > I am also worry about that LRU stuff should be part of allocator
> > > > > > > > instead of higher level.
> > > > > > >
> > > > > > > I'm sorry, but that's not a reasonable objection.
> > > > > > >
> > > > > > > These patches implement a core feature of being a zswap backend, using
> > > > > > > standard LRU and locking techniques established by the other backends.
> > > > > > >
> > > > > > > I don't disagree that it would nicer if zswap had a strong abstraction
> > > > > > > for backend pages and a generalized LRU. But that is major surgery on
> > > > > > > a codebase of over 6,500 lines. It's not a reasonable ask to change
> > > > > > > all that first before implementing a basic feature that's useful now.
> > > > > >
> > > > > > With same logic, folks added the LRU logic into their allocators
> > > > > > without the effort considering moving the LRU into upper layer.
> > > > > >
> > > > > > And then trend is still going on since I have seen multiple times
> > > > > > people are trying to add more allocators. So if it's not a reasonable
> > > > > > ask to consier, we couldn't stop the trend in the end.
> > > > >
> > > > > So there is actually an ongoing effort to do that. Yosry and I have
> > > > > spent quite some time on coming up with an LRU design that's
> > > > > independent from compression policy over email and at Plumbers.
> > > > >
> > > > > My concern is more about the order of doing things:
> > > > >
> > > > > 1. The missing writeback support is a gaping hole in zsmalloc, which
> > > > >    affects production systems. A generalized LRU list is a good idea,
> > > > >    but it's a huge task that from a user pov really is not
> > > > >    critical. Even from a kernel dev / maintainer POV, there are bigger
> > > > >    fish to fry in the zswap code base and the backends than this.
> > > > >
> > > > > 2. Refactoring existing functionality is much easier than writing
> > > > >    generalized code that simultaneously enables new behavior. zsmalloc
> > > > >    is the most complex of our backends. To make its LRU writeback work
> > > > >    we had to patch zswap's ->map ordering to accomodate it, e.g. Such
> > > > >    tricky changes are easier to make and test incrementally.
> > > > >
> > > > >    The generalized LRU project will hugely benefit from already having
> > > > >    a proven writeback implementation in zsmalloc, because then all the
> > > > >    requirements in zswap and zsmalloc will be in black and white.
> > > > >
> > > > > > > I get that your main interest is zram, and so this feature isn't of
> > > > > > > interest to you. But zram isn't the only user, nor is it the primary
> > > > > >
> > > > > > I am interest to the feature but my interest is more of general swap
> > > > > > layer to manage the LRU so that it could support any hierarchy among
> > > > > > swap devices, not only zswap.
> > > > >
> > > > > I think we're on the same page about the longer term goals.
> > > > >
> > > >
> > > > Yeah. As Johannes said, I was also recently looking into this. This
> > > > can also help solve other problems than consolidating implementations.
> > > > Currently if zswap rejects a page, it goes into swap, which is
> > > > more-or-less a violation of page LRUs since hotter pages that are more
> > > > recently reclaimed end up in swap (slow), while colder pages that were
> > > > reclaimed before are in zswap. Having a separate layer managing the
> > > > LRU of swap pages can also make sure this doesn't happen.
> > >
> > > True.
> > >
> > > >
> > > > More broadly, making zswap a separate layer from swap enables other
> > > > improvements such as using zswap regardless of the presence of a
> > > > backend swapfile and not consuming space in swapfiles if a page is in
> > > > zswap. Of course, this is a much larger surgery.
> > >
> > > If we could decouple the LRU writeback from zswap and supports
> > > compression without backing swapfile, sounds like becoming more of
> > > zram. ;-)
> >
> > That's a little bit grey. Maybe we can consolidate them one day :)
> >
> > We have been using zswap without swapfile at Google for a while, this
> > gives us the ability to reject pages that do not compress well enough
> > for us, which I suspect zram would not support given that it is
> > designed to be the final destination of the page. Also, having the
>
> zRAM could do with little change but at current implmentation, it will
> print swapout failure message(it's not a big deal since we could
> suppress) but I have thought rather than that, we needs to move the
> page unevictable LRU list with marking the page CoW to catch a time
> to move the page into evictable LRU list o provide second chance to
> be compressed. Just off-topic.

Right. We do something similar-ish today. However, this does not work
though for zswap if there is a backing swapfile, as the page needs to
still be evictable to the swapfile. A decoupled LRU can also manage
this appropriately.

>
> > same configuration and code running on machines whether or not they
> > have a swapfile is nice, otherwise one would need to use zram if there
> > is no swapfile and switch to zswap if there is one.
  
Sergey Senozhatsky Nov. 4, 2022, 3:58 a.m. UTC | #15
On (22/11/03 11:18), Johannes Weiner wrote:
> > > I'm not in love with this, to be honest. One big pool lock instead
> > > of 255 per-class locks doesn't look attractive, as one big pool lock
> > > is going to be hammered quite a lot when zram is used, e.g. as a regular
> > > block device with a file system and is under heavy parallel writes/reads.
> 
> TBH the class always struck me as an odd scope to split the lock. Lock
> contention depends on how variable the compression rate is of the
> hottest incoming data, which is unpredictable from a user POV.
> 
> My understanding is that the primary usecase for zram is swapping, and
> the pool lock is the same granularity as the swap locking.

That's what we thought until a couple of merge windows ago we figured
(the hard way) that SUSE uses ZRAM as a normal block device with a real
file-system on it. And they use it often enough to immediately spot the
regression which we landed.

> Do you have a particular one in mind? (I'm thinking journaled ones are
> not of much interest, since their IO tends to be fairly serialized.)
> 
> btrfs?

Probably some parallel fio workloads? Seq, random reads/writes from
numerous workers.

I personally sometimes use ZRAM when I want to compile something and
I care only about the package, I don't need .o for recomplilation or
something, just the final package.
  
Nhat Pham Nov. 7, 2022, 9:31 p.m. UTC | #16
We have benchmarked the lock consolidation to see the performance effect of
this change on zram. First, we ran a synthetic FS workload on a server machine
with 36 cores (same machine for all runs), using this benchmark script:

https://github.com/josefbacik/fs_mark

using 32 threads, and cranking the pressure up to > 80% FS usage.

Here is the result (unit is file/second):

With lock consolidation (btrfs):
Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028

Without lock consolidation (btrfs):
Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665

With lock consolidation (ext4):
Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668

Without lock consolidation (ext4)
Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469

As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression
for ext4. This is a small, barely measurable difference in my opinion.

For a more realistic scenario, we also tries building the kernel on zram.
Here is the time it takes (in seconds):

With lock consolidation (btrfs):
real
Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159
user
Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656
sys
Average: 521.4, Median: 522.0, Stddev: 1.51657508881031

Without lock consolidation (btrfs):
real
Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756
user
Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023
sys
Average: 520.6, Median: 521.0, Stddev: 1.140175425099138

With lock consolidation (ext4):
real
Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951
user
Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307
sys
Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317

Without lock consolidation (ext4)
real
Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159
user
Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523
sys
Average: 520.4, Median: 520.0, Stddev: 1.140175425099138

The difference is entirely within the noise of a typical run on zram. This
hardly justifies the complexity of maintaining both the pool lock and the class
lock. In fact, for writeback, we would need to introduce yet another lock to
prevent data races on the pool's LRU, further complicating the lock handling
logic. IMHO, it is just better to collapse all of these into a single
pool-level lock.
  
Minchan Kim Nov. 7, 2022, 10:35 p.m. UTC | #17
On Mon, Nov 07, 2022 at 01:31:14PM -0800, Nhat Pham wrote:
> We have benchmarked the lock consolidation to see the performance effect of
> this change on zram. First, we ran a synthetic FS workload on a server machine
> with 36 cores (same machine for all runs), using this benchmark script:
> 
> https://github.com/josefbacik/fs_mark
> 
> using 32 threads, and cranking the pressure up to > 80% FS usage.
> 
> Here is the result (unit is file/second):
> 
> With lock consolidation (btrfs):
> Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028
> 
> Without lock consolidation (btrfs):
> Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665
> 
> With lock consolidation (ext4):
> Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668
> 
> Without lock consolidation (ext4)
> Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469
> 
> As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression
> for ext4. This is a small, barely measurable difference in my opinion.
> 
> For a more realistic scenario, we also tries building the kernel on zram.
> Here is the time it takes (in seconds):
> 
> With lock consolidation (btrfs):
> real
> Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159
> user
> Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656
> sys
> Average: 521.4, Median: 522.0, Stddev: 1.51657508881031
> 
> Without lock consolidation (btrfs):
> real
> Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756
> user
> Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023
> sys
> Average: 520.6, Median: 521.0, Stddev: 1.140175425099138
> 
> With lock consolidation (ext4):
> real
> Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951
> user
> Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307
> sys
> Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317
> 
> Without lock consolidation (ext4)
> real
> Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159
> user
> Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523
> sys
> Average: 520.4, Median: 520.0, Stddev: 1.140175425099138
> 
> The difference is entirely within the noise of a typical run on zram. This
> hardly justifies the complexity of maintaining both the pool lock and the class
> lock. In fact, for writeback, we would need to introduce yet another lock to

I am glad to make the zsmalloc lock scheme simpler without meaning
regression since it introduced a lot mess. Please include the test
result in description.

Thanks for the testing, Nhat.
  

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d03941cace2c..326faa751f0a 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -33,8 +33,7 @@ 
 /*
  * lock ordering:
  *	page_lock
- *	pool->migrate_lock
- *	class->lock
+ *	pool->lock
  *	zspage->lock
  */
 
@@ -192,7 +191,6 @@  static const int fullness_threshold_frac = 4;
 static size_t huge_class_size;
 
 struct size_class {
-	spinlock_t lock;
 	struct list_head fullness_list[NR_ZS_FULLNESS];
 	/*
 	 * Size of objects stored in this class. Must be multiple
@@ -247,8 +245,7 @@  struct zs_pool {
 #ifdef CONFIG_COMPACTION
 	struct work_struct free_work;
 #endif
-	/* protect page/zspage migration */
-	rwlock_t migrate_lock;
+	spinlock_t lock;
 };
 
 struct zspage {
@@ -355,7 +352,7 @@  static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
 	kmem_cache_free(pool->zspage_cachep, zspage);
 }
 
-/* class->lock(which owns the handle) synchronizes races */
+/* pool->lock(which owns the handle) synchronizes races */
 static void record_obj(unsigned long handle, unsigned long obj)
 {
 	*(unsigned long *)handle = obj;
@@ -452,7 +449,7 @@  static __maybe_unused int is_first_page(struct page *page)
 	return PagePrivate(page);
 }
 
-/* Protected by class->lock */
+/* Protected by pool->lock */
 static inline int get_zspage_inuse(struct zspage *zspage)
 {
 	return zspage->inuse;
@@ -597,13 +594,13 @@  static int zs_stats_size_show(struct seq_file *s, void *v)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
+		spin_lock(&pool->lock);
 		class_almost_full = zs_stat_get(class, CLASS_ALMOST_FULL);
 		class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY);
 		obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
 		obj_used = zs_stat_get(class, OBJ_USED);
 		freeable = zs_can_compact(class);
-		spin_unlock(&class->lock);
+		spin_unlock(&pool->lock);
 
 		objs_per_zspage = class->objs_per_zspage;
 		pages_used = obj_allocated / objs_per_zspage *
@@ -916,7 +913,7 @@  static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 
 	get_zspage_mapping(zspage, &class_idx, &fg);
 
-	assert_spin_locked(&class->lock);
+	assert_spin_locked(&pool->lock);
 
 	VM_BUG_ON(get_zspage_inuse(zspage));
 	VM_BUG_ON(fg != ZS_EMPTY);
@@ -1247,19 +1244,19 @@  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	BUG_ON(in_interrupt());
 
 	/* It guarantees it can get zspage from handle safely */
-	read_lock(&pool->migrate_lock);
+	spin_lock(&pool->lock);
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &page, &obj_idx);
 	zspage = get_zspage(page);
 
 	/*
-	 * migration cannot move any zpages in this zspage. Here, class->lock
+	 * migration cannot move any zpages in this zspage. Here, pool->lock
 	 * is too heavy since callers would take some time until they calls
 	 * zs_unmap_object API so delegate the locking from class to zspage
 	 * which is smaller granularity.
 	 */
 	migrate_read_lock(zspage);
-	read_unlock(&pool->migrate_lock);
+	spin_unlock(&pool->lock);
 
 	class = zspage_class(pool, zspage);
 	off = (class->size * obj_idx) & ~PAGE_MASK;
@@ -1412,8 +1409,8 @@  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	size += ZS_HANDLE_SIZE;
 	class = pool->size_class[get_size_class_index(size)];
 
-	/* class->lock effectively protects the zpage migration */
-	spin_lock(&class->lock);
+	/* pool->lock effectively protects the zpage migration */
+	spin_lock(&pool->lock);
 	zspage = find_get_zspage(class);
 	if (likely(zspage)) {
 		obj = obj_malloc(pool, zspage, handle);
@@ -1421,12 +1418,12 @@  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		fix_fullness_group(class, zspage);
 		record_obj(handle, obj);
 		class_stat_inc(class, OBJ_USED, 1);
-		spin_unlock(&class->lock);
+		spin_unlock(&pool->lock);
 
 		return handle;
 	}
 
-	spin_unlock(&class->lock);
+	spin_unlock(&pool->lock);
 
 	zspage = alloc_zspage(pool, class, gfp);
 	if (!zspage) {
@@ -1434,7 +1431,7 @@  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		return (unsigned long)ERR_PTR(-ENOMEM);
 	}
 
-	spin_lock(&class->lock);
+	spin_lock(&pool->lock);
 	obj = obj_malloc(pool, zspage, handle);
 	newfg = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, newfg);
@@ -1447,7 +1444,7 @@  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 
 	/* We completely set up zspage so mark them as movable */
 	SetZsPageMovable(pool, zspage);
-	spin_unlock(&class->lock);
+	spin_unlock(&pool->lock);
 
 	return handle;
 }
@@ -1491,16 +1488,14 @@  void zs_free(struct zs_pool *pool, unsigned long handle)
 		return;
 
 	/*
-	 * The pool->migrate_lock protects the race with zpage's migration
+	 * The pool->lock protects the race with zpage's migration
 	 * so it's safe to get the page from handle.
 	 */
-	read_lock(&pool->migrate_lock);
+	spin_lock(&pool->lock);
 	obj = handle_to_obj(handle);
 	obj_to_page(obj, &f_page);
 	zspage = get_zspage(f_page);
 	class = zspage_class(pool, zspage);
-	spin_lock(&class->lock);
-	read_unlock(&pool->migrate_lock);
 
 	obj_free(class->size, obj);
 	class_stat_dec(class, OBJ_USED, 1);
@@ -1510,7 +1505,7 @@  void zs_free(struct zs_pool *pool, unsigned long handle)
 
 	free_zspage(pool, class, zspage);
 out:
-	spin_unlock(&class->lock);
+	spin_unlock(&pool->lock);
 	cache_free_handle(pool, handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
@@ -1867,16 +1862,12 @@  static int zs_page_migrate(struct page *newpage, struct page *page,
 	pool = zspage->pool;
 
 	/*
-	 * The pool migrate_lock protects the race between zpage migration
+	 * The pool's lock protects the race between zpage migration
 	 * and zs_free.
 	 */
-	write_lock(&pool->migrate_lock);
+	spin_lock(&pool->lock);
 	class = zspage_class(pool, zspage);
 
-	/*
-	 * the class lock protects zpage alloc/free in the zspage.
-	 */
-	spin_lock(&class->lock);
 	/* the migrate_write_lock protects zpage access via zs_map_object */
 	migrate_write_lock(zspage);
 
@@ -1906,10 +1897,9 @@  static int zs_page_migrate(struct page *newpage, struct page *page,
 	replace_sub_page(class, zspage, newpage, page);
 	/*
 	 * Since we complete the data copy and set up new zspage structure,
-	 * it's okay to release migration_lock.
+	 * it's okay to release the pool's lock.
 	 */
-	write_unlock(&pool->migrate_lock);
-	spin_unlock(&class->lock);
+	spin_unlock(&pool->lock);
 	dec_zspage_isolation(zspage);
 	migrate_write_unlock(zspage);
 
@@ -1964,9 +1954,9 @@  static void async_free_zspage(struct work_struct *work)
 		if (class->index != i)
 			continue;
 
-		spin_lock(&class->lock);
+		spin_lock(&pool->lock);
 		list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
-		spin_unlock(&class->lock);
+		spin_unlock(&pool->lock);
 	}
 
 	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
@@ -1976,9 +1966,9 @@  static void async_free_zspage(struct work_struct *work)
 		get_zspage_mapping(zspage, &class_idx, &fullness);
 		VM_BUG_ON(fullness != ZS_EMPTY);
 		class = pool->size_class[class_idx];
-		spin_lock(&class->lock);
+		spin_lock(&pool->lock);
 		__free_zspage(pool, class, zspage);
-		spin_unlock(&class->lock);
+		spin_unlock(&pool->lock);
 	}
 };
 
@@ -2039,10 +2029,11 @@  static unsigned long __zs_compact(struct zs_pool *pool,
 	struct zspage *dst_zspage = NULL;
 	unsigned long pages_freed = 0;
 
-	/* protect the race between zpage migration and zs_free */
-	write_lock(&pool->migrate_lock);
-	/* protect zpage allocation/free */
-	spin_lock(&class->lock);
+	/*
+	 * protect the race between zpage migration and zs_free
+	 * as well as zpage allocation/free
+	 */
+	spin_lock(&pool->lock);
 	while ((src_zspage = isolate_zspage(class, true))) {
 		/* protect someone accessing the zspage(i.e., zs_map_object) */
 		migrate_write_lock(src_zspage);
@@ -2067,7 +2058,7 @@  static unsigned long __zs_compact(struct zs_pool *pool,
 			putback_zspage(class, dst_zspage);
 			migrate_write_unlock(dst_zspage);
 			dst_zspage = NULL;
-			if (rwlock_is_contended(&pool->migrate_lock))
+			if (spin_is_contended(&pool->lock))
 				break;
 		}
 
@@ -2084,11 +2075,9 @@  static unsigned long __zs_compact(struct zs_pool *pool,
 			pages_freed += class->pages_per_zspage;
 		} else
 			migrate_write_unlock(src_zspage);
-		spin_unlock(&class->lock);
-		write_unlock(&pool->migrate_lock);
+		spin_unlock(&pool->lock);
 		cond_resched();
-		write_lock(&pool->migrate_lock);
-		spin_lock(&class->lock);
+		spin_lock(&pool->lock);
 	}
 
 	if (src_zspage) {
@@ -2096,8 +2085,7 @@  static unsigned long __zs_compact(struct zs_pool *pool,
 		migrate_write_unlock(src_zspage);
 	}
 
-	spin_unlock(&class->lock);
-	write_unlock(&pool->migrate_lock);
+	spin_unlock(&pool->lock);
 
 	return pages_freed;
 }
@@ -2200,7 +2188,7 @@  struct zs_pool *zs_create_pool(const char *name)
 		return NULL;
 
 	init_deferred_free(pool);
-	rwlock_init(&pool->migrate_lock);
+	spin_lock_init(&pool->lock);
 
 	pool->name = kstrdup(name, GFP_KERNEL);
 	if (!pool->name)
@@ -2271,7 +2259,6 @@  struct zs_pool *zs_create_pool(const char *name)
 		class->index = i;
 		class->pages_per_zspage = pages_per_zspage;
 		class->objs_per_zspage = objs_per_zspage;
-		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
 		for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
 							fullness++)