[PATCHv4,0/9] zsmalloc/zram: configurable zspage size

Message ID 20221031054108.541190-1-senozhatsky@chromium.org
Headers
Series zsmalloc/zram: configurable zspage size |

Message

Sergey Senozhatsky Oct. 31, 2022, 5:40 a.m. UTC
  Hello,

	Some use-cases and/or data patterns may benefit from
larger zspages. Currently the limit on the number of physical
pages that are linked into a zspage is hardcoded to 4. Higher
limit changes key characteristics of a number of the size
classes, improving compactness of the pool and redusing the
amount of memory zsmalloc pool uses. More on this in 0002
commit message.

v4:
-- Fixed type of the max_pages_per_zspage (kbuild reported a
   "warning: right shift count >= width of type" warning)
-- Renamed max_pages_per_zspage variable

v3:
-- Removed lots of text from 0002 commit message. Now it's shorter
   and simpler.

v2:
-- Cherry picked a patch from Alexey (minor code tweaks to move
   it ahead of this series)
-- zsmalloc does not require anymore pages-per-zspage limit to be a
   pow of 2 value, and overall doesn't use "order" any longer
-- zram does not require "zspage order" (pow of 2) value anymore
   and instead accepts an integer in [1,16] range
-- There is no global huge_class_size in zsmalloc anymore.
   huge_class_size is per-pool, since it depends on pager-per-zspage,
   which can be different for different pools.
-- There is no global huge_class_size in zram anymore. It should
   be per-pool (per-device).
-- Updated documentation
-- Fixed documentation htmldocs warning (Stephen)
-- Dropped get_pages_per_zspage() patch
-- Renamed zram sysfs knob (device attribute)
-- Re-worked "synthetic test" section in the first commit: more numbers,
   objects distribution analysis, etc.

Alexey Romanov (1):
  zram: add size class equals check into recompression

Sergey Senozhatsky (8):
  zsmalloc: turn zspage order into runtime variable
  zsmalloc: move away from page order defines
  zsmalloc: make huge class watermark zs_pool member
  zram: huge size watermark cannot be global
  zsmalloc: pass limit on pages per-zspage to zs_create_pool()
  zram: add pages_per_pool_page device attribute
  Documentation: document zram pages_per_pool_page attribute
  zsmalloc: break out of loop when found perfect zspage order

 Documentation/admin-guide/blockdev/zram.rst |  38 +++++--
 drivers/block/zram/zram_drv.c               |  63 +++++++++--
 drivers/block/zram/zram_drv.h               |   7 ++
 include/linux/zsmalloc.h                    |  14 ++-
 mm/zsmalloc.c                               | 112 +++++++++++++-------
 5 files changed, 176 insertions(+), 58 deletions(-)
  

Comments

Minchan Kim Nov. 10, 2022, 10:44 p.m. UTC | #1
On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote:
> 	Hello,
> 
> 	Some use-cases and/or data patterns may benefit from
> larger zspages. Currently the limit on the number of physical
> pages that are linked into a zspage is hardcoded to 4. Higher
> limit changes key characteristics of a number of the size
> classes, improving compactness of the pool and redusing the
> amount of memory zsmalloc pool uses. More on this in 0002
> commit message.

Hi Sergey,

I think the idea that break of fixed subpages in zspage is
really good start to optimize further. However, I am worry
about introducing per-pool config this stage. How about
to introduce just one golden value for the zspage size?
order-3 or 4 in Kconfig with keeping default 2?

And then we make more efforts to have auto tune based on
the wasted memory and the number of size classes on the
fly. A good thing to be able to achieve is we have indirect
table(handle <-> zpage) so we could move the object anytime
so I think we could do better way in the end.
  
Sergey Senozhatsky Nov. 11, 2022, 12:56 a.m. UTC | #2
Hi,

On (22/11/10 14:44), Minchan Kim wrote:
> On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote:
> > 	Hello,
> > 
> > 	Some use-cases and/or data patterns may benefit from
> > larger zspages. Currently the limit on the number of physical
> > pages that are linked into a zspage is hardcoded to 4. Higher
> > limit changes key characteristics of a number of the size
> > classes, improving compactness of the pool and redusing the
> > amount of memory zsmalloc pool uses. More on this in 0002
> > commit message.
> 
> Hi Sergey,
> 
> I think the idea that break of fixed subpages in zspage is
> really good start to optimize further. However, I am worry
> about introducing per-pool config this stage. How about
> to introduce just one golden value for the zspage size?
> order-3 or 4 in Kconfig with keeping default 2?

Sorry, not sure I'm following. So you want a .config value
for zspage limit? I really like the sysfs knob, because then
one may set values on per-device basis (if they have multiple
zram devices in a system with different data patterns):

	zram0 which is used as a swap device uses, say, 4
	zram1 which is vfat block device uses, say, 6
	zram2 which is ext4 block device uses, say, 8

The whole point of the series is that one single value does
not fit all purposes. There is no silver bullet.

> And then we make more efforts to have auto tune based on
> the wasted memory and the number of size classes on the
> fly. A good thing to be able to achieve is we have indirect
> table(handle <-> zpage) so we could move the object anytime
> so I think we could do better way in the end.

It still needs to be per zram device (per zspool). sysfs knob
doesn't stop us from having auto-tuned values in the future.
  
Minchan Kim Nov. 11, 2022, 5:03 p.m. UTC | #3
On Fri, Nov 11, 2022 at 09:56:36AM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (22/11/10 14:44), Minchan Kim wrote:
> > On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote:
> > > 	Hello,
> > > 
> > > 	Some use-cases and/or data patterns may benefit from
> > > larger zspages. Currently the limit on the number of physical
> > > pages that are linked into a zspage is hardcoded to 4. Higher
> > > limit changes key characteristics of a number of the size
> > > classes, improving compactness of the pool and redusing the
> > > amount of memory zsmalloc pool uses. More on this in 0002
> > > commit message.
> > 
> > Hi Sergey,
> > 
> > I think the idea that break of fixed subpages in zspage is
> > really good start to optimize further. However, I am worry
> > about introducing per-pool config this stage. How about
> > to introduce just one golden value for the zspage size?
> > order-3 or 4 in Kconfig with keeping default 2?
> 
> Sorry, not sure I'm following. So you want a .config value
> for zspage limit? I really like the sysfs knob, because then
> one may set values on per-device basis (if they have multiple
> zram devices in a system with different data patterns):

Yes, I wanted to have just a global policy to drive zsmalloc smarter
without needing user's big effort to decide right tune value(I thought
the decision process would be quite painful for normal user who don't
have enough resources) since zsmalloc's design makes it possible.
But for the interim solution until we prove no regression, just
provide config and then remove the config later when we add aggressive
zpage compaction(if necessary, please see below) since it's easier to
deprecate syfs knob.

> 
> 	zram0 which is used as a swap device uses, say, 4
> 	zram1 which is vfat block device uses, say, 6
> 	zram2 which is ext4 block device uses, say, 8
> 
> The whole point of the series is that one single value does
> not fit all purposes. There is no silver bullet.

I understand what you want to achieve with per-pool config with exposing
the knob to user but my worry is still how user could decide best fit
since workload is so dynamic. Some groups have enough resouces to practice
under fleet experimental while many others don't so if we really need the
per-pool config step, at least, I'd like to provide default guide to user
in the documentation along with the tunable knobs for experimental.
Maybe, we can suggest 4 for swap case and 8 for fs case.

I don't disagree the sysfs knobs for use cases but can't we deal with the
issue better way?

In general, the bigger pages_per_zspage, the more memory saving. It would
be same with slab_order in slab allocator but slab has the limit due to
high-order allocation cost and internal fragmentation with bigger order
size slab. However, zsmalloc is different in that it doesn't expose memory
address directly and it knows when the object is accessed by user. And
it doesn't need high-order allocation, either. That's how zsmalloc could
support object migration and page migration. With those features, theoretically,
zsmalloc doesn't need limitation of the pages_per_zspage so I am looking
forward to seeing zsmalloc handles the memory fragmentation problem better way.

Only concern with bigger pages_per_zspage(e.g., 8 or 16) is exhausting memory
when zram is used for swap. The use case aims to help memory pressure but the
worst case, the bigger pages_per_zspage, more chance to out of memory.
However, we could bound the worst case memory consumption up to

for class in classes:
    wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size

with *aggressive zpage compaction*. Now, we are relying on shrinker
(it might be already enough) to trigger but we could change the policy 
wasted memory in the class size crossed a threshold we defind for zram fs
usecase since it would be used without memory pressure.

What do you think about?
  
Sergey Senozhatsky Nov. 14, 2022, 3:53 a.m. UTC | #4
Hi Minchan,

On (22/11/11 09:03), Minchan Kim wrote:
> > Sorry, not sure I'm following. So you want a .config value
> > for zspage limit? I really like the sysfs knob, because then
> > one may set values on per-device basis (if they have multiple
> > zram devices in a system with different data patterns):
> 
> Yes, I wanted to have just a global policy to drive zsmalloc smarter
> without needing user's big effort to decide right tune value(I thought
> the decision process would be quite painful for normal user who don't
> have enough resources) since zsmalloc's design makes it possible.
> But for the interim solution until we prove no regression, just
> provide config and then remove the config later when we add aggressive
> zpage compaction(if necessary, please see below) since it's easier to
> deprecate syfs knob.

[..]

> I understand what you want to achieve with per-pool config with exposing
> the knob to user but my worry is still how user could decide best fit
> since workload is so dynamic. Some groups have enough resouces to practice
> under fleet experimental while many others don't so if we really need the
> per-pool config step, at least, I'd like to provide default guide to user
> in the documentation along with the tunable knobs for experimental.
> Maybe, we can suggest 4 for swap case and 8 for fs case.
> 
> I don't disagree the sysfs knobs for use cases but can't we deal with the
> issue better way?

[..]

> with *aggressive zpage compaction*. Now, we are relying on shrinker
> (it might be already enough) to trigger but we could change the policy 
> wasted memory in the class size crossed a threshold we defind for zram fs
> usecase since it would be used without memory pressure.
> 
> What do you think about?

This is tricky. I didn't want us to come up with any sort of policies
based on assumptions. For instance, we know that SUSE uses zram with fs
under severe memory pressure (so severe that they immediately noticed
when we removed zsmalloc handle allocation slow path and reported a
regression), so assumption that fs zram use-case is not memory sensitive
does not always hold.

There are too many variables. We have different data patterns, yes, but
even same data patterns have different characteristics when compressed
with different algorithms; then we also have different host states
(memory pressure, etc.) and so on.

I think that it'll be safer for us to execute it the other way.
We can (that's what I was going to do) reach out to people (Android,
SUSE, Meta, ChromeOS, Google cloud, WebOS, Tizen) and ask them to run
experiments (try out various numbers). Then (several months later) we
can take a look at the data - what numbers work for which workloads,
and then we can introduce/change policies, based on evidence and real
use cases. Who knows, maybe zspage_chain_size of 6 can be the new
default and then we can add .config policy, maybe 7 or 8. Or maybe we
won't find a single number that works equally well for everyone (even
in similar use cases).

This is where sysfs knob is very useful. Unlike .config, which has no
flexibility especially when your entire fleet uses same .config for all
builds, sysfs knob lets people run numerous A/B tests simultaneously
(not to mention that some setups have many zram devices which can have
different zspage_chain_size-s). And we don't even need to deprecate it,
if we introduce a generic one like allocator_tunables, which will
support tuples `key=val`. Then we can just deprecate a specific `key`.
  
Sergey Senozhatsky Nov. 14, 2022, 7:55 a.m. UTC | #5
On (22/11/11 09:03), Minchan Kim wrote:
[..]
> Only concern with bigger pages_per_zspage(e.g., 8 or 16) is exhausting memory
> when zram is used for swap. The use case aims to help memory pressure but the
> worst case, the bigger pages_per_zspage, more chance to out of memory.

It's hard to speak in concrete terms here. What locally may look
like a less optimal configuration, can result in a more optimal configuration
globally.

Yes, some zspage_chains get longer, but in return we have very different
clustering and zspool performance/configuration.

Example, a synthetic test on my host.

zspage_chain_size 4
-------------------

zsmalloc classes
 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
 ...
 Total                13           51        413836     412973     159955                         3

zram mm_stat
1691783168 628083717 655175680        0 655175680       60        0    34048    34049

zspage_chain_size 8
-------------------

zsmalloc classes
 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
 ...
 Total                18           87        414852     412978     156666                         0

zram mm_stat
1691803648 627793930 641703936        0 641703936       60        0    33591    33591


Note that we have lower "pages_used" value for the same amount of stored
data. Down to 156666 from 159955 pages.

So it *could be* that longer zspage_chains can be beneficial even in
memory sensitive cases, but we need more data on this, so that we can
speak "statistically".
  
Sergey Senozhatsky Nov. 14, 2022, 8:37 a.m. UTC | #6
On (22/11/11 09:03), Minchan Kim wrote:
[..]
> for class in classes:
>     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> 
> with *aggressive zpage compaction*. Now, we are relying on shrinker
> (it might be already enough) to trigger but we could change the policy 
> wasted memory in the class size crossed a threshold

That threshold can be another tunable in zramX/allocator_tunables sysfs
knob and struct zs_tunables.

But overall it sounds like a bigger project for some time next year.
We already have zs_compact() sysfs knob, so user-space can invoke it
as often as it wants to (not aware if anyone does btw), maybe new
compaction should be something slightly different. I don't have any
ideas yet. One way or the other it still can use the same sysfs knob :)
  
Sergey Senozhatsky Nov. 15, 2022, 6:01 a.m. UTC | #7
On (22/11/11 09:03), Minchan Kim wrote:
[..]
> for class in classes:
>     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> 
> with *aggressive zpage compaction*. Now, we are relying on shrinker
> (it might be already enough) to trigger but we could change the policy 
> wasted memory in the class size crossed a threshold

Compaction does something good only when we can release zspage in the
end. Otherwise we just hold the global pool->lock (assuming that we
land zsmalloc writeback series) and simply move objects around zspages.
So ability to limit zspage chain size still can be valuable, on another
level, as a measure to reduce dependency on compaction success.

We may be can make compaction slightly more successful. For instance,
if we would start move objects not only within zspages of the same size
class, but, for example, move objects to class size + X (upper size
classes). As an example, when all zspages in class are almost full,
but class size + 1 has almost empty pages. In other words sort of as
is those classes had been merged. (virtual merge). Single pool->look
would be handy for it.

But this is more of a research project (intern project?), with unclear
outcome and ETA. I think in the mean time we can let people start
experimenting with various zspage chain sizes so that may be at some
point we can arrive to a new "default" value for all zspool, higher
than current 4, which has been around for many years. Can't think, at
present, of a better way forward.
  
Sergey Senozhatsky Nov. 15, 2022, 7:59 a.m. UTC | #8
On (22/11/15 15:01), Sergey Senozhatsky wrote:
> On (22/11/11 09:03), Minchan Kim wrote:
> [..]
> > for class in classes:
> >     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> > 
> > with *aggressive zpage compaction*. Now, we are relying on shrinker
> > (it might be already enough) to trigger but we could change the policy 
> > wasted memory in the class size crossed a threshold
> 
> Compaction does something good only when we can release zspage in the
> end. Otherwise we just hold the global pool->lock (assuming that we
> land zsmalloc writeback series) and simply move objects around zspages.
> So ability to limit zspage chain size still can be valuable, on another
> level, as a measure to reduce dependency on compaction success.
> 
> We may be can make compaction slightly more successful. For instance,
> if we would start move objects not only within zspages of the same size
> class, but, for example, move objects to class size + X (upper size
> classes). As an example, when all zspages in class are almost full,
> but class size + 1 has almost empty pages. In other words sort of as
> is those classes had been merged. (virtual merge). Single pool->look
> would be handy for it.

What I'm trying to say here is that "aggressiveness of compaction"
probably should be measured not by compaction frequency, but by overall
cost of compaction operations.

Aggressive frequency of compaction doesn't help us much if the state of
the pool doesn't change significantly between compactions. E.g. if we do
10 compaction calls, then only the first one potentially compacts some
zspages, the remaining ones don't do anything.

Cost of compaction operations is a measure of how hard compaction tries.
Does it move object to neighbouring classes and so on? May be we can do
something here.

But then the question is - how do we control that we don't drain battery
too fast? And perhaps some other questions too.
  
Minchan Kim Nov. 15, 2022, 11:23 p.m. UTC | #9
On Tue, Nov 15, 2022 at 04:59:29PM +0900, Sergey Senozhatsky wrote:
> On (22/11/15 15:01), Sergey Senozhatsky wrote:
> > On (22/11/11 09:03), Minchan Kim wrote:
> > [..]
> > > for class in classes:
> > >     wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size
> > > 
> > > with *aggressive zpage compaction*. Now, we are relying on shrinker
> > > (it might be already enough) to trigger but we could change the policy 
> > > wasted memory in the class size crossed a threshold
> > 
> > Compaction does something good only when we can release zspage in the
> > end. Otherwise we just hold the global pool->lock (assuming that we
> > land zsmalloc writeback series) and simply move objects around zspages.
> > So ability to limit zspage chain size still can be valuable, on another
> > level, as a measure to reduce dependency on compaction success.
> > 
> > We may be can make compaction slightly more successful. For instance,
> > if we would start move objects not only within zspages of the same size
> > class, but, for example, move objects to class size + X (upper size
> > classes). As an example, when all zspages in class are almost full,
> > but class size + 1 has almost empty pages. In other words sort of as
> > is those classes had been merged. (virtual merge). Single pool->look
> > would be handy for it.
> 
> What I'm trying to say here is that "aggressiveness of compaction"
> probably should be measured not by compaction frequency, but by overall
> cost of compaction operations.
> 
> Aggressive frequency of compaction doesn't help us much if the state of
> the pool doesn't change significantly between compactions. E.g. if we do
> 10 compaction calls, then only the first one potentially compacts some
> zspages, the remaining ones don't do anything.
> 
> Cost of compaction operations is a measure of how hard compaction tries.
> Does it move object to neighbouring classes and so on? May be we can do
> something here.
> 
> But then the question is - how do we control that we don't drain battery
> too fast? And perhaps some other questions too.
> 

Sure, if we start talking about battery, that would have a lot of things
we need to consider not only from zram-direct but also other indirect-stuffs
caused caused by memory pressure and workload patterns. That's not what we
can control and would consume much more battery. I understand your concern
but also think sysfs per-konb can solve the issue since workload is too
dynamic even in the same swap file/fs, too. I'd like to try finding a
sweet spot in general. If it's too hard to have, then, we need to introduce
the knob with reasonable guideline how we could find it.

Let me try to see the data under Android workload how much just increase
the ZS_MAX_PAGES_PER_ZSPAGE blindly will change the data.
  
Sergey Senozhatsky Nov. 16, 2022, 12:52 a.m. UTC | #10
On (22/11/15 15:23), Minchan Kim wrote:
> Sure, if we start talking about battery, that would have a lot of things
> we need to consider not only from zram-direct but also other indirect-stuffs
> caused caused by memory pressure and workload patterns. That's not what we
> can control and would consume much more battery. I understand your concern
> but also think sysfs per-konb can solve the issue since workload is too
> dynamic even in the same swap file/fs, too. I'd like to try finding a
> sweet spot in general. If it's too hard to have, then, we need to introduce
> the knob with reasonable guideline how we could find it.
> 
> Let me try to see the data under Android workload how much just increase
> the ZS_MAX_PAGES_PER_ZSPAGE blindly will change the data.

I don't want to push for sysfs knob.

What I like about sysfs knob vs KConfig is that sysfs is opt-in. We can
ask folks to try things out, people will know what to look at and they
will keep an eye on metrics, then they come back to us. So we can sit
down, look at the numbers and draw some conclusions. KConfig is not
opt-in. It'll happen for everyone, as a policy, transparently and then
we rely on
a) people tracking metrics that they were not asked to track
b) people noticing changes (positive or negative) in metrics that they
   don't keep an eye on
c) people figuring out that change in metrics is related to zsmalloc
   Kconfig (and that's a very non-obvious conclusion)
d) people reaching out to us

That's way too much to rely on. Chances are we will never hear back.

I understand that you don't like sysfs, and it's not the best thing
probably, but KConfig is not better. I like the opt-in nature of
sysfs - if you change it then you know what you are doing.