diff mbox series

mm: memcg: Use larger chunks for proactive reclaim

Message ID	20240131162442.3487473-1-tjmercier@google.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-46779-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Date: Wed, 31 Jan 2024 16:24:41 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240131162442.3487473-1-tjmercier@google.com> Subject: [PATCH] mm: memcg: Use larger chunks for proactive reclaim From: "T.J. Mercier" <tjmercier@google.com> To: tjmercier@google.com, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, Efly Young <yangyifei03@kuaishou.com> Cc: android-mm@google.com, yuzhao@google.com, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX
Series	mm: memcg: Use larger chunks for proactive reclaim \| mm: memcg: Use larger chunks for proactive reclaim

Commit Message

T.J. Mercier Jan. 31, 2024, 4:24 p.m. UTC

  Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
reclaim") we passed the number of pages for the reclaim request directly
to try_to_free_mem_cgroup_pages, which could lead to significant
overreclaim in order to achieve fairness. After 0388536ac291 the number
of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce
the amount of overreclaim. However such a small chunk size caused a
regression in reclaim performance due to many more reclaim start/stop
cycles inside memory_reclaim.

Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant,
adjust the chunk size proportionally with number of pages left to
reclaim. This allows for higher reclaim efficiency with large chunk
sizes during the beginning of memory_reclaim, and reduces the amount of
potential overreclaim by using small chunk sizes as the total reclaim
amount is approached. Using 1/4 of the amount left to reclaim as the
chunk size gives a good compromise between reclaim performance and
overreclaim:

root - full reclaim       pages/sec   time (sec)
pre-0388536ac291      :    68047        10.46
post-0388536ac291     :    13742        inf
(reclaim-reclaimed)/4 :    67352        10.51

/uid_0 - 1G reclaim       pages/sec   time (sec)  overreclaim (MiB)
pre-0388536ac291      :    258822       1.12            107.8
post-0388536ac291     :    105174       2.49            3.5
(reclaim-reclaimed)/4 :    233396       1.12            -7.4

/uid_0 - full reclaim     pages/sec   time (sec)
pre-0388536ac291      :    72334        7.09
post-0388536ac291     :    38105        14.45
(reclaim-reclaimed)/4 :    72914        6.96

Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Johannes Weiner Jan. 31, 2024, 5:50 p.m. UTC | #1

On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote:
> Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
> reclaim") we passed the number of pages for the reclaim request directly
> to try_to_free_mem_cgroup_pages, which could lead to significant
> overreclaim in order to achieve fairness. After 0388536ac291 the number
> of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce
> the amount of overreclaim. However such a small chunk size caused a
> regression in reclaim performance due to many more reclaim start/stop
> cycles inside memory_reclaim.
> 
> Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant,
> adjust the chunk size proportionally with number of pages left to
> reclaim. This allows for higher reclaim efficiency with large chunk
> sizes during the beginning of memory_reclaim, and reduces the amount of
> potential overreclaim by using small chunk sizes as the total reclaim
> amount is approached. Using 1/4 of the amount left to reclaim as the
> chunk size gives a good compromise between reclaim performance and
> overreclaim:
> 
> root - full reclaim       pages/sec   time (sec)
> pre-0388536ac291      :    68047        10.46
> post-0388536ac291     :    13742        inf
> (reclaim-reclaimed)/4 :    67352        10.51
> 
> /uid_0 - 1G reclaim       pages/sec   time (sec)  overreclaim (MiB)
> pre-0388536ac291      :    258822       1.12            107.8
> post-0388536ac291     :    105174       2.49            3.5
> (reclaim-reclaimed)/4 :    233396       1.12            -7.4
> 
> /uid_0 - full reclaim     pages/sec   time (sec)
> pre-0388536ac291      :    72334        7.09
> post-0388536ac291     :    38105        14.45
> (reclaim-reclaimed)/4 :    72914        6.96
> 
> Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim")
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
>  mm/memcontrol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 46d8d02114cf..d68fb89eadd2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			lru_add_drain_all();
>  
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> +					max((nr_to_reclaim - nr_reclaimed) / 4,
> +					    (nr_to_reclaim - nr_reclaimed) % 4),

I don't see why the % 4 is needed. It only kicks in when the delta
drops below 4, but try_to_free_mem_cgroup_pages() already has

		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),

so it looks like dead code.

T.J. Mercier Jan. 31, 2024, 6:01 p.m. UTC | #2

On Wed, Jan 31, 2024 at 9:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote:
> > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
> > reclaim") we passed the number of pages for the reclaim request directly
> > to try_to_free_mem_cgroup_pages, which could lead to significant
> > overreclaim in order to achieve fairness. After 0388536ac291 the number
> > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce
> > the amount of overreclaim. However such a small chunk size caused a
> > regression in reclaim performance due to many more reclaim start/stop
> > cycles inside memory_reclaim.
> >
> > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant,
> > adjust the chunk size proportionally with number of pages left to
> > reclaim. This allows for higher reclaim efficiency with large chunk
> > sizes during the beginning of memory_reclaim, and reduces the amount of
> > potential overreclaim by using small chunk sizes as the total reclaim
> > amount is approached. Using 1/4 of the amount left to reclaim as the
> > chunk size gives a good compromise between reclaim performance and
> > overreclaim:
> >
> > root - full reclaim       pages/sec   time (sec)
> > pre-0388536ac291      :    68047        10.46
> > post-0388536ac291     :    13742        inf
> > (reclaim-reclaimed)/4 :    67352        10.51
> >
> > /uid_0 - 1G reclaim       pages/sec   time (sec)  overreclaim (MiB)
> > pre-0388536ac291      :    258822       1.12            107.8
> > post-0388536ac291     :    105174       2.49            3.5
> > (reclaim-reclaimed)/4 :    233396       1.12            -7.4
> >
> > /uid_0 - full reclaim     pages/sec   time (sec)
> > pre-0388536ac291      :    72334        7.09
> > post-0388536ac291     :    38105        14.45
> > (reclaim-reclaimed)/4 :    72914        6.96
> >
> > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim")
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> >  mm/memcontrol.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 46d8d02114cf..d68fb89eadd2 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >                       lru_add_drain_all();
> >
> >               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > -                                     min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > +                                     max((nr_to_reclaim - nr_reclaimed) / 4,
> > +                                         (nr_to_reclaim - nr_reclaimed) % 4),
>
> I don't see why the % 4 is needed. It only kicks in when the delta
> drops below 4, but try_to_free_mem_cgroup_pages() already has
>
>                 .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>
> so it looks like dead code.

That right, it's only there for when the integer division reaches
zero. I didn't want to assume anything about the implementation of
try_to_free_mem_cgroup_pages, but I can just remove it entirely if
you'd like.

Johannes Weiner Jan. 31, 2024, 8:12 p.m. UTC | #3

On Wed, Jan 31, 2024 at 10:01:27AM -0800, T.J. Mercier wrote:
> On Wed, Jan 31, 2024 at 9:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote:
> > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
> > > reclaim") we passed the number of pages for the reclaim request directly
> > > to try_to_free_mem_cgroup_pages, which could lead to significant
> > > overreclaim in order to achieve fairness. After 0388536ac291 the number
> > > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce
> > > the amount of overreclaim. However such a small chunk size caused a
> > > regression in reclaim performance due to many more reclaim start/stop
> > > cycles inside memory_reclaim.
> > >
> > > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant,
> > > adjust the chunk size proportionally with number of pages left to
> > > reclaim. This allows for higher reclaim efficiency with large chunk
> > > sizes during the beginning of memory_reclaim, and reduces the amount of
> > > potential overreclaim by using small chunk sizes as the total reclaim
> > > amount is approached. Using 1/4 of the amount left to reclaim as the
> > > chunk size gives a good compromise between reclaim performance and
> > > overreclaim:
> > >
> > > root - full reclaim       pages/sec   time (sec)
> > > pre-0388536ac291      :    68047        10.46
> > > post-0388536ac291     :    13742        inf
> > > (reclaim-reclaimed)/4 :    67352        10.51
> > >
> > > /uid_0 - 1G reclaim       pages/sec   time (sec)  overreclaim (MiB)
> > > pre-0388536ac291      :    258822       1.12            107.8
> > > post-0388536ac291     :    105174       2.49            3.5
> > > (reclaim-reclaimed)/4 :    233396       1.12            -7.4
> > >
> > > /uid_0 - full reclaim     pages/sec   time (sec)
> > > pre-0388536ac291      :    72334        7.09
> > > post-0388536ac291     :    38105        14.45
> > > (reclaim-reclaimed)/4 :    72914        6.96
> > >
> > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim")
> > > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > > ---
> > >  mm/memcontrol.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 46d8d02114cf..d68fb89eadd2 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > >                       lru_add_drain_all();
> > >
> > >               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > -                                     min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > > +                                     max((nr_to_reclaim - nr_reclaimed) / 4,
> > > +                                         (nr_to_reclaim - nr_reclaimed) % 4),
> >
> > I don't see why the % 4 is needed. It only kicks in when the delta
> > drops below 4, but try_to_free_mem_cgroup_pages() already has
> >
> >                 .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> >
> > so it looks like dead code.
> 
> That right, it's only there for when the integer division reaches
> zero. I didn't want to assume anything about the implementation of
> try_to_free_mem_cgroup_pages, but I can just remove it entirely if
> you'd like.

What do others think?

We rely on the rounding up in a few other places and it's been doing
that for a decade. Maybe lampshade it for the benefit of the reader:

	/* Will converge on zero, but reclaim enforces a minimum */

but otherwise there is IMO no need to have defensive extra code.

Michal Koutný Feb. 1, 2024, 1:57 p.m. UTC | #4

Hello.

On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> +					max((nr_to_reclaim - nr_reclaimed) / 4,
> +					    (nr_to_reclaim - nr_reclaimed) % 4),

The 1/4 factor looks like magic. 

Commit 0388536ac291 says:
| In theory, the amount of reclaimed would be in [request, 2 * request).

Doesn't this suggest 1/2 as a better option? (I didn't pursue the
theory.)

Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8,
the formula gives arbitrary (unrelated to delta's magnitude) values.

Regards,
Michal

Johannes Weiner Feb. 1, 2024, 3:34 p.m. UTC | #5

On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
> >  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > -					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > +					max((nr_to_reclaim - nr_reclaimed) / 4,
> > +					    (nr_to_reclaim - nr_reclaimed) % 4),
> 
> The 1/4 factor looks like magic.

It's just cutting the work into quarters to balance throughput with
goal accuracy. It's no more or less magic than DEF_PRIORITY being 12,
or SWAP_CLUSTER_MAX being 32.

> Commit 0388536ac291 says:
> | In theory, the amount of reclaimed would be in [request, 2 * request).

Looking at the code, I'm not quite sure if this can be read this
literally. Efly might be able to elaborate, but we do a full loop of
all nodes and cgroups in the tree before checking nr_to_reclaimed, and
rely on priority level for granularity. So request size and complexity
of the cgroup tree play a role. I don't know where the exact factor
two would come from.

IMO it's more accurate to phrase it like this:

Reclaim tries to balance nr_to_reclaim fidelity with fairness across
nodes and cgroups over which the pages are spread. As such, the bigger
the request, the bigger the absolute overreclaim error. Historic
in-kernel users of reclaim have used fixed, small request batches to
approach an appropriate reclaim rate over time. When we reclaim a user
request of arbitrary size, use decaying batches to manage error while
maintaining reasonable throughput.

> Doesn't this suggest 1/2 as a better option? (I didn't pursue the
> theory.)

That was TJ's first suggestion as well, but as per above I suggested
quartering as a safer option.

> Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8,
> the formula gives arbitrary (unrelated to delta's magnitude) values.

try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the
error margin is much higher at the smaller end of requests anyway.
But practically speaking, users care much less if you reclaim 32 pages
when 16 were requested than if you reclaim 2G when 1G was requested.

T.J. Mercier Feb. 1, 2024, 6:10 p.m. UTC | #6

On Thu, Feb 1, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote:
> > Hello.
> >
> > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
> > >             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > -                                   min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > > +                                   max((nr_to_reclaim - nr_reclaimed) / 4,
> > > +                                       (nr_to_reclaim - nr_reclaimed) % 4),
> >
> > The 1/4 factor looks like magic.
>
> It's just cutting the work into quarters to balance throughput with
> goal accuracy. It's no more or less magic than DEF_PRIORITY being 12,
> or SWAP_CLUSTER_MAX being 32.

Using SWAP_CLUSTER_MAX is sort of like having a really large divisor
instead of 4 (or 1 like before).

I recorded the average number of iterations required to complete the
1G reclaim for the measurements I took and it looks like this:
pre-0388536ac291     : 1
post-0388536ac291    : 1814
(reclaim-reclaimed)/4: 17

Given the results with /4, I don't think the perf we get here is
particularly sensitive to the number we choose, but it's definitely a
tradeoff.

<snip>

> > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8,
> > the formula gives arbitrary (unrelated to delta's magnitude) values.
>
> try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the
> error margin is much higher at the smaller end of requests anyway.
> But practically speaking, users care much less if you reclaim 32 pages
> when 16 were requested than if you reclaim 2G when 1G was requested.

I like Johannes's suggestion of just a comment instead of the mod op.
It's easier to read, slightly less generated code, and even if we
didn't have the .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX) in
try_to_free_mem_cgroup_pages, memory_reclaim would still get very
close to the target before running out of nr_retries.

Efly Young Feb. 2, 2024, 5:02 a.m. UTC | #7

> On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote:
> > Hello.
> > 
> > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote:
> > >     reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > -         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > > +         max((nr_to_reclaim - nr_reclaimed) / 4,
> > > +             (nr_to_reclaim - nr_reclaimed) % 4),
> > 
> > The 1/4 factor looks like magic.
> 
> It's just cutting the work into quarters to balance throughput with
> goal accuracy. It's no more or less magic than DEF_PRIORITY being 12,
> or SWAP_CLUSTER_MAX being 32.
> 
> > Commit 0388536ac291 says:
> > | In theory, the amount of reclaimed would be in [request, 2 * request).
> 
> Looking at the code, I'm not quite sure if this can be read this
> literally. Efly might be able to elaborate, but we do a full loop of
> all nodes and cgroups in the tree before checking nr_to_reclaimed, and
> rely on priority level for granularity. So request size and complexity
> of the cgroup tree play a role. I don't know where the exact factor
> two would come from.

I'm sorry that this conclusion may be arbitrary. It might just only suit
for my case. In my case, I traced it loop twice every time before checking
nr_reclaimed, and it reclaimed less than my request size(1G) every time.
So I think the upper bound is 2 * request. But now it seems that this is
related to cgroup tree I constucted and my system status and my request
size(a relatively large chunk). So there are many influencing factors,
a specific upper bound is not accurate.

> IMO it's more accurate to phrase it like this:
> 
> Reclaim tries to balance nr_to_reclaim fidelity with fairness across
> nodes and cgroups over which the pages are spread. As such, the bigger
> the request, the bigger the absolute overreclaim error. Historic
> in-kernel users of reclaim have used fixed, small request batches to
> approach an appropriate reclaim rate over time. When we reclaim a user
> request of arbitrary size, use decaying batches to manage error while
> maintaining reasonable throughput.
> 
> > Doesn't this suggest 1/2 as a better option? (I didn't pursue the
> > theory.)
> 
> That was TJ's first suggestion as well, but as per above I suggested
> quartering as a safer option.
> 
> > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8,
> > the formula gives arbitrary (unrelated to delta's magnitude) values.
> 
> try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the
> error margin is much higher at the smaller end of requests anyway.
> But practically speaking, users care much less if you reclaim 32 pages
> when 16 were requested than if you reclaim 2G when 1G was requested.

Yes, I agreed completely that the bigger the request the bigger the
absolute overreclaim error. The focus now is the tradeoff between 
accurate reclaim and efficient reclaim. I think TJ's test is suggestive.

Michal Koutný Feb. 2, 2024, 10:15 a.m. UTC | #8

On Fri, Feb 02, 2024 at 01:02:47PM +0800, Efly Young <yangyifei03@kuaishou.com> wrote:
> > Looking at the code, I'm not quite sure if this can be read this
> > literally. Efly might be able to elaborate, but we do a full loop of
> > all nodes and cgroups in the tree before checking nr_to_reclaimed, and
> > rely on priority level for granularity. So request size and complexity
> > of the cgroup tree play a role. I don't know where the exact factor
> > two would come from.
> 
> I'm sorry that this conclusion may be arbitrary. It might just only suit
> for my case. In my case, I traced it loop twice every time before checking
> nr_reclaimed, and it reclaimed less than my request size(1G) every time.
> So I think the upper bound is 2 * request. But now it seems that this is
> related to cgroup tree I constucted and my system status and my request
> size(a relatively large chunk). So there are many influencing factors,
> a specific upper bound is not accurate.

Alright, thanks for the background.

> > IMO it's more accurate to phrase it like this:
> > 
> > Reclaim tries to balance nr_to_reclaim fidelity with fairness across
> > nodes and cgroups over which the pages are spread. As such, the bigger
> > the request, the bigger the absolute overreclaim error. Historic
> > in-kernel users of reclaim have used fixed, small request batches to
> > approach an appropriate reclaim rate over time. When we reclaim a user
> > request of arbitrary size, use decaying batches to manage error while
> > maintaining reasonable throughput.

Hm, decay...
So shouldn't the formula be
  nr_pages = delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4
where
  delta = nr_to_reclaim - nr_reclaimed
?
(So that convergence for smaller deltas is same like original- and other
reclaims while conservative factor is applied for effectivity of higher
user requests.)

Thanks,
Michal

T.J. Mercier Feb. 2, 2024, 6:22 p.m. UTC | #9

On Fri, Feb 2, 2024 at 2:15 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Feb 02, 2024 at 01:02:47PM +0800, Efly Young <yangyifei03@kuaishou.com> wrote:
> > > Looking at the code, I'm not quite sure if this can be read this
> > > literally. Efly might be able to elaborate, but we do a full loop of
> > > all nodes and cgroups in the tree before checking nr_to_reclaimed, and
> > > rely on priority level for granularity. So request size and complexity
> > > of the cgroup tree play a role. I don't know where the exact factor
> > > two would come from.
> >
> > I'm sorry that this conclusion may be arbitrary. It might just only suit
> > for my case. In my case, I traced it loop twice every time before checking
> > nr_reclaimed, and it reclaimed less than my request size(1G) every time.
> > So I think the upper bound is 2 * request. But now it seems that this is
> > related to cgroup tree I constucted and my system status and my request
> > size(a relatively large chunk). So there are many influencing factors,
> > a specific upper bound is not accurate.
>
> Alright, thanks for the background.
>
> > > IMO it's more accurate to phrase it like this:
> > >
> > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across
> > > nodes and cgroups over which the pages are spread. As such, the bigger
> > > the request, the bigger the absolute overreclaim error. Historic
> > > in-kernel users of reclaim have used fixed, small request batches to
> > > approach an appropriate reclaim rate over time. When we reclaim a user
> > > request of arbitrary size, use decaying batches to manage error while
> > > maintaining reasonable throughput.
>
> Hm, decay...
> So shouldn't the formula be
>   nr_pages = delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4
> where
>   delta = nr_to_reclaim - nr_reclaimed
> ?
> (So that convergence for smaller deltas is same like original- and other
> reclaims while conservative factor is applied for effectivity of higher
> user requests.)

Tapering out at 32 instead of 4 doesn't make much difference in
practice because of how far off the actually reclaimed amount can be
from the request size. We're talking thousands of pages of error for a
request size of a few megs, and hundreds of pages of error for
requests less than 100 pages.

So all of these should be more or less equivalent:
delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4
max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4)
(nr_to_reclaim - nr_reclaimed) / 4 + 4
(nr_to_reclaim - nr_reclaimed) / 4

I was just trying to avoid putting in a 0 for the request size with the mod.

Michal Koutný Feb. 2, 2024, 7:46 p.m. UTC | #10

On Fri, Feb 02, 2024 at 10:22:34AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote:
> So all of these should be more or less equivalent:
> delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4
> max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4)
> (nr_to_reclaim - nr_reclaimed) / 4 + 4
> (nr_to_reclaim - nr_reclaimed) / 4
> 
> I was just trying to avoid putting in a 0 for the request size with the mod.

The third variant would be simpler then. Modulo looks weird.

Oh, and I just realized that try_to_free_mem_cgroup_pages() does
max(nr_pages, SWAP_CLUSTER_MAX). Then I'd vote for the fourth variant +
possible comment about harmless 0.
(I'm sorry if this was discussed before.)

Michal

T.J. Mercier Feb. 2, 2024, 9:42 p.m. UTC | #11

On Fri, Feb 2, 2024 at 11:46 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Feb 02, 2024 at 10:22:34AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote:
> > So all of these should be more or less equivalent:
> > delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4
> > max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4)
> > (nr_to_reclaim - nr_reclaimed) / 4 + 4
> > (nr_to_reclaim - nr_reclaimed) / 4
> >
> > I was just trying to avoid putting in a 0 for the request size with the mod.
>
> The third variant would be simpler then. Modulo looks weird.
>
> Oh, and I just realized that try_to_free_mem_cgroup_pages() does
> max(nr_pages, SWAP_CLUSTER_MAX). Then I'd vote for the fourth variant +
> possible comment about harmless 0.
> (I'm sorry if this was discussed before.)
>
> Michal

Ok great, let's do that. Thanks for your input.

diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46d8d02114cf..d68fb89eadd2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6977,7 +6977,8 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			lru_add_drain_all();
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
-					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
+					max((nr_to_reclaim - nr_reclaimed) / 4,
+					    (nr_to_reclaim - nr_reclaimed) % 4),
 					GFP_KERNEL, reclaim_options);
 
 		if (!reclaimed && !nr_retries--)