[rcu,13/14] workqueue: Make queue_rcu_work() use call_rcu_flush()

Message ID 20221019225144.2500095-13-paulmck@kernel.org
State New
Headers
Series Lazy call_rcu() updates for v6.2 |

Commit Message

Paul E. McKenney Oct. 19, 2022, 10:51 p.m. UTC
  From: Uladzislau Rezki <urezki@gmail.com>

call_rcu() changes to save power will slow down RCU workqueue items
queued via queue_rcu_work(). This may not be an issue, however we cannot
assume that workqueue users are OK with long delays. Use
call_rcu_flush() API instead which reverts to the old behavior.

Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Joel Fernandes Oct. 24, 2022, 12:36 a.m. UTC | #1
Hello,

On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> From: Uladzislau Rezki <urezki@gmail.com>
>
> call_rcu() changes to save power will slow down RCU workqueue items
> queued via queue_rcu_work(). This may not be an issue, however we cannot
> assume that workqueue users are OK with long delays. Use
> call_rcu_flush() API instead which reverts to the old behavio

On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the
batching is much better if we can just keep it as call_rcu() instead
of call_rcu_flush().

Is there really any reason to keep it as call_rcu_flush() ?  If I
recall, the real reason Vlad's system was slowing down was because of
scsi and the queue_rcu_work() conversion was really a red herring.

Vlad, any thoughts?

thanks,

 - Joel

.
>
> Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  kernel/workqueue.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 7cd5f5e7e0a1b..b4b0e828b529e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
>
>         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>                 rwork->wq = wq;
> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
>                 return true;
>         }
>
> --
> 2.31.1.189.g2e36527f23
>
  
Paul E. McKenney Oct. 24, 2022, 3:15 a.m. UTC | #2
On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote:
> Hello,
> 
> On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > From: Uladzislau Rezki <urezki@gmail.com>
> >
> > call_rcu() changes to save power will slow down RCU workqueue items
> > queued via queue_rcu_work(). This may not be an issue, however we cannot
> > assume that workqueue users are OK with long delays. Use
> > call_rcu_flush() API instead which reverts to the old behavio
> 
> On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the
> batching is much better if we can just keep it as call_rcu() instead
> of call_rcu_flush().
> 
> Is there really any reason to keep it as call_rcu_flush() ?  If I
> recall, the real reason Vlad's system was slowing down was because of
> scsi and the queue_rcu_work() conversion was really a red herring.

There are less than 20 invocations of queue_rcu_work(), so it should
be possible look through each.  The low-risk approach is of course to
have queue_rcu_work() use call_rcu_flush().

The next approach might be to have a Kconfig option and/or kernel
boot parameter that allowed a per-system choice.

But it would not hurt to double-check on Android.

							Thanx, Paul

> Vlad, any thoughts?
> 
> thanks,
> 
>  - Joel
> 
> .
> >
> > Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  kernel/workqueue.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 7cd5f5e7e0a1b..b4b0e828b529e 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> >
> >         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >                 rwork->wq = wq;
> > -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> >                 return true;
> >         }
> >
> > --
> > 2.31.1.189.g2e36527f23
> >
  
Uladzislau Rezki Oct. 24, 2022, 10:49 a.m. UTC | #3
> On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote:
> > Hello,
> > 
> > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > From: Uladzislau Rezki <urezki@gmail.com>
> > >
> > > call_rcu() changes to save power will slow down RCU workqueue items
> > > queued via queue_rcu_work(). This may not be an issue, however we cannot
> > > assume that workqueue users are OK with long delays. Use
> > > call_rcu_flush() API instead which reverts to the old behavio
> > 
> > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the
> > batching is much better if we can just keep it as call_rcu() instead
> > of call_rcu_flush().
> > 
> > Is there really any reason to keep it as call_rcu_flush() ?  If I
> > recall, the real reason Vlad's system was slowing down was because of
> > scsi and the queue_rcu_work() conversion was really a red herring.
> 
<snip>
*** drivers/acpi/osl.c:
acpi_os_drop_map_ref[401]      queue_rcu_work(system_wq, &map->track.rwork);

*** drivers/gpu/drm/i915/gt/intel_execlists_submission.c:
virtual_context_destroy[3653]  queue_rcu_work(system_wq, &ve->rcu);

*** fs/aio.c:
free_ioctx_reqs[632]           queue_rcu_work(system_wq, &ctx->free_rwork);

*** fs/fs-writeback.c:
inode_switch_wbs[604]          queue_rcu_work(isw_wq, &isw->work);
cleanup_offline_cgwb[676]      queue_rcu_work(isw_wq, &isw->work);

*** include/linux/workqueue.h:
__printf[446]                  extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);

*** kernel/cgroup/cgroup.c:
css_release_work_fn[5253]      queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
css_create[5384]               queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

*** kernel/rcu/tree.c:
kfree_rcu_monitor[3192]        queue_rcu_work(system_wq, &krwp->rcu_work);

*** net/core/skmsg.c:
sk_psock_drop[852]             queue_rcu_work(system_wq, &psock->rwork);

*** net/sched/act_ct.c:
tcf_ct_flow_table_put[355]     queue_rcu_work(act_ct_wq, &ct_ft->rwork);

*** net/sched/cls_api.c:
tcf_queue_work[225]            return queue_rcu_work(tc_filter_wq, rwork);
<snip>
There are 9 users of the queue_rcu_work() functions. I think there can be
a side effect if we keep it as lazy variant. Please note that i have not
checked all those users.

> There are less than 20 invocations of queue_rcu_work(), so it should
> be possible look through each.  The low-risk approach is of course to
> have queue_rcu_work() use call_rcu_flush().
> 
> The next approach might be to have a Kconfig option and/or kernel
> boot parameter that allowed a per-system choice.
> 
> But it would not hurt to double-check on Android.
> 
I did not see such noise but i will come back some data on 5.10 kernel
today.

> 
> > Vlad, any thoughts?
> > 
At least for the kvfree_rcu() i would like to keep the sync variant, because
we have the below patch that improves bathing:

<snip>
commit 51824b780b719c53113dc39e027fbf670dc66028
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Thu Jun 30 18:33:35 2022 +0200

    rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval

    Currently the monitor work is scheduled with a fixed interval of HZ/20,
    which is roughly 50 milliseconds. The drawback of this approach is
    low utilization of the 512 page slots in scenarios with infrequence
    kvfree_rcu() calls.  For example on an Android system:
<snip>

Apparently i see it in the "dev" branch only.

--
Uladzislau Rezki
  
Uladzislau Rezki Oct. 24, 2022, 12:23 p.m. UTC | #4
> > On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote:
> > > Hello,
> > > 
> > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > From: Uladzislau Rezki <urezki@gmail.com>
> > > >
> > > > call_rcu() changes to save power will slow down RCU workqueue items
> > > > queued via queue_rcu_work(). This may not be an issue, however we cannot
> > > > assume that workqueue users are OK with long delays. Use
> > > > call_rcu_flush() API instead which reverts to the old behavio
> > > 
> > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the
> > > batching is much better if we can just keep it as call_rcu() instead
> > > of call_rcu_flush().
> > > 
> > > Is there really any reason to keep it as call_rcu_flush() ?  If I
> > > recall, the real reason Vlad's system was slowing down was because of
> > > scsi and the queue_rcu_work() conversion was really a red herring.
> > 
> <snip>
> *** drivers/acpi/osl.c:
> acpi_os_drop_map_ref[401]      queue_rcu_work(system_wq, &map->track.rwork);
> 
> *** drivers/gpu/drm/i915/gt/intel_execlists_submission.c:
> virtual_context_destroy[3653]  queue_rcu_work(system_wq, &ve->rcu);
> 
> *** fs/aio.c:
> free_ioctx_reqs[632]           queue_rcu_work(system_wq, &ctx->free_rwork);
> 
> *** fs/fs-writeback.c:
> inode_switch_wbs[604]          queue_rcu_work(isw_wq, &isw->work);
> cleanup_offline_cgwb[676]      queue_rcu_work(isw_wq, &isw->work);
> 
> *** include/linux/workqueue.h:
> __printf[446]                  extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
> 
> *** kernel/cgroup/cgroup.c:
> css_release_work_fn[5253]      queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> css_create[5384]               queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> 
> *** kernel/rcu/tree.c:
> kfree_rcu_monitor[3192]        queue_rcu_work(system_wq, &krwp->rcu_work);
> 
> *** net/core/skmsg.c:
> sk_psock_drop[852]             queue_rcu_work(system_wq, &psock->rwork);
> 
> *** net/sched/act_ct.c:
> tcf_ct_flow_table_put[355]     queue_rcu_work(act_ct_wq, &ct_ft->rwork);
> 
> *** net/sched/cls_api.c:
> tcf_queue_work[225]            return queue_rcu_work(tc_filter_wq, rwork);
> <snip>
> There are 9 users of the queue_rcu_work() functions. I think there can be
> a side effect if we keep it as lazy variant. Please note that i have not
> checked all those users.
> 
> > There are less than 20 invocations of queue_rcu_work(), so it should
> > be possible look through each.  The low-risk approach is of course to
> > have queue_rcu_work() use call_rcu_flush().
> > 
> > The next approach might be to have a Kconfig option and/or kernel
> > boot parameter that allowed a per-system choice.
> > 
> > But it would not hurt to double-check on Android.
> > 
> I did not see such noise but i will come back some data on 5.10 kernel
> today.
> 
Home screen swipe:
<snip>
       <...>-15      [003] d..1   202.142205: rcu_batch_start: rcu_preempt CBs=105 bl=10
       <...>-55      [001] d..1   202.166174: rcu_batch_start: rcu_preempt CBs=135 bl=10
       <...>-26      [001] d..1   202.402182: rcu_batch_start: rcu_preempt CBs=221 bl=10
     rcuop/3-40      [003] d..1   202.650323: rcu_batch_start: rcu_preempt CBs=213 bl=10
     rcuop/3-40      [000] d..1   203.210537: rcu_batch_start: rcu_preempt CBs=90 bl=10
     rcuop/5-55      [001] d..1   204.675671: rcu_batch_start: rcu_preempt CBs=14 bl=10
     rcuop/2-33      [002] d..1   205.162229: rcu_batch_start: rcu_preempt CBs=649 bl=10
     rcuop/3-40      [000] d..1   205.418214: rcu_batch_start: rcu_preempt CBs=291 bl=10
     rcuop/3-40      [003] d..1   206.134204: rcu_batch_start: rcu_preempt CBs=174 bl=10
     rcuop/0-15      [003] d..1   206.726311: rcu_batch_start: rcu_preempt CBs=738 bl=10
     rcuop/1-26      [001] d..1   206.814168: rcu_batch_start: rcu_preempt CBs=865 bl=10
     rcuop/3-40      [003] d..1   207.278178: rcu_batch_start: rcu_preempt CBs=287 bl=10
     rcuop/1-26      [001] d..1   208.826279: rcu_batch_start: rcu_preempt CBs=506 bl=10
<snip>

An app launch:
<snip>
         rcuop/3-40      [002] d..1   322.118620: rcu_batch_start: rcu_preempt CBs=99 bl=10
         rcuop/4-48      [005] dn.1   322.454052: rcu_batch_start: rcu_preempt CBs=270 bl=10
         rcuop/5-55      [005] d..1   322.454109: rcu_batch_start: rcu_preempt CBs=91 bl=10
         rcuop/5-55      [007] d..1   322.470054: rcu_batch_start: rcu_preempt CBs=106 bl=10
         rcuop/6-62      [005] d..1   322.482120: rcu_batch_start: rcu_preempt CBs=231 bl=10
         rcuop/4-48      [001] d..1   322.494150: rcu_batch_start: rcu_preempt CBs=227 bl=10
           <...>-69      [002] d..1   322.502442: rcu_batch_start: rcu_preempt CBs=3350 bl=26
         rcuop/1-26      [001] d..1   322.646099: rcu_batch_start: rcu_preempt CBs=1685 bl=13
         rcuop/2-33      [001] d..1   322.670071: rcu_batch_start: rcu_preempt CBs=438 bl=10
         rcuop/1-26      [001] d..1   322.674120: rcu_batch_start: rcu_preempt CBs=18 bl=10
         rcuop/2-33      [003] d..1   322.690152: rcu_batch_start: rcu_preempt CBs=10 bl=10
         rcuop/1-26      [002] d..1   322.698104: rcu_batch_start: rcu_preempt CBs=10 bl=10
         rcuop/3-40      [002] d..1   322.706167: rcu_batch_start: rcu_preempt CBs=313 bl=10
         rcuop/2-33      [003] d..1   322.710075: rcu_batch_start: rcu_preempt CBs=15 bl=10
         rcuop/3-40      [002] d..1   322.742137: rcu_batch_start: rcu_preempt CBs=13 bl=10
         rcuop/5-55      [000] d..1   322.754270: rcu_batch_start: rcu_preempt CBs=157 bl=10
         rcuop/3-40      [000] d..1   322.762182: rcu_batch_start: rcu_preempt CBs=17 bl=10
         rcuop/2-33      [003] d..1   322.774088: rcu_batch_start: rcu_preempt CBs=38 bl=10
         rcuop/3-40      [000] d..1   322.778131: rcu_batch_start: rcu_preempt CBs=23 bl=10
         rcuop/1-26      [002] d..1   322.790105: rcu_batch_start: rcu_preempt CBs=33 bl=10
         rcuop/4-48      [001] d..1   322.798074: rcu_batch_start: rcu_preempt CBs=340 bl=10
         rcuop/2-33      [002] d..1   322.806158: rcu_batch_start: rcu_preempt CBs=18 bl=10
         rcuop/1-26      [002] d..1   322.814057: rcu_batch_start: rcu_preempt CBs=18 bl=10
         rcuop/0-15      [001] d..1   322.822476: rcu_batch_start: rcu_preempt CBs=333 bl=10
         rcuop/4-48      [003] d..1   322.830102: rcu_batch_start: rcu_preempt CBs=11 bl=10
         rcuop/2-33      [001] d..1   322.846109: rcu_batch_start: rcu_preempt CBs=80 bl=10
         rcuop/3-40      [001] d..1   322.854162: rcu_batch_start: rcu_preempt CBs=145 bl=10
         rcuop/4-48      [003] d..1   322.874129: rcu_batch_start: rcu_preempt CBs=21 bl=10
         rcuop/3-40      [001] d..1   322.878149: rcu_batch_start: rcu_preempt CBs=43 bl=10
         rcuop/3-40      [001] d..1   322.906273: rcu_batch_start: rcu_preempt CBs=10 bl=10
         rcuop/4-48      [001] d..1   322.918201: rcu_batch_start: rcu_preempt CBs=23 bl=10
         rcuop/2-33      [001] d..1   322.926212: rcu_batch_start: rcu_preempt CBs=86 bl=10
         rcuop/2-33      [001] d..1   322.946251: rcu_batch_start: rcu_preempt CBs=12 bl=10
         rcuop/5-55      [003] d..1   322.954482: rcu_batch_start: rcu_preempt CBs=70 bl=10
         rcuop/2-33      [003] d..1   322.978146: rcu_batch_start: rcu_preempt CBs=20 bl=10
         rcuop/1-26      [002] d..1   323.014290: rcu_batch_start: rcu_preempt CBs=230 bl=10
         rcuop/4-48      [001] d..1   323.026119: rcu_batch_start: rcu_preempt CBs=73 bl=10
         rcuop/5-55      [003] d..1   323.026175: rcu_batch_start: rcu_preempt CBs=94 bl=10
         rcuop/3-40      [001] d..1   323.035310: rcu_batch_start: rcu_preempt CBs=70 bl=10
         rcuop/0-15      [001] d..1   323.046231: rcu_batch_start: rcu_preempt CBs=165 bl=10
         rcuop/6-62      [005] d..1   323.066132: rcu_batch_start: rcu_preempt CBs=179 bl=10
         rcuop/1-26      [002] d..1   323.174202: rcu_batch_start: rcu_preempt CBs=61 bl=10
         rcuop/2-33      [003] d..1   323.190203: rcu_batch_start: rcu_preempt CBs=80 bl=10
         rcuop/3-40      [003] d..1   323.206210: rcu_batch_start: rcu_preempt CBs=84 bl=10
         rcuop/2-33      [003] d..1   323.226880: rcu_batch_start: rcu_preempt CBs=5 bl=10
<snip>

It is on Android with 5.10 kernel running. I do not see that queue_rcu_work() makes
some noise.

Joel Could you please post your batch_start trace point output where you see the noise?

--
Uladzislau Rezki
  
Joel Fernandes Oct. 24, 2022, 2:31 p.m. UTC | #5
On Mon, Oct 24, 2022 at 02:23:39PM +0200, Uladzislau Rezki wrote:
> > > On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote:
> > > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > From: Uladzislau Rezki <urezki@gmail.com>
> > > > >
> > > > > call_rcu() changes to save power will slow down RCU workqueue items
> > > > > queued via queue_rcu_work(). This may not be an issue, however we cannot
> > > > > assume that workqueue users are OK with long delays. Use
> > > > > call_rcu_flush() API instead which reverts to the old behavio
> > > > 
> > > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the
> > > > batching is much better if we can just keep it as call_rcu() instead
> > > > of call_rcu_flush().
> > > > 
> > > > Is there really any reason to keep it as call_rcu_flush() ?  If I
> > > > recall, the real reason Vlad's system was slowing down was because of
> > > > scsi and the queue_rcu_work() conversion was really a red herring.
> > > 
> > <snip>
> > *** drivers/acpi/osl.c:
> > acpi_os_drop_map_ref[401]      queue_rcu_work(system_wq, &map->track.rwork);
> > 
> > *** drivers/gpu/drm/i915/gt/intel_execlists_submission.c:
> > virtual_context_destroy[3653]  queue_rcu_work(system_wq, &ve->rcu);
> > 
> > *** fs/aio.c:
> > free_ioctx_reqs[632]           queue_rcu_work(system_wq, &ctx->free_rwork);
> > 
> > *** fs/fs-writeback.c:
> > inode_switch_wbs[604]          queue_rcu_work(isw_wq, &isw->work);
> > cleanup_offline_cgwb[676]      queue_rcu_work(isw_wq, &isw->work);
> > 
> > *** include/linux/workqueue.h:
> > __printf[446]                  extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
> > 
> > *** kernel/cgroup/cgroup.c:
> > css_release_work_fn[5253]      queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> > css_create[5384]               queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> > 
> > *** kernel/rcu/tree.c:
> > kfree_rcu_monitor[3192]        queue_rcu_work(system_wq, &krwp->rcu_work);
> > 
> > *** net/core/skmsg.c:
> > sk_psock_drop[852]             queue_rcu_work(system_wq, &psock->rwork);
> > 
> > *** net/sched/act_ct.c:
> > tcf_ct_flow_table_put[355]     queue_rcu_work(act_ct_wq, &ct_ft->rwork);
> > 
> > *** net/sched/cls_api.c:
> > tcf_queue_work[225]            return queue_rcu_work(tc_filter_wq, rwork);
> > <snip>
> > There are 9 users of the queue_rcu_work() functions. I think there can be
> > a side effect if we keep it as lazy variant. Please note that i have not
> > checked all those users.
> > 
> > > There are less than 20 invocations of queue_rcu_work(), so it should
> > > be possible look through each.  The low-risk approach is of course to
> > > have queue_rcu_work() use call_rcu_flush().

Yes, once I get to a device (tomorrow), I'll look more. Last I checked it was
kvfree_rcu() -- this was a few weeks/months ago though.

> > > The next approach might be to have a Kconfig option and/or kernel
> > > boot parameter that allowed a per-system choice.
> > > 
> > > But it would not hurt to double-check on Android.
> > > 
> > I did not see such noise but i will come back some data on 5.10 kernel
> > today.
> > 
> Home screen swipe:
> <snip>
>        <...>-15      [003] d..1   202.142205: rcu_batch_start: rcu_preempt CBs=105 bl=10
>        <...>-55      [001] d..1   202.166174: rcu_batch_start: rcu_preempt CBs=135 bl=10
>        <...>-26      [001] d..1   202.402182: rcu_batch_start: rcu_preempt CBs=221 bl=10
>      rcuop/3-40      [003] d..1   202.650323: rcu_batch_start: rcu_preempt CBs=213 bl=10
>      rcuop/3-40      [000] d..1   203.210537: rcu_batch_start: rcu_preempt CBs=90 bl=10
>      rcuop/5-55      [001] d..1   204.675671: rcu_batch_start: rcu_preempt CBs=14 bl=10
>      rcuop/2-33      [002] d..1   205.162229: rcu_batch_start: rcu_preempt CBs=649 bl=10
>      rcuop/3-40      [000] d..1   205.418214: rcu_batch_start: rcu_preempt CBs=291 bl=10
>      rcuop/3-40      [003] d..1   206.134204: rcu_batch_start: rcu_preempt CBs=174 bl=10
>      rcuop/0-15      [003] d..1   206.726311: rcu_batch_start: rcu_preempt CBs=738 bl=10
>      rcuop/1-26      [001] d..1   206.814168: rcu_batch_start: rcu_preempt CBs=865 bl=10
>      rcuop/3-40      [003] d..1   207.278178: rcu_batch_start: rcu_preempt CBs=287 bl=10
>      rcuop/1-26      [001] d..1   208.826279: rcu_batch_start: rcu_preempt CBs=506 bl=10
> <snip>

This looks fine to me, but..

> An app launch:
> <snip>
>          rcuop/3-40      [002] d..1   322.118620: rcu_batch_start: rcu_preempt CBs=99 bl=10
>          rcuop/4-48      [005] dn.1   322.454052: rcu_batch_start: rcu_preempt CBs=270 bl=10
>          rcuop/5-55      [005] d..1   322.454109: rcu_batch_start: rcu_preempt CBs=91 bl=10
>          rcuop/5-55      [007] d..1   322.470054: rcu_batch_start: rcu_preempt CBs=106 bl=10
>          rcuop/6-62      [005] d..1   322.482120: rcu_batch_start: rcu_preempt CBs=231 bl=10
>          rcuop/4-48      [001] d..1   322.494150: rcu_batch_start: rcu_preempt CBs=227 bl=10
>            <...>-69      [002] d..1   322.502442: rcu_batch_start: rcu_preempt CBs=3350 bl=26
>          rcuop/1-26      [001] d..1   322.646099: rcu_batch_start: rcu_preempt CBs=1685 bl=13
>          rcuop/2-33      [001] d..1   322.670071: rcu_batch_start: rcu_preempt CBs=438 bl=10
>          rcuop/1-26      [001] d..1   322.674120: rcu_batch_start: rcu_preempt CBs=18 bl=10
>          rcuop/2-33      [003] d..1   322.690152: rcu_batch_start: rcu_preempt CBs=10 bl=10
>          rcuop/1-26      [002] d..1   322.698104: rcu_batch_start: rcu_preempt CBs=10 bl=10
>          rcuop/3-40      [002] d..1   322.706167: rcu_batch_start: rcu_preempt CBs=313 bl=10
>          rcuop/2-33      [003] d..1   322.710075: rcu_batch_start: rcu_preempt CBs=15 bl=10

The above does not look fine to me (^^^) from a Lazu-RCU PoV.  Here, RCU
callbacks are being invoked every 10-20ms.  The batching I seek is of the
order of seconds, when the system is relatively idle.

Why is Lazy-RCU not in effect for app launch? IOW, which callback is causing
Lazy-RCU to not be lazy here?  Could it be queue_rcu_work()?  Whenever a
non-lazy callback is queued, all the lazy ones are 'promoted' to non-lazy.
That's why I am asking. Even if you queue one non-lazy callback at a high
enough frequency, the lazy ones will no longer giving you batching or the
benefits of laziness.

>          rcuop/3-40      [002] d..1   322.742137: rcu_batch_start: rcu_preempt CBs=13 bl=10
>          rcuop/5-55      [000] d..1   322.754270: rcu_batch_start: rcu_preempt CBs=157 bl=10
>          rcuop/3-40      [000] d..1   322.762182: rcu_batch_start: rcu_preempt CBs=17 bl=10
>          rcuop/2-33      [003] d..1   322.774088: rcu_batch_start: rcu_preempt CBs=38 bl=10
>          rcuop/3-40      [000] d..1   322.778131: rcu_batch_start: rcu_preempt CBs=23 bl=10
>          rcuop/1-26      [002] d..1   322.790105: rcu_batch_start: rcu_preempt CBs=33 bl=10
>          rcuop/4-48      [001] d..1   322.798074: rcu_batch_start: rcu_preempt CBs=340 bl=10
>          rcuop/2-33      [002] d..1   322.806158: rcu_batch_start: rcu_preempt CBs=18 bl=10
>          rcuop/1-26      [002] d..1   322.814057: rcu_batch_start: rcu_preempt CBs=18 bl=10
>          rcuop/0-15      [001] d..1   322.822476: rcu_batch_start: rcu_preempt CBs=333 bl=10
>          rcuop/4-48      [003] d..1   322.830102: rcu_batch_start: rcu_preempt CBs=11 bl=10
>          rcuop/2-33      [001] d..1   322.846109: rcu_batch_start: rcu_preempt CBs=80 bl=10
>          rcuop/3-40      [001] d..1   322.854162: rcu_batch_start: rcu_preempt CBs=145 bl=10
>          rcuop/4-48      [003] d..1   322.874129: rcu_batch_start: rcu_preempt CBs=21 bl=10
>          rcuop/3-40      [001] d..1   322.878149: rcu_batch_start: rcu_preempt CBs=43 bl=10
>          rcuop/3-40      [001] d..1   322.906273: rcu_batch_start: rcu_preempt CBs=10 bl=10
>          rcuop/4-48      [001] d..1   322.918201: rcu_batch_start: rcu_preempt CBs=23 bl=10
>          rcuop/2-33      [001] d..1   322.926212: rcu_batch_start: rcu_preempt CBs=86 bl=10
>          rcuop/2-33      [001] d..1   322.946251: rcu_batch_start: rcu_preempt CBs=12 bl=10
>          rcuop/5-55      [003] d..1   322.954482: rcu_batch_start: rcu_preempt CBs=70 bl=10
>          rcuop/2-33      [003] d..1   322.978146: rcu_batch_start: rcu_preempt CBs=20 bl=10
>          rcuop/1-26      [002] d..1   323.014290: rcu_batch_start: rcu_preempt CBs=230 bl=10
>          rcuop/4-48      [001] d..1   323.026119: rcu_batch_start: rcu_preempt CBs=73 bl=10
>          rcuop/5-55      [003] d..1   323.026175: rcu_batch_start: rcu_preempt CBs=94 bl=10
>          rcuop/3-40      [001] d..1   323.035310: rcu_batch_start: rcu_preempt CBs=70 bl=10
>          rcuop/0-15      [001] d..1   323.046231: rcu_batch_start: rcu_preempt CBs=165 bl=10
>          rcuop/6-62      [005] d..1   323.066132: rcu_batch_start: rcu_preempt CBs=179 bl=10
>          rcuop/1-26      [002] d..1   323.174202: rcu_batch_start: rcu_preempt CBs=61 bl=10
>          rcuop/2-33      [003] d..1   323.190203: rcu_batch_start: rcu_preempt CBs=80 bl=10
>          rcuop/3-40      [003] d..1   323.206210: rcu_batch_start: rcu_preempt CBs=84 bl=10
>          rcuop/2-33      [003] d..1   323.226880: rcu_batch_start: rcu_preempt CBs=5 bl=10

And for several seconds you have the same thing going ^^^.

> <snip>
> 
> It is on Android with 5.10 kernel running. I do not see that queue_rcu_work() makes
> some noise.

Your rcu_batch_start tracepoint output above does not really reveal much
information about which callbacks are lazy and which are not.
rcu_invoke_callback is better in the sense at least you have the name of the
callback and can take a guess.

> Joel Could you please post your batch_start trace point output where you see the noise?

Sure, I'll do that once I get to a device.

thanks,

 - Joel
  
Paul E. McKenney Oct. 24, 2022, 3:39 p.m. UTC | #6
On Mon, Oct 24, 2022 at 02:31:15PM +0000, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 02:23:39PM +0200, Uladzislau Rezki wrote:
> > > > On Sun, Oct 23, 2022 at 08:36:00PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Oct 19, 2022 at 6:51 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > From: Uladzislau Rezki <urezki@gmail.com>
> > > > > >
> > > > > > call_rcu() changes to save power will slow down RCU workqueue items
> > > > > > queued via queue_rcu_work(). This may not be an issue, however we cannot
> > > > > > assume that workqueue users are OK with long delays. Use
> > > > > > call_rcu_flush() API instead which reverts to the old behavio
> > > > > 
> > > > > On ChromeOS, I can see that queue_rcu_work() is pretty noisy and the
> > > > > batching is much better if we can just keep it as call_rcu() instead
> > > > > of call_rcu_flush().
> > > > > 
> > > > > Is there really any reason to keep it as call_rcu_flush() ?  If I
> > > > > recall, the real reason Vlad's system was slowing down was because of
> > > > > scsi and the queue_rcu_work() conversion was really a red herring.
> > > > 
> > > <snip>
> > > *** drivers/acpi/osl.c:
> > > acpi_os_drop_map_ref[401]      queue_rcu_work(system_wq, &map->track.rwork);
> > > 
> > > *** drivers/gpu/drm/i915/gt/intel_execlists_submission.c:
> > > virtual_context_destroy[3653]  queue_rcu_work(system_wq, &ve->rcu);
> > > 
> > > *** fs/aio.c:
> > > free_ioctx_reqs[632]           queue_rcu_work(system_wq, &ctx->free_rwork);
> > > 
> > > *** fs/fs-writeback.c:
> > > inode_switch_wbs[604]          queue_rcu_work(isw_wq, &isw->work);
> > > cleanup_offline_cgwb[676]      queue_rcu_work(isw_wq, &isw->work);
> > > 
> > > *** include/linux/workqueue.h:
> > > __printf[446]                  extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
> > > 
> > > *** kernel/cgroup/cgroup.c:
> > > css_release_work_fn[5253]      queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> > > css_create[5384]               queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> > > 
> > > *** kernel/rcu/tree.c:
> > > kfree_rcu_monitor[3192]        queue_rcu_work(system_wq, &krwp->rcu_work);
> > > 
> > > *** net/core/skmsg.c:
> > > sk_psock_drop[852]             queue_rcu_work(system_wq, &psock->rwork);
> > > 
> > > *** net/sched/act_ct.c:
> > > tcf_ct_flow_table_put[355]     queue_rcu_work(act_ct_wq, &ct_ft->rwork);
> > > 
> > > *** net/sched/cls_api.c:
> > > tcf_queue_work[225]            return queue_rcu_work(tc_filter_wq, rwork);
> > > <snip>
> > > There are 9 users of the queue_rcu_work() functions. I think there can be
> > > a side effect if we keep it as lazy variant. Please note that i have not
> > > checked all those users.
> > > 
> > > > There are less than 20 invocations of queue_rcu_work(), so it should
> > > > be possible look through each.  The low-risk approach is of course to
> > > > have queue_rcu_work() use call_rcu_flush().
> 
> Yes, once I get to a device (tomorrow), I'll look more. Last I checked it was
> kvfree_rcu() -- this was a few weeks/months ago though.
> 
> > > > The next approach might be to have a Kconfig option and/or kernel
> > > > boot parameter that allowed a per-system choice.
> > > > 
> > > > But it would not hurt to double-check on Android.
> > > > 
> > > I did not see such noise but i will come back some data on 5.10 kernel
> > > today.
> > > 
> > Home screen swipe:
> > <snip>
> >        <...>-15      [003] d..1   202.142205: rcu_batch_start: rcu_preempt CBs=105 bl=10
> >        <...>-55      [001] d..1   202.166174: rcu_batch_start: rcu_preempt CBs=135 bl=10
> >        <...>-26      [001] d..1   202.402182: rcu_batch_start: rcu_preempt CBs=221 bl=10
> >      rcuop/3-40      [003] d..1   202.650323: rcu_batch_start: rcu_preempt CBs=213 bl=10
> >      rcuop/3-40      [000] d..1   203.210537: rcu_batch_start: rcu_preempt CBs=90 bl=10
> >      rcuop/5-55      [001] d..1   204.675671: rcu_batch_start: rcu_preempt CBs=14 bl=10
> >      rcuop/2-33      [002] d..1   205.162229: rcu_batch_start: rcu_preempt CBs=649 bl=10
> >      rcuop/3-40      [000] d..1   205.418214: rcu_batch_start: rcu_preempt CBs=291 bl=10
> >      rcuop/3-40      [003] d..1   206.134204: rcu_batch_start: rcu_preempt CBs=174 bl=10
> >      rcuop/0-15      [003] d..1   206.726311: rcu_batch_start: rcu_preempt CBs=738 bl=10
> >      rcuop/1-26      [001] d..1   206.814168: rcu_batch_start: rcu_preempt CBs=865 bl=10
> >      rcuop/3-40      [003] d..1   207.278178: rcu_batch_start: rcu_preempt CBs=287 bl=10
> >      rcuop/1-26      [001] d..1   208.826279: rcu_batch_start: rcu_preempt CBs=506 bl=10
> > <snip>
> 
> This looks fine to me, but..
> 
> > An app launch:
> > <snip>
> >          rcuop/3-40      [002] d..1   322.118620: rcu_batch_start: rcu_preempt CBs=99 bl=10
> >          rcuop/4-48      [005] dn.1   322.454052: rcu_batch_start: rcu_preempt CBs=270 bl=10
> >          rcuop/5-55      [005] d..1   322.454109: rcu_batch_start: rcu_preempt CBs=91 bl=10
> >          rcuop/5-55      [007] d..1   322.470054: rcu_batch_start: rcu_preempt CBs=106 bl=10
> >          rcuop/6-62      [005] d..1   322.482120: rcu_batch_start: rcu_preempt CBs=231 bl=10
> >          rcuop/4-48      [001] d..1   322.494150: rcu_batch_start: rcu_preempt CBs=227 bl=10
> >            <...>-69      [002] d..1   322.502442: rcu_batch_start: rcu_preempt CBs=3350 bl=26
> >          rcuop/1-26      [001] d..1   322.646099: rcu_batch_start: rcu_preempt CBs=1685 bl=13
> >          rcuop/2-33      [001] d..1   322.670071: rcu_batch_start: rcu_preempt CBs=438 bl=10
> >          rcuop/1-26      [001] d..1   322.674120: rcu_batch_start: rcu_preempt CBs=18 bl=10
> >          rcuop/2-33      [003] d..1   322.690152: rcu_batch_start: rcu_preempt CBs=10 bl=10
> >          rcuop/1-26      [002] d..1   322.698104: rcu_batch_start: rcu_preempt CBs=10 bl=10
> >          rcuop/3-40      [002] d..1   322.706167: rcu_batch_start: rcu_preempt CBs=313 bl=10
> >          rcuop/2-33      [003] d..1   322.710075: rcu_batch_start: rcu_preempt CBs=15 bl=10
> 
> The above does not look fine to me (^^^) from a Lazu-RCU PoV.  Here, RCU
> callbacks are being invoked every 10-20ms.  The batching I seek is of the
> order of seconds, when the system is relatively idle.
> 
> Why is Lazy-RCU not in effect for app launch? IOW, which callback is causing
> Lazy-RCU to not be lazy here?  Could it be queue_rcu_work()?  Whenever a
> non-lazy callback is queued, all the lazy ones are 'promoted' to non-lazy.
> That's why I am asking. Even if you queue one non-lazy callback at a high
> enough frequency, the lazy ones will no longer giving you batching or the
> benefits of laziness.
> 
> >          rcuop/3-40      [002] d..1   322.742137: rcu_batch_start: rcu_preempt CBs=13 bl=10
> >          rcuop/5-55      [000] d..1   322.754270: rcu_batch_start: rcu_preempt CBs=157 bl=10
> >          rcuop/3-40      [000] d..1   322.762182: rcu_batch_start: rcu_preempt CBs=17 bl=10
> >          rcuop/2-33      [003] d..1   322.774088: rcu_batch_start: rcu_preempt CBs=38 bl=10
> >          rcuop/3-40      [000] d..1   322.778131: rcu_batch_start: rcu_preempt CBs=23 bl=10
> >          rcuop/1-26      [002] d..1   322.790105: rcu_batch_start: rcu_preempt CBs=33 bl=10
> >          rcuop/4-48      [001] d..1   322.798074: rcu_batch_start: rcu_preempt CBs=340 bl=10
> >          rcuop/2-33      [002] d..1   322.806158: rcu_batch_start: rcu_preempt CBs=18 bl=10
> >          rcuop/1-26      [002] d..1   322.814057: rcu_batch_start: rcu_preempt CBs=18 bl=10
> >          rcuop/0-15      [001] d..1   322.822476: rcu_batch_start: rcu_preempt CBs=333 bl=10
> >          rcuop/4-48      [003] d..1   322.830102: rcu_batch_start: rcu_preempt CBs=11 bl=10
> >          rcuop/2-33      [001] d..1   322.846109: rcu_batch_start: rcu_preempt CBs=80 bl=10
> >          rcuop/3-40      [001] d..1   322.854162: rcu_batch_start: rcu_preempt CBs=145 bl=10
> >          rcuop/4-48      [003] d..1   322.874129: rcu_batch_start: rcu_preempt CBs=21 bl=10
> >          rcuop/3-40      [001] d..1   322.878149: rcu_batch_start: rcu_preempt CBs=43 bl=10
> >          rcuop/3-40      [001] d..1   322.906273: rcu_batch_start: rcu_preempt CBs=10 bl=10
> >          rcuop/4-48      [001] d..1   322.918201: rcu_batch_start: rcu_preempt CBs=23 bl=10
> >          rcuop/2-33      [001] d..1   322.926212: rcu_batch_start: rcu_preempt CBs=86 bl=10
> >          rcuop/2-33      [001] d..1   322.946251: rcu_batch_start: rcu_preempt CBs=12 bl=10
> >          rcuop/5-55      [003] d..1   322.954482: rcu_batch_start: rcu_preempt CBs=70 bl=10
> >          rcuop/2-33      [003] d..1   322.978146: rcu_batch_start: rcu_preempt CBs=20 bl=10
> >          rcuop/1-26      [002] d..1   323.014290: rcu_batch_start: rcu_preempt CBs=230 bl=10
> >          rcuop/4-48      [001] d..1   323.026119: rcu_batch_start: rcu_preempt CBs=73 bl=10
> >          rcuop/5-55      [003] d..1   323.026175: rcu_batch_start: rcu_preempt CBs=94 bl=10
> >          rcuop/3-40      [001] d..1   323.035310: rcu_batch_start: rcu_preempt CBs=70 bl=10
> >          rcuop/0-15      [001] d..1   323.046231: rcu_batch_start: rcu_preempt CBs=165 bl=10
> >          rcuop/6-62      [005] d..1   323.066132: rcu_batch_start: rcu_preempt CBs=179 bl=10
> >          rcuop/1-26      [002] d..1   323.174202: rcu_batch_start: rcu_preempt CBs=61 bl=10
> >          rcuop/2-33      [003] d..1   323.190203: rcu_batch_start: rcu_preempt CBs=80 bl=10
> >          rcuop/3-40      [003] d..1   323.206210: rcu_batch_start: rcu_preempt CBs=84 bl=10
> >          rcuop/2-33      [003] d..1   323.226880: rcu_batch_start: rcu_preempt CBs=5 bl=10
> 
> And for several seconds you have the same thing going ^^^.
> 
> > <snip>
> > 
> > It is on Android with 5.10 kernel running. I do not see that queue_rcu_work() makes
> > some noise.
> 
> Your rcu_batch_start tracepoint output above does not really reveal much
> information about which callbacks are lazy and which are not.
> rcu_invoke_callback is better in the sense at least you have the name of the
> callback and can take a guess.
> 
> > Joel Could you please post your batch_start trace point output where you see the noise?
> 
> Sure, I'll do that once I get to a device.

You guys might need to agree on the definition of "good" here.  Or maybe
understand the differences in your respective platforms' definitions of
"good".  ;-)

							Thanx, Paul
  
Uladzislau Rezki Oct. 24, 2022, 4:25 p.m. UTC | #7
>
> You guys might need to agree on the definition of "good" here.  Or maybe
> understand the differences in your respective platforms' definitions of
> "good".  ;-)
>
Indeed. Bad is when once per-millisecond infinitely :) At least in such use
workload a can detect a power delta and power gain. Anyway, below is a new
trace where i do not use "flush" variant for the kvfree_rcu():

<snip>
1. Home screen swipe:
         rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
         rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
         rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
         rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
         rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
         rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
         rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
         rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
         rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
2. App launches:
         rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
         rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
         rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
         rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
         rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
         rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
         rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
         rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
         rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
           <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
         rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
           <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
           <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
         rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
           <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
         rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
         rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
         rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
         rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
         rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
         rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
         rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
         rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
         rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
         rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
         rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
<snip>

it is much more better. But. As i wrote earlier there is a patch that i have submitted
some time ago improving kvfree_rcu() batching:

<snip>
commit 51824b780b719c53113dc39e027fbf670dc66028
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Thu Jun 30 18:33:35 2022 +0200

    rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval

    Currently the monitor work is scheduled with a fixed interval of HZ/20,
    which is roughly 50 milliseconds. The drawback of this approach is
    low utilization of the 512 page slots in scenarios with infrequence
    kvfree_rcu() calls.  For example on an Android system:
<snip>

The trace that i posted was taken without it.

--
Uladzislau Rezki
  
Paul E. McKenney Oct. 24, 2022, 4:48 p.m. UTC | #8
On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> >
> > You guys might need to agree on the definition of "good" here.  Or maybe
> > understand the differences in your respective platforms' definitions of
> > "good".  ;-)
> >
> Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> workload a can detect a power delta and power gain. Anyway, below is a new
> trace where i do not use "flush" variant for the kvfree_rcu():
> 
> <snip>
> 1. Home screen swipe:
>          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
>          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
>          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
>          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
>          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
>          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
>          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
>          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
>          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> 2. App launches:
>          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
>          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
>          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
>          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
>          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
>          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
>          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
>          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
>          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
>            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
>          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
>            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
>            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
>          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
>            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
>          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
>          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
>          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
>          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
>          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
>          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
>          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
>          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
>          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
>          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
>          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> <snip>
> 
> it is much more better. But. As i wrote earlier there is a patch that i have submitted
> some time ago improving kvfree_rcu() batching:
> 
> <snip>
> commit 51824b780b719c53113dc39e027fbf670dc66028
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Thu Jun 30 18:33:35 2022 +0200
> 
>     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> 
>     Currently the monitor work is scheduled with a fixed interval of HZ/20,
>     which is roughly 50 milliseconds. The drawback of this approach is
>     low utilization of the 512 page slots in scenarios with infrequence
>     kvfree_rcu() calls.  For example on an Android system:
> <snip>
> 
> The trace that i posted was taken without it.

And if I am not getting too confused, that patch is now in mainline.
So it does make sense to rely on it, then.  ;-)

							Thanx, Paul
  
Joel Fernandes Oct. 24, 2022, 4:54 p.m. UTC | #9
On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> >
> > You guys might need to agree on the definition of "good" here.  Or maybe
> > understand the differences in your respective platforms' definitions of
> > "good".  ;-)
> >
> Indeed. Bad is when once per-millisecond infinitely :) At least in such use

To me once per-ms is really bad, and once per 20ms indefinitely is also not
ideal ;-). Just to give you a sense of why I feel this, I see the RCU thread
wake ups that periodically happen can disturb CPUIdle.

The act of queuing Callback + gp delay + rcu threads running is enough to
disrupt overlaps between CPUidle time and the gp delay. Further the idle
governor will refrain from entering deeper CPUidle states because it will see
timers queued in the near future to wake up the RCU grace-period kthreads.

> workload a can detect a power delta and power gain. Anyway, below is a new
> trace where i do not use "flush" variant for the kvfree_rcu():
> 
> <snip>
> 1. Home screen swipe:
>          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
>          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
>          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
>          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
>          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
>          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
>          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
>          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
>          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> 2. App launches:
>          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
>          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
>          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
>          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
>          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
>          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
>          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
>          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
>          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
>            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
>          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
>            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
>            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
>          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
>            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
>          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
>          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
>          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
>          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
>          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
>          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
>          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
>          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
>          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
>          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
>          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> <snip>
> 
> it is much more better. But. As i wrote earlier there is a patch that i have submitted
> some time ago improving kvfree_rcu() batching:

Yes it seems much better than your last traces! I'd propose to drop this
patch because as you show, it effects not only yours but ChromeOS. It appears
kvfree_rcu() use of queue_rcu_work() is a perfect candidate for call_rcu()
batching because it is purely driven by memory pressure. And we have a
shrinker for lazy-RCU as well.

For non-kvfree uses, we can introduce a queue_rcu_work_flush() if need-be.

What do you think?

thanks,

 - Joel
  
Uladzislau Rezki Oct. 24, 2022, 4:55 p.m. UTC | #10
On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > >
> > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > understand the differences in your respective platforms' definitions of
> > > "good".  ;-)
> > >
> > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > workload a can detect a power delta and power gain. Anyway, below is a new
> > trace where i do not use "flush" variant for the kvfree_rcu():
> > 
> > <snip>
> > 1. Home screen swipe:
> >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > 2. App launches:
> >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > <snip>
> > 
> > it is much more better. But. As i wrote earlier there is a patch that i have submitted
> > some time ago improving kvfree_rcu() batching:
> > 
> > <snip>
> > commit 51824b780b719c53113dc39e027fbf670dc66028
> > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Date:   Thu Jun 30 18:33:35 2022 +0200
> > 
> >     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> > 
> >     Currently the monitor work is scheduled with a fixed interval of HZ/20,
> >     which is roughly 50 milliseconds. The drawback of this approach is
> >     low utilization of the 512 page slots in scenarios with infrequence
> >     kvfree_rcu() calls.  For example on an Android system:
> > <snip>
> > 
> > The trace that i posted was taken without it.
> 
> And if I am not getting too confused, that patch is now in mainline.
> So it does make sense to rely on it, then.  ;-)
> 
Right.

urezki@pc638:~/data/raid0/coding/linux.git$ git tag --contains
51824b780b719c53113dc39e027fbf670dc66028
v6.1-rc1
v6.1-rc2
urezki@pc638:~/data/raid0/coding/linux.git$

--
Uladzislau Rezki
  
Uladzislau Rezki Oct. 24, 2022, 5:08 p.m. UTC | #11
On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > >
> > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > understand the differences in your respective platforms' definitions of
> > > > "good".  ;-)
> > > >
> > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > 
> > > <snip>
> > > 1. Home screen swipe:
> > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > > 2. App launches:
> > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > > <snip>
> > > 
> > > it is much more better. But. As i wrote earlier there is a patch that i have submitted
> > > some time ago improving kvfree_rcu() batching:
> > > 
> > > <snip>
> > > commit 51824b780b719c53113dc39e027fbf670dc66028
> > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Date:   Thu Jun 30 18:33:35 2022 +0200
> > > 
> > >     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> > > 
> > >     Currently the monitor work is scheduled with a fixed interval of HZ/20,
> > >     which is roughly 50 milliseconds. The drawback of this approach is
> > >     low utilization of the 512 page slots in scenarios with infrequence
> > >     kvfree_rcu() calls.  For example on an Android system:
> > > <snip>
> > > 
> > > The trace that i posted was taken without it.
> > 
> > And if I am not getting too confused, that patch is now in mainline.
> > So it does make sense to rely on it, then.  ;-)
> > 
> Right.
> 
> urezki@pc638:~/data/raid0/coding/linux.git$ git tag --contains
> 51824b780b719c53113dc39e027fbf670dc66028
> v6.1-rc1
> v6.1-rc2
> urezki@pc638:~/data/raid0/coding/linux.git$
> 
Just in case: 5.10 + "rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval":

<snip>
1. Home screen swipe:
         rcuop/3-40      [003] d..1    94.202849: rcu_batch_start: rcu_preempt CBs=664 bl=10
         rcuop/4-48      [001] d..1    95.999352: rcu_batch_start: rcu_preempt CBs=252 bl=10
         rcuop/6-62      [002] d..1    97.534875: rcu_batch_start: rcu_preempt CBs=152 bl=10
         rcuop/5-55      [003] d..1    98.042912: rcu_batch_start: rcu_preempt CBs=189 bl=10
         rcuop/0-15      [002] d..1    98.306769: rcu_batch_start: rcu_preempt CBs=1457 bl=11
         rcuop/1-26      [000] d..1    99.582931: rcu_batch_start: rcu_preempt CBs=2115 bl=16
         rcuop/2-33      [003] d..1    99.582935: rcu_batch_start: rcu_preempt CBs=2019 bl=15
         rcuop/3-40      [001] d..1    99.838885: rcu_batch_start: rcu_preempt CBs=1168 bl=10
         rcuop/1-26      [000] d..1   100.603496: rcu_batch_start: rcu_preempt CBs=168 bl=10
2. Apps launches:
         rcuop/4-48      [007] d..1   102.910580: rcu_batch_start: rcu_preempt CBs=1150 bl=10
         rcuop/6-62      [007] d..1   102.910682: rcu_batch_start: rcu_preempt CBs=1001 bl=10
         rcuop/5-55      [007] d..1   103.166607: rcu_batch_start: rcu_preempt CBs=939 bl=10
         rcuop/0-15      [007] d..1   104.450598: rcu_batch_start: rcu_preempt CBs=1694 bl=13
         rcuop/5-55      [006] d..1   104.478640: rcu_batch_start: rcu_preempt CBs=3125 bl=24
         rcuop/3-40      [007] d..1   104.958565: rcu_batch_start: rcu_preempt CBs=1108 bl=10
         rcuop/7-69      [007] d..1   106.238634: rcu_batch_start: rcu_preempt CBs=10275 bl=80
         rcuop/4-48      [007] d..1   107.258586: rcu_batch_start: rcu_preempt CBs=8142 bl=63
         rcuop/7-69      [007] d..1   107.260769: rcu_batch_start: rcu_preempt CBs=1880 bl=14
         rcuop/2-33      [007] d..1   107.526638: rcu_batch_start: rcu_preempt CBs=1968 bl=15
         rcuop/1-26      [007] d..1   107.542612: rcu_batch_start: rcu_preempt CBs=1796 bl=14
         rcuop/5-55      [007] d..1   108.286588: rcu_batch_start: rcu_preempt CBs=3547 bl=27
         rcuop/6-62      [007] d..1   108.287639: rcu_batch_start: rcu_preempt CBs=5820 bl=45
         rcuop/7-69      [007] d..1   108.290548: rcu_batch_start: rcu_preempt CBs=2430 bl=18
         rcuop/0-15      [000] d..1   109.826843: rcu_batch_start: rcu_preempt CBs=2282 bl=17
         rcuop/3-40      [002] d..1   110.595455: rcu_batch_start: rcu_preempt CBs=1960 bl=15
           <...>-48      [005] d..1   112.390702: rcu_batch_start: rcu_preempt CBs=5143 bl=40
         rcuop/7-69      [004] d..1   112.402607: rcu_batch_start: rcu_preempt CBs=3379 bl=26
         rcuop/2-33      [005] d..1   112.638614: rcu_batch_start: rcu_preempt CBs=3223 bl=25
         rcuop/1-26      [004] d..1   112.638617: rcu_batch_start: rcu_preempt CBs=3026 bl=23
         rcuop/5-55      [007] d..1   113.402581: rcu_batch_start: rcu_preempt CBs=7251 bl=56
         rcuop/6-62      [007] d..1   113.658582: rcu_batch_start: rcu_preempt CBs=7035 bl=54
<snip>

--
Uladzislau Rezki
  
Joel Fernandes Oct. 24, 2022, 5:20 p.m. UTC | #12
On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > >
> > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > understand the differences in your respective platforms' definitions of
> > > > > "good".  ;-)
> > > > >
> > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > >
> > > > <snip>
> > > > 1. Home screen swipe:
> > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23

> > > > 2. App launches:
> > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10

Definitely better, but I'd still ask why not just rely on the lazy
batching that we now have, since it is a memory pressure related
usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
disturb the lazy-RCU batching by queuing these "free memory" CBs; and
instead keep your improved kvfree_rcu() batching only for
!CONFIG_RCU_LAZY.
  
Paul E. McKenney Oct. 24, 2022, 5:35 p.m. UTC | #13
On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > >
> > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > understand the differences in your respective platforms' definitions of
> > > > > > "good".  ;-)
> > > > > >
> > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > >
> > > > > <snip>
> > > > > 1. Home screen swipe:
> > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> 
> > > > > 2. App launches:
> > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> 
> Definitely better, but I'd still ask why not just rely on the lazy
> batching that we now have, since it is a memory pressure related
> usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> instead keep your improved kvfree_rcu() batching only for
> !CONFIG_RCU_LAZY.

Given that making the kvfree_rcu()-level batching conditional on
CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
keeping the kvfree_rcu-level batching unconditionally?

							Thanx, Paul
  
Uladzislau Rezki Oct. 24, 2022, 5:40 p.m. UTC | #14
On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > >
> > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > understand the differences in your respective platforms' definitions of
> > > > > > "good".  ;-)
> > > > > >
> > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > >
> > > > > <snip>
> > > > > 1. Home screen swipe:
> > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> 
> > > > > 2. App launches:
> > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> 
> Definitely better, but I'd still ask why not just rely on the lazy
> batching that we now have, since it is a memory pressure related
> usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> instead keep your improved kvfree_rcu() batching only for
> !CONFIG_RCU_LAZY.
>

1. Double-batching?

The kvfree_rcu() interface itself keeps track of when to reclaim:
  a) when a page is full;
  b) when i high storm of freeing over rcu;
  c) when a low memory condition.

such control stays inside the kvfree_rcu(). Converting it to lazy
variant:
  a) lose the control, what will become as a problem;
  b) nothing is improved.

2. Converting the queue_rcu_work() to lazy variant breaks a humanity
interpretation when a queued work is supposed to be run. People do not
expect seconds when they queue the work. Same as in the kvfree_rcu()
we do not expect it we even used a high_prio queue in the beginning.

There are ~10 users who queue the work and they did not expect it to
be run in 10 seconds when they wrote the code.

3. With the "rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval" there
is no sense in doing it. Same data in active and idle use cases.

--
Uladzislau Rezki
  
Joel Fernandes Oct. 24, 2022, 8:08 p.m. UTC | #15
On Mon, Oct 24, 2022 at 1:40 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > >
> > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > "good".  ;-)
> > > > > > >
> > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > >
> > > > > > <snip>
> > > > > > 1. Home screen swipe:
> > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> >
> > > > > > 2. App launches:
> > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> >
> > Definitely better, but I'd still ask why not just rely on the lazy
> > batching that we now have, since it is a memory pressure related
> > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > instead keep your improved kvfree_rcu() batching only for
> > !CONFIG_RCU_LAZY.
> >
>
> 1. Double-batching?
>
> The kvfree_rcu() interface itself keeps track of when to reclaim:
>   a) when a page is full;
>   b) when i high storm of freeing over rcu;
>   c) when a low memory condition.
>
> such control stays inside the kvfree_rcu(). Converting it to lazy
> variant:
>   a) lose the control, what will become as a problem;
>   b) nothing is improved.

AFAICS, the only thing being changed is when you are giving memory
back to the system. So you will be holding on to memory a bit longer.
And there's shrinkers that are already flushing those. I don't think
the users of kvfree_rcu() want to free memory instantly. If there is
such usecase, please share it.

> 2. Converting the queue_rcu_work() to lazy variant breaks a humanity
> interpretation when a queued work is supposed to be run. People do not
> expect seconds when they queue the work.

Which people? ;)

> Same as in the kvfree_rcu()
> we do not expect it we even used a high_prio queue in the beginning.
> There are ~10 users who queue the work and they did not expect it to
> be run in 10 seconds when they wrote the code.

That's a bit of a misinterpretation of what I'm saying. A variant
queue_rcu_work_flush() can be added for those users (such as ones that
are not freeing memory).

Thanks.
  
Joel Fernandes Oct. 24, 2022, 8:12 p.m. UTC | #16
On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > >
> > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > "good".  ;-)
> > > > > > >
> > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > >
> > > > > > <snip>
> > > > > > 1. Home screen swipe:
> > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> >
> > > > > > 2. App launches:
> > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> >
> > Definitely better, but I'd still ask why not just rely on the lazy
> > batching that we now have, since it is a memory pressure related
> > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > instead keep your improved kvfree_rcu() batching only for
> > !CONFIG_RCU_LAZY.
>
> Given that making the kvfree_rcu()-level batching conditional on
> CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
> keeping the kvfree_rcu-level batching unconditionally?

The bad thing happening is power impact. There is a noticeable impact
in our testing, and when we dropped this particular patch, it got much
better results.

I also run rcutop and I see without the patch that I have several
seconds of laziness at a time, unlike with the patch.

Even in the beginning when I came up with an implementation for
call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since
it was quite frequent (on ChromeOS). But when we introduced the
flush() API, I forgot to not use flush() on it.  But unfortunately
this patch slipped into my last series when Vlad and I were debugging
the SCSI issue, and did not really help for the SCSI issue itself.

Thanks,

 - Joel
  
Joel Fernandes Oct. 24, 2022, 8:16 p.m. UTC | #17
On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > > >
> > > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > > "good".  ;-)
> > > > > > > >
> > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > > >
> > > > > > > <snip>
> > > > > > > 1. Home screen swipe:
> > > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > >
> > > > > > > 2. App launches:
> > > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > >
> > > Definitely better, but I'd still ask why not just rely on the lazy
> > > batching that we now have, since it is a memory pressure related
> > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > > instead keep your improved kvfree_rcu() batching only for
> > > !CONFIG_RCU_LAZY.
> >
> > Given that making the kvfree_rcu()-level batching conditional on
> > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
> > keeping the kvfree_rcu-level batching unconditionally?
>
> The bad thing happening is power impact. There is a noticeable impact
> in our testing, and when we dropped this particular patch, it got much
> better results.
>
> I also run rcutop and I see without the patch that I have several
> seconds of laziness at a time, unlike with the patch.
>
> Even in the beginning when I came up with an implementation for
> call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since
> it was quite frequent (on ChromeOS). But when we introduced the
> flush() API, I forgot to not use flush() on it.  But unfortunately
> this patch slipped into my last series when Vlad and I were debugging
> the SCSI issue, and did not really help for the SCSI issue itself.

I could try to run Vlad's other mainline patch itself and measure
power, I'll get back on that. Thanks!
  
Paul E. McKenney Oct. 24, 2022, 8:19 p.m. UTC | #18
On Mon, Oct 24, 2022 at 04:12:59PM -0400, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > > >
> > > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > > "good".  ;-)
> > > > > > > >
> > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > > >
> > > > > > > <snip>
> > > > > > > 1. Home screen swipe:
> > > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > >
> > > > > > > 2. App launches:
> > > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > >
> > > Definitely better, but I'd still ask why not just rely on the lazy
> > > batching that we now have, since it is a memory pressure related
> > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > > instead keep your improved kvfree_rcu() batching only for
> > > !CONFIG_RCU_LAZY.
> >
> > Given that making the kvfree_rcu()-level batching conditional on
> > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
> > keeping the kvfree_rcu-level batching unconditionally?
> 
> The bad thing happening is power impact. There is a noticeable impact
> in our testing, and when we dropped this particular patch, it got much
> better results.
> 
> I also run rcutop and I see without the patch that I have several
> seconds of laziness at a time, unlike with the patch.

Fair point, but is this visible at the power meter?

							Thanx, Paul

> Even in the beginning when I came up with an implementation for
> call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since
> it was quite frequent (on ChromeOS). But when we introduced the
> flush() API, I forgot to not use flush() on it.  But unfortunately
> this patch slipped into my last series when Vlad and I were debugging
> the SCSI issue, and did not really help for the SCSI issue itself.
> 
> Thanks,
> 
>  - Joel
  
Joel Fernandes Oct. 24, 2022, 8:26 p.m. UTC | #19
On Mon, Oct 24, 2022 at 4:19 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Oct 24, 2022 at 04:12:59PM -0400, Joel Fernandes wrote:
> > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > > > >
> > > > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > > > "good".  ;-)
> > > > > > > > >
> > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > > > >
> > > > > > > > <snip>
> > > > > > > > 1. Home screen swipe:
> > > > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > > >
> > > > > > > > 2. App launches:
> > > > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > > >
> > > > Definitely better, but I'd still ask why not just rely on the lazy
> > > > batching that we now have, since it is a memory pressure related
> > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > > > instead keep your improved kvfree_rcu() batching only for
> > > > !CONFIG_RCU_LAZY.
> > >
> > > Given that making the kvfree_rcu()-level batching conditional on
> > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
> > > keeping the kvfree_rcu-level batching unconditionally?
> >
> > The bad thing happening is power impact. There is a noticeable impact
> > in our testing, and when we dropped this particular patch, it got much
> > better results.
> >
> > I also run rcutop and I see without the patch that I have several
> > seconds of laziness at a time, unlike with the patch.
>
> Fair point, but is this visible at the power meter?

Yes it is and it came up as a part of debugging I did, I am not making
it up ;-) The delta in power is 10%. As you saw in Vlad's traces as
well, the kvfree_rcu() can be called quite frequently.

 - Joel
  
Uladzislau Rezki Oct. 25, 2022, 10:47 a.m. UTC | #20
On Mon, Oct 24, 2022 at 04:08:17PM -0400, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 1:40 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > > >
> > > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > > "good".  ;-)
> > > > > > > >
> > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > > >
> > > > > > > <snip>
> > > > > > > 1. Home screen swipe:
> > > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > >
> > > > > > > 2. App launches:
> > > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > >
> > > Definitely better, but I'd still ask why not just rely on the lazy
> > > batching that we now have, since it is a memory pressure related
> > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > > instead keep your improved kvfree_rcu() batching only for
> > > !CONFIG_RCU_LAZY.
> > >
> >
> > 1. Double-batching?
> >
> > The kvfree_rcu() interface itself keeps track of when to reclaim:
> >   a) when a page is full;
> >   b) when i high storm of freeing over rcu;
> >   c) when a low memory condition.
> >
> > such control stays inside the kvfree_rcu(). Converting it to lazy
> > variant:
> >   a) lose the control, what will become as a problem;
> >   b) nothing is improved.
> 
> AFAICS, the only thing being changed is when you are giving memory
> back to the system. So you will be holding on to memory a bit longer.
> And there's shrinkers that are already flushing those. I don't think
> the users of kvfree_rcu() want to free memory instantly. If there is
> such usecase, please share it.
> 
Actually, ideally, we want to free a memory asap. The problem with extra 10
seconds is a big amount of un-reclaimed memory that usually can lead
to more frequent memory pressure and doing shrinking. We do not want
ideally trigger any shrinker. Because for us it is a big slow down
in a device behaviour.

> > 2. Converting the queue_rcu_work() to lazy variant breaks a humanity
> > interpretation when a queued work is supposed to be run. People do not
> > expect seconds when they queue the work.
> 
> Which people? ;)
> 
Who wrote the code :)

> > Same as in the kvfree_rcu()
> > we do not expect it we even used a high_prio queue in the beginning.
> > There are ~10 users who queue the work and they did not expect it to
> > be run in 10 seconds when they wrote the code.
> 
> That's a bit of a misinterpretation of what I'm saying. A variant
> queue_rcu_work_flush() can be added for those users (such as ones that
> are not freeing memory).
> 
If it is added for the kvfree_rcu() it is totally fine. Because there is
a batching in place.

--
Uladzislau Rezki
  
Uladzislau Rezki Oct. 25, 2022, 10:48 a.m. UTC | #21
On Mon, Oct 24, 2022 at 04:16:20PM -0400, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > > > >
> > > > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > > > "good".  ;-)
> > > > > > > > >
> > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > > > >
> > > > > > > > <snip>
> > > > > > > > 1. Home screen swipe:
> > > > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > > >
> > > > > > > > 2. App launches:
> > > > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > > >
> > > > Definitely better, but I'd still ask why not just rely on the lazy
> > > > batching that we now have, since it is a memory pressure related
> > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > > > instead keep your improved kvfree_rcu() batching only for
> > > > !CONFIG_RCU_LAZY.
> > >
> > > Given that making the kvfree_rcu()-level batching conditional on
> > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
> > > keeping the kvfree_rcu-level batching unconditionally?
> >
> > The bad thing happening is power impact. There is a noticeable impact
> > in our testing, and when we dropped this particular patch, it got much
> > better results.
> >
> > I also run rcutop and I see without the patch that I have several
> > seconds of laziness at a time, unlike with the patch.
> >
> > Even in the beginning when I came up with an implementation for
> > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since
> > it was quite frequent (on ChromeOS). But when we introduced the
> > flush() API, I forgot to not use flush() on it.  But unfortunately
> > this patch slipped into my last series when Vlad and I were debugging
> > the SCSI issue, and did not really help for the SCSI issue itself.
> 
> I could try to run Vlad's other mainline patch itself and measure
> power, I'll get back on that. Thanks!
>
That makes sense. It would be good to have a look at your power figures
and traces.

--
Uladzislau Rezki
  
Joel Fernandes Oct. 25, 2022, 3:05 p.m. UTC | #22
On Tue, Oct 25, 2022 at 6:48 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Mon, Oct 24, 2022 at 04:16:20PM -0400, Joel Fernandes wrote:
> > On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > >
> > > > > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > > > > "good".  ;-)
> > > > > > > > > >
> > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > > > > >
> > > > > > > > > <snip>
> > > > > > > > > 1. Home screen swipe:
> > > > > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > > > >
> > > > > > > > > 2. App launches:
> > > > > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > > > >
> > > > > Definitely better, but I'd still ask why not just rely on the lazy
> > > > > batching that we now have, since it is a memory pressure related
> > > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > > > > instead keep your improved kvfree_rcu() batching only for
> > > > > !CONFIG_RCU_LAZY.
> > > >
> > > > Given that making the kvfree_rcu()-level batching conditional on
> > > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
> > > > keeping the kvfree_rcu-level batching unconditionally?
> > >
> > > The bad thing happening is power impact. There is a noticeable impact
> > > in our testing, and when we dropped this particular patch, it got much
> > > better results.
> > >
> > > I also run rcutop and I see without the patch that I have several
> > > seconds of laziness at a time, unlike with the patch.
> > >
> > > Even in the beginning when I came up with an implementation for
> > > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since
> > > it was quite frequent (on ChromeOS). But when we introduced the
> > > flush() API, I forgot to not use flush() on it.  But unfortunately
> > > this patch slipped into my last series when Vlad and I were debugging
> > > the SCSI issue, and did not really help for the SCSI issue itself.
> >
> > I could try to run Vlad's other mainline patch itself and measure
> > power, I'll get back on that. Thanks!
> >
> That makes sense. It would be good to have a look at your power figures
> and traces.

If you don't mind, could you backport that patch to 5.10?
Here is my 5.10 tree for reference (without the patch)
https://github.com/joelagnel/linux-kernel.git    (branch
5.10-v9-minus-queuework-plus-kfreebatch)

and I am getting conflicts if I cherry-pick:
51824b780b71 ("rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval")

I am assuming you have already done the backport, that's why you got
the traces above. If so, I would appreciate a link to your branch so I
don't mess the backport up!

 - Joel
  
Uladzislau Rezki Oct. 26, 2022, 8:35 p.m. UTC | #23
On Tue, Oct 25, 2022 at 11:05:57AM -0400, Joel Fernandes wrote:
> On Tue, Oct 25, 2022 at 6:48 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Oct 24, 2022 at 04:16:20PM -0400, Joel Fernandes wrote:
> > > On Mon, Oct 24, 2022 at 4:12 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > On Mon, Oct 24, 2022 at 1:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Mon, Oct 24, 2022 at 01:20:26PM -0400, Joel Fernandes wrote:
> > > > > > On Mon, Oct 24, 2022 at 1:08 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, Oct 24, 2022 at 06:55:16PM +0200, Uladzislau Rezki wrote:
> > > > > > > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > > > > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > > >
> > > > > > > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > > > > > > understand the differences in your respective platforms' definitions of
> > > > > > > > > > > "good".  ;-)
> > > > > > > > > > >
> > > > > > > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > > > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > > > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > > > > > > >
> > > > > > > > > > <snip>
> > > > > > > > > > 1. Home screen swipe:
> > > > > > > > > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > > > > > > > > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > > > > > > > > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > > > > > > > > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > > > > > > > > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > > > > > > > > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > > > > > > > > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > > > > > > > > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > > > > > > > > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > > > > >
> > > > > > > > > > 2. App launches:
> > > > > > > > > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > > > > > > > > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > > > > > > > > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > > > > > > > > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > > > > > > > > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > > > > > > > > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > > > > > > > > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > > > > > > > > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > > > > > > > > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > > > > > > > > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > > > > > > > > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > > > > > > > > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > > > > > > > > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > > > > > > > > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > > > > > > > > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > > > > > > > > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > > > > > > > > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > > > > > > > > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > > > > > > > > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > > > > > > > > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > > > > > > > > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > > > > > > > > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > > > > > > > > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > > > > > > > > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > > > > > > > > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > > > > > > > > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > > > > >
> > > > > > Definitely better, but I'd still ask why not just rely on the lazy
> > > > > > batching that we now have, since it is a memory pressure related
> > > > > > usecase. Or another approach could be, for CONFIG_RCU_LAZY, don't
> > > > > > disturb the lazy-RCU batching by queuing these "free memory" CBs; and
> > > > > > instead keep your improved kvfree_rcu() batching only for
> > > > > > !CONFIG_RCU_LAZY.
> > > > >
> > > > > Given that making the kvfree_rcu()-level batching conditional on
> > > > > CONFIG_RCU_LAZY would complicate the code, what bad thing happens when
> > > > > keeping the kvfree_rcu-level batching unconditionally?
> > > >
> > > > The bad thing happening is power impact. There is a noticeable impact
> > > > in our testing, and when we dropped this particular patch, it got much
> > > > better results.
> > > >
> > > > I also run rcutop and I see without the patch that I have several
> > > > seconds of laziness at a time, unlike with the patch.
> > > >
> > > > Even in the beginning when I came up with an implementation for
> > > > call_rcu_lazy(), I had to mark queue_rcu_work() as lazy as well since
> > > > it was quite frequent (on ChromeOS). But when we introduced the
> > > > flush() API, I forgot to not use flush() on it.  But unfortunately
> > > > this patch slipped into my last series when Vlad and I were debugging
> > > > the SCSI issue, and did not really help for the SCSI issue itself.
> > >
> > > I could try to run Vlad's other mainline patch itself and measure
> > > power, I'll get back on that. Thanks!
> > >
> > That makes sense. It would be good to have a look at your power figures
> > and traces.
> 
> If you don't mind, could you backport that patch to 5.10?
> Here is my 5.10 tree for reference (without the patch)
> https://github.com/joelagnel/linux-kernel.git    (branch
> 5.10-v9-minus-queuework-plus-kfreebatch)
> 
> and I am getting conflicts if I cherry-pick:
> 51824b780b71 ("rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval")
> 
> I am assuming you have already done the backport, that's why you got
> the traces above. If so, I would appreciate a link to your branch so I
> don't mess the backport up!
> 
Sure. I sent you the patches privately so i do not want to paste
here a lot of code to make extra line-nose.

--
Uladzislau Rezki
  
Joel Fernandes Oct. 28, 2022, 9:23 p.m. UTC | #24
On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > >
> > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > understand the differences in your respective platforms' definitions of
> > > "good".  ;-)
> > >
> > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > workload a can detect a power delta and power gain. Anyway, below is a new
> > trace where i do not use "flush" variant for the kvfree_rcu():
> > 
> > <snip>
> > 1. Home screen swipe:
> >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > 2. App launches:
> >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > <snip>
> > 
> > it is much more better. But. As i wrote earlier there is a patch that i have submitted
> > some time ago improving kvfree_rcu() batching:
> > 
> > <snip>
> > commit 51824b780b719c53113dc39e027fbf670dc66028
> > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Date:   Thu Jun 30 18:33:35 2022 +0200
> > 
> >     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> > 
> >     Currently the monitor work is scheduled with a fixed interval of HZ/20,
> >     which is roughly 50 milliseconds. The drawback of this approach is
> >     low utilization of the 512 page slots in scenarios with infrequence
> >     kvfree_rcu() calls.  For example on an Android system:
> > <snip>
> > 
> > The trace that i posted was taken without it.
> 
> And if I am not getting too confused, that patch is now in mainline.
> So it does make sense to rely on it, then.  ;-)

Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable
to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ?

One at 5 and other at 10 seems odd, especially because the former seems to
negate the effects of the latter and anyone tweaking that in the future (say
via new command line options) should probably tweak them together to increase
batching.

Testing shows significantly better batching with Vlad's updates, however I am
wondering why the rcu_callback fires in pairs like that from separate
kworkers:

     kworker/6:1-157     [006] d..1   288.861610: rcu_callback: rcu_preempt  rhp=0000000079b895f9 func=rcu_work_rcufn 1214
     kworker/4:2-158     [004]  d..1   288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798

I wonder if the queued kwork is executing twice accidentally, or something.
This kernel does have the additional trace patch below, fyi.

Another thought I have is, if we can just keep the kvfree_rcu() mapped to
call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something.
Because I am personally not much a fan of the kfree_rcu() induced additional
timer wake ups and kworker queue+wakeup which we don't need per-se, if we are
already batching with Lazyfied-call_rcu. Too many moving parts which might
hurt power.

---8<-----------------------

From: Joel Fernandes <joelaf@google.com>
Subject: [PATCH] debug: reorder trace_rcu_callback

Signed-off-by: Joel Fernandes <joelaf@google.com>
---
 kernel/rcu/tree.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 73feb09fd51b..a7c175e9533a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2978,10 +2978,6 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 	}
 
 	check_cb_ovld(rdp);
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
-		return; // Enqueued onto ->nocb_bypass, so just leave.
-	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
-	rcu_segcblist_enqueue(&rdp->cblist, head);
 	if (__is_kvfree_rcu_offset((unsigned long)func))
 		trace_rcu_kvfree_callback(rcu_state.name, head,
 					 (unsigned long)func,
@@ -2990,6 +2986,11 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 		trace_rcu_callback(rcu_state.name, head,
 				   rcu_segcblist_n_cbs(&rdp->cblist));
 
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
+		return; // Enqueued onto ->nocb_bypass, so just leave.
+	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
+	rcu_segcblist_enqueue(&rdp->cblist, head);
+
 	/* Go handle any RCU core processing required. */
 	if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
 	    unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) {
  
Joel Fernandes Oct. 28, 2022, 9:42 p.m. UTC | #25
On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > >
> > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > understand the differences in your respective platforms' definitions of
> > > > "good".  ;-)
> > > >
> > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > 
> > > <snip>
> > > 1. Home screen swipe:
> > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > > 2. App launches:
> > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > > <snip>
> > > 
> > > it is much more better. But. As i wrote earlier there is a patch that i have submitted
> > > some time ago improving kvfree_rcu() batching:
> > > 
> > > <snip>
> > > commit 51824b780b719c53113dc39e027fbf670dc66028
> > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Date:   Thu Jun 30 18:33:35 2022 +0200
> > > 
> > >     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> > > 
> > >     Currently the monitor work is scheduled with a fixed interval of HZ/20,
> > >     which is roughly 50 milliseconds. The drawback of this approach is
> > >     low utilization of the 512 page slots in scenarios with infrequence
> > >     kvfree_rcu() calls.  For example on an Android system:
> > > <snip>
> > > 
> > > The trace that i posted was taken without it.
> > 
> > And if I am not getting too confused, that patch is now in mainline.
> > So it does make sense to rely on it, then.  ;-)
> 
> Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable
> to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ?
> 
> One at 5 and other at 10 seems odd, especially because the former seems to
> negate the effects of the latter and anyone tweaking that in the future (say
> via new command line options) should probably tweak them together to increase
> batching.
> 
> Testing shows significantly better batching with Vlad's updates, however I am
> wondering why the rcu_callback fires in pairs like that from separate
> kworkers:
> 
>      kworker/6:1-157     [006] d..1   288.861610: rcu_callback: rcu_preempt  rhp=0000000079b895f9 func=rcu_work_rcufn 1214
>      kworker/4:2-158     [004]  d..1   288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798

I think this is just 2 kvfree_call_rcu() happening on 2 different CPUs, and
then the draining happened on 2 different kworkers, so appears normal.

Here is also some more evidence from the user side, of kvfree_call_rcu()
noisiness on ChromeOS. So we definitely need the batching to happen on
ChromeOS:

   kworker/u24:6-1448  [005]    77.290344: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
 ThreadPoolForeg-5130  [011]    77.301101: function:             kvfree_call_rcu <-- cgroup_migrate_finish
 irq/144-iwlwifi-1010  [004]    77.314367: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
 ThreadPoolSingl-2050  [004]    77.330359: function:             kvfree_call_rcu <-- cgroup_migrate_finish
 ThreadPoolSingl-2050  [004]    77.330362: function:             kvfree_call_rcu <-- cgroup_migrate_finish
 ThreadPoolForeg-5130  [011]    77.331513: function:             kvfree_call_rcu <-- cgroup_migrate_finish
     patchpaneld-2195  [009]    77.337726: function:             kvfree_call_rcu <-- neigh_flush_dev
     patchpaneld-2195  [009]    77.337737: function:             kvfree_call_rcu <-- __hw_addr_del_entry
     patchpaneld-2195  [009]    77.337744: function:             kvfree_call_rcu <-- addrconf_ifdown
     patchpaneld-2195  [009]    77.337744: function:             kvfree_call_rcu <-- __hw_addr_del_entry
 irq/144-iwlwifi-1010  [004]    77.633595: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
 irq/144-iwlwifi-1010  [004]    77.633609: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
 irq/144-iwlwifi-1010  [004]    77.769844: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
   kworker/u24:1-9     [008]    77.769858: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
 irq/144-iwlwifi-1010  [004]    77.880114: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
   kworker/u24:6-1448  [005]    77.880129: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
 irq/144-iwlwifi-1010  [004]    77.880131: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
   kworker/u24:6-1448  [005]    77.880133: function:             kvfree_call_rcu <-- cfg80211_update_known_bss
      virtio_gpu-5882  [010]    78.337302: function:             kvfree_call_rcu <-- context_close
      virtio_gpu-5882  [010]    78.337303: function:             kvfree_call_rcu <-- i915_driver_postclose
      virtio_gpu-5882  [010]    78.346058: function:             kvfree_call_rcu <-- fence_notify
      virtio_gpu-5882  [010]    78.346070: function:             kvfree_call_rcu <-- fence_notify
      virtio_gpu-5882  [010]    78.346079: function:             kvfree_call_rcu <-- fence_notify
      virtio_gpu-5882  [010]    78.346086: function:             kvfree_call_rcu <-- fence_notify
      virtio_gpu-5882  [010]    78.346184: function:             kvfree_call_rcu <-- fence_notify
      virtio_gpu-5882  [010]    78.346196: function:             kvfree_call_rcu <-- fence_notify

thanks,

 - Joel

> 
> I wonder if the queued kwork is executing twice accidentally, or something.
> This kernel does have the additional trace patch below, fyi.
> 
> Another thought I have is, if we can just keep the kvfree_rcu() mapped to
> call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something.
> Because I am personally not much a fan of the kfree_rcu() induced additional
> timer wake ups and kworker queue+wakeup which we don't need per-se, if we are
> already batching with Lazyfied-call_rcu. Too many moving parts which might
> hurt power.
> 
> ---8<-----------------------
> 
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] debug: reorder trace_rcu_callback
> 
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  kernel/rcu/tree.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 73feb09fd51b..a7c175e9533a 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2978,10 +2978,6 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>  	}
>  
>  	check_cb_ovld(rdp);
> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
> -		return; // Enqueued onto ->nocb_bypass, so just leave.
> -	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
> -	rcu_segcblist_enqueue(&rdp->cblist, head);
>  	if (__is_kvfree_rcu_offset((unsigned long)func))
>  		trace_rcu_kvfree_callback(rcu_state.name, head,
>  					 (unsigned long)func,
> @@ -2990,6 +2986,11 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>  		trace_rcu_callback(rcu_state.name, head,
>  				   rcu_segcblist_n_cbs(&rdp->cblist));
>  
> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
> +		return; // Enqueued onto ->nocb_bypass, so just leave.
> +	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
> +	rcu_segcblist_enqueue(&rdp->cblist, head);
> +
>  	/* Go handle any RCU core processing required. */
>  	if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
>  	    unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) {
> -- 
> 2.38.1.273.g43a17bfeac-goog
>
  
Uladzislau Rezki Oct. 31, 2022, 1:21 p.m. UTC | #26
On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote:
> On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > >
> > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > understand the differences in your respective platforms' definitions of
> > > > "good".  ;-)
> > > >
> > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > 
> > > <snip>
> > > 1. Home screen swipe:
> > >          rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
> > >          rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
> > >          rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
> > >          rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
> > >          rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
> > >          rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
> > >          rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
> > >          rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
> > >          rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
> > > 2. App launches:
> > >          rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
> > >          rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
> > >          rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
> > >          rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
> > >          rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
> > >          rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
> > >          rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
> > >          rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
> > >          rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
> > >            <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
> > >          rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84
> > >            <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
> > >            <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
> > >          rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
> > >            <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
> > >          rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
> > >          rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
> > >          rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
> > >          rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
> > >          rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
> > >          rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
> > >          rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
> > >          rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
> > >          rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
> > >          rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
> > >          rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
> > > <snip>
> > > 
> > > it is much more better. But. As i wrote earlier there is a patch that i have submitted
> > > some time ago improving kvfree_rcu() batching:
> > > 
> > > <snip>
> > > commit 51824b780b719c53113dc39e027fbf670dc66028
> > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Date:   Thu Jun 30 18:33:35 2022 +0200
> > > 
> > >     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> > > 
> > >     Currently the monitor work is scheduled with a fixed interval of HZ/20,
> > >     which is roughly 50 milliseconds. The drawback of this approach is
> > >     low utilization of the 512 page slots in scenarios with infrequence
> > >     kvfree_rcu() calls.  For example on an Android system:
> > > <snip>
> > > 
> > > The trace that i posted was taken without it.
> > 
> > And if I am not getting too confused, that patch is now in mainline.
> > So it does make sense to rely on it, then.  ;-)
> 
> Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable
> to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ?
> 
This is very good.

Below is a plot that i have taken during one use-case. It is about three
apps usage in parallel. It was done by running "monkey" test:

wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png

i set up three apps as usage scenario: Google Chrome, YoTube and Camera.
I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second.

Please have a look at results. It reflects what i am saying. non-flush
kvfree RCU variant makes a memory usage higher. What is not acceptable
for our mobile devices and workloads.

>
> One at 5 and other at 10 seems odd, especially because the former seems to
> negate the effects of the latter and anyone tweaking that in the future (say
> via new command line options) should probably tweak them together to increase
> batching.
> 
Well. Convert 5 seconds to 10? What will it solve for you? We can do it
and from a kvfree_rcu() perspective nothing really is changed.

> Testing shows significantly better batching with Vlad's updates, however I am
> wondering why the rcu_callback fires in pairs like that from separate
> kworkers:
> 
>      kworker/6:1-157     [006] d..1   288.861610: rcu_callback: rcu_preempt  rhp=0000000079b895f9 func=rcu_work_rcufn 1214
>      kworker/4:2-158     [004]  d..1   288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798
> 
> I wonder if the queued kwork is executing twice accidentally, or something.
>
Because a kfree_rcu_cpu() is a per-cpu thing.

> This kernel does have the additional trace patch below, fyi.
> 
> Another thought I have is, if we can just keep the kvfree_rcu() mapped to
> call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something.
>
I am not sure you need it, really. If you wake-up "rcuop" or whatever
with 0.5 second interval or with 5 seconds interval you will not notice
anything in terms of power between both.

--
Uladzislau Rezki
  
Joel Fernandes Oct. 31, 2022, 1:37 p.m. UTC | #27
> On Oct 31, 2022, at 9:21 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote:
>>> On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
>>> On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
>>>>> 
>>>>> You guys might need to agree on the definition of "good" here.  Or maybe
>>>>> understand the differences in your respective platforms' definitions of
>>>>> "good".  ;-)
>>>>> 
>>>> Indeed. Bad is when once per-millisecond infinitely :) At least in such use
>>>> workload a can detect a power delta and power gain. Anyway, below is a new
>>>> trace where i do not use "flush" variant for the kvfree_rcu():
>>>> 
>>>> <snip>
>>>> 1. Home screen swipe:
>>>>         rcuop/0-15      [003] d..1  1792.767750: rcu_batch_start: rcu_preempt CBs=1003 bl=10
>>>>         rcuop/2-33      [002] d..1  1792.771717: rcu_batch_start: rcu_preempt CBs=934 bl=10
>>>>         rcuop/3-40      [001] d..1  1794.811816: rcu_batch_start: rcu_preempt CBs=1508 bl=11
>>>>         rcuop/1-26      [003] d..1  1797.116382: rcu_batch_start: rcu_preempt CBs=2127 bl=16
>>>>         rcuop/4-48      [001] d..1  1797.124422: rcu_batch_start: rcu_preempt CBs=95 bl=10
>>>>         rcuop/5-55      [002] d..1  1797.124731: rcu_batch_start: rcu_preempt CBs=143 bl=10
>>>>         rcuop/6-62      [005] d..1  1798.911719: rcu_batch_start: rcu_preempt CBs=132 bl=10
>>>>         rcuop/2-33      [002] d..1  1803.003966: rcu_batch_start: rcu_preempt CBs=3797 bl=29
>>>>         rcuop/0-15      [003] d..1  1803.004707: rcu_batch_start: rcu_preempt CBs=2969 bl=23
>>>> 2. App launches:
>>>>         rcuop/4-48      [005] d..1  1831.087612: rcu_batch_start: rcu_preempt CBs=6141 bl=47
>>>>         rcuop/7-69      [007] d..1  1831.095578: rcu_batch_start: rcu_preempt CBs=5464 bl=42
>>>>         rcuop/5-55      [004] d..1  1832.703571: rcu_batch_start: rcu_preempt CBs=8461 bl=66
>>>>         rcuop/0-15      [004] d..1  1833.731603: rcu_batch_start: rcu_preempt CBs=2548 bl=19
>>>>         rcuop/1-26      [006] d..1  1833.743691: rcu_batch_start: rcu_preempt CBs=2567 bl=20
>>>>         rcuop/2-33      [006] d..1  1833.744005: rcu_batch_start: rcu_preempt CBs=2359 bl=18
>>>>         rcuop/3-40      [006] d..1  1833.744286: rcu_batch_start: rcu_preempt CBs=3681 bl=28
>>>>         rcuop/4-48      [002] d..1  1838.079777: rcu_batch_start: rcu_preempt CBs=10444 bl=81
>>>>         rcuop/7-69      [001] d..1  1838.080375: rcu_batch_start: rcu_preempt CBs=12572 bl=98
>>>>           <...>-62      [002] d..1  1838.080646: rcu_batch_start: rcu_preempt CBs=14135 bl=110
>>>>         rcuop/6-62      [000] d..1  1838.087722: rcu_batch_start: rcu_preempt CBs=10839 bl=84

Please let us try to trim emails. That goes for me too.

>>>>           <...>-62      [003] d..1  1839.227022: rcu_batch_start: rcu_preempt CBs=1834 bl=14
>>>>           <...>-26      [001] d..1  1839.963315: rcu_batch_start: rcu_preempt CBs=5769 bl=45
>>>>         rcuop/2-33      [001] d..1  1839.966485: rcu_batch_start: rcu_preempt CBs=3789 bl=29
>>>>           <...>-40      [001] d..1  1839.966596: rcu_batch_start: rcu_preempt CBs=6425 bl=50
>>>>         rcuop/2-33      [005] d..1  1840.541272: rcu_batch_start: rcu_preempt CBs=825 bl=10
>>>>         rcuop/2-33      [005] d..1  1840.547724: rcu_batch_start: rcu_preempt CBs=44 bl=10
>>>>         rcuop/2-33      [005] d..1  1841.075759: rcu_batch_start: rcu_preempt CBs=516 bl=10
>>>>         rcuop/0-15      [002] d..1  1841.695716: rcu_batch_start: rcu_preempt CBs=6312 bl=49
>>>>         rcuop/0-15      [003] d..1  1841.709714: rcu_batch_start: rcu_preempt CBs=39 bl=10
>>>>         rcuop/5-55      [004] d..1  1843.112442: rcu_batch_start: rcu_preempt CBs=16007 bl=125
>>>>         rcuop/5-55      [004] d..1  1843.115444: rcu_batch_start: rcu_preempt CBs=7901 bl=61
>>>>         rcuop/6-62      [001] dn.1  1843.123983: rcu_batch_start: rcu_preempt CBs=8427 bl=65
>>>>         rcuop/6-62      [006] d..1  1843.412383: rcu_batch_start: rcu_preempt CBs=981 bl=10
>>>>         rcuop/0-15      [003] d..1  1844.659812: rcu_batch_start: rcu_preempt CBs=1851 bl=14
>>>>         rcuop/0-15      [003] d..1  1844.667790: rcu_batch_start: rcu_preempt CBs=135 bl=10
>>>> <snip>
>>>> 
>>>> it is much more better. But. As i wrote earlier there is a patch that i have submitted
>>>> some time ago improving kvfree_rcu() batching:
>>>> 
>>>> <snip>
>>>> commit 51824b780b719c53113dc39e027fbf670dc66028
>>>> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>> Date:   Thu Jun 30 18:33:35 2022 +0200
>>>> 
>>>>    rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
>>>> 
>>>>    Currently the monitor work is scheduled with a fixed interval of HZ/20,
>>>>    which is roughly 50 milliseconds. The drawback of this approach is
>>>>    low utilization of the 512 page slots in scenarios with infrequence
>>>>    kvfree_rcu() calls.  For example on an Android system:
>>>> <snip>
>>>> 
>>>> The trace that i posted was taken without it.
>>> 
>>> And if I am not getting too confused, that patch is now in mainline.
>>> So it does make sense to rely on it, then.  ;-)
>> 
>> Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable
>> to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ?
>> 
> This is very good.
> 
> Below is a plot that i have taken during one use-case. It is about three
> apps usage in parallel. It was done by running "monkey" test:
> 
> wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png
> 
> i set up three apps as usage scenario: Google Chrome, YoTube and Camera.
> I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second.
> 
> Please have a look at results. It reflects what i am saying. non-flush
> kvfree RCU variant makes a memory usage higher. What is not acceptable
> for our mobile devices and workloads.

Thanks I’ll take a closer look at the data (currently commuting) but here’s a quick reply:

I am curious that with the 5 second timer, you are delaying RCU anyway. Are you saying that, adding another 10 on top (due to lazyfying) seems to be causing issues? I find it hard to believe that you cannot give the shrinker enough work within 5 seconds, such that it also triggers the issues you’re seeing. However the workload and data speaks.

>> One at 5 and other at 10 seems odd, especially because the former seems to
>> negate the effects of the latter and anyone tweaking that in the future (say
>> via new command line options) should probably tweak them together to increase
>> batching.
>> 
> Well. Convert 5 seconds to 10? What will it solve for you? We can do it
> and from a kvfree_rcu() perspective nothing really is changed.

True. In fact with my last patch, I see almost never even the need to go to RCU. However my point with unification is just to keep it simple for user for 2 knobs that do the same thing. Granted this is a compiler knob but that might change in the future. We already have enough knobs in RCU and as you guys know, I’m a fan of not letting the user mess things up too much.

>> Testing shows significantly better batching with Vlad's updates, however I am
>> wondering why the rcu_callback fires in pairs like that from separate
>> kworkers:
>> 
>>     kworker/6:1-157     [006] d..1   288.861610: rcu_callback: rcu_preempt  rhp=0000000079b895f9 func=rcu_work_rcufn 1214
>>     kworker/4:2-158     [004]  d..1   288.861612: rcu_callback: rcu_preempt rhp=00000000d83fcc90 func=rcu_work_rcufn 798
>> 
>> I wonder if the queued kwork is executing twice accidentally, or something.
>> 
> Because a kfree_rcu_cpu() is a per-cpu thing.

Right, got it.

>> This kernel does have the additional trace patch below, fyi.
>> 
>> Another thought I have is, if we can just keep the kvfree_rcu() mapped to
>> call_rcu() via a config option say CONFIG_RCU_LAZY_KFREE, or something.
>> 
> I am not sure you need it, really. If you wake-up "rcuop" or whatever
> with 0.5 second interval or with 5 seconds interval you will not notice
> anything in terms of power between both.

Yes, you are right. This is not needed considering the improvements you recently made.

Cheers,

 - Joel 


> 
> --
> Uladzislau Rezki
  
Joel Fernandes Oct. 31, 2022, 6:15 p.m. UTC | #28
On Mon, Oct 31, 2022 at 9:21 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote:
> > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > >
> > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > understand the differences in your respective platforms' definitions of
> > > > > "good".  ;-)
> > > > >
> > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > >
> > > > <snip>
> > > > 1. Home screen swipe:
[...]
> > > > 2. App launches:
[...]
> > > > <snip>
> > > >
> > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted
> > > > some time ago improving kvfree_rcu() batching:
> > > >
> > > > <snip>
> > > > commit 51824b780b719c53113dc39e027fbf670dc66028
> > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Date:   Thu Jun 30 18:33:35 2022 +0200
> > > >
> > > >     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> > > >
> > > >     Currently the monitor work is scheduled with a fixed interval of HZ/20,
> > > >     which is roughly 50 milliseconds. The drawback of this approach is
> > > >     low utilization of the 512 page slots in scenarios with infrequence
> > > >     kvfree_rcu() calls.  For example on an Android system:
> > > > <snip>
> > > >
> > > > The trace that i posted was taken without it.
> > >
> > > And if I am not getting too confused, that patch is now in mainline.
> > > So it does make sense to rely on it, then.  ;-)
> >
> > Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable
> > to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ?
> >
> This is very good.
>
> Below is a plot that i have taken during one use-case. It is about three
> apps usage in parallel. It was done by running "monkey" test:
>
> wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png
>
> i set up three apps as usage scenario: Google Chrome, YoTube and Camera.
> I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second.
>
> Please have a look at results. It reflects what i am saying. non-flush
> kvfree RCU variant makes a memory usage higher. What is not acceptable
> for our mobile devices and workloads.

That does look higher, though honestly about ~5%. But that's just the
effect of more "laziness". The graph itself does not show a higher
number of shrinker invocations, in fact I think shrinker invocations
are not happening much that's why the slab holds more memory. The
system may not be under memory pressure?

Anyway, I agree with your point of view and I think my concern does
not even occur with the latest patch on avoiding RCU that I posted
[1], so I come in peace.

[1] https://lore.kernel.org/rcu/20221029132856.3752018-1-joel@joelfernandes.org/

I am going to start merging all the lazy patches to ChromeOS 5.10 now
including your kfree updates, except for [1] while we discuss it.
  
Uladzislau Rezki Nov. 1, 2022, 4:49 a.m. UTC | #29
> On Mon, Oct 31, 2022 at 9:21 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Fri, Oct 28, 2022 at 09:23:47PM +0000, Joel Fernandes wrote:
> > > On Mon, Oct 24, 2022 at 09:48:19AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Oct 24, 2022 at 06:25:30PM +0200, Uladzislau Rezki wrote:
> > > > > >
> > > > > > You guys might need to agree on the definition of "good" here.  Or maybe
> > > > > > understand the differences in your respective platforms' definitions of
> > > > > > "good".  ;-)
> > > > > >
> > > > > Indeed. Bad is when once per-millisecond infinitely :) At least in such use
> > > > > workload a can detect a power delta and power gain. Anyway, below is a new
> > > > > trace where i do not use "flush" variant for the kvfree_rcu():
> > > > >
> > > > > <snip>
> > > > > 1. Home screen swipe:
> [...]
> > > > > 2. App launches:
> [...]
> > > > > <snip>
> > > > >
> > > > > it is much more better. But. As i wrote earlier there is a patch that i have submitted
> > > > > some time ago improving kvfree_rcu() batching:
> > > > >
> > > > > <snip>
> > > > > commit 51824b780b719c53113dc39e027fbf670dc66028
> > > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > Date:   Thu Jun 30 18:33:35 2022 +0200
> > > > >
> > > > >     rcu/kvfree: Update KFREE_DRAIN_JIFFIES interval
> > > > >
> > > > >     Currently the monitor work is scheduled with a fixed interval of HZ/20,
> > > > >     which is roughly 50 milliseconds. The drawback of this approach is
> > > > >     low utilization of the 512 page slots in scenarios with infrequence
> > > > >     kvfree_rcu() calls.  For example on an Android system:
> > > > > <snip>
> > > > >
> > > > > The trace that i posted was taken without it.
> > > >
> > > > And if I am not getting too confused, that patch is now in mainline.
> > > > So it does make sense to rely on it, then.  ;-)
> > >
> > > Vlad's patch to change the KFREE_DRAIN_JIFFIES to 5 seconds seems reasonable
> > > to me. However, can we unify KFREE_DRAIN_JIFFIES and LAZY_FLUSH_JIFFIES ?
> > >
> > This is very good.
> >
> > Below is a plot that i have taken during one use-case. It is about three
> > apps usage in parallel. It was done by running "monkey" test:
> >
> > wget ftp://vps418301.ovh.net/incoming/monkey_3_apps_slab_usage_5_minutes.png
> >
> > i set up three apps as usage scenario: Google Chrome, YoTube and Camera.
> > I logged the Slab metric from the /proc/meminfo. Sampling rate is 0.1 second.
> >
> > Please have a look at results. It reflects what i am saying. non-flush
> > kvfree RCU variant makes a memory usage higher. What is not acceptable
> > for our mobile devices and workloads.
> 
> That does look higher, though honestly about ~5%. But that's just the
> effect of more "laziness". The graph itself does not show a higher
> number of shrinker invocations, in fact I think shrinker invocations
> are not happening much that's why the slab holds more memory. The
> system may not be under memory pressure?
> 
The idea is to minimize a possibility of entering into a low memory
condition mode. This is bad from a sluggishness point of view for users.
I am saying it in a context of android devices.

> Anyway, I agree with your point of view and I think my concern does
> not even occur with the latest patch on avoiding RCU that I posted
> [1], so I come in peace.
> 
> [1] https://lore.kernel.org/rcu/20221029132856.3752018-1-joel@joelfernandes.org/
> 
I will have a look at it.

>
> I am going to start merging all the lazy patches to ChromeOS 5.10 now
> including your kfree updates, except for [1] while we discuss it.
>
Good for ChromeOS users :)

--
Uladzislau Rezki
  

Patch

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7cd5f5e7e0a1b..b4b0e828b529e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1771,7 +1771,7 @@  bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		rwork->wq = wq;
-		call_rcu(&rwork->rcu, rcu_work_rcufn);
+		call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
 		return true;
 	}