[v2] rcu/kfree: Do not request RCU when not needed

Message ID 20221109024758.2644936-1-joel@joelfernandes.org
State New
Headers
Series [v2] rcu/kfree: Do not request RCU when not needed |

Commit Message

Joel Fernandes Nov. 9, 2022, 2:47 a.m. UTC
  On ChromeOS, using this with the increased timeout, we see that we almost always
never need to initiate a new grace period. Testing also shows this frees large
amounts of unreclaimed memory, under intense kfree_rcu() pressure.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v1->v2: Same logic but use polled grace periods instead of sampling gp_seq.

 kernel/rcu/tree.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)
  

Comments

Uladzislau Rezki Nov. 10, 2022, 1:05 p.m. UTC | #1
> On ChromeOS, using this with the increased timeout, we see that we almost always
> never need to initiate a new grace period. Testing also shows this frees large
> amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> v1->v2: Same logic but use polled grace periods instead of sampling gp_seq.
> 
>  kernel/rcu/tree.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 591187b6352e..ed41243f7a49 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
>  
>  /**
>   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
> + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor work.
>   * @head: List of kfree_rcu() objects not yet waiting for a grace period
>   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
>   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
> @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
>  	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
>  	raw_spinlock_t lock;
>  	struct delayed_work monitor_work;
> +	unsigned long gp_snap;
>  	bool initialized;
>  	int count;
>  
> @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
>  			mod_delayed_work(system_wq, &krcp->monitor_work, delay);
>  		return;
>  	}
> +	krcp->gp_snap = get_state_synchronize_rcu();
>  	queue_delayed_work(system_wq, &krcp->monitor_work, delay);
>  }
>
How do you guarantee a full grace period for objects which proceed
to be placed into an input stream that is not yet detached?

>  
> @@ -3217,7 +3220,10 @@ static void kfree_rcu_monitor(struct work_struct *work)
>  			// be that the work is in the pending state when
>  			// channels have been detached following by each
>  			// other.
> -			queue_rcu_work(system_wq, &krwp->rcu_work);
> +			if (poll_state_synchronize_rcu(krcp->gp_snap))
> +				queue_work(system_wq, &krwp->rcu_work.work);
> +			else
> +				queue_rcu_work(system_wq, &krwp->rcu_work);
>  		}
>
Why do you want to queue a work over RCU-core?

1.
call_rcu()
   -> queue_work();
      -> do reclaim

if it can be improved and simplified as:

2.
queue_work();
    -> cond_synchronize_rcu(), do reclaim

Could you please clarify it?

--
Uladzislau Rezki
  
Uladzislau Rezki Nov. 10, 2022, 2:01 p.m. UTC | #2
> Hi,
> 
> On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> > > On ChromeOS, using this with the increased timeout, we see that we
> > almost always
> > > never need to initiate a new grace period. Testing also shows this frees
> > large
> > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > >
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > > v1->v2: Same logic but use polled grace periods instead of sampling
> > gp_seq.
> > >
> > >  kernel/rcu/tree.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 591187b6352e..ed41243f7a49 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > >
> > >  /**
> > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > period
> > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > work.
> > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > grace period
> > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > grace period
> > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > >       raw_spinlock_t lock;
> > >       struct delayed_work monitor_work;
> > > +     unsigned long gp_snap;
> > >       bool initialized;
> > >       int count;
> > >
> > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > *krcp)
> > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > delay);
> > >               return;
> > >       }
> > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > >  }
> > >
> > How do you guarantee a full grace period for objects which proceed
> > to be placed into an input stream that is not yet detached?
> 
> 
> Just replying from phone as I’m OOO today.
> 
> Hmm, so you’re saying that objects can be queued after the delayed work has
> been queued, but processed when the delayed work is run? I’m looking at
> this code after few years so I may have missed something.
> 
> That’s a good point and I think I missed that. I think your version did too
> but I’ll have to double check.
> 
> The fix then is to sample the clock for the latest object queued, not for
> when the delayed work is queued.
> 
The patch i sent gurantee it. Just in case see v2:

From 7ff4038d5dac8d2044b240adeee630ad7944ab8d Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Wed, 2 Nov 2022 19:26:27 +0100
Subject: [PATCH] rcu: kvfree_rcu: Reduce a memory footptint by using polling
 APIs

Total time taken by all kfree'ers: 6564718459 ns, loops: 10000, batches: 1110, memory footprint: 5057MB
Total time taken by all kfree'ers: 8431051895 ns, loops: 10000, batches: 1109, memory footprint: 2749MB
Total time taken by all kfree'ers: 9477830789 ns, loops: 10000, batches: 1158, memory footprint: 2934MB
Total time taken by all kfree'ers: 9950211144 ns, loops: 10000, batches: 981, memory footprint: 2704MB

with a patch:

Total time taken by all kfree'ers: 7712110118 ns, loops: 10000, batches: 1660, memory footprint: 91MB
Total time taken by all kfree'ers: 7002403664 ns, loops: 10000, batches: 1482, memory footprint: 86MB
Total time taken by all kfree'ers: 7842282319 ns, loops: 10000, batches: 1738, memory footprint: 86MB
Total time taken by all kfree'ers: 7230161977 ns, loops: 10000, batches: 1542, memory footprint: 72MB

Tested with NOCB option, all offloading CPUs:

kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
  --kconfig CONFIG_NR_CPUS=64 \
  --kconfig CONFIG_RCU_NOCB_CPU=y \
  --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
  --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \
  rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make

According to data there is a big gain in memory footprint with a patch.
It is because of call_rcu() and call_rcu_flush() take more effort and
time to queue a callback and then wait for a gp.

With polling API:
  a) we do not need to queue any callback;
  b) we might not even need wait for a GP completion.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 76973d716921..2be4f80867f2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2919,18 +2919,20 @@ struct kvfree_rcu_bulk_data {
 	((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
 
 /**
+ * @rcu_work: A work to reclaim a memory after a grace period
  * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
- * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
  * @head_free: List of kfree_rcu() objects waiting for a grace period
  * @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
  * @krcp: Pointer to @kfree_rcu_cpu structure
+ * @gp_snap: A snapshot of current grace period
  */
 
 struct kfree_rcu_cpu_work {
-	struct rcu_work rcu_work;
+	struct work_struct rcu_work;
 	struct rcu_head *head_free;
 	struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];
 	struct kfree_rcu_cpu *krcp;
+	unsigned long gp_snap;
 };
 
 /**
@@ -3064,10 +3066,11 @@ static void kfree_rcu_work(struct work_struct *work)
 	struct rcu_head *head, *next;
 	struct kfree_rcu_cpu *krcp;
 	struct kfree_rcu_cpu_work *krwp;
+	unsigned long this_krwp_gp_snap;
 	int i, j;
 
-	krwp = container_of(to_rcu_work(work),
-			    struct kfree_rcu_cpu_work, rcu_work);
+	krwp = container_of(work,
+		struct kfree_rcu_cpu_work, rcu_work);
 	krcp = krwp->krcp;
 
 	raw_spin_lock_irqsave(&krcp->lock, flags);
@@ -3080,8 +3083,15 @@ static void kfree_rcu_work(struct work_struct *work)
 	// Channel 3.
 	head = krwp->head_free;
 	krwp->head_free = NULL;
+
+	// Get the latest saved state of grace period
+	// associated with this "krwp" and objects there.
+	this_krwp_gp_snap = krwp->gp_snap;
 	raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
+	// Check the state.
+	cond_synchronize_rcu(this_krwp_gp_snap);
+
 	// Handle the first two channels.
 	for (i = 0; i < FREE_N_CHANNELS; i++) {
 		for (; bkvhead[i]; bkvhead[i] = bnext) {
@@ -3212,12 +3222,22 @@ static void kfree_rcu_monitor(struct work_struct *work)
 
 			WRITE_ONCE(krcp->count, 0);
 
+			// 1. Take a snapshot for this krwp. Please note no
+			// more any objects can be added to channels which
+			// have already been attached.
+			//
+			// 2. Since a "krwp" has several free channels a GP
+			// status of latest attached one is recorded, i.e.
+			// it allows us to maintain a GP status for last in
+			// objects among all channels.
+			krwp->gp_snap = get_state_synchronize_rcu();
+
 			// One work is per one batch, so there are three
 			// "free channels", the batch can handle. It can
 			// be that the work is in the pending state when
 			// channels have been detached following by each
 			// other.
-			queue_rcu_work(system_wq, &krwp->rcu_work);
+			queue_work(system_wq, &krwp->rcu_work);
 		}
 	}
 
@@ -4808,7 +4828,7 @@ static void __init kfree_rcu_batch_init(void)
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
 
 		for (i = 0; i < KFREE_N_BATCHES; i++) {
-			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
+			INIT_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
 			krcp->krw_arr[i].krcp = krcp;
 		}
 
> > > @@ -3217,7 +3220,10 @@ static void kfree_rcu_monitor(struct work_struct
> > *work)
> > >                       // be that the work is in the pending state when
> > >                       // channels have been detached following by each
> > >                       // other.
> > > -                     queue_rcu_work(system_wq, &krwp->rcu_work);
> > > +                     if (poll_state_synchronize_rcu(krcp->gp_snap))
> > > +                             queue_work(system_wq, &krwp->rcu_work.work
> > );
> > > +                     else
> > > +                             queue_rcu_work(system_wq, &krwp->rcu_work);
> > >               }
> > >
> > Why do you want to queue a work over RCU-core?
> >
> > 1.
> > call_rcu()
> >    -> queue_work();
> >       -> do reclaim
> >
> > if it can be improved and simplified as:
> >
> > 2.
> > queue_work();
> >     -> cond_synchronize_rcu(), do reclaim
> 
> 
> The second one blocks, the first doesn’t. So I’m not sure how that’s an
> improvement? I think we don’t want to block in kworker due to possible
> scheduling issues and hurting other kwork during critical operations, if
> possible.
> 
Does the first wait for a full grace period so our kworker is not even run?

--
Uladzislau Rezki
  
Joel Fernandes Nov. 11, 2022, 1:56 a.m. UTC | #3
On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > Hi,
> > 
> > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > > > On ChromeOS, using this with the increased timeout, we see that we
> > > almost always
> > > > never need to initiate a new grace period. Testing also shows this frees
> > > large
> > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > >
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > ---
> > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > gp_seq.
> > > >
> > > >  kernel/rcu/tree.c | 8 +++++++-
> > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 591187b6352e..ed41243f7a49 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > >
> > > >  /**
> > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > period
> > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > work.
> > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > grace period
> > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > grace period
> > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > >       raw_spinlock_t lock;
> > > >       struct delayed_work monitor_work;
> > > > +     unsigned long gp_snap;
> > > >       bool initialized;
> > > >       int count;
> > > >
> > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > *krcp)
> > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > delay);
> > > >               return;
> > > >       }
> > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > >  }
> > > >
> > > How do you guarantee a full grace period for objects which proceed
> > > to be placed into an input stream that is not yet detached?
> > 
> > 
> > Just replying from phone as I’m OOO today.
> > 
> > Hmm, so you’re saying that objects can be queued after the delayed work has
> > been queued, but processed when the delayed work is run? I’m looking at
> > this code after few years so I may have missed something.
> > 
> > That’s a good point and I think I missed that. I think your version did too
> > but I’ll have to double check.
> > 
> > The fix then is to sample the clock for the latest object queued, not for
> > when the delayed work is queued.
> > 
> The patch i sent gurantee it. Just in case see v2:

You are right and thank you! CBs can be queued while the monitor timer is in
progress. So we need to sample unconditionally. I think my approach is still
better since I take advantage of multiple seconds (I update snapshot on every
CB queue monitor and sample in the monitor handler).

Whereas your patch is snapshotting before queuing the regular work and when
the work is executed (This is a much shorter duration and I bet you would be
blocking in cond_synchronize..() more often).

As you pointed, I was sampling too late, and should be fixed below. Thoughts?

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH v3] rcu/kfree: Do not request RCU when not needed

On ChromeOS, using this with the increased timeout, we see that we almost always
never need to initiate a new grace period. Testing also shows this frees large
amounts of unreclaimed memory, under intense kfree_rcu() pressure.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 591187b6352e..499e6ab56fbf 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
 
 /**
  * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
+ * @gp_snap: The GP snapshot recorded at the last scheduling of monitor work.
  * @head: List of kfree_rcu() objects not yet waiting for a grace period
  * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
  * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
@@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
 	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
 	raw_spinlock_t lock;
 	struct delayed_work monitor_work;
+	unsigned long gp_snap;
 	bool initialized;
 	int count;
 
@@ -3217,7 +3219,10 @@ static void kfree_rcu_monitor(struct work_struct *work)
 			// be that the work is in the pending state when
 			// channels have been detached following by each
 			// other.
-			queue_rcu_work(system_wq, &krwp->rcu_work);
+			if (poll_state_synchronize_rcu(krcp->gp_snap))
+				queue_work(system_wq, &krwp->rcu_work.work);
+			else
+				queue_rcu_work(system_wq, &krwp->rcu_work);
 		}
 	}
 
@@ -3409,6 +3414,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 
 	WRITE_ONCE(krcp->count, krcp->count + 1);
 
+	// Snapshot the GP clock for the latest callback.
+	krcp->gp_snap = get_state_synchronize_rcu();
+
 	// Set timer to drain after KFREE_DRAIN_JIFFIES.
 	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
 		schedule_delayed_monitor_work(krcp);
  
Uladzislau Rezki Nov. 14, 2022, 12:20 p.m. UTC | #4
> On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > > Hi,
> > > 
> > > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > 
> > > > > On ChromeOS, using this with the increased timeout, we see that we
> > > > almost always
> > > > > never need to initiate a new grace period. Testing also shows this frees
> > > > large
> > > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > > >
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > ---
> > > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > > gp_seq.
> > > > >
> > > > >  kernel/rcu/tree.c | 8 +++++++-
> > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 591187b6352e..ed41243f7a49 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > > >
> > > > >  /**
> > > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > > period
> > > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > > work.
> > > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > > grace period
> > > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > > grace period
> > > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > > >       raw_spinlock_t lock;
> > > > >       struct delayed_work monitor_work;
> > > > > +     unsigned long gp_snap;
> > > > >       bool initialized;
> > > > >       int count;
> > > > >
> > > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > > *krcp)
> > > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > > delay);
> > > > >               return;
> > > > >       }
> > > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > > >  }
> > > > >
> > > > How do you guarantee a full grace period for objects which proceed
> > > > to be placed into an input stream that is not yet detached?
> > > 
> > > 
> > > Just replying from phone as I’m OOO today.
> > > 
> > > Hmm, so you’re saying that objects can be queued after the delayed work has
> > > been queued, but processed when the delayed work is run? I’m looking at
> > > this code after few years so I may have missed something.
> > > 
> > > That’s a good point and I think I missed that. I think your version did too
> > > but I’ll have to double check.
> > > 
> > > The fix then is to sample the clock for the latest object queued, not for
> > > when the delayed work is queued.
> > > 
> > The patch i sent gurantee it. Just in case see v2:
> 
> You are right and thank you! CBs can be queued while the monitor timer is in
> progress. So we need to sample unconditionally. I think my approach is still
> better since I take advantage of multiple seconds (I update snapshot on every
> CB queue monitor and sample in the monitor handler).
> 
> Whereas your patch is snapshotting before queuing the regular work and when
> the work is executed (This is a much shorter duration and I bet you would be
> blocking in cond_synchronize..() more often).
> 
There is a performance test that measures a taken time and memory
footprint, so you can do a quick comparison. A "rcutorture" can be
run with various parameters to figure out if a patch that is in question
makes any difference.

Usually i run it as:

<snip>
#! /usr/bin/env bash

LOOPS=10

for (( i=0; i<$LOOPS; i++ )); do
        tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
        --kconfig CONFIG_NR_CPUS=64 \
        --kconfig CONFIG_RCU_NOCB_CPU=y \
        --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
        --kconfig CONFIG_RCU_LAZY=n \
        --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make
        echo "Done $i"
done
<snip>

just run it from your linux sandbox.

--
Uladzislau Rezki
  
Paul E. McKenney Nov. 14, 2022, 4:17 p.m. UTC | #5
On Mon, Nov 14, 2022 at 01:20:33PM +0100, Uladzislau Rezki wrote:
> > On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > > > Hi,
> > > > 
> > > > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > 
> > > > > > On ChromeOS, using this with the increased timeout, we see that we
> > > > > almost always
> > > > > > never need to initiate a new grace period. Testing also shows this frees
> > > > > large
> > > > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > > > >
> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > ---
> > > > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > > > gp_seq.
> > > > > >
> > > > > >  kernel/rcu/tree.c | 8 +++++++-
> > > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 591187b6352e..ed41243f7a49 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > > > >
> > > > > >  /**
> > > > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > > > period
> > > > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > > > work.
> > > > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > > > grace period
> > > > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > > > grace period
> > > > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > > > >       raw_spinlock_t lock;
> > > > > >       struct delayed_work monitor_work;
> > > > > > +     unsigned long gp_snap;
> > > > > >       bool initialized;
> > > > > >       int count;
> > > > > >
> > > > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > > > *krcp)
> > > > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > > > delay);
> > > > > >               return;
> > > > > >       }
> > > > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > > > >  }
> > > > > >
> > > > > How do you guarantee a full grace period for objects which proceed
> > > > > to be placed into an input stream that is not yet detached?
> > > > 
> > > > 
> > > > Just replying from phone as I’m OOO today.
> > > > 
> > > > Hmm, so you’re saying that objects can be queued after the delayed work has
> > > > been queued, but processed when the delayed work is run? I’m looking at
> > > > this code after few years so I may have missed something.
> > > > 
> > > > That’s a good point and I think I missed that. I think your version did too
> > > > but I’ll have to double check.
> > > > 
> > > > The fix then is to sample the clock for the latest object queued, not for
> > > > when the delayed work is queued.
> > > > 
> > > The patch i sent gurantee it. Just in case see v2:
> > 
> > You are right and thank you! CBs can be queued while the monitor timer is in
> > progress. So we need to sample unconditionally. I think my approach is still
> > better since I take advantage of multiple seconds (I update snapshot on every
> > CB queue monitor and sample in the monitor handler).
> > 
> > Whereas your patch is snapshotting before queuing the regular work and when
> > the work is executed (This is a much shorter duration and I bet you would be
> > blocking in cond_synchronize..() more often).
> > 
> There is a performance test that measures a taken time and memory
> footprint, so you can do a quick comparison. A "rcutorture" can be
> run with various parameters to figure out if a patch that is in question
> makes any difference.
> 
> Usually i run it as:
> 
> <snip>
> #! /usr/bin/env bash
> 
> LOOPS=10
> 
> for (( i=0; i<$LOOPS; i++ )); do
>         tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
>         --kconfig CONFIG_NR_CPUS=64 \
>         --kconfig CONFIG_RCU_NOCB_CPU=y \
>         --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
>         --kconfig CONFIG_RCU_LAZY=n \
>         --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make
>         echo "Done $i"
> done
> <snip>
> 
> just run it from your linux sandbox.

Would it make sense to modify the "if test "$do_kvfree" = "yes" code
in tools/testing/selftests/rcutorture/bin/torture.sh to do something
like this instead of what it currently does?

Though if so, it would be much faster to use kvm.sh's --buildonly flag
to build the kernel, then kvm-again.sh to rerun that kernel.

						Thanx, Paul
  
Joel Fernandes Nov. 14, 2022, 8:49 p.m. UTC | #6
On Mon, Nov 14, 2022 at 7:20 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> > On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > > > Hi,
> > > >
> > > > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > > On ChromeOS, using this with the increased timeout, we see that we
> > > > > almost always
> > > > > > never need to initiate a new grace period. Testing also shows this frees
> > > > > large
> > > > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > > > >
> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > ---
> > > > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > > > gp_seq.
> > > > > >
> > > > > >  kernel/rcu/tree.c | 8 +++++++-
> > > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 591187b6352e..ed41243f7a49 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > > > >
> > > > > >  /**
> > > > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > > > period
> > > > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > > > work.
> > > > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > > > grace period
> > > > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > > > grace period
> > > > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > > > >       raw_spinlock_t lock;
> > > > > >       struct delayed_work monitor_work;
> > > > > > +     unsigned long gp_snap;
> > > > > >       bool initialized;
> > > > > >       int count;
> > > > > >
> > > > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > > > *krcp)
> > > > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > > > delay);
> > > > > >               return;
> > > > > >       }
> > > > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > > > >  }
> > > > > >
> > > > > How do you guarantee a full grace period for objects which proceed
> > > > > to be placed into an input stream that is not yet detached?
> > > >
> > > >
> > > > Just replying from phone as I’m OOO today.
> > > >
> > > > Hmm, so you’re saying that objects can be queued after the delayed work has
> > > > been queued, but processed when the delayed work is run? I’m looking at
> > > > this code after few years so I may have missed something.
> > > >
> > > > That’s a good point and I think I missed that. I think your version did too
> > > > but I’ll have to double check.
> > > >
> > > > The fix then is to sample the clock for the latest object queued, not for
> > > > when the delayed work is queued.
> > > >
> > > The patch i sent gurantee it. Just in case see v2:
> >
> > You are right and thank you! CBs can be queued while the monitor timer is in
> > progress. So we need to sample unconditionally. I think my approach is still
> > better since I take advantage of multiple seconds (I update snapshot on every
> > CB queue monitor and sample in the monitor handler).
> >
> > Whereas your patch is snapshotting before queuing the regular work and when
> > the work is executed (This is a much shorter duration and I bet you would be
> > blocking in cond_synchronize..() more often).
> >
> There is a performance test that measures a taken time and memory
> footprint, so you can do a quick comparison. A "rcutorture" can be
> run with various parameters to figure out if a patch that is in question
> makes any difference.

Yes sure, I am doing a run now with my patch. However, I have a
question -- why do you feel blocking in the kworker is not an issue?
You are taking a snapshot before queuing the normal kwork and then
reading the snapshot when the normal kwork runs. Considering it is a
high priority queue, the delay between when you are taking the
snapshot, and reading it is likely small so there is a bigger chance
of blocking in cond_synchronize_rcu(). Did I miss something?

> Usually i run it as:
>
> <snip>
> #! /usr/bin/env bash
>
> LOOPS=10
>
> for (( i=0; i<$LOOPS; i++ )); do
>         tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
>         --kconfig CONFIG_NR_CPUS=64 \
>         --kconfig CONFIG_RCU_NOCB_CPU=y \
>         --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
>         --kconfig CONFIG_RCU_LAZY=n \
>         --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make
>         echo "Done $i"
> done

Sounds good, thanks.

> <snip>
>
> just run it from your linux sandbox.

Ok, will do.

 - Joel
  
Joel Fernandes Nov. 14, 2022, 8:54 p.m. UTC | #7
On Mon, Nov 14, 2022 at 11:17 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Nov 14, 2022 at 01:20:33PM +0100, Uladzislau Rezki wrote:
> > > On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > > > On ChromeOS, using this with the increased timeout, we see that we
> > > > > > almost always
> > > > > > > never need to initiate a new grace period. Testing also shows this frees
> > > > > > large
> > > > > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > > > > >
> > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > ---
> > > > > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > > > > gp_seq.
> > > > > > >
> > > > > > >  kernel/rcu/tree.c | 8 +++++++-
> > > > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 591187b6352e..ed41243f7a49 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > > > > >
> > > > > > >  /**
> > > > > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > > > > period
> > > > > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > > > > work.
> > > > > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > > > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > > > > grace period
> > > > > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > > > > grace period
> > > > > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > > > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > > > > >       raw_spinlock_t lock;
> > > > > > >       struct delayed_work monitor_work;
> > > > > > > +     unsigned long gp_snap;
> > > > > > >       bool initialized;
> > > > > > >       int count;
> > > > > > >
> > > > > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > > > > *krcp)
> > > > > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > > > > delay);
> > > > > > >               return;
> > > > > > >       }
> > > > > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > > > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > > > > >  }
> > > > > > >
> > > > > > How do you guarantee a full grace period for objects which proceed
> > > > > > to be placed into an input stream that is not yet detached?
> > > > >
> > > > >
> > > > > Just replying from phone as I’m OOO today.
> > > > >
> > > > > Hmm, so you’re saying that objects can be queued after the delayed work has
> > > > > been queued, but processed when the delayed work is run? I’m looking at
> > > > > this code after few years so I may have missed something.
> > > > >
> > > > > That’s a good point and I think I missed that. I think your version did too
> > > > > but I’ll have to double check.
> > > > >
> > > > > The fix then is to sample the clock for the latest object queued, not for
> > > > > when the delayed work is queued.
> > > > >
> > > > The patch i sent gurantee it. Just in case see v2:
> > >
> > > You are right and thank you! CBs can be queued while the monitor timer is in
> > > progress. So we need to sample unconditionally. I think my approach is still
> > > better since I take advantage of multiple seconds (I update snapshot on every
> > > CB queue monitor and sample in the monitor handler).
> > >
> > > Whereas your patch is snapshotting before queuing the regular work and when
> > > the work is executed (This is a much shorter duration and I bet you would be
> > > blocking in cond_synchronize..() more often).
> > >
> > There is a performance test that measures a taken time and memory
> > footprint, so you can do a quick comparison. A "rcutorture" can be
> > run with various parameters to figure out if a patch that is in question
> > makes any difference.
> >
> > Usually i run it as:
> >
> > <snip>
> > #! /usr/bin/env bash
> >
> > LOOPS=10
> >
> > for (( i=0; i<$LOOPS; i++ )); do
> >         tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
> >         --kconfig CONFIG_NR_CPUS=64 \
> >         --kconfig CONFIG_RCU_NOCB_CPU=y \
> >         --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
> >         --kconfig CONFIG_RCU_LAZY=n \
> >         --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make
> >         echo "Done $i"
> > done
> > <snip>
> >
> > just run it from your linux sandbox.
>
> Would it make sense to modify the "if test "$do_kvfree" = "yes" code
> in tools/testing/selftests/rcutorture/bin/torture.sh to do something
> like this instead of what it currently does?

Yes I think so, Were you also thinking of adding it to
tools/testing/selftests/rcutorture/configs with a corresponding
".boot" file for the kfree test's boot parameters?

If it means the bots will run it more, that would be awesome :-)

Thanks,

 - Joel
  
Paul E. McKenney Nov. 14, 2022, 9:26 p.m. UTC | #8
On Mon, Nov 14, 2022 at 03:54:30PM -0500, Joel Fernandes wrote:
> On Mon, Nov 14, 2022 at 11:17 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Nov 14, 2022 at 01:20:33PM +0100, Uladzislau Rezki wrote:
> > > > On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > >
> > > > > > > > On ChromeOS, using this with the increased timeout, we see that we
> > > > > > > almost always
> > > > > > > > never need to initiate a new grace period. Testing also shows this frees
> > > > > > > large
> > > > > > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > > > > > >
> > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > > ---
> > > > > > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > > > > > gp_seq.
> > > > > > > >
> > > > > > > >  kernel/rcu/tree.c | 8 +++++++-
> > > > > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > > index 591187b6352e..ed41243f7a49 100644
> > > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > > > > > >
> > > > > > > >  /**
> > > > > > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > > > > > period
> > > > > > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > > > > > work.
> > > > > > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > > > > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > > > > > grace period
> > > > > > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > > > > > grace period
> > > > > > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > > > > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > > > > > >       raw_spinlock_t lock;
> > > > > > > >       struct delayed_work monitor_work;
> > > > > > > > +     unsigned long gp_snap;
> > > > > > > >       bool initialized;
> > > > > > > >       int count;
> > > > > > > >
> > > > > > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > > > > > *krcp)
> > > > > > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > > > > > delay);
> > > > > > > >               return;
> > > > > > > >       }
> > > > > > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > > > > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > > > > > >  }
> > > > > > > >
> > > > > > > How do you guarantee a full grace period for objects which proceed
> > > > > > > to be placed into an input stream that is not yet detached?
> > > > > >
> > > > > >
> > > > > > Just replying from phone as I’m OOO today.
> > > > > >
> > > > > > Hmm, so you’re saying that objects can be queued after the delayed work has
> > > > > > been queued, but processed when the delayed work is run? I’m looking at
> > > > > > this code after few years so I may have missed something.
> > > > > >
> > > > > > That’s a good point and I think I missed that. I think your version did too
> > > > > > but I’ll have to double check.
> > > > > >
> > > > > > The fix then is to sample the clock for the latest object queued, not for
> > > > > > when the delayed work is queued.
> > > > > >
> > > > > The patch i sent gurantee it. Just in case see v2:
> > > >
> > > > You are right and thank you! CBs can be queued while the monitor timer is in
> > > > progress. So we need to sample unconditionally. I think my approach is still
> > > > better since I take advantage of multiple seconds (I update snapshot on every
> > > > CB queue monitor and sample in the monitor handler).
> > > >
> > > > Whereas your patch is snapshotting before queuing the regular work and when
> > > > the work is executed (This is a much shorter duration and I bet you would be
> > > > blocking in cond_synchronize..() more often).
> > > >
> > > There is a performance test that measures a taken time and memory
> > > footprint, so you can do a quick comparison. A "rcutorture" can be
> > > run with various parameters to figure out if a patch that is in question
> > > makes any difference.
> > >
> > > Usually i run it as:
> > >
> > > <snip>
> > > #! /usr/bin/env bash
> > >
> > > LOOPS=10
> > >
> > > for (( i=0; i<$LOOPS; i++ )); do
> > >         tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
> > >         --kconfig CONFIG_NR_CPUS=64 \
> > >         --kconfig CONFIG_RCU_NOCB_CPU=y \
> > >         --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
> > >         --kconfig CONFIG_RCU_LAZY=n \
> > >         --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make
> > >         echo "Done $i"
> > > done
> > > <snip>
> > >
> > > just run it from your linux sandbox.
> >
> > Would it make sense to modify the "if test "$do_kvfree" = "yes" code
> > in tools/testing/selftests/rcutorture/bin/torture.sh to do something
> > like this instead of what it currently does?
> 
> Yes I think so, Were you also thinking of adding it to
> tools/testing/selftests/rcutorture/configs with a corresponding
> ".boot" file for the kfree test's boot parameters?

No, because I run torture.sh reasonably frequently, and the style of
test makes it better to run separately in that way.  In particular,
it needs lots of memory and runs for a short duration.  This is not all
that compatible with typical rcutorture runs, where lots of scenarios run
concurrently (thus limiting each VM's memory) and for long time periods.

> If it means the bots will run it more, that would be awesome :-)

I am not quite ready to have the bots run torture.sh.  Though maybe
with --do-no-scf it would be fine?

							Thanx, Paul
  
Uladzislau Rezki Nov. 15, 2022, 12:05 p.m. UTC | #9
On Mon, Nov 14, 2022 at 08:17:33AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 14, 2022 at 01:20:33PM +0100, Uladzislau Rezki wrote:
> > > On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > > > > Hi,
> > > > > 
> > > > > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > 
> > > > > > > On ChromeOS, using this with the increased timeout, we see that we
> > > > > > almost always
> > > > > > > never need to initiate a new grace period. Testing also shows this frees
> > > > > > large
> > > > > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > > > > >
> > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > ---
> > > > > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > > > > gp_seq.
> > > > > > >
> > > > > > >  kernel/rcu/tree.c | 8 +++++++-
> > > > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 591187b6352e..ed41243f7a49 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > > > > >
> > > > > > >  /**
> > > > > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > > > > period
> > > > > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > > > > work.
> > > > > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > > > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > > > > grace period
> > > > > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > > > > grace period
> > > > > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > > > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > > > > >       raw_spinlock_t lock;
> > > > > > >       struct delayed_work monitor_work;
> > > > > > > +     unsigned long gp_snap;
> > > > > > >       bool initialized;
> > > > > > >       int count;
> > > > > > >
> > > > > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > > > > *krcp)
> > > > > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > > > > delay);
> > > > > > >               return;
> > > > > > >       }
> > > > > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > > > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > > > > >  }
> > > > > > >
> > > > > > How do you guarantee a full grace period for objects which proceed
> > > > > > to be placed into an input stream that is not yet detached?
> > > > > 
> > > > > 
> > > > > Just replying from phone as I’m OOO today.
> > > > > 
> > > > > Hmm, so you’re saying that objects can be queued after the delayed work has
> > > > > been queued, but processed when the delayed work is run? I’m looking at
> > > > > this code after few years so I may have missed something.
> > > > > 
> > > > > That’s a good point and I think I missed that. I think your version did too
> > > > > but I’ll have to double check.
> > > > > 
> > > > > The fix then is to sample the clock for the latest object queued, not for
> > > > > when the delayed work is queued.
> > > > > 
> > > > The patch i sent gurantee it. Just in case see v2:
> > > 
> > > You are right and thank you! CBs can be queued while the monitor timer is in
> > > progress. So we need to sample unconditionally. I think my approach is still
> > > better since I take advantage of multiple seconds (I update snapshot on every
> > > CB queue monitor and sample in the monitor handler).
> > > 
> > > Whereas your patch is snapshotting before queuing the regular work and when
> > > the work is executed (This is a much shorter duration and I bet you would be
> > > blocking in cond_synchronize..() more often).
> > > 
> > There is a performance test that measures a taken time and memory
> > footprint, so you can do a quick comparison. A "rcutorture" can be
> > run with various parameters to figure out if a patch that is in question
> > makes any difference.
> > 
> > Usually i run it as:
> > 
> > <snip>
> > #! /usr/bin/env bash
> > 
> > LOOPS=10
> > 
> > for (( i=0; i<$LOOPS; i++ )); do
> >         tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
> >         --kconfig CONFIG_NR_CPUS=64 \
> >         --kconfig CONFIG_RCU_NOCB_CPU=y \
> >         --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
> >         --kconfig CONFIG_RCU_LAZY=n \
> >         --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make
> >         echo "Done $i"
> > done
> > <snip>
> > 
> > just run it from your linux sandbox.
> 
> Would it make sense to modify the "if test "$do_kvfree" = "yes" code
> in tools/testing/selftests/rcutorture/bin/torture.sh to do something
> like this instead of what it currently does?
> 

<snip>
if test "$do_kvfree" = "yes"
then
        torture_bootargs="rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot"
        torture_set "rcuscale-kvfree" tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig "CONFIG_NR_CPUS=$HALF_ALLOTED_CPUS" --memory 2G --trust-make
fi
<snip>

it does almost the same but i add some repeat loops usually to get some
average results.

> Though if so, it would be much faster to use kvm.sh's --buildonly flag
> to build the kernel, then kvm-again.sh to rerun that kernel.
> 
I think can be easily run from the bash in while true; do ... done; script
with a desired number of iteration.

But maybe torture.sh already has it. I mean number of iterations.

--
Uladzislau Rezki
  
Uladzislau Rezki Nov. 15, 2022, 1:07 p.m. UTC | #10
On Mon, Nov 14, 2022 at 03:49:16PM -0500, Joel Fernandes wrote:
> On Mon, Nov 14, 2022 at 7:20 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > > On Thu, Nov 10, 2022 at 03:01:30PM +0100, Uladzislau Rezki wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, Nov 10, 2022 at 8:05 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > > > On ChromeOS, using this with the increased timeout, we see that we
> > > > > > almost always
> > > > > > > never need to initiate a new grace period. Testing also shows this frees
> > > > > > large
> > > > > > > amounts of unreclaimed memory, under intense kfree_rcu() pressure.
> > > > > > >
> > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > ---
> > > > > > > v1->v2: Same logic but use polled grace periods instead of sampling
> > > > > > gp_seq.
> > > > > > >
> > > > > > >  kernel/rcu/tree.c | 8 +++++++-
> > > > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 591187b6352e..ed41243f7a49 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -2935,6 +2935,7 @@ struct kfree_rcu_cpu_work {
> > > > > > >
> > > > > > >  /**
> > > > > > >   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace
> > > > > > period
> > > > > > > + * @gp_snap: The GP snapshot recorded at the last scheduling of monitor
> > > > > > work.
> > > > > > >   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> > > > > > >   * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a
> > > > > > grace period
> > > > > > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a
> > > > > > grace period
> > > > > > > @@ -2964,6 +2965,7 @@ struct kfree_rcu_cpu {
> > > > > > >       struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > > > > > >       raw_spinlock_t lock;
> > > > > > >       struct delayed_work monitor_work;
> > > > > > > +     unsigned long gp_snap;
> > > > > > >       bool initialized;
> > > > > > >       int count;
> > > > > > >
> > > > > > > @@ -3167,6 +3169,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu
> > > > > > *krcp)
> > > > > > >                       mod_delayed_work(system_wq, &krcp->monitor_work,
> > > > > > delay);
> > > > > > >               return;
> > > > > > >       }
> > > > > > > +     krcp->gp_snap = get_state_synchronize_rcu();
> > > > > > >       queue_delayed_work(system_wq, &krcp->monitor_work, delay);
> > > > > > >  }
> > > > > > >
> > > > > > How do you guarantee a full grace period for objects which proceed
> > > > > > to be placed into an input stream that is not yet detached?
> > > > >
> > > > >
> > > > > Just replying from phone as I’m OOO today.
> > > > >
> > > > > Hmm, so you’re saying that objects can be queued after the delayed work has
> > > > > been queued, but processed when the delayed work is run? I’m looking at
> > > > > this code after few years so I may have missed something.
> > > > >
> > > > > That’s a good point and I think I missed that. I think your version did too
> > > > > but I’ll have to double check.
> > > > >
> > > > > The fix then is to sample the clock for the latest object queued, not for
> > > > > when the delayed work is queued.
> > > > >
> > > > The patch i sent gurantee it. Just in case see v2:
> > >
> > > You are right and thank you! CBs can be queued while the monitor timer is in
> > > progress. So we need to sample unconditionally. I think my approach is still
> > > better since I take advantage of multiple seconds (I update snapshot on every
> > > CB queue monitor and sample in the monitor handler).
> > >
> > > Whereas your patch is snapshotting before queuing the regular work and when
> > > the work is executed (This is a much shorter duration and I bet you would be
> > > blocking in cond_synchronize..() more often).
> > >
> > There is a performance test that measures a taken time and memory
> > footprint, so you can do a quick comparison. A "rcutorture" can be
> > run with various parameters to figure out if a patch that is in question
> > makes any difference.
> 
> Yes sure, I am doing a run now with my patch. However, I have a
> question -- why do you feel blocking in the kworker is not an issue?
> You are taking a snapshot before queuing the normal kwork and then
> reading the snapshot when the normal kwork runs. Considering it is a
> high priority queue, the delay between when you are taking the
> snapshot, and reading it is likely small so there is a bigger chance
> of blocking in cond_synchronize_rcu(). Did I miss something?
> 
We can wait indeed in the reclaim worker. But the worker does not do any
nasty or extra work here. If there is a need we block and wait. After a
grace period, we are awoken and proceed.

Therefore i do not see the reason in handling two cases:

if (gp_done)
    queue_work();
else
    queue_rcu_work();

it is the same if we just queue the work and check on entry. The current
scenario is: queue the work after a grace period. This is the difference.

Right if the reclaimer was a high prio kthread a time would be shorter. 

In your scenario the time seems even shorter(i have not checked) because
you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
basically even though you have objects whose grace period is passed you
do not separate it anyhow. Because you update the:

krcp->gp_snap = get_state_synchronize_rcu();

too often.

--
Uladzislau Rezki
  
Uladzislau Rezki Nov. 16, 2022, 7:19 p.m. UTC | #11
Hello, Paul, Joel.

> > 
> > Yes sure, I am doing a run now with my patch. However, I have a
> > question -- why do you feel blocking in the kworker is not an issue?
> > You are taking a snapshot before queuing the normal kwork and then
> > reading the snapshot when the normal kwork runs. Considering it is a
> > high priority queue, the delay between when you are taking the
> > snapshot, and reading it is likely small so there is a bigger chance
> > of blocking in cond_synchronize_rcu(). Did I miss something?
> > 
> We can wait indeed in the reclaim worker. But the worker does not do any
> nasty or extra work here. If there is a need we block and wait. After a
> grace period, we are awoken and proceed.
> 
> Therefore i do not see the reason in handling two cases:
> 
> if (gp_done)
>     queue_work();
> else
>     queue_rcu_work();
> 
> it is the same if we just queue the work and check on entry. The current
> scenario is: queue the work after a grace period. This is the difference.
> 
> Right if the reclaimer was a high prio kthread a time would be shorter. 
> 
> In your scenario the time seems even shorter(i have not checked) because
> you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
> basically even though you have objects whose grace period is passed you
> do not separate it anyhow. Because you update the:
> 
> krcp->gp_snap = get_state_synchronize_rcu();
> 
> too often.
> 
Once upon a time we discussed that it is worth to keep track of GP
per-a-page in order to reduce a memory footprint. Below patch addresses
it:

<snip>
From 76fc6a1398f22341758edcd9aa911127e0cf5129 Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Wed, 2 Nov 2022 19:26:27 +0100
Subject: [PATCH v3 1/1] rcu: kvfree_rcu: Reduce a memory footptint by using
 polling APIs

Total time taken by all kfree'ers: 6564718459 ns, loops: 10000, batches: 1110, memory footprint: 5057MB
Total time taken by all kfree'ers: 8431051895 ns, loops: 10000, batches: 1109, memory footprint: 2749MB
Total time taken by all kfree'ers: 9477830789 ns, loops: 10000, batches: 1158, memory footprint: 2934MB
Total time taken by all kfree'ers: 9950211144 ns, loops: 10000, batches: 981, memory footprint: 2704MB

with a patch:

Total time taken by all kfree'ers: 7712110118 ns, loops: 10000, batches: 1660, memory footprint: 91MB
Total time taken by all kfree'ers: 7002403664 ns, loops: 10000, batches: 1482, memory footprint: 86MB
Total time taken by all kfree'ers: 7842282319 ns, loops: 10000, batches: 1738, memory footprint: 86MB
Total time taken by all kfree'ers: 7230161977 ns, loops: 10000, batches: 1542, memory footprint: 72MB

Tested with NOCB option, all offloading CPUs:

kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
  --kconfig CONFIG_NR_CPUS=64 \
  --kconfig CONFIG_RCU_NOCB_CPU=y \
  --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
  --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \
  rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make

According to data there is a big gain in memory footprint with a patch.
It is because of call_rcu() and call_rcu_flush() take more effort and
time to queue a callback and then wait for a gp.

With polling API:
  a) we do not need to queue any callback;
  b) we might not even need wait for a GP completion.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 115 +++++++++++++++++++++++++++-------------------
 1 file changed, 67 insertions(+), 48 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 76973d716921..6a1f66dd5f09 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2900,13 +2900,16 @@ EXPORT_SYMBOL_GPL(call_rcu);
 
 /**
  * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
+ * @gp_snap: Snapshot of current GP for objects in a page
  * @nr_records: Number of active pointers in the array
+ * @list: Page list
  * @next: Next bulk object in the block chain
  * @records: Array of the kvfree_rcu() pointers
  */
 struct kvfree_rcu_bulk_data {
+	unsigned long gp_snap;
 	unsigned long nr_records;
-	struct kvfree_rcu_bulk_data *next;
+	struct list_head list;
 	void *records[];
 };
 
@@ -2919,24 +2922,26 @@ struct kvfree_rcu_bulk_data {
 	((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
 
 /**
+ * @rcu_work: A work to reclaim a memory after a grace period
  * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
- * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
  * @head_free: List of kfree_rcu() objects waiting for a grace period
- * @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
+ * @head_free_gp_snap: Snapshot of current GP for "@head_free" objects
  * @krcp: Pointer to @kfree_rcu_cpu structure
  */
 
 struct kfree_rcu_cpu_work {
-	struct rcu_work rcu_work;
+	struct work_struct rcu_work;
 	struct rcu_head *head_free;
-	struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];
+	unsigned long head_free_gp_snap;
+
+	struct list_head page_free_head[FREE_N_CHANNELS];
 	struct kfree_rcu_cpu *krcp;
 };
 
 /**
  * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
  * @head: List of kfree_rcu() objects not yet waiting for a grace period
- * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
+ * @page_head: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
  * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
  * @lock: Synchronize access to this structure
  * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
@@ -2960,7 +2965,7 @@ struct kfree_rcu_cpu_work {
  */
 struct kfree_rcu_cpu {
 	struct rcu_head *head;
-	struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS];
+	struct list_head page_head[FREE_N_CHANNELS];
 	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
 	raw_spinlock_t lock;
 	struct delayed_work monitor_work;
@@ -3060,60 +3065,62 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
 static void kfree_rcu_work(struct work_struct *work)
 {
 	unsigned long flags;
-	struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS], *bnext;
+	struct kvfree_rcu_bulk_data *page, *n;
+	struct list_head local_page_head[FREE_N_CHANNELS];
 	struct rcu_head *head, *next;
 	struct kfree_rcu_cpu *krcp;
 	struct kfree_rcu_cpu_work *krwp;
+	unsigned long head_free_gp_snap;
 	int i, j;
 
-	krwp = container_of(to_rcu_work(work),
-			    struct kfree_rcu_cpu_work, rcu_work);
+	krwp = container_of(work,
+		struct kfree_rcu_cpu_work, rcu_work);
 	krcp = krwp->krcp;
 
 	raw_spin_lock_irqsave(&krcp->lock, flags);
 	// Channels 1 and 2.
-	for (i = 0; i < FREE_N_CHANNELS; i++) {
-		bkvhead[i] = krwp->bkvhead_free[i];
-		krwp->bkvhead_free[i] = NULL;
-	}
+	for (i = 0; i < FREE_N_CHANNELS; i++)
+		// Initialized or empty it does not matter just replace.
+		list_replace_init(&krwp->page_free_head[i], &local_page_head[i]);
 
 	// Channel 3.
 	head = krwp->head_free;
 	krwp->head_free = NULL;
+
+	head_free_gp_snap = krwp->head_free_gp_snap;
 	raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
 	// Handle the first two channels.
 	for (i = 0; i < FREE_N_CHANNELS; i++) {
-		for (; bkvhead[i]; bkvhead[i] = bnext) {
-			bnext = bkvhead[i]->next;
-			debug_rcu_bhead_unqueue(bkvhead[i]);
+		// Start from the tail page, so a GP is likely passed for it.
+		list_for_each_entry_safe_reverse(page, n, &local_page_head[i], list) {
+			cond_synchronize_rcu(page->gp_snap);
+			debug_rcu_bhead_unqueue(page);
 
 			rcu_lock_acquire(&rcu_callback_map);
 			if (i == 0) { // kmalloc() / kfree().
 				trace_rcu_invoke_kfree_bulk_callback(
-					rcu_state.name, bkvhead[i]->nr_records,
-					bkvhead[i]->records);
+					rcu_state.name, page->nr_records,
+					page->records);
 
-				kfree_bulk(bkvhead[i]->nr_records,
-					bkvhead[i]->records);
+				kfree_bulk(page->nr_records, page->records);
 			} else { // vmalloc() / vfree().
-				for (j = 0; j < bkvhead[i]->nr_records; j++) {
+				for (j = 0; j < page->nr_records; j++) {
 					trace_rcu_invoke_kvfree_callback(
-						rcu_state.name,
-						bkvhead[i]->records[j], 0);
+						rcu_state.name, page->records[j], 0);
 
-					vfree(bkvhead[i]->records[j]);
+					vfree(page->records[j]);
 				}
 			}
 			rcu_lock_release(&rcu_callback_map);
 
 			raw_spin_lock_irqsave(&krcp->lock, flags);
-			if (put_cached_bnode(krcp, bkvhead[i]))
-				bkvhead[i] = NULL;
+			if (put_cached_bnode(krcp, page))
+				page = NULL;
 			raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
-			if (bkvhead[i])
-				free_page((unsigned long) bkvhead[i]);
+			if (page)
+				free_page((unsigned long) page);
 
 			cond_resched_tasks_rcu_qs();
 		}
@@ -3126,6 +3133,9 @@ static void kfree_rcu_work(struct work_struct *work)
 	 * queued on a linked list through their rcu_head structures.
 	 * This list is named "Channel 3".
 	 */
+	if (head)
+		cond_synchronize_rcu(head_free_gp_snap);
+
 	for (; head; head = next) {
 		unsigned long offset = (unsigned long)head->func;
 		void *ptr = (void *)head - offset;
@@ -3149,7 +3159,7 @@ need_offload_krc(struct kfree_rcu_cpu *krcp)
 	int i;
 
 	for (i = 0; i < FREE_N_CHANNELS; i++)
-		if (krcp->bkvhead[i])
+		if (!list_empty(&krcp->page_head[i]))
 			return true;
 
 	return !!krcp->head;
@@ -3191,16 +3201,15 @@ static void kfree_rcu_monitor(struct work_struct *work)
 		// a previous RCU batch is in progress, it means that
 		// immediately to queue another one is not possible so
 		// in that case the monitor work is rearmed.
-		if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
-			(krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
+		if ((!list_empty(&krcp->page_head[0]) && list_empty(&krwp->page_free_head[0])) ||
+			(!list_empty(&krcp->page_head[1]) && list_empty(&krwp->page_free_head[1])) ||
 				(krcp->head && !krwp->head_free)) {
+
 			// Channel 1 corresponds to the SLAB-pointer bulk path.
 			// Channel 2 corresponds to vmalloc-pointer bulk path.
 			for (j = 0; j < FREE_N_CHANNELS; j++) {
-				if (!krwp->bkvhead_free[j]) {
-					krwp->bkvhead_free[j] = krcp->bkvhead[j];
-					krcp->bkvhead[j] = NULL;
-				}
+				if (list_empty(&krwp->page_free_head[j]))
+					list_replace_init(&krcp->page_head[j], &krwp->page_free_head[j]);
 			}
 
 			// Channel 3 corresponds to both SLAB and vmalloc
@@ -3208,6 +3217,11 @@ static void kfree_rcu_monitor(struct work_struct *work)
 			if (!krwp->head_free) {
 				krwp->head_free = krcp->head;
 				krcp->head = NULL;
+
+				// Take a snapshot for this krwp. Please note no more
+				// any objects can be added to attached head_free channel
+				// therefore fixate a GP for it here.
+				krwp->head_free_gp_snap = get_state_synchronize_rcu();
 			}
 
 			WRITE_ONCE(krcp->count, 0);
@@ -3217,7 +3231,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
 			// be that the work is in the pending state when
 			// channels have been detached following by each
 			// other.
-			queue_rcu_work(system_wq, &krwp->rcu_work);
+			queue_work(system_wq, &krwp->rcu_work);
 		}
 	}
 
@@ -3312,10 +3326,11 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 		return false;
 
 	idx = !!is_vmalloc_addr(ptr);
+	bnode = list_first_entry_or_null(&(*krcp)->page_head[idx],
+		struct kvfree_rcu_bulk_data, list);
 
 	/* Check if a new block is required. */
-	if (!(*krcp)->bkvhead[idx] ||
-			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
+	if (!bnode || bnode->nr_records == KVFREE_BULK_MAX_ENTR) {
 		bnode = get_cached_bnode(*krcp);
 		if (!bnode && can_alloc) {
 			krc_this_cpu_unlock(*krcp, *flags);
@@ -3339,18 +3354,16 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 		if (!bnode)
 			return false;
 
-		/* Initialize the new block. */
+		/* Initialize a new block. */
 		bnode->nr_records = 0;
-		bnode->next = (*krcp)->bkvhead[idx];
-
-		/* Attach it to the head. */
-		(*krcp)->bkvhead[idx] = bnode;
+		list_add(&bnode->list, &(*krcp)->page_head[idx]);
 	}
 
 	/* Finally insert. */
-	(*krcp)->bkvhead[idx]->records
-		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
+	bnode->records[bnode->nr_records++] = ptr;
 
+	/* Keep updated a GP status of this page. */
+	bnode->gp_snap = get_state_synchronize_rcu();
 	return true;
 }
 
@@ -4790,7 +4803,7 @@ struct workqueue_struct *rcu_gp_wq;
 static void __init kfree_rcu_batch_init(void)
 {
 	int cpu;
-	int i;
+	int i, j;
 
 	/* Clamp it to [0:100] seconds interval. */
 	if (rcu_delay_page_cache_fill_msec < 0 ||
@@ -4808,10 +4821,16 @@ static void __init kfree_rcu_batch_init(void)
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
 
 		for (i = 0; i < KFREE_N_BATCHES; i++) {
-			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
+			INIT_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
 			krcp->krw_arr[i].krcp = krcp;
+
+			for (j = 0; j < FREE_N_CHANNELS; j++)
+				INIT_LIST_HEAD(&krcp->krw_arr[i].page_free_head[j]);
 		}
 
+		for (i = 0; i < FREE_N_CHANNELS; i++)
+			INIT_LIST_HEAD(&krcp->page_head[i]);
+
 		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
 		INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func);
 		krcp->initialized = true;
  
Joel Fernandes Nov. 16, 2022, 10:05 p.m. UTC | #12
On Wed, Nov 16, 2022 at 7:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> Hello, Paul, Joel.
>
> > >
> > > Yes sure, I am doing a run now with my patch. However, I have a
> > > question -- why do you feel blocking in the kworker is not an issue?
> > > You are taking a snapshot before queuing the normal kwork and then
> > > reading the snapshot when the normal kwork runs. Considering it is a
> > > high priority queue, the delay between when you are taking the
> > > snapshot, and reading it is likely small so there is a bigger chance
> > > of blocking in cond_synchronize_rcu(). Did I miss something?
> > >
> > We can wait indeed in the reclaim worker. But the worker does not do any
> > nasty or extra work here. If there is a need we block and wait. After a
> > grace period, we are awoken and proceed.
> >
> > Therefore i do not see the reason in handling two cases:
> >
> > if (gp_done)
> >     queue_work();
> > else
> >     queue_rcu_work();
> >
> > it is the same if we just queue the work and check on entry. The current
> > scenario is: queue the work after a grace period. This is the difference.
> >
> > Right if the reclaimer was a high prio kthread a time would be shorter.
> >
> > In your scenario the time seems even shorter(i have not checked) because
> > you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
> > basically even though you have objects whose grace period is passed you
> > do not separate it anyhow. Because you update the:
> >
> > krcp->gp_snap = get_state_synchronize_rcu();
> >
> > too often.
> >
> Once upon a time we discussed that it is worth to keep track of GP
> per-a-page in order to reduce a memory footprint. Below patch addresses
> it:

In the patch below, it appears you are tracking the GP per krwp, and
not per page. But I could be missing something - could you split it
into separate patches for easier review?

Also it still does cond_synchronize_rcu() :-(

thanks,

 - Joel


>
> <snip>
> From 76fc6a1398f22341758edcd9aa911127e0cf5129 Mon Sep 17 00:00:00 2001
> From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
> Date: Wed, 2 Nov 2022 19:26:27 +0100
> Subject: [PATCH v3 1/1] rcu: kvfree_rcu: Reduce a memory footptint by using
>  polling APIs
>
> Total time taken by all kfree'ers: 6564718459 ns, loops: 10000, batches: 1110, memory footprint: 5057MB
> Total time taken by all kfree'ers: 8431051895 ns, loops: 10000, batches: 1109, memory footprint: 2749MB
> Total time taken by all kfree'ers: 9477830789 ns, loops: 10000, batches: 1158, memory footprint: 2934MB
> Total time taken by all kfree'ers: 9950211144 ns, loops: 10000, batches: 981, memory footprint: 2704MB
>
> with a patch:
>
> Total time taken by all kfree'ers: 7712110118 ns, loops: 10000, batches: 1660, memory footprint: 91MB
> Total time taken by all kfree'ers: 7002403664 ns, loops: 10000, batches: 1482, memory footprint: 86MB
> Total time taken by all kfree'ers: 7842282319 ns, loops: 10000, batches: 1738, memory footprint: 86MB
> Total time taken by all kfree'ers: 7230161977 ns, loops: 10000, batches: 1542, memory footprint: 72MB
>
> Tested with NOCB option, all offloading CPUs:
>
> kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \
>   --kconfig CONFIG_NR_CPUS=64 \
>   --kconfig CONFIG_RCU_NOCB_CPU=y \
>   --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
>   --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \
>   rcuscale.holdoff=20 rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make
>
> According to data there is a big gain in memory footprint with a patch.
> It is because of call_rcu() and call_rcu_flush() take more effort and
> time to queue a callback and then wait for a gp.
>
> With polling API:
>   a) we do not need to queue any callback;
>   b) we might not even need wait for a GP completion.
>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  kernel/rcu/tree.c | 115 +++++++++++++++++++++++++++-------------------
>  1 file changed, 67 insertions(+), 48 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 76973d716921..6a1f66dd5f09 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2900,13 +2900,16 @@ EXPORT_SYMBOL_GPL(call_rcu);
>
>  /**
>   * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
> + * @gp_snap: Snapshot of current GP for objects in a page
>   * @nr_records: Number of active pointers in the array
> + * @list: Page list
>   * @next: Next bulk object in the block chain
>   * @records: Array of the kvfree_rcu() pointers
>   */
>  struct kvfree_rcu_bulk_data {
> +       unsigned long gp_snap;
>         unsigned long nr_records;
> -       struct kvfree_rcu_bulk_data *next;
> +       struct list_head list;
>         void *records[];
>  };
>
> @@ -2919,24 +2922,26 @@ struct kvfree_rcu_bulk_data {
>         ((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
>
>  /**
> + * @rcu_work: A work to reclaim a memory after a grace period
>   * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
> - * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
>   * @head_free: List of kfree_rcu() objects waiting for a grace period
> - * @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
> + * @head_free_gp_snap: Snapshot of current GP for "@head_free" objects
>   * @krcp: Pointer to @kfree_rcu_cpu structure
>   */
>
>  struct kfree_rcu_cpu_work {
> -       struct rcu_work rcu_work;
> +       struct work_struct rcu_work;
>         struct rcu_head *head_free;
> -       struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];
> +       unsigned long head_free_gp_snap;
> +
> +       struct list_head page_free_head[FREE_N_CHANNELS];
>         struct kfree_rcu_cpu *krcp;
>  };
>
>  /**
>   * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
>   * @head: List of kfree_rcu() objects not yet waiting for a grace period
> - * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
> + * @page_head: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
>   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
>   * @lock: Synchronize access to this structure
>   * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
> @@ -2960,7 +2965,7 @@ struct kfree_rcu_cpu_work {
>   */
>  struct kfree_rcu_cpu {
>         struct rcu_head *head;
> -       struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS];
> +       struct list_head page_head[FREE_N_CHANNELS];
>         struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
>         raw_spinlock_t lock;
>         struct delayed_work monitor_work;
> @@ -3060,60 +3065,62 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
>  static void kfree_rcu_work(struct work_struct *work)
>  {
>         unsigned long flags;
> -       struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS], *bnext;
> +       struct kvfree_rcu_bulk_data *page, *n;
> +       struct list_head local_page_head[FREE_N_CHANNELS];
>         struct rcu_head *head, *next;
>         struct kfree_rcu_cpu *krcp;
>         struct kfree_rcu_cpu_work *krwp;
> +       unsigned long head_free_gp_snap;
>         int i, j;
>
> -       krwp = container_of(to_rcu_work(work),
> -                           struct kfree_rcu_cpu_work, rcu_work);
> +       krwp = container_of(work,
> +               struct kfree_rcu_cpu_work, rcu_work);
>         krcp = krwp->krcp;
>
>         raw_spin_lock_irqsave(&krcp->lock, flags);
>         // Channels 1 and 2.
> -       for (i = 0; i < FREE_N_CHANNELS; i++) {
> -               bkvhead[i] = krwp->bkvhead_free[i];
> -               krwp->bkvhead_free[i] = NULL;
> -       }
> +       for (i = 0; i < FREE_N_CHANNELS; i++)
> +               // Initialized or empty it does not matter just replace.
> +               list_replace_init(&krwp->page_free_head[i], &local_page_head[i]);
>
>         // Channel 3.
>         head = krwp->head_free;
>         krwp->head_free = NULL;
> +
> +       head_free_gp_snap = krwp->head_free_gp_snap;
>         raw_spin_unlock_irqrestore(&krcp->lock, flags);
>
>         // Handle the first two channels.
>         for (i = 0; i < FREE_N_CHANNELS; i++) {
> -               for (; bkvhead[i]; bkvhead[i] = bnext) {
> -                       bnext = bkvhead[i]->next;
> -                       debug_rcu_bhead_unqueue(bkvhead[i]);
> +               // Start from the tail page, so a GP is likely passed for it.
> +               list_for_each_entry_safe_reverse(page, n, &local_page_head[i], list) {
> +                       cond_synchronize_rcu(page->gp_snap);
> +                       debug_rcu_bhead_unqueue(page);
>
>                         rcu_lock_acquire(&rcu_callback_map);
>                         if (i == 0) { // kmalloc() / kfree().
>                                 trace_rcu_invoke_kfree_bulk_callback(
> -                                       rcu_state.name, bkvhead[i]->nr_records,
> -                                       bkvhead[i]->records);
> +                                       rcu_state.name, page->nr_records,
> +                                       page->records);
>
> -                               kfree_bulk(bkvhead[i]->nr_records,
> -                                       bkvhead[i]->records);
> +                               kfree_bulk(page->nr_records, page->records);
>                         } else { // vmalloc() / vfree().
> -                               for (j = 0; j < bkvhead[i]->nr_records; j++) {
> +                               for (j = 0; j < page->nr_records; j++) {
>                                         trace_rcu_invoke_kvfree_callback(
> -                                               rcu_state.name,
> -                                               bkvhead[i]->records[j], 0);
> +                                               rcu_state.name, page->records[j], 0);
>
> -                                       vfree(bkvhead[i]->records[j]);
> +                                       vfree(page->records[j]);
>                                 }
>                         }
>                         rcu_lock_release(&rcu_callback_map);
>
>                         raw_spin_lock_irqsave(&krcp->lock, flags);
> -                       if (put_cached_bnode(krcp, bkvhead[i]))
> -                               bkvhead[i] = NULL;
> +                       if (put_cached_bnode(krcp, page))
> +                               page = NULL;
>                         raw_spin_unlock_irqrestore(&krcp->lock, flags);
>
> -                       if (bkvhead[i])
> -                               free_page((unsigned long) bkvhead[i]);
> +                       if (page)
> +                               free_page((unsigned long) page);
>
>                         cond_resched_tasks_rcu_qs();
>                 }
> @@ -3126,6 +3133,9 @@ static void kfree_rcu_work(struct work_struct *work)
>          * queued on a linked list through their rcu_head structures.
>          * This list is named "Channel 3".
>          */
> +       if (head)
> +               cond_synchronize_rcu(head_free_gp_snap);
> +
>         for (; head; head = next) {
>                 unsigned long offset = (unsigned long)head->func;
>                 void *ptr = (void *)head - offset;
> @@ -3149,7 +3159,7 @@ need_offload_krc(struct kfree_rcu_cpu *krcp)
>         int i;
>
>         for (i = 0; i < FREE_N_CHANNELS; i++)
> -               if (krcp->bkvhead[i])
> +               if (!list_empty(&krcp->page_head[i]))
>                         return true;
>
>         return !!krcp->head;
> @@ -3191,16 +3201,15 @@ static void kfree_rcu_monitor(struct work_struct *work)
>                 // a previous RCU batch is in progress, it means that
>                 // immediately to queue another one is not possible so
>                 // in that case the monitor work is rearmed.
> -               if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
> -                       (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
> +               if ((!list_empty(&krcp->page_head[0]) && list_empty(&krwp->page_free_head[0])) ||
> +                       (!list_empty(&krcp->page_head[1]) && list_empty(&krwp->page_free_head[1])) ||
>                                 (krcp->head && !krwp->head_free)) {
> +
>                         // Channel 1 corresponds to the SLAB-pointer bulk path.
>                         // Channel 2 corresponds to vmalloc-pointer bulk path.
>                         for (j = 0; j < FREE_N_CHANNELS; j++) {
> -                               if (!krwp->bkvhead_free[j]) {
> -                                       krwp->bkvhead_free[j] = krcp->bkvhead[j];
> -                                       krcp->bkvhead[j] = NULL;
> -                               }
> +                               if (list_empty(&krwp->page_free_head[j]))
> +                                       list_replace_init(&krcp->page_head[j], &krwp->page_free_head[j]);
>                         }
>
>                         // Channel 3 corresponds to both SLAB and vmalloc
> @@ -3208,6 +3217,11 @@ static void kfree_rcu_monitor(struct work_struct *work)
>                         if (!krwp->head_free) {
>                                 krwp->head_free = krcp->head;
>                                 krcp->head = NULL;
> +
> +                               // Take a snapshot for this krwp. Please note no more
> +                               // any objects can be added to attached head_free channel
> +                               // therefore fixate a GP for it here.
> +                               krwp->head_free_gp_snap = get_state_synchronize_rcu();
>                         }
>
>                         WRITE_ONCE(krcp->count, 0);
> @@ -3217,7 +3231,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
>                         // be that the work is in the pending state when
>                         // channels have been detached following by each
>                         // other.
> -                       queue_rcu_work(system_wq, &krwp->rcu_work);
> +                       queue_work(system_wq, &krwp->rcu_work);
>                 }
>         }
>
> @@ -3312,10 +3326,11 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>                 return false;
>
>         idx = !!is_vmalloc_addr(ptr);
> +       bnode = list_first_entry_or_null(&(*krcp)->page_head[idx],
> +               struct kvfree_rcu_bulk_data, list);
>
>         /* Check if a new block is required. */
> -       if (!(*krcp)->bkvhead[idx] ||
> -                       (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
> +       if (!bnode || bnode->nr_records == KVFREE_BULK_MAX_ENTR) {
>                 bnode = get_cached_bnode(*krcp);
>                 if (!bnode && can_alloc) {
>                         krc_this_cpu_unlock(*krcp, *flags);
> @@ -3339,18 +3354,16 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>                 if (!bnode)
>                         return false;
>
> -               /* Initialize the new block. */
> +               /* Initialize a new block. */
>                 bnode->nr_records = 0;
> -               bnode->next = (*krcp)->bkvhead[idx];
> -
> -               /* Attach it to the head. */
> -               (*krcp)->bkvhead[idx] = bnode;
> +               list_add(&bnode->list, &(*krcp)->page_head[idx]);
>         }
>
>         /* Finally insert. */
> -       (*krcp)->bkvhead[idx]->records
> -               [(*krcp)->bkvhead[idx]->nr_records++] = ptr;
> +       bnode->records[bnode->nr_records++] = ptr;
>
> +       /* Keep updated a GP status of this page. */
> +       bnode->gp_snap = get_state_synchronize_rcu();
>         return true;
>  }
>
> @@ -4790,7 +4803,7 @@ struct workqueue_struct *rcu_gp_wq;
>  static void __init kfree_rcu_batch_init(void)
>  {
>         int cpu;
> -       int i;
> +       int i, j;
>
>         /* Clamp it to [0:100] seconds interval. */
>         if (rcu_delay_page_cache_fill_msec < 0 ||
> @@ -4808,10 +4821,16 @@ static void __init kfree_rcu_batch_init(void)
>                 struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
>
>                 for (i = 0; i < KFREE_N_BATCHES; i++) {
> -                       INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
> +                       INIT_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
>                         krcp->krw_arr[i].krcp = krcp;
> +
> +                       for (j = 0; j < FREE_N_CHANNELS; j++)
> +                               INIT_LIST_HEAD(&krcp->krw_arr[i].page_free_head[j]);
>                 }
>
> +               for (i = 0; i < FREE_N_CHANNELS; i++)
> +                       INIT_LIST_HEAD(&krcp->page_head[i]);
> +
>                 INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
>                 INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func);
>                 krcp->initialized = true;
> --
> 2.30.2
> <snip>
>
> it is pretty simple. It does the following:
>
> 1) A GP status is sampled per a page that drives pointers;
> 2) Reclaim is done in reverse order because an oldest page more likely passed its GP;
> 3) Returning a memory occurs faster thus it reduces a memory footprint;
> 4) Improves readability of the code.
>
> Any inputs? I will test and check on our devices with real workloads.
>
> --
> Uladzislau Rezki
  
Uladzislau Rezki Nov. 17, 2022, 12:58 p.m. UTC | #13
On Wed, Nov 16, 2022 at 10:05:46PM +0000, Joel Fernandes wrote:
> On Wed, Nov 16, 2022 at 7:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > Hello, Paul, Joel.
> >
> > > >
> > > > Yes sure, I am doing a run now with my patch. However, I have a
> > > > question -- why do you feel blocking in the kworker is not an issue?
> > > > You are taking a snapshot before queuing the normal kwork and then
> > > > reading the snapshot when the normal kwork runs. Considering it is a
> > > > high priority queue, the delay between when you are taking the
> > > > snapshot, and reading it is likely small so there is a bigger chance
> > > > of blocking in cond_synchronize_rcu(). Did I miss something?
> > > >
> > > We can wait indeed in the reclaim worker. But the worker does not do any
> > > nasty or extra work here. If there is a need we block and wait. After a
> > > grace period, we are awoken and proceed.
> > >
> > > Therefore i do not see the reason in handling two cases:
> > >
> > > if (gp_done)
> > >     queue_work();
> > > else
> > >     queue_rcu_work();
> > >
> > > it is the same if we just queue the work and check on entry. The current
> > > scenario is: queue the work after a grace period. This is the difference.
> > >
> > > Right if the reclaimer was a high prio kthread a time would be shorter.
> > >
> > > In your scenario the time seems even shorter(i have not checked) because
> > > you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
> > > basically even though you have objects whose grace period is passed you
> > > do not separate it anyhow. Because you update the:
> > >
> > > krcp->gp_snap = get_state_synchronize_rcu();
> > >
> > > too often.
> > >
> > Once upon a time we discussed that it is worth to keep track of GP
> > per-a-page in order to reduce a memory footprint. Below patch addresses
> > it:
> 
> In the patch below, it appears you are tracking the GP per krwp, and
> not per page. But I could be missing something - could you split it
> into separate patches for easier review?
>
I will split. I was thinking about it. The GP is tracked per-a-page. As for
krwp it is only for channel_3. Everything goes there if no-page or no cache.

> 
> Also it still does cond_synchronize_rcu() :-(
> 
Sometimes we need to wait for a GP we can not just release :)

--
Uladzislau Rezki
  
Joel Fernandes Nov. 17, 2022, 1:06 p.m. UTC | #14
> On Nov 17, 2022, at 7:58 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Wed, Nov 16, 2022 at 10:05:46PM +0000, Joel Fernandes wrote:
>>> On Wed, Nov 16, 2022 at 7:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>>> 
>>> Hello, Paul, Joel.
>>> 
>>>>> 
>>>>> Yes sure, I am doing a run now with my patch. However, I have a
>>>>> question -- why do you feel blocking in the kworker is not an issue?
>>>>> You are taking a snapshot before queuing the normal kwork and then
>>>>> reading the snapshot when the normal kwork runs. Considering it is a
>>>>> high priority queue, the delay between when you are taking the
>>>>> snapshot, and reading it is likely small so there is a bigger chance
>>>>> of blocking in cond_synchronize_rcu(). Did I miss something?
>>>>> 
>>>> We can wait indeed in the reclaim worker. But the worker does not do any
>>>> nasty or extra work here. If there is a need we block and wait. After a
>>>> grace period, we are awoken and proceed.
>>>> 
>>>> Therefore i do not see the reason in handling two cases:
>>>> 
>>>> if (gp_done)
>>>>    queue_work();
>>>> else
>>>>    queue_rcu_work();
>>>> 
>>>> it is the same if we just queue the work and check on entry. The current
>>>> scenario is: queue the work after a grace period. This is the difference.
>>>> 
>>>> Right if the reclaimer was a high prio kthread a time would be shorter.
>>>> 
>>>> In your scenario the time seems even shorter(i have not checked) because
>>>> you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
>>>> basically even though you have objects whose grace period is passed you
>>>> do not separate it anyhow. Because you update the:
>>>> 
>>>> krcp->gp_snap = get_state_synchronize_rcu();
>>>> 
>>>> too often.
>>>> 
>>> Once upon a time we discussed that it is worth to keep track of GP
>>> per-a-page in order to reduce a memory footprint. Below patch addresses
>>> it:
>> 
>> In the patch below, it appears you are tracking the GP per krwp, and
>> not per page. But I could be missing something - could you split it
>> into separate patches for easier review?
>> 
> I will split. I was thinking about it. The GP is tracked per-a-page. As for
> krwp it is only for channel_3. Everything goes there if no-page or no cache.
> 
Ah, ok.

>> 
>> Also it still does cond_synchronize_rcu() :-(
>> 
> Sometimes we need to wait for a GP we can not just release :)

You know that is not what I meant ;) I was concerned about the blocking.

Thanks,

 - Joel 

> 
> --
> Uladzislau Rezki
  
Uladzislau Rezki Nov. 17, 2022, 1:11 p.m. UTC | #15
On Thu, Nov 17, 2022 at 08:06:21AM -0500, Joel Fernandes wrote:
> 
> 
> > On Nov 17, 2022, at 7:58 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > On Wed, Nov 16, 2022 at 10:05:46PM +0000, Joel Fernandes wrote:
> >>> On Wed, Nov 16, 2022 at 7:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >>> 
> >>> Hello, Paul, Joel.
> >>> 
> >>>>> 
> >>>>> Yes sure, I am doing a run now with my patch. However, I have a
> >>>>> question -- why do you feel blocking in the kworker is not an issue?
> >>>>> You are taking a snapshot before queuing the normal kwork and then
> >>>>> reading the snapshot when the normal kwork runs. Considering it is a
> >>>>> high priority queue, the delay between when you are taking the
> >>>>> snapshot, and reading it is likely small so there is a bigger chance
> >>>>> of blocking in cond_synchronize_rcu(). Did I miss something?
> >>>>> 
> >>>> We can wait indeed in the reclaim worker. But the worker does not do any
> >>>> nasty or extra work here. If there is a need we block and wait. After a
> >>>> grace period, we are awoken and proceed.
> >>>> 
> >>>> Therefore i do not see the reason in handling two cases:
> >>>> 
> >>>> if (gp_done)
> >>>>    queue_work();
> >>>> else
> >>>>    queue_rcu_work();
> >>>> 
> >>>> it is the same if we just queue the work and check on entry. The current
> >>>> scenario is: queue the work after a grace period. This is the difference.
> >>>> 
> >>>> Right if the reclaimer was a high prio kthread a time would be shorter.
> >>>> 
> >>>> In your scenario the time seems even shorter(i have not checked) because
> >>>> you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
> >>>> basically even though you have objects whose grace period is passed you
> >>>> do not separate it anyhow. Because you update the:
> >>>> 
> >>>> krcp->gp_snap = get_state_synchronize_rcu();
> >>>> 
> >>>> too often.
> >>>> 
> >>> Once upon a time we discussed that it is worth to keep track of GP
> >>> per-a-page in order to reduce a memory footprint. Below patch addresses
> >>> it:
> >> 
> >> In the patch below, it appears you are tracking the GP per krwp, and
> >> not per page. But I could be missing something - could you split it
> >> into separate patches for easier review?
> >> 
> > I will split. I was thinking about it. The GP is tracked per-a-page. As for
> > krwp it is only for channel_3. Everything goes there if no-page or no cache.
> > 
> Ah, ok.
> 
> >> 
> >> Also it still does cond_synchronize_rcu() :-(
> >> 
> > Sometimes we need to wait for a GP we can not just release :)
> 
> You know that is not what I meant ;) I was concerned about the blocking.
> 
Let me split. After that we/you can test and check if there is any issue
with sleeping on entry for waiting a GP if needed.

--
Uladzislau Rezki
  
Joel Fernandes Nov. 17, 2022, 1:23 p.m. UTC | #16
> On Nov 17, 2022, at 8:11 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Thu, Nov 17, 2022 at 08:06:21AM -0500, Joel Fernandes wrote:
>> 
>> 
>>>> On Nov 17, 2022, at 7:58 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
>>> 
>>> On Wed, Nov 16, 2022 at 10:05:46PM +0000, Joel Fernandes wrote:
>>>>> On Wed, Nov 16, 2022 at 7:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>>>>> 
>>>>> Hello, Paul, Joel.
>>>>> 
>>>>>>> 
>>>>>>> Yes sure, I am doing a run now with my patch. However, I have a
>>>>>>> question -- why do you feel blocking in the kworker is not an issue?
>>>>>>> You are taking a snapshot before queuing the normal kwork and then
>>>>>>> reading the snapshot when the normal kwork runs. Considering it is a
>>>>>>> high priority queue, the delay between when you are taking the
>>>>>>> snapshot, and reading it is likely small so there is a bigger chance
>>>>>>> of blocking in cond_synchronize_rcu(). Did I miss something?
>>>>>>> 
>>>>>> We can wait indeed in the reclaim worker. But the worker does not do any
>>>>>> nasty or extra work here. If there is a need we block and wait. After a
>>>>>> grace period, we are awoken and proceed.
>>>>>> 
>>>>>> Therefore i do not see the reason in handling two cases:
>>>>>> 
>>>>>> if (gp_done)
>>>>>>   queue_work();
>>>>>> else
>>>>>>   queue_rcu_work();
>>>>>> 
>>>>>> it is the same if we just queue the work and check on entry. The current
>>>>>> scenario is: queue the work after a grace period. This is the difference.
>>>>>> 
>>>>>> Right if the reclaimer was a high prio kthread a time would be shorter.
>>>>>> 
>>>>>> In your scenario the time seems even shorter(i have not checked) because
>>>>>> you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
>>>>>> basically even though you have objects whose grace period is passed you
>>>>>> do not separate it anyhow. Because you update the:
>>>>>> 
>>>>>> krcp->gp_snap = get_state_synchronize_rcu();
>>>>>> 
>>>>>> too often.
>>>>>> 
>>>>> Once upon a time we discussed that it is worth to keep track of GP
>>>>> per-a-page in order to reduce a memory footprint. Below patch addresses
>>>>> it:
>>>> 
>>>> In the patch below, it appears you are tracking the GP per krwp, and
>>>> not per page. But I could be missing something - could you split it
>>>> into separate patches for easier review?
>>>> 
>>> I will split. I was thinking about it. The GP is tracked per-a-page. As for
>>> krwp it is only for channel_3. Everything goes there if no-page or no cache.
>>> 
>> Ah, ok.
>> 
>>>> 
>>>> Also it still does cond_synchronize_rcu() :-(
>>>> 
>>> Sometimes we need to wait for a GP we can not just release :)
>> 
>> You know that is not what I meant ;) I was concerned about the blocking.
>> 
> Let me split. After that we/you can test and check if there is any issue
> with sleeping on entry for waiting a GP if needed.

Ack. What I’ll also do is, whenever you split it, I’ll put it on ChromeOS and see in real world, how many times we block. It could be I’m missing something of how polled GP works.

thanks,

 - Joel 

> 
> --
> Uladzislau Rezki
  
Uladzislau Rezki Nov. 17, 2022, 1:43 p.m. UTC | #17
On Thu, Nov 17, 2022 at 08:23:30AM -0500, Joel Fernandes wrote:
> 
> 
> > On Nov 17, 2022, at 8:11 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > On Thu, Nov 17, 2022 at 08:06:21AM -0500, Joel Fernandes wrote:
> >> 
> >> 
> >>>> On Nov 17, 2022, at 7:58 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> >>> 
> >>> On Wed, Nov 16, 2022 at 10:05:46PM +0000, Joel Fernandes wrote:
> >>>>> On Wed, Nov 16, 2022 at 7:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >>>>> 
> >>>>> Hello, Paul, Joel.
> >>>>> 
> >>>>>>> 
> >>>>>>> Yes sure, I am doing a run now with my patch. However, I have a
> >>>>>>> question -- why do you feel blocking in the kworker is not an issue?
> >>>>>>> You are taking a snapshot before queuing the normal kwork and then
> >>>>>>> reading the snapshot when the normal kwork runs. Considering it is a
> >>>>>>> high priority queue, the delay between when you are taking the
> >>>>>>> snapshot, and reading it is likely small so there is a bigger chance
> >>>>>>> of blocking in cond_synchronize_rcu(). Did I miss something?
> >>>>>>> 
> >>>>>> We can wait indeed in the reclaim worker. But the worker does not do any
> >>>>>> nasty or extra work here. If there is a need we block and wait. After a
> >>>>>> grace period, we are awoken and proceed.
> >>>>>> 
> >>>>>> Therefore i do not see the reason in handling two cases:
> >>>>>> 
> >>>>>> if (gp_done)
> >>>>>>   queue_work();
> >>>>>> else
> >>>>>>   queue_rcu_work();
> >>>>>> 
> >>>>>> it is the same if we just queue the work and check on entry. The current
> >>>>>> scenario is: queue the work after a grace period. This is the difference.
> >>>>>> 
> >>>>>> Right if the reclaimer was a high prio kthread a time would be shorter.
> >>>>>> 
> >>>>>> In your scenario the time seems even shorter(i have not checked) because
> >>>>>> you update a snapshot of krcp each time a kvfree_rcu() is invoked. So
> >>>>>> basically even though you have objects whose grace period is passed you
> >>>>>> do not separate it anyhow. Because you update the:
> >>>>>> 
> >>>>>> krcp->gp_snap = get_state_synchronize_rcu();
> >>>>>> 
> >>>>>> too often.
> >>>>>> 
> >>>>> Once upon a time we discussed that it is worth to keep track of GP
> >>>>> per-a-page in order to reduce a memory footprint. Below patch addresses
> >>>>> it:
> >>>> 
> >>>> In the patch below, it appears you are tracking the GP per krwp, and
> >>>> not per page. But I could be missing something - could you split it
> >>>> into separate patches for easier review?
> >>>> 
> >>> I will split. I was thinking about it. The GP is tracked per-a-page. As for
> >>> krwp it is only for channel_3. Everything goes there if no-page or no cache.
> >>> 
> >> Ah, ok.
> >> 
> >>>> 
> >>>> Also it still does cond_synchronize_rcu() :-(
> >>>> 
> >>> Sometimes we need to wait for a GP we can not just release :)
> >> 
> >> You know that is not what I meant ;) I was concerned about the blocking.
> >> 
> > Let me split. After that we/you can test and check if there is any issue
> > with sleeping on entry for waiting a GP if needed.
> 
> Ack. What I’ll also do is, whenever you split it, I’ll put it on ChromeOS and see in real world, how many times we block. It could be I’m missing something of how polled GP works.
> 
Good!

--
Uladzislau Rezki
  

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 591187b6352e..ed41243f7a49 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2935,6 +2935,7 @@  struct kfree_rcu_cpu_work {
 
 /**
  * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
+ * @gp_snap: The GP snapshot recorded at the last scheduling of monitor work.
  * @head: List of kfree_rcu() objects not yet waiting for a grace period
  * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
  * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
@@ -2964,6 +2965,7 @@  struct kfree_rcu_cpu {
 	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
 	raw_spinlock_t lock;
 	struct delayed_work monitor_work;
+	unsigned long gp_snap;
 	bool initialized;
 	int count;
 
@@ -3167,6 +3169,7 @@  schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
 			mod_delayed_work(system_wq, &krcp->monitor_work, delay);
 		return;
 	}
+	krcp->gp_snap = get_state_synchronize_rcu();
 	queue_delayed_work(system_wq, &krcp->monitor_work, delay);
 }
 
@@ -3217,7 +3220,10 @@  static void kfree_rcu_monitor(struct work_struct *work)
 			// be that the work is in the pending state when
 			// channels have been detached following by each
 			// other.
-			queue_rcu_work(system_wq, &krwp->rcu_work);
+			if (poll_state_synchronize_rcu(krcp->gp_snap))
+				queue_work(system_wq, &krwp->rcu_work.work);
+			else
+				queue_rcu_work(system_wq, &krwp->rcu_work);
 		}
 	}