diff mbox series

[v5,2/4] rcu: Reduce synchronize_rcu() latency

Message ID	20240220183115.74124-3-urezki@gmail.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-73509-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: "Paul E . McKenney" <paulmck@kernel.org> Cc: RCU <rcu@vger.kernel.org>, Neeraj upadhyay <Neeraj.Upadhyay@amd.com>, Boqun Feng <boqun.feng@gmail.com>, Hillf Danton <hdanton@sina.com>, Joel Fernandes <joel@joelfernandes.org>, LKML <linux-kernel@vger.kernel.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>, Frederic Weisbecker <frederic@kernel.org> Subject: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency Date: Tue, 20 Feb 2024 19:31:13 +0100 Message-Id: <20240220183115.74124-3-urezki@gmail.com> In-Reply-To: <20240220183115.74124-1-urezki@gmail.com> References: <20240220183115.74124-1-urezki@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX
Series	Reduce synchronize_rcu() latency(v5) \| [v5,0/4] Reduce synchronize_rcu() latency(v5) [v5,2/4] rcu: Reduce synchronize_rcu() latency [v5,3/4] rcu: Add a trace event for synchronize_rcu_normal()

Commit Message

Uladzislau Rezki Feb. 20, 2024, 6:31 p.m. UTC

  A call to a synchronize_rcu() can be optimized from a latency
point of view. Workloads which depend on this can benefit of it.

The delay of wakeme_after_rcu() callback, which unblocks a waiter,
depends on several factors:

- how fast a process of offloading is started. Combination of:
    - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
    - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
    - other.
- when started, invoking path is interrupted due to:
    - time limit;
    - need_resched();
    - if limit is reached.
- where in a nocb list it is located;
- how fast previous callbacks completed;

Example:

1. On our embedded devices i can easily trigger the scenario when
it is a last in the list out of ~3600 callbacks:

<snip>
  <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
..
  <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
  <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>

2. We use cpuset/cgroup to classify tasks and assign them into
different cgroups. For example "backgrond" group which binds tasks
only to little CPUs or "foreground" which makes use of all CPUs.
Tasks can be migrated between groups by a request if an acceleration
is needed.

See below an example how "surfaceflinger" task gets migrated.
Initially it is located in the "system-background" cgroup which
allows to run only on little cores. In order to speed it up it
can be temporary moved into "foreground" cgroup which allows
to use big/all CPUs:

cgroup_attach_task():
 -> cgroup_migrate_execute()
   -> cpuset_can_attach()
     -> percpu_down_write()
       -> rcu_sync_enter()
         -> synchronize_rcu()
   -> now move tasks to the new cgroup.
 -> cgroup_migrate_finish()

<snip>
         rcuop/1-29      [000] .....  7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
    PERFD-SERVER-1855    [000] d..1.  7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
   TimerDispatch-2768    [002] d..5.  7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
<snip>

"Boosting a task" depends on synchronize_rcu() latency:

- first trace shows a completion of synchronize_rcu();
- second shows attaching a task to a new group;
- last shows a final step when migration occurs.

3. To address this drawback, maintain a separate track that consists
of synchronize_rcu() callers only. After completion of a grace period
users are deferred to a dedicated worker to process requests.

4. This patch reduces the latency of synchronize_rcu() approximately
by ~30-40% on synthetic tests. The real test case, camera launch time,
shows(time is in milliseconds):

1-run 542 vs 489 improvement 9%
2-run 540 vs 466 improvement 13%
3-run 518 vs 468 improvement 9%
4-run 531 vs 457 improvement 13%
5-run 548 vs 475 improvement 13%
6-run 509 vs 484 improvement 4%

Synthetic test(no "noise" from other callbacks):
Hardware: x86_64 64 CPUs, 64GB of memory
Linux-6.6

- 10K tasks(simultaneous);
- each task does(1000 loops)
     synchronize_rcu();
     kfree(p);

default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.

Running 60K gives approximately same results on my setup. Please note
it is without any interaction with another type of callbacks, otherwise
it will impact a lot a default case.

5. By default it is disabled. To enable this perform one of the
below sequence:

echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 .../admin-guide/kernel-parameters.txt         |  14 +
 kernel/rcu/tree.c                             | 331 +++++++++++++++++-
 kernel/rcu/tree_exp.h                         |   2 +-
 3 files changed, 345 insertions(+), 2 deletions(-)

Comments

Frederic Weisbecker Feb. 26, 2024, 11:07 p.m. UTC | #1

On Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) wrote:
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> +{
> +	struct llist_node *done, *rcu, *next, *head;
> +
> +	/*
> +	 * This work execution can potentially execute
> +	 * while a new done tail is being updated by
> +	 * grace period kthread in rcu_sr_normal_gp_cleanup().
> +	 * So, read and updates of done tail need to
> +	 * follow acq-rel semantics.
> +	 *
> +	 * Given that wq semantics guarantees that a single work
> +	 * cannot execute concurrently by multiple kworkers,
> +	 * the done tail list manipulations are protected here.
> +	 */
> +	done = smp_load_acquire(&rcu_state.srs_done_tail);
> +	if (!done)
> +		return;
> +
> +	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> +	head = done->next;
> +	done->next = NULL;

Can the following race happen?

CPU 0                                                   CPU 1
-----                                                   -----

// wait_tail == HEAD1
rcu_sr_normal_gp_cleanup() {
    // has passed SR_MAX_USERS_WAKE_FROM_GP
    wait_tail->next = next;
    // done_tail = HEAD1
    smp_store_release(&rcu_state.srs_done_tail, wait_tail);
    queue_work() {
        test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
        __queue_work()
    }
}

                                                      set_work_pool_and_clear_pending()
                                                      rcu_sr_normal_gp_cleanup_work() {
// new GP, wait_tail == HEAD2
rcu_sr_normal_gp_cleanup() {
    // executes all completion, but stop at HEAD1
    wait_tail->next = HEAD1;
    // done_tail = HEAD2
    smp_store_release(&rcu_state.srs_done_tail, wait_tail);
    queue_work() {
        test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
        __queue_work()
    }
}
                                                          // done = HEAD2
                                                          done = smp_load_acquire(&rcu_state.srs_done_tail);
                                                          // head = HEAD1
                                                          head = done->next;
                                                          done->next = NULL;
                                                          llist_for_each_safe() {
                                                              // completes all callbacks, release HEAD1
                                                          }
                                                      }
                                                      // Process second queue
                                                      set_work_pool_and_clear_pending()
                                                      rcu_sr_normal_gp_cleanup_work() {
                                                          // done = HEAD2
                                                          done = smp_load_acquire(&rcu_state.srs_done_tail);

// new GP, wait_tail == HEAD3
rcu_sr_normal_gp_cleanup() {
    // Finds HEAD2 with ->next == NULL at the end
    rcu_sr_put_wait_head(HEAD2)
    ...

// A few more GPs later
rcu_sr_normal_gp_init() {
     HEAD2 = rcu_sr_get_wait_head();
     llist_add(HEAD2, &rcu_state.srs_next);
                                                          // head == rcu_state.srs_next
                                                          head = done->next;
                                                          done->next = NULL;
                                                          llist_for_each_safe() {
                                                              // EXECUTE CALLBACKS TOO EARLY!!!
                                                          }
                                                      }

Z qiang Feb. 27, 2024, 6:39 a.m. UTC | #2

>
> On Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > +{
> > +     struct llist_node *done, *rcu, *next, *head;
> > +
> > +     /*
> > +      * This work execution can potentially execute
> > +      * while a new done tail is being updated by
> > +      * grace period kthread in rcu_sr_normal_gp_cleanup().
> > +      * So, read and updates of done tail need to
> > +      * follow acq-rel semantics.
> > +      *
> > +      * Given that wq semantics guarantees that a single work
> > +      * cannot execute concurrently by multiple kworkers,
> > +      * the done tail list manipulations are protected here.
> > +      */
> > +     done = smp_load_acquire(&rcu_state.srs_done_tail);
> > +     if (!done)
> > +             return;
> > +
> > +     WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> > +     head = done->next;
> > +     done->next = NULL;
>
> Can the following race happen?
>
> CPU 0                                                   CPU 1
> -----                                                   -----
>
> // wait_tail == HEAD1
> rcu_sr_normal_gp_cleanup() {
>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>     wait_tail->next = next;
>     // done_tail = HEAD1
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
>
>                                                       set_work_pool_and_clear_pending()
>                                                       rcu_sr_normal_gp_cleanup_work() {
> // new GP, wait_tail == HEAD2
> rcu_sr_normal_gp_cleanup() {
>     // executes all completion, but stop at HEAD1
>     wait_tail->next = HEAD1;
>     // done_tail = HEAD2
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
>                                                           // done = HEAD2
>                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
>                                                           // head = HEAD1
>                                                           head = done->next;
>                                                           done->next = NULL;
>                                                           llist_for_each_safe() {
>                                                               // completes all callbacks, release HEAD1
>                                                           }
>                                                       }
>                                                       // Process second queue
>                                                       set_work_pool_and_clear_pending()
>                                                       rcu_sr_normal_gp_cleanup_work() {
>                                                           // done = HEAD2
>                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
>
> // new GP, wait_tail == HEAD3
> rcu_sr_normal_gp_cleanup() {
>     // Finds HEAD2 with ->next == NULL at the end
>     rcu_sr_put_wait_head(HEAD2)

It seems that we should move rcu_sr_put_wait_head() from
rcu_sr_normal_gp_cleanup() to
rcu_sr_normal_gp_cleanup_work(), if find wait_head->next == NULL, invoke
rcu_sr_put_wait_head() to release wait_head.

Thanks
Zqiang

>     ...
>
> // A few more GPs later
> rcu_sr_normal_gp_init() {
>      HEAD2 = rcu_sr_get_wait_head();
>      llist_add(HEAD2, &rcu_state.srs_next);
>                                                           // head == rcu_state.srs_next
>                                                           head = done->next;
>                                                           done->next = NULL;
>                                                           llist_for_each_safe() {
>                                                               // EXECUTE CALLBACKS TOO EARLY!!!
>                                                           }
>                                                       }
>

Frederic Weisbecker Feb. 27, 2024, 2:37 p.m. UTC | #3

Le Tue, Feb 27, 2024 at 02:39:55PM +0800, Z qiang a écrit :
> > Can the following race happen?
> >
> > CPU 0                                                   CPU 1
> > -----                                                   -----
> >
> > // wait_tail == HEAD1
> > rcu_sr_normal_gp_cleanup() {
> >     // has passed SR_MAX_USERS_WAKE_FROM_GP
> >     wait_tail->next = next;
> >     // done_tail = HEAD1
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> >
> >                                                       set_work_pool_and_clear_pending()
> >                                                       rcu_sr_normal_gp_cleanup_work() {
> > // new GP, wait_tail == HEAD2
> > rcu_sr_normal_gp_cleanup() {
> >     // executes all completion, but stop at HEAD1
> >     wait_tail->next = HEAD1;
> >     // done_tail = HEAD2
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> >                                                           // done = HEAD2
> >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> >                                                           // head = HEAD1
> >                                                           head = done->next;
> >                                                           done->next = NULL;
> >                                                           llist_for_each_safe() {
> >                                                               // completes all callbacks, release HEAD1
> >                                                           }
> >                                                       }
> >                                                       // Process second queue
> >                                                       set_work_pool_and_clear_pending()
> >                                                       rcu_sr_normal_gp_cleanup_work() {
> >                                                           // done = HEAD2
> >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> >
> > // new GP, wait_tail == HEAD3
> > rcu_sr_normal_gp_cleanup() {
> >     // Finds HEAD2 with ->next == NULL at the end
> >     rcu_sr_put_wait_head(HEAD2)
> 
> It seems that we should move rcu_sr_put_wait_head() from
> rcu_sr_normal_gp_cleanup() to
> rcu_sr_normal_gp_cleanup_work(), if find wait_head->next == NULL, invoke
> rcu_sr_put_wait_head() to release wait_head.

Well, rcu_sr_normal_gp_cleanup_work() already put all the wait heads
that are _after_ srs_done_tail. But it can't put the srs_done_tail itself
without introducing even worse races...

Frederic Weisbecker Feb. 27, 2024, 4:15 p.m. UTC | #4

Le Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) a écrit :
> +static struct llist_node *rcu_sr_get_wait_head(void)
> +{
> +	struct sr_wait_node *sr_wn;
> +	int i;
> +
> +	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
> +		sr_wn = &(rcu_state.srs_wait_nodes)[i];
> +
> +		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
> +			return &sr_wn->node;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void rcu_sr_put_wait_head(struct llist_node *node)
> +{
> +	struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
> +	atomic_set_release(&sr_wn->inuse, 0);

Can we comment this release/acquire pair?

IIUC, the point is that we must make sure that the wait_head->next fetch
happens before it gets reused and rewritten to something else.

Thanks.

Frederic Weisbecker Feb. 27, 2024, 4:16 p.m. UTC | #5

Le Tue, Feb 27, 2024 at 03:37:26PM +0100, Frederic Weisbecker a écrit :
> Le Tue, Feb 27, 2024 at 02:39:55PM +0800, Z qiang a écrit :
> > > Can the following race happen?
> > >
> > > CPU 0                                                   CPU 1
> > > -----                                                   -----
> > >
> > > // wait_tail == HEAD1
> > > rcu_sr_normal_gp_cleanup() {
> > >     // has passed SR_MAX_USERS_WAKE_FROM_GP
> > >     wait_tail->next = next;
> > >     // done_tail = HEAD1
> > >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > >     queue_work() {
> > >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > >         __queue_work()
> > >     }
> > > }
> > >
> > >                                                       set_work_pool_and_clear_pending()
> > >                                                       rcu_sr_normal_gp_cleanup_work() {
> > > // new GP, wait_tail == HEAD2
> > > rcu_sr_normal_gp_cleanup() {
> > >     // executes all completion, but stop at HEAD1
> > >     wait_tail->next = HEAD1;
> > >     // done_tail = HEAD2
> > >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > >     queue_work() {
> > >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > >         __queue_work()
> > >     }
> > > }
> > >                                                           // done = HEAD2
> > >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> > >                                                           // head = HEAD1
> > >                                                           head = done->next;
> > >                                                           done->next = NULL;
> > >                                                           llist_for_each_safe() {
> > >                                                               // completes all callbacks, release HEAD1
> > >                                                           }
> > >                                                       }
> > >                                                       // Process second queue
> > >                                                       set_work_pool_and_clear_pending()
> > >                                                       rcu_sr_normal_gp_cleanup_work() {
> > >                                                           // done = HEAD2
> > >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> > >
> > > // new GP, wait_tail == HEAD3
> > > rcu_sr_normal_gp_cleanup() {
> > >     // Finds HEAD2 with ->next == NULL at the end
> > >     rcu_sr_put_wait_head(HEAD2)
> > 
> > It seems that we should move rcu_sr_put_wait_head() from
> > rcu_sr_normal_gp_cleanup() to
> > rcu_sr_normal_gp_cleanup_work(), if find wait_head->next == NULL, invoke
> > rcu_sr_put_wait_head() to release wait_head.
> 
> Well, rcu_sr_normal_gp_cleanup_work() already put all the wait heads
> that are _after_ srs_done_tail. But it can't put the srs_done_tail itself
> without introducing even worse races...
> 

(I forgot to mention this race actually concerns the last patch (4/4))

Frederic Weisbecker Feb. 27, 2024, 5:03 p.m. UTC | #6

Le Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) a écrit :
> +/*
> + * Helper function for rcu_gp_cleanup().
> + */
> +static void rcu_sr_normal_gp_cleanup(void)
> +{
> +	struct llist_node *wait_tail;
> +
> +	wait_tail = rcu_state.srs_wait_tail;
> +	if (wait_tail == NULL)
> +		return;
> +
> +	rcu_state.srs_wait_tail = NULL;
> +	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
> +
> +	// concurrent sr_normal_gp_cleanup work might observe this update.
> +	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> +	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> +
> +	if (wait_tail)
> +		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);

You can queue unconditionally here, wait_tail is non-NULL at this point.

Thanks.

Uladzislau Rezki Feb. 27, 2024, 7:35 p.m. UTC | #7

> On Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > +{
> > +	struct llist_node *done, *rcu, *next, *head;
> > +
> > +	/*
> > +	 * This work execution can potentially execute
> > +	 * while a new done tail is being updated by
> > +	 * grace period kthread in rcu_sr_normal_gp_cleanup().
> > +	 * So, read and updates of done tail need to
> > +	 * follow acq-rel semantics.
> > +	 *
> > +	 * Given that wq semantics guarantees that a single work
> > +	 * cannot execute concurrently by multiple kworkers,
> > +	 * the done tail list manipulations are protected here.
> > +	 */
> > +	done = smp_load_acquire(&rcu_state.srs_done_tail);
> > +	if (!done)
> > +		return;
> > +
> > +	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> > +	head = done->next;
> > +	done->next = NULL;
> 
> Can the following race happen?
> 
> CPU 0                                                   CPU 1
> -----                                                   -----
> 
> // wait_tail == HEAD1
> rcu_sr_normal_gp_cleanup() {
>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>     wait_tail->next = next;
>     // done_tail = HEAD1
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
> 
>                                                       set_work_pool_and_clear_pending()
>                                                       rcu_sr_normal_gp_cleanup_work() {
> // new GP, wait_tail == HEAD2
> rcu_sr_normal_gp_cleanup() {
>     // executes all completion, but stop at HEAD1
>     wait_tail->next = HEAD1;
>     // done_tail = HEAD2
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
>                                                           // done = HEAD2
>                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
>                                                           // head = HEAD1
>                                                           head = done->next;
>                                                           done->next = NULL;
>                                                           llist_for_each_safe() {
>                                                               // completes all callbacks, release HEAD1
>                                                           }
>                                                       }
>                                                       // Process second queue
>                                                       set_work_pool_and_clear_pending()
>                                                       rcu_sr_normal_gp_cleanup_work() {
>                                                           // done = HEAD2
>                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> 
>
> // new GP, wait_tail == HEAD3
> rcu_sr_normal_gp_cleanup() {
>     // Finds HEAD2 with ->next == NULL at the end
>     rcu_sr_put_wait_head(HEAD2)
>     ...
> 
> // A few more GPs later
> rcu_sr_normal_gp_init() {
>      HEAD2 = rcu_sr_get_wait_head();
>      llist_add(HEAD2, &rcu_state.srs_next);
>                                                           // head == rcu_state.srs_next
>                                                           head = done->next;
>                                                           done->next = NULL;
>                                                           llist_for_each_safe() {
>                                                               // EXECUTE CALLBACKS TOO EARLY!!!
>                                                           }
>                                                       }
Let me check it tomorrow!

--
Uladzislau Rezki

Joel Fernandes Feb. 27, 2024, 8:51 p.m. UTC | #8

On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from a latency
> point of view. Workloads which depend on this can benefit of it.
> 
> The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> depends on several factors:
> 
> - how fast a process of offloading is started. Combination of:
>     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
>     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
>     - other.
> - when started, invoking path is interrupted due to:
>     - time limit;
>     - need_resched();
>     - if limit is reached.
> - where in a nocb list it is located;
> - how fast previous callbacks completed;
> 
> Example:
> 
> 1. On our embedded devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
> 
> <snip>
>   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
>   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
>   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
> 
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
> 
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
> 
> cgroup_attach_task():
>  -> cgroup_migrate_execute()
>    -> cpuset_can_attach()
>      -> percpu_down_write()
>        -> rcu_sync_enter()
>          -> synchronize_rcu()

We should do this patch but I wonder also if cgroup_attach_task() usage of
synchronize_rcu() should actually be using the _expedited() variant (via some
possible flag to the percpu rwsem / rcu_sync).

If the user assumes it a slow path, then usage of _expedited() should probably
be OK. If it is assumed a fast path, then it is probably hurting latency anyway
without the enablement of this patch's rcu_normal_wake_from_gp.

Thoughts?

Then it becomes a matter of how to plumb the expeditedness down the stack.

Also, speaking of percpu rwsem, I noticed that percpu refcounts don't use
rcu_sync. I haven't looked closely why, but something I hope to get time to look
into is if it can be converted over and what benefits would that entail if any.

Also will continue reviewing the patch. Thanks.

 - Joel

Joel Fernandes Feb. 27, 2024, 10:50 p.m. UTC | #9

On Tue, Feb 27, 2024 at 5:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> > A call to a synchronize_rcu() can be optimized from a latency
> > point of view. Workloads which depend on this can benefit of it.
> >
> > The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> > depends on several factors:
> >
> > - how fast a process of offloading is started. Combination of:
> >     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> >     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> >     - other.
> > - when started, invoking path is interrupted due to:
> >     - time limit;
> >     - need_resched();
> >     - if limit is reached.
> > - where in a nocb list it is located;
> > - how fast previous callbacks completed;
> >
> > Example:
> >
> > 1. On our embedded devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> >
> > <snip>
> >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> >
> > 2. We use cpuset/cgroup to classify tasks and assign them into
> > different cgroups. For example "backgrond" group which binds tasks
> > only to little CPUs or "foreground" which makes use of all CPUs.
> > Tasks can be migrated between groups by a request if an acceleration
> > is needed.
> >
[...]
> >   * Initialize a new grace period.  Return false if no grace period required.
> >   */
> > @@ -1432,6 +1711,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> >       unsigned long mask;
> >       struct rcu_data *rdp;
> >       struct rcu_node *rnp = rcu_get_root();
> > +     bool start_new_poll;
> >
> >       WRITE_ONCE(rcu_state.gp_activity, jiffies);
> >       raw_spin_lock_irq_rcu_node(rnp);
> > @@ -1456,10 +1736,24 @@ static noinline_for_stack bool rcu_gp_init(void)
> >       /* Record GP times before starting GP, hence rcu_seq_start(). */
> >       rcu_seq_start(&rcu_state.gp_seq);
> >       ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > +     start_new_poll = rcu_sr_normal_gp_init();
> >       trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> >       rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> >       raw_spin_unlock_irq_rcu_node(rnp);
> >
> > +     /*
> > +      * The "start_new_poll" is set to true, only when this GP is not able
> > +      * to handle anything and there are outstanding users. It happens when
> > +      * the rcu_sr_normal_gp_init() function was not able to insert a dummy
> > +      * separator to the llist, because there were no left any dummy-nodes.
> > +      *
> > +      * Number of dummy-nodes is fixed, it could be that we are run out of
> > +      * them, if so we start a new pool request to repeat a try. It is rare
> > +      * and it means that a system is doing a slow processing of callbacks.
> > +      */
> > +     if (start_new_poll)
> > +             (void) start_poll_synchronize_rcu();
> > +
>
> Optionally, does it make sense to print a warning if too many retries occurred?

For future work, I was wondering about slight modification to even
avoid this "running out of nodes" issues, why not add a wait node to
task_struct and use that. rcu_gp_init() can just use that. Then, there
is no limit to how many callers or to the length of the list. And by
definition, you cannot have more than 1 caller per task-struct. Would
that not work?

So in rcu_gp_init(), use the wait node of the first task_struct on the
top of the list, mark it as a "special node", perhaps with a flag that
says it is also the dummy node.

But yeah the new design of the patch is really cool... if you are
leaving it alone without going for above suggestion, I can add it to
my backlog for future work.

Thanks.

Joel Fernandes Feb. 27, 2024, 10:53 p.m. UTC | #10

On Tue, Feb 27, 2024 at 5:50 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Tue, Feb 27, 2024 at 5:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> >
> > On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> > > A call to a synchronize_rcu() can be optimized from a latency
> > > point of view. Workloads which depend on this can benefit of it.
> > >
> > > The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> > > depends on several factors:
> > >
> > > - how fast a process of offloading is started. Combination of:
> > >     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> > >     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> > >     - other.
> > > - when started, invoking path is interrupted due to:
> > >     - time limit;
> > >     - need_resched();
> > >     - if limit is reached.
> > > - where in a nocb list it is located;
> > > - how fast previous callbacks completed;
> > >
> > > Example:
> > >
> > > 1. On our embedded devices i can easily trigger the scenario when
> > > it is a last in the list out of ~3600 callbacks:
> > >
> > > <snip>
> > >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > > ...
> > >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> > >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> > >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > > <snip>
> > >
> > > 2. We use cpuset/cgroup to classify tasks and assign them into
> > > different cgroups. For example "backgrond" group which binds tasks
> > > only to little CPUs or "foreground" which makes use of all CPUs.
> > > Tasks can be migrated between groups by a request if an acceleration
> > > is needed.
> > >
> [...]
> > >   * Initialize a new grace period.  Return false if no grace period required.
> > >   */
> > > @@ -1432,6 +1711,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >       unsigned long mask;
> > >       struct rcu_data *rdp;
> > >       struct rcu_node *rnp = rcu_get_root();
> > > +     bool start_new_poll;
> > >
> > >       WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > >       raw_spin_lock_irq_rcu_node(rnp);
> > > @@ -1456,10 +1736,24 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >       /* Record GP times before starting GP, hence rcu_seq_start(). */
> > >       rcu_seq_start(&rcu_state.gp_seq);
> > >       ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > +     start_new_poll = rcu_sr_normal_gp_init();
> > >       trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > >       rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > >       raw_spin_unlock_irq_rcu_node(rnp);
> > >
> > > +     /*
> > > +      * The "start_new_poll" is set to true, only when this GP is not able
> > > +      * to handle anything and there are outstanding users. It happens when
> > > +      * the rcu_sr_normal_gp_init() function was not able to insert a dummy
> > > +      * separator to the llist, because there were no left any dummy-nodes.
> > > +      *
> > > +      * Number of dummy-nodes is fixed, it could be that we are run out of
> > > +      * them, if so we start a new pool request to repeat a try. It is rare
> > > +      * and it means that a system is doing a slow processing of callbacks.
> > > +      */
> > > +     if (start_new_poll)
> > > +             (void) start_poll_synchronize_rcu();
> > > +
> >
> > Optionally, does it make sense to print a warning if too many retries occurred?
>
> For future work, I was wondering about slight modification to even
> avoid this "running out of nodes" issues, why not add a wait node to
> task_struct and use that. rcu_gp_init() can just use that. Then, there
> is no limit to how many callers or to the length of the list. And by
> definition, you cannot have more than 1 caller per task-struct. Would
> that not work?
>
> So in rcu_gp_init(), use the wait node of the first task_struct on the
> top of the list, mark it as a "special node", perhaps with a flag that
> says it is also the dummy node.
>
> But yeah the new design of the patch is really cool... if you are
> leaving it alone without going for above suggestion, I can add it to
> my backlog for future work.

Ouch, let me clarify and sorry for so many messages,  I meant use the
same storage of the synchronous caller who's on top of the list (its
rcu_synchronize node) as the "Dummy" demarkation in the list. Not sure
if that makes sense or it will work, but I'm of the feeling that the
limit of 5 might not be needed.

Uladzislau Rezki Feb. 28, 2024, 9:28 a.m. UTC | #11

On Tue, Feb 27, 2024 at 03:51:03PM -0500, Joel Fernandes wrote:
> 
> 
> On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> > A call to a synchronize_rcu() can be optimized from a latency
> > point of view. Workloads which depend on this can benefit of it.
> > 
> > The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> > depends on several factors:
> > 
> > - how fast a process of offloading is started. Combination of:
> >     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> >     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> >     - other.
> > - when started, invoking path is interrupted due to:
> >     - time limit;
> >     - need_resched();
> >     - if limit is reached.
> > - where in a nocb list it is located;
> > - how fast previous callbacks completed;
> > 
> > Example:
> > 
> > 1. On our embedded devices i can easily trigger the scenario when
> > it is a last in the list out of ~3600 callbacks:
> > 
> > <snip>
> >   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> > ...
> >   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> >   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> >   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> > <snip>
> > 
> > 2. We use cpuset/cgroup to classify tasks and assign them into
> > different cgroups. For example "backgrond" group which binds tasks
> > only to little CPUs or "foreground" which makes use of all CPUs.
> > Tasks can be migrated between groups by a request if an acceleration
> > is needed.
> > 
> > See below an example how "surfaceflinger" task gets migrated.
> > Initially it is located in the "system-background" cgroup which
> > allows to run only on little cores. In order to speed it up it
> > can be temporary moved into "foreground" cgroup which allows
> > to use big/all CPUs:
> > 
> > cgroup_attach_task():
> >  -> cgroup_migrate_execute()
> >    -> cpuset_can_attach()
> >      -> percpu_down_write()
> >        -> rcu_sync_enter()
> >          -> synchronize_rcu()
> 
> We should do this patch but I wonder also if cgroup_attach_task() usage of
> synchronize_rcu() should actually be using the _expedited() variant (via some
> possible flag to the percpu rwsem / rcu_sync).
> 
> If the user assumes it a slow path, then usage of _expedited() should probably
> be OK. If it is assumed a fast path, then it is probably hurting latency anyway
> without the enablement of this patch's rcu_normal_wake_from_gp.
> 
> Thoughts?
> 
How i see it, the rcu_normal_wake_from_gp is disabled so far. We need to
work on this further to have it on by default. But we will move toward
this.


> Then it becomes a matter of how to plumb the expeditedness down the stack.
> 
> Also, speaking of percpu rwsem, I noticed that percpu refcounts don't use
> rcu_sync. I haven't looked closely why, but something I hope to get time to look
> into is if it can be converted over and what benefits would that entail if any.
> 
> Also will continue reviewing the patch. Thanks.
> 
Thanks.

--
Uladzislau Rezki

Joel Fernandes Feb. 28, 2024, 2:32 p.m. UTC | #12

On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from a latency
> point of view. Workloads which depend on this can benefit of it.
> 
> The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> depends on several factors:
> 
> - how fast a process of offloading is started. Combination of:
>     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
>     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
>     - other.
> - when started, invoking path is interrupted due to:
>     - time limit;
>     - need_resched();
>     - if limit is reached.
> - where in a nocb list it is located;
> - how fast previous callbacks completed;
> 
> Example:
> 
> 1. On our embedded devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
> 
> <snip>
>   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
>   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
>   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
> 
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
> 
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
> 
> cgroup_attach_task():
>  -> cgroup_migrate_execute()
>    -> cpuset_can_attach()
>      -> percpu_down_write()
>        -> rcu_sync_enter()
>          -> synchronize_rcu()
>    -> now move tasks to the new cgroup.
>  -> cgroup_migrate_finish()
> 
> <snip>
>          rcuop/1-29      [000] .....  7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
>     PERFD-SERVER-1855    [000] d..1.  7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
>    TimerDispatch-2768    [002] d..5.  7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> <snip>
> 
> "Boosting a task" depends on synchronize_rcu() latency:
> 
> - first trace shows a completion of synchronize_rcu();
> - second shows attaching a task to a new group;
> - last shows a final step when migration occurs.
> 
> 3. To address this drawback, maintain a separate track that consists
> of synchronize_rcu() callers only. After completion of a grace period
> users are deferred to a dedicated worker to process requests.
> 
> 4. This patch reduces the latency of synchronize_rcu() approximately
> by ~30-40% on synthetic tests. The real test case, camera launch time,
> shows(time is in milliseconds):
> 
> 1-run 542 vs 489 improvement 9%
> 2-run 540 vs 466 improvement 13%
> 3-run 518 vs 468 improvement 9%
> 4-run 531 vs 457 improvement 13%
> 5-run 548 vs 475 improvement 13%
> 6-run 509 vs 484 improvement 4%
> 
> Synthetic test(no "noise" from other callbacks):
> Hardware: x86_64 64 CPUs, 64GB of memory
> Linux-6.6
> 
> - 10K tasks(simultaneous);
> - each task does(1000 loops)
>      synchronize_rcu();
>      kfree(p);
> 
> default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
> patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.
> 
> Running 60K gives approximately same results on my setup. Please note
> it is without any interaction with another type of callbacks, otherwise
> it will impact a lot a default case.
> 
> 5. By default it is disabled. To enable this perform one of the
> below sequence:
> 
> echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
> 
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> Co-developed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  14 +
>  kernel/rcu/tree.c                             | 331 +++++++++++++++++-
>  kernel/rcu/tree_exp.h                         |   2 +-
>  3 files changed, 345 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 2244aa0a013b..7ca84cf7b4f4 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5059,6 +5059,20 @@
>  			delay, memory pressure or callback list growing too
>  			big.
>  
> +	rcutree.rcu_normal_wake_from_gp= [KNL]
> +			Reduces a latency of synchronize_rcu() call. This approach
> +			maintains its own track of synchronize_rcu() callers, so it
> +			does not interact with regular callbacks because it does not
> +			use a call_rcu[_hurry]() path. Please note, this is for a
> +			normal grace period.
> +
> +			How to enable it:
> +
> +			echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
> +			or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
> +
> +			Default is 0.
> +
>  	rcuscale.gp_async= [KNL]
>  			Measure performance of asynchronous
>  			grace-period primitives such as call_rcu().
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c8980d76f402..1328da63c3cd 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -75,6 +75,7 @@
>  #define MODULE_PARAM_PREFIX "rcutree."
>  
>  /* Data structures. */
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *);
>  
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
>  	.gpwrap = true,
> @@ -93,6 +94,8 @@ static struct rcu_state rcu_state = {
>  	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
>  	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
>  	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
> +	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
> +		rcu_sr_normal_gp_cleanup_work),
>  };
>  
>  /* Dump rcu_node combining tree at boot to verify correct setup. */
> @@ -1422,6 +1425,282 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
[..]
> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> +{
> +	llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
> +}
> +

I'm a bit concerned from a memory order PoV about this llist_add() happening
possibly on a different CPU than the GP thread, and different than the kworker
thread. Basically we can have 3 CPUs simultaneously modifying and reading the
list, but only 2 CPUs have the acq-rel pair AFAICS.

Consider the following situation:

synchronize_rcu() user
----------------------
llist_add the user U - update srs_next list

rcu_gp_init() and rcu_gp_cleanup (SAME THREAD)
--------------------
insert dummy node in front of U, call it S
update wait_tail to U

and then cleanup:
read wait_tail to W
set wait_tail to NULL
set done_tail to W (RELEASE) -- this release ensures U and S are seen by worker.

workqueue handler
-----------------
read done_tail (ACQUIRE)
disconnect rest of list -- disconnected list guaranteed to have U and S,
                           if done_tail read was W.
---------------------------------

So llist_add() does this (assume new_first and new_last are same):

	struct llist_node *first = READ_ONCE(head->first);

	do {
		new_last->next = first;
	} while (!try_cmpxchg(&head->first, &first, new_first));

	return !first;
---

It reads head->first, then writes the new_last->next (call it new_first->next)
to the old first, then sets head->first to the new_first if head->first did not
change in the meanwhile.

The problem I guess happens if the update the head->first is seen *after* the
update to the new_first->next.

This potentially means a corrupted list is seen in the workqueue handler..
because the "U" node is not yet seen pointing to the rest of the list
(previously added nodes), but is already seen the head of the list.

I am not sure if this can happen, but AFAIK try_cmpxchg() doesn't imply ordering
per-se. Maybe that try_cmpxchg() should be a try_cmpxchg_release() in llist_add() ?

thanks,

 - Joel

Joel Fernandes Feb. 28, 2024, 4:44 p.m. UTC | #13

On 2/28/2024 9:32 AM, Joel Fernandes wrote:
> 
> 
> On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
[...]
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index c8980d76f402..1328da63c3cd 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -75,6 +75,7 @@
>>  #define MODULE_PARAM_PREFIX "rcutree."
>>  
>>  /* Data structures. */
>> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *);
>>  
>>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
>>  	.gpwrap = true,
>> @@ -93,6 +94,8 @@ static struct rcu_state rcu_state = {
>>  	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
>>  	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
>>  	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
>> +	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
>> +		rcu_sr_normal_gp_cleanup_work),
>>  };
>>  
>>  /* Dump rcu_node combining tree at boot to verify correct setup. */
>> @@ -1422,6 +1425,282 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>  }
> [..]
>> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
>> +{
>> +	llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
>> +}
>> +
> 
> I'm a bit concerned from a memory order PoV about this llist_add() happening
> possibly on a different CPU than the GP thread, and different than the kworker
> thread. Basically we can have 3 CPUs simultaneously modifying and reading the
> list, but only 2 CPUs have the acq-rel pair AFAICS.
> 
> Consider the following situation:
> 
> synchronize_rcu() user
> ----------------------
> llist_add the user U - update srs_next list
> 
> rcu_gp_init() and rcu_gp_cleanup (SAME THREAD)
> --------------------
> insert dummy node in front of U, call it S
> update wait_tail to U
> 
> and then cleanup:
> read wait_tail to W
> set wait_tail to NULL
> set done_tail to W (RELEASE) -- this release ensures U and S are seen by worker.
> 
> workqueue handler
> -----------------
> read done_tail (ACQUIRE)
> disconnect rest of list -- disconnected list guaranteed to have U and S,
>                            if done_tail read was W.
> ---------------------------------
> 
> So llist_add() does this (assume new_first and new_last are same):
> 
> 	struct llist_node *first = READ_ONCE(head->first);
> 
> 	do {
> 		new_last->next = first;
> 	} while (!try_cmpxchg(&head->first, &first, new_first));
> 
> 	return !first;
> ---
> 
> It reads head->first, then writes the new_last->next (call it new_first->next)
> to the old first, then sets head->first to the new_first if head->first did not
> change in the meanwhile.
> 
> The problem I guess happens if the update the head->first is seen *after* the
> update to the new_first->next.
> 
> This potentially means a corrupted list is seen in the workqueue handler..
> because the "U" node is not yet seen pointing to the rest of the list
> (previously added nodes), but is already seen the head of the list.
> 
> I am not sure if this can happen, but AFAIK try_cmpxchg() doesn't imply ordering
> per-se. Maybe that try_cmpxchg() should be a try_cmpxchg_release() in llist_add() ?

Everyone in the internal RCU crew corrected me offline that try_cmpxchg() has
full ordering if the cmpxchg succeeded.

So I don't think the issue I mentioned can occur, So we can park this.

Thanks!

 - Joel

Uladzislau Rezki Feb. 28, 2024, 6:04 p.m. UTC | #14

On Tue, Feb 27, 2024 at 12:07:32AM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > +{
> > +	struct llist_node *done, *rcu, *next, *head;
> > +
> > +	/*
> > +	 * This work execution can potentially execute
> > +	 * while a new done tail is being updated by
> > +	 * grace period kthread in rcu_sr_normal_gp_cleanup().
> > +	 * So, read and updates of done tail need to
> > +	 * follow acq-rel semantics.
> > +	 *
> > +	 * Given that wq semantics guarantees that a single work
> > +	 * cannot execute concurrently by multiple kworkers,
> > +	 * the done tail list manipulations are protected here.
> > +	 */
> > +	done = smp_load_acquire(&rcu_state.srs_done_tail);
> > +	if (!done)
> > +		return;
> > +
> > +	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> > +	head = done->next;
> > +	done->next = NULL;
> 
> Can the following race happen?
> 
> CPU 0                                                   CPU 1
> -----                                                   -----
> 
> // wait_tail == HEAD1
> rcu_sr_normal_gp_cleanup() {
>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>     wait_tail->next = next;
>     // done_tail = HEAD1
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
> 
>                                                       set_work_pool_and_clear_pending()
>                                                       rcu_sr_normal_gp_cleanup_work() {
> // new GP, wait_tail == HEAD2
> rcu_sr_normal_gp_cleanup() {
>     // executes all completion, but stop at HEAD1
>     wait_tail->next = HEAD1;
>     // done_tail = HEAD2
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
>                                                           // done = HEAD2
>                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
>                                                           // head = HEAD1
>                                                           head = done->next;
>                                                           done->next = NULL;
>                                                           llist_for_each_safe() {
>                                                               // completes all callbacks, release HEAD1
>                                                           }
>                                                       }
>                                                       // Process second queue
>                                                       set_work_pool_and_clear_pending()
>                                                       rcu_sr_normal_gp_cleanup_work() {
>                                                           // done = HEAD2
>                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> 
> // new GP, wait_tail == HEAD3
> rcu_sr_normal_gp_cleanup() {
>     // Finds HEAD2 with ->next == NULL at the end
>     rcu_sr_put_wait_head(HEAD2)
>     ...
> 
> // A few more GPs later
> rcu_sr_normal_gp_init() {
>      HEAD2 = rcu_sr_get_wait_head();
>      llist_add(HEAD2, &rcu_state.srs_next);
>                                                           // head == rcu_state.srs_next
>                                                           head = done->next;
>                                                           done->next = NULL;
>                                                           llist_for_each_safe() {
>                                                               // EXECUTE CALLBACKS TOO EARLY!!!
>                                                           }
>                                                       }
Looks like that. To address this, we should not release the head in the GP kthread.

Thanks!

--
Uladzislau Rezki

Frederic Weisbecker March 4, 2024, 11:55 a.m. UTC | #15

Le Wed, Feb 28, 2024 at 07:04:21PM +0100, Uladzislau Rezki a écrit :
> On Tue, Feb 27, 2024 at 12:07:32AM +0100, Frederic Weisbecker wrote:
> > On Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > +{
> > > +	struct llist_node *done, *rcu, *next, *head;
> > > +
> > > +	/*
> > > +	 * This work execution can potentially execute
> > > +	 * while a new done tail is being updated by
> > > +	 * grace period kthread in rcu_sr_normal_gp_cleanup().
> > > +	 * So, read and updates of done tail need to
> > > +	 * follow acq-rel semantics.
> > > +	 *
> > > +	 * Given that wq semantics guarantees that a single work
> > > +	 * cannot execute concurrently by multiple kworkers,
> > > +	 * the done tail list manipulations are protected here.
> > > +	 */
> > > +	done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > +	if (!done)
> > > +		return;
> > > +
> > > +	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> > > +	head = done->next;
> > > +	done->next = NULL;
> > 
> > Can the following race happen?
> > 
> > CPU 0                                                   CPU 1
> > -----                                                   -----
> > 
> > // wait_tail == HEAD1
> > rcu_sr_normal_gp_cleanup() {
> >     // has passed SR_MAX_USERS_WAKE_FROM_GP
> >     wait_tail->next = next;
> >     // done_tail = HEAD1
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> > 
> >                                                       set_work_pool_and_clear_pending()
> >                                                       rcu_sr_normal_gp_cleanup_work() {
> > // new GP, wait_tail == HEAD2
> > rcu_sr_normal_gp_cleanup() {
> >     // executes all completion, but stop at HEAD1
> >     wait_tail->next = HEAD1;
> >     // done_tail = HEAD2
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> >                                                           // done = HEAD2
> >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> >                                                           // head = HEAD1
> >                                                           head = done->next;
> >                                                           done->next = NULL;
> >                                                           llist_for_each_safe() {
> >                                                               // completes all callbacks, release HEAD1
> >                                                           }
> >                                                       }
> >                                                       // Process second queue
> >                                                       set_work_pool_and_clear_pending()
> >                                                       rcu_sr_normal_gp_cleanup_work() {
> >                                                           // done = HEAD2
> >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> > 
> > // new GP, wait_tail == HEAD3
> > rcu_sr_normal_gp_cleanup() {
> >     // Finds HEAD2 with ->next == NULL at the end
> >     rcu_sr_put_wait_head(HEAD2)
> >     ...
> > 
> > // A few more GPs later
> > rcu_sr_normal_gp_init() {
> >      HEAD2 = rcu_sr_get_wait_head();
> >      llist_add(HEAD2, &rcu_state.srs_next);
> >                                                           // head == rcu_state.srs_next
> >                                                           head = done->next;
> >                                                           done->next = NULL;
> >                                                           llist_for_each_safe() {
> >                                                               // EXECUTE CALLBACKS TOO EARLY!!!
> >                                                           }
> >                                                       }
> Looks like that. To address this, we should not release the head in the GP
> > kthread.

But then you have to unconditionally schedule the work, right? Otherwise the
HEADs are not released. And that means dropping this patch (right now I don't
have a better idea).

Uladzislau Rezki March 4, 2024, 4:23 p.m. UTC | #16

On Mon, Mar 04, 2024 at 12:55:47PM +0100, Frederic Weisbecker wrote:
> Le Wed, Feb 28, 2024 at 07:04:21PM +0100, Uladzislau Rezki a écrit :
> > On Tue, Feb 27, 2024 at 12:07:32AM +0100, Frederic Weisbecker wrote:
> > > On Tue, Feb 20, 2024 at 07:31:13PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> > > > +{
> > > > +	struct llist_node *done, *rcu, *next, *head;
> > > > +
> > > > +	/*
> > > > +	 * This work execution can potentially execute
> > > > +	 * while a new done tail is being updated by
> > > > +	 * grace period kthread in rcu_sr_normal_gp_cleanup().
> > > > +	 * So, read and updates of done tail need to
> > > > +	 * follow acq-rel semantics.
> > > > +	 *
> > > > +	 * Given that wq semantics guarantees that a single work
> > > > +	 * cannot execute concurrently by multiple kworkers,
> > > > +	 * the done tail list manipulations are protected here.
> > > > +	 */
> > > > +	done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > > +	if (!done)
> > > > +		return;
> > > > +
> > > > +	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> > > > +	head = done->next;
> > > > +	done->next = NULL;
> > > 
> > > Can the following race happen?
> > > 
> > > CPU 0                                                   CPU 1
> > > -----                                                   -----
> > > 
> > > // wait_tail == HEAD1
> > > rcu_sr_normal_gp_cleanup() {
> > >     // has passed SR_MAX_USERS_WAKE_FROM_GP
> > >     wait_tail->next = next;
> > >     // done_tail = HEAD1
> > >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > >     queue_work() {
> > >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > >         __queue_work()
> > >     }
> > > }
> > > 
> > >                                                       set_work_pool_and_clear_pending()
> > >                                                       rcu_sr_normal_gp_cleanup_work() {
> > > // new GP, wait_tail == HEAD2
> > > rcu_sr_normal_gp_cleanup() {
> > >     // executes all completion, but stop at HEAD1
> > >     wait_tail->next = HEAD1;
> > >     // done_tail = HEAD2
> > >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> > >     queue_work() {
> > >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> > >         __queue_work()
> > >     }
> > > }
> > >                                                           // done = HEAD2
> > >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> > >                                                           // head = HEAD1
> > >                                                           head = done->next;
> > >                                                           done->next = NULL;
> > >                                                           llist_for_each_safe() {
> > >                                                               // completes all callbacks, release HEAD1
> > >                                                           }
> > >                                                       }
> > >                                                       // Process second queue
> > >                                                       set_work_pool_and_clear_pending()
> > >                                                       rcu_sr_normal_gp_cleanup_work() {
> > >                                                           // done = HEAD2
> > >                                                           done = smp_load_acquire(&rcu_state.srs_done_tail);
> > > 
> > > // new GP, wait_tail == HEAD3
> > > rcu_sr_normal_gp_cleanup() {
> > >     // Finds HEAD2 with ->next == NULL at the end
> > >     rcu_sr_put_wait_head(HEAD2)
> > >     ...
> > > 
> > > // A few more GPs later
> > > rcu_sr_normal_gp_init() {
> > >      HEAD2 = rcu_sr_get_wait_head();
> > >      llist_add(HEAD2, &rcu_state.srs_next);
> > >                                                           // head == rcu_state.srs_next
> > >                                                           head = done->next;
> > >                                                           done->next = NULL;
> > >                                                           llist_for_each_safe() {
> > >                                                               // EXECUTE CALLBACKS TOO EARLY!!!
> > >                                                           }
> > >                                                       }
> > Looks like that. To address this, we should not release the head in the GP
> > > kthread.
> 
> But then you have to unconditionally schedule the work, right? Otherwise the
> HEADs are not released. And that means dropping this patch (right now I don't
> have a better idea).
>
The easiest way is to drop the patch. To address it we can go with:

<snip>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 31f3a61f9c38..9aa2cd46583e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1661,16 +1661,8 @@ static void rcu_sr_normal_gp_cleanup(void)
 	 * wait-head is released if last. The worker is not kicked.
 	 */
 	llist_for_each_safe(rcu, next, wait_tail->next) {
-		if (rcu_sr_is_wait_head(rcu)) {
-			if (!rcu->next) {
-				rcu_sr_put_wait_head(rcu);
-				wait_tail->next = NULL;
-			} else {
-				wait_tail->next = rcu;
-			}
-
+		if (rcu_sr_is_wait_head(rcu))
 			break;
-		}
 
 		rcu_sr_normal_complete(rcu);
 		// It can be last, update a next on this step.
<snip>

i.e. the process of users from GP is still there. The work is triggered
to perform a final complete(if there are users) + releasing wait-heads
so we do not race anymore.

I am OK with both cases. Dropping the patch will make it more simple
for sure.

--
Uladzislau Rezki

diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2244aa0a013b..7ca84cf7b4f4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5059,6 +5059,20 @@ 
 			delay, memory pressure or callback list growing too
 			big.
 
+	rcutree.rcu_normal_wake_from_gp= [KNL]
+			Reduces a latency of synchronize_rcu() call. This approach
+			maintains its own track of synchronize_rcu() callers, so it
+			does not interact with regular callbacks because it does not
+			use a call_rcu[_hurry]() path. Please note, this is for a
+			normal grace period.
+
+			How to enable it:
+
+			echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
+			or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
+
+			Default is 0.
+
 	rcuscale.gp_async= [KNL]
 			Measure performance of asynchronous
 			grace-period primitives such as call_rcu().
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c8980d76f402..1328da63c3cd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -75,6 +75,7 @@ 
 #define MODULE_PARAM_PREFIX "rcutree."
 
 /* Data structures. */
+static void rcu_sr_normal_gp_cleanup_work(struct work_struct *);
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
 	.gpwrap = true,
@@ -93,6 +94,8 @@  static struct rcu_state rcu_state = {
 	.exp_mutex = __MUTEX_INITIALIZER(rcu_state.exp_mutex),
 	.exp_wake_mutex = __MUTEX_INITIALIZER(rcu_state.exp_wake_mutex),
 	.ofl_lock = __ARCH_SPIN_LOCK_UNLOCKED,
+	.srs_cleanup_work = __WORK_INITIALIZER(rcu_state.srs_cleanup_work,
+		rcu_sr_normal_gp_cleanup_work),
 };
 
 /* Dump rcu_node combining tree at boot to verify correct setup. */
@@ -1422,6 +1425,282 @@  static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+/*
+ * There is a single llist, which is used for handling
+ * synchronize_rcu() users' enqueued rcu_synchronize nodes.
+ * Within this llist, there are two tail pointers:
+ *
+ * wait tail: Tracks the set of nodes, which need to
+ *            wait for the current GP to complete.
+ * done tail: Tracks the set of nodes, for which grace
+ *            period has elapsed. These nodes processing
+ *            will be done as part of the cleanup work
+ *            execution by a kworker.
+ *
+ * At every grace period init, a new wait node is added
+ * to the llist. This wait node is used as wait tail
+ * for this new grace period. Given that there are a fixed
+ * number of wait nodes, if all wait nodes are in use
+ * (which can happen when kworker callback processing
+ * is delayed) and additional grace period is requested.
+ * This means, a system is slow in processing callbacks.
+ *
+ * TODO: If a slow processing is detected, a first node
+ * in the llist should be used as a wait-tail for this
+ * grace period, therefore users which should wait due
+ * to a slow process are handled by _this_ grace period
+ * and not next.
+ *
+ * Below is an illustration of how the done and wait
+ * tail pointers move from one set of rcu_synchronize nodes
+ * to the other, as grace periods start and finish and
+ * nodes are processed by kworker.
+ *
+ *
+ * a. Initial llist callbacks list:
+ *
+ * +----------+           +--------+          +-------+
+ * |          |           |        |          |       |
+ * |   head   |---------> |   cb2  |--------->| cb1   |
+ * |          |           |        |          |       |
+ * +----------+           +--------+          +-------+
+ *
+ *
+ *
+ * b. New GP1 Start:
+ *
+ *                    WAIT TAIL
+ *                      |
+ *                      |
+ *                      v
+ * +----------+     +--------+      +--------+        +-------+
+ * |          |     |        |      |        |        |       |
+ * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
+ * |          |     | head1  |      |        |        |       |
+ * +----------+     +--------+      +--------+        +-------+
+ *
+ *
+ *
+ * c. GP completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ *
+ *                   DONE TAIL
+ *                     |
+ *                     |
+ *                     v
+ * +----------+     +--------+      +--------+        +-------+
+ * |          |     |        |      |        |        |       |
+ * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
+ * |          |     | head1  |      |        |        |       |
+ * +----------+     +--------+      +--------+        +-------+
+ *
+ *
+ *
+ * d. New callbacks and GP2 start:
+ *
+ *                    WAIT TAIL                          DONE TAIL
+ *                      |                                 |
+ *                      |                                 |
+ *                      v                                 v
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ * |          |     |      |    |      |    |      |    |     |    |     |    |     |
+ * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
+ * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ *
+ *
+ *
+ * e. GP2 completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ *                   DONE TAIL
+ *                      |
+ *                      |
+ *                      v
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ * |          |     |      |    |      |    |      |    |     |    |     |    |     |
+ * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
+ * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ *
+ *
+ * While the llist state transitions from d to e, a kworker
+ * can start executing rcu_sr_normal_gp_cleanup_work() and
+ * can observe either the old done tail (@c) or the new
+ * done tail (@e). So, done tail updates and reads need
+ * to use the rel-acq semantics. If the concurrent kworker
+ * observes the old done tail, the newly queued work
+ * execution will process the updated done tail. If the
+ * concurrent kworker observes the new done tail, then
+ * the newly queued work will skip processing the done
+ * tail, as workqueue semantics guarantees that the new
+ * work is executed only after the previous one completes.
+ *
+ * f. kworker callbacks processing complete:
+ *
+ *
+ *                   DONE TAIL
+ *                     |
+ *                     |
+ *                     v
+ * +----------+     +--------+
+ * |          |     |        |
+ * |   head   ------> wait   |
+ * |          |     | head2  |
+ * +----------+     +--------+
+ *
+ */
+static bool rcu_sr_is_wait_head(struct llist_node *node)
+{
+	return &(rcu_state.srs_wait_nodes)[0].node <= node &&
+		node <= &(rcu_state.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
+}
+
+static struct llist_node *rcu_sr_get_wait_head(void)
+{
+	struct sr_wait_node *sr_wn;
+	int i;
+
+	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
+		sr_wn = &(rcu_state.srs_wait_nodes)[i];
+
+		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
+			return &sr_wn->node;
+	}
+
+	return NULL;
+}
+
+static void rcu_sr_put_wait_head(struct llist_node *node)
+{
+	struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
+	atomic_set_release(&sr_wn->inuse, 0);
+}
+
+/* Disabled by default. */
+static int rcu_normal_wake_from_gp;
+module_param(rcu_normal_wake_from_gp, int, 0644);
+
+static void rcu_sr_normal_complete(struct llist_node *node)
+{
+	struct rcu_synchronize *rs = container_of(
+		(struct rcu_head *) node, struct rcu_synchronize, head);
+	unsigned long oldstate = (unsigned long) rs->head.func;
+
+	WARN_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) &&
+		!poll_state_synchronize_rcu(oldstate),
+		"A full grace period is not passed yet: %lu",
+		rcu_seq_diff(get_state_synchronize_rcu(), oldstate));
+
+	/* Finally. */
+	complete(&rs->completion);
+}
+
+static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
+{
+	struct llist_node *done, *rcu, *next, *head;
+
+	/*
+	 * This work execution can potentially execute
+	 * while a new done tail is being updated by
+	 * grace period kthread in rcu_sr_normal_gp_cleanup().
+	 * So, read and updates of done tail need to
+	 * follow acq-rel semantics.
+	 *
+	 * Given that wq semantics guarantees that a single work
+	 * cannot execute concurrently by multiple kworkers,
+	 * the done tail list manipulations are protected here.
+	 */
+	done = smp_load_acquire(&rcu_state.srs_done_tail);
+	if (!done)
+		return;
+
+	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
+	head = done->next;
+	done->next = NULL;
+
+	/*
+	 * The dummy node, which is pointed to by the
+	 * done tail which is acq-read above is not removed
+	 * here.  This allows lockless additions of new
+	 * rcu_synchronize nodes in rcu_sr_normal_add_req(),
+	 * while the cleanup work executes. The dummy
+	 * nodes is removed, in next round of cleanup
+	 * work execution.
+	 */
+	llist_for_each_safe(rcu, next, head) {
+		if (!rcu_sr_is_wait_head(rcu)) {
+			rcu_sr_normal_complete(rcu);
+			continue;
+		}
+
+		rcu_sr_put_wait_head(rcu);
+	}
+}
+
+/*
+ * Helper function for rcu_gp_cleanup().
+ */
+static void rcu_sr_normal_gp_cleanup(void)
+{
+	struct llist_node *wait_tail;
+
+	wait_tail = rcu_state.srs_wait_tail;
+	if (wait_tail == NULL)
+		return;
+
+	rcu_state.srs_wait_tail = NULL;
+	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
+
+	// concurrent sr_normal_gp_cleanup work might observe this update.
+	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
+	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
+
+	if (wait_tail)
+		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
+}
+
+/*
+ * Helper function for rcu_gp_init().
+ */
+static bool rcu_sr_normal_gp_init(void)
+{
+	struct llist_node *first;
+	struct llist_node *wait_head;
+	bool start_new_poll = false;
+
+	first = READ_ONCE(rcu_state.srs_next.first);
+	if (!first || rcu_sr_is_wait_head(first))
+		return start_new_poll;
+
+	wait_head = rcu_sr_get_wait_head();
+	if (!wait_head) {
+		// Kick another GP to retry.
+		start_new_poll = true;
+		return start_new_poll;
+	}
+
+	/* Inject a wait-dummy-node. */
+	llist_add(wait_head, &rcu_state.srs_next);
+
+	/*
+	 * A waiting list of rcu_synchronize nodes should be empty on
+	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+	 * rolls it over. If not, it is a BUG, warn a user.
+	 */
+	WARN_ON_ONCE(rcu_state.srs_wait_tail != NULL);
+	rcu_state.srs_wait_tail = wait_head;
+	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_wait_tail);
+
+	return start_new_poll;
+}
+
+static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
+{
+	llist_add((struct llist_node *) &rs->head, &rcu_state.srs_next);
+}
+
 /*
  * Initialize a new grace period.  Return false if no grace period required.
  */
@@ -1432,6 +1711,7 @@  static noinline_for_stack bool rcu_gp_init(void)
 	unsigned long mask;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root();
+	bool start_new_poll;
 
 	WRITE_ONCE(rcu_state.gp_activity, jiffies);
 	raw_spin_lock_irq_rcu_node(rnp);
@@ -1456,10 +1736,24 @@  static noinline_for_stack bool rcu_gp_init(void)
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	rcu_seq_start(&rcu_state.gp_seq);
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
+	start_new_poll = rcu_sr_normal_gp_init();
 	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
 	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
 	raw_spin_unlock_irq_rcu_node(rnp);
 
+	/*
+	 * The "start_new_poll" is set to true, only when this GP is not able
+	 * to handle anything and there are outstanding users. It happens when
+	 * the rcu_sr_normal_gp_init() function was not able to insert a dummy
+	 * separator to the llist, because there were no left any dummy-nodes.
+	 *
+	 * Number of dummy-nodes is fixed, it could be that we are run out of
+	 * them, if so we start a new pool request to repeat a try. It is rare
+	 * and it means that a system is doing a slow processing of callbacks.
+	 */
+	if (start_new_poll)
+		(void) start_poll_synchronize_rcu();
+
 	/*
 	 * Apply per-leaf buffered online and offline operations to
 	 * the rcu_node tree. Note that this new grace period need not
@@ -1825,6 +2119,9 @@  static noinline void rcu_gp_cleanup(void)
 	}
 	raw_spin_unlock_irq_rcu_node(rnp);
 
+	// Make synchronize_rcu() users aware of the end of old grace period.
+	rcu_sr_normal_gp_cleanup();
+
 	// If strict, make all CPUs aware of the end of the old grace period.
 	if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
 		on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
@@ -3559,6 +3856,38 @@  static int rcu_blocking_is_gp(void)
 	return true;
 }
 
+/*
+ * Helper function for the synchronize_rcu() API.
+ */
+static void synchronize_rcu_normal(void)
+{
+	struct rcu_synchronize rs;
+
+	if (!READ_ONCE(rcu_normal_wake_from_gp)) {
+		wait_rcu_gp(call_rcu_hurry);
+		return;
+	}
+
+	init_rcu_head_on_stack(&rs.head);
+	init_completion(&rs.completion);
+
+	/*
+	 * This code might be preempted, therefore take a GP
+	 * snapshot before adding a request.
+	 */
+	if (IS_ENABLED(CONFIG_PROVE_RCU))
+		rs.head.func = (void *) get_state_synchronize_rcu();
+
+	rcu_sr_normal_add_req(&rs);
+
+	/* Kick a GP and start waiting. */
+	(void) start_poll_synchronize_rcu();
+
+	/* Now we can wait. */
+	wait_for_completion(&rs.completion);
+	destroy_rcu_head_on_stack(&rs.head);
+}
+
 /**
  * synchronize_rcu - wait until a grace period has elapsed.
  *
@@ -3610,7 +3939,7 @@  void synchronize_rcu(void)
 		if (rcu_gp_is_expedited())
 			synchronize_rcu_expedited();
 		else
-			wait_rcu_gp(call_rcu_hurry);
+			synchronize_rcu_normal();
 		return;
 	}
 
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6b83537480b1..8a1d9c8bd9f7 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -930,7 +930,7 @@  void synchronize_rcu_expedited(void)
 
 	/* If expedited grace periods are prohibited, fall back to normal. */
 	if (rcu_gp_is_normal()) {
-		wait_rcu_gp(call_rcu_hurry);
+		synchronize_rcu_normal();
 		return;
 	}