[v3] rcu: Add a minimum time for marking boot as completed

Message ID 20230303213851.2090365-1-joel@joelfernandes.org
State New
Headers
Series [v3] rcu: Add a minimum time for marking boot as completed |

Commit Message

Joel Fernandes March 3, 2023, 9:38 p.m. UTC
  On many systems, a great deal of boot (in userspace) happens after the
kernel thinks the boot has completed. It is difficult to determine if
the system has really booted from the kernel side. Some features like
lazy-RCU can risk slowing down boot time if, say, a callback has been
added that the boot synchronously depends on. Further expedited callbacks
can get unexpedited way earlier than it should be, thus slowing down
boot (as shown in the data below).

For these reasons, this commit adds a config option
'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
Userspace can also make RCU's view of the system as booted, by writing the
time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
Or even just writing a value of 0 to this sysfs node.
However, under no circumstance will the boot be allowed to end earlier
than just before init is launched.

The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
suites ChromeOS and also a PREEMPT_RT system below very well, which need
no config or parameter changes, and just a simple application of this patch. A
system designer can also choose a specific value here to keep RCU from marking
boot completion.  As noted earlier, RCU's perspective of the system as booted
will not be marker until at least rcu_boot_end_delay milliseconds have passed
or an update is made via writing a small value (or 0) in milliseconds to:
/sys/module/rcupdate/parameters/rcu_boot_end_delay.

One side-effect of this patch is, there is a risk that a real-time workload
launched just after the kernel boots will suffer interruptions due to expedited
RCU, which previous ended just before init was launched. However, to mitigate
such an issue (however unlikely), the user should either tune
CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
boots, and before launching the real-time workload.

Qiuxu also noted impressive boot-time improvements with earlier version
of patch. An excerpt from the data he shared:

1) Testing environment:
    OS            : CentOS Stream 8 (non-RT OS)
    Kernel     : v6.2
    Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
    Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …

2) OS boot time definition:
    The time from the start of the kernel boot to the shell command line
    prompt is shown from the console. [ Different people may have
    different OS boot time definitions. ]

3) Measurement method (very rough method):
    A timer in the kernel periodically prints the boot time every 100ms.
    As soon as the shell command line prompt is shown from the console,
    we record the boot time printed by the timer, then the printed boot
    time is the OS boot time.

4) Measured OS boot time (in seconds)
   a) Measured 10 times w/o this patch:
        8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
        The average OS boot time was: ~8.7s

   b) Measure 10 times w/ this patch:
        8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
        The average OS boot time was: ~8.3s.

Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v1->v2:
	Update some comments and description.
v2->v3:
        Add sysfs param, and update with Test data.

 .../admin-guide/kernel-parameters.txt         | 12 ++++
 cc_list                                       |  8 +++
 kernel/rcu/Kconfig                            | 19 ++++++
 kernel/rcu/update.c                           | 68 ++++++++++++++++++-
 4 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 cc_list
  

Comments

Paul E. McKenney March 4, 2023, 1:02 a.m. UTC | #1
On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> On many systems, a great deal of boot (in userspace) happens after the
> kernel thinks the boot has completed. It is difficult to determine if
> the system has really booted from the kernel side. Some features like
> lazy-RCU can risk slowing down boot time if, say, a callback has been
> added that the boot synchronously depends on. Further expedited callbacks
> can get unexpedited way earlier than it should be, thus slowing down
> boot (as shown in the data below).
> 
> For these reasons, this commit adds a config option
> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> Userspace can also make RCU's view of the system as booted, by writing the
> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> Or even just writing a value of 0 to this sysfs node.
> However, under no circumstance will the boot be allowed to end earlier
> than just before init is launched.
> 
> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> suites ChromeOS and also a PREEMPT_RT system below very well, which need
> no config or parameter changes, and just a simple application of this patch. A
> system designer can also choose a specific value here to keep RCU from marking
> boot completion.  As noted earlier, RCU's perspective of the system as booted
> will not be marker until at least rcu_boot_end_delay milliseconds have passed
> or an update is made via writing a small value (or 0) in milliseconds to:
> /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> 
> One side-effect of this patch is, there is a risk that a real-time workload
> launched just after the kernel boots will suffer interruptions due to expedited
> RCU, which previous ended just before init was launched. However, to mitigate
> such an issue (however unlikely), the user should either tune
> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> boots, and before launching the real-time workload.

Much better, thank you!

> Qiuxu also noted impressive boot-time improvements with earlier version
> of patch. An excerpt from the data he shared:
> 
> 1) Testing environment:
>     OS            : CentOS Stream 8 (non-RT OS)
>     Kernel     : v6.2
>     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
>     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> 
> 2) OS boot time definition:
>     The time from the start of the kernel boot to the shell command line
>     prompt is shown from the console. [ Different people may have
>     different OS boot time definitions. ]
> 
> 3) Measurement method (very rough method):
>     A timer in the kernel periodically prints the boot time every 100ms.
>     As soon as the shell command line prompt is shown from the console,
>     we record the boot time printed by the timer, then the printed boot
>     time is the OS boot time.
> 
> 4) Measured OS boot time (in seconds)
>    a) Measured 10 times w/o this patch:
>         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
>         The average OS boot time was: ~8.7s
> 
>    b) Measure 10 times w/ this patch:
>         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
>         The average OS boot time was: ~8.3s.

Unfortunately, given that a's average is within one standard deviation
of b's average, this is most definitely not statistically significant.
Especially given only ten measurements for each case -- you need *at*
*least* 24, preferably more.  Especially in this case, where you don't
really know what the underlying distribution is.

But we can apply the binomial distribution instead of the usual
normal distribution.  First, let's sort and take the medians:

a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3  Median: 8.7
b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3  Median: 8.2

8/10 of a's data points are greater than 0.1 more than b's median
and 8/10 of b's data points are less than 0.1 less than a's median.
What are the odds that this happens by random chance?

This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055.
This is not quite 95% confidence, so not hugely convincing, but it is at
least close.  Not that this is the confidence that (b) is 100ms faster
than (a), not just that (b) is faster than (a).

Not sure that this really carries its weight, but in contrast to the
usual statistics based on the normal distribution, it does suggest at
least a little improvement.  On the other hand, anyone who has carefully
studied nonparametric statistics probably jumped out of the boat several
paragraphs ago.  ;-)

A few more questions interspersed below.

							Thanx, Paul

> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> v1->v2:
> 	Update some comments and description.
> v2->v3:
>         Add sysfs param, and update with Test data.
> 
>  .../admin-guide/kernel-parameters.txt         | 12 ++++
>  cc_list                                       |  8 +++
>  kernel/rcu/Kconfig                            | 19 ++++++
>  kernel/rcu/update.c                           | 68 ++++++++++++++++++-
>  4 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100644 cc_list
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 2429b5e3184b..611de90d9c13 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5085,6 +5085,18 @@
>  	rcutorture.verbose= [KNL]
>  			Enable additional printk() statements.
>  
> +	rcupdate.rcu_boot_end_delay= [KNL]
> +			Minimum time in milliseconds that must elapse
> +			before the boot sequence can be marked complete
> +			from RCU's perspective, after which RCU's behavior
> +			becomes more relaxed. The default value is also
> +			configurable via CONFIG_RCU_BOOT_END_DELAY.
> +			Userspace can also mark the boot as completed
> +			sooner by writing the time in milliseconds, say once
> +			userspace considers the system as booted, to:
> +			/sys/module/rcupdate/parameters/rcu_boot_end_delay
> +			Or even just writing a value of 0 to this sysfs node.

Can userspace also extend the time in this manner?  I am not too worried
either way, but it would be good to make this clear.

If userspace writes a non-zero value, is that from the current time or
from boot?

> +
>  	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
>  			Dump ftrace buffer after reporting RCU CPU
>  			stall warning.
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 9071182b1284..4b5ffa36cbaf 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY
>  
>  	  Accept the default if unsure.
>  
> +config RCU_BOOT_END_DELAY
> +	int "Minimum time before RCU may consider in-kernel boot as completed"
> +	range 0 120000
> +	default 15000
> +	help
> +	  Default value of the minimum time in milliseconds that must elapse
> +	  before the boot sequence can be marked complete from RCU's perspective,
> +	  after which RCU's behavior becomes more relaxed.
> +	  Userspace can also mark the boot as completed sooner than this default
> +	  by writing the time in milliseconds, say once userspace considers
> +	  the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> +	  Or even just writing a value of 0 to this sysfs node.
> +
> +	  The actual delay for RCU's view of the system to be marked as booted can be
> +	  higher than this value if the kernel takes a long time to initialize but it
> +	  will never be smaller than this value.
> +
> +	  Accept the default if unsure.
> +
>  config RCU_EXP_KTHREAD
>  	bool "Perform RCU expedited work in a real-time kthread"
>  	depends on RCU_BOOST && RCU_EXPERT
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 19bf6fa3ee6a..93138c92136e 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void)
>  }
>  EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
>  
> +/*
> + * Minimum time in milliseconds until RCU can consider in-kernel boot as
> + * completed.  This can also be tuned at runtime to end the boot earlier, by
> + * userspace init code writing the time in milliseconds (even 0) to:
> + * /sys/module/rcupdate/parameters/rcu_boot_end_delay
> + */
> +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
> +
>  static bool rcu_boot_ended __read_mostly;
> +static bool rcu_boot_end_called __read_mostly;
> +static DEFINE_MUTEX(rcu_boot_end_lock);
> +
> +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
> +{
> +	uint end_ms;
> +	int ret = kstrtouint(val, 0, &end_ms);
> +
> +	if (ret)
> +		return ret;
> +	WRITE_ONCE(*(uint *)kp->arg, end_ms);

Doesn't this write to rcu_boot_end_delay outside of the lock?

> +
> +	/*
> +	 * rcu_end_inkernel_boot() should be called at least once during init
> +	 * before we can allow param changes to end the boot.
> +	 */
> +	mutex_lock(&rcu_boot_end_lock);
> +	rcu_boot_end_delay = end_ms;
> +	if (!rcu_boot_ended && rcu_boot_end_called) {
> +		mutex_unlock(&rcu_boot_end_lock);
> +		rcu_end_inkernel_boot();

Temporarily dropping rcu_boot_end_lock looks like an accident waiting
to happen.

> +	}
> +	mutex_unlock(&rcu_boot_end_lock);

And dropping it twice does not seem good, either.  Or am I missing some
subtle control-flow trick?

> +	return ret;
> +}
> +
> +static const struct kernel_param_ops rcu_boot_end_ops = {
> +	.set = param_set_rcu_boot_end,
> +	.get = param_get_uint,
> +};
> +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
>  
>  /*
> - * Inform RCU of the end of the in-kernel boot sequence.
> + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
> + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
>   */
> +void rcu_end_inkernel_boot(void);
> +static void rcu_boot_end_work_fn(struct work_struct *work)
> +{
> +	rcu_end_inkernel_boot();
> +}
> +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
> +
>  void rcu_end_inkernel_boot(void)
>  {
> +	mutex_lock(&rcu_boot_end_lock);
> +	rcu_boot_end_called = true;
> +
> +	if (rcu_boot_ended)
> +		return;
> +
> +	if (rcu_boot_end_delay) {
> +		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
> +
> +		if (boot_ms < rcu_boot_end_delay) {

Isn't it necessary to cancel a previously scheduled work to make sure
that the new value overrides the old one?

Mightn't this be simpler if the user was only permitted to write zero,
thus just saying "stop immediately"?  If people really need the ability
to extend or shorten the time, a patch can be produced at that point.
And then a non-zero write to the file would become legal.

> +			schedule_delayed_work(&rcu_boot_end_work,
> +					rcu_boot_end_delay - boot_ms);
> +			mutex_unlock(&rcu_boot_end_lock);
> +			return;
> +		}
> +	}
> +
> +	cancel_delayed_work(&rcu_boot_end_work);
>  	rcu_unexpedite_gp();
>  	rcu_async_relax();
>  	if (rcu_normal_after_boot)
>  		WRITE_ONCE(rcu_normal, 1);
>  	rcu_boot_ended = true;
> +	mutex_unlock(&rcu_boot_end_lock);
>  }
>  
>  /*
> -- 
> 2.40.0.rc0.216.gc4246ad0f0-goog
  
Joel Fernandes March 4, 2023, 4:51 a.m. UTC | #2
Hi Paul,

On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote:
[..]
> > Qiuxu also noted impressive boot-time improvements with earlier version
> > of patch. An excerpt from the data he shared:
> > 
> > 1) Testing environment:
> >     OS            : CentOS Stream 8 (non-RT OS)
> >     Kernel     : v6.2
> >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > 
> > 2) OS boot time definition:
> >     The time from the start of the kernel boot to the shell command line
> >     prompt is shown from the console. [ Different people may have
> >     different OS boot time definitions. ]
> > 
> > 3) Measurement method (very rough method):
> >     A timer in the kernel periodically prints the boot time every 100ms.
> >     As soon as the shell command line prompt is shown from the console,
> >     we record the boot time printed by the timer, then the printed boot
> >     time is the OS boot time.
> > 
> > 4) Measured OS boot time (in seconds)
> >    a) Measured 10 times w/o this patch:
> >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> >         The average OS boot time was: ~8.7s
> > 
> >    b) Measure 10 times w/ this patch:
> >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> >         The average OS boot time was: ~8.3s.
> 
> Unfortunately, given that a's average is within one standard deviation
> of b's average, this is most definitely not statistically significant.
> Especially given only ten measurements for each case -- you need *at*
> *least* 24, preferably more.  Especially in this case, where you don't
> really know what the underlying distribution is.
> 
> But we can apply the binomial distribution instead of the usual
> normal distribution.  First, let's sort and take the medians:
> 
> a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3  Median: 8.7
> b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3  Median: 8.2
> 
> 8/10 of a's data points are greater than 0.1 more than b's median
> and 8/10 of b's data points are less than 0.1 less than a's median.
> What are the odds that this happens by random chance?
> 
> This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055.
> This is not quite 95% confidence, so not hugely convincing, but it is at
> least close.  Not that this is the confidence that (b) is 100ms faster
> than (a), not just that (b) is faster than (a).
> 
> Not sure that this really carries its weight, but in contrast to the
> usual statistics based on the normal distribution, it does suggest at
> least a little improvement.  On the other hand, anyone who has carefully
> studied nonparametric statistics probably jumped out of the boat several
> paragraphs ago.  ;-)

Thanks for the analysis, I did feel the samples were few. I am happy to
update it with more data if Qiuxu can collect more samples and provide.

[..]
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5085,6 +5085,18 @@
> >  	rcutorture.verbose= [KNL]
> >  			Enable additional printk() statements.
> >  
> > +	rcupdate.rcu_boot_end_delay= [KNL]
> > +			Minimum time in milliseconds that must elapse
> > +			before the boot sequence can be marked complete
> > +			from RCU's perspective, after which RCU's behavior
> > +			becomes more relaxed. The default value is also
> > +			configurable via CONFIG_RCU_BOOT_END_DELAY.
> > +			Userspace can also mark the boot as completed
> > +			sooner by writing the time in milliseconds, say once
> > +			userspace considers the system as booted, to:
> > +			/sys/module/rcupdate/parameters/rcu_boot_end_delay
> > +			Or even just writing a value of 0 to this sysfs node.
> 
> Can userspace also extend the time in this manner?  I am not too worried
> either way, but it would be good to make this clear.

Yes, it can be extended because once the default timer fires, it will
schedule a new timer to account for that. Thanks, I'll clarify in the above
docs.

> If userspace writes a non-zero value, is that from the current time or
> from boot?

Good point, it is from the start of boot always, I fixed it.

[..]
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 19bf6fa3ee6a..93138c92136e 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void)
> >  }
> >  EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
> >  
> > +/*
> > + * Minimum time in milliseconds until RCU can consider in-kernel boot as
> > + * completed.  This can also be tuned at runtime to end the boot earlier, by
> > + * userspace init code writing the time in milliseconds (even 0) to:
> > + * /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > + */
> > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
> > +
> >  static bool rcu_boot_ended __read_mostly;
> > +static bool rcu_boot_end_called __read_mostly;
> > +static DEFINE_MUTEX(rcu_boot_end_lock);
> > +
> > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
> > +{
> > +	uint end_ms;
> > +	int ret = kstrtouint(val, 0, &end_ms);
> > +
> > +	if (ret)
> > +		return ret;
> > +	WRITE_ONCE(*(uint *)kp->arg, end_ms);
> 
> Doesn't this write to rcu_boot_end_delay outside of the lock?

True, but actually I realize I don't even need to do it because I overwrite
it in the next step ;-). So I'll just remove it.

> > +
> > +	/*
> > +	 * rcu_end_inkernel_boot() should be called at least once during init
> > +	 * before we can allow param changes to end the boot.
> > +	 */
> > +	mutex_lock(&rcu_boot_end_lock);
> > +	rcu_boot_end_delay = end_ms;
> > +	if (!rcu_boot_ended && rcu_boot_end_called) {
> > +		mutex_unlock(&rcu_boot_end_lock);
> > +		rcu_end_inkernel_boot();
> 
> Temporarily dropping rcu_boot_end_lock looks like an accident waiting
> to happen.
> 
> > +	}
> > +	mutex_unlock(&rcu_boot_end_lock);
> 
> And dropping it twice does not seem good, either.  Or am I missing some
> subtle control-flow trick?

You are quite right, sorry to miss it. To prevent this sort of issue
happening again, I moved the locking to the caller which also simplifies the
code a bit and prevents such traps.

> > +	return ret;
> > +}
> > +
> > +static const struct kernel_param_ops rcu_boot_end_ops = {
> > +	.set = param_set_rcu_boot_end,
> > +	.get = param_get_uint,
> > +};
> > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
> >  
> >  /*
> > - * Inform RCU of the end of the in-kernel boot sequence.
> > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
> > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
> >   */
> > +void rcu_end_inkernel_boot(void);
> > +static void rcu_boot_end_work_fn(struct work_struct *work)
> > +{
> > +	rcu_end_inkernel_boot();
> > +}
> > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
> > +
> >  void rcu_end_inkernel_boot(void)
> >  {
> > +	mutex_lock(&rcu_boot_end_lock);
> > +	rcu_boot_end_called = true;
> > +
> > +	if (rcu_boot_ended)
> > +		return;
> > +
> > +	if (rcu_boot_end_delay) {
> > +		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
> > +
> > +		if (boot_ms < rcu_boot_end_delay) {
> 
> Isn't it necessary to cancel a previously scheduled work to make sure
> that the new value overrides the old one?

No it is not necessary, as we can keep the older timer and once it fires, its
callback will call rcu_end_inkernel_boot() which will queue another timer to
extend the delay further. As long as 'rcu_boot_end_delay' is updated, it will
work fine. You can see that in test #3 below.

Actually this part of the code is equivalent to what I had in my first
patch, so it is not any new code I am adding.

> Mightn't this be simpler if the user was only permitted to write zero,
> thus just saying "stop immediately"?  If people really need the ability
> to extend or shorten the time, a patch can be produced at that point.
> And then a non-zero write to the file would become legal.

I prefer to keep it this way as with this method, I can not only get to
have variable rcu_boot_end_delay via boot parameter (as in my first patch), I
also don't need to add a separate sysfs entry, and can just reuse
'rcu_boot_end_delay' parameter, which I also had in my first patch. And
adding yet another sysfs parameter will actually complicate it even more and
add more lines of code.

I tested difference scenarios and it works fine, though I missed that
mutex locking unfortunately, I did verify different test cases work as
expected by manual testing.

Here are some printks and on simple testing in Qemu:

1. End the boot early, CONFIG is set to 120 seconds:
==================================================
[    1.614968] rcu_boot_end_delay = 120000
[    1.617630] schedule delayed work joel

Boot took 1.57 seconds
root@(none):/# cat /sys/module/rcupdate/parameters/rcu_boot_end_delay
120000
root@(none):/#
root@(none):/#
root@(none):/# echo 0 > /sys/module/rcupdate/parameters/rcu_boot_end_delay
[   10.108394] param called joel
[   10.110520] sys calling boot ended
[   10.112730] rcu_boot_end_delay = 0
[   10.115017] boot ended joel
-----------------------------------------------

2. End the boot passing in rcupdate.rcu_boot_end_delay as 10s.
   This should overwride the CONFIG of 120 seconds:
==================================================
[    1.700090] rcu_boot_end_delay = 10000
[    1.702628] schedule delayed work joel

Boot took 1.64 seconds

root@(none):/# [   10.414008] rcu_boot_end_delay = 10000
[   10.416670] boot ended joel
-----------------------------------------------

3. Do the same thing as #2, but extend the boot via sysfs to be longer than
10 seconds:
==================================================
[    0.060025] param called joel
[    0.060026] param called too early joel
[    1.663905] rcu_boot_end_delay = 10000
[    1.667051] schedule delayed work joel

Boot took 1.61 seconds

root@(none):/#
root@(none):/# echo 20000 > /sys/module/rcupdate/parameters/rcu_boot_end_delay
[    6.932517] param called joel
[    6.934637] sys calling boot ended
[    6.936845] rcu_boot_end_delay = 20000
[    6.939291] schedule delayed work joel
root@(none):/# [   10.389366] rcu_boot_end_delay = 20000
[   10.392047] schedule delayed work joel
[   20.117416] rcu_boot_end_delay = 20000
[   20.120073] boot ended joel
-----------------------------------------------

The debug patch is here: https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=rcu/lazy/postboot

Appended is the updated v4 patch, tested as shown above, more testing is in progress.

thanks,

 - Joel

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH v4] rcu: Add a minimum time for marking boot as completed

On many systems, a great deal of boot (in userspace) happens after the
kernel thinks the boot has completed. It is difficult to determine if
the system has really booted from the kernel side. Some features like
lazy-RCU can risk slowing down boot time if, say, a callback has been
added that the boot synchronously depends on. Further expedited callbacks
can get unexpedited way earlier than it should be, thus slowing down
boot (as shown in the data below).

For these reasons, this commit adds a config option
'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
Userspace can also make RCU's view of the system as booted, by writing the
time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
Or even just writing a value of 0 to this sysfs node.
However, under no circumstance will the boot be allowed to end earlier
than just before init is launched.

The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
suites ChromeOS and also a PREEMPT_RT system below very well, which need
no config or parameter changes, and just a simple application of this patch. A
system designer can also choose a specific value here to keep RCU from marking
boot completion.  As noted earlier, RCU's perspective of the system as booted
will not be marker until at least rcu_boot_end_delay milliseconds have passed
or an update is made via writing a small value (or 0) in milliseconds to:
/sys/module/rcupdate/parameters/rcu_boot_end_delay.

One side-effect of this patch is, there is a risk that a real-time workload
launched just after the kernel boots will suffer interruptions due to expedited
RCU, which previous ended just before init was launched. However, to mitigate
such an issue (however unlikely), the user should either tune
CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
boots, and before launching the real-time workload.

Qiuxu also noted impressive boot-time improvements with earlier version
of patch. An excerpt from the data he shared:

1) Testing environment:
    OS            : CentOS Stream 8 (non-RT OS)
    Kernel     : v6.2
    Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
    Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …

2) OS boot time definition:
    The time from the start of the kernel boot to the shell command line
    prompt is shown from the console. [ Different people may have
    different OS boot time definitions. ]

3) Measurement method (very rough method):
    A timer in the kernel periodically prints the boot time every 100ms.
    As soon as the shell command line prompt is shown from the console,
    we record the boot time printed by the timer, then the printed boot
    time is the OS boot time.

4) Measured OS boot time (in seconds)
   a) Measured 10 times w/o this patch:
        8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
        The average OS boot time was: ~8.7s

   b) Measure 10 times w/ this patch:
        8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
        The average OS boot time was: ~8.3s.

option-prefix PATCH v4
option-start
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

diff-note-start
v1->v2:
	Update some comments and description.
v2->v3:
        Add sysfs param, and update with Test data.
v3->v4:
        Fix locking bug found by Paul, make code more robust
        by refactoring locking code.
        Doc updates.
---
 .../admin-guide/kernel-parameters.txt         | 15 ++++
 cc_list                                       |  8 ++
 kernel/rcu/Kconfig                            | 21 ++++++
 kernel/rcu/update.c                           | 74 ++++++++++++++++++-
 4 files changed, 116 insertions(+), 2 deletions(-)
 create mode 100644 cc_list

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2429b5e3184b..878c2780f5db 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5085,6 +5085,21 @@
 	rcutorture.verbose= [KNL]
 			Enable additional printk() statements.
 
+	rcupdate.rcu_boot_end_delay= [KNL]
+			Minimum time in milliseconds from the start of boot
+			that must elapse before the boot sequence can be marked
+			complete from RCU's perspective, after which RCU's
+			behavior becomes more relaxed. The default value is also
+			configurable via CONFIG_RCU_BOOT_END_DELAY.
+			Userspace can also mark the boot as completed
+			sooner by writing the time in milliseconds, say once
+			userspace considers the system as booted, to:
+			/sys/module/rcupdate/parameters/rcu_boot_end_delay
+			Or even just writing a value of 0 to this sysfs node.
+			The sysfs node can also be used to extend the delay
+			to be larger than the default, assuming the marking
+			of boot complete has not yet occurred.
+
 	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
 			Dump ftrace buffer after reporting RCU CPU
 			stall warning.
diff --git a/cc_list b/cc_list
new file mode 100644
index 000000000000..7daed4877f5a
--- /dev/null
+++ b/cc_list
@@ -0,0 +1,8 @@
+Frederic Weisbecker <frederic@kernel.org>
+Joel Fernandes <joel@joelfernandes.org>
+Lai Jiangshan <jiangshanlai@gmail.com>
+linux-doc@vger.kernel.org
+linux-kernel@vger.kernel.org
+"Paul E. McKenney" <paulmck@kernel.org>
+rcu@vger.kernel.org
+urezki@gmail.com
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 9071182b1284..97f68120d1c0 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -217,6 +217,27 @@ config RCU_BOOST_DELAY
 
 	  Accept the default if unsure.
 
+config RCU_BOOT_END_DELAY
+	int "Minimum time before RCU may consider in-kernel boot as completed"
+	range 0 120000
+	default 15000
+	help
+	  Default value of the minimum time in milliseconds from the start of boot
+	  that must elapse before the boot sequence can be marked complete from RCU's
+	  perspective, after which RCU's behavior becomes more relaxed.
+	  Userspace can also mark the boot as completed sooner than this default
+	  by writing the time in milliseconds, say once userspace considers
+	  the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
+	  Or even just writing a value of 0 to this sysfs node. The sysfs node can
+	  also be used to extend the delay to be larger than the default, assuming
+	  the marking of boot completion has not yet occurred.
+
+	  The actual delay for RCU's view of the system to be marked as booted can be
+	  higher than this value if the kernel takes a long time to initialize but it
+	  will never be smaller than this value.
+
+	  Accept the default if unsure.
+
 config RCU_EXP_KTHREAD
 	bool "Perform RCU expedited work in a real-time kthread"
 	depends on RCU_BOOST && RCU_EXPERT
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 19bf6fa3ee6a..18ed3c15e6b5 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -224,13 +224,50 @@ void rcu_unexpedite_gp(void)
 }
 EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
 
+/*
+ * Minimum time in milliseconds from the start boot until RCU can consider
+ * in-kernel boot as completed.  This can also be tuned at runtime to end the
+ * boot earlier, by userspace init code writing the time in milliseconds (even
+ * 0) to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. The sysfs node
+ * can also be used to extend the delay to be larger than the default, assuming
+ * the marking of boot complete has not yet occurred.
+ */
+static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
+
 static bool rcu_boot_ended __read_mostly;
+static bool rcu_boot_end_called __read_mostly;
+static DEFINE_MUTEX(rcu_boot_end_lock);
 
 /*
- * Inform RCU of the end of the in-kernel boot sequence.
+ * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
+ * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
  */
-void rcu_end_inkernel_boot(void)
+void rcu_end_inkernel_boot(void);
+static void rcu_boot_end_work_fn(struct work_struct *work)
+{
+	rcu_end_inkernel_boot();
+}
+static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
+
+/* Must be called with rcu_boot_end_lock held. */
+static void rcu_end_inkernel_boot_locked(void)
 {
+	rcu_boot_end_called = true;
+
+	if (rcu_boot_ended)
+		return;
+
+	if (rcu_boot_end_delay) {
+		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
+
+		if (boot_ms < rcu_boot_end_delay) {
+			schedule_delayed_work(&rcu_boot_end_work,
+					rcu_boot_end_delay - boot_ms);
+			return;
+		}
+	}
+
+	cancel_delayed_work(&rcu_boot_end_work);
 	rcu_unexpedite_gp();
 	rcu_async_relax();
 	if (rcu_normal_after_boot)
@@ -238,6 +275,39 @@ void rcu_end_inkernel_boot(void)
 	rcu_boot_ended = true;
 }
 
+void rcu_end_inkernel_boot(void)
+{
+	mutex_lock(&rcu_boot_end_lock);
+	rcu_end_inkernel_boot_locked();
+	mutex_unlock(&rcu_boot_end_lock);
+}
+
+static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
+{
+	uint end_ms;
+	int ret = kstrtouint(val, 0, &end_ms);
+
+	if (ret)
+		return ret;
+	/*
+	 * rcu_end_inkernel_boot() should be called at least once during init
+	 * before we can allow param changes to end the boot.
+	 */
+	mutex_lock(&rcu_boot_end_lock);
+	rcu_boot_end_delay = end_ms;
+	if (!rcu_boot_ended && rcu_boot_end_called) {
+		rcu_end_inkernel_boot_locked();
+	}
+	mutex_unlock(&rcu_boot_end_lock);
+	return ret;
+}
+
+static const struct kernel_param_ops rcu_boot_end_ops = {
+	.set = param_set_rcu_boot_end,
+	.get = param_get_uint,
+};
+module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
+
 /*
  * Let rcutorture know when it is OK to turn it up to eleven.
  */
  
Uladzislau Rezki March 5, 2023, 11:39 a.m. UTC | #3
On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> On many systems, a great deal of boot (in userspace) happens after the
> kernel thinks the boot has completed. It is difficult to determine if
> the system has really booted from the kernel side. Some features like
> lazy-RCU can risk slowing down boot time if, say, a callback has been
> added that the boot synchronously depends on. Further expedited callbacks
> can get unexpedited way earlier than it should be, thus slowing down
> boot (as shown in the data below).
> 
> For these reasons, this commit adds a config option
> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> Userspace can also make RCU's view of the system as booted, by writing the
> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> Or even just writing a value of 0 to this sysfs node.
> However, under no circumstance will the boot be allowed to end earlier
> than just before init is launched.
> 
> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> suites ChromeOS and also a PREEMPT_RT system below very well, which need
> no config or parameter changes, and just a simple application of this patch. A
> system designer can also choose a specific value here to keep RCU from marking
> boot completion.  As noted earlier, RCU's perspective of the system as booted
> will not be marker until at least rcu_boot_end_delay milliseconds have passed
> or an update is made via writing a small value (or 0) in milliseconds to:
> /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> 
> One side-effect of this patch is, there is a risk that a real-time workload
> launched just after the kernel boots will suffer interruptions due to expedited
> RCU, which previous ended just before init was launched. However, to mitigate
> such an issue (however unlikely), the user should either tune
> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> boots, and before launching the real-time workload.
> 
> Qiuxu also noted impressive boot-time improvements with earlier version
> of patch. An excerpt from the data he shared:
> 
> 1) Testing environment:
>     OS            : CentOS Stream 8 (non-RT OS)
>     Kernel     : v6.2
>     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
>     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> 
> 2) OS boot time definition:
>     The time from the start of the kernel boot to the shell command line
>     prompt is shown from the console. [ Different people may have
>     different OS boot time definitions. ]
> 
> 3) Measurement method (very rough method):
>     A timer in the kernel periodically prints the boot time every 100ms.
>     As soon as the shell command line prompt is shown from the console,
>     we record the boot time printed by the timer, then the printed boot
>     time is the OS boot time.
> 
> 4) Measured OS boot time (in seconds)
>    a) Measured 10 times w/o this patch:
>         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
>         The average OS boot time was: ~8.7s
> 
>    b) Measure 10 times w/ this patch:
>         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
>         The average OS boot time was: ~8.3s.
> 
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> v1->v2:
> 	Update some comments and description.
> v2->v3:
>         Add sysfs param, and update with Test data.
> 
>  .../admin-guide/kernel-parameters.txt         | 12 ++++
>  cc_list                                       |  8 +++
>  kernel/rcu/Kconfig                            | 19 ++++++
>  kernel/rcu/update.c                           | 68 ++++++++++++++++++-
>  4 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100644 cc_list
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 2429b5e3184b..611de90d9c13 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5085,6 +5085,18 @@
>  	rcutorture.verbose= [KNL]
>  			Enable additional printk() statements.
>  
> +	rcupdate.rcu_boot_end_delay= [KNL]
> +			Minimum time in milliseconds that must elapse
> +			before the boot sequence can be marked complete
> +			from RCU's perspective, after which RCU's behavior
> +			becomes more relaxed. The default value is also
> +			configurable via CONFIG_RCU_BOOT_END_DELAY.
> +			Userspace can also mark the boot as completed
> +			sooner by writing the time in milliseconds, say once
> +			userspace considers the system as booted, to:
> +			/sys/module/rcupdate/parameters/rcu_boot_end_delay
> +			Or even just writing a value of 0 to this sysfs node.
> +
>  	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
>  			Dump ftrace buffer after reporting RCU CPU
>  			stall warning.
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 9071182b1284..4b5ffa36cbaf 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY
>  
>  	  Accept the default if unsure.
>  
> +config RCU_BOOT_END_DELAY
> +	int "Minimum time before RCU may consider in-kernel boot as completed"
> +	range 0 120000
> +	default 15000
> +	help
> +	  Default value of the minimum time in milliseconds that must elapse
> +	  before the boot sequence can be marked complete from RCU's perspective,
> +	  after which RCU's behavior becomes more relaxed.
> +	  Userspace can also mark the boot as completed sooner than this default
> +	  by writing the time in milliseconds, say once userspace considers
> +	  the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> +	  Or even just writing a value of 0 to this sysfs node.
> +
> +	  The actual delay for RCU's view of the system to be marked as booted can be
> +	  higher than this value if the kernel takes a long time to initialize but it
> +	  will never be smaller than this value.
> +
> +	  Accept the default if unsure.
> +
>  config RCU_EXP_KTHREAD
>  	bool "Perform RCU expedited work in a real-time kthread"
>  	depends on RCU_BOOST && RCU_EXPERT
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 19bf6fa3ee6a..93138c92136e 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void)
>  }
>  EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
>  
> +/*
> + * Minimum time in milliseconds until RCU can consider in-kernel boot as
> + * completed.  This can also be tuned at runtime to end the boot earlier, by
> + * userspace init code writing the time in milliseconds (even 0) to:
> + * /sys/module/rcupdate/parameters/rcu_boot_end_delay
> + */
> +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
> +
>  static bool rcu_boot_ended __read_mostly;
> +static bool rcu_boot_end_called __read_mostly;
> +static DEFINE_MUTEX(rcu_boot_end_lock);
> +
> +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
> +{
> +	uint end_ms;
> +	int ret = kstrtouint(val, 0, &end_ms);
> +
> +	if (ret)
> +		return ret;
> +	WRITE_ONCE(*(uint *)kp->arg, end_ms);
> +
> +	/*
> +	 * rcu_end_inkernel_boot() should be called at least once during init
> +	 * before we can allow param changes to end the boot.
> +	 */
> +	mutex_lock(&rcu_boot_end_lock);
> +	rcu_boot_end_delay = end_ms;
> +	if (!rcu_boot_ended && rcu_boot_end_called) {
> +		mutex_unlock(&rcu_boot_end_lock);
> +		rcu_end_inkernel_boot();
> +	}
> +	mutex_unlock(&rcu_boot_end_lock);
> +	return ret;
> +}
> +
> +static const struct kernel_param_ops rcu_boot_end_ops = {
> +	.set = param_set_rcu_boot_end,
> +	.get = param_get_uint,
> +};
> +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
>  
>  /*
> - * Inform RCU of the end of the in-kernel boot sequence.
> + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
> + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
>   */
> +void rcu_end_inkernel_boot(void);
> +static void rcu_boot_end_work_fn(struct work_struct *work)
> +{
> +	rcu_end_inkernel_boot();
> +}
> +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
> +
>  void rcu_end_inkernel_boot(void)
>  {
> +	mutex_lock(&rcu_boot_end_lock);
> +	rcu_boot_end_called = true;
> +
> +	if (rcu_boot_ended)
> +		return;
> +
> +	if (rcu_boot_end_delay) {
> +		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
> +
> +		if (boot_ms < rcu_boot_end_delay) {
> +			schedule_delayed_work(&rcu_boot_end_work,
> +					rcu_boot_end_delay - boot_ms);
<snip>
urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 93138c92136e..93f426f0f4ec 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void)
 
                if (boot_ms < rcu_boot_end_delay) {
                        schedule_delayed_work(&rcu_boot_end_work,
-                                       rcu_boot_end_delay - boot_ms);
+                               msecs_to_jiffies(rcu_boot_end_delay - boot_ms));
                        mutex_unlock(&rcu_boot_end_lock);
                        return;
                }
urezki@pc638:~/data/raid0/coding/linux-rcu.git$
<snip>

I think you need to apply above patch. I am not sure maybe Paul
has already mentioned about it. But just in case.

--
Uladzislau Rezki
  
Joel Fernandes March 5, 2023, 3:03 p.m. UTC | #4
> On Mar 5, 2023, at 6:39 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
>> On many systems, a great deal of boot (in userspace) happens after the
>> kernel thinks the boot has completed. It is difficult to determine if
>> the system has really booted from the kernel side. Some features like
>> lazy-RCU can risk slowing down boot time if, say, a callback has been
>> added that the boot synchronously depends on. Further expedited callbacks
>> can get unexpedited way earlier than it should be, thus slowing down
>> boot (as shown in the data below).
>> 
>> For these reasons, this commit adds a config option
>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
>> Userspace can also make RCU's view of the system as booted, by writing the
>> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
>> Or even just writing a value of 0 to this sysfs node.
>> However, under no circumstance will the boot be allowed to end earlier
>> than just before init is launched.
>> 
>> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
>> suites ChromeOS and also a PREEMPT_RT system below very well, which need
>> no config or parameter changes, and just a simple application of this patch. A
>> system designer can also choose a specific value here to keep RCU from marking
>> boot completion.  As noted earlier, RCU's perspective of the system as booted
>> will not be marker until at least rcu_boot_end_delay milliseconds have passed
>> or an update is made via writing a small value (or 0) in milliseconds to:
>> /sys/module/rcupdate/parameters/rcu_boot_end_delay.
>> 
>> One side-effect of this patch is, there is a risk that a real-time workload
>> launched just after the kernel boots will suffer interruptions due to expedited
>> RCU, which previous ended just before init was launched. However, to mitigate
>> such an issue (however unlikely), the user should either tune
>> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
>> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
>> boots, and before launching the real-time workload.
>> 
>> Qiuxu also noted impressive boot-time improvements with earlier version
>> of patch. An excerpt from the data he shared:
>> 
>> 1) Testing environment:
>>    OS            : CentOS Stream 8 (non-RT OS)
>>    Kernel     : v6.2
>>    Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
>>    Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
>> 
>> 2) OS boot time definition:
>>    The time from the start of the kernel boot to the shell command line
>>    prompt is shown from the console. [ Different people may have
>>    different OS boot time definitions. ]
>> 
>> 3) Measurement method (very rough method):
>>    A timer in the kernel periodically prints the boot time every 100ms.
>>    As soon as the shell command line prompt is shown from the console,
>>    we record the boot time printed by the timer, then the printed boot
>>    time is the OS boot time.
>> 
>> 4) Measured OS boot time (in seconds)
>>   a) Measured 10 times w/o this patch:
>>        8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
>>        The average OS boot time was: ~8.7s
>> 
>>   b) Measure 10 times w/ this patch:
>>        8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
>>        The average OS boot time was: ~8.3s.
>> 
>> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>> v1->v2:
>>    Update some comments and description.
>> v2->v3:
>>        Add sysfs param, and update with Test data.
>> 
>> .../admin-guide/kernel-parameters.txt         | 12 ++++
>> cc_list                                       |  8 +++
>> kernel/rcu/Kconfig                            | 19 ++++++
>> kernel/rcu/update.c                           | 68 ++++++++++++++++++-
>> 4 files changed, 106 insertions(+), 1 deletion(-)
>> create mode 100644 cc_list
>> 
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 2429b5e3184b..611de90d9c13 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -5085,6 +5085,18 @@
>>    rcutorture.verbose= [KNL]
>>            Enable additional printk() statements.
>> 
>> +    rcupdate.rcu_boot_end_delay= [KNL]
>> +            Minimum time in milliseconds that must elapse
>> +            before the boot sequence can be marked complete
>> +            from RCU's perspective, after which RCU's behavior
>> +            becomes more relaxed. The default value is also
>> +            configurable via CONFIG_RCU_BOOT_END_DELAY.
>> +            Userspace can also mark the boot as completed
>> +            sooner by writing the time in milliseconds, say once
>> +            userspace considers the system as booted, to:
>> +            /sys/module/rcupdate/parameters/rcu_boot_end_delay
>> +            Or even just writing a value of 0 to this sysfs node.
>> +
>>    rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
>>            Dump ftrace buffer after reporting RCU CPU
>>            stall warning.
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index 9071182b1284..4b5ffa36cbaf 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY
>> 
>>      Accept the default if unsure.
>> 
>> +config RCU_BOOT_END_DELAY
>> +    int "Minimum time before RCU may consider in-kernel boot as completed"
>> +    range 0 120000
>> +    default 15000
>> +    help
>> +      Default value of the minimum time in milliseconds that must elapse
>> +      before the boot sequence can be marked complete from RCU's perspective,
>> +      after which RCU's behavior becomes more relaxed.
>> +      Userspace can also mark the boot as completed sooner than this default
>> +      by writing the time in milliseconds, say once userspace considers
>> +      the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
>> +      Or even just writing a value of 0 to this sysfs node.
>> +
>> +      The actual delay for RCU's view of the system to be marked as booted can be
>> +      higher than this value if the kernel takes a long time to initialize but it
>> +      will never be smaller than this value.
>> +
>> +      Accept the default if unsure.
>> +
>> config RCU_EXP_KTHREAD
>>    bool "Perform RCU expedited work in a real-time kthread"
>>    depends on RCU_BOOST && RCU_EXPERT
>> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
>> index 19bf6fa3ee6a..93138c92136e 100644
>> --- a/kernel/rcu/update.c
>> +++ b/kernel/rcu/update.c
>> @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void)
>> }
>> EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
>> 
>> +/*
>> + * Minimum time in milliseconds until RCU can consider in-kernel boot as
>> + * completed.  This can also be tuned at runtime to end the boot earlier, by
>> + * userspace init code writing the time in milliseconds (even 0) to:
>> + * /sys/module/rcupdate/parameters/rcu_boot_end_delay
>> + */
>> +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
>> +
>> static bool rcu_boot_ended __read_mostly;
>> +static bool rcu_boot_end_called __read_mostly;
>> +static DEFINE_MUTEX(rcu_boot_end_lock);
>> +
>> +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
>> +{
>> +    uint end_ms;
>> +    int ret = kstrtouint(val, 0, &end_ms);
>> +
>> +    if (ret)
>> +        return ret;
>> +    WRITE_ONCE(*(uint *)kp->arg, end_ms);
>> +
>> +    /*
>> +     * rcu_end_inkernel_boot() should be called at least once during init
>> +     * before we can allow param changes to end the boot.
>> +     */
>> +    mutex_lock(&rcu_boot_end_lock);
>> +    rcu_boot_end_delay = end_ms;
>> +    if (!rcu_boot_ended && rcu_boot_end_called) {
>> +        mutex_unlock(&rcu_boot_end_lock);
>> +        rcu_end_inkernel_boot();
>> +    }
>> +    mutex_unlock(&rcu_boot_end_lock);
>> +    return ret;
>> +}
>> +
>> +static const struct kernel_param_ops rcu_boot_end_ops = {
>> +    .set = param_set_rcu_boot_end,
>> +    .get = param_get_uint,
>> +};
>> +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
>> 
>> /*
>> - * Inform RCU of the end of the in-kernel boot sequence.
>> + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
>> + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
>>  */
>> +void rcu_end_inkernel_boot(void);
>> +static void rcu_boot_end_work_fn(struct work_struct *work)
>> +{
>> +    rcu_end_inkernel_boot();
>> +}
>> +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
>> +
>> void rcu_end_inkernel_boot(void)
>> {
>> +    mutex_lock(&rcu_boot_end_lock);
>> +    rcu_boot_end_called = true;
>> +
>> +    if (rcu_boot_ended)
>> +        return;
>> +
>> +    if (rcu_boot_end_delay) {
>> +        u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
>> +
>> +        if (boot_ms < rcu_boot_end_delay) {
>> +            schedule_delayed_work(&rcu_boot_end_work,
>> +                    rcu_boot_end_delay - boot_ms);
> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 93138c92136e..93f426f0f4ec 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void)
> 
>                if (boot_ms < rcu_boot_end_delay) {
>                        schedule_delayed_work(&rcu_boot_end_work,
> -                                       rcu_boot_end_delay - boot_ms);
> +                               msecs_to_jiffies(rcu_boot_end_delay - boot_ms));
>                        mutex_unlock(&rcu_boot_end_lock);
>                        return;
>                }
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> <snip>
> 
> I think you need to apply above patch. I am not sure maybe Paul
> has already mentioned about it. But just in case.

Ah, the reason my testing did not catch it is because for HZ=1000, msecs
and jiffies are the same.

Great eyes and thank you Vlad, I’ll make the fix and repost it.

 - Joel

> 
> --
> Uladzislau Rezki
  
Paul E. McKenney March 5, 2023, 8:34 p.m. UTC | #5
On Sun, Mar 05, 2023 at 12:39:01PM +0100, Uladzislau Rezki wrote:
> On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > On many systems, a great deal of boot (in userspace) happens after the
> > kernel thinks the boot has completed. It is difficult to determine if
> > the system has really booted from the kernel side. Some features like
> > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > added that the boot synchronously depends on. Further expedited callbacks
> > can get unexpedited way earlier than it should be, thus slowing down
> > boot (as shown in the data below).
> > 
> > For these reasons, this commit adds a config option
> > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > Userspace can also make RCU's view of the system as booted, by writing the
> > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > Or even just writing a value of 0 to this sysfs node.
> > However, under no circumstance will the boot be allowed to end earlier
> > than just before init is launched.
> > 
> > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > no config or parameter changes, and just a simple application of this patch. A
> > system designer can also choose a specific value here to keep RCU from marking
> > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > or an update is made via writing a small value (or 0) in milliseconds to:
> > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > 
> > One side-effect of this patch is, there is a risk that a real-time workload
> > launched just after the kernel boots will suffer interruptions due to expedited
> > RCU, which previous ended just before init was launched. However, to mitigate
> > such an issue (however unlikely), the user should either tune
> > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > boots, and before launching the real-time workload.
> > 
> > Qiuxu also noted impressive boot-time improvements with earlier version
> > of patch. An excerpt from the data he shared:
> > 
> > 1) Testing environment:
> >     OS            : CentOS Stream 8 (non-RT OS)
> >     Kernel     : v6.2
> >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > 
> > 2) OS boot time definition:
> >     The time from the start of the kernel boot to the shell command line
> >     prompt is shown from the console. [ Different people may have
> >     different OS boot time definitions. ]
> > 
> > 3) Measurement method (very rough method):
> >     A timer in the kernel periodically prints the boot time every 100ms.
> >     As soon as the shell command line prompt is shown from the console,
> >     we record the boot time printed by the timer, then the printed boot
> >     time is the OS boot time.
> > 
> > 4) Measured OS boot time (in seconds)
> >    a) Measured 10 times w/o this patch:
> >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> >         The average OS boot time was: ~8.7s
> > 
> >    b) Measure 10 times w/ this patch:
> >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> >         The average OS boot time was: ~8.3s.
> > 
> > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> > v1->v2:
> > 	Update some comments and description.
> > v2->v3:
> >         Add sysfs param, and update with Test data.
> > 
> >  .../admin-guide/kernel-parameters.txt         | 12 ++++
> >  cc_list                                       |  8 +++
> >  kernel/rcu/Kconfig                            | 19 ++++++
> >  kernel/rcu/update.c                           | 68 ++++++++++++++++++-
> >  4 files changed, 106 insertions(+), 1 deletion(-)
> >  create mode 100644 cc_list
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 2429b5e3184b..611de90d9c13 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5085,6 +5085,18 @@
> >  	rcutorture.verbose= [KNL]
> >  			Enable additional printk() statements.
> >  
> > +	rcupdate.rcu_boot_end_delay= [KNL]
> > +			Minimum time in milliseconds that must elapse
> > +			before the boot sequence can be marked complete
> > +			from RCU's perspective, after which RCU's behavior
> > +			becomes more relaxed. The default value is also
> > +			configurable via CONFIG_RCU_BOOT_END_DELAY.
> > +			Userspace can also mark the boot as completed
> > +			sooner by writing the time in milliseconds, say once
> > +			userspace considers the system as booted, to:
> > +			/sys/module/rcupdate/parameters/rcu_boot_end_delay
> > +			Or even just writing a value of 0 to this sysfs node.
> > +
> >  	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
> >  			Dump ftrace buffer after reporting RCU CPU
> >  			stall warning.
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index 9071182b1284..4b5ffa36cbaf 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -217,6 +217,25 @@ config RCU_BOOST_DELAY
> >  
> >  	  Accept the default if unsure.
> >  
> > +config RCU_BOOT_END_DELAY
> > +	int "Minimum time before RCU may consider in-kernel boot as completed"
> > +	range 0 120000
> > +	default 15000
> > +	help
> > +	  Default value of the minimum time in milliseconds that must elapse
> > +	  before the boot sequence can be marked complete from RCU's perspective,
> > +	  after which RCU's behavior becomes more relaxed.
> > +	  Userspace can also mark the boot as completed sooner than this default
> > +	  by writing the time in milliseconds, say once userspace considers
> > +	  the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > +	  Or even just writing a value of 0 to this sysfs node.
> > +
> > +	  The actual delay for RCU's view of the system to be marked as booted can be
> > +	  higher than this value if the kernel takes a long time to initialize but it
> > +	  will never be smaller than this value.
> > +
> > +	  Accept the default if unsure.
> > +
> >  config RCU_EXP_KTHREAD
> >  	bool "Perform RCU expedited work in a real-time kthread"
> >  	depends on RCU_BOOST && RCU_EXPERT
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 19bf6fa3ee6a..93138c92136e 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -224,18 +224,84 @@ void rcu_unexpedite_gp(void)
> >  }
> >  EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
> >  
> > +/*
> > + * Minimum time in milliseconds until RCU can consider in-kernel boot as
> > + * completed.  This can also be tuned at runtime to end the boot earlier, by
> > + * userspace init code writing the time in milliseconds (even 0) to:
> > + * /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > + */
> > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
> > +
> >  static bool rcu_boot_ended __read_mostly;
> > +static bool rcu_boot_end_called __read_mostly;
> > +static DEFINE_MUTEX(rcu_boot_end_lock);
> > +
> > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
> > +{
> > +	uint end_ms;
> > +	int ret = kstrtouint(val, 0, &end_ms);
> > +
> > +	if (ret)
> > +		return ret;
> > +	WRITE_ONCE(*(uint *)kp->arg, end_ms);
> > +
> > +	/*
> > +	 * rcu_end_inkernel_boot() should be called at least once during init
> > +	 * before we can allow param changes to end the boot.
> > +	 */
> > +	mutex_lock(&rcu_boot_end_lock);
> > +	rcu_boot_end_delay = end_ms;
> > +	if (!rcu_boot_ended && rcu_boot_end_called) {
> > +		mutex_unlock(&rcu_boot_end_lock);
> > +		rcu_end_inkernel_boot();
> > +	}
> > +	mutex_unlock(&rcu_boot_end_lock);
> > +	return ret;
> > +}
> > +
> > +static const struct kernel_param_ops rcu_boot_end_ops = {
> > +	.set = param_set_rcu_boot_end,
> > +	.get = param_get_uint,
> > +};
> > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
> >  
> >  /*
> > - * Inform RCU of the end of the in-kernel boot sequence.
> > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
> > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
> >   */
> > +void rcu_end_inkernel_boot(void);
> > +static void rcu_boot_end_work_fn(struct work_struct *work)
> > +{
> > +	rcu_end_inkernel_boot();
> > +}
> > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
> > +
> >  void rcu_end_inkernel_boot(void)
> >  {
> > +	mutex_lock(&rcu_boot_end_lock);
> > +	rcu_boot_end_called = true;
> > +
> > +	if (rcu_boot_ended)
> > +		return;
> > +
> > +	if (rcu_boot_end_delay) {
> > +		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
> > +
> > +		if (boot_ms < rcu_boot_end_delay) {
> > +			schedule_delayed_work(&rcu_boot_end_work,
> > +					rcu_boot_end_delay - boot_ms);
> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$ git diff
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 93138c92136e..93f426f0f4ec 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void)
>  
>                 if (boot_ms < rcu_boot_end_delay) {
>                         schedule_delayed_work(&rcu_boot_end_work,
> -                                       rcu_boot_end_delay - boot_ms);
> +                               msecs_to_jiffies(rcu_boot_end_delay - boot_ms));
>                         mutex_unlock(&rcu_boot_end_lock);
>                         return;
>                 }
> urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> <snip>
> 
> I think you need to apply above patch. I am not sure maybe Paul
> has already mentioned about it. But just in case.

No, I did miss that one, so thank you very much for spotting it!

							Thanx, Paul
  
Qiuxu Zhuo March 6, 2023, 8:24 a.m. UTC | #6
> From: Paul E. McKenney <paulmck@kernel.org>
> [...]
> > Qiuxu also noted impressive boot-time improvements with earlier
> > version of patch. An excerpt from the data he shared:
> >
> > 1) Testing environment:
> >     OS            : CentOS Stream 8 (non-RT OS)
> >     Kernel     : v6.2
> >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical
> threads)
> >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2,
> > …
> >
> > 2) OS boot time definition:
> >     The time from the start of the kernel boot to the shell command line
> >     prompt is shown from the console. [ Different people may have
> >     different OS boot time definitions. ]
> >
> > 3) Measurement method (very rough method):
> >     A timer in the kernel periodically prints the boot time every 100ms.
> >     As soon as the shell command line prompt is shown from the console,
> >     we record the boot time printed by the timer, then the printed boot
> >     time is the OS boot time.
> >
> > 4) Measured OS boot time (in seconds)
> >    a) Measured 10 times w/o this patch:
> >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> >         The average OS boot time was: ~8.7s
> >
> >    b) Measure 10 times w/ this patch:
> >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> >         The average OS boot time was: ~8.3s.
> 
> Unfortunately, given that a's average is within one standard deviation of b's
> average, this is most definitely not statistically significant.
> Especially given only ten measurements for each case -- you need *at*
> *least* 24, preferably more.  Especially in this case, where you don't really
> know what the underlying distribution is.

Thank you so much Paul for the detailed comments on the measured data.

I'm curious how did you figure out the number 24 that we at *least* need.
This can guide me on whether the number of samples is enough for 
future testing ;-).

I did another 48 measurements (2x of 24) for each case 
(w/o and w/ Joel's v2 patch) as below. 
All the testing configurations for the new testing
are the same as before.

a) Measured 48 times w/o v2 patch (in seconds):
    8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4,
    8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8,
    8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8,
    8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6,
    8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0,
    9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7
    The average OS boot time was: ~9.0s

b) Measure 48 times w/ v2 patch (in seconds):
    7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2,
    9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2,
    8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3,
    8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0,
    8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8,
    8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3,
    The average OS boot time was: ~8.5s

@Joel Fernandes (Google), you may replace my old data with the above 
new data in your commit message.

> But we can apply the binomial distribution instead of the usual normal
> distribution.  First, let's sort and take the medians:
> 
> a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3  Median: 8.7
> b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3  Median: 8.2
> 
> 8/10 of a's data points are greater than 0.1 more than b's median and 8/10
> of b's data points are less than 0.1 less than a's median.
> What are the odds that this happens by random chance?
> 
> This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055.

What's the meaning of 0.5 here? Was it the probability (we assume?) that 
each time b's data point failed (or didn't satisfy) "less than 0.1 less than 
a's median"?

> This is not quite 95% confidence, so not hugely convincing, but it is at least
> close.  Not that this is the confidence that (b) is 100ms faster than (a), not
> just that (b) is faster than (a).
> 
> Not sure that this really carries its weight, but in contrast to the usual
> statistics based on the normal distribution, it does suggest at least a little
> improvement.  On the other hand, anyone who has carefully studied
> nonparametric statistics probably jumped out of the boat several paragraphs
> ago.  ;-)
> 
> A few more questions interspersed below.
> 
> 							Thanx, Paul
> 
> > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
  
Qiuxu Zhuo March 6, 2023, 8:37 a.m. UTC | #7
> From: Joel Fernandes <joel@joelfernandes.org>
> [...]
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -289,7 +289,7 @@ void rcu_end_inkernel_boot(void)
> >
> >                if (boot_ms < rcu_boot_end_delay) {
> >                        schedule_delayed_work(&rcu_boot_end_work,
> > -                                       rcu_boot_end_delay - boot_ms);
> > +                               msecs_to_jiffies(rcu_boot_end_delay -
> > + boot_ms));
> >                        mutex_unlock(&rcu_boot_end_lock);
> >                        return;
> >                }
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git$
> > <snip>
> >
> > I think you need to apply above patch. I am not sure maybe Paul has
> > already mentioned about it. But just in case.
> 
> Ah, the reason my testing did not catch it is because for HZ=1000, msecs and
> jiffies are the same.

   So was my system :-)
   
       CONFIG_HZ_1000=y
       CONFIG_HZ=1000

-Qiuxu

> Great eyes and thank you Vlad, I’ll make the fix and repost it.
> 
>  - Joel
> 
> >
> > --
> > Uladzislau Rezki
  
Paul E. McKenney March 6, 2023, 2:49 p.m. UTC | #8
On Mon, Mar 06, 2023 at 08:24:44AM +0000, Zhuo, Qiuxu wrote:
> > From: Paul E. McKenney <paulmck@kernel.org>
> > [...]
> > > Qiuxu also noted impressive boot-time improvements with earlier
> > > version of patch. An excerpt from the data he shared:
> > >
> > > 1) Testing environment:
> > >     OS            : CentOS Stream 8 (non-RT OS)
> > >     Kernel     : v6.2
> > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical
> > threads)
> > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2,
> > > …
> > >
> > > 2) OS boot time definition:
> > >     The time from the start of the kernel boot to the shell command line
> > >     prompt is shown from the console. [ Different people may have
> > >     different OS boot time definitions. ]
> > >
> > > 3) Measurement method (very rough method):
> > >     A timer in the kernel periodically prints the boot time every 100ms.
> > >     As soon as the shell command line prompt is shown from the console,
> > >     we record the boot time printed by the timer, then the printed boot
> > >     time is the OS boot time.
> > >
> > > 4) Measured OS boot time (in seconds)
> > >    a) Measured 10 times w/o this patch:
> > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > >         The average OS boot time was: ~8.7s
> > >
> > >    b) Measure 10 times w/ this patch:
> > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > >         The average OS boot time was: ~8.3s.
> > 
> > Unfortunately, given that a's average is within one standard deviation of b's
> > average, this is most definitely not statistically significant.
> > Especially given only ten measurements for each case -- you need *at*
> > *least* 24, preferably more.  Especially in this case, where you don't really
> > know what the underlying distribution is.
> 
> Thank you so much Paul for the detailed comments on the measured data.
> 
> I'm curious how did you figure out the number 24 that we at *least* need.
> This can guide me on whether the number of samples is enough for 
> future testing ;-).

It is a rough rule of thumb.  For more details and accuracy, study up
on the Student's t-test and related statistical tests.

Of course, this all assumes that the data fits a normal distribution.

> I did another 48 measurements (2x of 24) for each case 
> (w/o and w/ Joel's v2 patch) as below. 
> All the testing configurations for the new testing
> are the same as before.
> 
> a) Measured 48 times w/o v2 patch (in seconds):
>     8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4,
>     8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8,
>     8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8,
>     8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6,
>     8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0,
>     9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7
>     The average OS boot time was: ~9.0s

The range is 8.2 through 9.8.

> b) Measure 48 times w/ v2 patch (in seconds):
>     7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2,
>     9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2,
>     8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3,
>     8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0,
>     8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8,
>     8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3,
>     The average OS boot time was: ~8.5s

The range is 7.7 through 9.8.

There is again significant overlap, so it is again unclear that you have
a statistically significant difference.  So could you please calculate
the standard deviations?

> @Joel Fernandes (Google), you may replace my old data with the above 
> new data in your commit message.
> 
> > But we can apply the binomial distribution instead of the usual normal
> > distribution.  First, let's sort and take the medians:
> > 
> > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3  Median: 8.7
> > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3  Median: 8.2
> > 
> > 8/10 of a's data points are greater than 0.1 more than b's median and 8/10
> > of b's data points are less than 0.1 less than a's median.
> > What are the odds that this happens by random chance?
> > 
> > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055.
> 
> What's the meaning of 0.5 here? Was it the probability (we assume?) that 
> each time b's data point failed (or didn't satisfy) "less than 0.1 less than 
> a's median"?

The meaning of 0.5 is the probability of a given data point being on one
side or the other of the corresponding distribution's median.  This of
course assumes that the median of the measured data matches that of the
corresponding distribution, though the fact that the median is also a
mode of both of the old data sets gives some hope.

The meaning of the 0.1 is the smallest difference that the data could
measure.  I could have instead chosen 0.0 and asked if there was likely
some (perhaps tiny) difference, but instead, I chose to ask if there
was likely some small but meaningful difference.  It is better to choose
the desired difference before measuring the data.

Why don't you try applying this approach to the new data?  You will need
the general binomial formula.

							Thanx, Paul

> > This is not quite 95% confidence, so not hugely convincing, but it is at least
> > close.  Not that this is the confidence that (b) is 100ms faster than (a), not
> > just that (b) is faster than (a).
> > 
> > Not sure that this really carries its weight, but in contrast to the usual
> > statistics based on the normal distribution, it does suggest at least a little
> > improvement.  On the other hand, anyone who has carefully studied
> > nonparametric statistics probably jumped out of the boat several paragraphs
> > ago.  ;-)
> > 
> > A few more questions interspersed below.
> > 
> > 							Thanx, Paul
> > 
> > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
  
Qiuxu Zhuo March 7, 2023, 7:49 a.m. UTC | #9
> From: Paul E. McKenney <paulmck@kernel.org>
> [...]
> >
> > Thank you so much Paul for the detailed comments on the measured data.
> >
> > I'm curious how did you figure out the number 24 that we at *least* need.
> > This can guide me on whether the number of samples is enough for
> > future testing ;-).
> 
> It is a rough rule of thumb.  For more details and accuracy, study up on the
> Student's t-test and related statistical tests.
> 
> Of course, this all assumes that the data fits a normal distribution.

Thanks for this extra information. Good to know the Student's t-test.

> > I did another 48 measurements (2x of 24) for each case (w/o and w/
> > Joel's v2 patch) as below.
> > All the testing configurations for the new testing are the same as
> > before.
> >
> > a) Measured 48 times w/o v2 patch (in seconds):
> >     8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4,
> >     8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8,
> >     8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8,
> >     8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6,
> >     8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0,
> >     9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7
> >     The average OS boot time was: ~9.0s
> 
> The range is 8.2 through 9.8.
> 
> > b) Measure 48 times w/ v2 patch (in seconds):
> >     7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2,
> >     9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2,
> >     8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3,
> >     8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0,
> >     8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8,
> >     8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3,
> >     The average OS boot time was: ~8.5s
> 
> The range is 7.7 through 9.8.
> 
> There is again significant overlap, so it is again unclear that you have a
> statistically significant difference.  So could you please calculate the standard
> deviations?

a's standard deviation is ~0.4.
b's standard deviation is ~0.5.

a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
So, the measurements should be statistically significant to some degree.

The calculated standard deviations are via: 
https://www.gigacalculator.com/calculators/standard-deviation-calculator.php

> > @Joel Fernandes (Google), you may replace my old data with the above
> > new data in your commit message.
> >
> > > But we can apply the binomial distribution instead of the usual
> > > normal distribution.  First, let's sort and take the medians:
> > >
> > > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3  Median: 8.7
> > > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3  Median: 8.2
> > >
> > > 8/10 of a's data points are greater than 0.1 more than b's median
> > > and 8/10 of b's data points are less than 0.1 less than a's median.
> > > What are the odds that this happens by random chance?
> > >
> > > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055.
> >
> > What's the meaning of 0.5 here? Was it the probability (we assume?)
> > that each time b's data point failed (or didn't satisfy) "less than
> > 0.1 less than a's median"?
> 
> The meaning of 0.5 is the probability of a given data point being on one side
> or the other of the corresponding distribution's median.  This of course
> assumes that the median of the measured data matches that of the
> corresponding distribution, though the fact that the median is also a mode of
> both of the old data sets gives some hope.

  Thanks for the detailed comments on the meaning of 0.5 here. :-)

> The meaning of the 0.1 is the smallest difference that the data could measure.
> I could have instead chosen 0.0 and asked if there was likely some (perhaps
> tiny) difference, but instead, I chose to ask if there was likely some small but
> meaningful difference.  It is better to choose the desired difference before
> measuring the data.

  Thanks for the detailed comments on the meaning of 0.1 here. :-)

> Why don't you try applying this approach to the new data?  You will need the
> general binomial formula.

   Thank you Paul for the suggestion. 
   I just tried it, but not sure whether my analysis was correct ...

   Analysis 1:
   a's median is 8.9. 
   35/48 b's data points are less than 0.1 less than a's median.
   For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5.
   So, we have strong confidence that b is 100ms faster than a.

   Analysis 2:
   a's median - 0.4 = 8.9 - 0.4 = 8.5. 
   24/48 b's data points are less than 0.4 less than a's median.
   The probability that a's data points are less than 8.5 is p = 7/48 = 0.1458 
   For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458.
   So, looks like we have confidence that b is 400ms faster than a.

   The calculated cumulative binomial distributions P(X) is via:
   https://www.gigacalculator.com/calculators/binomial-probability-calculator.php

   I apologize if this analysis/discussion bored some of you. ;-)

-Qiuxu

> [...]
  
Frederic Weisbecker March 7, 2023, 1:01 p.m. UTC | #10
On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> On many systems, a great deal of boot (in userspace) happens after the
> kernel thinks the boot has completed. It is difficult to determine if
> the system has really booted from the kernel side. Some features like
> lazy-RCU can risk slowing down boot time if, say, a callback has been
> added that the boot synchronously depends on. Further expedited callbacks
> can get unexpedited way earlier than it should be, thus slowing down
> boot (as shown in the data below).
> 
> For these reasons, this commit adds a config option
> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> Userspace can also make RCU's view of the system as booted, by writing the
> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> Or even just writing a value of 0 to this sysfs node.
> However, under no circumstance will the boot be allowed to end earlier
> than just before init is launched.
> 
> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> suites ChromeOS and also a PREEMPT_RT system below very well, which need
> no config or parameter changes, and just a simple application of this patch. A
> system designer can also choose a specific value here to keep RCU from marking
> boot completion.  As noted earlier, RCU's perspective of the system as booted
> will not be marker until at least rcu_boot_end_delay milliseconds have passed
> or an update is made via writing a small value (or 0) in milliseconds to:
> /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> 
> One side-effect of this patch is, there is a risk that a real-time workload
> launched just after the kernel boots will suffer interruptions due to expedited
> RCU, which previous ended just before init was launched. However, to mitigate
> such an issue (however unlikely), the user should either tune
> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> boots, and before launching the real-time workload.
> 
> Qiuxu also noted impressive boot-time improvements with earlier version
> of patch. An excerpt from the data he shared:
> 
> 1) Testing environment:
>     OS            : CentOS Stream 8 (non-RT OS)
>     Kernel     : v6.2
>     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
>     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> 
> 2) OS boot time definition:
>     The time from the start of the kernel boot to the shell command line
>     prompt is shown from the console. [ Different people may have
>     different OS boot time definitions. ]
> 
> 3) Measurement method (very rough method):
>     A timer in the kernel periodically prints the boot time every 100ms.
>     As soon as the shell command line prompt is shown from the console,
>     we record the boot time printed by the timer, then the printed boot
>     time is the OS boot time.
> 
> 4) Measured OS boot time (in seconds)
>    a) Measured 10 times w/o this patch:
>         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
>         The average OS boot time was: ~8.7s
> 
>    b) Measure 10 times w/ this patch:
>         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
>         The average OS boot time was: ~8.3s.
> 
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

I still don't really like that:

1) It feels like we are curing a symptom for which we don't know the cause.
   Which RCU write side caller is the source of this slow boot? Some tracepoints
   reporting the wait duration within synchronize_rcu() calls between the end of
   the kernel boot and the end of userspace boot may be helpful.
   
2) The kernel boot was already covered before this patch so this is about
   userspace code calling into the kernel. Is that piece of code also called
   after the boot? In that case are we missing a conversion from
   synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
   the problem is more general than just boot.

This needs to be analyzed first and if it happens that the issue really
needs to be fixed with telling the kernel that userspace has completed
booting, eg: because the problem is not in a few callsites that need conversion
to expedited but instead in the accumulation of lots of calls that should stay
as is:

3) This arbitrary timeout looks dangerous to me as latency sensitive code
   may run right after the boot. Either you choose a value that is too low
   and you miss the optimization or the value is too high and you may break
   things.

4) This should be fixed the way you did:
   a) a kernel parameter like you did
   b) The init process (systemd?) tells the kernel when it judges that userspace
      has completed booting.
   c) Make these interfaces more generic, maybe that information will be useful
      outside RCU. For example the kernel parameter should be
      "user_booted_reported" and the sysfs (should be sysctl?):
      kernel.user_booted = 1
   d) But yuck, this means we must know if the init process supports that...

For these reasons, let's make sure we know exactly what is going on first.

Thanks.
  
Uladzislau Rezki March 7, 2023, 1:40 p.m. UTC | #11
On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > On many systems, a great deal of boot (in userspace) happens after the
> > kernel thinks the boot has completed. It is difficult to determine if
> > the system has really booted from the kernel side. Some features like
> > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > added that the boot synchronously depends on. Further expedited callbacks
> > can get unexpedited way earlier than it should be, thus slowing down
> > boot (as shown in the data below).
> > 
> > For these reasons, this commit adds a config option
> > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > Userspace can also make RCU's view of the system as booted, by writing the
> > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > Or even just writing a value of 0 to this sysfs node.
> > However, under no circumstance will the boot be allowed to end earlier
> > than just before init is launched.
> > 
> > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > no config or parameter changes, and just a simple application of this patch. A
> > system designer can also choose a specific value here to keep RCU from marking
> > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > or an update is made via writing a small value (or 0) in milliseconds to:
> > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > 
> > One side-effect of this patch is, there is a risk that a real-time workload
> > launched just after the kernel boots will suffer interruptions due to expedited
> > RCU, which previous ended just before init was launched. However, to mitigate
> > such an issue (however unlikely), the user should either tune
> > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > boots, and before launching the real-time workload.
> > 
> > Qiuxu also noted impressive boot-time improvements with earlier version
> > of patch. An excerpt from the data he shared:
> > 
> > 1) Testing environment:
> >     OS            : CentOS Stream 8 (non-RT OS)
> >     Kernel     : v6.2
> >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > 
> > 2) OS boot time definition:
> >     The time from the start of the kernel boot to the shell command line
> >     prompt is shown from the console. [ Different people may have
> >     different OS boot time definitions. ]
> > 
> > 3) Measurement method (very rough method):
> >     A timer in the kernel periodically prints the boot time every 100ms.
> >     As soon as the shell command line prompt is shown from the console,
> >     we record the boot time printed by the timer, then the printed boot
> >     time is the OS boot time.
> > 
> > 4) Measured OS boot time (in seconds)
> >    a) Measured 10 times w/o this patch:
> >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> >         The average OS boot time was: ~8.7s
> > 
> >    b) Measure 10 times w/ this patch:
> >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> >         The average OS boot time was: ~8.3s.
> > 
> > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> I still don't really like that:
> 
> 1) It feels like we are curing a symptom for which we don't know the cause.
>    Which RCU write side caller is the source of this slow boot? Some tracepoints
>    reporting the wait duration within synchronize_rcu() calls between the end of
>    the kernel boot and the end of userspace boot may be helpful.
>    
> 2) The kernel boot was already covered before this patch so this is about
>    userspace code calling into the kernel. Is that piece of code also called
>    after the boot? In that case are we missing a conversion from
>    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
>    the problem is more general than just boot.
> 
> This needs to be analyzed first and if it happens that the issue really
> needs to be fixed with telling the kernel that userspace has completed
> booting, eg: because the problem is not in a few callsites that need conversion
> to expedited but instead in the accumulation of lots of calls that should stay
> as is:
> 
> 3) This arbitrary timeout looks dangerous to me as latency sensitive code
>    may run right after the boot. Either you choose a value that is too low
>    and you miss the optimization or the value is too high and you may break
>    things.
> 
> 4) This should be fixed the way you did:
>    a) a kernel parameter like you did
>    b) The init process (systemd?) tells the kernel when it judges that userspace
>       has completed booting.
>    c) Make these interfaces more generic, maybe that information will be useful
>       outside RCU. For example the kernel parameter should be
>       "user_booted_reported" and the sysfs (should be sysctl?):
>       kernel.user_booted = 1
>    d) But yuck, this means we must know if the init process supports that...
> 
> For these reasons, let's make sure we know exactly what is going on first.
> 
> Thanks.
Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
parameter that can be used during the boot. For example on our devices
to speedup a boot we boot the kernel with rcu_expedited:

XQ-DQ54:/ # cat /proc/cmdline
stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug  msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init  qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001
XQ-DQ54:/ #

then a user space can decides if it is needed or not:

<snip>
rcu_expedited  rcu_normal
XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
-rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
-rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
XQ-DQ54:/ #
<snip>

for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
true or false. So we can follow and be aligned with rcu_expedited and
rcu_normal parameters.

--
Uladzislau Rezki
  
Joel Fernandes March 7, 2023, 1:41 p.m. UTC | #12
On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > On many systems, a great deal of boot (in userspace) happens after the
> > kernel thinks the boot has completed. It is difficult to determine if
> > the system has really booted from the kernel side. Some features like
> > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > added that the boot synchronously depends on. Further expedited callbacks
> > can get unexpedited way earlier than it should be, thus slowing down
> > boot (as shown in the data below).
> >
> > For these reasons, this commit adds a config option
> > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > Userspace can also make RCU's view of the system as booted, by writing the
> > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > Or even just writing a value of 0 to this sysfs node.
> > However, under no circumstance will the boot be allowed to end earlier
> > than just before init is launched.
> >
> > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > no config or parameter changes, and just a simple application of this patch. A
> > system designer can also choose a specific value here to keep RCU from marking
> > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > or an update is made via writing a small value (or 0) in milliseconds to:
> > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> >
> > One side-effect of this patch is, there is a risk that a real-time workload
> > launched just after the kernel boots will suffer interruptions due to expedited
> > RCU, which previous ended just before init was launched. However, to mitigate
> > such an issue (however unlikely), the user should either tune
> > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > boots, and before launching the real-time workload.
> >
> > Qiuxu also noted impressive boot-time improvements with earlier version
> > of patch. An excerpt from the data he shared:
> >
> > 1) Testing environment:
> >     OS            : CentOS Stream 8 (non-RT OS)
> >     Kernel     : v6.2
> >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> >
> > 2) OS boot time definition:
> >     The time from the start of the kernel boot to the shell command line
> >     prompt is shown from the console. [ Different people may have
> >     different OS boot time definitions. ]
> >
> > 3) Measurement method (very rough method):
> >     A timer in the kernel periodically prints the boot time every 100ms.
> >     As soon as the shell command line prompt is shown from the console,
> >     we record the boot time printed by the timer, then the printed boot
> >     time is the OS boot time.
> >
> > 4) Measured OS boot time (in seconds)
> >    a) Measured 10 times w/o this patch:
> >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> >         The average OS boot time was: ~8.7s
> >
> >    b) Measure 10 times w/ this patch:
> >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> >         The average OS boot time was: ~8.3s.
> >
> > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> I still don't really like that:
>
> 1) It feels like we are curing a symptom for which we don't know the cause.
>    Which RCU write side caller is the source of this slow boot? Some tracepoints
>    reporting the wait duration within synchronize_rcu() calls between the end of
>    the kernel boot and the end of userspace boot may be helpful.

Just to clarify (and I feel we discussed this recently) -- there is no
callback I am aware of right now causing a slow boot. The reason for
doing this is we don't have such issues in the future; so it is a
protection. Note the repeated call outs to the scsi callback and also
the rcu_barrier() issue previously fixed. Further, we already see
slight improvements in boot times with disabling lazy during boot (its
not much but its there). Yes, we should fix issues instead of hiding
them - but we also would like to improve the user experience -- just
like we disable lazy and expedited during suspend.

So what is the problem that you really have with this patch even with
data showing improvements? I actually wanted a mechanism like this
from the beginning and was trying to get Intel to write the patch, but
I ended up writing it.

> 2) The kernel boot was already covered before this patch so this is about
>    userspace code calling into the kernel. Is that piece of code also called
>    after the boot? In that case are we missing a conversion from
>    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
>    the problem is more general than just boot.
>
> This needs to be analyzed first and if it happens that the issue really
> needs to be fixed with telling the kernel that userspace has completed
> booting, eg: because the problem is not in a few callsites that need conversion
> to expedited but instead in the accumulation of lots of calls that should stay
> as is:

There is no such callback I am aware off that needs such a conversion
and I don't think that will help give any guarantees because there is
no preventing someone from adding a callback that synchronously slows
boot. The approach here is to put a protection. However, I will do
some more investigations into what else may be slowing things as I do
hold a lot of weight for your words! :)

>
> 3) This arbitrary timeout looks dangerous to me as latency sensitive code
>    may run right after the boot. Either you choose a value that is too low
>    and you miss the optimization or the value is too high and you may break
>    things.

So someone is presenting a timing sensitive workload within 15 seconds
of boot? Please provide some evidence of that. The only evidence right
now is on the plus side even for the RT system.

> 4) This should be fixed the way you did:
>    a) a kernel parameter like you did
>    b) The init process (systemd?) tells the kernel when it judges that userspace
>       has completed booting.
>    c) Make these interfaces more generic, maybe that information will be useful
>       outside RCU. For example the kernel parameter should be
>       "user_booted_reported" and the sysfs (should be sysctl?):
>       kernel.user_booted = 1
>    d) But yuck, this means we must know if the init process supports that...
>
> For these reasons, let's make sure we know exactly what is going on first.

I can investigate this more and get back to you.

One of the challenges is getting boot tracing working properly.
Systems do weird things like turning off tracing during boot and/or
clearing trace buffers.

 - Joel
  
Joel Fernandes March 7, 2023, 1:48 p.m. UTC | #13
On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > On many systems, a great deal of boot (in userspace) happens after the
> > > kernel thinks the boot has completed. It is difficult to determine if
> > > the system has really booted from the kernel side. Some features like
> > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > added that the boot synchronously depends on. Further expedited callbacks
> > > can get unexpedited way earlier than it should be, thus slowing down
> > > boot (as shown in the data below).
> > >
> > > For these reasons, this commit adds a config option
> > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > Userspace can also make RCU's view of the system as booted, by writing the
> > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > Or even just writing a value of 0 to this sysfs node.
> > > However, under no circumstance will the boot be allowed to end earlier
> > > than just before init is launched.
> > >
> > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > no config or parameter changes, and just a simple application of this patch. A
> > > system designer can also choose a specific value here to keep RCU from marking
> > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > >
> > > One side-effect of this patch is, there is a risk that a real-time workload
> > > launched just after the kernel boots will suffer interruptions due to expedited
> > > RCU, which previous ended just before init was launched. However, to mitigate
> > > such an issue (however unlikely), the user should either tune
> > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > boots, and before launching the real-time workload.
> > >
> > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > of patch. An excerpt from the data he shared:
> > >
> > > 1) Testing environment:
> > >     OS            : CentOS Stream 8 (non-RT OS)
> > >     Kernel     : v6.2
> > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > >
> > > 2) OS boot time definition:
> > >     The time from the start of the kernel boot to the shell command line
> > >     prompt is shown from the console. [ Different people may have
> > >     different OS boot time definitions. ]
> > >
> > > 3) Measurement method (very rough method):
> > >     A timer in the kernel periodically prints the boot time every 100ms.
> > >     As soon as the shell command line prompt is shown from the console,
> > >     we record the boot time printed by the timer, then the printed boot
> > >     time is the OS boot time.
> > >
> > > 4) Measured OS boot time (in seconds)
> > >    a) Measured 10 times w/o this patch:
> > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > >         The average OS boot time was: ~8.7s
> > >
> > >    b) Measure 10 times w/ this patch:
> > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > >         The average OS boot time was: ~8.3s.
> > >
> > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > I still don't really like that:
> >
> > 1) It feels like we are curing a symptom for which we don't know the cause.
> >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> >    reporting the wait duration within synchronize_rcu() calls between the end of
> >    the kernel boot and the end of userspace boot may be helpful.
> >
> > 2) The kernel boot was already covered before this patch so this is about
> >    userspace code calling into the kernel. Is that piece of code also called
> >    after the boot? In that case are we missing a conversion from
> >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> >    the problem is more general than just boot.
> >
> > This needs to be analyzed first and if it happens that the issue really
> > needs to be fixed with telling the kernel that userspace has completed
> > booting, eg: because the problem is not in a few callsites that need conversion
> > to expedited but instead in the accumulation of lots of calls that should stay
> > as is:
> >
> > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> >    may run right after the boot. Either you choose a value that is too low
> >    and you miss the optimization or the value is too high and you may break
> >    things.
> >
> > 4) This should be fixed the way you did:
> >    a) a kernel parameter like you did
> >    b) The init process (systemd?) tells the kernel when it judges that userspace
> >       has completed booting.
> >    c) Make these interfaces more generic, maybe that information will be useful
> >       outside RCU. For example the kernel parameter should be
> >       "user_booted_reported" and the sysfs (should be sysctl?):
> >       kernel.user_booted = 1
> >    d) But yuck, this means we must know if the init process supports that...
> >
> > For these reasons, let's make sure we know exactly what is going on first.
> >
> > Thanks.
> Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> parameter that can be used during the boot. For example on our devices
> to speedup a boot we boot the kernel with rcu_expedited:
>
> XQ-DQ54:/ # cat /proc/cmdline
> stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug  msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init  qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001
> XQ-DQ54:/ #
>
> then a user space can decides if it is needed or not:
>
> <snip>
> rcu_expedited  rcu_normal
> XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> XQ-DQ54:/ #
> <snip>
>
> for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> true or false. So we can follow and be aligned with rcu_expedited and
> rcu_normal parameters.

Speaking of aligning, there is also the automated
rcu_normal_after_boot boot option correct? I prefer the automated
option of doing this. So the approach here is not really unprecedented
and is much more robust than relying on userspace too much (I am ok
with adding your suggestion *on top* of the automated toggle, but I
probably would not have ChromeOS use it if the automated way exists).
Or did I miss something?

thanks,

 - Joel
  
Paul E. McKenney March 7, 2023, 3:22 p.m. UTC | #14
On Tue, Mar 07, 2023 at 07:49:49AM +0000, Zhuo, Qiuxu wrote:
> > From: Paul E. McKenney <paulmck@kernel.org>
> > [...]
> > >
> > > Thank you so much Paul for the detailed comments on the measured data.
> > >
> > > I'm curious how did you figure out the number 24 that we at *least* need.
> > > This can guide me on whether the number of samples is enough for
> > > future testing ;-).
> > 
> > It is a rough rule of thumb.  For more details and accuracy, study up on the
> > Student's t-test and related statistical tests.
> > 
> > Of course, this all assumes that the data fits a normal distribution.
> 
> Thanks for this extra information. Good to know the Student's t-test.
> 
> > > I did another 48 measurements (2x of 24) for each case (w/o and w/
> > > Joel's v2 patch) as below.
> > > All the testing configurations for the new testing are the same as
> > > before.
> > >
> > > a) Measured 48 times w/o v2 patch (in seconds):
> > >     8.4, 8.8, 9.2, 9.0, 8.3, 9.6, 8.8, 9.4,
> > >     8.7, 9.2, 8.3, 9.4, 8.4, 9.6, 8.5, 8.8,
> > >     8.8, 8.9, 9.3, 9.2, 8.6, 9.7, 9.2, 8.8,
> > >     8.7, 9.0, 9.1, 9.5, 8.6, 8.9, 9.1, 8.6,
> > >     8.2, 9.1, 8.8, 9.2, 9.1, 8.9, 8.4, 9.0,
> > >     9.8, 9.8, 8.7, 8.8, 9.1, 9.5, 9.5, 8.7
> > >     The average OS boot time was: ~9.0s
> > 
> > The range is 8.2 through 9.8.
> > 
> > > b) Measure 48 times w/ v2 patch (in seconds):
> > >     7.7, 8.6, 8.1, 7.8, 8.2, 8.2, 8.8, 8.2,
> > >     9.8, 8.0, 9.2, 8.8, 9.2, 8.5, 8.4, 9.2,
> > >     8.5, 8.3, 8.1, 8.3, 8.6, 7.9, 8.3, 8.3,
> > >     8.6, 8.9, 8.0, 8.5, 8.4, 8.6, 8.7, 8.0,
> > >     8.8, 8.8, 9.1, 7.9, 9.7, 7.9, 8.2, 7.8,
> > >     8.1, 8.5, 8.6, 8.4, 9.2, 8.6, 9.6, 8.3,
> > >     The average OS boot time was: ~8.5s
> > 
> > The range is 7.7 through 9.8.
> > 
> > There is again significant overlap, so it is again unclear that you have a
> > statistically significant difference.  So could you please calculate the standard
> > deviations?
> 
> a's standard deviation is ~0.4.
> b's standard deviation is ~0.5.
> 
> a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
> So, the measurements should be statistically significant to some degree.

That single standard deviation means that you have 68% confidence that the
difference is real.  This is not far above the 50% leval of random noise.
95% is the lowest level that is normally considered to be statistically
significant.

> The calculated standard deviations are via: 
> https://www.gigacalculator.com/calculators/standard-deviation-calculator.php

Fair enough.  Formulas are readily available as well, and most spreadsheets
support standard deviation.

> > > @Joel Fernandes (Google), you may replace my old data with the above
> > > new data in your commit message.
> > >
> > > > But we can apply the binomial distribution instead of the usual
> > > > normal distribution.  First, let's sort and take the medians:
> > > >
> > > > a: 8.2 8.3 8.4 8.6 8.7 8.7 8.8 8.8 9.0 9.3  Median: 8.7
> > > > b: 7.6 7.8 8.2 8.2 8.2 8.2 8.4 8.5 8.7 9.3  Median: 8.2
> > > >
> > > > 8/10 of a's data points are greater than 0.1 more than b's median
> > > > and 8/10 of b's data points are less than 0.1 less than a's median.
> > > > What are the odds that this happens by random chance?
> > > >
> > > > This is given by sum_0^2 (0.5^10 * binomial(10,i)), which is about 0.055.
> > >
> > > What's the meaning of 0.5 here? Was it the probability (we assume?)
> > > that each time b's data point failed (or didn't satisfy) "less than
> > > 0.1 less than a's median"?
> > 
> > The meaning of 0.5 is the probability of a given data point being on one side
> > or the other of the corresponding distribution's median.  This of course
> > assumes that the median of the measured data matches that of the
> > corresponding distribution, though the fact that the median is also a mode of
> > both of the old data sets gives some hope.
> 
>   Thanks for the detailed comments on the meaning of 0.5 here. :-)
> 
> > The meaning of the 0.1 is the smallest difference that the data could measure.
> > I could have instead chosen 0.0 and asked if there was likely some (perhaps
> > tiny) difference, but instead, I chose to ask if there was likely some small but
> > meaningful difference.  It is better to choose the desired difference before
> > measuring the data.
> 
>   Thanks for the detailed comments on the meaning of 0.1 here. :-)
> 
> > Why don't you try applying this approach to the new data?  You will need the
> > general binomial formula.
> 
>    Thank you Paul for the suggestion. 
>    I just tried it, but not sure whether my analysis was correct ...
> 
>    Analysis 1:
>    a's median is 8.9. 

I get 8.95, which is the average of the 24th and 25th members of a
in numerical order.

>    35/48 b's data points are less than 0.1 less than a's median.
>    For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5.
>    So, we have strong confidence that b is 100ms faster than a.

I of course get quite a bit stronger confidence, but your 99.9% is
good enough.  And I get even stronger confidence going in the other
direction.  However, the fact that a's median varies from 8.7 in the old
experiment to 8.95 in this experiment does give some pause.  These are
after all supposedly drawn from the same distribution.  Or did you use
a different machine or different OS version or some such in the two
sets of measurements?  Different time of day and thus different ambient
temperature, thus different CPU clock frequency?

Assuming identical test setups, let's try the old value of 8.7 from old
a to new b.  There are 14 elements in new b greater than 8.6, for a
probability of 0.17%, or about 98.3% significance.  This is still OK.

In contrast, the median of the old b is 8.2, which gives extreme
confidence.  So let's be conservative and use the large-set median.

In real life, additional procedures would be needed to estimate the
confidence in the median, which turns oout to be nontrivial.  When I apply
this sort of technique, I usually have all data from each sample being
on one side of the median of the other, which simplifies things.  ;-)

The easiest way to estimate bounds on the median is to "bootstrap",
but that works best if you have 1000 samples and can randomly draw 1000
sub-samples each of size 10 from the larger sample and compute the median
of each.  You can sort these medians and obtain a cumulative distribution.
But you have to have an extremely good reason to collect data from 1000
boots, and I don't believe we have that good of a reason.

>    Analysis 2:
>    a's median - 0.4 = 8.9 - 0.4 = 8.5. 
>    24/48 b's data points are less than 0.4 less than a's median.
>    The probability that a's data points are less than 8.5 is p = 7/48 = 0.1458 
This is only 85.4% significant, so...

>    For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458.
>    So, looks like we have confidence that b is 400ms faster than a.

...we really cannot say anything about 400ms faster.  Again, you need 95%
and preferably 99% to really make any sort of claim.  You probably need
quite a few more samples to say much about 200ms, let alone 400ms.

Plus, you really should select the speedup and only then take the
measurements.  Otherwise, you end up fitting noise.

However, assuming identical tests setups, you really can calculate
the median from the full data set.

>    The calculated cumulative binomial distributions P(X) is via:
>    https://www.gigacalculator.com/calculators/binomial-probability-calculator.php

The maxima program's binomial() function agrees with it, so good.  ;-)

>    I apologize if this analysis/discussion bored some of you. ;-)

Let's just say that it is a lot simpler when you are measuring
larger differences in data with tighter distributions.  Me, I usually
just say "no" to drawing any sort of conclusion from data sets that
overlap this much.

Instead, I might check to see if there is some random events adding
noise to the boot duration, eliminate that, and hopefully get data
that is easier to analyze.

But I am good with the 98.3% confidence in a 100ms improvement.

So if Joel wishes to make this point, he should feel free to take both
of your datasets and use the computation with the worse mean.

							Thanx, Paul
  
Frederic Weisbecker March 7, 2023, 5:19 p.m. UTC | #15
On Tue, Mar 07, 2023 at 08:41:17AM -0500, Joel Fernandes wrote:
> On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > On many systems, a great deal of boot (in userspace) happens after the
> > > kernel thinks the boot has completed. It is difficult to determine if
> > > the system has really booted from the kernel side. Some features like
> > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > added that the boot synchronously depends on. Further expedited callbacks
> > > can get unexpedited way earlier than it should be, thus slowing down
> > > boot (as shown in the data below).
> > >
> > > For these reasons, this commit adds a config option
> > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > Userspace can also make RCU's view of the system as booted, by writing the
> > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > Or even just writing a value of 0 to this sysfs node.
> > > However, under no circumstance will the boot be allowed to end earlier
> > > than just before init is launched.
> > >
> > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > no config or parameter changes, and just a simple application of this patch. A
> > > system designer can also choose a specific value here to keep RCU from marking
> > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > >
> > > One side-effect of this patch is, there is a risk that a real-time workload
> > > launched just after the kernel boots will suffer interruptions due to expedited
> > > RCU, which previous ended just before init was launched. However, to mitigate
> > > such an issue (however unlikely), the user should either tune
> > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > boots, and before launching the real-time workload.
> > >
> > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > of patch. An excerpt from the data he shared:
> > >
> > > 1) Testing environment:
> > >     OS            : CentOS Stream 8 (non-RT OS)
> > >     Kernel     : v6.2
> > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > >
> > > 2) OS boot time definition:
> > >     The time from the start of the kernel boot to the shell command line
> > >     prompt is shown from the console. [ Different people may have
> > >     different OS boot time definitions. ]
> > >
> > > 3) Measurement method (very rough method):
> > >     A timer in the kernel periodically prints the boot time every 100ms.
> > >     As soon as the shell command line prompt is shown from the console,
> > >     we record the boot time printed by the timer, then the printed boot
> > >     time is the OS boot time.
> > >
> > > 4) Measured OS boot time (in seconds)
> > >    a) Measured 10 times w/o this patch:
> > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > >         The average OS boot time was: ~8.7s
> > >
> > >    b) Measure 10 times w/ this patch:
> > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > >         The average OS boot time was: ~8.3s.
> > >
> > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > I still don't really like that:
> >
> > 1) It feels like we are curing a symptom for which we don't know the cause.
> >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> >    reporting the wait duration within synchronize_rcu() calls between the end of
> >    the kernel boot and the end of userspace boot may be helpful.
> 
> Just to clarify (and I feel we discussed this recently) -- there is no
> callback I am aware of right now causing a slow boot. The reason for
> doing this is we don't have such issues in the future; so it is a
> protection. Note the repeated call outs to the scsi callback and also
> the rcu_barrier() issue previously fixed. Further, we already see
> slight improvements in boot times with disabling lazy during boot (its
> not much but its there). Yes, we should fix issues instead of hiding
> them - but we also would like to improve the user experience -- just
> like we disable lazy and expedited during suspend.
> 
> So what is the problem that you really have with this patch even with
> data showing improvements? I actually wanted a mechanism like this
> from the beginning and was trying to get Intel to write the patch, but
> I ended up writing it.

Let's put it another way: kernel boot is mostly code that won't execute
again. User boot (or rather the kernel part of it) OTOH is code that is
subject to be repeated again.

A lot of the kernel boot code is __init code that will execute only once.
And there it makes sense to force hurry and expedited because we may easily
miss something and after all this all happens only once, also there is no
interference with userspace, etc...

User boot OTOH use common kernel code: syscalls, signal, files, etc... And that
code will be called also after the boot.

So if there is something slowing down user boot, there are some good chances
that this thing slows down userspace in general.

Therefore we need to know exactly what's going on because the problem may be
bigger than what you observe on boot.

> 
> > 2) The kernel boot was already covered before this patch so this is about
> >    userspace code calling into the kernel. Is that piece of code also called
> >    after the boot? In that case are we missing a conversion from
> >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> >    the problem is more general than just boot.
> >
> > This needs to be analyzed first and if it happens that the issue really
> > needs to be fixed with telling the kernel that userspace has completed
> > booting, eg: because the problem is not in a few callsites that need conversion
> > to expedited but instead in the accumulation of lots of calls that should stay
> > as is:
> 
> There is no such callback I am aware off that needs such a conversion
> and I don't think that will help give any guarantees because there is
> no preventing someone from adding a callback that synchronously slows
> boot. The approach here is to put a protection. However, I will do
> some more investigations into what else may be slowing things as I do
> hold a lot of weight for your words! :)

Kernel boot is already handled and userspace boot can not add a new RCU callback.

> 
> >
> > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> >    may run right after the boot. Either you choose a value that is too low
> >    and you miss the optimization or the value is too high and you may break
> >    things.
> 
> So someone is presenting a timing sensitive workload within 15 seconds
> of boot? Please provide some evidence of that.

I have no idea, there are billions of computers running out there, it's a disaster...

> The only evidence right now is on the plus side even for the RT system.

Right it's improving the boot of an RT system, doesn't mean it's not breaking
post boot of others.

> 
> > 4) This should be fixed the way you did:
> >    a) a kernel parameter like you did
> >    b) The init process (systemd?) tells the kernel when it judges that userspace
> >       has completed booting.
> >    c) Make these interfaces more generic, maybe that information will be useful
> >       outside RCU. For example the kernel parameter should be
> >       "user_booted_reported" and the sysfs (should be sysctl?):
> >       kernel.user_booted = 1
> >    d) But yuck, this means we must know if the init process supports that...
> >
> > For these reasons, let's make sure we know exactly what is going on first.
> 
> I can investigate this more and get back to you.
> 
> One of the challenges is getting boot tracing working properly.
> Systems do weird things like turning off tracing during boot and/or
> clearing trace buffers.

Just compare the average and total duration of all synchronize_rcu() calls
(before and after forcing expedited) between launching initand userspace boot
completion. Sure there will be noise but if a difference can be measured before
and after your patch, then a difference might be measureable on tracing as
well... Well of course tracing can induce subtle things... But let's try at
least, we want to know what we are fixing here.

Thanks.
  
Paul E. McKenney March 7, 2023, 5:33 p.m. UTC | #16
On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > the system has really booted from the kernel side. Some features like
> > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > boot (as shown in the data below).
> > > >
> > > > For these reasons, this commit adds a config option
> > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > Or even just writing a value of 0 to this sysfs node.
> > > > However, under no circumstance will the boot be allowed to end earlier
> > > > than just before init is launched.
> > > >
> > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > no config or parameter changes, and just a simple application of this patch. A
> > > > system designer can also choose a specific value here to keep RCU from marking
> > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > >
> > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > such an issue (however unlikely), the user should either tune
> > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > boots, and before launching the real-time workload.
> > > >
> > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > of patch. An excerpt from the data he shared:
> > > >
> > > > 1) Testing environment:
> > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > >     Kernel     : v6.2
> > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > >
> > > > 2) OS boot time definition:
> > > >     The time from the start of the kernel boot to the shell command line
> > > >     prompt is shown from the console. [ Different people may have
> > > >     different OS boot time definitions. ]
> > > >
> > > > 3) Measurement method (very rough method):
> > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > >     As soon as the shell command line prompt is shown from the console,
> > > >     we record the boot time printed by the timer, then the printed boot
> > > >     time is the OS boot time.
> > > >
> > > > 4) Measured OS boot time (in seconds)
> > > >    a) Measured 10 times w/o this patch:
> > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > >         The average OS boot time was: ~8.7s
> > > >
> > > >    b) Measure 10 times w/ this patch:
> > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > >         The average OS boot time was: ~8.3s.
> > > >
> > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >
> > > I still don't really like that:
> > >
> > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > >    the kernel boot and the end of userspace boot may be helpful.
> > >
> > > 2) The kernel boot was already covered before this patch so this is about
> > >    userspace code calling into the kernel. Is that piece of code also called
> > >    after the boot? In that case are we missing a conversion from
> > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > >    the problem is more general than just boot.
> > >
> > > This needs to be analyzed first and if it happens that the issue really
> > > needs to be fixed with telling the kernel that userspace has completed
> > > booting, eg: because the problem is not in a few callsites that need conversion
> > > to expedited but instead in the accumulation of lots of calls that should stay
> > > as is:
> > >
> > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > >    may run right after the boot. Either you choose a value that is too low
> > >    and you miss the optimization or the value is too high and you may break
> > >    things.
> > >
> > > 4) This should be fixed the way you did:
> > >    a) a kernel parameter like you did
> > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > >       has completed booting.
> > >    c) Make these interfaces more generic, maybe that information will be useful
> > >       outside RCU. For example the kernel parameter should be
> > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > >       kernel.user_booted = 1
> > >    d) But yuck, this means we must know if the init process supports that...
> > >
> > > For these reasons, let's make sure we know exactly what is going on first.
> > >
> > > Thanks.
> > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > parameter that can be used during the boot. For example on our devices
> > to speedup a boot we boot the kernel with rcu_expedited:
> >
> > XQ-DQ54:/ # cat /proc/cmdline
> > stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug  msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init  qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001
> > XQ-DQ54:/ #
> >
> > then a user space can decides if it is needed or not:
> >
> > <snip>
> > rcu_expedited  rcu_normal
> > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > XQ-DQ54:/ #
> > <snip>
> >
> > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > true or false. So we can follow and be aligned with rcu_expedited and
> > rcu_normal parameters.
> 
> Speaking of aligning, there is also the automated
> rcu_normal_after_boot boot option correct? I prefer the automated
> option of doing this. So the approach here is not really unprecedented
> and is much more robust than relying on userspace too much (I am ok
> with adding your suggestion *on top* of the automated toggle, but I
> probably would not have ChromeOS use it if the automated way exists).
> Or did I miss something?

See this commit:

3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives")

Antti provided this commit precisely in order to allow Android devices
to expedite the boot process and to shut off the expediting at a time of
Android userspace's choosing.  So Android has been making this work for
about ten years, which strikes me as an adequate proof of concept.  ;-)

Of course, Android has a rather tightly controlled userspace, as do
real-time embedded systems (I sure hope, anyway!).  Which is why your
timeout-based fallback/backup makes a lot of sense.  And why someone might
want an aggressive indication when that timeout-based backup is needed.

							Thanx, Paul
  
Joel Fernandes March 7, 2023, 6:19 p.m. UTC | #17
On Tue, Mar 7, 2023 at 12:19 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Tue, Mar 07, 2023 at 08:41:17AM -0500, Joel Fernandes wrote:
> > On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> > >
> > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > the system has really booted from the kernel side. Some features like
> > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > boot (as shown in the data below).
> > > >
> > > > For these reasons, this commit adds a config option
> > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > Or even just writing a value of 0 to this sysfs node.
> > > > However, under no circumstance will the boot be allowed to end earlier
> > > > than just before init is launched.
> > > >
> > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > no config or parameter changes, and just a simple application of this patch. A
> > > > system designer can also choose a specific value here to keep RCU from marking
> > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > >
> > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > such an issue (however unlikely), the user should either tune
> > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > boots, and before launching the real-time workload.
> > > >
> > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > of patch. An excerpt from the data he shared:
> > > >
> > > > 1) Testing environment:
> > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > >     Kernel     : v6.2
> > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > >
> > > > 2) OS boot time definition:
> > > >     The time from the start of the kernel boot to the shell command line
> > > >     prompt is shown from the console. [ Different people may have
> > > >     different OS boot time definitions. ]
> > > >
> > > > 3) Measurement method (very rough method):
> > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > >     As soon as the shell command line prompt is shown from the console,
> > > >     we record the boot time printed by the timer, then the printed boot
> > > >     time is the OS boot time.
> > > >
> > > > 4) Measured OS boot time (in seconds)
> > > >    a) Measured 10 times w/o this patch:
> > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > >         The average OS boot time was: ~8.7s
> > > >
> > > >    b) Measure 10 times w/ this patch:
> > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > >         The average OS boot time was: ~8.3s.
> > > >
> > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >
> > > I still don't really like that:
> > >
> > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > >    the kernel boot and the end of userspace boot may be helpful.
> >
> > Just to clarify (and I feel we discussed this recently) -- there is no
> > callback I am aware of right now causing a slow boot. The reason for
> > doing this is we don't have such issues in the future; so it is a
> > protection. Note the repeated call outs to the scsi callback and also
> > the rcu_barrier() issue previously fixed. Further, we already see
> > slight improvements in boot times with disabling lazy during boot (its
> > not much but its there). Yes, we should fix issues instead of hiding
> > them - but we also would like to improve the user experience -- just
> > like we disable lazy and expedited during suspend.
> >
> > So what is the problem that you really have with this patch even with
> > data showing improvements? I actually wanted a mechanism like this
> > from the beginning and was trying to get Intel to write the patch, but
> > I ended up writing it.
>
> Let's put it another way: kernel boot is mostly code that won't execute
> again. User boot (or rather the kernel part of it) OTOH is code that is
> subject to be repeated again.
>
> A lot of the kernel boot code is __init code that will execute only once.
> And there it makes sense to force hurry and expedited because we may easily
> miss something and after all this all happens only once, also there is no
> interference with userspace, etc...
>
> User boot OTOH use common kernel code: syscalls, signal, files, etc... And that
> code will be called also after the boot.
>
> So if there is something slowing down user boot, there are some good chances
> that this thing slows down userspace in general.
>
> Therefore we need to know exactly what's going on because the problem may be
> bigger than what you observe on boot.

These are good points. It motivates me to dig further, as we may be
setting ourselves up for longer term problems for shorter term gains
otherwise. I am thinking I finish my debugobjects patch soon which
adds metadata to callbacks and expose the details via debugfs, and
provide it to Qiuxu and ChromeOS folks to run and study the boot time.

> > > 2) The kernel boot was already covered before this patch so this is about
> > >    userspace code calling into the kernel. Is that piece of code also called
> > >    after the boot? In that case are we missing a conversion from
> > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > >    the problem is more general than just boot.
> > >
> > > This needs to be analyzed first and if it happens that the issue really
> > > needs to be fixed with telling the kernel that userspace has completed
> > > booting, eg: because the problem is not in a few callsites that need conversion
> > > to expedited but instead in the accumulation of lots of calls that should stay
> > > as is:
> >
> > There is no such callback I am aware off that needs such a conversion
> > and I don't think that will help give any guarantees because there is
> > no preventing someone from adding a callback that synchronously slows
> > boot. The approach here is to put a protection. However, I will do
> > some more investigations into what else may be slowing things as I do
> > hold a lot of weight for your words! :)
>
> Kernel boot is already handled and userspace boot can not add a new RCU callback.

Right, so that is in line with your point about userspace slowing down
even after boot, if I am not mistaken.

> > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > >    may run right after the boot. Either you choose a value that is too low
> > >    and you miss the optimization or the value is too high and you may break
> > >    things.
> >
> > So someone is presenting a timing sensitive workload within 15 seconds
> > of boot? Please provide some evidence of that.
>
> I have no idea, there are billions of computers running out there, it's a disaster...

Haha... Linux success sounds like a nice problem to have. ;-)

> > The only evidence right now is on the plus side even for the RT system.
>
> Right it's improving the boot of an RT system, doesn't mean it's not breaking
> post boot of others.

True. However, I still feel a protection in the future would make
sense in general after we finish these investigations.

> > > 4) This should be fixed the way you did:
> > >    a) a kernel parameter like you did
> > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > >       has completed booting.
> > >    c) Make these interfaces more generic, maybe that information will be useful
> > >       outside RCU. For example the kernel parameter should be
> > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > >       kernel.user_booted = 1
> > >    d) But yuck, this means we must know if the init process supports that...
> > >
> > > For these reasons, let's make sure we know exactly what is going on first.
> >
> > I can investigate this more and get back to you.
> >
> > One of the challenges is getting boot tracing working properly.
> > Systems do weird things like turning off tracing during boot and/or
> > clearing trace buffers.
>
> Just compare the average and total duration of all synchronize_rcu() calls
> (before and after forcing expedited) between launching initand userspace boot
> completion. Sure there will be noise but if a difference can be measured before
> and after your patch, then a difference might be measureable on tracing as
> well... Well of course tracing can induce subtle things... But let's try at
> least, we want to know what we are fixing here.

You mean using function graph tracer? For the synchronize_rcu() stuff,
I'll have to defer to the Qiuxu to try tracing synchronize_rcu() on
his PREEMPT_RT system since I don't have access to that system. I can
try to provide a patch that will make tracing that easier, but that
will be a few days probably as I'm traveling...

On ChromeOS we are seeing slight improvements with this patch (though
it is not clear whether it is statistically significant). So I have to
dig deeper what is going on there.


 - Joel
  
Joel Fernandes March 7, 2023, 6:54 p.m. UTC | #18
On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote:
> On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > > the system has really booted from the kernel side. Some features like
> > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > > boot (as shown in the data below).
> > > > >
> > > > > For these reasons, this commit adds a config option
> > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > > Or even just writing a value of 0 to this sysfs node.
> > > > > However, under no circumstance will the boot be allowed to end earlier
> > > > > than just before init is launched.
> > > > >
> > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > > no config or parameter changes, and just a simple application of this patch. A
> > > > > system designer can also choose a specific value here to keep RCU from marking
> > > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > > >
> > > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > > such an issue (however unlikely), the user should either tune
> > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > > boots, and before launching the real-time workload.
> > > > >
> > > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > > of patch. An excerpt from the data he shared:
> > > > >
> > > > > 1) Testing environment:
> > > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > > >     Kernel     : v6.2
> > > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > > >
> > > > > 2) OS boot time definition:
> > > > >     The time from the start of the kernel boot to the shell command line
> > > > >     prompt is shown from the console. [ Different people may have
> > > > >     different OS boot time definitions. ]
> > > > >
> > > > > 3) Measurement method (very rough method):
> > > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > > >     As soon as the shell command line prompt is shown from the console,
> > > > >     we record the boot time printed by the timer, then the printed boot
> > > > >     time is the OS boot time.
> > > > >
> > > > > 4) Measured OS boot time (in seconds)
> > > > >    a) Measured 10 times w/o this patch:
> > > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > > >         The average OS boot time was: ~8.7s
> > > > >
> > > > >    b) Measure 10 times w/ this patch:
> > > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > > >         The average OS boot time was: ~8.3s.
> > > > >
> > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > >
> > > > I still don't really like that:
> > > >
> > > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > > >    the kernel boot and the end of userspace boot may be helpful.
> > > >
> > > > 2) The kernel boot was already covered before this patch so this is about
> > > >    userspace code calling into the kernel. Is that piece of code also called
> > > >    after the boot? In that case are we missing a conversion from
> > > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > > >    the problem is more general than just boot.
> > > >
> > > > This needs to be analyzed first and if it happens that the issue really
> > > > needs to be fixed with telling the kernel that userspace has completed
> > > > booting, eg: because the problem is not in a few callsites that need conversion
> > > > to expedited but instead in the accumulation of lots of calls that should stay
> > > > as is:
> > > >
> > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > > >    may run right after the boot. Either you choose a value that is too low
> > > >    and you miss the optimization or the value is too high and you may break
> > > >    things.
> > > >
> > > > 4) This should be fixed the way you did:
> > > >    a) a kernel parameter like you did
> > > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > > >       has completed booting.
> > > >    c) Make these interfaces more generic, maybe that information will be useful
> > > >       outside RCU. For example the kernel parameter should be
> > > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > > >       kernel.user_booted = 1
> > > >    d) But yuck, this means we must know if the init process supports that...
> > > >
> > > > For these reasons, let's make sure we know exactly what is going on first.
> > > >
> > > > Thanks.
> > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > > parameter that can be used during the boot. For example on our devices
> > > to speedup a boot we boot the kernel with rcu_expedited:
> > >
> > > XQ-DQ54:/ # cat /proc/cmdline
> > > XQ-DQ54:/ #
> > >
> > > then a user space can decides if it is needed or not:
> > >
> > > <snip>
> > > rcu_expedited  rcu_normal
> > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > > XQ-DQ54:/ #
> > > <snip>
> > >
> > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > > true or false. So we can follow and be aligned with rcu_expedited and
> > > rcu_normal parameters.
> > 
> > Speaking of aligning, there is also the automated
> > rcu_normal_after_boot boot option correct? I prefer the automated
> > option of doing this. So the approach here is not really unprecedented
> > and is much more robust than relying on userspace too much (I am ok
> > with adding your suggestion *on top* of the automated toggle, but I
> > probably would not have ChromeOS use it if the automated way exists).
> > Or did I miss something?
> 
> See this commit:
> 
> 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives")
> 
> Antti provided this commit precisely in order to allow Android devices
> to expedite the boot process and to shut off the expediting at a time of
> Android userspace's choosing.  So Android has been making this work for
> about ten years, which strikes me as an adequate proof of concept.  ;-)

Thanks for the pointer. That's true. Looking at Android sources, I find that
Android Mediatek devices at least are setting rcu_expedited to 1 at late
stage of their userspace boot (which is weird, it should be set to 1 as early
as possible), and interestingly I cannot find them resetting it back to 0!.
Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P

> Of course, Android has a rather tightly controlled userspace, as do
> real-time embedded systems (I sure hope, anyway!).  Which is why your
> timeout-based fallback/backup makes a lot of sense.  And why someone might
> want an aggressive indication when that timeout-based backup is needed.

Or someone designs a system but is unaware of RCU behavior during boot. ;-)

thanks,

 - Joel
  
Paul E. McKenney March 7, 2023, 7:27 p.m. UTC | #19
On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote:
> On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote:
> > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > > > the system has really booted from the kernel side. Some features like
> > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > > > boot (as shown in the data below).
> > > > > >
> > > > > > For these reasons, this commit adds a config option
> > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > > > Or even just writing a value of 0 to this sysfs node.
> > > > > > However, under no circumstance will the boot be allowed to end earlier
> > > > > > than just before init is launched.
> > > > > >
> > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > > > no config or parameter changes, and just a simple application of this patch. A
> > > > > > system designer can also choose a specific value here to keep RCU from marking
> > > > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > > > >
> > > > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > > > such an issue (however unlikely), the user should either tune
> > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > > > boots, and before launching the real-time workload.
> > > > > >
> > > > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > > > of patch. An excerpt from the data he shared:
> > > > > >
> > > > > > 1) Testing environment:
> > > > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > > > >     Kernel     : v6.2
> > > > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > > > >
> > > > > > 2) OS boot time definition:
> > > > > >     The time from the start of the kernel boot to the shell command line
> > > > > >     prompt is shown from the console. [ Different people may have
> > > > > >     different OS boot time definitions. ]
> > > > > >
> > > > > > 3) Measurement method (very rough method):
> > > > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > > > >     As soon as the shell command line prompt is shown from the console,
> > > > > >     we record the boot time printed by the timer, then the printed boot
> > > > > >     time is the OS boot time.
> > > > > >
> > > > > > 4) Measured OS boot time (in seconds)
> > > > > >    a) Measured 10 times w/o this patch:
> > > > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > > > >         The average OS boot time was: ~8.7s
> > > > > >
> > > > > >    b) Measure 10 times w/ this patch:
> > > > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > > > >         The average OS boot time was: ~8.3s.
> > > > > >
> > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > >
> > > > > I still don't really like that:
> > > > >
> > > > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > > > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > > > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > > > >    the kernel boot and the end of userspace boot may be helpful.
> > > > >
> > > > > 2) The kernel boot was already covered before this patch so this is about
> > > > >    userspace code calling into the kernel. Is that piece of code also called
> > > > >    after the boot? In that case are we missing a conversion from
> > > > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > > > >    the problem is more general than just boot.
> > > > >
> > > > > This needs to be analyzed first and if it happens that the issue really
> > > > > needs to be fixed with telling the kernel that userspace has completed
> > > > > booting, eg: because the problem is not in a few callsites that need conversion
> > > > > to expedited but instead in the accumulation of lots of calls that should stay
> > > > > as is:
> > > > >
> > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > > > >    may run right after the boot. Either you choose a value that is too low
> > > > >    and you miss the optimization or the value is too high and you may break
> > > > >    things.
> > > > >
> > > > > 4) This should be fixed the way you did:
> > > > >    a) a kernel parameter like you did
> > > > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > > > >       has completed booting.
> > > > >    c) Make these interfaces more generic, maybe that information will be useful
> > > > >       outside RCU. For example the kernel parameter should be
> > > > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > > > >       kernel.user_booted = 1
> > > > >    d) But yuck, this means we must know if the init process supports that...
> > > > >
> > > > > For these reasons, let's make sure we know exactly what is going on first.
> > > > >
> > > > > Thanks.
> > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > > > parameter that can be used during the boot. For example on our devices
> > > > to speedup a boot we boot the kernel with rcu_expedited:
> > > >
> > > > XQ-DQ54:/ # cat /proc/cmdline
> > > > XQ-DQ54:/ #
> > > >
> > > > then a user space can decides if it is needed or not:
> > > >
> > > > <snip>
> > > > rcu_expedited  rcu_normal
> > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > > > XQ-DQ54:/ #
> > > > <snip>
> > > >
> > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > > > true or false. So we can follow and be aligned with rcu_expedited and
> > > > rcu_normal parameters.
> > > 
> > > Speaking of aligning, there is also the automated
> > > rcu_normal_after_boot boot option correct? I prefer the automated
> > > option of doing this. So the approach here is not really unprecedented
> > > and is much more robust than relying on userspace too much (I am ok
> > > with adding your suggestion *on top* of the automated toggle, but I
> > > probably would not have ChromeOS use it if the automated way exists).
> > > Or did I miss something?
> > 
> > See this commit:
> > 
> > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives")
> > 
> > Antti provided this commit precisely in order to allow Android devices
> > to expedite the boot process and to shut off the expediting at a time of
> > Android userspace's choosing.  So Android has been making this work for
> > about ten years, which strikes me as an adequate proof of concept.  ;-)
> 
> Thanks for the pointer. That's true. Looking at Android sources, I find that
> Android Mediatek devices at least are setting rcu_expedited to 1 at late
> stage of their userspace boot (which is weird, it should be set to 1 as early
> as possible), and interestingly I cannot find them resetting it back to 0!.
> Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P

Interesting.  Though this is consistent with Antti's commit log, where
he talks about expediting grace periods but not unexpediting them.

> > Of course, Android has a rather tightly controlled userspace, as do
> > real-time embedded systems (I sure hope, anyway!).  Which is why your
> > timeout-based fallback/backup makes a lot of sense.  And why someone might
> > want an aggressive indication when that timeout-based backup is needed.
> 
> Or someone designs a system but is unaware of RCU behavior during boot. ;-)

RCU is just doing what they told it to!  ;-)

							Thanx, Paul
  
Uladzislau Rezki March 8, 2023, 9:41 a.m. UTC | #20
On Tue, Mar 07, 2023 at 11:27:26AM -0800, Paul E. McKenney wrote:
> On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote:
> > On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> > > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > > > > the system has really booted from the kernel side. Some features like
> > > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > > > > boot (as shown in the data below).
> > > > > > >
> > > > > > > For these reasons, this commit adds a config option
> > > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > > > > Or even just writing a value of 0 to this sysfs node.
> > > > > > > However, under no circumstance will the boot be allowed to end earlier
> > > > > > > than just before init is launched.
> > > > > > >
> > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > > > > no config or parameter changes, and just a simple application of this patch. A
> > > > > > > system designer can also choose a specific value here to keep RCU from marking
> > > > > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > > > > >
> > > > > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > > > > such an issue (however unlikely), the user should either tune
> > > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > > > > boots, and before launching the real-time workload.
> > > > > > >
> > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > > > > of patch. An excerpt from the data he shared:
> > > > > > >
> > > > > > > 1) Testing environment:
> > > > > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > > > > >     Kernel     : v6.2
> > > > > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > > > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > > > > >
> > > > > > > 2) OS boot time definition:
> > > > > > >     The time from the start of the kernel boot to the shell command line
> > > > > > >     prompt is shown from the console. [ Different people may have
> > > > > > >     different OS boot time definitions. ]
> > > > > > >
> > > > > > > 3) Measurement method (very rough method):
> > > > > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > > > > >     As soon as the shell command line prompt is shown from the console,
> > > > > > >     we record the boot time printed by the timer, then the printed boot
> > > > > > >     time is the OS boot time.
> > > > > > >
> > > > > > > 4) Measured OS boot time (in seconds)
> > > > > > >    a) Measured 10 times w/o this patch:
> > > > > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > > > > >         The average OS boot time was: ~8.7s
> > > > > > >
> > > > > > >    b) Measure 10 times w/ this patch:
> > > > > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > > > > >         The average OS boot time was: ~8.3s.
> > > > > > >
> > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > >
> > > > > > I still don't really like that:
> > > > > >
> > > > > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > > > > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > > > > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > > > > >    the kernel boot and the end of userspace boot may be helpful.
> > > > > >
> > > > > > 2) The kernel boot was already covered before this patch so this is about
> > > > > >    userspace code calling into the kernel. Is that piece of code also called
> > > > > >    after the boot? In that case are we missing a conversion from
> > > > > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > > > > >    the problem is more general than just boot.
> > > > > >
> > > > > > This needs to be analyzed first and if it happens that the issue really
> > > > > > needs to be fixed with telling the kernel that userspace has completed
> > > > > > booting, eg: because the problem is not in a few callsites that need conversion
> > > > > > to expedited but instead in the accumulation of lots of calls that should stay
> > > > > > as is:
> > > > > >
> > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > > > > >    may run right after the boot. Either you choose a value that is too low
> > > > > >    and you miss the optimization or the value is too high and you may break
> > > > > >    things.
> > > > > >
> > > > > > 4) This should be fixed the way you did:
> > > > > >    a) a kernel parameter like you did
> > > > > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > > > > >       has completed booting.
> > > > > >    c) Make these interfaces more generic, maybe that information will be useful
> > > > > >       outside RCU. For example the kernel parameter should be
> > > > > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > > > > >       kernel.user_booted = 1
> > > > > >    d) But yuck, this means we must know if the init process supports that...
> > > > > >
> > > > > > For these reasons, let's make sure we know exactly what is going on first.
> > > > > >
> > > > > > Thanks.
> > > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > > > > parameter that can be used during the boot. For example on our devices
> > > > > to speedup a boot we boot the kernel with rcu_expedited:
> > > > >
> > > > > XQ-DQ54:/ # cat /proc/cmdline
> > > > > XQ-DQ54:/ #
> > > > >
> > > > > then a user space can decides if it is needed or not:
> > > > >
> > > > > <snip>
> > > > > rcu_expedited  rcu_normal
> > > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > > > > XQ-DQ54:/ #
> > > > > <snip>
> > > > >
> > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > > > > true or false. So we can follow and be aligned with rcu_expedited and
> > > > > rcu_normal parameters.
> > > > 
> > > > Speaking of aligning, there is also the automated
> > > > rcu_normal_after_boot boot option correct? I prefer the automated
> > > > option of doing this. So the approach here is not really unprecedented
> > > > and is much more robust than relying on userspace too much (I am ok
> > > > with adding your suggestion *on top* of the automated toggle, but I
> > > > probably would not have ChromeOS use it if the automated way exists).
> > > > Or did I miss something?
> > > 
> > > See this commit:
> > > 
> > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives")
> > > 
> > > Antti provided this commit precisely in order to allow Android devices
> > > to expedite the boot process and to shut off the expediting at a time of
> > > Android userspace's choosing.  So Android has been making this work for
> > > about ten years, which strikes me as an adequate proof of concept.  ;-)
> > 
> > Thanks for the pointer. That's true. Looking at Android sources, I find that
> > Android Mediatek devices at least are setting rcu_expedited to 1 at late
> > stage of their userspace boot (which is weird, it should be set to 1 as early
> > as possible), and interestingly I cannot find them resetting it back to 0!.
> > Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> 
> Interesting.  Though this is consistent with Antti's commit log, where
> he talks about expediting grace periods but not unexpediting them.
> 
Do you think we need to unexpedite it? :))))

--
Uladzislau Rezki
  
Uladzislau Rezki March 8, 2023, 10:14 a.m. UTC | #21
On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > the system has really booted from the kernel side. Some features like
> > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > boot (as shown in the data below).
> > > >
> > > > For these reasons, this commit adds a config option
> > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > Or even just writing a value of 0 to this sysfs node.
> > > > However, under no circumstance will the boot be allowed to end earlier
> > > > than just before init is launched.
> > > >
> > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > no config or parameter changes, and just a simple application of this patch. A
> > > > system designer can also choose a specific value here to keep RCU from marking
> > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > >
> > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > such an issue (however unlikely), the user should either tune
> > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > boots, and before launching the real-time workload.
> > > >
> > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > of patch. An excerpt from the data he shared:
> > > >
> > > > 1) Testing environment:
> > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > >     Kernel     : v6.2
> > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > >
> > > > 2) OS boot time definition:
> > > >     The time from the start of the kernel boot to the shell command line
> > > >     prompt is shown from the console. [ Different people may have
> > > >     different OS boot time definitions. ]
> > > >
> > > > 3) Measurement method (very rough method):
> > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > >     As soon as the shell command line prompt is shown from the console,
> > > >     we record the boot time printed by the timer, then the printed boot
> > > >     time is the OS boot time.
> > > >
> > > > 4) Measured OS boot time (in seconds)
> > > >    a) Measured 10 times w/o this patch:
> > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > >         The average OS boot time was: ~8.7s
> > > >
> > > >    b) Measure 10 times w/ this patch:
> > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > >         The average OS boot time was: ~8.3s.
> > > >
> > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >
> > > I still don't really like that:
> > >
> > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > >    the kernel boot and the end of userspace boot may be helpful.
> > >
> > > 2) The kernel boot was already covered before this patch so this is about
> > >    userspace code calling into the kernel. Is that piece of code also called
> > >    after the boot? In that case are we missing a conversion from
> > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > >    the problem is more general than just boot.
> > >
> > > This needs to be analyzed first and if it happens that the issue really
> > > needs to be fixed with telling the kernel that userspace has completed
> > > booting, eg: because the problem is not in a few callsites that need conversion
> > > to expedited but instead in the accumulation of lots of calls that should stay
> > > as is:
> > >
> > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > >    may run right after the boot. Either you choose a value that is too low
> > >    and you miss the optimization or the value is too high and you may break
> > >    things.
> > >
> > > 4) This should be fixed the way you did:
> > >    a) a kernel parameter like you did
> > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > >       has completed booting.
> > >    c) Make these interfaces more generic, maybe that information will be useful
> > >       outside RCU. For example the kernel parameter should be
> > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > >       kernel.user_booted = 1
> > >    d) But yuck, this means we must know if the init process supports that...
> > >
> > > For these reasons, let's make sure we know exactly what is going on first.
> > >
> > > Thanks.
> > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > parameter that can be used during the boot. For example on our devices
> > to speedup a boot we boot the kernel with rcu_expedited:
> >
> > XQ-DQ54:/ # cat /proc/cmdline
> > stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug  msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init  qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001
> > XQ-DQ54:/ #
> >
> > then a user space can decides if it is needed or not:
> >
> > <snip>
> > rcu_expedited  rcu_normal
> > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > XQ-DQ54:/ #
> > <snip>
> >
> > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > true or false. So we can follow and be aligned with rcu_expedited and
> > rcu_normal parameters.
> 
> Speaking of aligning, there is also the automated
> rcu_normal_after_boot boot option correct? I prefer the automated
> option of doing this. So the approach here is not really unprecedented
> and is much more robust than relying on userspace too much (I am ok
> with adding your suggestion *on top* of the automated toggle, but I
> probably would not have ChromeOS use it if the automated way exists).
> Or did I miss something?
> 
According to name of the rcu_end_inkernel_boot() function and a place
when it is invoked we can conclude that it marks the end of kernel boot
and it happens before running an "init" process.

With your patch we change a behavior. The initialization occurs not right
after a kernel is up and running but rather after 15 seconds timeout what
at least does not correspond to a function name. Apart from that an expected
behavior might be different. For example some test-suites or smoke tests, etc.

Another thought about "automated boot complete" is we do not know from
kernel space when it really completes for user space, because from kernel
space we are done and we can detect it. In this cases a user space is a
right candidate to say when it is ready.

For example for Android a boot complete happens when a home-screen appears.
For Chrome OS i think there is something similar. There must be a boot complete
event in its init scripts or something similar.

This is just my thoughts. I do not really mind but i also do not see a high
need in having it.

--
Uladzislau Rezki
  
Joel Fernandes March 8, 2023, 1:52 p.m. UTC | #22
> On Mar 7, 2023, at 9:19 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Tue, Mar 07, 2023 at 08:41:17AM -0500, Joel Fernandes wrote:
>>> On Tue, Mar 7, 2023 at 8:01 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>>> 
>>> On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
>>>> On many systems, a great deal of boot (in userspace) happens after the
>>>> kernel thinks the boot has completed. It is difficult to determine if
>>>> the system has really booted from the kernel side. Some features like
>>>> lazy-RCU can risk slowing down boot time if, say, a callback has been
>>>> added that the boot synchronously depends on. Further expedited callbacks
>>>> can get unexpedited way earlier than it should be, thus slowing down
>>>> boot (as shown in the data below).
>>>> 
>>>> For these reasons, this commit adds a config option
>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
>>>> Userspace can also make RCU's view of the system as booted, by writing the
>>>> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
>>>> Or even just writing a value of 0 to this sysfs node.
>>>> However, under no circumstance will the boot be allowed to end earlier
>>>> than just before init is launched.
>>>> 
>>>> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
>>>> suites ChromeOS and also a PREEMPT_RT system below very well, which need
>>>> no config or parameter changes, and just a simple application of this patch. A
>>>> system designer can also choose a specific value here to keep RCU from marking
>>>> boot completion.  As noted earlier, RCU's perspective of the system as booted
>>>> will not be marker until at least rcu_boot_end_delay milliseconds have passed
>>>> or an update is made via writing a small value (or 0) in milliseconds to:
>>>> /sys/module/rcupdate/parameters/rcu_boot_end_delay.
>>>> 
>>>> One side-effect of this patch is, there is a risk that a real-time workload
>>>> launched just after the kernel boots will suffer interruptions due to expedited
>>>> RCU, which previous ended just before init was launched. However, to mitigate
>>>> such an issue (however unlikely), the user should either tune
>>>> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
>>>> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
>>>> boots, and before launching the real-time workload.
>>>> 
>>>> Qiuxu also noted impressive boot-time improvements with earlier version
>>>> of patch. An excerpt from the data he shared:
>>>> 
>>>> 1) Testing environment:
>>>>    OS            : CentOS Stream 8 (non-RT OS)
>>>>    Kernel     : v6.2
>>>>    Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
>>>>    Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
>>>> 
>>>> 2) OS boot time definition:
>>>>    The time from the start of the kernel boot to the shell command line
>>>>    prompt is shown from the console. [ Different people may have
>>>>    different OS boot time definitions. ]
>>>> 
>>>> 3) Measurement method (very rough method):
>>>>    A timer in the kernel periodically prints the boot time every 100ms.
>>>>    As soon as the shell command line prompt is shown from the console,
>>>>    we record the boot time printed by the timer, then the printed boot
>>>>    time is the OS boot time.
>>>> 
>>>> 4) Measured OS boot time (in seconds)
>>>>   a) Measured 10 times w/o this patch:
>>>>        8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
>>>>        The average OS boot time was: ~8.7s
>>>> 
>>>>   b) Measure 10 times w/ this patch:
>>>>        8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
>>>>        The average OS boot time was: ~8.3s.
>>>> 
>>>> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>> 
>>> I still don't really like that:
>>> 
>>> 1) It feels like we are curing a symptom for which we don't know the cause.
>>>   Which RCU write side caller is the source of this slow boot? Some tracepoints
>>>   reporting the wait duration within synchronize_rcu() calls between the end of
>>>   the kernel boot and the end of userspace boot may be helpful.
>> 
>> Just to clarify (and I feel we discussed this recently) -- there is no
>> callback I am aware of right now causing a slow boot. The reason for
>> doing this is we don't have such issues in the future; so it is a
>> protection. Note the repeated call outs to the scsi callback and also
>> the rcu_barrier() issue previously fixed. Further, we already see
>> slight improvements in boot times with disabling lazy during boot (its
>> not much but its there). Yes, we should fix issues instead of hiding
>> them - but we also would like to improve the user experience -- just
>> like we disable lazy and expedited during suspend.
>> 
>> So what is the problem that you really have with this patch even with
>> data showing improvements? I actually wanted a mechanism like this
>> from the beginning and was trying to get Intel to write the patch, but
>> I ended up writing it.
> 
> Let's put it another way: kernel boot is mostly code that won't execute
> again. User boot (or rather the kernel part of it) OTOH is code that is
> subject to be repeated again.
> 
> A lot of the kernel boot code is __init code that will execute only once.
> And there it makes sense to force hurry and expedited because we may easily
> miss something and after all this all happens only once, also there is no
> interference with userspace, etc...
> 
> User boot OTOH use common kernel code: syscalls, signal, files, etc... And that
> code will be called also after the boot.
> 
> So if there is something slowing down user boot, there are some good chances
> that this thing slows down userspace in general.
> 
> Therefore we need to know exactly what's going on because the problem may be
> bigger than what you observe on boot.

Just to add to previous reply:

One thing to consider is that it is more of a performance improvement for booting in expedited mode to fallback to normal later, than a bug fix. Repeated synchronize_rcu() can easily add 100s of milliseconds and to remedy that — a conversion of the call from normal API to the expedited API will not help.

What is needed is user mode to notify kernel or do some kind of timed fallback line I did here. Both of these approaches have their pros and cons (and so IMO we should probably give an option to do both).

Thanks,

- Joel

> 
>> 
>>> 2) The kernel boot was already covered before this patch so this is about
>>>   userspace code calling into the kernel. Is that piece of code also called
>>>   after the boot? In that case are we missing a conversion from
>>>   synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
>>>   the problem is more general than just boot.
>>> 
>>> This needs to be analyzed first and if it happens that the issue really
>>> needs to be fixed with telling the kernel that userspace has completed
>>> booting, eg: because the problem is not in a few callsites that need conversion
>>> to expedited but instead in the accumulation of lots of calls that should stay
>>> as is:
>> 
>> There is no such callback I am aware off that needs such a conversion
>> and I don't think that will help give any guarantees because there is
>> no preventing someone from adding a callback that synchronously slows
>> boot. The approach here is to put a protection. However, I will do
>> some more investigations into what else may be slowing things as I do
>> hold a lot of weight for your words! :)
> 
> Kernel boot is already handled and userspace boot can not add a new RCU callback.
> 
>> 
>>> 
>>> 3) This arbitrary timeout looks dangerous to me as latency sensitive code
>>>   may run right after the boot. Either you choose a value that is too low
>>>   and you miss the optimization or the value is too high and you may break
>>>   things.
>> 
>> So someone is presenting a timing sensitive workload within 15 seconds
>> of boot? Please provide some evidence of that.
> 
> I have no idea, there are billions of computers running out there, it's a disaster...
> 
>> The only evidence right now is on the plus side even for the RT system.
> 
> Right it's improving the boot of an RT system, doesn't mean it's not breaking
> post boot of others.
> 
>> 
>>> 4) This should be fixed the way you did:
>>>   a) a kernel parameter like you did
>>>   b) The init process (systemd?) tells the kernel when it judges that userspace
>>>      has completed booting.
>>>   c) Make these interfaces more generic, maybe that information will be useful
>>>      outside RCU. For example the kernel parameter should be
>>>      "user_booted_reported" and the sysfs (should be sysctl?):
>>>      kernel.user_booted = 1
>>>   d) But yuck, this means we must know if the init process supports that...
>>> 
>>> For these reasons, let's make sure we know exactly what is going on first.
>> 
>> I can investigate this more and get back to you.
>> 
>> One of the challenges is getting boot tracing working properly.
>> Systems do weird things like turning off tracing during boot and/or
>> clearing trace buffers.
> 
> Just compare the average and total duration of all synchronize_rcu() calls
> (before and after forcing expedited) between launching initand userspace boot
> completion. Sure there will be noise but if a difference can be measured before
> and after your patch, then a difference might be measureable on tracing as
> well... Well of course tracing can induce subtle things... But let's try at
> least, we want to know what we are fixing here.
> 
> Thanks.
>
  
Paul E. McKenney March 8, 2023, 2:45 p.m. UTC | #23
On Wed, Mar 08, 2023 at 10:41:19AM +0100, Uladzislau Rezki wrote:
> On Tue, Mar 07, 2023 at 11:27:26AM -0800, Paul E. McKenney wrote:
> > On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote:
> > > On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> > > > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > > > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > > > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > > > > > the system has really booted from the kernel side. Some features like
> > > > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > > > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > > > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > > > > > boot (as shown in the data below).
> > > > > > > >
> > > > > > > > For these reasons, this commit adds a config option
> > > > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > > > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > > > > > Or even just writing a value of 0 to this sysfs node.
> > > > > > > > However, under no circumstance will the boot be allowed to end earlier
> > > > > > > > than just before init is launched.
> > > > > > > >
> > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > > > > > no config or parameter changes, and just a simple application of this patch. A
> > > > > > > > system designer can also choose a specific value here to keep RCU from marking
> > > > > > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > > > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > > > > > >
> > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > > > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > > > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > > > > > such an issue (however unlikely), the user should either tune
> > > > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > > > > > boots, and before launching the real-time workload.
> > > > > > > >
> > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > > > > > of patch. An excerpt from the data he shared:
> > > > > > > >
> > > > > > > > 1) Testing environment:
> > > > > > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > > > > > >     Kernel     : v6.2
> > > > > > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > > > > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > > > > > >
> > > > > > > > 2) OS boot time definition:
> > > > > > > >     The time from the start of the kernel boot to the shell command line
> > > > > > > >     prompt is shown from the console. [ Different people may have
> > > > > > > >     different OS boot time definitions. ]
> > > > > > > >
> > > > > > > > 3) Measurement method (very rough method):
> > > > > > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > > > > > >     As soon as the shell command line prompt is shown from the console,
> > > > > > > >     we record the boot time printed by the timer, then the printed boot
> > > > > > > >     time is the OS boot time.
> > > > > > > >
> > > > > > > > 4) Measured OS boot time (in seconds)
> > > > > > > >    a) Measured 10 times w/o this patch:
> > > > > > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > > > > > >         The average OS boot time was: ~8.7s
> > > > > > > >
> > > > > > > >    b) Measure 10 times w/ this patch:
> > > > > > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > > > > > >         The average OS boot time was: ~8.3s.
> > > > > > > >
> > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > >
> > > > > > > I still don't really like that:
> > > > > > >
> > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > > > > > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > > > > > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > > > > > >    the kernel boot and the end of userspace boot may be helpful.
> > > > > > >
> > > > > > > 2) The kernel boot was already covered before this patch so this is about
> > > > > > >    userspace code calling into the kernel. Is that piece of code also called
> > > > > > >    after the boot? In that case are we missing a conversion from
> > > > > > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > > > > > >    the problem is more general than just boot.
> > > > > > >
> > > > > > > This needs to be analyzed first and if it happens that the issue really
> > > > > > > needs to be fixed with telling the kernel that userspace has completed
> > > > > > > booting, eg: because the problem is not in a few callsites that need conversion
> > > > > > > to expedited but instead in the accumulation of lots of calls that should stay
> > > > > > > as is:
> > > > > > >
> > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > > > > > >    may run right after the boot. Either you choose a value that is too low
> > > > > > >    and you miss the optimization or the value is too high and you may break
> > > > > > >    things.
> > > > > > >
> > > > > > > 4) This should be fixed the way you did:
> > > > > > >    a) a kernel parameter like you did
> > > > > > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > > > > > >       has completed booting.
> > > > > > >    c) Make these interfaces more generic, maybe that information will be useful
> > > > > > >       outside RCU. For example the kernel parameter should be
> > > > > > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > > > > > >       kernel.user_booted = 1
> > > > > > >    d) But yuck, this means we must know if the init process supports that...
> > > > > > >
> > > > > > > For these reasons, let's make sure we know exactly what is going on first.
> > > > > > >
> > > > > > > Thanks.
> > > > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > > > > > parameter that can be used during the boot. For example on our devices
> > > > > > to speedup a boot we boot the kernel with rcu_expedited:
> > > > > >
> > > > > > XQ-DQ54:/ # cat /proc/cmdline
> > > > > > XQ-DQ54:/ #
> > > > > >
> > > > > > then a user space can decides if it is needed or not:
> > > > > >
> > > > > > <snip>
> > > > > > rcu_expedited  rcu_normal
> > > > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > > > > > XQ-DQ54:/ #
> > > > > > <snip>
> > > > > >
> > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > > > > > true or false. So we can follow and be aligned with rcu_expedited and
> > > > > > rcu_normal parameters.
> > > > > 
> > > > > Speaking of aligning, there is also the automated
> > > > > rcu_normal_after_boot boot option correct? I prefer the automated
> > > > > option of doing this. So the approach here is not really unprecedented
> > > > > and is much more robust than relying on userspace too much (I am ok
> > > > > with adding your suggestion *on top* of the automated toggle, but I
> > > > > probably would not have ChromeOS use it if the automated way exists).
> > > > > Or did I miss something?
> > > > 
> > > > See this commit:
> > > > 
> > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives")
> > > > 
> > > > Antti provided this commit precisely in order to allow Android devices
> > > > to expedite the boot process and to shut off the expediting at a time of
> > > > Android userspace's choosing.  So Android has been making this work for
> > > > about ten years, which strikes me as an adequate proof of concept.  ;-)
> > > 
> > > Thanks for the pointer. That's true. Looking at Android sources, I find that
> > > Android Mediatek devices at least are setting rcu_expedited to 1 at late
> > > stage of their userspace boot (which is weird, it should be set to 1 as early
> > > as possible), and interestingly I cannot find them resetting it back to 0!.
> > > Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > 
> > Interesting.  Though this is consistent with Antti's commit log, where
> > he talks about expediting grace periods but not unexpediting them.
> > 
> Do you think we need to unexpedite it? :))))

Android runs on smallish systems, so quite possibly not!

							Thanx, Paul
  
Frederic Weisbecker March 8, 2023, 3:01 p.m. UTC | #24
On Wed, Mar 08, 2023 at 05:52:50AM -0800, Joel Fernandes wrote:
> Just to add to previous reply:
> 
> One thing to consider is that it is more of a performance improvement for
> booting in expedited mode to fallback to normal later, than a bug
> fix. Repeated synchronize_rcu() can easily add 100s of milliseconds and to
> remedy that — a conversion of the call from normal API to the expedited API
> will not help.

2 things to consider:

1) Is it this about specific calls to synchronize_rcu() that repeat a lot
   and thus create such measurable impact? If so the specific callsites should
   be considered for a conversion.

2) Is it about lots of different calls to synchronize_rcu() that gather a big
   noise? Then the solution is different.

Again without proper analysis, what do we know?

Thanks.
  
Joel Fernandes March 8, 2023, 3:09 p.m. UTC | #25
> On Mar 8, 2023, at 7:01 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Wed, Mar 08, 2023 at 05:52:50AM -0800, Joel Fernandes wrote:
>> Just to add to previous reply:
>> 
>> One thing to consider is that it is more of a performance improvement for
>> booting in expedited mode to fallback to normal later, than a bug
>> fix. Repeated synchronize_rcu() can easily add 100s of milliseconds and to
>> remedy that — a conversion of the call from normal API to the expedited API
>> will not help.
> 
> 2 things to consider:
> 
> 1) Is it this about specific calls to synchronize_rcu() that repeat a lot
>   and thus create such measurable impact? If so the specific callsites should
>   be considered for a conversion.
> 
> 2) Is it about lots of different calls to synchronize_rcu() that gather a big
>   noise? Then the solution is different.
> 
> Again without proper analysis, what do we know?

Again, no one disputed that proper analysis is needed. That is obvious. I was just responding to your assumption that if boot is slow, user space will also be slow. That is not a good thing to conclude because there are many factors. Slowness at boot may be considered a bug, but slowness after boot may not be (say if the user care mores for power later).

On my side I am planning to dig deeper into our boot process, but it will take time. I hope Qiuxu can do the boot analysis on his side.

Thanks.

> 
> Thanks.
  
Uladzislau Rezki March 9, 2023, 12:57 p.m. UTC | #26
On Wed, Mar 08, 2023 at 06:45:28AM -0800, Paul E. McKenney wrote:
> On Wed, Mar 08, 2023 at 10:41:19AM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 07, 2023 at 11:27:26AM -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 07, 2023 at 06:54:43PM +0000, Joel Fernandes wrote:
> > > > On Tue, Mar 07, 2023 at 09:33:13AM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> > > > > > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > > > > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > > > > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > > > > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > > > > > > the system has really booted from the kernel side. Some features like
> > > > > > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > > > > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > > > > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > > > > > > boot (as shown in the data below).
> > > > > > > > >
> > > > > > > > > For these reasons, this commit adds a config option
> > > > > > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > > > > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > > > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > > > > > > Or even just writing a value of 0 to this sysfs node.
> > > > > > > > > However, under no circumstance will the boot be allowed to end earlier
> > > > > > > > > than just before init is launched.
> > > > > > > > >
> > > > > > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > > > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > > > > > > no config or parameter changes, and just a simple application of this patch. A
> > > > > > > > > system designer can also choose a specific value here to keep RCU from marking
> > > > > > > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > > > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > > > > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > > > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > > > > > > >
> > > > > > > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > > > > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > > > > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > > > > > > such an issue (however unlikely), the user should either tune
> > > > > > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > > > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > > > > > > boots, and before launching the real-time workload.
> > > > > > > > >
> > > > > > > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > > > > > > of patch. An excerpt from the data he shared:
> > > > > > > > >
> > > > > > > > > 1) Testing environment:
> > > > > > > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > > > > > > >     Kernel     : v6.2
> > > > > > > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > > > > > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > > > > > > >
> > > > > > > > > 2) OS boot time definition:
> > > > > > > > >     The time from the start of the kernel boot to the shell command line
> > > > > > > > >     prompt is shown from the console. [ Different people may have
> > > > > > > > >     different OS boot time definitions. ]
> > > > > > > > >
> > > > > > > > > 3) Measurement method (very rough method):
> > > > > > > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > > > > > > >     As soon as the shell command line prompt is shown from the console,
> > > > > > > > >     we record the boot time printed by the timer, then the printed boot
> > > > > > > > >     time is the OS boot time.
> > > > > > > > >
> > > > > > > > > 4) Measured OS boot time (in seconds)
> > > > > > > > >    a) Measured 10 times w/o this patch:
> > > > > > > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > > > > > > >         The average OS boot time was: ~8.7s
> > > > > > > > >
> > > > > > > > >    b) Measure 10 times w/ this patch:
> > > > > > > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > > > > > > >         The average OS boot time was: ~8.3s.
> > > > > > > > >
> > > > > > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > >
> > > > > > > > I still don't really like that:
> > > > > > > >
> > > > > > > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > > > > > > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > > > > > > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > > > > > > >    the kernel boot and the end of userspace boot may be helpful.
> > > > > > > >
> > > > > > > > 2) The kernel boot was already covered before this patch so this is about
> > > > > > > >    userspace code calling into the kernel. Is that piece of code also called
> > > > > > > >    after the boot? In that case are we missing a conversion from
> > > > > > > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > > > > > > >    the problem is more general than just boot.
> > > > > > > >
> > > > > > > > This needs to be analyzed first and if it happens that the issue really
> > > > > > > > needs to be fixed with telling the kernel that userspace has completed
> > > > > > > > booting, eg: because the problem is not in a few callsites that need conversion
> > > > > > > > to expedited but instead in the accumulation of lots of calls that should stay
> > > > > > > > as is:
> > > > > > > >
> > > > > > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > > > > > > >    may run right after the boot. Either you choose a value that is too low
> > > > > > > >    and you miss the optimization or the value is too high and you may break
> > > > > > > >    things.
> > > > > > > >
> > > > > > > > 4) This should be fixed the way you did:
> > > > > > > >    a) a kernel parameter like you did
> > > > > > > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > > > > > > >       has completed booting.
> > > > > > > >    c) Make these interfaces more generic, maybe that information will be useful
> > > > > > > >       outside RCU. For example the kernel parameter should be
> > > > > > > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > > > > > > >       kernel.user_booted = 1
> > > > > > > >    d) But yuck, this means we must know if the init process supports that...
> > > > > > > >
> > > > > > > > For these reasons, let's make sure we know exactly what is going on first.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > > > > > > parameter that can be used during the boot. For example on our devices
> > > > > > > to speedup a boot we boot the kernel with rcu_expedited:
> > > > > > >
> > > > > > > XQ-DQ54:/ # cat /proc/cmdline
> > > > > > > XQ-DQ54:/ #
> > > > > > >
> > > > > > > then a user space can decides if it is needed or not:
> > > > > > >
> > > > > > > <snip>
> > > > > > > rcu_expedited  rcu_normal
> > > > > > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > > > > > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > > > > > > XQ-DQ54:/ #
> > > > > > > <snip>
> > > > > > >
> > > > > > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > > > > > > true or false. So we can follow and be aligned with rcu_expedited and
> > > > > > > rcu_normal parameters.
> > > > > > 
> > > > > > Speaking of aligning, there is also the automated
> > > > > > rcu_normal_after_boot boot option correct? I prefer the automated
> > > > > > option of doing this. So the approach here is not really unprecedented
> > > > > > and is much more robust than relying on userspace too much (I am ok
> > > > > > with adding your suggestion *on top* of the automated toggle, but I
> > > > > > probably would not have ChromeOS use it if the automated way exists).
> > > > > > Or did I miss something?
> > > > > 
> > > > > See this commit:
> > > > > 
> > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of expedited RCU primitives")
> > > > > 
> > > > > Antti provided this commit precisely in order to allow Android devices
> > > > > to expedite the boot process and to shut off the expediting at a time of
> > > > > Android userspace's choosing.  So Android has been making this work for
> > > > > about ten years, which strikes me as an adequate proof of concept.  ;-)
> > > > 
> > > > Thanks for the pointer. That's true. Looking at Android sources, I find that
> > > > Android Mediatek devices at least are setting rcu_expedited to 1 at late
> > > > stage of their userspace boot (which is weird, it should be set to 1 as early
> > > > as possible), and interestingly I cannot find them resetting it back to 0!.
> > > > Maybe they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > 
> > > Interesting.  Though this is consistent with Antti's commit log, where
> > > he talks about expediting grace periods but not unexpediting them.
> > > 
> > Do you think we need to unexpedite it? :))))
> 
> Android runs on smallish systems, so quite possibly not!
> 
We keep it enabled and never unexpedite it. The reason is a performance.
I have done some app-launch time analysis with enabling and disabling of it.

An expedited case is much better when it comes to app launch time. It
requires ~25% less time to run an app comparing with unexpedited variant.
So we have a big gain here.

--
Uladzislau Rezki
  
Qiuxu Zhuo March 9, 2023, 3:17 p.m. UTC | #27
> From: Paul E. McKenney <paulmck@kernel.org>
> [...]
> >
> > a's standard deviation is ~0.4.
> > b's standard deviation is ~0.5.
> >
> > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
> > So, the measurements should be statistically significant to some degree.
> 
> That single standard deviation means that you have 68% confidence that the
> difference is real.  This is not far above the 50% leval of random noise.
> 95% is the lowest level that is normally considered to be statistically
> significant.

95% means there is no overlap between two standard deviations of a
and two standard deviations of b.

This relies on either much less noise during testing or a big enough 
difference between a and b. 

> > The calculated standard deviations are via:
> > https://www.gigacalculator.com/calculators/standard-deviation-calculat
> > or.php
> 
> Fair enough.  Formulas are readily available as well, and most spreadsheets
> support standard deviation.
> 
> [...]
>
> > > Why don't you try applying this approach to the new data?  You will
> > > need the general binomial formula.
> >
> >    Thank you Paul for the suggestion.
> >    I just tried it, but not sure whether my analysis was correct ...
> >
> >    Analysis 1:
> >    a's median is 8.9.
> 
> I get 8.95, which is the average of the 24th and 25th members of a in
> numerical order.

Yes, it should be 8.95. Thanks for correcting me. 

> >    35/48 b's data points are less than 0.1 less than a's median.
> >    For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5.
> >    So, we have strong confidence that b is 100ms faster than a.
> 
> I of course get quite a bit stronger confidence, but your 99.9% is good
> enough.  And I get even stronger confidence going in the other direction.
> However, the fact that a's median varies from 8.7 in the old experiment to
> 8.95 in this experiment does give some pause.  These are after all supposedly
> drawn from the same distribution.  Or did you use a different machine or
> different OS version or some such in the two sets of measurements?
> Different time of day and thus different ambient temperature, thus different
> CPU clock frequency?

All the testing setups were identical except for the testing time. 

      Old a median   : 8.7
      New a median : 8.95

      Old b median   : 8.2
      New b median : 8.45

I'm a bit surprised that both new medians are exactly greater 0.25 more than 
the old medians.  Coincidence?

> Assuming identical test setups, let's try the old value of 8.7 from old a to new
> b.  There are 14 elements in new b greater than 8.6, for a probability of
> 0.17%, or about 98.3% significance.  This is still OK.
> 
> In contrast, the median of the old b is 8.2, which gives extreme confidence.
> So let's be conservative and use the large-set median.
> 
> In real life, additional procedures would be needed to estimate the
> confidence in the median, which turns oout to be nontrivial.  When I apply

Luckily, I could just simply pick up the medians in numerical order in this case. ;-)

> this sort of technique, I usually have all data from each sample being on one
> side of the median of the other, which simplifies things.  ;-)

I like all data points are on one side of the median of the other ;-)

But this also relies on either much less noise during testing or a big enough 
difference between a and b, right?

> The easiest way to estimate bounds on the median is to "bootstrap", but that
> works best if you have 1000 samples and can randomly draw 1000 sub-
> samples each of size 10 from the larger sample and compute the median of
> each.  You can sort these medians and obtain a cumulative distribution.

Good to know "bootstap".

> But you have to have an extremely good reason to collect data from 1000
> boots, and I don't believe we have that good of a reason.
>

1000 boots, Oh my ...
No. No. I don't have a good reason for that ;-)

> >    Analysis 2:
> >    a's median - 0.4 = 8.9 - 0.4 = 8.5.
> >    24/48 b's data points are less than 0.4 less than a's median.
> >    The probability that a's data points are less than 8.5 is p = 7/48
> > = 0.1458
> This is only 85.4% significant, so...
> 
> >    For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458.
> >    So, looks like we have confidence that b is 400ms faster than a.
> 
> ...we really cannot say anything about 400ms faster.  Again, you need 95%
> and preferably 99% to really make any sort of claim.  You probably need
> quite a few more samples to say much about 200ms, let alone 400ms.

OK. Thanks for correcting me. 

> 
> Plus, you really should select the speedup and only then take the
> measurements.  Otherwise, you end up fitting noise.
> 
> However, assuming identical tests setups, you really can calculate the median
> from the full data set.
> 
> >    The calculated cumulative binomial distributions P(X) is via:
> >
> > https://www.gigacalculator.com/calculators/binomial-probability-calcul
> > ator.php
> 
> The maxima program's binomial() function agrees with it, so good.  ;-)
> 
> >    I apologize if this analysis/discussion bored some of you. ;-)
> 
> Let's just say that it is a lot simpler when you are measuring larger
> differences in data with tighter distributions.  Me, I usually just say "no" to
> drawing any sort of conclusion from data sets that overlap this much.
> Instead, I might check to see if there is some random events adding noise to
> the boot duration, eliminate that, and hopefully get data that is easier to
> analyze.

Agree. 

> But I am good with the 98.3% confidence in a 100ms improvement.
> 
> So if Joel wishes to make this point, he should feel free to take both of your
> datasets and use the computation with the worse mean.

Thank you so much Paul for your patience and detailed comments. 

-Qiuxu
  
Paul E. McKenney March 9, 2023, 9:53 p.m. UTC | #28
On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote:
> > From: Paul E. McKenney <paulmck@kernel.org>
> > [...]
> > >
> > > a's standard deviation is ~0.4.
> > > b's standard deviation is ~0.5.
> > >
> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
> > > So, the measurements should be statistically significant to some degree.
> > 
> > That single standard deviation means that you have 68% confidence that the
> > difference is real.  This is not far above the 50% leval of random noise.
> > 95% is the lowest level that is normally considered to be statistically
> > significant.
> 
> 95% means there is no overlap between two standard deviations of a
> and two standard deviations of b.
> 
> This relies on either much less noise during testing or a big enough 
> difference between a and b. 
> 
> > > The calculated standard deviations are via:
> > > https://www.gigacalculator.com/calculators/standard-deviation-calculat
> > > or.php
> > 
> > Fair enough.  Formulas are readily available as well, and most spreadsheets
> > support standard deviation.
> > 
> > [...]
> >
> > > > Why don't you try applying this approach to the new data?  You will
> > > > need the general binomial formula.
> > >
> > >    Thank you Paul for the suggestion.
> > >    I just tried it, but not sure whether my analysis was correct ...
> > >
> > >    Analysis 1:
> > >    a's median is 8.9.
> > 
> > I get 8.95, which is the average of the 24th and 25th members of a in
> > numerical order.
> 
> Yes, it should be 8.95. Thanks for correcting me. 
> 
> > >    35/48 b's data points are less than 0.1 less than a's median.
> > >    For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5.
> > >    So, we have strong confidence that b is 100ms faster than a.
> > 
> > I of course get quite a bit stronger confidence, but your 99.9% is good
> > enough.  And I get even stronger confidence going in the other direction.
> > However, the fact that a's median varies from 8.7 in the old experiment to
> > 8.95 in this experiment does give some pause.  These are after all supposedly
> > drawn from the same distribution.  Or did you use a different machine or
> > different OS version or some such in the two sets of measurements?
> > Different time of day and thus different ambient temperature, thus different
> > CPU clock frequency?
> 
> All the testing setups were identical except for the testing time. 
> 
>       Old a median   : 8.7
>       New a median : 8.95
> 
>       Old b median   : 8.2
>       New b median : 8.45
> 
> I'm a bit surprised that both new medians are exactly greater 0.25 more than 
> the old medians.  Coincidence?

Possibly some semi-rare race condition makes boot take longer, and 48
boots has a higher probability of getting more of them?  But without
analyzing the boot sequence, your guess is as good as mine.

> > Assuming identical test setups, let's try the old value of 8.7 from old a to new
> > b.  There are 14 elements in new b greater than 8.6, for a probability of
> > 0.17%, or about 98.3% significance.  This is still OK.
> > 
> > In contrast, the median of the old b is 8.2, which gives extreme confidence.
> > So let's be conservative and use the large-set median.
> > 
> > In real life, additional procedures would be needed to estimate the
> > confidence in the median, which turns oout to be nontrivial.  When I apply
> 
> Luckily, I could just simply pick up the medians in numerical order in this case. ;-)
> 
> > this sort of technique, I usually have all data from each sample being on one
> > side of the median of the other, which simplifies things.  ;-)
> 
> I like all data points are on one side of the median of the other ;-)
> 
> But this also relies on either much less noise during testing or a big enough 
> difference between a and b, right?

Yes, life is indeed *much* easier when there is less noise or larger
differences.  ;-)

> > The easiest way to estimate bounds on the median is to "bootstrap", but that
> > works best if you have 1000 samples and can randomly draw 1000 sub-
> > samples each of size 10 from the larger sample and compute the median of
> > each.  You can sort these medians and obtain a cumulative distribution.
> 
> Good to know "bootstap".
> 
> > But you have to have an extremely good reason to collect data from 1000
> > boots, and I don't believe we have that good of a reason.
> >
> 
> 1000 boots, Oh my ...
> No. No. I don't have a good reason for that ;-)
> 
> > >    Analysis 2:
> > >    a's median - 0.4 = 8.9 - 0.4 = 8.5.
> > >    24/48 b's data points are less than 0.4 less than a's median.
> > >    The probability that a's data points are less than 8.5 is p = 7/48
> > > = 0.1458
> > This is only 85.4% significant, so...
> > 
> > >    For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458.
> > >    So, looks like we have confidence that b is 400ms faster than a.
> > 
> > ...we really cannot say anything about 400ms faster.  Again, you need 95%
> > and preferably 99% to really make any sort of claim.  You probably need
> > quite a few more samples to say much about 200ms, let alone 400ms.
> 
> OK. Thanks for correcting me. 
> 
> > Plus, you really should select the speedup and only then take the
> > measurements.  Otherwise, you end up fitting noise.
> > 
> > However, assuming identical tests setups, you really can calculate the median
> > from the full data set.
> > 
> > >    The calculated cumulative binomial distributions P(X) is via:
> > >
> > > https://www.gigacalculator.com/calculators/binomial-probability-calcul
> > > ator.php
> > 
> > The maxima program's binomial() function agrees with it, so good.  ;-)
> > 
> > >    I apologize if this analysis/discussion bored some of you. ;-)
> > 
> > Let's just say that it is a lot simpler when you are measuring larger
> > differences in data with tighter distributions.  Me, I usually just say "no" to
> > drawing any sort of conclusion from data sets that overlap this much.
> > Instead, I might check to see if there is some random events adding noise to
> > the boot duration, eliminate that, and hopefully get data that is easier to
> > analyze.
> 
> Agree. 
> 
> > But I am good with the 98.3% confidence in a 100ms improvement.
> > 
> > So if Joel wishes to make this point, he should feel free to take both of your
> > datasets and use the computation with the worse mean.
> 
> Thank you so much Paul for your patience and detailed 

And thank you for bearing with me.

							Thanx, Paul
  
Joel Fernandes March 9, 2023, 10:10 p.m. UTC | #29
On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
[..]
> > > > > > See this commit:
> > > > > > 
> > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > expedited RCU primitives")
> > > > > > 
> > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > devices to expedite the boot process and to shut off the
> > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > has been making this work for about ten years, which strikes me
> > > > > > as an adequate proof of concept.  ;-)
> > > > > 
> > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > find that Android Mediatek devices at least are setting
> > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > weird, it should be set to 1 as early as possible), and
> > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > 
> > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > where he talks about expediting grace periods but not unexpediting
> > > > them.
> > > > 
> > > Do you think we need to unexpedite it? :))))
> > 
> > Android runs on smallish systems, so quite possibly not!
> > 
> We keep it enabled and never unexpedite it. The reason is a performance.  I
> have done some app-launch time analysis with enabling and disabling of it.
> 
> An expedited case is much better when it comes to app launch time. It
> requires ~25% less time to run an app comparing with unexpedited variant.
> So we have a big gain here.

Wow, that's huge. I wonder if you can dig deeper and find out why that is so
as the callbacks may need to be synchronize_rcu_expedited() then, as it could
be slowing down other usecases! I find it hard to believe, real-time
workloads will run better without those callbacks being always-expedited if
it actually gives back 25% in performance!

thanks,

 - Joel
  
Akira Yokosawa March 10, 2023, 12:11 a.m. UTC | #30
Hi,

Let me chime in this interesting thread.

On Thu, 9 Mar 2023 13:53:39 -0800, Paul E. McKenney wrote:
> On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote:
>> > From: Paul E. McKenney <paulmck@kernel.org>
>> > [...]
>> > >
>> > > a's standard deviation is ~0.4.
>> > > b's standard deviation is ~0.5.
>> > >
>> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
>> > > So, the measurements should be statistically significant to some degree.
>> > 
>> > That single standard deviation means that you have 68% confidence that the
>> > difference is real.  This is not far above the 50% leval of random noise.
>> > 95% is the lowest level that is normally considered to be statistically
>> > significant.
>> 
>> 95% means there is no overlap between two standard deviations of a
>> and two standard deviations of b.
>> 
>> This relies on either much less noise during testing or a big enough 
>> difference between a and b. 

Appended is a histogram comparing 2 data sets.

As you see, the one with v2 patch is far from normal distribution.
I think there is at least two peaks.
The one at the right around 9.7 seems not affected by the patch.
In such a case, average and standard deviation of all the data don't
tell much.

It is hard to say anything for sure with such small set of samples.
And the shape of the plot is likely to be highly dependent on machine
setups.

Hope this helps.

        Thanks, Akira

>> 
[...]
  
Paul E. McKenney March 10, 2023, 1:47 a.m. UTC | #31
On Fri, Mar 10, 2023 at 09:11:54AM +0900, Akira Yokosawa wrote:
> Hi,
> 
> Let me chime in this interesting thread.
> 
> On Thu, 9 Mar 2023 13:53:39 -0800, Paul E. McKenney wrote:
> > On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote:
> >> > From: Paul E. McKenney <paulmck@kernel.org>
> >> > [...]
> >> > >
> >> > > a's standard deviation is ~0.4.
> >> > > b's standard deviation is ~0.5.
> >> > >
> >> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
> >> > > So, the measurements should be statistically significant to some degree.
> >> > 
> >> > That single standard deviation means that you have 68% confidence that the
> >> > difference is real.  This is not far above the 50% leval of random noise.
> >> > 95% is the lowest level that is normally considered to be statistically
> >> > significant.
> >> 
> >> 95% means there is no overlap between two standard deviations of a
> >> and two standard deviations of b.
> >> 
> >> This relies on either much less noise during testing or a big enough 
> >> difference between a and b. 
> 
> Appended is a histogram comparing 2 data sets.
> 
> As you see, the one with v2 patch is far from normal distribution.
> I think there is at least two peaks.
> The one at the right around 9.7 seems not affected by the patch.
> In such a case, average and standard deviation of all the data don't
> tell much.
> 
> It is hard to say anything for sure with such small set of samples.
> And the shape of the plot is likely to be highly dependent on machine
> setups.
> 
> Hope this helps.

Thank you, Akira!  Definitely an abnormal distribution!  ;-)

							Thanx, Paul
  
Qiuxu Zhuo March 10, 2023, 2:35 a.m. UTC | #32
> From: Akira Yokosawa <akiyks@gmail.com>
> Sent: Friday, March 10, 2023 8:12 AM
> To: paulmck@kernel.org; Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
> Cc: frederic@kernel.org; jiangshanlai@gmail.com; joel@joelfernandes.org;
> linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> rcu@vger.kernel.org; urezki@gmail.com; Akira Yokosawa
> <akiyks@gmail.com>
> Subject: Re: [PATCH v3] rcu: Add a minimum time for marking boot as
> completed
> 
> Hi,
> 
> Let me chime in this interesting thread.
> 
> On Thu, 9 Mar 2023 13:53:39 -0800, Paul E. McKenney wrote:
> > On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote:
> >> > From: Paul E. McKenney <paulmck@kernel.org> [...]
> >> > >
> >> > > a's standard deviation is ~0.4.
> >> > > b's standard deviation is ~0.5.
> >> > >
> >> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0,
> 9].
> >> > > So, the measurements should be statistically significant to some
> degree.
> >> >
> >> > That single standard deviation means that you have 68% confidence
> >> > that the difference is real.  This is not far above the 50% leval of random
> noise.
> >> > 95% is the lowest level that is normally considered to be
> >> > statistically significant.
> >>
> >> 95% means there is no overlap between two standard deviations of a
> >> and two standard deviations of b.
> >>
> >> This relies on either much less noise during testing or a big enough
> >> difference between a and b.
> 
> Appended is a histogram comparing 2 data sets.
> 
> As you see, the one with v2 patch is far from normal distribution.
> I think there is at least two peaks.
> The one at the right around 9.7 seems not affected by the patch.
> In such a case, average and standard deviation of all the data don't tell much.
> 
> It is hard to say anything for sure with such small set of samples.
> And the shape of the plot is likely to be highly dependent on machine setups.
> 
> Hope this helps.

Thank you Yokosawa for sharing the histogram to provide an 
intuitive view of these data points and your analysis. ;-)

-Qiuxu
  
Uladzislau Rezki March 10, 2023, 8:55 a.m. UTC | #33
On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> [..]
> > > > > > > See this commit:
> > > > > > > 
> > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > > expedited RCU primitives")
> > > > > > > 
> > > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > > devices to expedite the boot process and to shut off the
> > > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > > has been making this work for about ten years, which strikes me
> > > > > > > as an adequate proof of concept.  ;-)
> > > > > > 
> > > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > find that Android Mediatek devices at least are setting
> > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > weird, it should be set to 1 as early as possible), and
> > > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > 
> > > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > > where he talks about expediting grace periods but not unexpediting
> > > > > them.
> > > > > 
> > > > Do you think we need to unexpedite it? :))))
> > > 
> > > Android runs on smallish systems, so quite possibly not!
> > > 
> > We keep it enabled and never unexpedite it. The reason is a performance.  I
> > have done some app-launch time analysis with enabling and disabling of it.
> > 
> > An expedited case is much better when it comes to app launch time. It
> > requires ~25% less time to run an app comparing with unexpedited variant.
> > So we have a big gain here.
> 
> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> be slowing down other usecases! I find it hard to believe, real-time
> workloads will run better without those callbacks being always-expedited if
> it actually gives back 25% in performance!
> 
I can dig further, but on a high level i think there are some spots
which show better performance if expedited is set. I mean synchronize_rcu()
becomes as "less blocking a context" from a time point of view.

The problem of a regular synchronize_rcu() is - it can trigger a big latency
delays for a caller. For example for nocb case we do not know where in a list
our callback is located and when it is invoked to unblock a caller.

I have already mentioned somewhere. Probably it makes sense to directly wake-up
callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
one by one.

--
Uladzislau Rezki
  
Paul E. McKenney March 11, 2023, 6:24 a.m. UTC | #34
On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > [..]
> > > > > > > > See this commit:
> > > > > > > > 
> > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > > > expedited RCU primitives")
> > > > > > > > 
> > > > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > > > devices to expedite the boot process and to shut off the
> > > > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > > > has been making this work for about ten years, which strikes me
> > > > > > > > as an adequate proof of concept.  ;-)
> > > > > > > 
> > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > > find that Android Mediatek devices at least are setting
> > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > > weird, it should be set to 1 as early as possible), and
> > > > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > > 
> > > > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > > > where he talks about expediting grace periods but not unexpediting
> > > > > > them.
> > > > > > 
> > > > > Do you think we need to unexpedite it? :))))
> > > > 
> > > > Android runs on smallish systems, so quite possibly not!
> > > > 
> > > We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > have done some app-launch time analysis with enabling and disabling of it.
> > > 
> > > An expedited case is much better when it comes to app launch time. It
> > > requires ~25% less time to run an app comparing with unexpedited variant.
> > > So we have a big gain here.
> > 
> > Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > be slowing down other usecases! I find it hard to believe, real-time
> > workloads will run better without those callbacks being always-expedited if
> > it actually gives back 25% in performance!
> > 
> I can dig further, but on a high level i think there are some spots
> which show better performance if expedited is set. I mean synchronize_rcu()
> becomes as "less blocking a context" from a time point of view.
> 
> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> delays for a caller. For example for nocb case we do not know where in a list
> our callback is located and when it is invoked to unblock a caller.

True, expedited RCU grace periods do not have this callback-invocation
delay that normal RCU does.

> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> one by one.

Makes sense, but it is necessary to be careful.  Wakeups are not fast,
so making the RCU grace-period kthread do them all sequentially is not
a strategy to win.  For example, note that the next expedited grace
period can start before the previous expedited grace period has finished
its wakeups.

							Thanx, Paul
  
Joel Fernandes March 11, 2023, 5:19 p.m. UTC | #35
On Sat, Mar 11, 2023 at 1:24 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > [..]
> > > > > > > > > See this commit:
> > > > > > > > >
> > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > > > > expedited RCU primitives")
> > > > > > > > >
> > > > > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > > > > devices to expedite the boot process and to shut off the
> > > > > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > > > > has been making this work for about ten years, which strikes me
> > > > > > > > > as an adequate proof of concept.  ;-)
> > > > > > > >
> > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > > > find that Android Mediatek devices at least are setting
> > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > > > weird, it should be set to 1 as early as possible), and
> > > > > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > > >
> > > > > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > > > > where he talks about expediting grace periods but not unexpediting
> > > > > > > them.
> > > > > > >
> > > > > > Do you think we need to unexpedite it? :))))
> > > > >
> > > > > Android runs on smallish systems, so quite possibly not!
> > > > >
> > > > We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > > have done some app-launch time analysis with enabling and disabling of it.
> > > >
> > > > An expedited case is much better when it comes to app launch time. It
> > > > requires ~25% less time to run an app comparing with unexpedited variant.
> > > > So we have a big gain here.
> > >
> > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > be slowing down other usecases! I find it hard to believe, real-time
> > > workloads will run better without those callbacks being always-expedited if
> > > it actually gives back 25% in performance!
> > >
> > I can dig further, but on a high level i think there are some spots
> > which show better performance if expedited is set. I mean synchronize_rcu()
> > becomes as "less blocking a context" from a time point of view.
> >
> > The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > delays for a caller. For example for nocb case we do not know where in a list
> > our callback is located and when it is invoked to unblock a caller.
>
> True, expedited RCU grace periods do not have this callback-invocation
> delay that normal RCU does.
>
> > I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > one by one.
>
> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> so making the RCU grace-period kthread do them all sequentially is not
> a strategy to win.  For example, note that the next expedited grace
> period can start before the previous expedited grace period has finished
> its wakeups.

The kthreads could be undergoing scheduler contention too especially
since the workload is launching an app if I understand Vlad's usecase.
Hence my desire for a rcutop one-stop tool which shows all these
things (rcu kthread scheduler delays, callback latencies, etc etc).
;-) The more and more I run into issues, the more that tool becomes
urgent which I'm working on...

thanks,

 - Joel
  
Paul E. McKenney March 11, 2023, 8:44 p.m. UTC | #36
On Sat, Mar 04, 2023 at 04:51:45AM +0000, Joel Fernandes wrote:
> Hi Paul,
> 
> On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote:
> [..]
> > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > of patch. An excerpt from the data he shared:

Now that we have the measurement methodology put to bed...

[ . . . ]

> > Mightn't this be simpler if the user was only permitted to write zero,
> > thus just saying "stop immediately"?  If people really need the ability
> > to extend or shorten the time, a patch can be produced at that point.
> > And then a non-zero write to the file would become legal.
> 
> I prefer to keep it this way as with this method, I can not only get to
> have variable rcu_boot_end_delay via boot parameter (as in my first patch), I
> also don't need to add a separate sysfs entry, and can just reuse
> 'rcu_boot_end_delay' parameter, which I also had in my first patch. And
> adding yet another sysfs parameter will actually complicate it even more and
> add more lines of code.
> 
> I tested difference scenarios and it works fine, though I missed that
> mutex locking unfortunately, I did verify different test cases work as
> expected by manual testing.

Except that you don't need that extra sysfs value.  You could instead use
any of a number of state variables that tell you that early boot is done.
If the state says early boot (as in parsing the kernel command line),
make the code act as it does now.  Otherwise, make it accept only zero.

If there really is some system that wants to set one time limit via
the kernel boot parameter and set another at some time during boot,
there are very simple userspace facilities to make this happen.

And there is also a smaller state space and less testing to be done,
benefits which accrue on an ongoing basis.

							Thanx, Paul

> Here are some printks and on simple testing in Qemu:
> 
> 1. End the boot early, CONFIG is set to 120 seconds:
> ==================================================
> [    1.614968] rcu_boot_end_delay = 120000
> [    1.617630] schedule delayed work joel
> 
> Boot took 1.57 seconds
> root@(none):/# cat /sys/module/rcupdate/parameters/rcu_boot_end_delay
> 120000
> root@(none):/#
> root@(none):/#
> root@(none):/# echo 0 > /sys/module/rcupdate/parameters/rcu_boot_end_delay
> [   10.108394] param called joel
> [   10.110520] sys calling boot ended
> [   10.112730] rcu_boot_end_delay = 0
> [   10.115017] boot ended joel
> -----------------------------------------------
> 
> 2. End the boot passing in rcupdate.rcu_boot_end_delay as 10s.
>    This should overwride the CONFIG of 120 seconds:
> ==================================================
> [    1.700090] rcu_boot_end_delay = 10000
> [    1.702628] schedule delayed work joel
> 
> Boot took 1.64 seconds
> 
> root@(none):/# [   10.414008] rcu_boot_end_delay = 10000
> [   10.416670] boot ended joel
> -----------------------------------------------
> 
> 3. Do the same thing as #2, but extend the boot via sysfs to be longer than
> 10 seconds:
> ==================================================
> [    0.060025] param called joel
> [    0.060026] param called too early joel
> [    1.663905] rcu_boot_end_delay = 10000
> [    1.667051] schedule delayed work joel
> 
> Boot took 1.61 seconds
> 
> root@(none):/#
> root@(none):/# echo 20000 > /sys/module/rcupdate/parameters/rcu_boot_end_delay
> [    6.932517] param called joel
> [    6.934637] sys calling boot ended
> [    6.936845] rcu_boot_end_delay = 20000
> [    6.939291] schedule delayed work joel
> root@(none):/# [   10.389366] rcu_boot_end_delay = 20000
> [   10.392047] schedule delayed work joel
> [   20.117416] rcu_boot_end_delay = 20000
> [   20.120073] boot ended joel
> -----------------------------------------------
> 
> The debug patch is here: https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=rcu/lazy/postboot
> 
> Appended is the updated v4 patch, tested as shown above, more testing is in progress.
> 
> thanks,
> 
>  - Joel
> 
> ---8<-----------------------
> 
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH v4] rcu: Add a minimum time for marking boot as completed
> 
> On many systems, a great deal of boot (in userspace) happens after the
> kernel thinks the boot has completed. It is difficult to determine if
> the system has really booted from the kernel side. Some features like
> lazy-RCU can risk slowing down boot time if, say, a callback has been
> added that the boot synchronously depends on. Further expedited callbacks
> can get unexpedited way earlier than it should be, thus slowing down
> boot (as shown in the data below).
> 
> For these reasons, this commit adds a config option
> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> Userspace can also make RCU's view of the system as booted, by writing the
> time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> Or even just writing a value of 0 to this sysfs node.
> However, under no circumstance will the boot be allowed to end earlier
> than just before init is launched.
> 
> The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> suites ChromeOS and also a PREEMPT_RT system below very well, which need
> no config or parameter changes, and just a simple application of this patch. A
> system designer can also choose a specific value here to keep RCU from marking
> boot completion.  As noted earlier, RCU's perspective of the system as booted
> will not be marker until at least rcu_boot_end_delay milliseconds have passed
> or an update is made via writing a small value (or 0) in milliseconds to:
> /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> 
> One side-effect of this patch is, there is a risk that a real-time workload
> launched just after the kernel boots will suffer interruptions due to expedited
> RCU, which previous ended just before init was launched. However, to mitigate
> such an issue (however unlikely), the user should either tune
> CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> boots, and before launching the real-time workload.
> 
> Qiuxu also noted impressive boot-time improvements with earlier version
> of patch. An excerpt from the data he shared:
> 
> 1) Testing environment:
>     OS            : CentOS Stream 8 (non-RT OS)
>     Kernel     : v6.2
>     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
>     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> 
> 2) OS boot time definition:
>     The time from the start of the kernel boot to the shell command line
>     prompt is shown from the console. [ Different people may have
>     different OS boot time definitions. ]
> 
> 3) Measurement method (very rough method):
>     A timer in the kernel periodically prints the boot time every 100ms.
>     As soon as the shell command line prompt is shown from the console,
>     we record the boot time printed by the timer, then the printed boot
>     time is the OS boot time.
> 
> 4) Measured OS boot time (in seconds)
>    a) Measured 10 times w/o this patch:
>         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
>         The average OS boot time was: ~8.7s
> 
>    b) Measure 10 times w/ this patch:
>         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
>         The average OS boot time was: ~8.3s.
> 
> option-prefix PATCH v4
> option-start
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> diff-note-start
> v1->v2:
> 	Update some comments and description.
> v2->v3:
>         Add sysfs param, and update with Test data.
> v3->v4:
>         Fix locking bug found by Paul, make code more robust
>         by refactoring locking code.
>         Doc updates.
> ---
>  .../admin-guide/kernel-parameters.txt         | 15 ++++
>  cc_list                                       |  8 ++
>  kernel/rcu/Kconfig                            | 21 ++++++
>  kernel/rcu/update.c                           | 74 ++++++++++++++++++-
>  4 files changed, 116 insertions(+), 2 deletions(-)
>  create mode 100644 cc_list
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 2429b5e3184b..878c2780f5db 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5085,6 +5085,21 @@
>  	rcutorture.verbose= [KNL]
>  			Enable additional printk() statements.
>  
> +	rcupdate.rcu_boot_end_delay= [KNL]
> +			Minimum time in milliseconds from the start of boot
> +			that must elapse before the boot sequence can be marked
> +			complete from RCU's perspective, after which RCU's
> +			behavior becomes more relaxed. The default value is also
> +			configurable via CONFIG_RCU_BOOT_END_DELAY.
> +			Userspace can also mark the boot as completed
> +			sooner by writing the time in milliseconds, say once
> +			userspace considers the system as booted, to:
> +			/sys/module/rcupdate/parameters/rcu_boot_end_delay
> +			Or even just writing a value of 0 to this sysfs node.
> +			The sysfs node can also be used to extend the delay
> +			to be larger than the default, assuming the marking
> +			of boot complete has not yet occurred.
> +
>  	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
>  			Dump ftrace buffer after reporting RCU CPU
>  			stall warning.
> diff --git a/cc_list b/cc_list
> new file mode 100644
> index 000000000000..7daed4877f5a
> --- /dev/null
> +++ b/cc_list
> @@ -0,0 +1,8 @@
> +Frederic Weisbecker <frederic@kernel.org>
> +Joel Fernandes <joel@joelfernandes.org>
> +Lai Jiangshan <jiangshanlai@gmail.com>
> +linux-doc@vger.kernel.org
> +linux-kernel@vger.kernel.org
> +"Paul E. McKenney" <paulmck@kernel.org>
> +rcu@vger.kernel.org
> +urezki@gmail.com
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 9071182b1284..97f68120d1c0 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -217,6 +217,27 @@ config RCU_BOOST_DELAY
>  
>  	  Accept the default if unsure.
>  
> +config RCU_BOOT_END_DELAY
> +	int "Minimum time before RCU may consider in-kernel boot as completed"
> +	range 0 120000
> +	default 15000
> +	help
> +	  Default value of the minimum time in milliseconds from the start of boot
> +	  that must elapse before the boot sequence can be marked complete from RCU's
> +	  perspective, after which RCU's behavior becomes more relaxed.
> +	  Userspace can also mark the boot as completed sooner than this default
> +	  by writing the time in milliseconds, say once userspace considers
> +	  the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> +	  Or even just writing a value of 0 to this sysfs node. The sysfs node can
> +	  also be used to extend the delay to be larger than the default, assuming
> +	  the marking of boot completion has not yet occurred.
> +
> +	  The actual delay for RCU's view of the system to be marked as booted can be
> +	  higher than this value if the kernel takes a long time to initialize but it
> +	  will never be smaller than this value.
> +
> +	  Accept the default if unsure.
> +
>  config RCU_EXP_KTHREAD
>  	bool "Perform RCU expedited work in a real-time kthread"
>  	depends on RCU_BOOST && RCU_EXPERT
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 19bf6fa3ee6a..18ed3c15e6b5 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -224,13 +224,50 @@ void rcu_unexpedite_gp(void)
>  }
>  EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
>  
> +/*
> + * Minimum time in milliseconds from the start boot until RCU can consider
> + * in-kernel boot as completed.  This can also be tuned at runtime to end the
> + * boot earlier, by userspace init code writing the time in milliseconds (even
> + * 0) to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. The sysfs node
> + * can also be used to extend the delay to be larger than the default, assuming
> + * the marking of boot complete has not yet occurred.
> + */
> +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
> +
>  static bool rcu_boot_ended __read_mostly;
> +static bool rcu_boot_end_called __read_mostly;
> +static DEFINE_MUTEX(rcu_boot_end_lock);
>  
>  /*
> - * Inform RCU of the end of the in-kernel boot sequence.
> + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
> + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
>   */
> -void rcu_end_inkernel_boot(void)
> +void rcu_end_inkernel_boot(void);
> +static void rcu_boot_end_work_fn(struct work_struct *work)
> +{
> +	rcu_end_inkernel_boot();
> +}
> +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
> +
> +/* Must be called with rcu_boot_end_lock held. */
> +static void rcu_end_inkernel_boot_locked(void)
>  {
> +	rcu_boot_end_called = true;
> +
> +	if (rcu_boot_ended)
> +		return;
> +
> +	if (rcu_boot_end_delay) {
> +		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
> +
> +		if (boot_ms < rcu_boot_end_delay) {
> +			schedule_delayed_work(&rcu_boot_end_work,
> +					rcu_boot_end_delay - boot_ms);
> +			return;
> +		}
> +	}
> +
> +	cancel_delayed_work(&rcu_boot_end_work);
>  	rcu_unexpedite_gp();
>  	rcu_async_relax();
>  	if (rcu_normal_after_boot)
> @@ -238,6 +275,39 @@ void rcu_end_inkernel_boot(void)
>  	rcu_boot_ended = true;
>  }
>  
> +void rcu_end_inkernel_boot(void)
> +{
> +	mutex_lock(&rcu_boot_end_lock);
> +	rcu_end_inkernel_boot_locked();
> +	mutex_unlock(&rcu_boot_end_lock);
> +}
> +
> +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
> +{
> +	uint end_ms;
> +	int ret = kstrtouint(val, 0, &end_ms);
> +
> +	if (ret)
> +		return ret;
> +	/*
> +	 * rcu_end_inkernel_boot() should be called at least once during init
> +	 * before we can allow param changes to end the boot.
> +	 */
> +	mutex_lock(&rcu_boot_end_lock);
> +	rcu_boot_end_delay = end_ms;
> +	if (!rcu_boot_ended && rcu_boot_end_called) {
> +		rcu_end_inkernel_boot_locked();
> +	}
> +	mutex_unlock(&rcu_boot_end_lock);
> +	return ret;
> +}
> +
> +static const struct kernel_param_ops rcu_boot_end_ops = {
> +	.set = param_set_rcu_boot_end,
> +	.get = param_get_uint,
> +};
> +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
> +
>  /*
>   * Let rcutorture know when it is OK to turn it up to eleven.
>   */
> -- 
> 2.40.0.rc0.216.gc4246ad0f0-goog
>
  
Joel Fernandes March 11, 2023, 10:23 p.m. UTC | #37
On Sat, Mar 11, 2023 at 12:44:53PM -0800, Paul E. McKenney wrote:
> On Sat, Mar 04, 2023 at 04:51:45AM +0000, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote:
> > [..]
> > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > of patch. An excerpt from the data he shared:
> 
> Now that we have the measurement methodology put to bed...
> 
> [ . . . ]
> 
> > > Mightn't this be simpler if the user was only permitted to write zero,
> > > thus just saying "stop immediately"?  If people really need the ability
> > > to extend or shorten the time, a patch can be produced at that point.
> > > And then a non-zero write to the file would become legal.
> > 
> > I prefer to keep it this way as with this method, I can not only get to
> > have variable rcu_boot_end_delay via boot parameter (as in my first patch), I
> > also don't need to add a separate sysfs entry, and can just reuse
> > 'rcu_boot_end_delay' parameter, which I also had in my first patch. And
> > adding yet another sysfs parameter will actually complicate it even more and
> > add more lines of code.
> > 
> > I tested difference scenarios and it works fine, though I missed that
> > mutex locking unfortunately, I did verify different test cases work as
> > expected by manual testing.
> 
> Except that you don't need that extra sysfs value.  You could instead use
> any of a number of state variables that tell you that early boot is done.
> If the state says early boot (as in parsing the kernel command line),
> make the code act as it does now.  Otherwise, make it accept only zero.
> 
> If there really is some system that wants to set one time limit via
> the kernel boot parameter and set another at some time during boot,
> there are very simple userspace facilities to make this happen.
> 
> And there is also a smaller state space and less testing to be done,
> benefits which accrue on an ongoing basis.

Ok, thanks for the suggestion and I will consider it when/if posting the next
revision of this idea. I got strong pushback from Frederic, Vlad and Steven
Rostedt on doing the timeout-based thing, so currently I am analyzing the
boot process more to see if it could be optimized instead. I tend to agree
with them now also because this feature is new and there could be bugs that
this patch might hide..

thanks,

 - Joel


> 
> 							Thanx, Paul
> 
> > Here are some printks and on simple testing in Qemu:
> > 
> > 1. End the boot early, CONFIG is set to 120 seconds:
> > ==================================================
> > [    1.614968] rcu_boot_end_delay = 120000
> > [    1.617630] schedule delayed work joel
> > 
> > Boot took 1.57 seconds
> > root@(none):/# cat /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > 120000
> > root@(none):/#
> > root@(none):/#
> > root@(none):/# echo 0 > /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > [   10.108394] param called joel
> > [   10.110520] sys calling boot ended
> > [   10.112730] rcu_boot_end_delay = 0
> > [   10.115017] boot ended joel
> > -----------------------------------------------
> > 
> > 2. End the boot passing in rcupdate.rcu_boot_end_delay as 10s.
> >    This should overwride the CONFIG of 120 seconds:
> > ==================================================
> > [    1.700090] rcu_boot_end_delay = 10000
> > [    1.702628] schedule delayed work joel
> > 
> > Boot took 1.64 seconds
> > 
> > root@(none):/# [   10.414008] rcu_boot_end_delay = 10000
> > [   10.416670] boot ended joel
> > -----------------------------------------------
> > 
> > 3. Do the same thing as #2, but extend the boot via sysfs to be longer than
> > 10 seconds:
> > ==================================================
> > [    0.060025] param called joel
> > [    0.060026] param called too early joel
> > [    1.663905] rcu_boot_end_delay = 10000
> > [    1.667051] schedule delayed work joel
> > 
> > Boot took 1.61 seconds
> > 
> > root@(none):/#
> > root@(none):/# echo 20000 > /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > [    6.932517] param called joel
> > [    6.934637] sys calling boot ended
> > [    6.936845] rcu_boot_end_delay = 20000
> > [    6.939291] schedule delayed work joel
> > root@(none):/# [   10.389366] rcu_boot_end_delay = 20000
> > [   10.392047] schedule delayed work joel
> > [   20.117416] rcu_boot_end_delay = 20000
> > [   20.120073] boot ended joel
> > -----------------------------------------------
> > 
> > The debug patch is here: https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=rcu/lazy/postboot
> > 
> > Appended is the updated v4 patch, tested as shown above, more testing is in progress.
> > 
> > thanks,
> > 
> >  - Joel
> > 
> > ---8<-----------------------
> > 
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > Subject: [PATCH v4] rcu: Add a minimum time for marking boot as completed
> > 
> > On many systems, a great deal of boot (in userspace) happens after the
> > kernel thinks the boot has completed. It is difficult to determine if
> > the system has really booted from the kernel side. Some features like
> > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > added that the boot synchronously depends on. Further expedited callbacks
> > can get unexpedited way earlier than it should be, thus slowing down
> > boot (as shown in the data below).
> > 
> > For these reasons, this commit adds a config option
> > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > Userspace can also make RCU's view of the system as booted, by writing the
> > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > Or even just writing a value of 0 to this sysfs node.
> > However, under no circumstance will the boot be allowed to end earlier
> > than just before init is launched.
> > 
> > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > no config or parameter changes, and just a simple application of this patch. A
> > system designer can also choose a specific value here to keep RCU from marking
> > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > or an update is made via writing a small value (or 0) in milliseconds to:
> > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > 
> > One side-effect of this patch is, there is a risk that a real-time workload
> > launched just after the kernel boots will suffer interruptions due to expedited
> > RCU, which previous ended just before init was launched. However, to mitigate
> > such an issue (however unlikely), the user should either tune
> > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > boots, and before launching the real-time workload.
> > 
> > Qiuxu also noted impressive boot-time improvements with earlier version
> > of patch. An excerpt from the data he shared:
> > 
> > 1) Testing environment:
> >     OS            : CentOS Stream 8 (non-RT OS)
> >     Kernel     : v6.2
> >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > 
> > 2) OS boot time definition:
> >     The time from the start of the kernel boot to the shell command line
> >     prompt is shown from the console. [ Different people may have
> >     different OS boot time definitions. ]
> > 
> > 3) Measurement method (very rough method):
> >     A timer in the kernel periodically prints the boot time every 100ms.
> >     As soon as the shell command line prompt is shown from the console,
> >     we record the boot time printed by the timer, then the printed boot
> >     time is the OS boot time.
> > 
> > 4) Measured OS boot time (in seconds)
> >    a) Measured 10 times w/o this patch:
> >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> >         The average OS boot time was: ~8.7s
> > 
> >    b) Measure 10 times w/ this patch:
> >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> >         The average OS boot time was: ~8.3s.
> > 
> > option-prefix PATCH v4
> > option-start
> > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > diff-note-start
> > v1->v2:
> > 	Update some comments and description.
> > v2->v3:
> >         Add sysfs param, and update with Test data.
> > v3->v4:
> >         Fix locking bug found by Paul, make code more robust
> >         by refactoring locking code.
> >         Doc updates.
> > ---
> >  .../admin-guide/kernel-parameters.txt         | 15 ++++
> >  cc_list                                       |  8 ++
> >  kernel/rcu/Kconfig                            | 21 ++++++
> >  kernel/rcu/update.c                           | 74 ++++++++++++++++++-
> >  4 files changed, 116 insertions(+), 2 deletions(-)
> >  create mode 100644 cc_list
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 2429b5e3184b..878c2780f5db 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5085,6 +5085,21 @@
> >  	rcutorture.verbose= [KNL]
> >  			Enable additional printk() statements.
> >  
> > +	rcupdate.rcu_boot_end_delay= [KNL]
> > +			Minimum time in milliseconds from the start of boot
> > +			that must elapse before the boot sequence can be marked
> > +			complete from RCU's perspective, after which RCU's
> > +			behavior becomes more relaxed. The default value is also
> > +			configurable via CONFIG_RCU_BOOT_END_DELAY.
> > +			Userspace can also mark the boot as completed
> > +			sooner by writing the time in milliseconds, say once
> > +			userspace considers the system as booted, to:
> > +			/sys/module/rcupdate/parameters/rcu_boot_end_delay
> > +			Or even just writing a value of 0 to this sysfs node.
> > +			The sysfs node can also be used to extend the delay
> > +			to be larger than the default, assuming the marking
> > +			of boot complete has not yet occurred.
> > +
> >  	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
> >  			Dump ftrace buffer after reporting RCU CPU
> >  			stall warning.
> > diff --git a/cc_list b/cc_list
> > new file mode 100644
> > index 000000000000..7daed4877f5a
> > --- /dev/null
> > +++ b/cc_list
> > @@ -0,0 +1,8 @@
> > +Frederic Weisbecker <frederic@kernel.org>
> > +Joel Fernandes <joel@joelfernandes.org>
> > +Lai Jiangshan <jiangshanlai@gmail.com>
> > +linux-doc@vger.kernel.org
> > +linux-kernel@vger.kernel.org
> > +"Paul E. McKenney" <paulmck@kernel.org>
> > +rcu@vger.kernel.org
> > +urezki@gmail.com
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index 9071182b1284..97f68120d1c0 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -217,6 +217,27 @@ config RCU_BOOST_DELAY
> >  
> >  	  Accept the default if unsure.
> >  
> > +config RCU_BOOT_END_DELAY
> > +	int "Minimum time before RCU may consider in-kernel boot as completed"
> > +	range 0 120000
> > +	default 15000
> > +	help
> > +	  Default value of the minimum time in milliseconds from the start of boot
> > +	  that must elapse before the boot sequence can be marked complete from RCU's
> > +	  perspective, after which RCU's behavior becomes more relaxed.
> > +	  Userspace can also mark the boot as completed sooner than this default
> > +	  by writing the time in milliseconds, say once userspace considers
> > +	  the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > +	  Or even just writing a value of 0 to this sysfs node. The sysfs node can
> > +	  also be used to extend the delay to be larger than the default, assuming
> > +	  the marking of boot completion has not yet occurred.
> > +
> > +	  The actual delay for RCU's view of the system to be marked as booted can be
> > +	  higher than this value if the kernel takes a long time to initialize but it
> > +	  will never be smaller than this value.
> > +
> > +	  Accept the default if unsure.
> > +
> >  config RCU_EXP_KTHREAD
> >  	bool "Perform RCU expedited work in a real-time kthread"
> >  	depends on RCU_BOOST && RCU_EXPERT
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 19bf6fa3ee6a..18ed3c15e6b5 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -224,13 +224,50 @@ void rcu_unexpedite_gp(void)
> >  }
> >  EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
> >  
> > +/*
> > + * Minimum time in milliseconds from the start boot until RCU can consider
> > + * in-kernel boot as completed.  This can also be tuned at runtime to end the
> > + * boot earlier, by userspace init code writing the time in milliseconds (even
> > + * 0) to: /sys/module/rcupdate/parameters/rcu_boot_end_delay. The sysfs node
> > + * can also be used to extend the delay to be larger than the default, assuming
> > + * the marking of boot complete has not yet occurred.
> > + */
> > +static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
> > +
> >  static bool rcu_boot_ended __read_mostly;
> > +static bool rcu_boot_end_called __read_mostly;
> > +static DEFINE_MUTEX(rcu_boot_end_lock);
> >  
> >  /*
> > - * Inform RCU of the end of the in-kernel boot sequence.
> > + * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
> > + * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
> >   */
> > -void rcu_end_inkernel_boot(void)
> > +void rcu_end_inkernel_boot(void);
> > +static void rcu_boot_end_work_fn(struct work_struct *work)
> > +{
> > +	rcu_end_inkernel_boot();
> > +}
> > +static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
> > +
> > +/* Must be called with rcu_boot_end_lock held. */
> > +static void rcu_end_inkernel_boot_locked(void)
> >  {
> > +	rcu_boot_end_called = true;
> > +
> > +	if (rcu_boot_ended)
> > +		return;
> > +
> > +	if (rcu_boot_end_delay) {
> > +		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
> > +
> > +		if (boot_ms < rcu_boot_end_delay) {
> > +			schedule_delayed_work(&rcu_boot_end_work,
> > +					rcu_boot_end_delay - boot_ms);
> > +			return;
> > +		}
> > +	}
> > +
> > +	cancel_delayed_work(&rcu_boot_end_work);
> >  	rcu_unexpedite_gp();
> >  	rcu_async_relax();
> >  	if (rcu_normal_after_boot)
> > @@ -238,6 +275,39 @@ void rcu_end_inkernel_boot(void)
> >  	rcu_boot_ended = true;
> >  }
> >  
> > +void rcu_end_inkernel_boot(void)
> > +{
> > +	mutex_lock(&rcu_boot_end_lock);
> > +	rcu_end_inkernel_boot_locked();
> > +	mutex_unlock(&rcu_boot_end_lock);
> > +}
> > +
> > +static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
> > +{
> > +	uint end_ms;
> > +	int ret = kstrtouint(val, 0, &end_ms);
> > +
> > +	if (ret)
> > +		return ret;
> > +	/*
> > +	 * rcu_end_inkernel_boot() should be called at least once during init
> > +	 * before we can allow param changes to end the boot.
> > +	 */
> > +	mutex_lock(&rcu_boot_end_lock);
> > +	rcu_boot_end_delay = end_ms;
> > +	if (!rcu_boot_ended && rcu_boot_end_called) {
> > +		rcu_end_inkernel_boot_locked();
> > +	}
> > +	mutex_unlock(&rcu_boot_end_lock);
> > +	return ret;
> > +}
> > +
> > +static const struct kernel_param_ops rcu_boot_end_ops = {
> > +	.set = param_set_rcu_boot_end,
> > +	.get = param_get_uint,
> > +};
> > +module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
> > +
> >  /*
> >   * Let rcutorture know when it is OK to turn it up to eleven.
> >   */
> > -- 
> > 2.40.0.rc0.216.gc4246ad0f0-goog
> >
  
Paul E. McKenney March 11, 2023, 10:57 p.m. UTC | #38
On Sat, Mar 11, 2023 at 10:23:54PM +0000, Joel Fernandes wrote:
> On Sat, Mar 11, 2023 at 12:44:53PM -0800, Paul E. McKenney wrote:
> > On Sat, Mar 04, 2023 at 04:51:45AM +0000, Joel Fernandes wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Mar 03, 2023 at 05:02:51PM -0800, Paul E. McKenney wrote:
> > > [..]
> > > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > > of patch. An excerpt from the data he shared:
> > 
> > Now that we have the measurement methodology put to bed...
> > 
> > [ . . . ]
> > 
> > > > Mightn't this be simpler if the user was only permitted to write zero,
> > > > thus just saying "stop immediately"?  If people really need the ability
> > > > to extend or shorten the time, a patch can be produced at that point.
> > > > And then a non-zero write to the file would become legal.
> > > 
> > > I prefer to keep it this way as with this method, I can not only get to
> > > have variable rcu_boot_end_delay via boot parameter (as in my first patch), I
> > > also don't need to add a separate sysfs entry, and can just reuse
> > > 'rcu_boot_end_delay' parameter, which I also had in my first patch. And
> > > adding yet another sysfs parameter will actually complicate it even more and
> > > add more lines of code.
> > > 
> > > I tested difference scenarios and it works fine, though I missed that
> > > mutex locking unfortunately, I did verify different test cases work as
> > > expected by manual testing.
> > 
> > Except that you don't need that extra sysfs value.  You could instead use
> > any of a number of state variables that tell you that early boot is done.
> > If the state says early boot (as in parsing the kernel command line),
> > make the code act as it does now.  Otherwise, make it accept only zero.
> > 
> > If there really is some system that wants to set one time limit via
> > the kernel boot parameter and set another at some time during boot,
> > there are very simple userspace facilities to make this happen.
> > 
> > And there is also a smaller state space and less testing to be done,
> > benefits which accrue on an ongoing basis.
> 
> Ok, thanks for the suggestion and I will consider it when/if posting the next
> revision of this idea. I got strong pushback from Frederic, Vlad and Steven
> Rostedt on doing the timeout-based thing, so currently I am analyzing the
> boot process more to see if it could be optimized instead. I tend to agree
> with them now also because this feature is new and there could be bugs that
> this patch might hide..

Agreed, fixing underlying causes is even better.

							Thanx, Paul
  
Uladzislau Rezki March 13, 2023, 9:51 a.m. UTC | #39
On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > [..]
> > > > > > > > > See this commit:
> > > > > > > > > 
> > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > > > > expedited RCU primitives")
> > > > > > > > > 
> > > > > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > > > > devices to expedite the boot process and to shut off the
> > > > > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > > > > has been making this work for about ten years, which strikes me
> > > > > > > > > as an adequate proof of concept.  ;-)
> > > > > > > > 
> > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > > > find that Android Mediatek devices at least are setting
> > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > > > weird, it should be set to 1 as early as possible), and
> > > > > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > > > 
> > > > > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > > > > where he talks about expediting grace periods but not unexpediting
> > > > > > > them.
> > > > > > > 
> > > > > > Do you think we need to unexpedite it? :))))
> > > > > 
> > > > > Android runs on smallish systems, so quite possibly not!
> > > > > 
> > > > We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > > have done some app-launch time analysis with enabling and disabling of it.
> > > > 
> > > > An expedited case is much better when it comes to app launch time. It
> > > > requires ~25% less time to run an app comparing with unexpedited variant.
> > > > So we have a big gain here.
> > > 
> > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > be slowing down other usecases! I find it hard to believe, real-time
> > > workloads will run better without those callbacks being always-expedited if
> > > it actually gives back 25% in performance!
> > > 
> > I can dig further, but on a high level i think there are some spots
> > which show better performance if expedited is set. I mean synchronize_rcu()
> > becomes as "less blocking a context" from a time point of view.
> > 
> > The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > delays for a caller. For example for nocb case we do not know where in a list
> > our callback is located and when it is invoked to unblock a caller.
> 
> True, expedited RCU grace periods do not have this callback-invocation
> delay that normal RCU does.
> 
> > I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > one by one.
> 
> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> so making the RCU grace-period kthread do them all sequentially is not
> a strategy to win.  For example, note that the next expedited grace
> period can start before the previous expedited grace period has finished
> its wakeups.
> 
I hove done a small and quick prototype:

<snip>
diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
index 699b938358bf..e1a4cca9a208 100644
--- a/include/linux/rcupdate_wait.h
+++ b/include/linux/rcupdate_wait.h
@@ -9,6 +9,8 @@
 #include <linux/rcupdate.h>
 #include <linux/completion.h>

+extern struct llist_head gp_wait_llist;
+
 /*
  * Structure allowing asynchronous waiting on RCU.
  */
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ee27a03d7576..50b81ca54104 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
 int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
 int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */

+/* Waiters for a GP kthread. */
+LLIST_HEAD(gp_wait_llist);
+
 /*
  * The rcu_scheduler_active variable is initialized to the value
  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
@@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
 }

+static void rcu_notify_gp_end(struct llist_node *llist)
+{
+       struct llist_node *rcu, *next;
+
+       llist_for_each_safe(rcu, next, llist)
+               complete(&((struct rcu_synchronize *) rcu)->completion);
+}
+
 /*
  * Body of kthread that handles grace periods.
  */
@@ -1811,6 +1822,9 @@ static int __noreturn rcu_gp_kthread(void *unused)
                WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP);
                rcu_gp_cleanup();
                WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED);
+
+               /* Wake-app all users. */
+               rcu_notify_gp_end(llist_del_all(&gp_wait_llist));
        }
 }

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 19bf6fa3ee6a..1de7c328a3e5 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
                if (j == i) {
                        init_rcu_head_on_stack(&rs_array[i].head);
                        init_completion(&rs_array[i].completion);
-                       (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
+
+                       /* Kick a grace period if needed. */
+                       (void) start_poll_synchronize_rcu();
+                       llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist);
                }
        }
<snip>

and did some experiments in terms of performance and comparison. A test case is:

thread_X:
  synchronize_rcu();
  kfree(ptr);

below are results with running 10 parallel workers running 1000 times of mentioned
test scenario:

# default(NOCB)
[   29.322944] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17286604 usec
[   29.325759] All test took worker0=63964052068 cycles
[   29.327255] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23414575 usec
[   29.329974] All test took worker1=86638822563 cycles
[   29.331460] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23357988 usec
[   29.334205] All test took worker2=86429439193 cycles
[   29.350808] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17174001 usec
[   29.353553] All test took worker3=63547397954 cycles
[   29.355039] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17141904 usec
[   29.357770] All test took worker4=63428630877 cycles
[   29.374831] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23397952 usec
[   29.377577] All test took worker5=86577316353 cycles
[   29.398809] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17142038 usec
[   29.401549] All test took worker6=63429124938 cycles
[   29.414828] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17158248 usec
[   29.417574] All test took worker7=63489107118 cycles
[   29.438811] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 18102109 usec
[   29.441550] All test took worker8=66981588881 cycles
[   29.462826] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23446042 usec
[   29.465561] All test took worker9=86755258455 cycles

# patch(NOCB)
[   14.720986] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837883 usec
[   14.723753] All test took worker0=32702015768 cycles
[   14.740386] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec
[   14.743076] All test took worker1=32701525814 cycles
[   14.760350] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837734 usec
[   14.763036] All test took worker2=32701466281 cycles
[   14.780369] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837707 usec
[   14.783057] All test took worker3=32701364901 cycles
[   14.800352] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837730 usec
[   14.803041] All test took worker4=32701449927 cycles
[   14.820355] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837724 usec
[   14.823048] All test took worker5=32701428134 cycles
[   14.840359] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837705 usec
[   14.843052] All test took worker6=32701356465 cycles
[   14.860322] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837742 usec
[   14.863005] All test took worker7=32701494475 cycles
[   14.880363] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec
[   14.883081] All test took worker8=32701525074 cycles
[   14.900362] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837918 usec
[   14.903065] All test took worker9=32702145379 cycles

--
Uladzislau Rezki
  
Uladzislau Rezki March 13, 2023, 12:27 p.m. UTC | #40
On Mon, Mar 13, 2023 at 10:51:39AM +0100, Uladzislau Rezki wrote:
> On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > > [..]
> > > > > > > > > > See this commit:
> > > > > > > > > > 
> > > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > > > > > expedited RCU primitives")
> > > > > > > > > > 
> > > > > > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > > > > > devices to expedite the boot process and to shut off the
> > > > > > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > > > > > has been making this work for about ten years, which strikes me
> > > > > > > > > > as an adequate proof of concept.  ;-)
> > > > > > > > > 
> > > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > > > > find that Android Mediatek devices at least are setting
> > > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > > > > weird, it should be set to 1 as early as possible), and
> > > > > > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > > > > 
> > > > > > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > > > > > where he talks about expediting grace periods but not unexpediting
> > > > > > > > them.
> > > > > > > > 
> > > > > > > Do you think we need to unexpedite it? :))))
> > > > > > 
> > > > > > Android runs on smallish systems, so quite possibly not!
> > > > > > 
> > > > > We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > > > have done some app-launch time analysis with enabling and disabling of it.
> > > > > 
> > > > > An expedited case is much better when it comes to app launch time. It
> > > > > requires ~25% less time to run an app comparing with unexpedited variant.
> > > > > So we have a big gain here.
> > > > 
> > > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > > be slowing down other usecases! I find it hard to believe, real-time
> > > > workloads will run better without those callbacks being always-expedited if
> > > > it actually gives back 25% in performance!
> > > > 
> > > I can dig further, but on a high level i think there are some spots
> > > which show better performance if expedited is set. I mean synchronize_rcu()
> > > becomes as "less blocking a context" from a time point of view.
> > > 
> > > The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > > delays for a caller. For example for nocb case we do not know where in a list
> > > our callback is located and when it is invoked to unblock a caller.
> > 
> > True, expedited RCU grace periods do not have this callback-invocation
> > delay that normal RCU does.
> > 
> > > I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > > one by one.
> > 
> > Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > so making the RCU grace-period kthread do them all sequentially is not
> > a strategy to win.  For example, note that the next expedited grace
> > period can start before the previous expedited grace period has finished
> > its wakeups.
> > 
> I hove done a small and quick prototype:
> 
> <snip>
> diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> index 699b938358bf..e1a4cca9a208 100644
> --- a/include/linux/rcupdate_wait.h
> +++ b/include/linux/rcupdate_wait.h
> @@ -9,6 +9,8 @@
>  #include <linux/rcupdate.h>
>  #include <linux/completion.h>
> 
> +extern struct llist_head gp_wait_llist;
> +
>  /*
>   * Structure allowing asynchronous waiting on RCU.
>   */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ee27a03d7576..50b81ca54104 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
>  int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
>  int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> 
> +/* Waiters for a GP kthread. */
> +LLIST_HEAD(gp_wait_llist);
> +
>  /*
>   * The rcu_scheduler_active variable is initialized to the value
>   * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
>                 on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
>  }
> 
> +static void rcu_notify_gp_end(struct llist_node *llist)
> +{
> +       struct llist_node *rcu, *next;
> +
> +       llist_for_each_safe(rcu, next, llist)
> +               complete(&((struct rcu_synchronize *) rcu)->completion);
> +}
> +
>  /*
>   * Body of kthread that handles grace periods.
>   */
> @@ -1811,6 +1822,9 @@ static int __noreturn rcu_gp_kthread(void *unused)
>                 WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP);
>                 rcu_gp_cleanup();
>                 WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED);
> +
> +               /* Wake-app all users. */
> +               rcu_notify_gp_end(llist_del_all(&gp_wait_llist));
>         }
>  }
> 
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 19bf6fa3ee6a..1de7c328a3e5 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
>                 if (j == i) {
>                         init_rcu_head_on_stack(&rs_array[i].head);
>                         init_completion(&rs_array[i].completion);
> -                       (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
> +
> +                       /* Kick a grace period if needed. */
> +                       (void) start_poll_synchronize_rcu();
> +                       llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist);
>                 }
>         }
> <snip>
> 
> and did some experiments in terms of performance and comparison. A test case is:
> 
> thread_X:
>   synchronize_rcu();
>   kfree(ptr);
> 
> below are results with running 10 parallel workers running 1000 times of mentioned
> test scenario:
> 
> # default(NOCB)
> [   29.322944] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17286604 usec
> [   29.325759] All test took worker0=63964052068 cycles
> [   29.327255] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23414575 usec
> [   29.329974] All test took worker1=86638822563 cycles
> [   29.331460] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23357988 usec
> [   29.334205] All test took worker2=86429439193 cycles
> [   29.350808] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17174001 usec
> [   29.353553] All test took worker3=63547397954 cycles
> [   29.355039] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17141904 usec
> [   29.357770] All test took worker4=63428630877 cycles
> [   29.374831] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23397952 usec
> [   29.377577] All test took worker5=86577316353 cycles
> [   29.398809] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17142038 usec
> [   29.401549] All test took worker6=63429124938 cycles
> [   29.414828] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17158248 usec
> [   29.417574] All test took worker7=63489107118 cycles
> [   29.438811] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 18102109 usec
> [   29.441550] All test took worker8=66981588881 cycles
> [   29.462826] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23446042 usec
> [   29.465561] All test took worker9=86755258455 cycles
> 
> # patch(NOCB)
> [   14.720986] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837883 usec
> [   14.723753] All test took worker0=32702015768 cycles
> [   14.740386] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec
> [   14.743076] All test took worker1=32701525814 cycles
> [   14.760350] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837734 usec
> [   14.763036] All test took worker2=32701466281 cycles
> [   14.780369] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837707 usec
> [   14.783057] All test took worker3=32701364901 cycles
> [   14.800352] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837730 usec
> [   14.803041] All test took worker4=32701449927 cycles
> [   14.820355] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837724 usec
> [   14.823048] All test took worker5=32701428134 cycles
> [   14.840359] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837705 usec
> [   14.843052] All test took worker6=32701356465 cycles
> [   14.860322] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837742 usec
> [   14.863005] All test took worker7=32701494475 cycles
> [   14.880363] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec
> [   14.883081] All test took worker8=32701525074 cycles
> [   14.900362] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837918 usec
> [   14.903065] All test took worker9=32702145379 cycles
> 
> --
> Uladzislau Rezki
A quick app launch test. This is a camera app on our device:

urezki@pc636:~/data/yoshino_bin/scripts$ ./test-cam.sh
629
572
652
622
642
650
613
654
607
urezki@pc636:~/data/yoshino_bin/scripts$ adb shell
XQ-DQ54:/ $ su
XQ-DQ54:/ # echo 1 > /sy
sys/          system/       system_dlkm/  system_ext/
XQ-DQ54:/ # echo 1 > /sys/kernel/rc
rcu_expedited       rcu_improve_normal  rcu_normal
XQ-DQ54:/ # echo 1 > /sys/kernel/rcu_improve_normal
XQ-DQ54:/ # exit
XQ-DQ54:/ $ exit
urezki@pc636:~/data/yoshino_bin/scripts$ ./test-cam.sh
533
549
563
537
540
563
531
549
548
urezki@pc636:~/data/yoshino_bin/scripts$

the taken time to run an app in milliseconds.

--
Uladzislau Rezki
  
Qiuxu Zhuo March 13, 2023, 1:48 p.m. UTC | #41
> From: Uladzislau Rezki <urezki@gmail.com>
> [...]
> XQ-DQ54:/ # echo 1 > /sys/kernel/rc
> rcu_expedited       rcu_improve_normal  rcu_normal
> XQ-DQ54:/ # echo 1 > /sys/kernel/rcu_improve_normal 

Hi Rezki,

I applied your prototype patch, but I did NOT find the sys-node:
 "/sys/kernel/rcu_improve_normal" on my system.

What is this node used for? What am I missing? Thanks!

[ There were only "rcu_expedited" & " rcu_normal" sys nodes
on my system. ]

-Qiuxu

> XQ-DQ54:/ # exit 
> XQ-DQ54:/ $ exit urezki@pc636:~/data/yoshino_bin/scripts$ ./test-cam.sh
> 533
> 549
> 563
> 537
> 540
> 563
> 531
> 549
> 548
> urezki@pc636:~/data/yoshino_bin/scripts$
> 
> the taken time to run an app in milliseconds.
> 
> --
> Uladzislau Rezki
  
Joel Fernandes March 13, 2023, 1:58 p.m. UTC | #42
> On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
>>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
>>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
>>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
>>>> [..]
>>>>>>>>>> See this commit:
>>>>>>>>>> 
>>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
>>>>>>>>>> expedited RCU primitives")
>>>>>>>>>> 
>>>>>>>>>> Antti provided this commit precisely in order to allow Android
>>>>>>>>>> devices to expedite the boot process and to shut off the
>>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
>>>>>>>>>> has been making this work for about ten years, which strikes me
>>>>>>>>>> as an adequate proof of concept.  ;-)
>>>>>>>>> 
>>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
>>>>>>>>> find that Android Mediatek devices at least are setting
>>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
>>>>>>>>> weird, it should be set to 1 as early as possible), and
>>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
>>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
>>>>>>>> 
>>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
>>>>>>>> where he talks about expediting grace periods but not unexpediting
>>>>>>>> them.
>>>>>>>> 
>>>>>>> Do you think we need to unexpedite it? :))))
>>>>>> 
>>>>>> Android runs on smallish systems, so quite possibly not!
>>>>>> 
>>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
>>>>> have done some app-launch time analysis with enabling and disabling of it.
>>>>> 
>>>>> An expedited case is much better when it comes to app launch time. It
>>>>> requires ~25% less time to run an app comparing with unexpedited variant.
>>>>> So we have a big gain here.
>>>> 
>>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
>>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
>>>> be slowing down other usecases! I find it hard to believe, real-time
>>>> workloads will run better without those callbacks being always-expedited if
>>>> it actually gives back 25% in performance!
>>>> 
>>> I can dig further, but on a high level i think there are some spots
>>> which show better performance if expedited is set. I mean synchronize_rcu()
>>> becomes as "less blocking a context" from a time point of view.
>>> 
>>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
>>> delays for a caller. For example for nocb case we do not know where in a list
>>> our callback is located and when it is invoked to unblock a caller.
>> 
>> True, expedited RCU grace periods do not have this callback-invocation
>> delay that normal RCU does.
>> 
>>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
>>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
>>> one by one.
>> 
>> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
>> so making the RCU grace-period kthread do them all sequentially is not
>> a strategy to win.  For example, note that the next expedited grace
>> period can start before the previous expedited grace period has finished
>> its wakeups.
>> 
> I hove done a small and quick prototype:
> 
> <snip>
> diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> index 699b938358bf..e1a4cca9a208 100644
> --- a/include/linux/rcupdate_wait.h
> +++ b/include/linux/rcupdate_wait.h
> @@ -9,6 +9,8 @@
> #include <linux/rcupdate.h>
> #include <linux/completion.h>
> 
> +extern struct llist_head gp_wait_llist;
> +
> /*
>  * Structure allowing asynchronous waiting on RCU.
>  */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ee27a03d7576..50b81ca54104 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> 
> +/* Waiters for a GP kthread. */
> +LLIST_HEAD(gp_wait_llist);
> +
> /*
>  * The rcu_scheduler_active variable is initialized to the value
>  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
>                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> }
> 
> +static void rcu_notify_gp_end(struct llist_node *llist)
> +{
> +       struct llist_node *rcu, *next;
> +
> +       llist_for_each_safe(rcu, next, llist)
> +               complete(&((struct rcu_synchronize *) rcu)->completion);

This looks broken to me, so the synchronize will complete even
if it was called in the middle of an ongoing GP?

Thanks,

 - Joel



> +}
> +
> /*
>  * Body of kthread that handles grace periods.
>  */
> @@ -1811,6 +1822,9 @@ static int __noreturn rcu_gp_kthread(void *unused)
>                WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP);
>                rcu_gp_cleanup();
>                WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED);
> +
> +               /* Wake-app all users. */
> +               rcu_notify_gp_end(llist_del_all(&gp_wait_llist));
>        }
> }
> 
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 19bf6fa3ee6a..1de7c328a3e5 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
>                if (j == i) {
>                        init_rcu_head_on_stack(&rs_array[i].head);
>                        init_completion(&rs_array[i].completion);
> -                       (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
> +
> +                       /* Kick a grace period if needed. */
> +                       (void) start_poll_synchronize_rcu();
> +                       llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist);
>                }
>        }
> <snip>
> 
> and did some experiments in terms of performance and comparison. A test case is:
> 
> thread_X:
>  synchronize_rcu();
>  kfree(ptr);
> 
> below are results with running 10 parallel workers running 1000 times of mentioned
> test scenario:
> 
> # default(NOCB)
> [   29.322944] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17286604 usec
> [   29.325759] All test took worker0=63964052068 cycles
> [   29.327255] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23414575 usec
> [   29.329974] All test took worker1=86638822563 cycles
> [   29.331460] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23357988 usec
> [   29.334205] All test took worker2=86429439193 cycles
> [   29.350808] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17174001 usec
> [   29.353553] All test took worker3=63547397954 cycles
> [   29.355039] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17141904 usec
> [   29.357770] All test took worker4=63428630877 cycles
> [   29.374831] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23397952 usec
> [   29.377577] All test took worker5=86577316353 cycles
> [   29.398809] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17142038 usec
> [   29.401549] All test took worker6=63429124938 cycles
> [   29.414828] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 17158248 usec
> [   29.417574] All test took worker7=63489107118 cycles
> [   29.438811] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 18102109 usec
> [   29.441550] All test took worker8=66981588881 cycles
> [   29.462826] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 23446042 usec
> [   29.465561] All test took worker9=86755258455 cycles
> 
> # patch(NOCB)
> [   14.720986] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837883 usec
> [   14.723753] All test took worker0=32702015768 cycles
> [   14.740386] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec
> [   14.743076] All test took worker1=32701525814 cycles
> [   14.760350] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837734 usec
> [   14.763036] All test took worker2=32701466281 cycles
> [   14.780369] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837707 usec
> [   14.783057] All test took worker3=32701364901 cycles
> [   14.800352] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837730 usec
> [   14.803041] All test took worker4=32701449927 cycles
> [   14.820355] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837724 usec
> [   14.823048] All test took worker5=32701428134 cycles
> [   14.840359] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837705 usec
> [   14.843052] All test took worker6=32701356465 cycles
> [   14.860322] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837742 usec
> [   14.863005] All test took worker7=32701494475 cycles
> [   14.880363] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837750 usec
> [   14.883081] All test took worker8=32701525074 cycles
> [   14.900362] Summary: kvfree_rcu_1_arg_vmalloc_test loops: 1000 avg: 8837918 usec
> [   14.903065] All test took worker9=32702145379 cycles
> 
> --
> Uladzislau Rezki
  
Uladzislau Rezki March 13, 2023, 3:28 p.m. UTC | #43
On Mon, Mar 13, 2023 at 01:48:18PM +0000, Zhuo, Qiuxu wrote:
> > From: Uladzislau Rezki <urezki@gmail.com>
> > [...]
> > XQ-DQ54:/ # echo 1 > /sys/kernel/rc
> > rcu_expedited       rcu_improve_normal  rcu_normal
> > XQ-DQ54:/ # echo 1 > /sys/kernel/rcu_improve_normal 
> 
> Hi Rezki,
> 
> I applied your prototype patch, but I did NOT find the sys-node:
>  "/sys/kernel/rcu_improve_normal" on my system.
> 
> What is this node used for? What am I missing? Thanks!
> 
> [ There were only "rcu_expedited" & " rcu_normal" sys nodes
> on my system. ]
> 
The prototype i posted does not have such helper, i added it just
for my local tests.

--
Uladzislau Rezki
  
Uladzislau Rezki March 13, 2023, 3:32 p.m. UTC | #44
On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> 
> 
> > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> >>>> [..]
> >>>>>>>>>> See this commit:
> >>>>>>>>>> 
> >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> >>>>>>>>>> expedited RCU primitives")
> >>>>>>>>>> 
> >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> >>>>>>>>>> devices to expedite the boot process and to shut off the
> >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> >>>>>>>>>> has been making this work for about ten years, which strikes me
> >>>>>>>>>> as an adequate proof of concept.  ;-)
> >>>>>>>>> 
> >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> >>>>>>>>> find that Android Mediatek devices at least are setting
> >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> >>>>>>>>> weird, it should be set to 1 as early as possible), and
> >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> >>>>>>>> 
> >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> >>>>>>>> where he talks about expediting grace periods but not unexpediting
> >>>>>>>> them.
> >>>>>>>> 
> >>>>>>> Do you think we need to unexpedite it? :))))
> >>>>>> 
> >>>>>> Android runs on smallish systems, so quite possibly not!
> >>>>>> 
> >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> >>>>> have done some app-launch time analysis with enabling and disabling of it.
> >>>>> 
> >>>>> An expedited case is much better when it comes to app launch time. It
> >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> >>>>> So we have a big gain here.
> >>>> 
> >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> >>>> be slowing down other usecases! I find it hard to believe, real-time
> >>>> workloads will run better without those callbacks being always-expedited if
> >>>> it actually gives back 25% in performance!
> >>>> 
> >>> I can dig further, but on a high level i think there are some spots
> >>> which show better performance if expedited is set. I mean synchronize_rcu()
> >>> becomes as "less blocking a context" from a time point of view.
> >>> 
> >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> >>> delays for a caller. For example for nocb case we do not know where in a list
> >>> our callback is located and when it is invoked to unblock a caller.
> >> 
> >> True, expedited RCU grace periods do not have this callback-invocation
> >> delay that normal RCU does.
> >> 
> >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> >>> one by one.
> >> 
> >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> >> so making the RCU grace-period kthread do them all sequentially is not
> >> a strategy to win.  For example, note that the next expedited grace
> >> period can start before the previous expedited grace period has finished
> >> its wakeups.
> >> 
> > I hove done a small and quick prototype:
> > 
> > <snip>
> > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > index 699b938358bf..e1a4cca9a208 100644
> > --- a/include/linux/rcupdate_wait.h
> > +++ b/include/linux/rcupdate_wait.h
> > @@ -9,6 +9,8 @@
> > #include <linux/rcupdate.h>
> > #include <linux/completion.h>
> > 
> > +extern struct llist_head gp_wait_llist;
> > +
> > /*
> >  * Structure allowing asynchronous waiting on RCU.
> >  */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index ee27a03d7576..50b81ca54104 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > 
> > +/* Waiters for a GP kthread. */
> > +LLIST_HEAD(gp_wait_llist);
> > +
> > /*
> >  * The rcu_scheduler_active variable is initialized to the value
> >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > }
> > 
> > +static void rcu_notify_gp_end(struct llist_node *llist)
> > +{
> > +       struct llist_node *rcu, *next;
> > +
> > +       llist_for_each_safe(rcu, next, llist)
> > +               complete(&((struct rcu_synchronize *) rcu)->completion);
> 
> This looks broken to me, so the synchronize will complete even
> if it was called in the middle of an ongoing GP?
> 
Do you mean before replacing the list(and after rcu_gp_cleanup()) a new
GP sequence can be initiated?

--
Uladzislau Rezki
  
Joel Fernandes March 13, 2023, 3:49 p.m. UTC | #45
On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> >
> >
> > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > >>>> [..]
> > >>>>>>>>>> See this commit:
> > >>>>>>>>>>
> > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > >>>>>>>>>> expedited RCU primitives")
> > >>>>>>>>>>
> > >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> > >>>>>>>>>> devices to expedite the boot process and to shut off the
> > >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> > >>>>>>>>>> has been making this work for about ten years, which strikes me
> > >>>>>>>>>> as an adequate proof of concept.  ;-)
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> > >>>>>>>>> find that Android Mediatek devices at least are setting
> > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> > >>>>>>>>> weird, it should be set to 1 as early as possible), and
> > >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > >>>>>>>>
> > >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> > >>>>>>>> where he talks about expediting grace periods but not unexpediting
> > >>>>>>>> them.
> > >>>>>>>>
> > >>>>>>> Do you think we need to unexpedite it? :))))
> > >>>>>>
> > >>>>>> Android runs on smallish systems, so quite possibly not!
> > >>>>>>
> > >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> > >>>>> have done some app-launch time analysis with enabling and disabling of it.
> > >>>>>
> > >>>>> An expedited case is much better when it comes to app launch time. It
> > >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> > >>>>> So we have a big gain here.
> > >>>>
> > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > >>>> be slowing down other usecases! I find it hard to believe, real-time
> > >>>> workloads will run better without those callbacks being always-expedited if
> > >>>> it actually gives back 25% in performance!
> > >>>>
> > >>> I can dig further, but on a high level i think there are some spots
> > >>> which show better performance if expedited is set. I mean synchronize_rcu()
> > >>> becomes as "less blocking a context" from a time point of view.
> > >>>
> > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > >>> delays for a caller. For example for nocb case we do not know where in a list
> > >>> our callback is located and when it is invoked to unblock a caller.
> > >>
> > >> True, expedited RCU grace periods do not have this callback-invocation
> > >> delay that normal RCU does.
> > >>
> > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > >>> one by one.
> > >>
> > >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > >> so making the RCU grace-period kthread do them all sequentially is not
> > >> a strategy to win.  For example, note that the next expedited grace
> > >> period can start before the previous expedited grace period has finished
> > >> its wakeups.
> > >>
> > > I hove done a small and quick prototype:
> > >
> > > <snip>
> > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > > index 699b938358bf..e1a4cca9a208 100644
> > > --- a/include/linux/rcupdate_wait.h
> > > +++ b/include/linux/rcupdate_wait.h
> > > @@ -9,6 +9,8 @@
> > > #include <linux/rcupdate.h>
> > > #include <linux/completion.h>
> > >
> > > +extern struct llist_head gp_wait_llist;
> > > +
> > > /*
> > >  * Structure allowing asynchronous waiting on RCU.
> > >  */
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index ee27a03d7576..50b81ca54104 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > >
> > > +/* Waiters for a GP kthread. */
> > > +LLIST_HEAD(gp_wait_llist);
> > > +
> > > /*
> > >  * The rcu_scheduler_active variable is initialized to the value
> > >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> > >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > }
> > >
> > > +static void rcu_notify_gp_end(struct llist_node *llist)
> > > +{
> > > +       struct llist_node *rcu, *next;
> > > +
> > > +       llist_for_each_safe(rcu, next, llist)
> > > +               complete(&((struct rcu_synchronize *) rcu)->completion);
> >
> > This looks broken to me, so the synchronize will complete even
> > if it was called in the middle of an ongoing GP?
> >
> Do you mean before replacing the list(and after rcu_gp_cleanup()) a new
> GP sequence can be initiated?

I guess I mean rcu_notify_gp_end() is called at the end of the current
grace period, which might be the grace period which started _before_
the synchronize_rcu() was called. So the callback needs to be invoked
after the end of the next grace period, not the current one.

Did I miss some part of your patch that is handling this?

thanks,

 - Joel
  
Uladzislau Rezki March 13, 2023, 6:12 p.m. UTC | #46
On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote:
> On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> > >
> > >
> > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > >>>> [..]
> > > >>>>>>>>>> See this commit:
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > >>>>>>>>>> expedited RCU primitives")
> > > >>>>>>>>>>
> > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> > > >>>>>>>>>> devices to expedite the boot process and to shut off the
> > > >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> > > >>>>>>>>>> has been making this work for about ten years, which strikes me
> > > >>>>>>>>>> as an adequate proof of concept.  ;-)
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> > > >>>>>>>>> find that Android Mediatek devices at least are setting
> > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> > > >>>>>>>>> weird, it should be set to 1 as early as possible), and
> > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > >>>>>>>>
> > > >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> > > >>>>>>>> where he talks about expediting grace periods but not unexpediting
> > > >>>>>>>> them.
> > > >>>>>>>>
> > > >>>>>>> Do you think we need to unexpedite it? :))))
> > > >>>>>>
> > > >>>>>> Android runs on smallish systems, so quite possibly not!
> > > >>>>>>
> > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > >>>>> have done some app-launch time analysis with enabling and disabling of it.
> > > >>>>>
> > > >>>>> An expedited case is much better when it comes to app launch time. It
> > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> > > >>>>> So we have a big gain here.
> > > >>>>
> > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > >>>> be slowing down other usecases! I find it hard to believe, real-time
> > > >>>> workloads will run better without those callbacks being always-expedited if
> > > >>>> it actually gives back 25% in performance!
> > > >>>>
> > > >>> I can dig further, but on a high level i think there are some spots
> > > >>> which show better performance if expedited is set. I mean synchronize_rcu()
> > > >>> becomes as "less blocking a context" from a time point of view.
> > > >>>
> > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > > >>> delays for a caller. For example for nocb case we do not know where in a list
> > > >>> our callback is located and when it is invoked to unblock a caller.
> > > >>
> > > >> True, expedited RCU grace periods do not have this callback-invocation
> > > >> delay that normal RCU does.
> > > >>
> > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > > >>> one by one.
> > > >>
> > > >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > > >> so making the RCU grace-period kthread do them all sequentially is not
> > > >> a strategy to win.  For example, note that the next expedited grace
> > > >> period can start before the previous expedited grace period has finished
> > > >> its wakeups.
> > > >>
> > > > I hove done a small and quick prototype:
> > > >
> > > > <snip>
> > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > > > index 699b938358bf..e1a4cca9a208 100644
> > > > --- a/include/linux/rcupdate_wait.h
> > > > +++ b/include/linux/rcupdate_wait.h
> > > > @@ -9,6 +9,8 @@
> > > > #include <linux/rcupdate.h>
> > > > #include <linux/completion.h>
> > > >
> > > > +extern struct llist_head gp_wait_llist;
> > > > +
> > > > /*
> > > >  * Structure allowing asynchronous waiting on RCU.
> > > >  */
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index ee27a03d7576..50b81ca54104 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > > >
> > > > +/* Waiters for a GP kthread. */
> > > > +LLIST_HEAD(gp_wait_llist);
> > > > +
> > > > /*
> > > >  * The rcu_scheduler_active variable is initialized to the value
> > > >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> > > >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > > }
> > > >
> > > > +static void rcu_notify_gp_end(struct llist_node *llist)
> > > > +{
> > > > +       struct llist_node *rcu, *next;
> > > > +
> > > > +       llist_for_each_safe(rcu, next, llist)
> > > > +               complete(&((struct rcu_synchronize *) rcu)->completion);
> > >
> > > This looks broken to me, so the synchronize will complete even
> > > if it was called in the middle of an ongoing GP?
> > >
> > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new
> > GP sequence can be initiated?
> 
> I guess I mean rcu_notify_gp_end() is called at the end of the current
> grace period, which might be the grace period which started _before_
> the synchronize_rcu() was called. So the callback needs to be invoked
> after the end of the next grace period, not the current one.
> 
> Did I miss some part of your patch that is handling this?
> 
No, you did not! That was my fault in placing llist_del_all() into
inappropriate place. We have to guarantee a full grace period. But
this is a prototype and kind of kick off :)

--
Uladzislau Rezki
  
Paul E. McKenney March 13, 2023, 6:56 p.m. UTC | #47
On Mon, Mar 13, 2023 at 07:12:07PM +0100, Uladzislau Rezki wrote:
> On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote:
> > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> > > >
> > > >
> > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > > >>>> [..]
> > > > >>>>>>>>>> See this commit:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > >>>>>>>>>> expedited RCU primitives")
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> > > > >>>>>>>>>> devices to expedite the boot process and to shut off the
> > > > >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> > > > >>>>>>>>>> has been making this work for about ten years, which strikes me
> > > > >>>>>>>>>> as an adequate proof of concept.  ;-)
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> > > > >>>>>>>>> find that Android Mediatek devices at least are setting
> > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and
> > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > >>>>>>>>
> > > > >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting
> > > > >>>>>>>> them.
> > > > >>>>>>>>
> > > > >>>>>>> Do you think we need to unexpedite it? :))))
> > > > >>>>>>
> > > > >>>>>> Android runs on smallish systems, so quite possibly not!
> > > > >>>>>>
> > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > > >>>>> have done some app-launch time analysis with enabling and disabling of it.
> > > > >>>>>
> > > > >>>>> An expedited case is much better when it comes to app launch time. It
> > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> > > > >>>>> So we have a big gain here.
> > > > >>>>
> > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > > >>>> be slowing down other usecases! I find it hard to believe, real-time
> > > > >>>> workloads will run better without those callbacks being always-expedited if
> > > > >>>> it actually gives back 25% in performance!
> > > > >>>>
> > > > >>> I can dig further, but on a high level i think there are some spots
> > > > >>> which show better performance if expedited is set. I mean synchronize_rcu()
> > > > >>> becomes as "less blocking a context" from a time point of view.
> > > > >>>
> > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > > > >>> delays for a caller. For example for nocb case we do not know where in a list
> > > > >>> our callback is located and when it is invoked to unblock a caller.
> > > > >>
> > > > >> True, expedited RCU grace periods do not have this callback-invocation
> > > > >> delay that normal RCU does.
> > > > >>
> > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > > > >>> one by one.
> > > > >>
> > > > >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > > > >> so making the RCU grace-period kthread do them all sequentially is not
> > > > >> a strategy to win.  For example, note that the next expedited grace
> > > > >> period can start before the previous expedited grace period has finished
> > > > >> its wakeups.
> > > > >>
> > > > > I hove done a small and quick prototype:
> > > > >
> > > > > <snip>
> > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > > > > index 699b938358bf..e1a4cca9a208 100644
> > > > > --- a/include/linux/rcupdate_wait.h
> > > > > +++ b/include/linux/rcupdate_wait.h
> > > > > @@ -9,6 +9,8 @@
> > > > > #include <linux/rcupdate.h>
> > > > > #include <linux/completion.h>
> > > > >
> > > > > +extern struct llist_head gp_wait_llist;
> > > > > +
> > > > > /*
> > > > >  * Structure allowing asynchronous waiting on RCU.
> > > > >  */
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index ee27a03d7576..50b81ca54104 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > > > >
> > > > > +/* Waiters for a GP kthread. */
> > > > > +LLIST_HEAD(gp_wait_llist);

This being a single global will of course fail due to memory contention
on large systems.  So a patch that is ready for mainline must either
have per-rcu_node-structure lists or similar.

> > > > > /*
> > > > >  * The rcu_scheduler_active variable is initialized to the value
> > > > >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> > > > >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > > > }
> > > > >
> > > > > +static void rcu_notify_gp_end(struct llist_node *llist)

And calling this directly from rcu_gp_kthread() is a no-go for large
systems because the large number of wakeups will CPU-bound that kthread.
Also, it would be better to invoke this from rcu_gp_cleanup().

One option would be to do the wakeups from a workqueue handler.

You might also want to have an array of lists indexed by the bottom few
bits of the RCU grace-period sequence number.  This would reduce the
number of spurious wakeups.

> > > > > +{
> > > > > +       struct llist_node *rcu, *next;
> > > > > +
> > > > > +       llist_for_each_safe(rcu, next, llist)
> > > > > +               complete(&((struct rcu_synchronize *) rcu)->completion);

If you don't eliminate spurious wakeups, it is necessary to do something
like checking poll_state_synchronize_rcu() reject those wakeups.

							Thanx, Paul

> > > >
> > > > This looks broken to me, so the synchronize will complete even
> > > > if it was called in the middle of an ongoing GP?
> > > >
> > > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new
> > > GP sequence can be initiated?
> > 
> > I guess I mean rcu_notify_gp_end() is called at the end of the current
> > grace period, which might be the grace period which started _before_
> > the synchronize_rcu() was called. So the callback needs to be invoked
> > after the end of the next grace period, not the current one.
> > 
> > Did I miss some part of your patch that is handling this?
> > 
> No, you did not! That was my fault in placing llist_del_all() into
> inappropriate place. We have to guarantee a full grace period. But
> this is a prototype and kind of kick off :)
> 
> --
> Uladzislau Rezki
  
Uladzislau Rezki March 14, 2023, 11:16 a.m. UTC | #48
On Mon, Mar 13, 2023 at 11:56:34AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 13, 2023 at 07:12:07PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote:
> > > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> > > > >
> > > > >
> > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > > > >>>> [..]
> > > > > >>>>>>>>>> See this commit:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > >>>>>>>>>> expedited RCU primitives")
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> > > > > >>>>>>>>>> devices to expedite the boot process and to shut off the
> > > > > >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> > > > > >>>>>>>>>> has been making this work for about ten years, which strikes me
> > > > > >>>>>>>>>> as an adequate proof of concept.  ;-)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > >>>>>>>>> find that Android Mediatek devices at least are setting
> > > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and
> > > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > >>>>>>>>
> > > > > >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> > > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting
> > > > > >>>>>>>> them.
> > > > > >>>>>>>>
> > > > > >>>>>>> Do you think we need to unexpedite it? :))))
> > > > > >>>>>>
> > > > > >>>>>> Android runs on smallish systems, so quite possibly not!
> > > > > >>>>>>
> > > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > > > >>>>> have done some app-launch time analysis with enabling and disabling of it.
> > > > > >>>>>
> > > > > >>>>> An expedited case is much better when it comes to app launch time. It
> > > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> > > > > >>>>> So we have a big gain here.
> > > > > >>>>
> > > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > > > >>>> be slowing down other usecases! I find it hard to believe, real-time
> > > > > >>>> workloads will run better without those callbacks being always-expedited if
> > > > > >>>> it actually gives back 25% in performance!
> > > > > >>>>
> > > > > >>> I can dig further, but on a high level i think there are some spots
> > > > > >>> which show better performance if expedited is set. I mean synchronize_rcu()
> > > > > >>> becomes as "less blocking a context" from a time point of view.
> > > > > >>>
> > > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > > > > >>> delays for a caller. For example for nocb case we do not know where in a list
> > > > > >>> our callback is located and when it is invoked to unblock a caller.
> > > > > >>
> > > > > >> True, expedited RCU grace periods do not have this callback-invocation
> > > > > >> delay that normal RCU does.
> > > > > >>
> > > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > > > > >>> one by one.
> > > > > >>
> > > > > >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > > > > >> so making the RCU grace-period kthread do them all sequentially is not
> > > > > >> a strategy to win.  For example, note that the next expedited grace
> > > > > >> period can start before the previous expedited grace period has finished
> > > > > >> its wakeups.
> > > > > >>
> > > > > > I hove done a small and quick prototype:
> > > > > >
> > > > > > <snip>
> > > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > > > > > index 699b938358bf..e1a4cca9a208 100644
> > > > > > --- a/include/linux/rcupdate_wait.h
> > > > > > +++ b/include/linux/rcupdate_wait.h
> > > > > > @@ -9,6 +9,8 @@
> > > > > > #include <linux/rcupdate.h>
> > > > > > #include <linux/completion.h>
> > > > > >
> > > > > > +extern struct llist_head gp_wait_llist;
> > > > > > +
> > > > > > /*
> > > > > >  * Structure allowing asynchronous waiting on RCU.
> > > > > >  */
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index ee27a03d7576..50b81ca54104 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > > > > >
> > > > > > +/* Waiters for a GP kthread. */
> > > > > > +LLIST_HEAD(gp_wait_llist);
> 
> This being a single global will of course fail due to memory contention
> on large systems.  So a patch that is ready for mainline must either
> have per-rcu_node-structure lists or similar.
> 
I agree. This is a prototype and the aim is a proof of concept :)
On bigger systems gp can starve if it wake-ups a lot of users.

At lease i see that a camera-app improves in terms of launch time.
It is around 12% percent.

> > > > > > /*
> > > > > >  * The rcu_scheduler_active variable is initialized to the value
> > > > > >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> > > > > >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > > > > }
> > > > > >
> > > > > > +static void rcu_notify_gp_end(struct llist_node *llist)
> 
> And calling this directly from rcu_gp_kthread() is a no-go for large
> systems because the large number of wakeups will CPU-bound that kthread.
> Also, it would be better to invoke this from rcu_gp_cleanup().
> 
> One option would be to do the wakeups from a workqueue handler.
> 
> You might also want to have an array of lists indexed by the bottom few
> bits of the RCU grace-period sequence number.  This would reduce the
> number of spurious wakeups.
> 
> > > > > > +{
> > > > > > +       struct llist_node *rcu, *next;
> > > > > > +
> > > > > > +       llist_for_each_safe(rcu, next, llist)
> > > > > > +               complete(&((struct rcu_synchronize *) rcu)->completion);
> 
> If you don't eliminate spurious wakeups, it is necessary to do something
> like checking poll_state_synchronize_rcu() reject those wakeups.
> 
OK.

I will come up with some data and figures soon.

--
Uladzislau Rezki
  
Paul E. McKenney March 14, 2023, 1:49 p.m. UTC | #49
On Tue, Mar 14, 2023 at 12:16:51PM +0100, Uladzislau Rezki wrote:
> On Mon, Mar 13, 2023 at 11:56:34AM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 13, 2023 at 07:12:07PM +0100, Uladzislau Rezki wrote:
> > > On Mon, Mar 13, 2023 at 11:49:58AM -0400, Joel Fernandes wrote:
> > > > On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> > > > > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > > > > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > > > > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > > > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > > > > >>>> [..]
> > > > > > >>>>>>>>>> See this commit:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > >>>>>>>>>> expedited RCU primitives")
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> > > > > > >>>>>>>>>> devices to expedite the boot process and to shut off the
> > > > > > >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> > > > > > >>>>>>>>>> has been making this work for about ten years, which strikes me
> > > > > > >>>>>>>>>> as an adequate proof of concept.  ;-)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > >>>>>>>>> find that Android Mediatek devices at least are setting
> > > > > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > >>>>>>>>> weird, it should be set to 1 as early as possible), and
> > > > > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> > > > > > >>>>>>>> where he talks about expediting grace periods but not unexpediting
> > > > > > >>>>>>>> them.
> > > > > > >>>>>>>>
> > > > > > >>>>>>> Do you think we need to unexpedite it? :))))
> > > > > > >>>>>>
> > > > > > >>>>>> Android runs on smallish systems, so quite possibly not!
> > > > > > >>>>>>
> > > > > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > > > > >>>>> have done some app-launch time analysis with enabling and disabling of it.
> > > > > > >>>>>
> > > > > > >>>>> An expedited case is much better when it comes to app launch time. It
> > > > > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> > > > > > >>>>> So we have a big gain here.
> > > > > > >>>>
> > > > > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > > > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > > > > >>>> be slowing down other usecases! I find it hard to believe, real-time
> > > > > > >>>> workloads will run better without those callbacks being always-expedited if
> > > > > > >>>> it actually gives back 25% in performance!
> > > > > > >>>>
> > > > > > >>> I can dig further, but on a high level i think there are some spots
> > > > > > >>> which show better performance if expedited is set. I mean synchronize_rcu()
> > > > > > >>> becomes as "less blocking a context" from a time point of view.
> > > > > > >>>
> > > > > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > > > > > >>> delays for a caller. For example for nocb case we do not know where in a list
> > > > > > >>> our callback is located and when it is invoked to unblock a caller.
> > > > > > >>
> > > > > > >> True, expedited RCU grace periods do not have this callback-invocation
> > > > > > >> delay that normal RCU does.
> > > > > > >>
> > > > > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > > > > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > > > > > >>> one by one.
> > > > > > >>
> > > > > > >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > > > > > >> so making the RCU grace-period kthread do them all sequentially is not
> > > > > > >> a strategy to win.  For example, note that the next expedited grace
> > > > > > >> period can start before the previous expedited grace period has finished
> > > > > > >> its wakeups.
> > > > > > >>
> > > > > > > I hove done a small and quick prototype:
> > > > > > >
> > > > > > > <snip>
> > > > > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > > > > > > index 699b938358bf..e1a4cca9a208 100644
> > > > > > > --- a/include/linux/rcupdate_wait.h
> > > > > > > +++ b/include/linux/rcupdate_wait.h
> > > > > > > @@ -9,6 +9,8 @@
> > > > > > > #include <linux/rcupdate.h>
> > > > > > > #include <linux/completion.h>
> > > > > > >
> > > > > > > +extern struct llist_head gp_wait_llist;
> > > > > > > +
> > > > > > > /*
> > > > > > >  * Structure allowing asynchronous waiting on RCU.
> > > > > > >  */
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index ee27a03d7576..50b81ca54104 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > > > > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > > > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > > > > > >
> > > > > > > +/* Waiters for a GP kthread. */
> > > > > > > +LLIST_HEAD(gp_wait_llist);
> > 
> > This being a single global will of course fail due to memory contention
> > on large systems.  So a patch that is ready for mainline must either
> > have per-rcu_node-structure lists or similar.
> > 
> I agree. This is a prototype and the aim is a proof of concept :)
> On bigger systems gp can starve if it wake-ups a lot of users.
> 
> At lease i see that a camera-app improves in terms of launch time.
> It is around 12% percent.

Understood and agreed, lack of scalablity is OK for a prototype
for testing purposes.

> > > > > > > /*
> > > > > > >  * The rcu_scheduler_active variable is initialized to the value
> > > > > > >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > > > > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> > > > > > >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > > > > > }
> > > > > > >
> > > > > > > +static void rcu_notify_gp_end(struct llist_node *llist)
> > 
> > And calling this directly from rcu_gp_kthread() is a no-go for large
> > systems because the large number of wakeups will CPU-bound that kthread.
> > Also, it would be better to invoke this from rcu_gp_cleanup().
> > 
> > One option would be to do the wakeups from a workqueue handler.
> > 
> > You might also want to have an array of lists indexed by the bottom few
> > bits of the RCU grace-period sequence number.  This would reduce the
> > number of spurious wakeups.
> > 
> > > > > > > +{
> > > > > > > +       struct llist_node *rcu, *next;
> > > > > > > +
> > > > > > > +       llist_for_each_safe(rcu, next, llist)
> > > > > > > +               complete(&((struct rcu_synchronize *) rcu)->completion);
> > 
> > If you don't eliminate spurious wakeups, it is necessary to do something
> > like checking poll_state_synchronize_rcu() reject those wakeups.
> > 
> OK.
> 
> I will come up with some data and figures soon.

Sounds good!

							Thanx, Paul
  
Joel Fernandes March 14, 2023, 10:44 p.m. UTC | #50
On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> >
> >
> > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > >>>> [..]
> > >>>>>>>>>> See this commit:
> > >>>>>>>>>>
> > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > >>>>>>>>>> expedited RCU primitives")
> > >>>>>>>>>>
> > >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> > >>>>>>>>>> devices to expedite the boot process and to shut off the
> > >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> > >>>>>>>>>> has been making this work for about ten years, which strikes me
> > >>>>>>>>>> as an adequate proof of concept.  ;-)
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> > >>>>>>>>> find that Android Mediatek devices at least are setting
> > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> > >>>>>>>>> weird, it should be set to 1 as early as possible), and
> > >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > >>>>>>>>
> > >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> > >>>>>>>> where he talks about expediting grace periods but not unexpediting
> > >>>>>>>> them.
> > >>>>>>>>
> > >>>>>>> Do you think we need to unexpedite it? :))))
> > >>>>>>
> > >>>>>> Android runs on smallish systems, so quite possibly not!
> > >>>>>>
> > >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> > >>>>> have done some app-launch time analysis with enabling and disabling of it.
> > >>>>>
> > >>>>> An expedited case is much better when it comes to app launch time. It
> > >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> > >>>>> So we have a big gain here.
> > >>>>
> > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > >>>> be slowing down other usecases! I find it hard to believe, real-time
> > >>>> workloads will run better without those callbacks being always-expedited if
> > >>>> it actually gives back 25% in performance!
> > >>>>
> > >>> I can dig further, but on a high level i think there are some spots
> > >>> which show better performance if expedited is set. I mean synchronize_rcu()
> > >>> becomes as "less blocking a context" from a time point of view.
> > >>>
> > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > >>> delays for a caller. For example for nocb case we do not know where in a list
> > >>> our callback is located and when it is invoked to unblock a caller.
> > >>
> > >> True, expedited RCU grace periods do not have this callback-invocation
> > >> delay that normal RCU does.
> > >>
> > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > >>> one by one.
> > >>
> > >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > >> so making the RCU grace-period kthread do them all sequentially is not
> > >> a strategy to win.  For example, note that the next expedited grace
> > >> period can start before the previous expedited grace period has finished
> > >> its wakeups.
> > >>
> > > I hove done a small and quick prototype:
> > >
> > > <snip>
> > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > > index 699b938358bf..e1a4cca9a208 100644
> > > --- a/include/linux/rcupdate_wait.h
> > > +++ b/include/linux/rcupdate_wait.h
> > > @@ -9,6 +9,8 @@
> > > #include <linux/rcupdate.h>
> > > #include <linux/completion.h>
> > >
> > > +extern struct llist_head gp_wait_llist;
> > > +
> > > /*
> > >  * Structure allowing asynchronous waiting on RCU.
> > >  */
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index ee27a03d7576..50b81ca54104 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > >
> > > +/* Waiters for a GP kthread. */
> > > +LLIST_HEAD(gp_wait_llist);
> > > +
> > > /*
> > >  * The rcu_scheduler_active variable is initialized to the value
> > >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> > >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > }
> > >
> > > +static void rcu_notify_gp_end(struct llist_node *llist)
> > > +{
> > > +       struct llist_node *rcu, *next;
> > > +
> > > +       llist_for_each_safe(rcu, next, llist)
> > > +               complete(&((struct rcu_synchronize *) rcu)->completion);
> >
> > This looks broken to me, so the synchronize will complete even
> > if it was called in the middle of an ongoing GP?
> >
> Do you mean before replacing the list(and after rcu_gp_cleanup()) a new
> GP sequence can be initiated?

It looks interesting, I am happy to try it on ChromeOS once you
provide a patch, in case it improves something, even if that is
suspend or boot time.

I think the main concern I had was if you did not wait for a full
grace period (which as you indicated, you would fix), you are not
really measuring the long delays that the full grace period can cause
so IMHO it is important to only measure once correctness is preserved
by the modification.  To that end, perhaps having rcutorture pass with
your modification could be a vote of confidence before proceeding to
performance tests.

 - Joel
  
Joel Fernandes March 15, 2023, 12:18 p.m. UTC | #51
On Wed, Mar 08, 2023 at 11:14:40AM +0100, Uladzislau Rezki wrote:
> On Tue, Mar 07, 2023 at 08:48:52AM -0500, Joel Fernandes wrote:
> > On Tue, Mar 7, 2023 at 8:40 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Tue, Mar 07, 2023 at 02:01:54PM +0100, Frederic Weisbecker wrote:
> > > > On Fri, Mar 03, 2023 at 09:38:51PM +0000, Joel Fernandes (Google) wrote:
> > > > > On many systems, a great deal of boot (in userspace) happens after the
> > > > > kernel thinks the boot has completed. It is difficult to determine if
> > > > > the system has really booted from the kernel side. Some features like
> > > > > lazy-RCU can risk slowing down boot time if, say, a callback has been
> > > > > added that the boot synchronously depends on. Further expedited callbacks
> > > > > can get unexpedited way earlier than it should be, thus slowing down
> > > > > boot (as shown in the data below).
> > > > >
> > > > > For these reasons, this commit adds a config option
> > > > > 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.
> > > > > Userspace can also make RCU's view of the system as booted, by writing the
> > > > > time in milliseconds to: /sys/module/rcupdate/parameters/rcu_boot_end_delay
> > > > > Or even just writing a value of 0 to this sysfs node.
> > > > > However, under no circumstance will the boot be allowed to end earlier
> > > > > than just before init is launched.
> > > > >
> > > > > The default value of CONFIG_RCU_BOOT_END_DELAY is chosen as 15s. This
> > > > > suites ChromeOS and also a PREEMPT_RT system below very well, which need
> > > > > no config or parameter changes, and just a simple application of this patch. A
> > > > > system designer can also choose a specific value here to keep RCU from marking
> > > > > boot completion.  As noted earlier, RCU's perspective of the system as booted
> > > > > will not be marker until at least rcu_boot_end_delay milliseconds have passed
> > > > > or an update is made via writing a small value (or 0) in milliseconds to:
> > > > > /sys/module/rcupdate/parameters/rcu_boot_end_delay.
> > > > >
> > > > > One side-effect of this patch is, there is a risk that a real-time workload
> > > > > launched just after the kernel boots will suffer interruptions due to expedited
> > > > > RCU, which previous ended just before init was launched. However, to mitigate
> > > > > such an issue (however unlikely), the user should either tune
> > > > > CONFIG_RCU_BOOT_END_DELAY to a smaller value than 15 seconds or write a value
> > > > > of 0 to /sys/module/rcupdate/parameters/rcu_boot_end_delay, once userspace
> > > > > boots, and before launching the real-time workload.
> > > > >
> > > > > Qiuxu also noted impressive boot-time improvements with earlier version
> > > > > of patch. An excerpt from the data he shared:
> > > > >
> > > > > 1) Testing environment:
> > > > >     OS            : CentOS Stream 8 (non-RT OS)
> > > > >     Kernel     : v6.2
> > > > >     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
> > > > >     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
> > > > >
> > > > > 2) OS boot time definition:
> > > > >     The time from the start of the kernel boot to the shell command line
> > > > >     prompt is shown from the console. [ Different people may have
> > > > >     different OS boot time definitions. ]
> > > > >
> > > > > 3) Measurement method (very rough method):
> > > > >     A timer in the kernel periodically prints the boot time every 100ms.
> > > > >     As soon as the shell command line prompt is shown from the console,
> > > > >     we record the boot time printed by the timer, then the printed boot
> > > > >     time is the OS boot time.
> > > > >
> > > > > 4) Measured OS boot time (in seconds)
> > > > >    a) Measured 10 times w/o this patch:
> > > > >         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
> > > > >         The average OS boot time was: ~8.7s
> > > > >
> > > > >    b) Measure 10 times w/ this patch:
> > > > >         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
> > > > >         The average OS boot time was: ~8.3s.
> > > > >
> > > > > Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > >
> > > > I still don't really like that:
> > > >
> > > > 1) It feels like we are curing a symptom for which we don't know the cause.
> > > >    Which RCU write side caller is the source of this slow boot? Some tracepoints
> > > >    reporting the wait duration within synchronize_rcu() calls between the end of
> > > >    the kernel boot and the end of userspace boot may be helpful.
> > > >
> > > > 2) The kernel boot was already covered before this patch so this is about
> > > >    userspace code calling into the kernel. Is that piece of code also called
> > > >    after the boot? In that case are we missing a conversion from
> > > >    synchronize_rcu() to synchronize_rcu_expedited() somewhere? Because then
> > > >    the problem is more general than just boot.
> > > >
> > > > This needs to be analyzed first and if it happens that the issue really
> > > > needs to be fixed with telling the kernel that userspace has completed
> > > > booting, eg: because the problem is not in a few callsites that need conversion
> > > > to expedited but instead in the accumulation of lots of calls that should stay
> > > > as is:
> > > >
> > > > 3) This arbitrary timeout looks dangerous to me as latency sensitive code
> > > >    may run right after the boot. Either you choose a value that is too low
> > > >    and you miss the optimization or the value is too high and you may break
> > > >    things.
> > > >
> > > > 4) This should be fixed the way you did:
> > > >    a) a kernel parameter like you did
> > > >    b) The init process (systemd?) tells the kernel when it judges that userspace
> > > >       has completed booting.
> > > >    c) Make these interfaces more generic, maybe that information will be useful
> > > >       outside RCU. For example the kernel parameter should be
> > > >       "user_booted_reported" and the sysfs (should be sysctl?):
> > > >       kernel.user_booted = 1
> > > >    d) But yuck, this means we must know if the init process supports that...
> > > >
> > > > For these reasons, let's make sure we know exactly what is going on first.
> > > >
> > > > Thanks.
> > > Just add some notes and thoughts. There is a rcupdate.rcu_expedited=1
> > > parameter that can be used during the boot. For example on our devices
> > > to speedup a boot we boot the kernel with rcu_expedited:
> > >
> > > XQ-DQ54:/ # cat /proc/cmdline
> > > stack_depot_disable=on kasan.stacktrace=off kvm-arm.mode=protected cgroup_disable=pressure console=ttyMSM0,115200n8 loglevel=6 kpti=0 log_buf_len=256K kernel.panic_on_rcu_stall=1 service_locator.enable=1 msm_rtb.filter=0x237 rcupdate.rcu_expedited=1 rcu_nocbs=0-7 ftrace_dump_on_oops swiotlb=noforce loop.max_part=7 fw_devlink.strict=1 allow_mismatched_32bit_el0 cpufreq.default_governor=performance printk.console_no_auto_verbose=1 kasan=off sysctl.kernel.sched_pelt_multiplier=4 can.stats_timer=0 pcie_ports=compat irqaffinity=0-2 disable_dma32=on no-steal-acc cgroup.memory=nokmem,nosocket video=vfb:640x400,bpp=32,memsize=3072000 page_owner=on stack_depot_disable=off printk.console_no_auto_verbose=0 nosoftlockup bootconfig buildvariant=userdebug  msm_drm.dsi_display0=somc,1_panel: rootwait ro init=/init  qcom_geni_serial.con_enabled=0 oembootloader.startup=0x00000001 oembootloader.warmboot=0x00000000 oembootloader.securityflags=0x00000001
> > > XQ-DQ54:/ #
> > >
> > > then a user space can decides if it is needed or not:
> > >
> > > <snip>
> > > rcu_expedited  rcu_normal
> > > XQ-DQ54:/ # ls -al /sys/kernel/rcu_*
> > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_expedited
> > > -rw-r--r-- 1 root root 4096 2023-02-16 09:27 /sys/kernel/rcu_normal
> > > XQ-DQ54:/ #
> > > <snip>
> > >
> > > for lazy we can add "rcu_cb_lazy" parameter and boot the kernel with
> > > true or false. So we can follow and be aligned with rcu_expedited and
> > > rcu_normal parameters.
> > 
> > Speaking of aligning, there is also the automated
> > rcu_normal_after_boot boot option correct? I prefer the automated
> > option of doing this. So the approach here is not really unprecedented
> > and is much more robust than relying on userspace too much (I am ok
> > with adding your suggestion *on top* of the automated toggle, but I
> > probably would not have ChromeOS use it if the automated way exists).
> > Or did I miss something?
> > 
> According to name of the rcu_end_inkernel_boot() function and a place
> when it is invoked we can conclude that it marks the end of kernel boot
> and it happens before running an "init" process.
> 
> With your patch we change a behavior. The initialization occurs not right
> after a kernel is up and running but rather after 15 seconds timeout what
> at least does not correspond to a function name. Apart from that an expected
> behavior might be different. For example some test-suites or smoke tests, etc.
> 
> Another thought about "automated boot complete" is we do not know from
> kernel space when it really completes for user space, because from kernel
> space we are done and we can detect it. In this cases a user space is a
> right candidate to say when it is ready.
> 
> For example for Android a boot complete happens when a home-screen appears.
> For Chrome OS i think there is something similar. There must be a boot complete
> event in its init scripts or something similar.
> 
> This is just my thoughts. I do not really mind but i also do not see a high
> need in having it.

Thanks for your thoughts, perhaps if I am the only one who wants it, then it
is a bad idea. Here's some hoping to get some more time this week to dig
deeper into this... this week has been crazy on the personal front.

thanks,

 - Joel
  
Joel Fernandes March 15, 2023, 12:21 p.m. UTC | #52
On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > [..]
> > > > > > > > See this commit:
> > > > > > > > 
> > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > > > expedited RCU primitives")
> > > > > > > > 
> > > > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > > > devices to expedite the boot process and to shut off the
> > > > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > > > has been making this work for about ten years, which strikes me
> > > > > > > > as an adequate proof of concept.  ;-)
> > > > > > > 
> > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > > find that Android Mediatek devices at least are setting
> > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > > weird, it should be set to 1 as early as possible), and
> > > > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > > 
> > > > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > > > where he talks about expediting grace periods but not unexpediting
> > > > > > them.
> > > > > > 
> > > > > Do you think we need to unexpedite it? :))))
> > > > 
> > > > Android runs on smallish systems, so quite possibly not!
> > > > 
> > > We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > have done some app-launch time analysis with enabling and disabling of it.
> > > 
> > > An expedited case is much better when it comes to app launch time. It
> > > requires ~25% less time to run an app comparing with unexpedited variant.
> > > So we have a big gain here.
> > 
> > Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > be slowing down other usecases! I find it hard to believe, real-time
> > workloads will run better without those callbacks being always-expedited if
> > it actually gives back 25% in performance!
> > 
> I can dig further, but on a high level i think there are some spots
> which show better performance if expedited is set. I mean synchronize_rcu()
> becomes as "less blocking a context" from a time point of view.
> 
> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> delays for a caller. For example for nocb case we do not know where in a list
> our callback is located and when it is invoked to unblock a caller.
> 
> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> one by one.

Looking forward to your optimization, I wonder if to overcome the issue Paul
mentioned about wake up overhead, whether it is possible to find out how many
tasks there are to wake without much overhead, and for the common case of
likely one task to wake up which is doing a synchronize_rcu(), wake that up.
But there could be dragons..

thanks,

 - Joel
  
Paul E. McKenney March 15, 2023, 4:12 p.m. UTC | #53
On Wed, Mar 15, 2023 at 12:21:48PM +0000, Joel Fernandes wrote:
> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > [..]
> > > > > > > > > See this commit:
> > > > > > > > > 
> > > > > > > > > 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > > > > > > > expedited RCU primitives")
> > > > > > > > > 
> > > > > > > > > Antti provided this commit precisely in order to allow Android
> > > > > > > > > devices to expedite the boot process and to shut off the
> > > > > > > > > expediting at a time of Android userspace's choosing.  So Android
> > > > > > > > > has been making this work for about ten years, which strikes me
> > > > > > > > > as an adequate proof of concept.  ;-)
> > > > > > > > 
> > > > > > > > Thanks for the pointer. That's true. Looking at Android sources, I
> > > > > > > > find that Android Mediatek devices at least are setting
> > > > > > > > rcu_expedited to 1 at late stage of their userspace boot (which is
> > > > > > > > weird, it should be set to 1 as early as possible), and
> > > > > > > > interestingly I cannot find them resetting it back to 0!.  Maybe
> > > > > > > > they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > > > > > 
> > > > > > > Interesting.  Though this is consistent with Antti's commit log,
> > > > > > > where he talks about expediting grace periods but not unexpediting
> > > > > > > them.
> > > > > > > 
> > > > > > Do you think we need to unexpedite it? :))))
> > > > > 
> > > > > Android runs on smallish systems, so quite possibly not!
> > > > > 
> > > > We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > > have done some app-launch time analysis with enabling and disabling of it.
> > > > 
> > > > An expedited case is much better when it comes to app launch time. It
> > > > requires ~25% less time to run an app comparing with unexpedited variant.
> > > > So we have a big gain here.
> > > 
> > > Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > be slowing down other usecases! I find it hard to believe, real-time
> > > workloads will run better without those callbacks being always-expedited if
> > > it actually gives back 25% in performance!
> > > 
> > I can dig further, but on a high level i think there are some spots
> > which show better performance if expedited is set. I mean synchronize_rcu()
> > becomes as "less blocking a context" from a time point of view.
> > 
> > The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > delays for a caller. For example for nocb case we do not know where in a list
> > our callback is located and when it is invoked to unblock a caller.
> > 
> > I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > one by one.
> 
> Looking forward to your optimization, I wonder if to overcome the issue Paul
> mentioned about wake up overhead, whether it is possible to find out how many
> tasks there are to wake without much overhead, and for the common case of
> likely one task to wake up which is doing a synchronize_rcu(), wake that up.
> But there could be dragons..

A per-rcu_node count of the number of tasks needing wakeups might work.
But for best results, there would be an array of such numbers indexed
by the low-order bits of the grace-period number (excluding the bottom
status bits).  The callback-offloading code uses such arrays, for example,
though not for counts of sleeping tasks.  (There cannot be that many
rcuo kthreads per group, so there has been no need to count them.)

							Thanx, Paul
  
Uladzislau Rezki March 15, 2023, 5:12 p.m. UTC | #54
On Tue, Mar 14, 2023 at 06:44:44PM -0400, Joel Fernandes wrote:
> On Mon, Mar 13, 2023 at 11:32 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Mar 13, 2023 at 06:58:30AM -0700, Joel Fernandes wrote:
> > >
> > >
> > > > On Mar 13, 2023, at 2:51 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 10, 2023 at 10:24:34PM -0800, Paul E. McKenney wrote:
> > > >>> On Fri, Mar 10, 2023 at 09:55:02AM +0100, Uladzislau Rezki wrote:
> > > >>> On Thu, Mar 09, 2023 at 10:10:56PM +0000, Joel Fernandes wrote:
> > > >>>> On Thu, Mar 09, 2023 at 01:57:42PM +0100, Uladzislau Rezki wrote:
> > > >>>> [..]
> > > >>>>>>>>>> See this commit:
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3705b88db0d7cc ("rcu: Add a module parameter to force use of
> > > >>>>>>>>>> expedited RCU primitives")
> > > >>>>>>>>>>
> > > >>>>>>>>>> Antti provided this commit precisely in order to allow Android
> > > >>>>>>>>>> devices to expedite the boot process and to shut off the
> > > >>>>>>>>>> expediting at a time of Android userspace's choosing.  So Android
> > > >>>>>>>>>> has been making this work for about ten years, which strikes me
> > > >>>>>>>>>> as an adequate proof of concept.  ;-)
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for the pointer. That's true. Looking at Android sources, I
> > > >>>>>>>>> find that Android Mediatek devices at least are setting
> > > >>>>>>>>> rcu_expedited to 1 at late stage of their userspace boot (which is
> > > >>>>>>>>> weird, it should be set to 1 as early as possible), and
> > > >>>>>>>>> interestingly I cannot find them resetting it back to 0!.  Maybe
> > > >>>>>>>>> they set rcu_normal to 1? But I cannot find that either. Vlad? :P
> > > >>>>>>>>
> > > >>>>>>>> Interesting.  Though this is consistent with Antti's commit log,
> > > >>>>>>>> where he talks about expediting grace periods but not unexpediting
> > > >>>>>>>> them.
> > > >>>>>>>>
> > > >>>>>>> Do you think we need to unexpedite it? :))))
> > > >>>>>>
> > > >>>>>> Android runs on smallish systems, so quite possibly not!
> > > >>>>>>
> > > >>>>> We keep it enabled and never unexpedite it. The reason is a performance.  I
> > > >>>>> have done some app-launch time analysis with enabling and disabling of it.
> > > >>>>>
> > > >>>>> An expedited case is much better when it comes to app launch time. It
> > > >>>>> requires ~25% less time to run an app comparing with unexpedited variant.
> > > >>>>> So we have a big gain here.
> > > >>>>
> > > >>>> Wow, that's huge. I wonder if you can dig deeper and find out why that is so
> > > >>>> as the callbacks may need to be synchronize_rcu_expedited() then, as it could
> > > >>>> be slowing down other usecases! I find it hard to believe, real-time
> > > >>>> workloads will run better without those callbacks being always-expedited if
> > > >>>> it actually gives back 25% in performance!
> > > >>>>
> > > >>> I can dig further, but on a high level i think there are some spots
> > > >>> which show better performance if expedited is set. I mean synchronize_rcu()
> > > >>> becomes as "less blocking a context" from a time point of view.
> > > >>>
> > > >>> The problem of a regular synchronize_rcu() is - it can trigger a big latency
> > > >>> delays for a caller. For example for nocb case we do not know where in a list
> > > >>> our callback is located and when it is invoked to unblock a caller.
> > > >>
> > > >> True, expedited RCU grace periods do not have this callback-invocation
> > > >> delay that normal RCU does.
> > > >>
> > > >>> I have already mentioned somewhere. Probably it makes sense to directly wake-up
> > > >>> callers from the GP kthread instead and not via nocb-kthread that invokes our callbacks
> > > >>> one by one.
> > > >>
> > > >> Makes sense, but it is necessary to be careful.  Wakeups are not fast,
> > > >> so making the RCU grace-period kthread do them all sequentially is not
> > > >> a strategy to win.  For example, note that the next expedited grace
> > > >> period can start before the previous expedited grace period has finished
> > > >> its wakeups.
> > > >>
> > > > I hove done a small and quick prototype:
> > > >
> > > > <snip>
> > > > diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
> > > > index 699b938358bf..e1a4cca9a208 100644
> > > > --- a/include/linux/rcupdate_wait.h
> > > > +++ b/include/linux/rcupdate_wait.h
> > > > @@ -9,6 +9,8 @@
> > > > #include <linux/rcupdate.h>
> > > > #include <linux/completion.h>
> > > >
> > > > +extern struct llist_head gp_wait_llist;
> > > > +
> > > > /*
> > > >  * Structure allowing asynchronous waiting on RCU.
> > > >  */
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index ee27a03d7576..50b81ca54104 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> > > > int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> > > > int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> > > >
> > > > +/* Waiters for a GP kthread. */
> > > > +LLIST_HEAD(gp_wait_llist);
> > > > +
> > > > /*
> > > >  * The rcu_scheduler_active variable is initialized to the value
> > > >  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
> > > > @@ -1776,6 +1779,14 @@ static noinline void rcu_gp_cleanup(void)
> > > >                on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> > > > }
> > > >
> > > > +static void rcu_notify_gp_end(struct llist_node *llist)
> > > > +{
> > > > +       struct llist_node *rcu, *next;
> > > > +
> > > > +       llist_for_each_safe(rcu, next, llist)
> > > > +               complete(&((struct rcu_synchronize *) rcu)->completion);
> > >
> > > This looks broken to me, so the synchronize will complete even
> > > if it was called in the middle of an ongoing GP?
> > >
> > Do you mean before replacing the list(and after rcu_gp_cleanup()) a new
> > GP sequence can be initiated?
> 
> It looks interesting, I am happy to try it on ChromeOS once you
> provide a patch, in case it improves something, even if that is
> suspend or boot time.
> 
> I think the main concern I had was if you did not wait for a full
> grace period (which as you indicated, you would fix), you are not
> really measuring the long delays that the full grace period can cause
> so IMHO it is important to only measure once correctness is preserved
> by the modification.  To that end, perhaps having rcutorture pass with
> your modification could be a vote of confidence before proceeding to
> performance tests.
> 
No problem. Please note it is just a proof of concept. Here we go:

<snip>
diff --git a/include/linux/rcupdate_wait.h b/include/linux/rcupdate_wait.h
index 699b938358bf..e1a4cca9a208 100644
--- a/include/linux/rcupdate_wait.h
+++ b/include/linux/rcupdate_wait.h
@@ -9,6 +9,8 @@
 #include <linux/rcupdate.h>
 #include <linux/completion.h>
 
+extern struct llist_head gp_wait_llist;
+
 /*
  * Structure allowing asynchronous waiting on RCU.
  */
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ee27a03d7576..a35b779471eb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -113,6 +113,9 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
 int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
 int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
 
+/* Waiters for a GP kthread. */
+LLIST_HEAD(gp_wait_llist);
+
 /*
  * The rcu_scheduler_active variable is initialized to the value
  * RCU_SCHEDULER_INACTIVE and transitions RCU_SCHEDULER_INIT just before the
@@ -1383,7 +1386,7 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 /*
  * Initialize a new grace period.  Return false if no grace period required.
  */
-static noinline_for_stack bool rcu_gp_init(void)
+static noinline_for_stack bool rcu_gp_init(struct llist_node **wait_list)
 {
 	unsigned long flags;
 	unsigned long oldmask;
@@ -1409,6 +1412,12 @@ static noinline_for_stack bool rcu_gp_init(void)
 		return false;
 	}
 
+	/*
+	 * Snapshot callers of synchronize_rcu() for which
+	 * we guarantee a full grace period to be passed.
+	 */
+	*wait_list = llist_del_all(&gp_wait_llist);
+
 	/* Advance to a new grace period and initialize state. */
 	record_gp_stall_check_time();
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
@@ -1776,11 +1785,27 @@ static noinline void rcu_gp_cleanup(void)
 		on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
 }
 
+static void rcu_notify_gp_end(struct llist_node *llist)
+{
+	struct llist_node *rcu, *next;
+	int n = 0;
+
+	llist_for_each_safe(rcu, next, llist) {
+		complete(&((struct rcu_synchronize *) rcu)->completion);
+		n++;
+	}
+
+	if (n)
+		trace_printk("Awoken %d users.\n", n);
+}
+
 /*
  * Body of kthread that handles grace periods.
  */
 static int __noreturn rcu_gp_kthread(void *unused)
 {
+	struct llist_node *wait_list;
+
 	rcu_bind_gp_kthread();
 	for (;;) {
 
@@ -1795,7 +1820,7 @@ static int __noreturn rcu_gp_kthread(void *unused)
 			rcu_gp_torture_wait();
 			WRITE_ONCE(rcu_state.gp_state, RCU_GP_DONE_GPS);
 			/* Locking provides needed memory barrier. */
-			if (rcu_gp_init())
+			if (rcu_gp_init(&wait_list))
 				break;
 			cond_resched_tasks_rcu_qs();
 			WRITE_ONCE(rcu_state.gp_activity, jiffies);
@@ -1811,6 +1836,9 @@ static int __noreturn rcu_gp_kthread(void *unused)
 		WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP);
 		rcu_gp_cleanup();
 		WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED);
+
+		/* Wake-app synchronize_rcu() users. */
+		rcu_notify_gp_end(wait_list);
 	}
 }
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 19bf6fa3ee6a..483997edd58e 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -426,7 +426,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
 		if (j == i) {
 			init_rcu_head_on_stack(&rs_array[i].head);
 			init_completion(&rs_array[i].completion);
-			(crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
+			llist_add((struct llist_node *) &rs_array[i].head, &gp_wait_llist);
+
+			/* Kick a grace period if needed. */
+			(void) start_poll_synchronize_rcu();
 		}
 	}
<snip> 

i do not think that it improves your boot time. My concern and what i
would like to fix is:

<snip>
           <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
...
           <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
           <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
           <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
           <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
           <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
           <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
           <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>

i grabbed that good example(our phone device) where a user of synchronize_rcu() is "un-blocked" 
as last since its callback was the last in a list.

--
Uladzislau Rezki
  

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2429b5e3184b..611de90d9c13 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5085,6 +5085,18 @@ 
 	rcutorture.verbose= [KNL]
 			Enable additional printk() statements.
 
+	rcupdate.rcu_boot_end_delay= [KNL]
+			Minimum time in milliseconds that must elapse
+			before the boot sequence can be marked complete
+			from RCU's perspective, after which RCU's behavior
+			becomes more relaxed. The default value is also
+			configurable via CONFIG_RCU_BOOT_END_DELAY.
+			Userspace can also mark the boot as completed
+			sooner by writing the time in milliseconds, say once
+			userspace considers the system as booted, to:
+			/sys/module/rcupdate/parameters/rcu_boot_end_delay
+			Or even just writing a value of 0 to this sysfs node.
+
 	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
 			Dump ftrace buffer after reporting RCU CPU
 			stall warning.
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 9071182b1284..4b5ffa36cbaf 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -217,6 +217,25 @@  config RCU_BOOST_DELAY
 
 	  Accept the default if unsure.
 
+config RCU_BOOT_END_DELAY
+	int "Minimum time before RCU may consider in-kernel boot as completed"
+	range 0 120000
+	default 15000
+	help
+	  Default value of the minimum time in milliseconds that must elapse
+	  before the boot sequence can be marked complete from RCU's perspective,
+	  after which RCU's behavior becomes more relaxed.
+	  Userspace can also mark the boot as completed sooner than this default
+	  by writing the time in milliseconds, say once userspace considers
+	  the system as booted, to: /sys/module/rcupdate/parameters/rcu_boot_end_delay.
+	  Or even just writing a value of 0 to this sysfs node.
+
+	  The actual delay for RCU's view of the system to be marked as booted can be
+	  higher than this value if the kernel takes a long time to initialize but it
+	  will never be smaller than this value.
+
+	  Accept the default if unsure.
+
 config RCU_EXP_KTHREAD
 	bool "Perform RCU expedited work in a real-time kthread"
 	depends on RCU_BOOST && RCU_EXPERT
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 19bf6fa3ee6a..93138c92136e 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -224,18 +224,84 @@  void rcu_unexpedite_gp(void)
 }
 EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
 
+/*
+ * Minimum time in milliseconds until RCU can consider in-kernel boot as
+ * completed.  This can also be tuned at runtime to end the boot earlier, by
+ * userspace init code writing the time in milliseconds (even 0) to:
+ * /sys/module/rcupdate/parameters/rcu_boot_end_delay
+ */
+static int rcu_boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
+
 static bool rcu_boot_ended __read_mostly;
+static bool rcu_boot_end_called __read_mostly;
+static DEFINE_MUTEX(rcu_boot_end_lock);
+
+static int param_set_rcu_boot_end(const char *val, const struct kernel_param *kp)
+{
+	uint end_ms;
+	int ret = kstrtouint(val, 0, &end_ms);
+
+	if (ret)
+		return ret;
+	WRITE_ONCE(*(uint *)kp->arg, end_ms);
+
+	/*
+	 * rcu_end_inkernel_boot() should be called at least once during init
+	 * before we can allow param changes to end the boot.
+	 */
+	mutex_lock(&rcu_boot_end_lock);
+	rcu_boot_end_delay = end_ms;
+	if (!rcu_boot_ended && rcu_boot_end_called) {
+		mutex_unlock(&rcu_boot_end_lock);
+		rcu_end_inkernel_boot();
+	}
+	mutex_unlock(&rcu_boot_end_lock);
+	return ret;
+}
+
+static const struct kernel_param_ops rcu_boot_end_ops = {
+	.set = param_set_rcu_boot_end,
+	.get = param_get_uint,
+};
+module_param_cb(rcu_boot_end_delay, &rcu_boot_end_ops, &rcu_boot_end_delay, 0644);
 
 /*
- * Inform RCU of the end of the in-kernel boot sequence.
+ * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
+ * not be marked ended until at least rcu_boot_end_delay milliseconds have passed.
  */
+void rcu_end_inkernel_boot(void);
+static void rcu_boot_end_work_fn(struct work_struct *work)
+{
+	rcu_end_inkernel_boot();
+}
+static DECLARE_DELAYED_WORK(rcu_boot_end_work, rcu_boot_end_work_fn);
+
 void rcu_end_inkernel_boot(void)
 {
+	mutex_lock(&rcu_boot_end_lock);
+	rcu_boot_end_called = true;
+
+	if (rcu_boot_ended)
+		return;
+
+	if (rcu_boot_end_delay) {
+		u64 boot_ms = div_u64(ktime_get_boot_fast_ns(), 1000000UL);
+
+		if (boot_ms < rcu_boot_end_delay) {
+			schedule_delayed_work(&rcu_boot_end_work,
+					rcu_boot_end_delay - boot_ms);
+			mutex_unlock(&rcu_boot_end_lock);
+			return;
+		}
+	}
+
+	cancel_delayed_work(&rcu_boot_end_work);
 	rcu_unexpedite_gp();
 	rcu_async_relax();
 	if (rcu_normal_after_boot)
 		WRITE_ONCE(rcu_normal, 1);
 	rcu_boot_ended = true;
+	mutex_unlock(&rcu_boot_end_lock);
 }
 
 /*