[RFC,v2] rcu: Add a minimum time for marking boot as completed

Message ID 20230225033421.3323164-1-joel@joelfernandes.org
State New
Headers
Series [RFC,v2] rcu: Add a minimum time for marking boot as completed |

Commit Message

Joel Fernandes Feb. 25, 2023, 3:34 a.m. UTC
  On many systems, a great deal of boot happens after the kernel thinks the boot
has completed. It is difficult to determine if the system has really booted
from the kernel side. Some features like lazy-RCU can risk slowing down boot
time if, say, a callback has been added that the boot synchronously depends on.

Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
stay expedited for as long as the system is still booting.

For these reasons, this commit adds a config option
'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter rcupdate.boot_end_delay.

By default, this value is 20s. A system designer can choose to specify a value
here to keep RCU from marking boot completion.  The boot sequence will not be
marked ended until at least boot_end_delay milliseconds have passed.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v1->v2:
	Update some comments and description.

 .../admin-guide/kernel-parameters.txt         |  4 +++
 cc_list                                       |  7 ++++++
 kernel/rcu/Kconfig                            | 12 +++++++++
 kernel/rcu/update.c                           | 25 +++++++++++++++++--
 4 files changed, 46 insertions(+), 2 deletions(-)
 create mode 100644 cc_list
  

Comments

Randy Dunlap Feb. 25, 2023, 3:36 a.m. UTC | #1
On 2/24/23 19:34, Joel Fernandes (Google) wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 2429b5e3184b..0943139fdf01 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5085,6 +5085,10 @@
>  	rcutorture.verbose= [KNL]
>  			Enable additional printk() statements.
>  
> +	rcupdate.boot_end_delay= [KNL]
> +			Minimum time that must elapse before the boot
> +			sequence can be marked as completed.

+			in milliseconds.

> +
>  	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
>  			Dump ftrace buffer after reporting RCU CPU
>  			stall warning.
  
Frederic Weisbecker Feb. 26, 2023, 7:45 p.m. UTC | #2
On Sat, Feb 25, 2023 at 03:34:21AM +0000, Joel Fernandes (Google) wrote:
> On many systems, a great deal of boot happens after the kernel thinks the boot
> has completed. It is difficult to determine if the system has really booted
> from the kernel side.

Is this only about kernel booting? What makes it hard to determine if the
system has really booted?

Thanks.
  
Joel Fernandes Feb. 26, 2023, 11:07 p.m. UTC | #3
On Sun, Feb 26, 2023 at 2:45 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Sat, Feb 25, 2023 at 03:34:21AM +0000, Joel Fernandes (Google) wrote:
> > On many systems, a great deal of boot happens after the kernel thinks the boot
> > has completed. It is difficult to determine if the system has really booted
> > from the kernel side.
>
> Is this only about kernel booting? What makes it hard to determine if the
> system has really booted?

Yes, I should probably clarify in the change log. It is more than the
kernel booting, it is all OS userspace code that runs as well during
which the CPU load is high (init spawning new programs such as shells,
daemons setup etc). The kernel does not know when that will complete
per-se. A timer based approach is not ideal, but it does not hurt
either IMHO. In other words, it is a "something better than nothing"
patch which reduces risk of slowness.


Thanks,

 - Joel
  
Qiuxu Zhuo Feb. 27, 2023, 7:53 a.m. UTC | #4
> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> Sent: Saturday, February 25, 2023 11:34 AM
> To: linux-kernel@vger.kernel.org
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> rcu@vger.kernel.org
> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> completed
> 
> On many systems, a great deal of boot happens after the kernel thinks the
> boot has completed. It is difficult to determine if the system has really
> booted from the kernel side. Some features like lazy-RCU can risk slowing
> down boot time if, say, a callback has been added that the boot
> synchronously depends on.
> 
> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> stay expedited for as long as the system is still booting.
> 
> For these reasons, this commit adds a config option
> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> rcupdate.boot_end_delay.
> 
> By default, this value is 20s. A system designer can choose to specify a value
> here to keep RCU from marking boot completion.  The boot sequence will not
> be marked ended until at least boot_end_delay milliseconds have passed.

Hi Joel,

Just some thoughts on the default value of 20s, correct me if I'm wrong :-).

Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the 
real-time latency than the overall OS boot time?

If so, we might make rcupdate.boot_end_delay = 0 as the default value 
(NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels? 

-Qiuxu

> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> v1->v2:
> 	Update some comments and description.
>  ...
  
Joel Fernandes Feb. 27, 2023, 1:22 p.m. UTC | #5
> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> 
> 
>> 
>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
>> Sent: Saturday, February 25, 2023 11:34 AM
>> To: linux-kernel@vger.kernel.org
>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
>> rcu@vger.kernel.org
>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
>> completed
>> 
>> On many systems, a great deal of boot happens after the kernel thinks the
>> boot has completed. It is difficult to determine if the system has really
>> booted from the kernel side. Some features like lazy-RCU can risk slowing
>> down boot time if, say, a callback has been added that the boot
>> synchronously depends on.
>> 
>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
>> stay expedited for as long as the system is still booting.
>> 
>> For these reasons, this commit adds a config option
>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
>> rcupdate.boot_end_delay.
>> 
>> By default, this value is 20s. A system designer can choose to specify a value
>> here to keep RCU from marking boot completion.  The boot sequence will not
>> be marked ended until at least boot_end_delay milliseconds have passed.
> 
> Hi Joel,
> 
> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> 
> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the 
> real-time latency than the overall OS boot time?

But every system has to boot, even an RT system.

> 
> If so, we might make rcupdate.boot_end_delay = 0 as the default value 
> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels? 

Could you measure how much time your RT system takes to boot before the application runs?

I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.

Thanks,

 - Joel


> 
> -Qiuxu
> 
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>> v1->v2:
>>    Update some comments and description.
>> ...
>
  
Paul E. McKenney Feb. 27, 2023, 2:55 p.m. UTC | #6
On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> 
> 
> > On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > 
> > 
> >> 
> >> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> Sent: Saturday, February 25, 2023 11:34 AM
> >> To: linux-kernel@vger.kernel.org
> >> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> >> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> >> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> >> rcu@vger.kernel.org
> >> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> >> completed
> >> 
> >> On many systems, a great deal of boot happens after the kernel thinks the
> >> boot has completed. It is difficult to determine if the system has really
> >> booted from the kernel side. Some features like lazy-RCU can risk slowing
> >> down boot time if, say, a callback has been added that the boot
> >> synchronously depends on.
> >> 
> >> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> >> stay expedited for as long as the system is still booting.
> >> 
> >> For these reasons, this commit adds a config option
> >> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> >> rcupdate.boot_end_delay.
> >> 
> >> By default, this value is 20s. A system designer can choose to specify a value
> >> here to keep RCU from marking boot completion.  The boot sequence will not
> >> be marked ended until at least boot_end_delay milliseconds have passed.
> > 
> > Hi Joel,
> > 
> > Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> > 
> > Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the 
> > real-time latency than the overall OS boot time?
> 
> But every system has to boot, even an RT system.
> 
> > 
> > If so, we might make rcupdate.boot_end_delay = 0 as the default value 
> > (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels? 
> 
> Could you measure how much time your RT system takes to boot before the application runs?
> 
> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.

Provide a /sys location that the userspace code writes to when it
is ready?  Different systems with different hardware and software
configurations are going to take different amounts of time to boot,
correct?

								Thanx, Paul

> Thanks,
> 
>  - Joel
> 
> 
> > 
> > -Qiuxu
> > 
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >> v1->v2:
> >>    Update some comments and description.
> >> ...
> >
  
Joel Fernandes Feb. 27, 2023, 3:16 p.m. UTC | #7
On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> >
> >
> > > On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > >
> > > 
> > >>
> > >> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >> Sent: Saturday, February 25, 2023 11:34 AM
> > >> To: linux-kernel@vger.kernel.org
> > >> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> > >> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> > >> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> > >> rcu@vger.kernel.org
> > >> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> > >> completed
> > >>
> > >> On many systems, a great deal of boot happens after the kernel thinks the
> > >> boot has completed. It is difficult to determine if the system has really
> > >> booted from the kernel side. Some features like lazy-RCU can risk slowing
> > >> down boot time if, say, a callback has been added that the boot
> > >> synchronously depends on.
> > >>
> > >> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> > >> stay expedited for as long as the system is still booting.
> > >>
> > >> For these reasons, this commit adds a config option
> > >> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> > >> rcupdate.boot_end_delay.
> > >>
> > >> By default, this value is 20s. A system designer can choose to specify a value
> > >> here to keep RCU from marking boot completion.  The boot sequence will not
> > >> be marked ended until at least boot_end_delay milliseconds have passed.
> > >
> > > Hi Joel,
> > >
> > > Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> > >
> > > Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> > > real-time latency than the overall OS boot time?
> >
> > But every system has to boot, even an RT system.
> >
> > >
> > > If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > > (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> >
> > Could you measure how much time your RT system takes to boot before the application runs?
> >
> > I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
>
> Provide a /sys location that the userspace code writes to when it
> is ready?  Different systems with different hardware and software
> configurations are going to take different amounts of time to boot,
> correct?

I could add a sysfs node, but I still wanted this patch as well
because I am wary of systems where yet more userspace changes are
required. I feel the kernel should itself be able to do this. Yes, it
is possible the system completes "booting" at a different time than
what the kernel thinks. But it does that anyway (even without this
patch), so I am not seeing a good reason to not do this in the kernel.
It is also only a minimum cap, so if the in-kernel boot takes too
long, then the patch will have no effect.

Thoughts?

thanks,

 - Joel
  
Uladzislau Rezki Feb. 27, 2023, 6:06 p.m. UTC | #8
On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> > >
> > >
> > > > On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > > >
> > > > 
> > > >>
> > > >> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > >> Sent: Saturday, February 25, 2023 11:34 AM
> > > >> To: linux-kernel@vger.kernel.org
> > > >> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> > > >> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> > > >> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> > > >> rcu@vger.kernel.org
> > > >> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> > > >> completed
> > > >>
> > > >> On many systems, a great deal of boot happens after the kernel thinks the
> > > >> boot has completed. It is difficult to determine if the system has really
> > > >> booted from the kernel side. Some features like lazy-RCU can risk slowing
> > > >> down boot time if, say, a callback has been added that the boot
> > > >> synchronously depends on.
> > > >>
> > > >> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> > > >> stay expedited for as long as the system is still booting.
> > > >>
> > > >> For these reasons, this commit adds a config option
> > > >> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> > > >> rcupdate.boot_end_delay.
> > > >>
> > > >> By default, this value is 20s. A system designer can choose to specify a value
> > > >> here to keep RCU from marking boot completion.  The boot sequence will not
> > > >> be marked ended until at least boot_end_delay milliseconds have passed.
> > > >
> > > > Hi Joel,
> > > >
> > > > Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> > > >
> > > > Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> > > > real-time latency than the overall OS boot time?
> > >
> > > But every system has to boot, even an RT system.
> > >
> > > >
> > > > If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > > > (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> > >
> > > Could you measure how much time your RT system takes to boot before the application runs?
> > >
> > > I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
> >
> > Provide a /sys location that the userspace code writes to when it
> > is ready?  Different systems with different hardware and software
> > configurations are going to take different amounts of time to boot,
> > correct?
> 
> I could add a sysfs node, but I still wanted this patch as well
> because I am wary of systems where yet more userspace changes are
> required. I feel the kernel should itself be able to do this. Yes, it
> is possible the system completes "booting" at a different time than
> what the kernel thinks. But it does that anyway (even without this
> patch), so I am not seeing a good reason to not do this in the kernel.
> It is also only a minimum cap, so if the in-kernel boot takes too
> long, then the patch will have no effect.
> 
> Thoughts?
> 
Why "rcu_boot_ended" is not enough? As i see right after that an "init"
process or shell or panic is going to be invoked by the kernel. It basically
indicates that a kernel is fully functional.

Or an idea to wait even further? Until all kernel modules are loaded by
user space.

Thanks!

--
Uladzislau Rezki
  
Joel Fernandes Feb. 27, 2023, 6:15 p.m. UTC | #9
> On Feb 27, 2023, at 1:06 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
>>> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
>>>> 
>>>> 
>>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>>>> Sent: Saturday, February 25, 2023 11:34 AM
>>>>>> To: linux-kernel@vger.kernel.org
>>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
>>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
>>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
>>>>>> rcu@vger.kernel.org
>>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
>>>>>> completed
>>>>>> 
>>>>>> On many systems, a great deal of boot happens after the kernel thinks the
>>>>>> boot has completed. It is difficult to determine if the system has really
>>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
>>>>>> down boot time if, say, a callback has been added that the boot
>>>>>> synchronously depends on.
>>>>>> 
>>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
>>>>>> stay expedited for as long as the system is still booting.
>>>>>> 
>>>>>> For these reasons, this commit adds a config option
>>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
>>>>>> rcupdate.boot_end_delay.
>>>>>> 
>>>>>> By default, this value is 20s. A system designer can choose to specify a value
>>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
>>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
>>>>> 
>>>>> Hi Joel,
>>>>> 
>>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
>>>>> 
>>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
>>>>> real-time latency than the overall OS boot time?
>>>> 
>>>> But every system has to boot, even an RT system.
>>>> 
>>>>> 
>>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
>>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
>>>> 
>>>> Could you measure how much time your RT system takes to boot before the application runs?
>>>> 
>>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
>>> 
>>> Provide a /sys location that the userspace code writes to when it
>>> is ready?  Different systems with different hardware and software
>>> configurations are going to take different amounts of time to boot,
>>> correct?
>> 
>> I could add a sysfs node, but I still wanted this patch as well
>> because I am wary of systems where yet more userspace changes are
>> required. I feel the kernel should itself be able to do this. Yes, it
>> is possible the system completes "booting" at a different time than
>> what the kernel thinks. But it does that anyway (even without this
>> patch), so I am not seeing a good reason to not do this in the kernel.
>> It is also only a minimum cap, so if the in-kernel boot takes too
>> long, then the patch will have no effect.
>> 
>> Thoughts?
>> 
> Why "rcu_boot_ended" is not enough? As i see right after that an "init"
> process or shell or panic is going to be invoked by the kernel. It basically
> indicates that a kernel is fully functional.
> 
> Or an idea to wait even further? Until all kernel modules are loaded by
> user space.

I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.

So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?

Thanks,

- Joel
(And Steven please dont kick my butt for the lack of wrapping, I am working on fixing it :P).


> 
> Thanks!
> 
> --
> Uladzislau Rezki
  
Uladzislau Rezki Feb. 27, 2023, 6:20 p.m. UTC | #10
On Mon, Feb 27, 2023 at 01:15:47PM -0500, Joel Fernandes wrote:
> 
> 
> > On Feb 27, 2023, at 1:06 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
> >>> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> 
> >>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> >>>> 
> >>>> 
> >>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> >>>>> 
> >>>>> 
> >>>>>> 
> >>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> >>>>>> Sent: Saturday, February 25, 2023 11:34 AM
> >>>>>> To: linux-kernel@vger.kernel.org
> >>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> >>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> >>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> >>>>>> rcu@vger.kernel.org
> >>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> >>>>>> completed
> >>>>>> 
> >>>>>> On many systems, a great deal of boot happens after the kernel thinks the
> >>>>>> boot has completed. It is difficult to determine if the system has really
> >>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
> >>>>>> down boot time if, say, a callback has been added that the boot
> >>>>>> synchronously depends on.
> >>>>>> 
> >>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> >>>>>> stay expedited for as long as the system is still booting.
> >>>>>> 
> >>>>>> For these reasons, this commit adds a config option
> >>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> >>>>>> rcupdate.boot_end_delay.
> >>>>>> 
> >>>>>> By default, this value is 20s. A system designer can choose to specify a value
> >>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
> >>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
> >>>>> 
> >>>>> Hi Joel,
> >>>>> 
> >>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> >>>>> 
> >>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> >>>>> real-time latency than the overall OS boot time?
> >>>> 
> >>>> But every system has to boot, even an RT system.
> >>>> 
> >>>>> 
> >>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
> >>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> >>>> 
> >>>> Could you measure how much time your RT system takes to boot before the application runs?
> >>>> 
> >>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
> >>> 
> >>> Provide a /sys location that the userspace code writes to when it
> >>> is ready?  Different systems with different hardware and software
> >>> configurations are going to take different amounts of time to boot,
> >>> correct?
> >> 
> >> I could add a sysfs node, but I still wanted this patch as well
> >> because I am wary of systems where yet more userspace changes are
> >> required. I feel the kernel should itself be able to do this. Yes, it
> >> is possible the system completes "booting" at a different time than
> >> what the kernel thinks. But it does that anyway (even without this
> >> patch), so I am not seeing a good reason to not do this in the kernel.
> >> It is also only a minimum cap, so if the in-kernel boot takes too
> >> long, then the patch will have no effect.
> >> 
> >> Thoughts?
> >> 
> > Why "rcu_boot_ended" is not enough? As i see right after that an "init"
> > process or shell or panic is going to be invoked by the kernel. It basically
> > indicates that a kernel is fully functional.
> > 
> > Or an idea to wait even further? Until all kernel modules are loaded by
> > user space.
> 
> I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.
> 
> So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?
> 
Than it is up to user space to decide when it is ready in terms of "boot completed".

--
Uladzislau Rezki
  
Joel Fernandes Feb. 27, 2023, 6:27 p.m. UTC | #11
> On Feb 27, 2023, at 1:20 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Mon, Feb 27, 2023 at 01:15:47PM -0500, Joel Fernandes wrote:
>> 
>> 
>>>> On Feb 27, 2023, at 1:06 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
>>> 
>>> On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
>>>>> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>>>>> 
>>>>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>>>>>> Sent: Saturday, February 25, 2023 11:34 AM
>>>>>>>> To: linux-kernel@vger.kernel.org
>>>>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
>>>>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
>>>>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
>>>>>>>> rcu@vger.kernel.org
>>>>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
>>>>>>>> completed
>>>>>>>> 
>>>>>>>> On many systems, a great deal of boot happens after the kernel thinks the
>>>>>>>> boot has completed. It is difficult to determine if the system has really
>>>>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
>>>>>>>> down boot time if, say, a callback has been added that the boot
>>>>>>>> synchronously depends on.
>>>>>>>> 
>>>>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
>>>>>>>> stay expedited for as long as the system is still booting.
>>>>>>>> 
>>>>>>>> For these reasons, this commit adds a config option
>>>>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
>>>>>>>> rcupdate.boot_end_delay.
>>>>>>>> 
>>>>>>>> By default, this value is 20s. A system designer can choose to specify a value
>>>>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
>>>>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
>>>>>>> 
>>>>>>> Hi Joel,
>>>>>>> 
>>>>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
>>>>>>> 
>>>>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
>>>>>>> real-time latency than the overall OS boot time?
>>>>>> 
>>>>>> But every system has to boot, even an RT system.
>>>>>> 
>>>>>>> 
>>>>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
>>>>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
>>>>>> 
>>>>>> Could you measure how much time your RT system takes to boot before the application runs?
>>>>>> 
>>>>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
>>>>> 
>>>>> Provide a /sys location that the userspace code writes to when it
>>>>> is ready?  Different systems with different hardware and software
>>>>> configurations are going to take different amounts of time to boot,
>>>>> correct?
>>>> 
>>>> I could add a sysfs node, but I still wanted this patch as well
>>>> because I am wary of systems where yet more userspace changes are
>>>> required. I feel the kernel should itself be able to do this. Yes, it
>>>> is possible the system completes "booting" at a different time than
>>>> what the kernel thinks. But it does that anyway (even without this
>>>> patch), so I am not seeing a good reason to not do this in the kernel.
>>>> It is also only a minimum cap, so if the in-kernel boot takes too
>>>> long, then the patch will have no effect.
>>>> 
>>>> Thoughts?
>>>> 
>>> Why "rcu_boot_ended" is not enough? As i see right after that an "init"
>>> process or shell or panic is going to be invoked by the kernel. It basically
>>> indicates that a kernel is fully functional.
>>> 
>>> Or an idea to wait even further? Until all kernel modules are loaded by
>>> user space.
>> 
>> I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.
>> 
>> So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?
>> 
> Than it is up to user space to decide when it is ready in terms of "boot completed".

I dont know if you caught up with the other threads. See replies from Paul and my reply to that.

Also what you are proposing can be more harmful. If user space has a bug and does not notify the kernel that boot completed, then the boot can stay incomplete forever. The idea with this patch is to make things better, not worse.

Thanks,

- Joel


> 
> --
> Uladzislau Rezki
  
Uladzislau Rezki Feb. 27, 2023, 6:57 p.m. UTC | #12
On Mon, Feb 27, 2023 at 01:27:20PM -0500, Joel Fernandes wrote:
> 
> 
> > On Feb 27, 2023, at 1:20 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > 
> > On Mon, Feb 27, 2023 at 01:15:47PM -0500, Joel Fernandes wrote:
> >> 
> >> 
> >>>> On Feb 27, 2023, at 1:06 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> >>> 
> >>> On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
> >>>>> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>>>> 
> >>>>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> 
> >>>>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> >>>>>>>> Sent: Saturday, February 25, 2023 11:34 AM
> >>>>>>>> To: linux-kernel@vger.kernel.org
> >>>>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> >>>>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> >>>>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> >>>>>>>> rcu@vger.kernel.org
> >>>>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> >>>>>>>> completed
> >>>>>>>> 
> >>>>>>>> On many systems, a great deal of boot happens after the kernel thinks the
> >>>>>>>> boot has completed. It is difficult to determine if the system has really
> >>>>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
> >>>>>>>> down boot time if, say, a callback has been added that the boot
> >>>>>>>> synchronously depends on.
> >>>>>>>> 
> >>>>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> >>>>>>>> stay expedited for as long as the system is still booting.
> >>>>>>>> 
> >>>>>>>> For these reasons, this commit adds a config option
> >>>>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> >>>>>>>> rcupdate.boot_end_delay.
> >>>>>>>> 
> >>>>>>>> By default, this value is 20s. A system designer can choose to specify a value
> >>>>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
> >>>>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
> >>>>>>> 
> >>>>>>> Hi Joel,
> >>>>>>> 
> >>>>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> >>>>>>> 
> >>>>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> >>>>>>> real-time latency than the overall OS boot time?
> >>>>>> 
> >>>>>> But every system has to boot, even an RT system.
> >>>>>> 
> >>>>>>> 
> >>>>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
> >>>>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> >>>>>> 
> >>>>>> Could you measure how much time your RT system takes to boot before the application runs?
> >>>>>> 
> >>>>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
> >>>>> 
> >>>>> Provide a /sys location that the userspace code writes to when it
> >>>>> is ready?  Different systems with different hardware and software
> >>>>> configurations are going to take different amounts of time to boot,
> >>>>> correct?
> >>>> 
> >>>> I could add a sysfs node, but I still wanted this patch as well
> >>>> because I am wary of systems where yet more userspace changes are
> >>>> required. I feel the kernel should itself be able to do this. Yes, it
> >>>> is possible the system completes "booting" at a different time than
> >>>> what the kernel thinks. But it does that anyway (even without this
> >>>> patch), so I am not seeing a good reason to not do this in the kernel.
> >>>> It is also only a minimum cap, so if the in-kernel boot takes too
> >>>> long, then the patch will have no effect.
> >>>> 
> >>>> Thoughts?
> >>>> 
> >>> Why "rcu_boot_ended" is not enough? As i see right after that an "init"
> >>> process or shell or panic is going to be invoked by the kernel. It basically
> >>> indicates that a kernel is fully functional.
> >>> 
> >>> Or an idea to wait even further? Until all kernel modules are loaded by
> >>> user space.
> >> 
> >> I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.
> >> 
> >> So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?
> >> 
> > Than it is up to user space to decide when it is ready in terms of "boot completed".
> 
> I dont know if you caught up with the other threads. See replies from Paul and my reply to that.
> 
> Also what you are proposing can be more harmful. If user space has a bug and does not notify the kernel that boot completed, then the boot can stay incomplete forever. The idea with this patch is to make things better, not worse.
> 
I saw that Paul proposed to have a sysfs attribute using which you can
send a notification.

IMHO, to me this patch does not provide a clear correlation between what
is a boot complete and when it occurs. A boot complete is a synchronous
event whereas the patch thinks that after some interval a "boot" is completed.

We can imply that after, say 100 seconds an initialization of user space
is done. Maybe 100 seconds then? :)

--
Uladzislau Rezki
  
Joel Fernandes Feb. 27, 2023, 7:10 p.m. UTC | #13
On Mon, Feb 27, 2023 at 1:57 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Mon, Feb 27, 2023 at 01:27:20PM -0500, Joel Fernandes wrote:
> >
> >
> > > On Feb 27, 2023, at 1:20 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Mon, Feb 27, 2023 at 01:15:47PM -0500, Joel Fernandes wrote:
> > >>
> > >>
> > >>>> On Feb 27, 2023, at 1:06 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > >>>
> > >>> On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
> > >>>>> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >>>>>
> > >>>>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > >>>>>>>
> > >>>>>>> 
> > >>>>>>>>
> > >>>>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >>>>>>>> Sent: Saturday, February 25, 2023 11:34 AM
> > >>>>>>>> To: linux-kernel@vger.kernel.org
> > >>>>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> > >>>>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> > >>>>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> > >>>>>>>> rcu@vger.kernel.org
> > >>>>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> > >>>>>>>> completed
> > >>>>>>>>
> > >>>>>>>> On many systems, a great deal of boot happens after the kernel thinks the
> > >>>>>>>> boot has completed. It is difficult to determine if the system has really
> > >>>>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
> > >>>>>>>> down boot time if, say, a callback has been added that the boot
> > >>>>>>>> synchronously depends on.
> > >>>>>>>>
> > >>>>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> > >>>>>>>> stay expedited for as long as the system is still booting.
> > >>>>>>>>
> > >>>>>>>> For these reasons, this commit adds a config option
> > >>>>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> > >>>>>>>> rcupdate.boot_end_delay.
> > >>>>>>>>
> > >>>>>>>> By default, this value is 20s. A system designer can choose to specify a value
> > >>>>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
> > >>>>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
> > >>>>>>>
> > >>>>>>> Hi Joel,
> > >>>>>>>
> > >>>>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> > >>>>>>>
> > >>>>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> > >>>>>>> real-time latency than the overall OS boot time?
> > >>>>>>
> > >>>>>> But every system has to boot, even an RT system.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > >>>>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> > >>>>>>
> > >>>>>> Could you measure how much time your RT system takes to boot before the application runs?
> > >>>>>>
> > >>>>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
> > >>>>>
> > >>>>> Provide a /sys location that the userspace code writes to when it
> > >>>>> is ready?  Different systems with different hardware and software
> > >>>>> configurations are going to take different amounts of time to boot,
> > >>>>> correct?
> > >>>>
> > >>>> I could add a sysfs node, but I still wanted this patch as well
> > >>>> because I am wary of systems where yet more userspace changes are
> > >>>> required. I feel the kernel should itself be able to do this. Yes, it
> > >>>> is possible the system completes "booting" at a different time than
> > >>>> what the kernel thinks. But it does that anyway (even without this
> > >>>> patch), so I am not seeing a good reason to not do this in the kernel.
> > >>>> It is also only a minimum cap, so if the in-kernel boot takes too
> > >>>> long, then the patch will have no effect.
> > >>>>
> > >>>> Thoughts?
> > >>>>
> > >>> Why "rcu_boot_ended" is not enough? As i see right after that an "init"
> > >>> process or shell or panic is going to be invoked by the kernel. It basically
> > >>> indicates that a kernel is fully functional.
> > >>>
> > >>> Or an idea to wait even further? Until all kernel modules are loaded by
> > >>> user space.
> > >>
> > >> I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.
> > >>
> > >> So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?
> > >>
> > > Than it is up to user space to decide when it is ready in terms of "boot completed".
> >
> > I dont know if you caught up with the other threads. See replies from Paul and my reply to that.
> >
> > Also what you are proposing can be more harmful. If user space has a bug and does not notify the kernel that boot completed, then the boot can stay incomplete forever. The idea with this patch is to make things better, not worse.
> >
> I saw that Paul proposed to have a sysfs attribute using which you can
> send a notification.

Maybe I am missing something but how will a sysfs node on its own work really?

1. delete kernel marking itself boot completed  -- and then sysfs
marks it completed?

2. delete kernel marking itself boot completed  -- and then sysfs
marks it completed, if sysfs does not come in in N seconds, then
kernel marks as completed?

#1 is a no go, that just means a bug waiting to happen if userspace
forgets to write to sysfs.

#2 is just an extension of this patch. So I can add a sysfs node on
top of this. And we can make the minimum time as a long period of
time, as you noted below:

> IMHO, to me this patch does not provide a clear correlation between what
> is a boot complete and when it occurs. A boot complete is a synchronous
> event whereas the patch thinks that after some interval a "boot" is completed.

But that is exactly how the kernel code is now without this patch, so
it is already broken in that sense, I am not really breaking it more
;-)

> We can imply that after, say 100 seconds an initialization of user space
> is done. Maybe 100 seconds then? :)

Yes I am Ok with that. So are you suggesting we change the default to
100 seconds and then add a sysfs node to mark as boot done whenever
userspace notifies?

Thanks,

 - Joel
  
Paul E. McKenney Feb. 27, 2023, 11:05 p.m. UTC | #14
On Mon, Feb 27, 2023 at 02:10:30PM -0500, Joel Fernandes wrote:
> On Mon, Feb 27, 2023 at 1:57 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Feb 27, 2023 at 01:27:20PM -0500, Joel Fernandes wrote:
> > >
> > >
> > > > On Feb 27, 2023, at 1:20 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Mon, Feb 27, 2023 at 01:15:47PM -0500, Joel Fernandes wrote:
> > > >>
> > > >>
> > > >>>> On Feb 27, 2023, at 1:06 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >>>
> > > >>> On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
> > > >>>>> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >>>>>
> > > >>>>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > > >>>>>>>
> > > >>>>>>> 
> > > >>>>>>>>
> > > >>>>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > >>>>>>>> Sent: Saturday, February 25, 2023 11:34 AM
> > > >>>>>>>> To: linux-kernel@vger.kernel.org
> > > >>>>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> > > >>>>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> > > >>>>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> > > >>>>>>>> rcu@vger.kernel.org
> > > >>>>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> > > >>>>>>>> completed
> > > >>>>>>>>
> > > >>>>>>>> On many systems, a great deal of boot happens after the kernel thinks the
> > > >>>>>>>> boot has completed. It is difficult to determine if the system has really
> > > >>>>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
> > > >>>>>>>> down boot time if, say, a callback has been added that the boot
> > > >>>>>>>> synchronously depends on.
> > > >>>>>>>>
> > > >>>>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> > > >>>>>>>> stay expedited for as long as the system is still booting.
> > > >>>>>>>>
> > > >>>>>>>> For these reasons, this commit adds a config option
> > > >>>>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> > > >>>>>>>> rcupdate.boot_end_delay.
> > > >>>>>>>>
> > > >>>>>>>> By default, this value is 20s. A system designer can choose to specify a value
> > > >>>>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
> > > >>>>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
> > > >>>>>>>
> > > >>>>>>> Hi Joel,
> > > >>>>>>>
> > > >>>>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> > > >>>>>>>
> > > >>>>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> > > >>>>>>> real-time latency than the overall OS boot time?
> > > >>>>>>
> > > >>>>>> But every system has to boot, even an RT system.
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > > >>>>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> > > >>>>>>
> > > >>>>>> Could you measure how much time your RT system takes to boot before the application runs?
> > > >>>>>>
> > > >>>>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
> > > >>>>>
> > > >>>>> Provide a /sys location that the userspace code writes to when it
> > > >>>>> is ready?  Different systems with different hardware and software
> > > >>>>> configurations are going to take different amounts of time to boot,
> > > >>>>> correct?
> > > >>>>
> > > >>>> I could add a sysfs node, but I still wanted this patch as well
> > > >>>> because I am wary of systems where yet more userspace changes are
> > > >>>> required. I feel the kernel should itself be able to do this. Yes, it
> > > >>>> is possible the system completes "booting" at a different time than
> > > >>>> what the kernel thinks. But it does that anyway (even without this
> > > >>>> patch), so I am not seeing a good reason to not do this in the kernel.
> > > >>>> It is also only a minimum cap, so if the in-kernel boot takes too
> > > >>>> long, then the patch will have no effect.
> > > >>>>
> > > >>>> Thoughts?
> > > >>>>
> > > >>> Why "rcu_boot_ended" is not enough? As i see right after that an "init"
> > > >>> process or shell or panic is going to be invoked by the kernel. It basically
> > > >>> indicates that a kernel is fully functional.
> > > >>>
> > > >>> Or an idea to wait even further? Until all kernel modules are loaded by
> > > >>> user space.
> > > >>
> > > >> I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.
> > > >>
> > > >> So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?
> > > >>
> > > > Than it is up to user space to decide when it is ready in terms of "boot completed".
> > >
> > > I dont know if you caught up with the other threads. See replies from Paul and my reply to that.
> > >
> > > Also what you are proposing can be more harmful. If user space has a bug and does not notify the kernel that boot completed, then the boot can stay incomplete forever. The idea with this patch is to make things better, not worse.
> > >
> > I saw that Paul proposed to have a sysfs attribute using which you can
> > send a notification.
> 
> Maybe I am missing something but how will a sysfs node on its own work really?
> 
> 1. delete kernel marking itself boot completed  -- and then sysfs
> marks it completed?
> 
> 2. delete kernel marking itself boot completed  -- and then sysfs
> marks it completed, if sysfs does not come in in N seconds, then
> kernel marks as completed?
> 
> #1 is a no go, that just means a bug waiting to happen if userspace
> forgets to write to sysfs.
> 
> #2 is just an extension of this patch. So I can add a sysfs node on
> top of this. And we can make the minimum time as a long period of
> time, as you noted below:
> 
> > IMHO, to me this patch does not provide a clear correlation between what
> > is a boot complete and when it occurs. A boot complete is a synchronous
> > event whereas the patch thinks that after some interval a "boot" is completed.
> 
> But that is exactly how the kernel code is now without this patch, so
> it is already broken in that sense, I am not really breaking it more
> ;-)
> 
> > We can imply that after, say 100 seconds an initialization of user space
> > is done. Maybe 100 seconds then? :)
> 
> Yes I am Ok with that. So are you suggesting we change the default to
> 100 seconds and then add a sysfs node to mark as boot done whenever
> userspace notifies?

The combination of sysfs manipulated by userspace and a kernel failsafe
makes sense to me.  Especially if by default triggering the failsafe
splats.  That way, bugs where userspace fails to update the sysfs file
get caught.

The non-default silent-failsafe mode is also useful to allow some power
savings in advance of userspace getting the sysfs updating in place.
And of course the default splatting setup can be used in internal testing
with the release software being more tolerant of userspace foibles.

							Thanx, Paul
  
Joel Fernandes Feb. 27, 2023, 11:24 p.m. UTC | #15
On Mon, Feb 27, 2023 at 6:05 PM Paul E. McKenney <paulmck@kernel.org> wrote:
[...]
> > > > >>>>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > > > >>>>>>>
> > > > >>>>>>> 
> > > > >>>>>>>>
> > > > >>>>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > >>>>>>>> Sent: Saturday, February 25, 2023 11:34 AM
> > > > >>>>>>>> To: linux-kernel@vger.kernel.org
> > > > >>>>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> > > > >>>>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> > > > >>>>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> > > > >>>>>>>> rcu@vger.kernel.org
> > > > >>>>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> > > > >>>>>>>> completed
> > > > >>>>>>>>
> > > > >>>>>>>> On many systems, a great deal of boot happens after the kernel thinks the
> > > > >>>>>>>> boot has completed. It is difficult to determine if the system has really
> > > > >>>>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
> > > > >>>>>>>> down boot time if, say, a callback has been added that the boot
> > > > >>>>>>>> synchronously depends on.
> > > > >>>>>>>>
> > > > >>>>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> > > > >>>>>>>> stay expedited for as long as the system is still booting.
> > > > >>>>>>>>
> > > > >>>>>>>> For these reasons, this commit adds a config option
> > > > >>>>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> > > > >>>>>>>> rcupdate.boot_end_delay.
> > > > >>>>>>>>
> > > > >>>>>>>> By default, this value is 20s. A system designer can choose to specify a value
> > > > >>>>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
> > > > >>>>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
> > > > >>>>>>>
> > > > >>>>>>> Hi Joel,
> > > > >>>>>>>
> > > > >>>>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> > > > >>>>>>>
> > > > >>>>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> > > > >>>>>>> real-time latency than the overall OS boot time?
> > > > >>>>>>
> > > > >>>>>> But every system has to boot, even an RT system.
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > > > >>>>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> > > > >>>>>>
> > > > >>>>>> Could you measure how much time your RT system takes to boot before the application runs?
> > > > >>>>>>
> > > > >>>>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
> > > > >>>>>
> > > > >>>>> Provide a /sys location that the userspace code writes to when it
> > > > >>>>> is ready?  Different systems with different hardware and software
> > > > >>>>> configurations are going to take different amounts of time to boot,
> > > > >>>>> correct?
> > > > >>>>
> > > > >>>> I could add a sysfs node, but I still wanted this patch as well
> > > > >>>> because I am wary of systems where yet more userspace changes are
> > > > >>>> required. I feel the kernel should itself be able to do this. Yes, it
> > > > >>>> is possible the system completes "booting" at a different time than
> > > > >>>> what the kernel thinks. But it does that anyway (even without this
> > > > >>>> patch), so I am not seeing a good reason to not do this in the kernel.
> > > > >>>> It is also only a minimum cap, so if the in-kernel boot takes too
> > > > >>>> long, then the patch will have no effect.
> > > > >>>>
> > > > >>>> Thoughts?
> > > > >>>>
> > > > >>> Why "rcu_boot_ended" is not enough? As i see right after that an "init"
> > > > >>> process or shell or panic is going to be invoked by the kernel. It basically
> > > > >>> indicates that a kernel is fully functional.
> > > > >>>
> > > > >>> Or an idea to wait even further? Until all kernel modules are loaded by
> > > > >>> user space.
> > > > >>
> > > > >> I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.
> > > > >>
> > > > >> So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?
> > > > >>
> > > > > Than it is up to user space to decide when it is ready in terms of "boot completed".
> > > >
> > > > I dont know if you caught up with the other threads. See replies from Paul and my reply to that.
> > > >
> > > > Also what you are proposing can be more harmful. If user space has a bug and does not notify the kernel that boot completed, then the boot can stay incomplete forever. The idea with this patch is to make things better, not worse.
> > > >
> > > I saw that Paul proposed to have a sysfs attribute using which you can
> > > send a notification.
> >
> > Maybe I am missing something but how will a sysfs node on its own work really?
> >
> > 1. delete kernel marking itself boot completed  -- and then sysfs
> > marks it completed?
> >
> > 2. delete kernel marking itself boot completed  -- and then sysfs
> > marks it completed, if sysfs does not come in in N seconds, then
> > kernel marks as completed?
> >
> > #1 is a no go, that just means a bug waiting to happen if userspace
> > forgets to write to sysfs.
> >
> > #2 is just an extension of this patch. So I can add a sysfs node on
> > top of this. And we can make the minimum time as a long period of
> > time, as you noted below:
> >
> > > IMHO, to me this patch does not provide a clear correlation between what
> > > is a boot complete and when it occurs. A boot complete is a synchronous
> > > event whereas the patch thinks that after some interval a "boot" is completed.
> >
> > But that is exactly how the kernel code is now without this patch, so
> > it is already broken in that sense, I am not really breaking it more
> > ;-)
> >
> > > We can imply that after, say 100 seconds an initialization of user space
> > > is done. Maybe 100 seconds then? :)
> >
> > Yes I am Ok with that. So are you suggesting we change the default to
> > 100 seconds and then add a sysfs node to mark as boot done whenever
> > userspace notifies?
>
> The combination of sysfs manipulated by userspace and a kernel failsafe
> makes sense to me.  Especially if by default triggering the failsafe
> splats.  That way, bugs where userspace fails to update the sysfs file
> get caught.

By splat, if we could do an "info" message, that would work for me
instead of a WARN_ON. I'm afraid of Android and other folks who
upgrade to the new kernel only to now have to go patch userspace.

So,
pr_info("RCU is still in boot-mode for the next N seconds, please
consider writing X to /sys/.. to avoid this message.");
?

> The non-default silent-failsafe mode is also useful to allow some power
> savings in advance of userspace getting the sysfs updating in place.
> And of course the default splatting setup can be used in internal testing
> with the release software being more tolerant of userspace foibles.

Sounds good, would 100 seconds be a good fail-safe trigger value?

Thanks,

 - Joel
  
Frederic Weisbecker Feb. 27, 2023, 11:40 p.m. UTC | #16
On Mon, Feb 27, 2023 at 03:05:02PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 27, 2023 at 02:10:30PM -0500, Joel Fernandes wrote:
> 
> The combination of sysfs manipulated by userspace and a kernel failsafe
> makes sense to me.  Especially if by default triggering the failsafe
> splats.  That way, bugs where userspace fails to update the sysfs file
> get caught.
> 
> The non-default silent-failsafe mode is also useful to allow some power
> savings in advance of userspace getting the sysfs updating in place.
> And of course the default splatting setup can be used in internal testing
> with the release software being more tolerant of userspace foibles.

I'm wondering, this is all about CONFIG_RCU_LAZY, right? Or does also expedited
GP turned off a bit early or late on boot matter for anybody in practice?

So shouldn't we disable lazy callbacks by default when CONFIG_RCU_LAZY=y and then
turn it on with "sysctl kernel.rcu.lazy=1" only whenever userspace feels ready
about it? We can still keep the current call to rcu_end_inkernel_boot().

And if suddenly disabling lazy by default is an ABI breakage we can still add
CONFIG_RCU_LAZY_DEFAULT_DISABLED.
  
Joel Fernandes Feb. 28, 2023, 1:30 a.m. UTC | #17
On Tue, Feb 28, 2023 at 12:40:38AM +0100, Frederic Weisbecker wrote:
> On Mon, Feb 27, 2023 at 03:05:02PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2023 at 02:10:30PM -0500, Joel Fernandes wrote:
> > 
> > The combination of sysfs manipulated by userspace and a kernel failsafe
> > makes sense to me.  Especially if by default triggering the failsafe
> > splats.  That way, bugs where userspace fails to update the sysfs file
> > get caught.
> > 
> > The non-default silent-failsafe mode is also useful to allow some power
> > savings in advance of userspace getting the sysfs updating in place.
> > And of course the default splatting setup can be used in internal testing
> > with the release software being more tolerant of userspace foibles.
> 
> I'm wondering, this is all about CONFIG_RCU_LAZY, right? Or does also expedited
> GP turned off a bit early or late on boot matter for anybody in practice?

Yes, if you provide 'rcu_normal_after_boot', then after the boot ends, it
switches expedited GPs to normal ones.

It is the same issue for expedited, the kernel's version of what is 'boot' is
much shorter than what is actually boot.

This is also the case with suspend/resume's rcu_pm_notify(). See the comment:
  /*
   * On non-huge systems, use expedited RCU grace periods to make suspend
   * and hibernation run faster.
   */

There also we turn on/off both lazy and expedited. I don't see why we
shouldn't do it for boot.

> So shouldn't we disable lazy callbacks by default when CONFIG_RCU_LAZY=y and then
> turn it on with "sysctl kernel.rcu.lazy=1" only whenever userspace feels ready
> about it? We can still keep the current call to rcu_end_inkernel_boot().

Hmm IMHO that would add more knobs for not much reason honestly. We already
have CONFIG_RCU_LAZY default disabled, I really don't want to add more
dependency (like user enables the config and does not see laziness).

Thanks!

 - Joel
  
Qiuxu Zhuo Feb. 28, 2023, 6:40 a.m. UTC | #18
> From: Joel Fernandes <joel@joelfernandes.org>
> Sent: Monday, February 27, 2023 9:22 PM
> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
> Cc: linux-kernel@vger.kernel.org; Frederic Weisbecker <frederic@kernel.org>;
> Lai Jiangshan <jiangshanlai@gmail.com>; linux-doc@vger.kernel.org; Paul E.
> McKenney <paulmck@kernel.org>; rcu@vger.kernel.org
> Subject: Re: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> completed
> 
> 
> 
> > On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> >
> > 
> >>
> >> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> Sent: Saturday, February 25, 2023 11:34 AM
> >> To: linux-kernel@vger.kernel.org
> >> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic
> >> Weisbecker <frederic@kernel.org>; Lai Jiangshan
> >> <jiangshanlai@gmail.com>; linux- doc@vger.kernel.org; Paul E.
> >> McKenney <paulmck@kernel.org>; rcu@vger.kernel.org
> >> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> >> completed
> >>
> >> On many systems, a great deal of boot happens after the kernel thinks
> >> the boot has completed. It is difficult to determine if the system
> >> has really booted from the kernel side. Some features like lazy-RCU
> >> can risk slowing down boot time if, say, a callback has been added
> >> that the boot synchronously depends on.
> >>
> >> Further, it is better to boot systems which pass
> >> 'rcu_normal_after_boot' to stay expedited for as long as the system is still
> booting.
> >>
> >> For these reasons, this commit adds a config option
> >> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> >> rcupdate.boot_end_delay.
> >>
> >> By default, this value is 20s. A system designer can choose to
> >> specify a value here to keep RCU from marking boot completion.  The
> >> boot sequence will not be marked ended until at least boot_end_delay
> milliseconds have passed.
> >
> > Hi Joel,
> >
> > Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> >
> > Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> > real-time latency than the overall OS boot time?
> 
> But every system has to boot, even an RT system.

   Yes, this is true.

> >
> > If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> 
> Could you measure how much time your RT system takes to boot before the
> application runs?

I don't have a real-time OS environment to measure the OS boot time. 
I tried to measure the OS boot time of my "CentOS Stream 8" w/o and
w/ Joel’s patch. 

My testing showed the positive result that the OS boot time was 
reduced by ~4.6% on my side after applying Joel’s patch.

For testing details, please see the below:

1) Testing environment:
    OS            : CentOS Stream 8 (non-RT OS)
    Kernel     : v6.2
    Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
    Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, … 

2) My OS boot time definition: 
    The time from the start of the kernel boot to the shell command line 
    prompt is shown from the console. [ Different people may have
    different OS boot time definitions. ]
  
3) My measurement method (very rough method): 
    A timer in the kernel periodically prints the boot time every 100ms. 
    As soon as the shell command line prompt is shown from the console,
    we record the boot time printed by the timer, then the printed boot 
    time is the OS boot time. 
   
    The console log (mixed userspace and kernel logs) looked like this:

           [  OK  ] Started Permit User Sessions.
                        Starting Terminate Plymouth Boot Screen...
                        Starting Hold until boot process finishes up...
           [  OK  ] Started Command Scheduler.
           [    6.824466] input: ImExPS/2 Generic Explorer ...
           [    6.884685] Boot ms 6863
		...
           [    7.170920] Spectre V2 : WARNING: Unprivileged eBPF ...
           [    7.173140] Spectre V2 : WARNING: Unprivileged eBPF ...
           [    7.196741] Boot ms 7175
		...
           [    8.236757] Boot ms 8215

           CentOS Stream 8
           Kernel 6.2.0-rcu+ on an x86_64

           login: [    8.340751] Boot ms 8319
           [    8.444756] Boot ms 8423
		...

     Then the log "login: [    8.340751] Boot ms 8319" roughly showed the OS boot time was ~8.3s.

4) Measured OS boot time (in seconds)
   a) Measured 10 times w/o Joel's patch:
        8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
        The average OS boot time was: ~8.7s
   
   b) Measure 10 times w/ Joel's patch: 
        8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
        The average OS boot time was: ~8.3s.
		 
The OS boot time was reduced by : 8.7 – 8.3 = 0.4 second
The reduction percentage was       : 0.4/8.7 * 100% = 4.6%

If the testing above makes sense to you, please feel free to 
add

      Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Thanks!
-Qiuxu

> I can change it to default 0 essentially NOOPing it, but I would rather have a
> saner default (10 seconds even), than having someone forget to tune this for
> their system.
> 
> Thanks,
> 
>  - Joel
  
Frederic Weisbecker Feb. 28, 2023, 11:04 a.m. UTC | #19
On Tue, Feb 28, 2023 at 01:30:25AM +0000, Joel Fernandes wrote:
> On Tue, Feb 28, 2023 at 12:40:38AM +0100, Frederic Weisbecker wrote:
> > On Mon, Feb 27, 2023 at 03:05:02PM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 27, 2023 at 02:10:30PM -0500, Joel Fernandes wrote:
> > > 
> > > The combination of sysfs manipulated by userspace and a kernel failsafe
> > > makes sense to me.  Especially if by default triggering the failsafe
> > > splats.  That way, bugs where userspace fails to update the sysfs file
> > > get caught.
> > > 
> > > The non-default silent-failsafe mode is also useful to allow some power
> > > savings in advance of userspace getting the sysfs updating in place.
> > > And of course the default splatting setup can be used in internal testing
> > > with the release software being more tolerant of userspace foibles.
> > 
> > I'm wondering, this is all about CONFIG_RCU_LAZY, right? Or does also expedited
> > GP turned off a bit early or late on boot matter for anybody in practice?
> 
> Yes, if you provide 'rcu_normal_after_boot', then after the boot ends, it
> switches expedited GPs to normal ones.
> 
> It is the same issue for expedited, the kernel's version of what is 'boot' is
> much shorter than what is actually boot.
> 
> This is also the case with suspend/resume's rcu_pm_notify(). See the comment:
>   /*
>    * On non-huge systems, use expedited RCU grace periods to make suspend
>    * and hibernation run faster.
>    */
> 
> There also we turn on/off both lazy and expedited. I don't see why we
> shouldn't do it for boot.

Of course but I mean currently rcu_end_inkernel_boot() is called explicitly
before the kernel calls init. From that point on, what is the source of the
issue? Delaying lazy further would be enough or do we really need to delay
forcing expedited as well? Or is it the reverse: delaying expedited further
would matter and lazy doesn't play much role from there.

It matters to know because if delaying expedited further is enough, then indeed
we must delay the call to rcu_end_inkernel_boot() somehow. But if delaying
expedited further doesn't matter and delaying lazy matter then it's possible
that the issue is a callback that should be marked as call_rcu_hurry() and then
the source of the problem is much broader.

I think the confusion comes from the fact that your changelog doesn't state precisely
what the problem exactly is. Also do we need to wait for the kernel boot completion?
And if so what is missing from kernel boot after the current explicit call to
rcu_end_inkernel_boot()?

Or do we also need to wait for userspace to complete the boot? Different
problems, different solutions.

But in any case a countdown is not a way to go. Consider that rcu_lazy may
be used by a larger audience than just chromium in the long run. You can not
ask every admin to provide his own estimation per type of machine. You can't
either rely on a long default value because that may have bad impact on
workload assumptions launched right after boot.

> 
> > So shouldn't we disable lazy callbacks by default when CONFIG_RCU_LAZY=y and then
> > turn it on with "sysctl kernel.rcu.lazy=1" only whenever userspace feels ready
> > about it? We can still keep the current call to rcu_end_inkernel_boot().
> 
> Hmm IMHO that would add more knobs for not much reason honestly. We already
> have CONFIG_RCU_LAZY default disabled, I really don't want to add more
> dependency (like user enables the config and does not see laziness).

I don't know. Like I said, different problems, different solutions. Let's
identify what the issue is precisely. For example can we expect that the issues
on boot can be a problem also on some temporary workloads?

Besides I'm currently testing a very hacky flavour of rcu_lazy and so far it
shows many idle calls that would have been delayed if callbacks weren't queued
as lazy. I have yet to do actual energy and performance measurements but if it
happens to show improvements, I suspect distros will want a supported yet
default disabled Kconfig that can be turned on on boot or later. Of course we
are not there yet but things to keep in mind...

Thanks.
  
Uladzislau Rezki Feb. 28, 2023, 11:42 a.m. UTC | #20
> On Mon, Feb 27, 2023 at 1:57 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Mon, Feb 27, 2023 at 01:27:20PM -0500, Joel Fernandes wrote:
> > >
> > >
> > > > On Feb 27, 2023, at 1:20 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Mon, Feb 27, 2023 at 01:15:47PM -0500, Joel Fernandes wrote:
> > > >>
> > > >>
> > > >>>> On Feb 27, 2023, at 1:06 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >>>
> > > >>> On Mon, Feb 27, 2023 at 10:16:51AM -0500, Joel Fernandes wrote:
> > > >>>>> On Mon, Feb 27, 2023 at 9:55 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >>>>>
> > > >>>>> On Mon, Feb 27, 2023 at 08:22:06AM -0500, Joel Fernandes wrote:
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Feb 27, 2023, at 2:53 AM, Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > > >>>>>>>
> > > >>>>>>> 
> > > >>>>>>>>
> > > >>>>>>>> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > >>>>>>>> Sent: Saturday, February 25, 2023 11:34 AM
> > > >>>>>>>> To: linux-kernel@vger.kernel.org
> > > >>>>>>>> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>; Frederic Weisbecker
> > > >>>>>>>> <frederic@kernel.org>; Lai Jiangshan <jiangshanlai@gmail.com>; linux-
> > > >>>>>>>> doc@vger.kernel.org; Paul E. McKenney <paulmck@kernel.org>;
> > > >>>>>>>> rcu@vger.kernel.org
> > > >>>>>>>> Subject: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> > > >>>>>>>> completed
> > > >>>>>>>>
> > > >>>>>>>> On many systems, a great deal of boot happens after the kernel thinks the
> > > >>>>>>>> boot has completed. It is difficult to determine if the system has really
> > > >>>>>>>> booted from the kernel side. Some features like lazy-RCU can risk slowing
> > > >>>>>>>> down boot time if, say, a callback has been added that the boot
> > > >>>>>>>> synchronously depends on.
> > > >>>>>>>>
> > > >>>>>>>> Further, it is better to boot systems which pass 'rcu_normal_after_boot' to
> > > >>>>>>>> stay expedited for as long as the system is still booting.
> > > >>>>>>>>
> > > >>>>>>>> For these reasons, this commit adds a config option
> > > >>>>>>>> 'CONFIG_RCU_BOOT_END_DELAY' and a boot parameter
> > > >>>>>>>> rcupdate.boot_end_delay.
> > > >>>>>>>>
> > > >>>>>>>> By default, this value is 20s. A system designer can choose to specify a value
> > > >>>>>>>> here to keep RCU from marking boot completion.  The boot sequence will not
> > > >>>>>>>> be marked ended until at least boot_end_delay milliseconds have passed.
> > > >>>>>>>
> > > >>>>>>> Hi Joel,
> > > >>>>>>>
> > > >>>>>>> Just some thoughts on the default value of 20s, correct me if I'm wrong :-).
> > > >>>>>>>
> > > >>>>>>> Does the OS with CONFIG_PREEMPT_RT=y kernel concern more about the
> > > >>>>>>> real-time latency than the overall OS boot time?
> > > >>>>>>
> > > >>>>>> But every system has to boot, even an RT system.
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > > >>>>>>> (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> > > >>>>>>
> > > >>>>>> Could you measure how much time your RT system takes to boot before the application runs?
> > > >>>>>>
> > > >>>>>> I can change it to default 0 essentially NOOPing it, but I would rather have a saner default (10 seconds even), than having someone forget to tune this for their system.
> > > >>>>>
> > > >>>>> Provide a /sys location that the userspace code writes to when it
> > > >>>>> is ready?  Different systems with different hardware and software
> > > >>>>> configurations are going to take different amounts of time to boot,
> > > >>>>> correct?
> > > >>>>
> > > >>>> I could add a sysfs node, but I still wanted this patch as well
> > > >>>> because I am wary of systems where yet more userspace changes are
> > > >>>> required. I feel the kernel should itself be able to do this. Yes, it
> > > >>>> is possible the system completes "booting" at a different time than
> > > >>>> what the kernel thinks. But it does that anyway (even without this
> > > >>>> patch), so I am not seeing a good reason to not do this in the kernel.
> > > >>>> It is also only a minimum cap, so if the in-kernel boot takes too
> > > >>>> long, then the patch will have no effect.
> > > >>>>
> > > >>>> Thoughts?
> > > >>>>
> > > >>> Why "rcu_boot_ended" is not enough? As i see right after that an "init"
> > > >>> process or shell or panic is going to be invoked by the kernel. It basically
> > > >>> indicates that a kernel is fully functional.
> > > >>>
> > > >>> Or an idea to wait even further? Until all kernel modules are loaded by
> > > >>> user space.
> > > >>
> > > >> I mentioned in commit message it is daemons, userspace initialization etc. There is a lot of userspace booting up as well and using the kernel while doing so.
> > > >>
> > > >> So, It does not make sense to me to mark kernel as booted too early. And no harm in adding some builtin kernel hysteresis. What am I missing?
> > > >>
> > > > Than it is up to user space to decide when it is ready in terms of "boot completed".
> > >
> > > I dont know if you caught up with the other threads. See replies from Paul and my reply to that.
> > >
> > > Also what you are proposing can be more harmful. If user space has a bug and does not notify the kernel that boot completed, then the boot can stay incomplete forever. The idea with this patch is to make things better, not worse.
> > >
> > I saw that Paul proposed to have a sysfs attribute using which you can
> > send a notification.
> 
> Maybe I am missing something but how will a sysfs node on its own work really?
> 
> 1. delete kernel marking itself boot completed  -- and then sysfs
> marks it completed?
> 
> 2. delete kernel marking itself boot completed  -- and then sysfs
> marks it completed, if sysfs does not come in in N seconds, then
> kernel marks as completed?
> 
> #1 is a no go, that just means a bug waiting to happen if userspace
> forgets to write to sysfs.
> 
> #2 is just an extension of this patch. So I can add a sysfs node on
> top of this. And we can make the minimum time as a long period of
> time, as you noted below:
> 
> > IMHO, to me this patch does not provide a clear correlation between what
> > is a boot complete and when it occurs. A boot complete is a synchronous
> > event whereas the patch thinks that after some interval a "boot" is completed.
> 
> But that is exactly how the kernel code is now without this patch, so
> it is already broken in that sense, I am not really breaking it more
> ;-)
> 
> > We can imply that after, say 100 seconds an initialization of user space
> > is done. Maybe 100 seconds then? :)
> 
> Yes I am Ok with that. So are you suggesting we change the default to
> 100 seconds and then add a sysfs node to mark as boot done whenever
> userspace notifies?
> 
As i see it correctly the patch tries to address at least two issues. Fist
one is about lazy callbacks. If it stucks the boot-time is degraded so you
want it to be disabled during the boot.

I wonder why not keeping lazy stuff to be disabled by __default__. User decides
if it is needed or not? No matter if user space hangs or whatever. Or is it hard
to add such runtime switcher?

Second you want to have an expidited gp during a boot. I guess same here
it is about boot performance. Should not it be as a separate patch with
different commit message and with some data if you see a speedup?

Thanks!

--
Uladzislau Rezki
  
Joel Fernandes Feb. 28, 2023, 2:27 p.m. UTC | #21
On Tue, Feb 28, 2023 at 1:40 AM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>
[...]
> > > If so, we might make rcupdate.boot_end_delay = 0 as the default value
> > > (NOT the default 20s) for CONFIG_PREEMPT_RT=y kernels?
> >
> > Could you measure how much time your RT system takes to boot before the
> > application runs?
>
> I don't have a real-time OS environment to measure the OS boot time.
> I tried to measure the OS boot time of my "CentOS Stream 8" w/o and
> w/ Joel’s patch.
>
> My testing showed the positive result that the OS boot time was
> reduced by ~4.6% on my side after applying Joel’s patch.

Wow, this is great! I am guessing you have CONFIG_RCU_LAZY disabled,
when you tested. If so, that is great news that expediting RCU for a
bit longer improves boot time! Please confirm that your config had
LAZY disabled.

> For testing details, please see the below:
>
> 1) Testing environment:
>     OS            : CentOS Stream 8 (non-RT OS)
>     Kernel     : v6.2
>     Machine : Intel Cascade Lake server (2 sockets, each with 44 logical threads)
>     Qemu  args  : -cpu host -enable-kvm, -smp 88,threads=2,sockets=2, …
>
> 2) My OS boot time definition:
>     The time from the start of the kernel boot to the shell command line
>     prompt is shown from the console. [ Different people may have
>     different OS boot time definitions. ]
>
> 3) My measurement method (very rough method):
>     A timer in the kernel periodically prints the boot time every 100ms.
>     As soon as the shell command line prompt is shown from the console,
>     we record the boot time printed by the timer, then the printed boot
>     time is the OS boot time.

Hmm, Can you not just print the boot time from userspace using
clock_gettime() and CLOCK_BOOTTIME? But yeah either way, good data!

>     The console log (mixed userspace and kernel logs) looked like this:
>
>            [  OK  ] Started Permit User Sessions.
>                         Starting Terminate Plymouth Boot Screen...
>                         Starting Hold until boot process finishes up...
>            [  OK  ] Started Command Scheduler.
>            [    6.824466] input: ImExPS/2 Generic Explorer ...
>            [    6.884685] Boot ms 6863
>                 ...
>            [    7.170920] Spectre V2 : WARNING: Unprivileged eBPF ...
>            [    7.173140] Spectre V2 : WARNING: Unprivileged eBPF ...
>            [    7.196741] Boot ms 7175
>                 ...
>            [    8.236757] Boot ms 8215
>
>            CentOS Stream 8
>            Kernel 6.2.0-rcu+ on an x86_64
>
>            login: [    8.340751] Boot ms 8319
>            [    8.444756] Boot ms 8423
>                 ...
>
>      Then the log "login: [    8.340751] Boot ms 8319" roughly showed the OS boot time was ~8.3s.
>
> 4) Measured OS boot time (in seconds)
>    a) Measured 10 times w/o Joel's patch:
>         8.7s, 8.4s, 8.6s, 8.2s, 9.0s, 8.7s, 8.8s, 9.3s, 8.8s, 8.3s
>         The average OS boot time was: ~8.7s
>
>    b) Measure 10 times w/ Joel's patch:
>         8.5s, 8.2s, 7.6s, 8.2s, 8.7s, 8.2s, 7.8s, 8.2s, 9.3s, 8.4s
>         The average OS boot time was: ~8.3s.
>
> The OS boot time was reduced by : 8.7 – 8.3 = 0.4 second
> The reduction percentage was       : 0.4/8.7 * 100% = 4.6%
>
> If the testing above makes sense to you, please feel free to
> add
>
>       Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Yes, it makes sense. I will add these. Thanks!

 - Joel

>
> Thanks!
> -Qiuxu
>
> > I can change it to default 0 essentially NOOPing it, but I would rather have a
> > saner default (10 seconds even), than having someone forget to tune this for
> > their system.
> >
> > Thanks,
> >
> >  - Joel
>
  
Joel Fernandes Feb. 28, 2023, 8:09 p.m. UTC | #22
Hi Frederic,

On Tue, Feb 28, 2023 at 12:04:36PM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 28, 2023 at 01:30:25AM +0000, Joel Fernandes wrote:
> > On Tue, Feb 28, 2023 at 12:40:38AM +0100, Frederic Weisbecker wrote:
> > > On Mon, Feb 27, 2023 at 03:05:02PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 27, 2023 at 02:10:30PM -0500, Joel Fernandes wrote:
> > > > 
> > > > The combination of sysfs manipulated by userspace and a kernel failsafe
> > > > makes sense to me.  Especially if by default triggering the failsafe
> > > > splats.  That way, bugs where userspace fails to update the sysfs file
> > > > get caught.
> > > > 
> > > > The non-default silent-failsafe mode is also useful to allow some power
> > > > savings in advance of userspace getting the sysfs updating in place.
> > > > And of course the default splatting setup can be used in internal testing
> > > > with the release software being more tolerant of userspace foibles.
> > > 
> > > I'm wondering, this is all about CONFIG_RCU_LAZY, right? Or does also expedited
> > > GP turned off a bit early or late on boot matter for anybody in practice?
> > 
> > Yes, if you provide 'rcu_normal_after_boot', then after the boot ends, it
> > switches expedited GPs to normal ones.
> > 
> > It is the same issue for expedited, the kernel's version of what is 'boot' is
> > much shorter than what is actually boot.
> > 
> > This is also the case with suspend/resume's rcu_pm_notify(). See the comment:
> >   /*
> >    * On non-huge systems, use expedited RCU grace periods to make suspend
> >    * and hibernation run faster.
> >    */
> > 
> > There also we turn on/off both lazy and expedited. I don't see why we
> > shouldn't do it for boot.
> 
> Of course but I mean currently rcu_end_inkernel_boot() is called explicitly
> before the kernel calls init. From that point on, what is the source of the
> issue? Delaying lazy further would be enough or do we really need to delay
> forcing expedited as well? Or is it the reverse: delaying expedited further
> would matter and lazy doesn't play much role from there.

Both should play a role. For lazy, we found callbacks that showed later in
the full boot sequence (like the SCSI issue).

For expedited, there is new data from Qiuxu showing 5% improvement in boot
time.

> It matters to know because if delaying expedited further is enough, then indeed
> we must delay the call to rcu_end_inkernel_boot() somehow. But if delaying
> expedited further doesn't matter and delaying lazy matter then it's possible
> that the issue is a callback that should be marked as call_rcu_hurry() and then
> the source of the problem is much broader.

Right, and we also don't know if in the future, somebody queues a CB that
slows down boot as well (say they queue a lazy CB that does a wakeup), even
if currently there are not any such. As noted, that SCSI issue did show. Just
to note, callbacks doing wakeups are supposed to call call_rcu_hurry().

> I think the confusion comes from the fact that your changelog doesn't state precisely
> what the problem exactly is. Also do we need to wait for the kernel boot completion?
> And if so what is missing from kernel boot after the current explicit call to
> rcu_end_inkernel_boot()?

Yes, sorry, it was more an RFC but still should have been more clear. For the
v3 I'll definitely make it clear.

rcu_end_inkernel_boot() is called before init is run. But the kernel cannot
posibly know when init has finished running and say the system is now waiting
for user login, or something. There's a considerable amount time from
rcu_end_inkernel_boot() to when the system is actually "booted". That's the
main issue. We could look at CPU load, but that's not ideal. Maybe wait for
user input, but that sucks as well.

> Or do we also need to wait for userspace to complete the boot? Different
> problems, different solutions.
> 
> But in any case a countdown is not a way to go. Consider that rcu_lazy may
> be used by a larger audience than just chromium in the long run. You can not
> ask every admin to provide his own estimation per type of machine. You can't
> either rely on a long default value because that may have bad impact on
> workload assumptions launched right after boot.

Hmmm I see what you mean, so a conservative and configurable "fail-safe"
timeout followed by sysctl to end the boot earlier than the timeout, should
do it (something like 30 seconds IMHO sounds reasonable)? In any case,
whatever way we go, we would not end the kernel boot before
rcu_end_inkernel_boot() is called at least once (which is the current
behavior).

So it would be:

  low level boot + initcalls
       20 sec                         30 second timeout
|------------------------------|--------------------------
                               |                         |
	        old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()

But it could be, if user decides:
  low level boot + initcalls
       20 sec                         10 second timeout
|------------------------------|--------------------------
                               |                         |
	        old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
		                                 via /sys/ entry.

> > > So shouldn't we disable lazy callbacks by default when CONFIG_RCU_LAZY=y and then
> > > turn it on with "sysctl kernel.rcu.lazy=1" only whenever userspace feels ready
> > > about it? We can still keep the current call to rcu_end_inkernel_boot().
> > 
> > Hmm IMHO that would add more knobs for not much reason honestly. We already
> > have CONFIG_RCU_LAZY default disabled, I really don't want to add more
> > dependency (like user enables the config and does not see laziness).
> 
> I don't know. Like I said, different problems, different solutions. Let's
> identify what the issue is precisely. For example can we expect that the issues
> on boot can be a problem also on some temporary workloads?
> 
> Besides I'm currently testing a very hacky flavour of rcu_lazy and so far it
> shows many idle calls that would have been delayed if callbacks weren't queued
> as lazy.

Can you provide more details? What kind of hack flavor, and what is it doing?

thanks,

 - Joel


> I have yet to do actual energy and performance measurements but if it
> happens to show improvements, I suspect distros will want a supported yet
> default disabled Kconfig that can be turned on on boot or later. Of course we
> are not there yet but things to keep in mind...
> 
> Thanks.
  
Qiuxu Zhuo March 1, 2023, 1:34 a.m. UTC | #23
> From: Joel Fernandes <joel@joelfernandes.org>
> Sent: Tuesday, February 28, 2023 10:27 PM
> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
> Cc: linux-kernel@vger.kernel.org; Frederic Weisbecker <frederic@kernel.org>;
> Lai Jiangshan <jiangshanlai@gmail.com>; linux-doc@vger.kernel.org; Paul E.
> McKenney <paulmck@kernel.org>; rcu@vger.kernel.org
> Subject: Re: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> completed
> 
> On Tue, Feb 28, 2023 at 1:40 AM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> >
> [...]
> > My testing showed the positive result that the OS boot time was
> > reduced by ~4.6% on my side after applying Joel’s patch.
> 
> Wow, this is great! I am guessing you have CONFIG_RCU_LAZY disabled,
> when you tested. If so, that is great news that expediting RCU for a bit longer
> improves boot time! Please confirm that your config had LAZY disabled.
> 

I confirm that CONFIG_RCU_LAZY kernel configuration was disabled in my testing.
All the kernel configurations for RCU used in my testing are as below:

     #
     # RCU Subsystem
     #
     CONFIG_TREE_RCU=y
     CONFIG_PREEMPT_RCU=y
     CONFIG_RCU_EXPERT=y
     CONFIG_SRCU=y
     CONFIG_TREE_SRCU=y
     CONFIG_TASKS_RCU_GENERIC=y
     # CONFIG_FORCE_TASKS_RCU is not set
     CONFIG_TASKS_RCU=y
     # CONFIG_FORCE_TASKS_RUDE_RCU is not set
     # CONFIG_FORCE_TASKS_TRACE_RCU is not set
     CONFIG_TASKS_TRACE_RCU=y
     CONFIG_RCU_STALL_COMMON=y
     CONFIG_RCU_NEED_SEGCBLIST=y
     CONFIG_RCU_FANOUT=32
     CONFIG_RCU_FANOUT_LEAF=16
     # CONFIG_RCU_BOOST is not set
     CONFIG_RCU_BOOT_END_DELAY=20000
     CONFIG_RCU_NOCB_CPU=y
     # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
     # CONFIG_TASKS_TRACE_RCU_READ_MB is not set
     # CONFIG_RCU_LAZY is not set
     # end of RCU Subsystem

-Qiuxu
  
Joel Fernandes March 1, 2023, 3:57 p.m. UTC | #24
On Tue, Feb 28, 2023 at 8:34 PM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>
> > From: Joel Fernandes <joel@joelfernandes.org>
> > Sent: Tuesday, February 28, 2023 10:27 PM
> > To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
> > Cc: linux-kernel@vger.kernel.org; Frederic Weisbecker <frederic@kernel.org>;
> > Lai Jiangshan <jiangshanlai@gmail.com>; linux-doc@vger.kernel.org; Paul E.
> > McKenney <paulmck@kernel.org>; rcu@vger.kernel.org
> > Subject: Re: [PATCH RFC v2] rcu: Add a minimum time for marking boot as
> > completed
> >
> > On Tue, Feb 28, 2023 at 1:40 AM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
> > >
> > [...]
> > > My testing showed the positive result that the OS boot time was
> > > reduced by ~4.6% on my side after applying Joel’s patch.
> >
> > Wow, this is great! I am guessing you have CONFIG_RCU_LAZY disabled,
> > when you tested. If so, that is great news that expediting RCU for a bit longer
> > improves boot time! Please confirm that your config had LAZY disabled.
> >
>
> I confirm that CONFIG_RCU_LAZY kernel configuration was disabled in my testing.
> All the kernel configurations for RCU used in my testing are as below:

Thank you! Good to know there is improvement in *existing code*,
that's the power of upstream... new features fix/improve old ones
sometimes.

Any case, I will include all your data in v3. Thanks!

 - Joel

>
>      #
>      # RCU Subsystem
>      #
>      CONFIG_TREE_RCU=y
>      CONFIG_PREEMPT_RCU=y
>      CONFIG_RCU_EXPERT=y
>      CONFIG_SRCU=y
>      CONFIG_TREE_SRCU=y
>      CONFIG_TASKS_RCU_GENERIC=y
>      # CONFIG_FORCE_TASKS_RCU is not set
>      CONFIG_TASKS_RCU=y
>      # CONFIG_FORCE_TASKS_RUDE_RCU is not set
>      # CONFIG_FORCE_TASKS_TRACE_RCU is not set
>      CONFIG_TASKS_TRACE_RCU=y
>      CONFIG_RCU_STALL_COMMON=y
>      CONFIG_RCU_NEED_SEGCBLIST=y
>      CONFIG_RCU_FANOUT=32
>      CONFIG_RCU_FANOUT_LEAF=16
>      # CONFIG_RCU_BOOST is not set
>      CONFIG_RCU_BOOT_END_DELAY=20000
>      CONFIG_RCU_NOCB_CPU=y
>      # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
>      # CONFIG_TASKS_TRACE_RCU_READ_MB is not set
>      # CONFIG_RCU_LAZY is not set
>      # end of RCU Subsystem
>
> -Qiuxu
  
Frederic Weisbecker March 1, 2023, 5:11 p.m. UTC | #25
On Tue, Feb 28, 2023 at 08:09:02PM +0000, Joel Fernandes wrote:
> Hi Frederic,
> 
> On Tue, Feb 28, 2023 at 12:04:36PM +0100, Frederic Weisbecker wrote:
> Both should play a role. For lazy, we found callbacks that showed later in
> the full boot sequence (like the SCSI issue).
> 
> For expedited, there is new data from Qiuxu showing 5% improvement in boot
> time.

Indeed, nice one!

> Right, and we also don't know if in the future, somebody queues a CB that
> slows down boot as well (say they queue a lazy CB that does a wakeup), even
> if currently there are not any such. As noted, that SCSI issue did show. Just
> to note, callbacks doing wakeups are supposed to call call_rcu_hurry().

Sure, ideally we should have a way to debug lazy callbacks involved in wait/wake
dependency against the enqueuer but that doesn't sound easy to do...

> Yes, sorry, it was more an RFC but still should have been more clear. For the
> v3 I'll definitely make it clear.
> 
> rcu_end_inkernel_boot() is called before init is run. But the kernel cannot
> posibly know when init has finished running and say the system is now waiting
> for user login, or something. There's a considerable amount time from
> rcu_end_inkernel_boot() to when the system is actually "booted". That's the
> main issue. We could look at CPU load, but that's not ideal. Maybe wait for
> user input, but that sucks as well.

So it looks like user boot is involved in what you want to wait for. In that
case only userspace can tell.

> Hmmm I see what you mean, so a conservative and configurable "fail-safe"
> timeout followed by sysctl to end the boot earlier than the timeout, should
> do it (something like 30 seconds IMHO sounds reasonable)? In any case,
> whatever way we go, we would not end the kernel boot before
> rcu_end_inkernel_boot() is called at least once (which is the current
> behavior).
> 
> So it would be:
> 
>   low level boot + initcalls
>        20 sec                         30 second timeout
> |------------------------------|--------------------------
>                                |                         |
> 	        old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> 
> But it could be, if user decides:
>   low level boot + initcalls
>        20 sec                         10 second timeout
> |------------------------------|--------------------------
>                                |                         |
> 	        old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> 		                                 via /sys/ entry.

The problem I have with a random default timeout is that it may break sensitive
workloads. If the default is 30 and say the boot only takes 5 seconds and
immediately launches a latency sensitive task, this may break things in a
subtle way during these 25 seconds when it usually didn't. Because expedited
RCU is a hammer interrupting all non-idle CPUs.

Until now forcing expedited RCU was only performed before any user code. Now it
crosses the boundary so better be careful. I'd personally advocate for keeping
the current call before init is launched. Then provide an end_boot_sysctl kernel
boot parameter that will ignore the current call before init and let the user
signal that through sysctl.


> > > > So shouldn't we disable lazy callbacks by default when CONFIG_RCU_LAZY=y and then
> > > > turn it on with "sysctl kernel.rcu.lazy=1" only whenever userspace feels ready
> > > > about it? We can still keep the current call to rcu_end_inkernel_boot().
> > > 
> > > Hmm IMHO that would add more knobs for not much reason honestly. We already
> > > have CONFIG_RCU_LAZY default disabled, I really don't want to add more
> > > dependency (like user enables the config and does not see laziness).
> > 
> > I don't know. Like I said, different problems, different solutions. Let's
> > identify what the issue is precisely. For example can we expect that the issues
> > on boot can be a problem also on some temporary workloads?
> > 
> > Besides I'm currently testing a very hacky flavour of rcu_lazy and so far it
> > shows many idle calls that would have been delayed if callbacks weren't queued
> > as lazy.
> 
> Can you provide more details? What kind of hack flavor, and what is it doing?

Sorry, I meant a hacky implementation of lazy to work with !NOCB.

Thanks.
  
Joel Fernandes March 1, 2023, 9:31 p.m. UTC | #26
On Wed, Mar 1, 2023 at 12:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
[...]
> > Hmmm I see what you mean, so a conservative and configurable "fail-safe"
> > timeout followed by sysctl to end the boot earlier than the timeout, should
> > do it (something like 30 seconds IMHO sounds reasonable)? In any case,
> > whatever way we go, we would not end the kernel boot before
> > rcu_end_inkernel_boot() is called at least once (which is the current
> > behavior).
> >
> > So it would be:
> >
> >   low level boot + initcalls
> >        20 sec                         30 second timeout
> > |------------------------------|--------------------------
> >                                |                         |
> >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> >
> > But it could be, if user decides:
> >   low level boot + initcalls
> >        20 sec                         10 second timeout
> > |------------------------------|--------------------------
> >                                |                         |
> >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> >                                                via /sys/ entry.
>
> The problem I have with a random default timeout is that it may break sensitive
> workloads. If the default is 30 and say the boot only takes 5 seconds and
> immediately launches a latency sensitive task, this may break things in a
> subtle way during these 25 seconds when it usually didn't. Because expedited
> RCU is a hammer interrupting all non-idle CPUs.
>
> Until now forcing expedited RCU was only performed before any user code. Now it
> crosses the boundary so better be careful. I'd personally advocate for keeping
> the current call before init is launched. Then provide an end_boot_sysctl kernel
> boot parameter that will ignore the current call before init and let the user
> signal that through sysctl.

Considering that the PREEMPT-RT system benefits from it within the 8
seconds, I will go ahead make the default 15 seconds or so and make it
tunable. Hopefully that will be an acceptable compromise, with
sufficient documentation, changelog, and so forth... If you agree I'd
appreciate your Ack on the next posting.

> > > > > So shouldn't we disable lazy callbacks by default when CONFIG_RCU_LAZY=y and then
> > > > > turn it on with "sysctl kernel.rcu.lazy=1" only whenever userspace feels ready
> > > > > about it? We can still keep the current call to rcu_end_inkernel_boot().
> > > >
> > > > Hmm IMHO that would add more knobs for not much reason honestly. We already
> > > > have CONFIG_RCU_LAZY default disabled, I really don't want to add more
> > > > dependency (like user enables the config and does not see laziness).
> > >
> > > I don't know. Like I said, different problems, different solutions. Let's
> > > identify what the issue is precisely. For example can we expect that the issues
> > > on boot can be a problem also on some temporary workloads?
> > >
> > > Besides I'm currently testing a very hacky flavour of rcu_lazy and so far it
> > > shows many idle calls that would have been delayed if callbacks weren't queued
> > > as lazy.
> >
> > Can you provide more details? What kind of hack flavor, and what is it doing?
>
> Sorry, I meant a hacky implementation of lazy to work with !NOCB.

Interesting, I'm curious which calls those are and if they should also
be call_rcu_hurry().

 - Joel
  
Paul E. McKenney March 2, 2023, 12:49 a.m. UTC | #27
On Wed, Mar 01, 2023 at 04:31:01PM -0500, Joel Fernandes wrote:
> On Wed, Mar 1, 2023 at 12:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> [...]
> > > Hmmm I see what you mean, so a conservative and configurable "fail-safe"
> > > timeout followed by sysctl to end the boot earlier than the timeout, should
> > > do it (something like 30 seconds IMHO sounds reasonable)? In any case,
> > > whatever way we go, we would not end the kernel boot before
> > > rcu_end_inkernel_boot() is called at least once (which is the current
> > > behavior).
> > >
> > > So it would be:
> > >
> > >   low level boot + initcalls
> > >        20 sec                         30 second timeout
> > > |------------------------------|--------------------------
> > >                                |                         |
> > >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> > >
> > > But it could be, if user decides:
> > >   low level boot + initcalls
> > >        20 sec                         10 second timeout
> > > |------------------------------|--------------------------
> > >                                |                         |
> > >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> > >                                                via /sys/ entry.
> >
> > The problem I have with a random default timeout is that it may break sensitive
> > workloads. If the default is 30 and say the boot only takes 5 seconds and
> > immediately launches a latency sensitive task, this may break things in a
> > subtle way during these 25 seconds when it usually didn't. Because expedited
> > RCU is a hammer interrupting all non-idle CPUs.
> >
> > Until now forcing expedited RCU was only performed before any user code. Now it
> > crosses the boundary so better be careful. I'd personally advocate for keeping
> > the current call before init is launched. Then provide an end_boot_sysctl kernel
> > boot parameter that will ignore the current call before init and let the user
> > signal that through sysctl.
> 
> Considering that the PREEMPT-RT system benefits from it within the 8
> seconds, I will go ahead make the default 15 seconds or so and make it
> tunable. Hopefully that will be an acceptable compromise, with
> sufficient documentation, changelog, and so forth... If you agree I'd
> appreciate your Ack on the next posting.

Just checking on the sysfs portion of this.  After all, tuning kernel
boot parameters to specific systems is not all that much fun, especially
when you have lots of systems.

							Thanx, Paul
  
Joel Fernandes March 2, 2023, 1:08 a.m. UTC | #28
On Wed, Mar 1, 2023 at 7:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Mar 01, 2023 at 04:31:01PM -0500, Joel Fernandes wrote:
> > On Wed, Mar 1, 2023 at 12:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > [...]
> > > > Hmmm I see what you mean, so a conservative and configurable "fail-safe"
> > > > timeout followed by sysctl to end the boot earlier than the timeout, should
> > > > do it (something like 30 seconds IMHO sounds reasonable)? In any case,
> > > > whatever way we go, we would not end the kernel boot before
> > > > rcu_end_inkernel_boot() is called at least once (which is the current
> > > > behavior).
> > > >
> > > > So it would be:
> > > >
> > > >   low level boot + initcalls
> > > >        20 sec                         30 second timeout
> > > > |------------------------------|--------------------------
> > > >                                |                         |
> > > >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> > > >
> > > > But it could be, if user decides:
> > > >   low level boot + initcalls
> > > >        20 sec                         10 second timeout
> > > > |------------------------------|--------------------------
> > > >                                |                         |
> > > >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> > > >                                                via /sys/ entry.
> > >
> > > The problem I have with a random default timeout is that it may break sensitive
> > > workloads. If the default is 30 and say the boot only takes 5 seconds and
> > > immediately launches a latency sensitive task, this may break things in a
> > > subtle way during these 25 seconds when it usually didn't. Because expedited
> > > RCU is a hammer interrupting all non-idle CPUs.
> > >
> > > Until now forcing expedited RCU was only performed before any user code. Now it
> > > crosses the boundary so better be careful. I'd personally advocate for keeping
> > > the current call before init is launched. Then provide an end_boot_sysctl kernel
> > > boot parameter that will ignore the current call before init and let the user
> > > signal that through sysctl.
> >
> > Considering that the PREEMPT-RT system benefits from it within the 8
> > seconds, I will go ahead make the default 15 seconds or so and make it
> > tunable. Hopefully that will be an acceptable compromise, with
> > sufficient documentation, changelog, and so forth... If you agree I'd
> > appreciate your Ack on the next posting.
>
> Just checking on the sysfs portion of this.

Yes, the current plan is to add a sysfs node (likely sysctl) with the
15-second failsafe.

> After all, tuning kernel
> boot parameters to specific systems is not all that much fun, especially
> when you have lots of systems.

Are you suggesting to drop "end the boot" sysfs and just have a
minimum-time approach as I initially did?

 - Joel
  
Paul E. McKenney March 2, 2023, 1:19 a.m. UTC | #29
On Wed, Mar 01, 2023 at 08:08:54PM -0500, Joel Fernandes wrote:
> On Wed, Mar 1, 2023 at 7:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Mar 01, 2023 at 04:31:01PM -0500, Joel Fernandes wrote:
> > > On Wed, Mar 1, 2023 at 12:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > [...]
> > > > > Hmmm I see what you mean, so a conservative and configurable "fail-safe"
> > > > > timeout followed by sysctl to end the boot earlier than the timeout, should
> > > > > do it (something like 30 seconds IMHO sounds reasonable)? In any case,
> > > > > whatever way we go, we would not end the kernel boot before
> > > > > rcu_end_inkernel_boot() is called at least once (which is the current
> > > > > behavior).
> > > > >
> > > > > So it would be:
> > > > >
> > > > >   low level boot + initcalls
> > > > >        20 sec                         30 second timeout
> > > > > |------------------------------|--------------------------
> > > > >                                |                         |
> > > > >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> > > > >
> > > > > But it could be, if user decides:
> > > > >   low level boot + initcalls
> > > > >        20 sec                         10 second timeout
> > > > > |------------------------------|--------------------------
> > > > >                                |                         |
> > > > >               old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
> > > > >                                                via /sys/ entry.
> > > >
> > > > The problem I have with a random default timeout is that it may break sensitive
> > > > workloads. If the default is 30 and say the boot only takes 5 seconds and
> > > > immediately launches a latency sensitive task, this may break things in a
> > > > subtle way during these 25 seconds when it usually didn't. Because expedited
> > > > RCU is a hammer interrupting all non-idle CPUs.
> > > >
> > > > Until now forcing expedited RCU was only performed before any user code. Now it
> > > > crosses the boundary so better be careful. I'd personally advocate for keeping
> > > > the current call before init is launched. Then provide an end_boot_sysctl kernel
> > > > boot parameter that will ignore the current call before init and let the user
> > > > signal that through sysctl.
> > >
> > > Considering that the PREEMPT-RT system benefits from it within the 8
> > > seconds, I will go ahead make the default 15 seconds or so and make it
> > > tunable. Hopefully that will be an acceptable compromise, with
> > > sufficient documentation, changelog, and so forth... If you agree I'd
> > > appreciate your Ack on the next posting.
> >
> > Just checking on the sysfs portion of this.
> 
> Yes, the current plan is to add a sysfs node (likely sysctl) with the
> 15-second failsafe.

Whew!!!  Very good, then!

> > After all, tuning kernel
> > boot parameters to specific systems is not all that much fun, especially
> > when you have lots of systems.
> 
> Are you suggesting to drop "end the boot" sysfs and just have a
> minimum-time approach as I initially did?

Not at all.  I was just concerned that the sysfs was getting lost in
the shuffle.

All good now, thank you!

							Thanx, Paul
  

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2429b5e3184b..0943139fdf01 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5085,6 +5085,10 @@ 
 	rcutorture.verbose= [KNL]
 			Enable additional printk() statements.
 
+	rcupdate.boot_end_delay= [KNL]
+			Minimum time that must elapse before the boot
+			sequence can be marked as completed.
+
 	rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
 			Dump ftrace buffer after reporting RCU CPU
 			stall warning.
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 9071182b1284..0d77006a948f 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -217,6 +217,18 @@  config RCU_BOOST_DELAY
 
 	  Accept the default if unsure.
 
+config RCU_BOOT_END_DELAY
+	int "Minimum time before RCU may consider in-kernel boot as completed"
+	range 0 120000
+	default 20000
+	help
+	  This option specifies the minimum time in milliseconds from boot start
+	  before RCU believes boot has completed. The actual delay can be
+	  higher than this if the kernel takes a long time to initialize
+	  but it will never be smaller than this.
+
+	  Accept the default if unsure.
+
 config RCU_EXP_KTHREAD
 	bool "Perform RCU expedited work in a real-time kthread"
 	depends on RCU_BOOST && RCU_EXPERT
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 19bf6fa3ee6a..cbdad7b46841 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -62,6 +62,10 @@  module_param(rcu_normal_after_boot, int, 0444);
 #endif
 #endif /* #ifndef CONFIG_TINY_RCU */
 
+/* Minimum time in ms until RCU can consider in-kernel boot as completed. */
+static int boot_end_delay = CONFIG_RCU_BOOT_END_DELAY;
+module_param(boot_end_delay, int, 0444);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /**
  * rcu_read_lock_held_common() - might we be in RCU-sched read-side critical section?
@@ -225,12 +229,29 @@  void rcu_unexpedite_gp(void)
 EXPORT_SYMBOL_GPL(rcu_unexpedite_gp);
 
 static bool rcu_boot_ended __read_mostly;
-
 /*
- * Inform RCU of the end of the in-kernel boot sequence.
+ * Inform RCU of the end of the in-kernel boot sequence. The boot sequence will
+ * not be marked ended until at least boot_end_delay milliseconds have passed.
  */
+void rcu_end_inkernel_boot(void);
+static void boot_rcu_work_fn(struct work_struct *work)
+{
+	rcu_end_inkernel_boot();
+}
+static DECLARE_DELAYED_WORK(boot_rcu_work, boot_rcu_work_fn);
+
 void rcu_end_inkernel_boot(void)
 {
+	if (boot_end_delay) {
+		u64 boot_ms = ktime_get_boot_fast_ns() / 1000000UL;
+
+		if (boot_ms < boot_end_delay) {
+			schedule_delayed_work(&boot_rcu_work,
+					boot_end_delay - boot_ms);
+			return;
+		}
+	}
+
 	rcu_unexpedite_gp();
 	rcu_async_relax();
 	if (rcu_normal_after_boot)