init/main: Clear boot task idle flag

Message ID 20230913005647.1534747-1-Liam.Howlett@oracle.com
State New
Headers
Series init/main: Clear boot task idle flag |

Commit Message

Liam R. Howlett Sept. 13, 2023, 12:56 a.m. UTC
  Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle().  Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set.  Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed.  Recent changes
have caused just this scenario and IRQs have been enabled early.

This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.

Fix this issue by clearing the PF_IDLE flag on return from sched_init()
and restore the flag in rest_init().  Although the boot task was marked
as idle since (at least) d80e4fda576d, I am not sure that it is wrong to
do so.  The forced context-switch on idle task was introduced in the
tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee.

Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Cc: stable@vger.kernel.org
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 init/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
  

Comments

Peter Zijlstra Sept. 14, 2023, 7:13 a.m. UTC | #1
On Wed, Sep 13, 2023 at 01:32:38PM -0400, Liam R. Howlett wrote:
> * Peter Zijlstra <peterz@infradead.org> [230913 12:13]:
> > On Wed, Sep 13, 2023 at 10:51:25AM -0400, Liam R. Howlett wrote:
> > > * Peter Zijlstra <peterz@infradead.org> [230913 09:53]:
> > > > On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
> > > > 
> > > > > diff --git a/init/main.c b/init/main.c
> > > > > index ad920fac325c..f74772acf612 100644
> > > > > --- a/init/main.c
> > > > > +++ b/init/main.c
> > > > > @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void)
> > > > >  	 */
> > > > >  	rcu_read_lock();
> > > > >  	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
> > > > > -	tsk->flags |= PF_NO_SETAFFINITY;
> > > > > +	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
> > > > >  	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
> > > > >  	rcu_read_unlock();
> > > > >  
> > > > 
> > > > Hmm, isn't that pid-1 you're setting PF_IDLE on?
> > > 
> > > Yes, thanks.  I think that is what Geert is hitting with my patch.
> > > 
> > > debug __might_resched() in kernel/sched/core.c is failing to return in
> > > that first (complex) if statement.  His report says pid 1 so this is
> > > likely the issue.
> > > 
> > > > 
> > > > The task becoming idle is 'current' at this point, see the
> > > > cpu_startup_entry() call below.
> > > > 
> > > > Would not something like so be the right thing?
> > > > 
> > > > 
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 2299a5cfbfb9..802551e0009b 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
> > > >  	 * PF_KTHREAD should already be set at this point; regardless, make it
> > > >  	 * look like a proper per-CPU kthread.
> > > >  	 */
> > > > -	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
> > > > +	idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
> > > 
> > > I am concerned this will alter more than just the current task, which
> > > would mean more modifications later.  There is a comment about it being
> > > called 'more than once' and 'per cpu' so I am hesitant to change the
> > > function itself.
> > > 
> > > Although I am unsure of the call path.. fork_idle() -> init_idle() I
> > > guess?
> > 
> > There's only 2 ways to get into do_idle(), through cpu_startup_entry()
> > and play_idle_precise(). The latter already frobs PF_IDLE since it is
> > the forced idle path, this then leaves cpu_startup_entry() which is the
> > regular idle path.
> > 
> > All idle threads will end up calling into it, the boot CPU through the
> > rest_init() and the SMP cpus through arch SMP bringup.
> > 
> > IOW, this ensures all idle loops will have PF_IDLE set but not the
> > pre-idle loop setup code these threads run.
> 
> Thanks for the information.  This does leave the init_idle() function in
> the odd state of not setting PF_IDLE, but I guess that's okay?

Yep, the few things that care about PF_IDLE seem to really only care
about do_idle() and very much not (per the rcutiny thing) any code that
comes before it.
  
kernel test robot Sept. 14, 2023, 4:05 p.m. UTC | #2
Hello,

kernel test robot noticed "BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c" on:

commit: afa19b70494c22ba27224f90c0805e88e5db41e1 ("[PATCH] init/main: Clear boot task idle flag")
url: https://github.com/intel-lab-lkp/linux/commits/Liam-R-Howlett/init-main-Clear-boot-task-idle-flag/20230913-085921
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3669558bdf354cd352be955ef2764cde6a9bf5ec
patch link: https://lore.kernel.org/all/20230913005647.1534747-1-Liam.Howlett@oracle.com/
patch subject: [PATCH] init/main: Clear boot task idle flag

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


besides BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c,
we also observed other issues on this commit but clean on parent:


3669558bdf354cd3 afa19b70494c22ba27224f90c08
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :6           17%           1:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_drivers/dma-buf/selftest.c
           :6           17%           1:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_drivers/net/ethernet/intel/e1000_hw.c
           :6          100%           6:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_include/linux/sched/mm.h
           :6           83%           5:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/mutex.c
           :6           50%           3:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/rwsem.c
           :6           17%           1:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/rcu/tree.c
           :6          100%           6:6     dmesg.BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c
           :6          100%           6:6     dmesg.RIP:__do_softirq
           :6          100%           6:6     dmesg.RIP:__flush_smp_call_function_queue
           :6           33%           2:6     dmesg.RIP:__handle_irq_event_percpu
           :6          100%           6:6     dmesg.RIP:__hrtimer_run_queues
           :6          100%           6:6     dmesg.RIP:__note_gp_changes
           :6          100%           6:6     dmesg.RIP:__queue_work
           :6          100%           6:6     dmesg.RIP:__raise_softirq_irqoff
           :6          100%           6:6     dmesg.RIP:__run_timers
           :6          100%           6:6     dmesg.RIP:__smp_call_single_queue
           :6          100%           6:6     dmesg.RIP:__sysvec_apic_timer_interrupt
           :6          100%           6:6     dmesg.RIP:__sysvec_call_function_single
           :6           50%           3:6     dmesg.RIP:__update_load_avg_blocked_se
           :6          100%           6:6     dmesg.RIP:__update_load_avg_cfs_rq
           :6          100%           6:6     dmesg.RIP:__update_load_avg_se
           :6          100%           6:6     dmesg.RIP:__wake_nocb_gp
           :6           67%           4:6     dmesg.RIP:_raw_spin_unlock_irqrestore
           :6          100%           6:6     dmesg.RIP:arch_ftrace_ops_list_func
           :6          100%           6:6     dmesg.RIP:attach_entity_load_avg
           :6          100%           6:6     dmesg.RIP:call_timer_fn
           :6          100%           6:6     dmesg.RIP:default_idle
           :6          100%           6:6     dmesg.RIP:dequeue_task_fair
           :6          100%           6:6     dmesg.RIP:detach_entity_load_avg
           :6           33%           2:6     dmesg.RIP:detach_if_pending
           :6           67%           4:6     dmesg.RIP:dma_fence_signal_timestamp_locked
           :6           33%           2:6     dmesg.RIP:do_nocb_deferred_wakeup
           :6          100%           6:6     dmesg.RIP:do_nocb_deferred_wakeup_timer
           :6          100%           6:6     dmesg.RIP:enqueue_hrtimer
           :6          100%           6:6     dmesg.RIP:enqueue_task_fair
           :6          100%           6:6     dmesg.RIP:enqueue_timer
           :6           17%           1:6     dmesg.RIP:hrtimer_init
           :6           67%           4:6     dmesg.RIP:load_balance
           :6          100%           6:6     dmesg.RIP:lock_acquire
           :6          100%           6:6     dmesg.RIP:lock_acquired
           :6          100%           6:6     dmesg.RIP:lock_contended
           :6          100%           6:6     dmesg.RIP:lock_release
           :6           33%           2:6     dmesg.RIP:printk_sprint
           :6           17%           1:6     dmesg.RIP:rcu_barrier_trace
           :6          100%           6:6     dmesg.RIP:rcu_core
           :6          100%           6:6     dmesg.RIP:rcu_is_cpu_rrupt_from_idle
           :6           83%           5:6     dmesg.RIP:rcu_qs
           :6          100%           6:6     dmesg.RIP:rcu_report_qs_rnp
           :6          100%           6:6     dmesg.RIP:rcu_sched_clock_irq
           :6           50%           3:6     dmesg.RIP:resched_curr
           :6          100%           6:6     dmesg.RIP:set_task_cpu
           :6          100%           6:6     dmesg.RIP:trace_hardirqs_off
           :6          100%           6:6     dmesg.RIP:trace_hardirqs_off_finish
           :6          100%           6:6     dmesg.RIP:trace_hardirqs_on
           :6          100%           6:6     dmesg.RIP:trace_reschedule_entry
           :6          100%           6:6     dmesg.RIP:trace_reschedule_exit
           :6          100%           6:6     dmesg.RIP:try_to_wake_up
           :6          100%           6:6     dmesg.RIP:ttwu_do_activate
           :6          100%           6:6     dmesg.RIP:update_cpu_capacity
           :6          100%           6:6     dmesg.RIP:update_curr
           :6          100%           6:6     dmesg.RIP:update_dl_rq_load_avg
           :6          100%           6:6     dmesg.RIP:update_irq_load_avg
           :6          100%           6:6     dmesg.RIP:update_rt_rq_load_avg
           :6          100%           6:6     dmesg.RIP:update_sd_lb_stats
           :6           17%           1:6     dmesg.RIP:wake_up_nohz_cpu
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#__sysvec_apic_timer_interrupt
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#__sysvec_call_function_single
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#trace_reschedule_entry
           :6          100%           6:6     dmesg.WARNING:at_arch/x86/include/asm/trace/irq_vectors.h:#trace_reschedule_exit
           :6          100%           6:6     dmesg.WARNING:at_include/linux/trace_recursion.h:#arch_ftrace_ops_list_func
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/csd.h:#__flush_smp_call_function_queue
           :6           67%           4:6     dmesg.WARNING:at_include/trace/events/dma_fence.h:#dma_fence_signal_timestamp_locked
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/ipi.h:#__smp_call_single_queue
           :6           50%           3:6     dmesg.WARNING:at_include/trace/events/ipi.h:#resched_curr
           :6           17%           1:6     dmesg.WARNING:at_include/trace/events/ipi.h:#wake_up_nohz_cpu
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/irq.h:#__do_softirq
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/irq.h:#__handle_irq_event_percpu
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/irq.h:#__raise_softirq_irqoff
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_acquire
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_acquired
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_contended
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/lock.h:#lock_release
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_off
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_off_finish
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/preemptirq.h:#trace_hardirqs_on
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/printk.h:#printk_sprint
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#__note_gp_changes
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#__wake_nocb_gp
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/rcu.h:#do_nocb_deferred_wakeup
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#do_nocb_deferred_wakeup_timer
           :6           17%           1:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_barrier_trace
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_core
           :6           83%           5:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_qs
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_report_qs_rnp
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/rcu.h:#rcu_sched_clock_irq
           :6           50%           3:6     dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_blocked_se
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_cfs_rq
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#__update_load_avg_se
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#attach_entity_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#dequeue_task_fair
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#detach_entity_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#enqueue_task_fair
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#set_task_cpu
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#try_to_wake_up
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#ttwu_do_activate
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_cpu_capacity
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_curr
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_dl_rq_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_irq_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_rt_rq_load_avg
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/sched.h:#update_sd_lb_stats
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#__hrtimer_run_queues
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#__run_timers
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#call_timer_fn
           :6           33%           2:6     dmesg.WARNING:at_include/trace/events/timer.h:#detach_if_pending
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#enqueue_hrtimer
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/timer.h:#enqueue_timer
           :6           17%           1:6     dmesg.WARNING:at_include/trace/events/timer.h:#hrtimer_init
           :6          100%           6:6     dmesg.WARNING:at_include/trace/events/workqueue.h:#__queue_work
           :6          100%           6:6     dmesg.WARNING:at_kernel/rcu/tree.c:#rcu_is_cpu_rrupt_from_idle
           :6           17%           1:6     dmesg.WARNING:inconsistent_lock_state
           :6          100%           6:6     dmesg.WARNING:suspicious_RCU_usage
           :6           33%           2:6     dmesg.arch/x86/kernel/irq.c:#IRQ_failed_to_wake_up_RCU
           :6          100%           6:6     dmesg.include/linux/cgroup.h:#suspicious_rcu_dereference_check()usage
           :6          100%           6:6     dmesg.include/linux/rcupdate.h:#rcu_read_lock()used_illegally_while_idle
           :6          100%           6:6     dmesg.include/linux/rcupdate.h:#rcu_read_unlock()used_illegally_while_idle
           :6           17%           1:6     dmesg.inconsistent{IN-HARDIRQ-W}->{HARDIRQ-ON-W}usage
           :6          100%           6:6     dmesg.kernel/events/core.c:#suspicious_rcu_dereference_check()usage
           :6           83%           5:6     dmesg.kernel/exit.c:#suspicious_rcu_dereference_check()usage
           :6          100%           6:6     dmesg.kernel/rcu/tree.c:#RCU_dynticks_nmi_nesting_counter_underflow/zero
           :6           67%           4:6     dmesg.kernel/sched/core.c:#suspicious_rcu_dereference_check()usage
           :6          100%           6:6     dmesg.kernel/sched/fair.c:#suspicious_rcu_dereference_check()usage
           :6           83%           5:6     dmesg.kernel/workqueue.c:#RCU_or_wq_pool_mutex_should_be_held
           :6           83%           5:6     dmesg.kernel/workqueue.c:#suspicious_rcu_dereference_check()usage



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309142334.a345320b-oliver.sang@intel.com


[    2.318233][    T1] BUG: sleeping function called from invalid context at kernel/sched/completion.c:111
[    2.318647][    T1] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
[    2.319725][    T1] preempt_count: 0, expected: 0
[    2.320293][    T1] no locks held by swapper/0/1.
[    2.320882][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc1-00034-gafa19b70494c #1
[    2.321957][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[    2.321957][    T1] Call Trace:
[    2.321957][    T1]  <TASK>
[ 2.321957][ T1] dump_stack_lvl (lib/dump_stack.c:107) 
[ 2.321957][ T1] __might_resched (kernel/sched/core.c:10188) 
[ 2.321957][ T1] ? sched_core_find (kernel/sched/core.c:10142) 
[ 2.321957][ T1] __wait_for_common (kernel/sched/completion.c:115) 
[ 2.321957][ T1] ? usleep_range_state (kernel/time/timer.c:2129) 
[ 2.321957][ T1] ? reacquire_held_locks (kernel/locking/lockdep.c:5404) 
[ 2.321957][ T1] ? wait_for_completion_io_timeout (kernel/sched/completion.c:110) 
[ 2.321957][ T1] ? do_raw_spin_lock (kernel/locking/spinlock_debug.c:121) 
[ 2.321957][ T1] ? rest_init (init/main.c:1431) 
[ 2.321957][ T1] kernel_init (init/main.c:1439) 
[ 2.321957][ T1] ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202) 
[ 2.321957][ T1] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 2.321957][ T1] ? rest_init (init/main.c:1431) 
[ 2.321957][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 
[    2.321957][    T1]  </TASK>
[    2.322276][    T1] smpboot: CPU0: Intel Xeon E312xx (Sandy Bridge) (family: 0x6, model: 0x2a, stepping: 0x1)
[    2.327212][    T1] RCU Tasks: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1.
[    2.328461][    T1] RCU Tasks Rude: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1.
[    2.330045][    T1] RCU Tasks Trace: Setting shift to 1 and lim to 1 rcu_task_cb_adjust=1.
[    2.331385][    T1] Running RCU-tasks wait API self tests
[    2.333117][    T1] Performance Events: unsupported p6 CPU model 42 no PMU driver, software events only.
[    2.334654][    T1] signal: max sigframe size: 1776
[    2.335915][    T1] rcu: Hierarchical SRCU implementation.
[    2.336653][    T1] rcu: 	Max phase no-delay instances is 1000.
[    2.342550][   T14] Callback from call_rcu_tasks_trace() invoked.
[    2.344299][    T1] smp: Bringing up secondary CPUs ...
[    2.347251][    T1] smpboot: x86: Booting SMP configuration:
[    2.347999][    T1] .... node  #0, CPUs:      #1
[    0.487313][    T0] masked ExtINT on CPU#1
[    2.348624][    T0] ------------[ cut here ]------------
[ 2.348624][ T0] WARNING: CPU: 0 PID: 0 at include/trace/events/preemptirq.h:36 trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[    2.348624][    T0] Modules linked in:
[    2.348624][    T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.6.0-rc1-00034-gafa19b70494c #1
[    2.348624][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 2.348624][ T0] RIP: 0010:trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] Code: 03 80 3c 02 00 75 2a 48 8b 7b 08 31 d2 48 89 ee e8 99 f4 ff ff 65 ff 0d 0a 69 ce 7e e9 5d ff ff ff e8 38 18 e9 ff 84 c0 75 80 <0f> 0b e9 79 ff ff ff e8 c8 c0 3a 00 eb cf 66 0f 1f 44 00 00 65 8b
All code
========
   0:	03 80 3c 02 00 75    	add    0x7500023c(%rax),%eax
   6:	2a 48 8b             	sub    -0x75(%rax),%cl
   9:	7b 08                	jnp    0x13
   b:	31 d2                	xor    %edx,%edx
   d:	48 89 ee             	mov    %rbp,%rsi
  10:	e8 99 f4 ff ff       	call   0xfffffffffffff4ae
  15:	65 ff 0d 0a 69 ce 7e 	decl   %gs:0x7ece690a(%rip)        # 0x7ece6926
  1c:	e9 5d ff ff ff       	jmp    0xffffffffffffff7e
  21:	e8 38 18 e9 ff       	call   0xffffffffffe9185e
  26:	84 c0                	test   %al,%al
  28:	75 80                	jne    0xffffffffffffffaa
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	e9 79 ff ff ff       	jmp    0xffffffffffffffaa
  31:	e8 c8 c0 3a 00       	call   0x3ac0fe
  36:	eb cf                	jmp    0x7
  38:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  3e:	65                   	gs
  3f:	8b                   	.byte 0x8b

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	e9 79 ff ff ff       	jmp    0xffffffffffffff80
   7:	e8 c8 c0 3a 00       	call   0x3ac0d4
   c:	eb cf                	jmp    0xffffffffffffffdd
   e:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  14:	65                   	gs
  15:	8b                   	.byte 0x8b
[    2.348624][    T0] RSP: 0000:ffffffff84a07d60 EFLAGS: 00010046
[    2.348624][    T0] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81513ac2
[    2.348624][    T0] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffff85a79540
[    2.348624][    T0] RBP: ffffffff8352c418 R08: 0000000000000000 R09: fffffbfff0b4f2a8
[    2.348624][    T0] R10: ffffffff85a79547 R11: 0000000000000000 R12: 0000000000000000
[    2.348624][    T0] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    2.348624][    T0] FS:  0000000000000000(0000) GS:ffff8883aea00000(0000) knlGS:0000000000000000
[    2.348624][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.348624][    T0] CR2: ffff88843ffff000 CR3: 0000000004a89000 CR4: 00000000000406f0
[    2.348624][    T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.348624][    T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.348624][    T0] Call Trace:
[    2.348624][    T0]  <TASK>
[ 2.348624][ T0] ? __warn (kernel/panic.c:673) 
[ 2.348624][ T0] ? trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 2.348624][ T0] ? handle_bug (arch/x86/kernel/traps.c:237) 
[ 2.348624][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
[ 2.348624][ T0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 2.348624][ T0] ? irqentry_enter (kernel/entry/common.c:315) 
[ 2.348624][ T0] ? trace_hardirqs_off_finish (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 include/linux/cpumask.h:504 include/linux/cpumask.h:1082 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] ? trace_hardirqs_off_finish (include/trace/events/preemptirq.h:36 include/trace/events/preemptirq.h:36 kernel/trace/trace_preemptirq.c:78 kernel/trace/trace_preemptirq.c:73) 
[ 2.348624][ T0] irqentry_enter (kernel/entry/common.c:315) 
[ 2.348624][ T0] sysvec_call_function_single (arch/x86/kernel/smp.c:287) 
[ 2.348624][ T0] asm_sysvec_call_function_single (arch/x86/include/asm/idtentry.h:652) 
[ 2.348624][ T0] RIP: 0010:default_idle (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/kernel/process.c:750) 
[ 2.348624][ T0] Code: 07 76 e7 48 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc eb 07 0f 00 2d 37 79 30 00 fb f4 <fa> c3 0f 1f 00 65 48 8b 04 25 40 a4 1f 00 f0 80 48 02 20 48 8b 10
All code
========
   0:	07                   	(bad)
   1:	76 e7                	jbe    0xffffffffffffffea
   3:	48 89 07             	mov    %rax,(%rdi)
   6:	49 c7 c0 08 00 00 00 	mov    $0x8,%r8
   d:	4d 29 c8             	sub    %r9,%r8
  10:	4c 01 c7             	add    %r8,%rdi
  13:	4c 29 c2             	sub    %r8,%rdx
  16:	e9 72 ff ff ff       	jmp    0xffffffffffffff8d
  1b:	cc                   	int3
  1c:	cc                   	int3
  1d:	cc                   	int3
  1e:	cc                   	int3
  1f:	eb 07                	jmp    0x28
  21:	0f 00 2d 37 79 30 00 	verw   0x307937(%rip)        # 0x30795f
  28:	fb                   	sti
  29:	f4                   	hlt
  2a:*	fa                   	cli		<-- trapping instruction
  2b:	c3                   	ret
  2c:	0f 1f 00             	nopl   (%rax)
  2f:	65 48 8b 04 25 40 a4 	mov    %gs:0x1fa440,%rax
  36:	1f 00 
  38:	f0 80 48 02 20       	lock orb $0x20,0x2(%rax)
  3d:	48 8b 10             	mov    (%rax),%rdx

Code starting with the faulting instruction
===========================================
   0:	fa                   	cli
   1:	c3                   	ret
   2:	0f 1f 00             	nopl   (%rax)
   5:	65 48 8b 04 25 40 a4 	mov    %gs:0x1fa440,%rax
   c:	1f 00 
   e:	f0 80 48 02 20       	lock orb $0x20,0x2(%rax)
  13:	48 8b 10             	mov    (%rax),%rdx
[    2.348624][    T0] RSP: 0000:ffffffff84a07e40 EFLAGS: 00000206
[    2.348624][    T0] RAX: 0000000000003056 RBX: 0000000000000000 RCX: ffffffff8352c7ba
[    2.348624][    T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812e3bd3
[    2.348624][    T0] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1075d7ed9d
[    2.348624][    T0] R10: ffff8883aebf6ceb R11: 0000000000000000 R12: 1ffffffff0940fca
[    2.348624][    T0] R13: ffffffff84a1e940 R14: dffffc0000000000 R15: 0000000000013f10
[ 2.348624][ T0] ? ct_kernel_exit+0xca/0x100 
[ 2.348624][ T0] ? cpuidle_idle_call (kernel/sched/idle.c:171) 
[ 2.348624][ T0] default_idle_call (include/linux/cpuidle.h:143 kernel/sched/idle.c:98) 
[ 2.348624][ T0] cpuidle_idle_call (kernel/sched/idle.c:171) 
[ 2.348624][ T0] ? arch_cpu_idle_exit+0x30/0x30 
[ 2.348624][ T0] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4565) 
[ 2.348624][ T0] do_idle (kernel/sched/idle.c:282) 
[ 2.348624][ T0] cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1)) 
[ 2.348624][ T0] rest_init (include/linux/rcupdate.h:750 init/main.c:697) 
[ 2.348624][ T0] arch_call_rest_init+0xf/0x20 
[ 2.348624][ T0] start_kernel (init/main.c:1021 (discriminator 1)) 
[ 2.348624][ T0] x86_64_start_reservations (arch/x86/kernel/head64.c:544) 
[ 2.348624][ T0] x86_64_start_kernel (??:?) 
[ 2.348624][ T0] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433) 
[    2.348624][    T0]  </TASK>
[    2.348624][    T0] irq event stamp: 12375
[ 2.348624][ T0] hardirqs last enabled at (12374): cpuidle_idle_call (kernel/sched/idle.c:171) 
[ 2.348624][ T0] hardirqs last disabled at (12375): sysvec_call_function_single (arch/x86/kernel/smp.c:287) 
[ 2.348624][ T0] softirqs last enabled at (8563): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) 
[ 2.348624][ T0] softirqs last disabled at (8558): __irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632) 
[    2.348624][    T0] ---[ end trace 0000000000000000 ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230914/202309142334.a345320b-oliver.sang@intel.com
  

Patch

diff --git a/init/main.c b/init/main.c
index ad920fac325c..f74772acf612 100644
--- a/init/main.c
+++ b/init/main.c
@@ -696,7 +696,7 @@  noinline void __ref __noreturn rest_init(void)
 	 */
 	rcu_read_lock();
 	tsk = find_task_by_pid_ns(pid, &init_pid_ns);
-	tsk->flags |= PF_NO_SETAFFINITY;
+	tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE;
 	set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
 	rcu_read_unlock();
 
@@ -938,6 +938,8 @@  void start_kernel(void)
 	 * time - but meanwhile we still have a functioning scheduler.
 	 */
 	sched_init();
+	/* Avoid early context switch, rest_init() restores PF_IDLE */
+	current->flags &= ~PF_IDLE;
 
 	if (WARN(!irqs_disabled(),
 		 "Interrupts were enabled *very* early, fixing it\n"))