[RFC] workqueue: Fix kernel panic on CPU hot-unplug

Message ID ZbqfMR_mVLaSCj4Q@carbonx1
State New
Headers
Series [RFC] workqueue: Fix kernel panic on CPU hot-unplug |

Commit Message

Helge Deller Jan. 31, 2024, 7:27 p.m. UTC
  When hot-unplugging a 32-bit CPU on the parisc platform with
"chcpu -d 1", I get the following kernel panic. Adding a check
for !pwq prevents the panic.

 Kernel Fault: Code=26 (Data memory access rights trap) at addr 00000000
 CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.8.0-rc1-32bit+ #1291
 Hardware name: 9000/778/B160L
 
 IASQ: 00000000 00000000 IAOQ: 10446db4 10446db8
  IIR: 0f80109c    ISR: 00000000  IOR: 00000000
  CPU:        1   CR30: 11dd1710 CR31: 00000000
  IAOQ[0]: wq_update_pod+0x98/0x14c
  IAOQ[1]: wq_update_pod+0x9c/0x14c
  RP(r2): wq_update_pod+0x80/0x14c
 Backtrace:
  [<10448744>] workqueue_offline_cpu+0x1d4/0x1dc
  [<10429db4>] cpuhp_invoke_callback+0xf8/0x200
  [<1042a1d0>] cpuhp_thread_fun+0xb8/0x164
  [<10452970>] smpboot_thread_fn+0x284/0x288
  [<1044d8f4>] kthread+0x12c/0x13c
  [<1040201c>] ret_from_kernel_thread+0x1c/0x24
 Kernel panic - not syncing: Kernel Fault

Signed-off-by: Helge Deller <deller@gmx.de>

---
  

Comments

Tejun Heo Jan. 31, 2024, 10:28 p.m. UTC | #1
Hello,

On Wed, Jan 31, 2024 at 08:27:45PM +0100, Helge Deller wrote:
> When hot-unplugging a 32-bit CPU on the parisc platform with
> "chcpu -d 1", I get the following kernel panic. Adding a check
> for !pwq prevents the panic.
> 
>  Kernel Fault: Code=26 (Data memory access rights trap) at addr 00000000
>  CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.8.0-rc1-32bit+ #1291
>  Hardware name: 9000/778/B160L
>  
>  IASQ: 00000000 00000000 IAOQ: 10446db4 10446db8
>   IIR: 0f80109c    ISR: 00000000  IOR: 00000000
>   CPU:        1   CR30: 11dd1710 CR31: 00000000
>   IAOQ[0]: wq_update_pod+0x98/0x14c
>   IAOQ[1]: wq_update_pod+0x9c/0x14c
>   RP(r2): wq_update_pod+0x80/0x14c
>  Backtrace:
>   [<10448744>] workqueue_offline_cpu+0x1d4/0x1dc
>   [<10429db4>] cpuhp_invoke_callback+0xf8/0x200
>   [<1042a1d0>] cpuhp_thread_fun+0xb8/0x164
>   [<10452970>] smpboot_thread_fn+0x284/0x288
>   [<1044d8f4>] kthread+0x12c/0x13c
>   [<1040201c>] ret_from_kernel_thread+0x1c/0x24
>  Kernel panic - not syncing: Kernel Fault
> 
> Signed-off-by: Helge Deller <deller@gmx.de>
> 
> ---
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 76e60faed892..dfeee7b7322c 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4521,6 +4521,8 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu,
>  	wq_calc_pod_cpumask(target_attrs, cpu, off_cpu);
>  	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
>  					lockdep_is_held(&wq_pool_mutex));
> +	if (!pwq)
> +		return;

Hmm... I have a hard time imagining a scenario where some CPUs don't have
pwq installed on wq->cpu_pwq. Can you please run `drgn
tools/workqueue/wq_dump.py` before triggering the hotplug event and paste
the output along with full dmesg?

Thanks.
  
Helge Deller Feb. 1, 2024, 4:41 p.m. UTC | #2
On 1/31/24 23:28, Tejun Heo wrote:
> On Wed, Jan 31, 2024 at 08:27:45PM +0100, Helge Deller wrote:
>> When hot-unplugging a 32-bit CPU on the parisc platform with
>> "chcpu -d 1", I get the following kernel panic. Adding a check
>> for !pwq prevents the panic.
>>
>>   Kernel Fault: Code=26 (Data memory access rights trap) at addr 00000000
>>   CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.8.0-rc1-32bit+ #1291
>>   Hardware name: 9000/778/B160L
>>
>>   IASQ: 00000000 00000000 IAOQ: 10446db4 10446db8
>>    IIR: 0f80109c    ISR: 00000000  IOR: 00000000
>>    CPU:        1   CR30: 11dd1710 CR31: 00000000
>>    IAOQ[0]: wq_update_pod+0x98/0x14c
>>    IAOQ[1]: wq_update_pod+0x9c/0x14c
>>    RP(r2): wq_update_pod+0x80/0x14c
>>   Backtrace:
>>    [<10448744>] workqueue_offline_cpu+0x1d4/0x1dc
>>    [<10429db4>] cpuhp_invoke_callback+0xf8/0x200
>>    [<1042a1d0>] cpuhp_thread_fun+0xb8/0x164
>>    [<10452970>] smpboot_thread_fn+0x284/0x288
>>    [<1044d8f4>] kthread+0x12c/0x13c
>>    [<1040201c>] ret_from_kernel_thread+0x1c/0x24
>>   Kernel panic - not syncing: Kernel Fault
>>
>> Signed-off-by: Helge Deller <deller@gmx.de>
>>
>> ---
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 76e60faed892..dfeee7b7322c 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -4521,6 +4521,8 @@ static void wq_update_pod(struct workqueue_struct *wq, int cpu,
>>   	wq_calc_pod_cpumask(target_attrs, cpu, off_cpu);
>>   	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
>>   					lockdep_is_held(&wq_pool_mutex));
>> +	if (!pwq)
>> +		return;
>
> Hmm... I have a hard time imagining a scenario where some CPUs don't have
> pwq installed on wq->cpu_pwq. Can you please run `drgn
> tools/workqueue/wq_dump.py` before triggering the hotplug event and paste
> the output along with full dmesg?

I'm not sure if parisc is already fully supported with that tool, or
if I'm doing something wrong:

root@debian:~# uname -a
Linux debian 6.8.0-rc1-32bit+ #1292 SMP PREEMPT Thu Feb  1 11:31:38 CET 2024 parisc GNU/Linux

root@debian:~# drgn --main-symbols -s ./vmlinux ./wq_dump.py
Traceback (most recent call last):
   File "/usr/bin/drgn", line 33, in <module>
     sys.exit(load_entry_point('drgn==0.0.25', 'console_scripts', 'drgn')())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/lib/python3/dist-packages/drgn/cli.py", line 301, in _main
     runpy.run_path(script, init_globals={"prog": prog}, run_name="__main__")
   File "<frozen runpy>", line 291, in run_path
   File "<frozen runpy>", line 98, in _run_module_code
   File "<frozen runpy>", line 88, in _run_code
   File "./wq_dump.py", line 78, in <module>
     worker_pool_idr         = prog['worker_pool_idr']
                               ~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'worker_pool_idr'

Maybe you have an idea? I'll check further, but otherwise it's probably
easier for me to add some printk() to the kernel function wq_update_pod()
and send that info?

Helge
  
Tejun Heo Feb. 1, 2024, 4:54 p.m. UTC | #3
Hello, Helge.

On Thu, Feb 01, 2024 at 05:41:10PM +0100, Helge Deller wrote:
> > Hmm... I have a hard time imagining a scenario where some CPUs don't have
> > pwq installed on wq->cpu_pwq. Can you please run `drgn
> > tools/workqueue/wq_dump.py` before triggering the hotplug event and paste
> > the output along with full dmesg?
> 
> I'm not sure if parisc is already fully supported with that tool, or
> if I'm doing something wrong:
> 
> root@debian:~# uname -a
> Linux debian 6.8.0-rc1-32bit+ #1292 SMP PREEMPT Thu Feb  1 11:31:38 CET 2024 parisc GNU/Linux
> 
> root@debian:~# drgn --main-symbols -s ./vmlinux ./wq_dump.py
> Traceback (most recent call last):
>   File "/usr/bin/drgn", line 33, in <module>
>     sys.exit(load_entry_point('drgn==0.0.25', 'console_scripts', 'drgn')())
>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/usr/lib/python3/dist-packages/drgn/cli.py", line 301, in _main
>     runpy.run_path(script, init_globals={"prog": prog}, run_name="__main__")
>   File "<frozen runpy>", line 291, in run_path
>   File "<frozen runpy>", line 98, in _run_module_code
>   File "<frozen runpy>", line 88, in _run_code
>   File "./wq_dump.py", line 78, in <module>
>     worker_pool_idr         = prog['worker_pool_idr']
>                               ~~~~^^^^^^^^^^^^^^^^^^^
> KeyError: 'worker_pool_idr'

Does the kernel have CONFIG_DEBUG_INFO enabled? If you can look up
worker_pool_idr in gdb, drgn should be able to do the same.

> Maybe you have an idea? I'll check further, but otherwise it's probably
> easier for me to add some printk() to the kernel function wq_update_pod()
> and send that info?

Can you first try with drgn? The script dumps all the config info, so it's
likely easier to view that way. If that doesn't work out, I can write up a
debug patch.

Thanks.
  
Helge Deller Feb. 1, 2024, 5:56 p.m. UTC | #4
Hi Tejun,

On 2/1/24 17:54, Tejun Heo wrote:
> On Thu, Feb 01, 2024 at 05:41:10PM +0100, Helge Deller wrote:
>>> Hmm... I have a hard time imagining a scenario where some CPUs don't have
>>> pwq installed on wq->cpu_pwq. Can you please run `drgn
>>> tools/workqueue/wq_dump.py` before triggering the hotplug event and paste
>>> the output along with full dmesg?

Enabling CONFIG_DEBUG_INFO=y did the trick :-)


root@debian:~# drgn --main-symbols -s ./vmlinux ./wq_dump.py 2>&1 | tee L
Affinity Scopes
===============
wq_unbound_cpumask=0000ffff

CPU
   nr_pods  16
   pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 [4]=00000010 [5]=00000020 [6]=00000040 [7]=00000080 [8]=00000100 [9]=00000200 [10]=00000400 [11]=00000800 [12]=00001000 [13]=00002000 [14]=00004000 [15]=00008000
   pod_node [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0
   cpu_pod  [0]=0 [1]=1

SMT
   nr_pods  16
   pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 [4]=00000010 [5]=00000020 [6]=00000040 [7]=00000080 [8]=00000100 [9]=00000200 [10]=00000400 [11]=00000800 [12]=00001000 [13]=00002000 [14]=00004000 [15]=00008000
   pod_node [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0
   cpu_pod  [0]=0 [1]=1

CACHE (default)
   nr_pods  1
   pod_cpus [0]=0000ffff
   pod_node [0]=0
   cpu_pod  [0]=0 [1]=0

NUMA
   nr_pods  1
   pod_cpus [0]=0000ffff
   pod_node [0]=0
   cpu_pod  [0]=0 [1]=0

SYSTEM
   nr_pods  1
   pod_cpus [0]=0000ffff
   pod_node [0]=-1
   cpu_pod  [0]=0 [1]=0

Worker Pools
============
pool[00] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  0
pool[01] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  0
pool[02] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  1
pool[03] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  1
pool[04] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  2
pool[05] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  2
pool[06] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  3
pool[07] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  3
pool[08] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  4
pool[09] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  4
pool[10] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  5
pool[11] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  5
pool[12] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  6
pool[13] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  6
pool[14] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  7
pool[15] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  7
pool[16] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  8
pool[17] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  8
pool[18] ref= 1 nice=  0 idle/workers=  0/  0 cpu=  9
pool[19] ref= 1 nice=-20 idle/workers=  0/  0 cpu=  9
pool[20] ref= 1 nice=  0 idle/workers=  0/  0 cpu= 10
pool[21] ref= 1 nice=-20 idle/workers=  0/  0 cpu= 10
pool[22] ref= 1 nice=  0 idle/workers=  0/  0 cpu= 11
pool[23] ref= 1 nice=-20 idle/workers=  0/  0 cpu= 11
pool[24] ref= 1 nice=  0 idle/workers=  0/  0 cpu= 12
pool[25] ref= 1 nice=-20 idle/workers=  0/  0 cpu= 12
pool[26] ref= 1 nice=  0 idle/workers=  0/  0 cpu= 13
pool[27] ref= 1 nice=-20 idle/workers=  0/  0 cpu= 13
pool[28] ref= 1 nice=  0 idle/workers=  0/  0 cpu= 14
pool[29] ref= 1 nice=-20 idle/workers=  0/  0 cpu= 14
pool[30] ref= 1 nice=  0 idle/workers=  0/  0 cpu= 15
pool[31] ref= 1 nice=-20 idle/workers=  0/  0 cpu= 15
pool[32] ref=28 nice=  0 idle/workers=  8/  8 cpus=0000ffff pod_cpus=0000ffff

Workqueue CPU -> pool
=====================
[    workqueue     \     type   CPU  0  1 dfl]
events                   percpu      0  2
events_highpri           percpu      1  3
events_long              percpu      0  2
events_unbound           unbound    32 32 32
events_freezable         percpu      0  2
events_power_efficient   percpu      0  2
events_freezable_power_  percpu      0  2
rcu_gp                   percpu      0  2
rcu_par_gp               percpu      0  2
slub_flushwq             percpu      0  2
netns                    ordered    32 32 32
mm_percpu_wq             percpu      0  2
inet_frag_wq             percpu      0  2
cgroup_destroy           percpu      0  2
cgroup_pidlist_destroy   percpu      0  2
cgwb_release             percpu      0  2
writeback                unbound    32 32 32
kintegrityd              percpu      1  3
kblockd                  percpu      1  3
blkcg_punt_bio           unbound    32 32 32
ata_sff                  percpu      0  2
usb_hub_wq               percpu      0  2
inode_switch_wbs         percpu      0  2
virtio-blk               percpu      0  2
scsi_tmf_0               ordered    32 32 32
psmouse-smbus            percpu      0  2
kpsmoused                ordered    32 32 32
sock_diag_events         percpu      0  2
kstrp                    ordered    32 32 32
ext4-rsv-conversion      ordered    32 32 32
root@debian:~#
root@debian:~# lscpu
Architecture:          parisc
   Byte Order:          Big Endian
CPU(s):                2
   On-line CPU(s) list: 0,1
Model name:            PA7300LC (PCX-L2)
   CPU family:          PA-RISC 1.1e
   Model:               9000/778/B160L - Merlin L2 160 (9000/778/B160L)
   Thread(s) per core:  1
   Core(s) per socket:  1
   Socket(s):           2
   BogoMIPS:            2446.13
root@debian:~#
root@debian:~# chcpu -d 1
[  261.926353] Backtrace:
[  261.928292]  [<10448744>] workqueue_offline_cpu+0x1d4/0x1dc
[  261.928292]  [<10429db4>] cpuhp_invoke_callback+0xf8/0x200
[  261.928292]  [<1042a1d0>] cpuhp_thread_fun+0xb8/0x164
[  261.928292]  [<10452970>] smpboot_thread_fn+0x284/0x288
[  261.928292]  [<1044d8f4>] kthread+0x12c/0x13c
[  261.928292]  [<1040201c>] ret_from_kernel_thread+0x1c/0x24
[  261.928292]
[  261.928292]
[  261.928292] Kernel Fault: Code=26 (Data memory access rights trap) at addr 00000000
[  261.928292] CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.8.0-rc1-32bit+ #1293
[  261.928292] Hardware name: 9000/778/B160L
[  261.928292]
[  261.928292]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[  261.928292] PSW: 00000000000001101111111100001111 Not tainted
[  261.928292] r00-03  0006ff0f 11011540 10446d9c 11e00500
[  261.928292] r04-07  11c0b800 00000002 11c0d000 00000001
[  261.928292] r08-11  110194e4 11018f08 00000000 00000004
[  261.928292] r12-15  10c78800 00000612 f0028050 f0027fd8
[  261.928292] r16-19  fffffffc fee01180 f0027ed8 01735000
[  261.928292] r20-23  0000ffff 1249cc00 1249cc00 00000000
[  261.928292] r24-27  11c0c580 11c0d004 11c0d000 10ceb708
[  261.928292] r28-31  00000000 0000000e 11e00580 00000018
[  261.928292] sr00-03  00000000 00000000 00000000 000004be
[  261.928292] sr04-07  00000000 00000000 00000000 00000000
[  261.928292]
[  261.928292] IASQ: 00000000 00000000 IAOQ: 10446db4 10446db8
[  261.928292]  IIR: 0f80109c    ISR: 00000000  IOR: 00000000
[  261.928292]  CPU:        1   CR30: 11dd1710 CR31: 00000000
[  261.928292]  ORIG_R28: 00000612
[  261.928292]  IAOQ[0]: wq_update_pod+0x98/0x14c
[  261.928292]  IAOQ[1]: wq_update_pod+0x9c/0x14c
[  261.928292]  RP(r2): wq_update_pod+0x80/0x14c
[  261.928292] Backtrace:
[  261.928292]  [<10448744>] workqueue_offline_cpu+0x1d4/0x1dc
[  261.928292]  [<10429db4>] cpuhp_invoke_callback+0xf8/0x200
[  261.928292]  [<1042a1d0>] cpuhp_thread_fun+0xb8/0x164
[  261.928292]  [<10452970>] smpboot_thread_fn+0x284/0x288
[  261.928292]  [<1044d8f4>] kthread+0x12c/0x13c
[  261.928292]  [<1040201c>] ret_from_kernel_thread+0x1c/0x24
[  261.928292]
[  261.928292] Kernel panic - not syncing: Kernel Fault
  
Tejun Heo Feb. 2, 2024, 1:39 a.m. UTC | #5
Hello,

On Thu, Feb 01, 2024 at 06:56:20PM +0100, Helge Deller wrote:
> root@debian:~# drgn --main-symbols -s ./vmlinux ./wq_dump.py 2>&1 | tee L
> Affinity Scopes
> ===============
> wq_unbound_cpumask=0000ffff
> 
> CPU
>   nr_pods  16
>   pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 [4]=00000010 [5]=00000020 [6]=00000040 [7]=00000080 [8]=00000100 [9]=00000200 [10]=00000400 [11]=00000800 [12]=00001000 [13]=00002000 [14]=00004000 [15]=00008000
>   pod_node [0]=0 [1]=0 [2]=0 [3]=0 [4]=0 [5]=0 [6]=0 [7]=0 [8]=0 [9]=0 [10]=0 [11]=0 [12]=0 [13]=0 [14]=0 [15]=0
>   cpu_pod  [0]=0 [1]=1

wq_unbound_cpumask is saying there are 16 possible cpus but
for_each_possible_cpu() iteration is only giving two. Can you please apply
the following patch and post the boot dmesg? Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ffb625db9771..d3fa2bea4d75 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7146,6 +7146,9 @@ void __init workqueue_init_early(void)
 	BUG_ON(!alloc_cpumask_var(&wq_requested_unbound_cpumask, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&wq_isolated_cpumask, GFP_KERNEL));
 
+	printk("XXX workqueue_init_early: possible_cpus=%*pb\n",
+	       cpumask_pr_args(cpu_possible_mask));
+
 	cpumask_copy(wq_unbound_cpumask, cpu_possible_mask);
 	restrict_unbound_cpumask("HK_TYPE_WQ", housekeeping_cpumask(HK_TYPE_WQ));
 	restrict_unbound_cpumask("HK_TYPE_DOMAIN", housekeeping_cpumask(HK_TYPE_DOMAIN));
@@ -7290,6 +7293,9 @@ void __init workqueue_init(void)
 	struct worker_pool *pool;
 	int cpu, bkt;
 
+	printk("XXX workqueue_init: possible_cpus=%*pb\n",
+	       cpumask_pr_args(cpu_possible_mask));
+
 	wq_cpu_intensive_thresh_init();
 
 	mutex_lock(&wq_pool_mutex);
@@ -7401,6 +7407,9 @@ void __init workqueue_init_topology(void)
 	struct workqueue_struct *wq;
 	int cpu;
 
+	printk("XXX workqueue_init_topology: possible_cpus=%*pb\n",
+	       cpumask_pr_args(cpu_possible_mask));
+
 	init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
 	init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
  
Tejun Heo Feb. 2, 2024, 5:29 p.m. UTC | #6
Hello, Helge.

On Fri, Feb 02, 2024 at 09:41:38AM +0100, Helge Deller wrote:
> In a second step I extended your patch to print the present
> and online CPUs too. Below is the relevant dmesg part.
> 
> Note, that on parisc the second CPU will be activated later in the
> boot process, after the kernel has the inventory.
> This I think differs vs x86, where all CPUs are available earlier
> in the boot process.
> ...
> [    0.000000] XXX workqueue_init_early: possible_cpus=ffff  present=0001  online=0001
..
> [    0.228080] XXX workqueue_init: possible_cpus=ffff  present=0001  online=0001
..
> [    0.263466] XXX workqueue_init_topology: possible_cpus=ffff  present=0001  online=0001

So, what's bothersome is that when the wq_dump.py script printing each cpu's
pwq, it's only printing for CPU 0 and 1. The for_each_possible_cpu() drgn
helper reads cpu_possible_mask from the kernel and iterates that, so that
most likely indicates at some point the cpu_possible_mask becomes 0x3
instead of the one used during boot - 0xffff, which is problematic.

Can you please sprinkle more printks to find out whether and when the
cpu_possible_mask changes during boot?

Thanks.
  

Patch

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 76e60faed892..dfeee7b7322c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4521,6 +4521,8 @@  static void wq_update_pod(struct workqueue_struct *wq, int cpu,
 	wq_calc_pod_cpumask(target_attrs, cpu, off_cpu);
 	pwq = rcu_dereference_protected(*per_cpu_ptr(wq->cpu_pwq, cpu),
 					lockdep_is_held(&wq_pool_mutex));
+	if (!pwq)
+		return;
 	if (wqattrs_equal(target_attrs, pwq->pool->attrs))
 		return;