sched/numa, mm: do not promote folios to nodes not set N_MEMORY

Message ID 20240214035355.18335-1-byungchul@sk.com
State New
Headers
Series sched/numa, mm: do not promote folios to nodes not set N_MEMORY |

Commit Message

Byungchul Park Feb. 14, 2024, 3:53 a.m. UTC
  While running qemu with a configuration where some CPUs don't have their
local memory and with a kernel numa balancing on, the following oops has
been observed. It's because of null pointers of ->zone_pgdat of zones of
those nodes that are not initialized at booting time. So should avoid
nodes not set N_MEMORY from getting promoted.

> BUG: unable to handle page fault for address: 00000000000033f3
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> Code: (omitted)
> RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
> ? __die
> ? page_fault_oops
> ? __pte_offset_map_lock
> ? exc_page_fault
> ? asm_exc_page_fault
> ? wakeup_kswapd
> migrate_misplaced_page
> __handle_mm_fault
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
> RIP: 0033:0x55b897ba0808
> Code: (omitted)
> RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
>  </TASK>
> Modules linked in:
> CR2: 00000000000033f3
> ---[ end trace 0000000000000000  ]---
> RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> Code: (omitted)
> RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> note: masim[895] exited with irqs disabled

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reported-by: hyeongtak.ji@sk.com
---
 kernel/sched/fair.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
  

Comments

Phil Auld Feb. 14, 2024, 12:31 p.m. UTC | #1
Hi,

On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote:
> While running qemu with a configuration where some CPUs don't have their
> local memory and with a kernel numa balancing on, the following oops has
> been observed. It's because of null pointers of ->zone_pgdat of zones of
> those nodes that are not initialized at booting time. So should avoid
> nodes not set N_MEMORY from getting promoted.
> 
> > BUG: unable to handle page fault for address: 00000000000033f3
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> > ? __die
> > ? page_fault_oops
> > ? __pte_offset_map_lock
> > ? exc_page_fault
> > ? asm_exc_page_fault
> > ? wakeup_kswapd
> > migrate_misplaced_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > RIP: 0033:0x55b897ba0808
> > Code: (omitted)
> > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> >  </TASK>
> > Modules linked in:
> > CR2: 00000000000033f3
> > ---[ end trace 0000000000000000  ]---
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > note: masim[895] exited with irqs disabled

I think you could trim the down a little bit.


> 
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Reported-by: hyeongtak.ji@sk.com
> ---
>  kernel/sched/fair.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..6d215cc85f14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  	int dst_nid = cpu_to_node(dst_cpu);
>  	int last_cpupid, this_cpupid;
>  
> +	/*
> +	 * A node of dst_nid might not have its local memory. Promoting
> +	 * a folio to the node is meaningless. What's even worse, oops
> +	 * can be observed by the null pointer of ->zone_pgdat in
> +	 * various points of the code during migration.
> +	 *

> +	 * For instance, oops has been observed at CPU2 while qemu'ing:
> +	 *
> +	 * {qemu} \
> +	 *    -numa node,nodeid=0,mem=1G,cpus=0-1 \
> +	 *    -numa node,nodeid=1,cpus=2-3 \
> +	 *    -numa node,nodeid=2,mem=8G \
> +	 *    ...

This part above should probably be in the commit message not in the code.
The first paragraph of comment is plenty.

Otherwise, I think the check probably makes sense.


Cheers,
Phil

> +	 */
> +	if (!node_state(dst_nid, N_MEMORY))
> +		return false;
> +
>  	/*
>  	 * The pages in slow memory node should be migrated according
>  	 * to hot/cold instead of private/shared.
> -- 
> 2.17.1
> 
> 

--
  
Phil Auld Feb. 14, 2024, 8:03 p.m. UTC | #2
On Wed, Feb 14, 2024 at 07:31:37AM -0500 Phil Auld wrote:
> Hi,
> 
> On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote:
> > While running qemu with a configuration where some CPUs don't have their
> > local memory and with a kernel numa balancing on, the following oops has
> > been observed. It's because of null pointers of ->zone_pgdat of zones of
> > those nodes that are not initialized at booting time. So should avoid
> > nodes not set N_MEMORY from getting promoted.
> > 
> > > BUG: unable to handle page fault for address: 00000000000033f3
> > > #PF: supervisor read access in kernel mode
> > > #PF: error_code(0x0000) - not-present page
> > > PGD 0 P4D 0
> > > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > Code: (omitted)
> > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > Call Trace:
> > >  <TASK>
> > > ? __die
> > > ? page_fault_oops
> > > ? __pte_offset_map_lock
> > > ? exc_page_fault
> > > ? asm_exc_page_fault
> > > ? wakeup_kswapd
> > > migrate_misplaced_page
> > > __handle_mm_fault
> > > handle_mm_fault
> > > do_user_addr_fault
> > > exc_page_fault
> > > asm_exc_page_fault
> > > RIP: 0033:0x55b897ba0808
> > > Code: (omitted)
> > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> > >  </TASK>
> > > Modules linked in:
> > > CR2: 00000000000033f3
> > > ---[ end trace 0000000000000000  ]---
> > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > Code: (omitted)
> > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > note: masim[895] exited with irqs disabled
> 
> I think you could trim the down a little bit.
> 
> 
> > 
> > Signed-off-by: Byungchul Park <byungchul@sk.com>
> > Reported-by: hyeongtak.ji@sk.com
> > ---
> >  kernel/sched/fair.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d7a3c63a2171..6d215cc85f14 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> >  	int dst_nid = cpu_to_node(dst_cpu);
> >  	int last_cpupid, this_cpupid;
> >  
> > +	/*
> > +	 * A node of dst_nid might not have its local memory. Promoting
> > +	 * a folio to the node is meaningless. What's even worse, oops
> > +	 * can be observed by the null pointer of ->zone_pgdat in
> > +	 * various points of the code during migration.
> > +	 *
> 
> > +	 * For instance, oops has been observed at CPU2 while qemu'ing:
> > +	 *
> > +	 * {qemu} \
> > +	 *    -numa node,nodeid=0,mem=1G,cpus=0-1 \
> > +	 *    -numa node,nodeid=1,cpus=2-3 \
> > +	 *    -numa node,nodeid=2,mem=8G \
> > +	 *    ...
> 
> This part above should probably be in the commit message not in the code.
> The first paragraph of comment is plenty.
> 
> Otherwise, I think the check probably makes sense.
>

Actually, after looking at the memory.c code I wonder if this check should
not be made farther up in the numa migrate machinery.


Cheers,
Phil

> 
> Cheers,
> Phil
> 
> > +	 */
> > +	if (!node_state(dst_nid, N_MEMORY))
> > +		return false;
> > +
> >  	/*
> >  	 * The pages in slow memory node should be migrated according
> >  	 * to hot/cold instead of private/shared.
> > -- 
> > 2.17.1
> > 
> > 
> 
> -- 
> 
> 

--
  
Oscar Salvador Feb. 14, 2024, 9:13 p.m. UTC | #3
On Wed, Feb 14, 2024 at 12:53:55PM +0900, Byungchul Park wrote:
> While running qemu with a configuration where some CPUs don't have their
> local memory and with a kernel numa balancing on, the following oops has
> been observed. It's because of null pointers of ->zone_pgdat of zones of
> those nodes that are not initialized at booting time. So should avoid
> nodes not set N_MEMORY from getting promoted.

Looking at free_area_init(), we call free_area_init_node() for each node
found on the system.
And free_area_init_node()->free_area_init_core() inits all zones
belonging to the system via zone_init_internals().
Now, I am not saying the check is wrong because we obviously do not want
migrate memory to a memoryless node, but I am confused as to where
we are crashing.
  
Byungchul Park Feb. 16, 2024, 5:26 a.m. UTC | #4
On Wed, Feb 14, 2024 at 07:31:37AM -0500, Phil Auld wrote:
> Hi,
> 
> On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote:
> > While running qemu with a configuration where some CPUs don't have their
> > local memory and with a kernel numa balancing on, the following oops has
> > been observed. It's because of null pointers of ->zone_pgdat of zones of
> > those nodes that are not initialized at booting time. So should avoid
> > nodes not set N_MEMORY from getting promoted.
> > 
> > > BUG: unable to handle page fault for address: 00000000000033f3
> > > #PF: supervisor read access in kernel mode
> > > #PF: error_code(0x0000) - not-present page
> > > PGD 0 P4D 0
> > > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > Code: (omitted)
> > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > Call Trace:
> > >  <TASK>
> > > ? __die
> > > ? page_fault_oops
> > > ? __pte_offset_map_lock
> > > ? exc_page_fault
> > > ? asm_exc_page_fault
> > > ? wakeup_kswapd
> > > migrate_misplaced_page
> > > __handle_mm_fault
> > > handle_mm_fault
> > > do_user_addr_fault
> > > exc_page_fault
> > > asm_exc_page_fault
> > > RIP: 0033:0x55b897ba0808
> > > Code: (omitted)
> > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> > >  </TASK>
> > > Modules linked in:
> > > CR2: 00000000000033f3
> > > ---[ end trace 0000000000000000  ]---
> > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > Code: (omitted)
> > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > PKRU: 55555554
> > > note: masim[895] exited with irqs disabled
> 
> I think you could trim the down a little bit.

Thank you for the feedback. I will.

> > 
> > Signed-off-by: Byungchul Park <byungchul@sk.com>
> > Reported-by: hyeongtak.ji@sk.com
> > ---
> >  kernel/sched/fair.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d7a3c63a2171..6d215cc85f14 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> >  	int dst_nid = cpu_to_node(dst_cpu);
> >  	int last_cpupid, this_cpupid;
> >  
> > +	/*
> > +	 * A node of dst_nid might not have its local memory. Promoting
> > +	 * a folio to the node is meaningless. What's even worse, oops
> > +	 * can be observed by the null pointer of ->zone_pgdat in
> > +	 * various points of the code during migration.
> > +	 *
> 
> > +	 * For instance, oops has been observed at CPU2 while qemu'ing:
> > +	 *
> > +	 * {qemu} \
> > +	 *    -numa node,nodeid=0,mem=1G,cpus=0-1 \
> > +	 *    -numa node,nodeid=1,cpus=2-3 \
> > +	 *    -numa node,nodeid=2,mem=8G \
> > +	 *    ...
> 
> This part above should probably be in the commit message not in the code.
> The first paragraph of comment is plenty.

I will.

Thanks. I will respin it.

	Byungchul

> Otherwise, I think the check probably makes sense.
> 
> 
> Cheers,
> Phil
> 
> > +	 */
> > +	if (!node_state(dst_nid, N_MEMORY))
> > +		return false;
> > +
> >  	/*
> >  	 * The pages in slow memory node should be migrated according
> >  	 * to hot/cold instead of private/shared.
> > -- 
> > 2.17.1
> > 
> > 
> 
> --
  
Byungchul Park Feb. 16, 2024, 7:07 a.m. UTC | #5
On Wed, Feb 14, 2024 at 10:13:57PM +0100, Oscar Salvador wrote:
> On Wed, Feb 14, 2024 at 12:53:55PM +0900, Byungchul Park wrote:
> > While running qemu with a configuration where some CPUs don't have their
> > local memory and with a kernel numa balancing on, the following oops has
> > been observed. It's because of null pointers of ->zone_pgdat of zones of
> > those nodes that are not initialized at booting time. So should avoid
> > nodes not set N_MEMORY from getting promoted.
> 
> Looking at free_area_init(), we call free_area_init_node() for each node
> found on the system.
> And free_area_init_node()->free_area_init_core() inits all zones
> belonging to the system via zone_init_internals().

For normal numa nodes, node_data[] is initialized at alloc_node_data(),
but it's not for memoryless node. However, the node *gets onlined* at
init_cpu_to_node().

Let's look at back free_area_init(). free_area_init_node() will be called
with node_data[] not set yet, because it's already *onlined*. So
->zone_pgdat cannot be initialized properly in the path you mentioned.

	Byungchul

> Now, I am not saying the check is wrong because we obviously do not want
> migrate memory to a memoryless node, but I am confused as to where
> we are crashing.
> 
> 
> -- 
> Oscar Salvador
> SUSE Labs
  
Byungchul Park Feb. 16, 2024, 7:45 a.m. UTC | #6
On Wed, Feb 14, 2024 at 03:03:18PM -0500, Phil Auld wrote:
> On Wed, Feb 14, 2024 at 07:31:37AM -0500 Phil Auld wrote:
> > Hi,
> > 
> > On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote:
> > > While running qemu with a configuration where some CPUs don't have their
> > > local memory and with a kernel numa balancing on, the following oops has
> > > been observed. It's because of null pointers of ->zone_pgdat of zones of
> > > those nodes that are not initialized at booting time. So should avoid
> > > nodes not set N_MEMORY from getting promoted.
> > > 
> > > > BUG: unable to handle page fault for address: 00000000000033f3
> > > > #PF: supervisor read access in kernel mode
> > > > #PF: error_code(0x0000) - not-present page
> > > > PGD 0 P4D 0
> > > > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > > Code: (omitted)
> > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > PKRU: 55555554
> > > > Call Trace:
> > > >  <TASK>
> > > > ? __die
> > > > ? page_fault_oops
> > > > ? __pte_offset_map_lock
> > > > ? exc_page_fault
> > > > ? asm_exc_page_fault
> > > > ? wakeup_kswapd
> > > > migrate_misplaced_page
> > > > __handle_mm_fault
> > > > handle_mm_fault
> > > > do_user_addr_fault
> > > > exc_page_fault
> > > > asm_exc_page_fault
> > > > RIP: 0033:0x55b897ba0808
> > > > Code: (omitted)
> > > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> > > >  </TASK>
> > > > Modules linked in:
> > > > CR2: 00000000000033f3
> > > > ---[ end trace 0000000000000000  ]---
> > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > > > Code: (omitted)
> > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > > > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > PKRU: 55555554
> > > > note: masim[895] exited with irqs disabled
> > 
> > I think you could trim the down a little bit.
> > 
> > 
> > > 
> > > Signed-off-by: Byungchul Park <byungchul@sk.com>
> > > Reported-by: hyeongtak.ji@sk.com
> > > ---
> > >  kernel/sched/fair.c | 17 +++++++++++++++++
> > >  1 file changed, 17 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d7a3c63a2171..6d215cc85f14 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> > >  	int dst_nid = cpu_to_node(dst_cpu);
> > >  	int last_cpupid, this_cpupid;
> > >  
> > > +	/*
> > > +	 * A node of dst_nid might not have its local memory. Promoting
> > > +	 * a folio to the node is meaningless. What's even worse, oops
> > > +	 * can be observed by the null pointer of ->zone_pgdat in
> > > +	 * various points of the code during migration.
> > > +	 *
> > 
> > > +	 * For instance, oops has been observed at CPU2 while qemu'ing:
> > > +	 *
> > > +	 * {qemu} \
> > > +	 *    -numa node,nodeid=0,mem=1G,cpus=0-1 \
> > > +	 *    -numa node,nodeid=1,cpus=2-3 \
> > > +	 *    -numa node,nodeid=2,mem=8G \
> > > +	 *    ...
> > 
> > This part above should probably be in the commit message not in the code.
> > The first paragraph of comment is plenty.
> > 
> > Otherwise, I think the check probably makes sense.
> >
> 
> Actually, after looking at the memory.c code I wonder if this check should
> not be made farther up in the numa migrate machinery.

First of all, we cannot avoid hinting fault. It's because no one knows
which node a task eventually runs on until a hinting fault occurs. So
should let it go get hinting fault *and then* we can make the decision
if we can migrate the folio or not. Assuming that, IMHO,
should_numa_migrate_memory() is a good place to make it.

Thoughts? Am I missing something?

	Byungchul

> Cheers,
> Phil
> 
> > 
> > Cheers,
> > Phil
> > 
> > > +	 */
> > > +	if (!node_state(dst_nid, N_MEMORY))
> > > +		return false;
> > > +
> > >  	/*
> > >  	 * The pages in slow memory node should be migrated according
> > >  	 * to hot/cold instead of private/shared.
> > > -- 
> > > 2.17.1
> > > 
> > > 
> > 
> > -- 
> > 
> > 
> 
> --
  
Oscar Salvador Feb. 16, 2024, 7:52 a.m. UTC | #7
On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote:
> For normal numa nodes, node_data[] is initialized at alloc_node_data(),
> but it's not for memoryless node. However, the node *gets onlined* at
> init_cpu_to_node().
> 
> Let's look at back free_area_init(). free_area_init_node() will be called
> with node_data[] not set yet, because it's already *onlined*. So
> ->zone_pgdat cannot be initialized properly in the path you mentioned.

I am might be missing something., so bear with me.

free_area_init() gets called before init_cpu_to_node() does.
free_area_init_node() gets called on every possible node.

free_area_init_node then() does

 pg_data_t *pgdat = NODE_DATA(nid);,

and then we call free_area_init_core().

free_area_init_core() does

 free_area_init_core() does
  zone_init_internals()

which ends up doing zone->zone_pgdat = NODE_DATA(nid);

If node_data[] was not set at all, we would already blow up when doing
the first

  for_each_node()
    pgdat = NODE_DATA(nid);
    free_area_init_node(nid);

back in free_area_init().
  
Byungchul Park Feb. 16, 2024, 9:11 a.m. UTC | #8
On Fri, Feb 16, 2024 at 08:52:30AM +0100, Oscar Salvador wrote:
> On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote:
> > For normal numa nodes, node_data[] is initialized at alloc_node_data(),
> > but it's not for memoryless node. However, the node *gets onlined* at
> > init_cpu_to_node().
> > 
> > Let's look at back free_area_init(). free_area_init_node() will be called
> > with node_data[] not set yet, because it's already *onlined*. So
> > ->zone_pgdat cannot be initialized properly in the path you mentioned.
> 
> I am might be missing something., so bear with me.
> 
> free_area_init() gets called before init_cpu_to_node() does.
> free_area_init_node() gets called on every possible node.
> 
> free_area_init_node then() does
> 
>  pg_data_t *pgdat = NODE_DATA(nid);,
> 
> and then we call free_area_init_core().
> 
> free_area_init_core() does
> 
>  free_area_init_core() does
>   zone_init_internals()
> 
> which ends up doing zone->zone_pgdat = NODE_DATA(nid);
> 
> If node_data[] was not set at all, we would already blow up when doing
> the first
> 
>   for_each_node()
>     pgdat = NODE_DATA(nid);
>     free_area_init_node(nid);
> 
> back in free_area_init().

It seems that I got it wrong about the reason. Let me check it again and
share the reason.

Just in case, this patch is still definitely necessary tho.

	Byungchul
  
Byungchul Park Feb. 16, 2024, 9:23 a.m. UTC | #9
On Fri, Feb 16, 2024 at 06:11:40PM +0900, Byungchul Park wrote:
> On Fri, Feb 16, 2024 at 08:52:30AM +0100, Oscar Salvador wrote:
> > On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote:
> > > For normal numa nodes, node_data[] is initialized at alloc_node_data(),
> > > but it's not for memoryless node. However, the node *gets onlined* at
> > > init_cpu_to_node().
> > > 
> > > Let's look at back free_area_init(). free_area_init_node() will be called
> > > with node_data[] not set yet, because it's already *onlined*. So
> > > ->zone_pgdat cannot be initialized properly in the path you mentioned.
> > 
> > I am might be missing something., so bear with me.
> > 
> > free_area_init() gets called before init_cpu_to_node() does.
> > free_area_init_node() gets called on every possible node.
> > 
> > free_area_init_node then() does
> > 
> >  pg_data_t *pgdat = NODE_DATA(nid);,
> > 
> > and then we call free_area_init_core().
> > 
> > free_area_init_core() does
> > 
> >  free_area_init_core() does
> >   zone_init_internals()
> > 
> > which ends up doing zone->zone_pgdat = NODE_DATA(nid);
> > 
> > If node_data[] was not set at all, we would already blow up when doing
> > the first
> > 
> >   for_each_node()
> >     pgdat = NODE_DATA(nid);
> >     free_area_init_node(nid);
> > 
> > back in free_area_init().
> 
> It seems that I got it wrong about the reason. Let me check it again and
> share the reason.
> 
> Just in case, this patch is still definitely necessary tho.

Sorry for the confusing expression. Please don't misunderstand it. The
oops has been always observed in the configuration that I descriped. I
meant:

   Just in case, I need to say the fix is still necessary.

	Byungchul
  
Byungchul Park Feb. 16, 2024, 11:26 a.m. UTC | #10
On Fri, Feb 16, 2024 at 06:23:05PM +0900, Byungchul Park wrote:
> On Fri, Feb 16, 2024 at 06:11:40PM +0900, Byungchul Park wrote:
> > On Fri, Feb 16, 2024 at 08:52:30AM +0100, Oscar Salvador wrote:
> > > On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote:
> > > > For normal numa nodes, node_data[] is initialized at alloc_node_data(),
> > > > but it's not for memoryless node. However, the node *gets onlined* at
> > > > init_cpu_to_node().
> > > > 
> > > > Let's look at back free_area_init(). free_area_init_node() will be called
> > > > with node_data[] not set yet, because it's already *onlined*. So
> > > > ->zone_pgdat cannot be initialized properly in the path you mentioned.
> > > 
> > > I am might be missing something., so bear with me.
> > > 
> > > free_area_init() gets called before init_cpu_to_node() does.
> > > free_area_init_node() gets called on every possible node.
> > > 
> > > free_area_init_node then() does
> > > 
> > >  pg_data_t *pgdat = NODE_DATA(nid);,
> > > 
> > > and then we call free_area_init_core().
> > > 
> > > free_area_init_core() does
> > > 
> > >  free_area_init_core() does
> > >   zone_init_internals()
> > > 
> > > which ends up doing zone->zone_pgdat = NODE_DATA(nid);
> > > 
> > > If node_data[] was not set at all, we would already blow up when doing
> > > the first
> > > 
> > >   for_each_node()
> > >     pgdat = NODE_DATA(nid);
> > >     free_area_init_node(nid);
> > > 
> > > back in free_area_init().
> > 
> > It seems that I got it wrong about the reason. Let me check it again and
> > share the reason.

I analyzed it wrong. Even though the issue was gone with the patch but
it's not the fix. Sorry for making you confused. I submitted the fix with
another patch:

   https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/

	Byungchul
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..6d215cc85f14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1828,6 +1828,23 @@  bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 	int dst_nid = cpu_to_node(dst_cpu);
 	int last_cpupid, this_cpupid;
 
+	/*
+	 * A node of dst_nid might not have its local memory. Promoting
+	 * a folio to the node is meaningless. What's even worse, oops
+	 * can be observed by the null pointer of ->zone_pgdat in
+	 * various points of the code during migration.
+	 *
+	 * For instance, oops has been observed at CPU2 while qemu'ing:
+	 *
+	 * {qemu} \
+	 *    -numa node,nodeid=0,mem=1G,cpus=0-1 \
+	 *    -numa node,nodeid=1,cpus=2-3 \
+	 *    -numa node,nodeid=2,mem=8G \
+	 *    ...
+	 */
+	if (!node_state(dst_nid, N_MEMORY))
+		return false;
+
 	/*
 	 * The pages in slow memory node should be migrated according
 	 * to hot/cold instead of private/shared.