Message ID | 20240214035355.18335-1-byungchul@sk.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-64717-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:bc8a:b0:106:860b:bbdd with SMTP id dn10csp970980dyb; Tue, 13 Feb 2024 19:58:32 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCVTQJoVCsaVDoajfrI/lIiYE9w2mRXKNwzsHw/pu4dcYqicmkRPHYMK6YEe2ODNTjLLE1I1th0ltkoSF+URMgmai4eIeg== X-Google-Smtp-Source: AGHT+IHIRYRAvCA4UM5HpCwFCpgAb6DFQE8Vuoq+mi89h4/4gCicmNqVgxa34drbU3Q6uFZsL7Sn X-Received: by 2002:a05:6a21:3998:b0:199:7d51:a942 with SMTP id ad24-20020a056a21399800b001997d51a942mr1917545pzc.50.1707883112597; Tue, 13 Feb 2024 19:58:32 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707883112; cv=pass; d=google.com; s=arc-20160816; b=0hvi2R6b/oxHN0jJQFFjpwFwx/E8ZTOhakWrgw2Qf2EjwgT/o3MZ1MemILbgMvcBqh LUdqgLrVHGMyxyAJkVaq4oWNhOPFn87b8QexQrDqcBGz3nGMhoXlysU74NzARDIcC0sN 8FtjgBN8o2DmdWlzrO6QnKxrFIzu1UJBhcuVTMc00og7BtxEIT+50RLQDo8CYlJXr+9F iDFAhzF7c56RAxlX/kB7U0xaBukfvH0E+5+VfCzjp+IuYogObVtGRvwI9hdFhHk2SBF9 TjKOegAP0l2oqmy/oh/B9/tEBWng7OOsdH73IS1qxYR009iB61q7PvTrB1fojdxXd08/ xRnA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-id:precedence:message-id:date :subject:cc:to:from; bh=8eVGciwhU8lCi+euJCCTtr4v1C2N5uLww2/wJOliNk0=; fh=Zl3oMptSgUCRB4M92tTqzUPYiemDiR+RWcWEtc9Fwxw=; b=ZT3Y31rDlMXosd4anXJDTzVEAdh8vHTxPw0J7KYXBQRm6O/eN1lhd02hV5cjox2Ek4 7JjciujFZ9N5C31EAu+m3E2qC1ilUkwS9tHEpjdsqBMVcEOpq0GDWyHzZMrqOivyKKL6 R6J1lyv81mjIZZbuZk00IKWG6QJISz5rnrFdUa7MN5xbYJ4/0AZ0I6kN9v4h6t0zs+wF ib20FyzSHNEsMC9jehDW40YG7jY2mefpUk7mOHdT+rVKhdoVT0AyOpUEAvyanVty86dD NAiEy6Flm/CuQdC7AF1u4dwiv+OU4uxGDQ2dG6wf7IVzOgF01HQLGOaNBozzZpRDCmab oy8Q==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=sk.com); spf=pass (google.com: domain of linux-kernel+bounces-64717-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-64717-ouuuleilei=gmail.com@vger.kernel.org" X-Forwarded-Encrypted: i=2; AJvYcCUuXPSGDmL4NWGulSDK0drc6YpY9WvTeBXeWYLqr5Vy5BrzC5WhyCIBpHoRlRpNjDtmUcSZFBlByExXDRgE5RRFCodYUQ== Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id f18-20020a637552000000b005cf5895bfa6si2995864pgn.814.2024.02.13.19.58.32 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Feb 2024 19:58:32 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-64717-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=sk.com); spf=pass (google.com: domain of linux-kernel+bounces-64717-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-64717-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 334B1B24FEE for <ouuuleilei@gmail.com>; Wed, 14 Feb 2024 03:54:39 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id AE7DDC121; Wed, 14 Feb 2024 03:54:22 +0000 (UTC) Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by smtp.subspace.kernel.org (Postfix) with ESMTP id EB711BA22 for <linux-kernel@vger.kernel.org>; Wed, 14 Feb 2024 03:54:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=166.125.252.92 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707882861; cv=none; b=g08IhMzWGy7nQOTExP5Fy6d55hzteSXfN3mzmOVkxrshA6T8esH6n5/6YakjrpQ4KP4p2gChJ9O7oHVgPuDx5AEq2ylGXqGkwgo8SWzl6raQwIlXPPMDicjWNQs5XAv98E8PtXgjRvqiO3+964j5KhuBo9Uui2Xm4CeIqFXoP5U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707882861; c=relaxed/simple; bh=FrIuJZ5FwLSqA+IuFZPm9PO3C2DWtjfmjjxrAGW4qNw=; h=From:To:Cc:Subject:Date:Message-Id; b=RiYeSVgMfsdievGQmKv+nJ9MbNPz7tMZDonKh+Aw5QmQdsstBpNZQBQuFdk55i7oLwLu4fJcM0pdG+P85OihQSfxor6dUW8HV+/CHGQuShuY+XpBGoEYTmFcIS3zbEyZBkDEbTQZCVWZ/9nbpv3R4F3nEIRyVbNON56qnloTjAg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com; spf=pass smtp.mailfrom=sk.com; arc=none smtp.client-ip=166.125.252.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=sk.com X-AuditID: a67dfc5b-d85ff70000001748-ae-65cc395d8c12 From: Byungchul Park <byungchul@sk.com> To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel_team@skhynix.com, akpm@linux-foundation.org Subject: [PATCH] sched/numa, mm: do not promote folios to nodes not set N_MEMORY Date: Wed, 14 Feb 2024 12:53:55 +0900 Message-Id: <20240214035355.18335-1-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrNLMWRmVeSWpSXmKPExsXC9ZZnoW6c5ZlUg3N/WCzmrF/DZnHp8VU2 i+kvG1ksnk7Yymxxt38qi8XlXXPYLO6t+c9qMfndM0aLSwcWMFkc7z3AZLGv4wGTRceRb8wW W49+Z3fg9Vgzbw2jR8u+W+weCzaVemxeoeWx6dMkdo871/aweZyY8ZvF4/2+q2wem09Xe3ze JBfAFcVlk5Kak1mWWqRvl8CV8fbGOtaCE0oVp1riGxj/yHQxcnBICJhILF6Q2MXICWbumLee BcRmE1CXuHHjJ3MXIxeHiMAbRonOZWfYQBLMAnkSrf/7mEBsYYEAic33r4A1sAioSpzcvJwR xOYVMJWY0bKKFWKovMTqDQfABkkIrGGTOPz2B1RCUuLgihssExi5FzAyrGIUyswry03MzDHR y6jMy6zQS87P3cQIDMJltX+idzB+uhB8iFGAg1GJh/dEwulUIdbEsuLK3EOMEhzMSiK8l2ac SBXiTUmsrEotyo8vKs1JLT7EKM3BoiTOa/StPEVIID2xJDU7NbUgtQgmy8TBKdXAGNP3P+gG I3tl6mJpGcM9h2YvDruk9zlkhrL2or/79BtiL6RoVekd/sX/PrCxyvHEwcbTX3zY8uYJljJO XJLUf6OlI3nu/WzZnoXLLtY84vl858db/jV6bXmZuZsEBG+EP0iNKp/DO8FnmqG0D//9zCbn kwt3T2Aocjun3vtulUWiZ858viUVSizFGYmGWsxFxYkA/IN2JD4CAAA= X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprMLMWRmVeSWpSXmKPExsXC5WfdrBtreSbVoPOGicWc9WvYLC49vspm Mf1lI4vF0wlbmS3u9k9lsTg89ySrxeVdc9gs7q35z2ox+d0zRotLBxYwWRzvPcBksa/jAZNF x5FvzBZbj35nd+DzWDNvDaNHy75b7B4LNpV6bF6h5bHp0yR2jzvX9rB5nJjxm8Xj/b6rbB6L X3xg8th8utrj8ya5AO4oLpuU1JzMstQifbsEroy3N9axFpxQqjjVEt/A+Eemi5GTQ0LARGLH vPUsIDabgLrEjRs/mbsYuThEBN4wSnQuO8MGkmAWyJNo/d/HBGILCwRIbL5/BayBRUBV4uTm 5YwgNq+AqcSMllWsEEPlJVZvOMA8gZFjASPDKkaRzLyy3MTMHFO94uyMyrzMCr3k/NxNjMCg Wlb7Z+IOxi+X3Q8xCnAwKvHwnkg4nSrEmlhWXJl7iFGCg1lJhPfSjBOpQrwpiZVVqUX58UWl OanFhxilOViUxHm9wlMThATSE0tSs1NTC1KLYLJMHJxSDYz+B34vkI58+f+zIzP7jbsTlO4b bT16cbs/e3mwvGZKbIrAVamm6QsZ+jiXvpPmXmKcfePdgctVBRd0Bby6Jxcc/fTOk0XyhInM Thf/j3tvCfkEbD8de5nXUrhOfvnqtarTQvc5qFsEKD3WClh+62nWnbmOTHatYf8/CJz+xv7v 67YTJmf01J8qsRRnJBpqMRcVJwIA2sikGiYCAAA= X-CFilter-Loop: Reflected Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790845242980983620 X-GMAIL-MSGID: 1790845242980983620 |
Series |
sched/numa, mm: do not promote folios to nodes not set N_MEMORY
|
|
Commit Message
Byungchul Park
Feb. 14, 2024, 3:53 a.m. UTC
While running qemu with a configuration where some CPUs don't have their local memory and with a kernel numa balancing on, the following oops has been observed. It's because of null pointers of ->zone_pgdat of zones of those nodes that are not initialized at booting time. So should avoid nodes not set N_MEMORY from getting promoted. > BUG: unable to handle page fault for address: 00000000000033f3 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > Code: (omitted) > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > <TASK> > ? __die > ? page_fault_oops > ? __pte_offset_map_lock > ? exc_page_fault > ? asm_exc_page_fault > ? wakeup_kswapd > migrate_misplaced_page > __handle_mm_fault > handle_mm_fault > do_user_addr_fault > exc_page_fault > asm_exc_page_fault > RIP: 0033:0x55b897ba0808 > Code: (omitted) > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > </TASK> > Modules linked in: > CR2: 00000000000033f3 > ---[ end trace 0000000000000000 ]--- > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > Code: (omitted) > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > note: masim[895] exited with irqs disabled Signed-off-by: Byungchul Park <byungchul@sk.com> Reported-by: hyeongtak.ji@sk.com --- kernel/sched/fair.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
Comments
Hi, On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote: > While running qemu with a configuration where some CPUs don't have their > local memory and with a kernel numa balancing on, the following oops has > been observed. It's because of null pointers of ->zone_pgdat of zones of > those nodes that are not initialized at booting time. So should avoid > nodes not set N_MEMORY from getting promoted. > > > BUG: unable to handle page fault for address: 00000000000033f3 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > Code: (omitted) > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > PKRU: 55555554 > > Call Trace: > > <TASK> > > ? __die > > ? page_fault_oops > > ? __pte_offset_map_lock > > ? exc_page_fault > > ? asm_exc_page_fault > > ? wakeup_kswapd > > migrate_misplaced_page > > __handle_mm_fault > > handle_mm_fault > > do_user_addr_fault > > exc_page_fault > > asm_exc_page_fault > > RIP: 0033:0x55b897ba0808 > > Code: (omitted) > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > > </TASK> > > Modules linked in: > > CR2: 00000000000033f3 > > ---[ end trace 0000000000000000 ]--- > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > Code: (omitted) > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > PKRU: 55555554 > > note: masim[895] exited with irqs disabled I think you could trim the down a little bit. > > Signed-off-by: Byungchul Park <byungchul@sk.com> > Reported-by: hyeongtak.ji@sk.com > --- > kernel/sched/fair.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d7a3c63a2171..6d215cc85f14 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > int dst_nid = cpu_to_node(dst_cpu); > int last_cpupid, this_cpupid; > > + /* > + * A node of dst_nid might not have its local memory. Promoting > + * a folio to the node is meaningless. What's even worse, oops > + * can be observed by the null pointer of ->zone_pgdat in > + * various points of the code during migration. > + * > + * For instance, oops has been observed at CPU2 while qemu'ing: > + * > + * {qemu} \ > + * -numa node,nodeid=0,mem=1G,cpus=0-1 \ > + * -numa node,nodeid=1,cpus=2-3 \ > + * -numa node,nodeid=2,mem=8G \ > + * ... This part above should probably be in the commit message not in the code. The first paragraph of comment is plenty. Otherwise, I think the check probably makes sense. Cheers, Phil > + */ > + if (!node_state(dst_nid, N_MEMORY)) > + return false; > + > /* > * The pages in slow memory node should be migrated according > * to hot/cold instead of private/shared. > -- > 2.17.1 > > --
On Wed, Feb 14, 2024 at 07:31:37AM -0500 Phil Auld wrote: > Hi, > > On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote: > > While running qemu with a configuration where some CPUs don't have their > > local memory and with a kernel numa balancing on, the following oops has > > been observed. It's because of null pointers of ->zone_pgdat of zones of > > those nodes that are not initialized at booting time. So should avoid > > nodes not set N_MEMORY from getting promoted. > > > > > BUG: unable to handle page fault for address: 00000000000033f3 > > > #PF: supervisor read access in kernel mode > > > #PF: error_code(0x0000) - not-present page > > > PGD 0 P4D 0 > > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > > Code: (omitted) > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > Call Trace: > > > <TASK> > > > ? __die > > > ? page_fault_oops > > > ? __pte_offset_map_lock > > > ? exc_page_fault > > > ? asm_exc_page_fault > > > ? wakeup_kswapd > > > migrate_misplaced_page > > > __handle_mm_fault > > > handle_mm_fault > > > do_user_addr_fault > > > exc_page_fault > > > asm_exc_page_fault > > > RIP: 0033:0x55b897ba0808 > > > Code: (omitted) > > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > > > </TASK> > > > Modules linked in: > > > CR2: 00000000000033f3 > > > ---[ end trace 0000000000000000 ]--- > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > > Code: (omitted) > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > note: masim[895] exited with irqs disabled > > I think you could trim the down a little bit. > > > > > > Signed-off-by: Byungchul Park <byungchul@sk.com> > > Reported-by: hyeongtak.ji@sk.com > > --- > > kernel/sched/fair.c | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index d7a3c63a2171..6d215cc85f14 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > > int dst_nid = cpu_to_node(dst_cpu); > > int last_cpupid, this_cpupid; > > > > + /* > > + * A node of dst_nid might not have its local memory. Promoting > > + * a folio to the node is meaningless. What's even worse, oops > > + * can be observed by the null pointer of ->zone_pgdat in > > + * various points of the code during migration. > > + * > > > + * For instance, oops has been observed at CPU2 while qemu'ing: > > + * > > + * {qemu} \ > > + * -numa node,nodeid=0,mem=1G,cpus=0-1 \ > > + * -numa node,nodeid=1,cpus=2-3 \ > > + * -numa node,nodeid=2,mem=8G \ > > + * ... > > This part above should probably be in the commit message not in the code. > The first paragraph of comment is plenty. > > Otherwise, I think the check probably makes sense. > Actually, after looking at the memory.c code I wonder if this check should not be made farther up in the numa migrate machinery. Cheers, Phil > > Cheers, > Phil > > > + */ > > + if (!node_state(dst_nid, N_MEMORY)) > > + return false; > > + > > /* > > * The pages in slow memory node should be migrated according > > * to hot/cold instead of private/shared. > > -- > > 2.17.1 > > > > > > -- > > --
On Wed, Feb 14, 2024 at 12:53:55PM +0900, Byungchul Park wrote: > While running qemu with a configuration where some CPUs don't have their > local memory and with a kernel numa balancing on, the following oops has > been observed. It's because of null pointers of ->zone_pgdat of zones of > those nodes that are not initialized at booting time. So should avoid > nodes not set N_MEMORY from getting promoted. Looking at free_area_init(), we call free_area_init_node() for each node found on the system. And free_area_init_node()->free_area_init_core() inits all zones belonging to the system via zone_init_internals(). Now, I am not saying the check is wrong because we obviously do not want migrate memory to a memoryless node, but I am confused as to where we are crashing.
On Wed, Feb 14, 2024 at 07:31:37AM -0500, Phil Auld wrote: > Hi, > > On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote: > > While running qemu with a configuration where some CPUs don't have their > > local memory and with a kernel numa balancing on, the following oops has > > been observed. It's because of null pointers of ->zone_pgdat of zones of > > those nodes that are not initialized at booting time. So should avoid > > nodes not set N_MEMORY from getting promoted. > > > > > BUG: unable to handle page fault for address: 00000000000033f3 > > > #PF: supervisor read access in kernel mode > > > #PF: error_code(0x0000) - not-present page > > > PGD 0 P4D 0 > > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > > Code: (omitted) > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > Call Trace: > > > <TASK> > > > ? __die > > > ? page_fault_oops > > > ? __pte_offset_map_lock > > > ? exc_page_fault > > > ? asm_exc_page_fault > > > ? wakeup_kswapd > > > migrate_misplaced_page > > > __handle_mm_fault > > > handle_mm_fault > > > do_user_addr_fault > > > exc_page_fault > > > asm_exc_page_fault > > > RIP: 0033:0x55b897ba0808 > > > Code: (omitted) > > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > > > </TASK> > > > Modules linked in: > > > CR2: 00000000000033f3 > > > ---[ end trace 0000000000000000 ]--- > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > > Code: (omitted) > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > note: masim[895] exited with irqs disabled > > I think you could trim the down a little bit. Thank you for the feedback. I will. > > > > Signed-off-by: Byungchul Park <byungchul@sk.com> > > Reported-by: hyeongtak.ji@sk.com > > --- > > kernel/sched/fair.c | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index d7a3c63a2171..6d215cc85f14 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > > int dst_nid = cpu_to_node(dst_cpu); > > int last_cpupid, this_cpupid; > > > > + /* > > + * A node of dst_nid might not have its local memory. Promoting > > + * a folio to the node is meaningless. What's even worse, oops > > + * can be observed by the null pointer of ->zone_pgdat in > > + * various points of the code during migration. > > + * > > > + * For instance, oops has been observed at CPU2 while qemu'ing: > > + * > > + * {qemu} \ > > + * -numa node,nodeid=0,mem=1G,cpus=0-1 \ > > + * -numa node,nodeid=1,cpus=2-3 \ > > + * -numa node,nodeid=2,mem=8G \ > > + * ... > > This part above should probably be in the commit message not in the code. > The first paragraph of comment is plenty. I will. Thanks. I will respin it. Byungchul > Otherwise, I think the check probably makes sense. > > > Cheers, > Phil > > > + */ > > + if (!node_state(dst_nid, N_MEMORY)) > > + return false; > > + > > /* > > * The pages in slow memory node should be migrated according > > * to hot/cold instead of private/shared. > > -- > > 2.17.1 > > > > > > --
On Wed, Feb 14, 2024 at 10:13:57PM +0100, Oscar Salvador wrote: > On Wed, Feb 14, 2024 at 12:53:55PM +0900, Byungchul Park wrote: > > While running qemu with a configuration where some CPUs don't have their > > local memory and with a kernel numa balancing on, the following oops has > > been observed. It's because of null pointers of ->zone_pgdat of zones of > > those nodes that are not initialized at booting time. So should avoid > > nodes not set N_MEMORY from getting promoted. > > Looking at free_area_init(), we call free_area_init_node() for each node > found on the system. > And free_area_init_node()->free_area_init_core() inits all zones > belonging to the system via zone_init_internals(). For normal numa nodes, node_data[] is initialized at alloc_node_data(), but it's not for memoryless node. However, the node *gets onlined* at init_cpu_to_node(). Let's look at back free_area_init(). free_area_init_node() will be called with node_data[] not set yet, because it's already *onlined*. So ->zone_pgdat cannot be initialized properly in the path you mentioned. Byungchul > Now, I am not saying the check is wrong because we obviously do not want > migrate memory to a memoryless node, but I am confused as to where > we are crashing. > > > -- > Oscar Salvador > SUSE Labs
On Wed, Feb 14, 2024 at 03:03:18PM -0500, Phil Auld wrote: > On Wed, Feb 14, 2024 at 07:31:37AM -0500 Phil Auld wrote: > > Hi, > > > > On Wed, Feb 14, 2024 at 12:53:55PM +0900 Byungchul Park wrote: > > > While running qemu with a configuration where some CPUs don't have their > > > local memory and with a kernel numa balancing on, the following oops has > > > been observed. It's because of null pointers of ->zone_pgdat of zones of > > > those nodes that are not initialized at booting time. So should avoid > > > nodes not set N_MEMORY from getting promoted. > > > > > > > BUG: unable to handle page fault for address: 00000000000033f3 > > > > #PF: supervisor read access in kernel mode > > > > #PF: error_code(0x0000) - not-present page > > > > PGD 0 P4D 0 > > > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > > > Code: (omitted) > > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > PKRU: 55555554 > > > > Call Trace: > > > > <TASK> > > > > ? __die > > > > ? page_fault_oops > > > > ? __pte_offset_map_lock > > > > ? exc_page_fault > > > > ? asm_exc_page_fault > > > > ? wakeup_kswapd > > > > migrate_misplaced_page > > > > __handle_mm_fault > > > > handle_mm_fault > > > > do_user_addr_fault > > > > exc_page_fault > > > > asm_exc_page_fault > > > > RIP: 0033:0x55b897ba0808 > > > > Code: (omitted) > > > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > > > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > > > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > > > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > > > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > > > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > > > > </TASK> > > > > Modules linked in: > > > > CR2: 00000000000033f3 > > > > ---[ end trace 0000000000000000 ]--- > > > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > > > Code: (omitted) > > > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > PKRU: 55555554 > > > > note: masim[895] exited with irqs disabled > > > > I think you could trim the down a little bit. > > > > > > > > > > Signed-off-by: Byungchul Park <byungchul@sk.com> > > > Reported-by: hyeongtak.ji@sk.com > > > --- > > > kernel/sched/fair.c | 17 +++++++++++++++++ > > > 1 file changed, 17 insertions(+) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index d7a3c63a2171..6d215cc85f14 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > > > int dst_nid = cpu_to_node(dst_cpu); > > > int last_cpupid, this_cpupid; > > > > > > + /* > > > + * A node of dst_nid might not have its local memory. Promoting > > > + * a folio to the node is meaningless. What's even worse, oops > > > + * can be observed by the null pointer of ->zone_pgdat in > > > + * various points of the code during migration. > > > + * > > > > > + * For instance, oops has been observed at CPU2 while qemu'ing: > > > + * > > > + * {qemu} \ > > > + * -numa node,nodeid=0,mem=1G,cpus=0-1 \ > > > + * -numa node,nodeid=1,cpus=2-3 \ > > > + * -numa node,nodeid=2,mem=8G \ > > > + * ... > > > > This part above should probably be in the commit message not in the code. > > The first paragraph of comment is plenty. > > > > Otherwise, I think the check probably makes sense. > > > > Actually, after looking at the memory.c code I wonder if this check should > not be made farther up in the numa migrate machinery. First of all, we cannot avoid hinting fault. It's because no one knows which node a task eventually runs on until a hinting fault occurs. So should let it go get hinting fault *and then* we can make the decision if we can migrate the folio or not. Assuming that, IMHO, should_numa_migrate_memory() is a good place to make it. Thoughts? Am I missing something? Byungchul > Cheers, > Phil > > > > > Cheers, > > Phil > > > > > + */ > > > + if (!node_state(dst_nid, N_MEMORY)) > > > + return false; > > > + > > > /* > > > * The pages in slow memory node should be migrated according > > > * to hot/cold instead of private/shared. > > > -- > > > 2.17.1 > > > > > > > > > > -- > > > > > > --
On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote: > For normal numa nodes, node_data[] is initialized at alloc_node_data(), > but it's not for memoryless node. However, the node *gets onlined* at > init_cpu_to_node(). > > Let's look at back free_area_init(). free_area_init_node() will be called > with node_data[] not set yet, because it's already *onlined*. So > ->zone_pgdat cannot be initialized properly in the path you mentioned. I am might be missing something., so bear with me. free_area_init() gets called before init_cpu_to_node() does. free_area_init_node() gets called on every possible node. free_area_init_node then() does pg_data_t *pgdat = NODE_DATA(nid);, and then we call free_area_init_core(). free_area_init_core() does free_area_init_core() does zone_init_internals() which ends up doing zone->zone_pgdat = NODE_DATA(nid); If node_data[] was not set at all, we would already blow up when doing the first for_each_node() pgdat = NODE_DATA(nid); free_area_init_node(nid); back in free_area_init().
On Fri, Feb 16, 2024 at 08:52:30AM +0100, Oscar Salvador wrote: > On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote: > > For normal numa nodes, node_data[] is initialized at alloc_node_data(), > > but it's not for memoryless node. However, the node *gets onlined* at > > init_cpu_to_node(). > > > > Let's look at back free_area_init(). free_area_init_node() will be called > > with node_data[] not set yet, because it's already *onlined*. So > > ->zone_pgdat cannot be initialized properly in the path you mentioned. > > I am might be missing something., so bear with me. > > free_area_init() gets called before init_cpu_to_node() does. > free_area_init_node() gets called on every possible node. > > free_area_init_node then() does > > pg_data_t *pgdat = NODE_DATA(nid);, > > and then we call free_area_init_core(). > > free_area_init_core() does > > free_area_init_core() does > zone_init_internals() > > which ends up doing zone->zone_pgdat = NODE_DATA(nid); > > If node_data[] was not set at all, we would already blow up when doing > the first > > for_each_node() > pgdat = NODE_DATA(nid); > free_area_init_node(nid); > > back in free_area_init(). It seems that I got it wrong about the reason. Let me check it again and share the reason. Just in case, this patch is still definitely necessary tho. Byungchul
On Fri, Feb 16, 2024 at 06:11:40PM +0900, Byungchul Park wrote: > On Fri, Feb 16, 2024 at 08:52:30AM +0100, Oscar Salvador wrote: > > On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote: > > > For normal numa nodes, node_data[] is initialized at alloc_node_data(), > > > but it's not for memoryless node. However, the node *gets onlined* at > > > init_cpu_to_node(). > > > > > > Let's look at back free_area_init(). free_area_init_node() will be called > > > with node_data[] not set yet, because it's already *onlined*. So > > > ->zone_pgdat cannot be initialized properly in the path you mentioned. > > > > I am might be missing something., so bear with me. > > > > free_area_init() gets called before init_cpu_to_node() does. > > free_area_init_node() gets called on every possible node. > > > > free_area_init_node then() does > > > > pg_data_t *pgdat = NODE_DATA(nid);, > > > > and then we call free_area_init_core(). > > > > free_area_init_core() does > > > > free_area_init_core() does > > zone_init_internals() > > > > which ends up doing zone->zone_pgdat = NODE_DATA(nid); > > > > If node_data[] was not set at all, we would already blow up when doing > > the first > > > > for_each_node() > > pgdat = NODE_DATA(nid); > > free_area_init_node(nid); > > > > back in free_area_init(). > > It seems that I got it wrong about the reason. Let me check it again and > share the reason. > > Just in case, this patch is still definitely necessary tho. Sorry for the confusing expression. Please don't misunderstand it. The oops has been always observed in the configuration that I descriped. I meant: Just in case, I need to say the fix is still necessary. Byungchul
On Fri, Feb 16, 2024 at 06:23:05PM +0900, Byungchul Park wrote: > On Fri, Feb 16, 2024 at 06:11:40PM +0900, Byungchul Park wrote: > > On Fri, Feb 16, 2024 at 08:52:30AM +0100, Oscar Salvador wrote: > > > On Fri, Feb 16, 2024 at 04:07:54PM +0900, Byungchul Park wrote: > > > > For normal numa nodes, node_data[] is initialized at alloc_node_data(), > > > > but it's not for memoryless node. However, the node *gets onlined* at > > > > init_cpu_to_node(). > > > > > > > > Let's look at back free_area_init(). free_area_init_node() will be called > > > > with node_data[] not set yet, because it's already *onlined*. So > > > > ->zone_pgdat cannot be initialized properly in the path you mentioned. > > > > > > I am might be missing something., so bear with me. > > > > > > free_area_init() gets called before init_cpu_to_node() does. > > > free_area_init_node() gets called on every possible node. > > > > > > free_area_init_node then() does > > > > > > pg_data_t *pgdat = NODE_DATA(nid);, > > > > > > and then we call free_area_init_core(). > > > > > > free_area_init_core() does > > > > > > free_area_init_core() does > > > zone_init_internals() > > > > > > which ends up doing zone->zone_pgdat = NODE_DATA(nid); > > > > > > If node_data[] was not set at all, we would already blow up when doing > > > the first > > > > > > for_each_node() > > > pgdat = NODE_DATA(nid); > > > free_area_init_node(nid); > > > > > > back in free_area_init(). > > > > It seems that I got it wrong about the reason. Let me check it again and > > share the reason. I analyzed it wrong. Even though the issue was gone with the patch but it's not the fix. Sorry for making you confused. I submitted the fix with another patch: https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ Byungchul
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d7a3c63a2171..6d215cc85f14 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1828,6 +1828,23 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int dst_nid = cpu_to_node(dst_cpu); int last_cpupid, this_cpupid; + /* + * A node of dst_nid might not have its local memory. Promoting + * a folio to the node is meaningless. What's even worse, oops + * can be observed by the null pointer of ->zone_pgdat in + * various points of the code during migration. + * + * For instance, oops has been observed at CPU2 while qemu'ing: + * + * {qemu} \ + * -numa node,nodeid=0,mem=1G,cpus=0-1 \ + * -numa node,nodeid=1,cpus=2-3 \ + * -numa node,nodeid=2,mem=8G \ + * ... + */ + if (!node_state(dst_nid, N_MEMORY)) + return false; + /* * The pages in slow memory node should be migrated according * to hot/cold instead of private/shared.