Message ID | 20221031183122.470962-1-shy828301@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp2477082wru; Mon, 31 Oct 2022 11:37:48 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4L6EsCYevRW1K0o89BWLvrKOHwL9b0qQPsq+txWpUOo7xSn4iYHVvXEpp7hnJH6Lgkc8lI X-Received: by 2002:a05:6a00:238c:b0:56c:b442:6d28 with SMTP id f12-20020a056a00238c00b0056cb4426d28mr16299648pfc.62.1667241468660; Mon, 31 Oct 2022 11:37:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667241468; cv=none; d=google.com; s=arc-20160816; b=x3At64Opt9d+wgbjZZiPRodorvUj8RrLMmFCKUQiULRpvO0nzaKpc9QIgWsI5qgOMP +NfPcMvtT6bEwVH6yZPwiV2MtflY7ajabsGANlsUvWjD9EwjJ7mJqaJsZXNGmRZmeHbK gpbHu8/YX/H8tj7qm5h81ARi/CIUwn3gJVwiSFbwXqqkPJgTvNqiu7gkpEPefbepQ9gu cEIGUEm2cIJk36n0sPBzfI/EBFqjXnzsuAW09ZLxyUcso7qtzJ+lOK4IQKTLWaCo+RhY cvzAhKKEaLcRf20utpXbBAMbRmmwz3MQr8ZOGDKxM8X/kD3pLEsYTY/zGTGO9HgCvlZa rgdg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=gKVkobfuovv7qclKCKdX5gd2H7TOV0mQXBG/s5kJPMw=; b=uBXHvs7IqV+XJHO2ABwIkuo2CiSahC97nXJWjpKwWpfKi6uczkT0jdLWYjalnxRkBA 4yeB/UjBLnnaBzjydtZeqncbyg3wnlIwFMrb+c5/4kPDLPyu0XA15ZJp0MyaKnst7Gt0 AbcYiZamcJh7jrd0ydHVmh/BJDtYo7ba8otRP2SkvMiq7mzpRDNWofC9B0N/a+oyvSF+ CefQMSRFu7cLp+fBhgXlAc/2Pd26cZrO8yudR0Vvpcz8Zr7ny7I1KbD8C2UFTiH0nmzw rVEJlGWkwtQ2MYQzq+wm5lHE1YLlRYeVsd5hg1E3oKvyTX10lrxED1PHA1W2LdjhbP4D on7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=RskxV3GL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d18-20020a056a0010d200b0053e9c7939basi10274869pfu.188.2022.10.31.11.37.35; Mon, 31 Oct 2022 11:37:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=RskxV3GL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229983AbiJaSb2 (ORCPT <rfc822;kartikey406@gmail.com> + 99 others); Mon, 31 Oct 2022 14:31:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33320 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229536AbiJaSb0 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 31 Oct 2022 14:31:26 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E451411A06 for <linux-kernel@vger.kernel.org>; Mon, 31 Oct 2022 11:31:25 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id r61-20020a17090a43c300b00212f4e9cccdso16669727pjg.5 for <linux-kernel@vger.kernel.org>; Mon, 31 Oct 2022 11:31:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=gKVkobfuovv7qclKCKdX5gd2H7TOV0mQXBG/s5kJPMw=; b=RskxV3GLPHbnnyOFxW/O0DgxzzLmOWNp7gGDZKFE7KcWWe87QrQtkuleHxdKp4ScSe j85Fm5UincogEx4CZH5MRDFkEK+Tg4bVm6BOBvdUHfH5hBzljJp/G7Hzqklv5duoF1Ez ROSCwt5PR3x69i56v7H9toJfXxT+o3GRrSJZkqMAcRsSkL+YoKG6gb0Fc20qZ/3yPrjL 7WBwYSrR1W4pDCDjNt13+oER7fWCK1NcErLvX/19Z+e97teF5pJklSzsyhC0Kj2UJ7ha DwP7s21xWccRkirzTYOZP4eY46f6DtRs3qJ/0HWXIqls2Osz9JGZ2pGx4joK1TBjbdt/ 8knQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=gKVkobfuovv7qclKCKdX5gd2H7TOV0mQXBG/s5kJPMw=; b=ULz+bEgx4gMtbYHmFu9UvfBDMxQzVdCLefSXutCsN9YAtvBpxURkyqGJhnUVhuxOFQ ZL+iK1/evF57h9i/FUKRm8M8F0c5OG6tGiUuQuF79m6vDImgeAMxI/dvhk2bXRSAkuOE ekuPdEhCq3HQcIvUwoKnE+sd3kaYNlN2J0CUorpVRm/0vgCg4OGJPSCdsyyHa5XTKaTd bhItkHpAUhZLxDqt0hqlWflPY8lsvclZ47c1E0wLuJory8KSQ2RzGap/zww8S8zRV4WX CdbTEzGouDY9+s8iEJMO8nmC55wAX5HZyXip/5RRqMN57FDOzffzoWUUIRCE0a7qRNPd OLmQ== X-Gm-Message-State: ACrzQf336Gj0jl2vxkx3ST4hC9vorPCd+wbGCV6pwSWbOYDPcBDD5Yi/ AZeqD3y1yLuETNuq0E0hxkg= X-Received: by 2002:a17:902:b20a:b0:178:6f5b:f903 with SMTP id t10-20020a170902b20a00b001786f5bf903mr15895867plr.39.1667241085363; Mon, 31 Oct 2022 11:31:25 -0700 (PDT) Received: from localhost.localdomain (c-67-174-241-145.hsd1.ca.comcast.net. [67.174.241.145]) by smtp.gmail.com with ESMTPSA id bc11-20020a656d8b000000b00462612c2699sm4443266pgb.86.2022.10.31.11.31.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 31 Oct 2022 11:31:24 -0700 (PDT) From: Yang Shi <shy828301@gmail.com> To: zokeefe@google.com, mhocko@suse.com, akpm@linux-foundation.org Cc: shy828301@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: don't warn if the node is offlined Date: Mon, 31 Oct 2022 11:31:22 -0700 Message-Id: <20221031183122.470962-1-shy828301@gmail.com> X-Mailer: git-send-email 2.26.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748229390468738458?= X-GMAIL-MSGID: =?utf-8?q?1748229390468738458?= |
Series |
mm: don't warn if the node is offlined
|
|
Commit Message
Yang Shi
Oct. 31, 2022, 6:31 p.m. UTC
Syzbot reported the below splat:
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
do_madvise mm/madvise.c:1432 [inline]
__do_sys_madvise mm/madvise.c:1432 [inline]
__se_sys_madvise mm/madvise.c:1430 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
</TASK>
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
do_madvise mm/madvise.c:1432 [inline]
__do_sys_madvise mm/madvise.c:1432 [inline]
__se_sys_madvise mm/madvise.c:1430 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
</TASK>
It is because khugepaged allocates pages with __GFP_THISNODE, but the
preferred node is offlined. The warning was even stronger before commit
8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node
is explicitly requested"). The commit softened the warning for
__GFP_THISNODE.
But this warning seems not quite useful because:
* There is no guarantee the node is online for __GFP_THISNODE context
for all the callsites.
* Kernel just fails the allocation regardless the warning, and it looks
all callsites handle the allocation failure gracefully.
So, removing the warning seems like the good move.
Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Michal Hocko <mhocko@suse.com>
---
include/linux/gfp.h | 2 --
1 file changed, 2 deletions(-)
Comments
On Mon, Oct 31, 2022 at 11:31 AM Yang Shi <shy828301@gmail.com> wrote: > > Syzbot reported the below splat: > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > Modules linked in: > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156 > madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611 > madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066 > madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240 > do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419 > do_madvise mm/madvise.c:1432 [inline] > __do_sys_madvise mm/madvise.c:1432 [inline] > __se_sys_madvise mm/madvise.c:1430 [inline] > __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x7f6b48a4eef9 > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c > RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9 > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000 > RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4 > R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000 > </TASK> > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node > include/linux/gfp.h:221 [inline] > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 > hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 > alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > Modules linked in: > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted > 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 10/11/2022 > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc > ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 > 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156 > madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611 > madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066 > madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240 > do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419 > do_madvise mm/madvise.c:1432 [inline] > __do_sys_madvise mm/madvise.c:1432 [inline] > __se_sys_madvise mm/madvise.c:1430 [inline] > __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430 > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x7f6b48a4eef9 > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 > f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 > f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c > RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9 > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000 > RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4 > R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000 > </TASK> > > It is because khugepaged allocates pages with __GFP_THISNODE, but the > preferred node is offlined. The warning was even stronger before commit > 8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node > is explicitly requested"). The commit softened the warning for > __GFP_THISNODE. > > But this warning seems not quite useful because: > * There is no guarantee the node is online for __GFP_THISNODE context > for all the callsites. > * Kernel just fails the allocation regardless the warning, and it looks > all callsites handle the allocation failure gracefully. > > So, removing the warning seems like the good move. > > Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com > Signed-off-by: Yang Shi <shy828301@gmail.com> > Cc: Zach O'Keefe <zokeefe@google.com> > Cc: Michal Hocko <mhocko@suse.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> > --- > include/linux/gfp.h | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index ef4aea3b356e..594d6dee5646 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -218,7 +218,6 @@ static inline struct page * > __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) > { > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); > > return __alloc_pages(gfp_mask, order, nid, NULL); > } > @@ -227,7 +226,6 @@ static inline > struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid) > { > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > - VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid)); > > return __folio_alloc(gfp, order, nid, NULL); > } > -- > 2.26.3 >
On Mon 31-10-22 11:31:22, Yang Shi wrote: > Syzbot reported the below splat: > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > Modules linked in: > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 This is quite weird, isn't it? alloc_charge_hpage is selecting the most busy node (as per collapse_control). How come this can be an offline node? Is a parallel memory hotplug happening? [...] > It is because khugepaged allocates pages with __GFP_THISNODE, but the > preferred node is offlined. The warning was even stronger before commit > 8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node > is explicitly requested"). The commit softened the warning for > __GFP_THISNODE. > > But this warning seems not quite useful because: > * There is no guarantee the node is online for __GFP_THISNODE context > for all the callsites. The original idea IIRC was to catch a buggy code which mishandled node assignment. But this looks like a perfectly valid code. There is no synchronization with the memory hotplug so it is possible that memory gets offline during a longer taking scanning. I do agree that the warning is not really helpful in this case. It is actually even harmful for those running in panic-on-warn mode. > * Kernel just fails the allocation regardless the warning, and it looks > all callsites handle the allocation failure gracefully. > > So, removing the warning seems like the good move. > > Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com > Signed-off-by: Yang Shi <shy828301@gmail.com> > Cc: Zach O'Keefe <zokeefe@google.com> > Cc: Michal Hocko <mhocko@suse.com> Unless I am wrong in my above statement I would appreciate extending the changelog to describe the actual code is correct so the warning is harmful. Acked-by: Michal Hocko <mhocko@suse.com> > --- > include/linux/gfp.h | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index ef4aea3b356e..594d6dee5646 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -218,7 +218,6 @@ static inline struct page * > __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) > { > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); > > return __alloc_pages(gfp_mask, order, nid, NULL); > } > @@ -227,7 +226,6 @@ static inline > struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid) > { > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > - VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid)); > > return __folio_alloc(gfp, order, nid, NULL); > } > -- > 2.26.3
On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > Syzbot reported the below splat: > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > Modules linked in: > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Call Trace: > > <TASK> > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > busy node (as per collapse_control). How come this can be an offline > node? Is a parallel memory hotplug happening? TBH -- I did not look closely at the syzbot reproducer (let alone attempt to run it) and assumed this was the case. Taking a quick look, at least memory hot remove is enabled: CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y CONFIG_MEMORY_HOTPLUG=y CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y CONFIG_MEMORY_HOTREMOVE=y But looking at the C reproducer, I don't immediately see anywhere where we offline nodes. I'll try to run this tomorrow to make sure I'm not missing something real here. Thanks, Zach > [...] > > > It is because khugepaged allocates pages with __GFP_THISNODE, but the > > preferred node is offlined. The warning was even stronger before commit > > 8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node > > is explicitly requested"). The commit softened the warning for > > __GFP_THISNODE. > > > > But this warning seems not quite useful because: > > * There is no guarantee the node is online for __GFP_THISNODE context > > for all the callsites. > > The original idea IIRC was to catch a buggy code which mishandled node > assignment. But this looks like a perfectly valid code. There is no > synchronization with the memory hotplug so it is possible that memory > gets offline during a longer taking scanning. > > I do agree that the warning is not really helpful in this case. It is > actually even harmful for those running in panic-on-warn mode. > > > * Kernel just fails the allocation regardless the warning, and it looks > > all callsites handle the allocation failure gracefully. > > > > So, removing the warning seems like the good move. > > > > Reported-by: syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com > > Signed-off-by: Yang Shi <shy828301@gmail.com> > > Cc: Zach O'Keefe <zokeefe@google.com> > > Cc: Michal Hocko <mhocko@suse.com> > > Unless I am wrong in my above statement I would appreciate extending the > changelog to describe the actual code is correct so the warning is > harmful. > > Acked-by: Michal Hocko <mhocko@suse.com> > > > --- > > include/linux/gfp.h | 2 -- > > 1 file changed, 2 deletions(-) > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > index ef4aea3b356e..594d6dee5646 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -218,7 +218,6 @@ static inline struct page * > > __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) > > { > > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > > - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); > > > > return __alloc_pages(gfp_mask, order, nid, NULL); > > } > > @@ -227,7 +226,6 @@ static inline > > struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid) > > { > > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > > - VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid)); > > > > return __folio_alloc(gfp, order, nid, NULL); > > } > > -- > > 2.26.3 > > -- > Michal Hocko > SUSE Labs
On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > Syzbot reported the below splat: > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > Modules linked in: > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > Call Trace: > > > <TASK> > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > busy node (as per collapse_control). How come this can be an offline > > node? Is a parallel memory hotplug happening? > > TBH -- I did not look closely at the syzbot reproducer (let alone > attempt to run it) and assumed this was the case. Taking a quick look, > at least memory hot remove is enabled: > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > CONFIG_MEMORY_HOTPLUG=y > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > CONFIG_MEMORY_HOTREMOVE=y > > But looking at the C reproducer, I don't immediately see anywhere > where we offline nodes. I'll try to run this tomorrow to make sure I'm > not missing something real here. Looking slightly closer at hpage_collapse_scan_file I think that it is possible that xas_for_each simply doesn't find any entries in the page cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back to collapse_file even without any real entries. But the mere possibility of the hotplug race should be a sufficient ground to remove those WARN_ONs Thanks!
On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > > Syzbot reported the below splat: > > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > Modules linked in: > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > Call Trace: > > > > <TASK> > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > > busy node (as per collapse_control). How come this can be an offline > > > node? Is a parallel memory hotplug happening? > > > > TBH -- I did not look closely at the syzbot reproducer (let alone > > attempt to run it) and assumed this was the case. Taking a quick look, > > at least memory hot remove is enabled: > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > > CONFIG_MEMORY_HOTPLUG=y > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > > CONFIG_MEMORY_HOTREMOVE=y > > > > But looking at the C reproducer, I don't immediately see anywhere > > where we offline nodes. I'll try to run this tomorrow to make sure I'm > > not missing something real here. > > Looking slightly closer at hpage_collapse_scan_file I think that it is > possible that xas_for_each simply doesn't find any entries in the page > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back > to collapse_file even without any real entries. The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <= (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it. But a closer look at the code about how to pick up the preferred node, there seems to be a corner case for MADV_COLLAPSE. The code tried to do some balance if several nodes have the same hit record. Basically it does conceptually: * If the target_node <= last_target_node, then iterate from last_target_node + 1 to MAX_NUMNODES (1024 on default config) * If the max_value == node_load[nid], then target_node = nid So assuming the system has 2 nodes, the target_node is 0 and the last_target_node is 1, if MADV_COLLAPSE path is hit, then it may return 2 for target_node, but it is actually not existing (offline), so the warn is triggered. The below patch should be able to fix it: diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ea0d186bc9d4..d24405e6736b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc) if (target_node <= cc->last_target_node) for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; nid++) - if (max_value == cc->node_load[nid]) { + if (node_online(nid) && + max_value == cc->node_load[nid]) { target_node = nid; break; } > But the mere possibility of the hotplug race should be a sufficient > ground to remove those WARN_ONs The warn_on did help to catch this bug. But the reasons for removing it still stand TBH, so we may consider to move this warn_on to the callers which care about it? > > > Thanks! > -- > Michal Hocko > SUSE Labs
On Tue, Nov 1, 2022 at 10:13 AM Yang Shi <shy828301@gmail.com> wrote: > > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > > > Syzbot reported the below splat: > > > > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > Modules linked in: > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > Call Trace: > > > > > <TASK> > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > > > busy node (as per collapse_control). How come this can be an offline > > > > node? Is a parallel memory hotplug happening? > > > > > > TBH -- I did not look closely at the syzbot reproducer (let alone > > > attempt to run it) and assumed this was the case. Taking a quick look, > > > at least memory hot remove is enabled: > > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > > > CONFIG_MEMORY_HOTPLUG=y > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > > > CONFIG_MEMORY_HOTREMOVE=y > > > > > > But looking at the C reproducer, I don't immediately see anywhere > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm > > > not missing something real here. > > > > Looking slightly closer at hpage_collapse_scan_file I think that it is > > possible that xas_for_each simply doesn't find any entries in the page > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back > > to collapse_file even without any real entries. > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <= > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it. > > But a closer look at the code about how to pick up the preferred node, > there seems to be a corner case for MADV_COLLAPSE. > > The code tried to do some balance if several nodes have the same hit > record. Basically it does conceptually: > * If the target_node <= last_target_node, then iterate from > last_target_node + 1 to MAX_NUMNODES (1024 on default config) > * If the max_value == node_load[nid], then target_node = nid > > So assuming the system has 2 nodes, the target_node is 0 and the > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may > return 2 for target_node, but it is actually not existing (offline), > so the warn is triggered. > You're one step ahead of me, Yang. I was just debugging the syzbot C reproducer, and this seems to be exactly the case that is happening. > The below patch should be able to fix it: > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index ea0d186bc9d4..d24405e6736b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct > collapse_control *cc) > if (target_node <= cc->last_target_node) > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; > nid++) > - if (max_value == cc->node_load[nid]) { > + if (node_online(nid) && > + max_value == cc->node_load[nid]) { > target_node = nid; > break; > } > Thanks for the patch. I think this is the right place to do the check. This is slightly tangential - but I don't want to send a new mail about it -- but I wonder if we should be doing __GFP_THISNODE + explicit node vs having hpage_collapse_find_target_node() set a nodemask. We could then provide fallback nodes for ties, or if some node contained > some threshold number of pages. > > But the mere possibility of the hotplug race should be a sufficient > > ground to remove those WARN_ONs > Agreed. > The warn_on did help to catch this bug. But the reasons for removing > it still stand TBH, so we may consider to move this warn_on to the > callers which care about it? I didn't come across in a cursory search -- but if there are callers which try to synchronize with hot remove to ensure __GFP_THISNODE succeeds, then sure, the warn makes sense to them. > > > > > > Thanks! > > -- > > Michal Hocko > > SUSE Labs
On Tue, Nov 1, 2022 at 12:14 PM Zach O'Keefe <zokeefe@google.com> wrote: > > On Tue, Nov 1, 2022 at 10:13 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > > > > Syzbot reported the below splat: > > > > > > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > Modules linked in: > > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > > Call Trace: > > > > > > <TASK> > > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > > > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > > > > busy node (as per collapse_control). How come this can be an offline > > > > > node? Is a parallel memory hotplug happening? > > > > > > > > TBH -- I did not look closely at the syzbot reproducer (let alone > > > > attempt to run it) and assumed this was the case. Taking a quick look, > > > > at least memory hot remove is enabled: > > > > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > > > > CONFIG_MEMORY_HOTPLUG=y > > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > > > > CONFIG_MEMORY_HOTREMOVE=y > > > > > > > > But looking at the C reproducer, I don't immediately see anywhere > > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm > > > > not missing something real here. > > > > > > Looking slightly closer at hpage_collapse_scan_file I think that it is > > > possible that xas_for_each simply doesn't find any entries in the page > > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back > > > to collapse_file even without any real entries. > > > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <= > > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it. > > > > But a closer look at the code about how to pick up the preferred node, > > there seems to be a corner case for MADV_COLLAPSE. > > > > The code tried to do some balance if several nodes have the same hit > > record. Basically it does conceptually: > > * If the target_node <= last_target_node, then iterate from > > last_target_node + 1 to MAX_NUMNODES (1024 on default config) > > * If the max_value == node_load[nid], then target_node = nid > > > > So assuming the system has 2 nodes, the target_node is 0 and the > > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may > > return 2 for target_node, but it is actually not existing (offline), > > so the warn is triggered. > > > > You're one step ahead of me, Yang. I was just debugging the syzbot C > reproducer, and this seems to be exactly the case that is happening. > > > The below patch should be able to fix it: > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index ea0d186bc9d4..d24405e6736b 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct > > collapse_control *cc) > > if (target_node <= cc->last_target_node) > > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; > > nid++) > > - if (max_value == cc->node_load[nid]) { > > + if (node_online(nid) && > > + max_value == cc->node_load[nid]) { > > target_node = nid; > > break; > > } > > > > Thanks for the patch. I think this is the right place to do the check. > > This is slightly tangential - but I don't want to send a new mail > about it -- but I wonder if we should be doing __GFP_THISNODE + > explicit node vs having hpage_collapse_find_target_node() set a > nodemask. We could then provide fallback nodes for ties, or if some > node contained > some threshold number of pages. We definitely could, but I'm not sure whether we should make this code more complicated or not. TBH I think it is already overengineered. If fallback is fine, then just removing __GFP_THISNODE may be fine too. Actually I doubt there would be any noticeable difference for real life workload. > > > > But the mere possibility of the hotplug race should be a sufficient > > > ground to remove those WARN_ONs > > > > Agreed. > > > The warn_on did help to catch this bug. But the reasons for removing > > it still stand TBH, so we may consider to move this warn_on to the > > callers which care about it? > > I didn't come across in a cursory search -- but if there are callers > which try to synchronize with hot remove to ensure __GFP_THISNODE > succeeds, then sure, the warn makes sense to them. Yeah, we are on the same page, the specific warning could be added by the caller instead of in the common path. > > > > > > > > > > Thanks! > > > -- > > > Michal Hocko > > > SUSE Labs
On Tue, Nov 1, 2022 at 1:09 PM Yang Shi <shy828301@gmail.com> wrote: > > On Tue, Nov 1, 2022 at 12:14 PM Zach O'Keefe <zokeefe@google.com> wrote: > > > > On Tue, Nov 1, 2022 at 10:13 AM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > > > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > > > > > Syzbot reported the below splat: > > > > > > > > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > > Modules linked in: > > > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > > > Call Trace: > > > > > > > <TASK> > > > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > > > > > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > > > > > busy node (as per collapse_control). How come this can be an offline > > > > > > node? Is a parallel memory hotplug happening? > > > > > > > > > > TBH -- I did not look closely at the syzbot reproducer (let alone > > > > > attempt to run it) and assumed this was the case. Taking a quick look, > > > > > at least memory hot remove is enabled: > > > > > > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > > > > > CONFIG_MEMORY_HOTPLUG=y > > > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > > > > > CONFIG_MEMORY_HOTREMOVE=y > > > > > > > > > > But looking at the C reproducer, I don't immediately see anywhere > > > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm > > > > > not missing something real here. > > > > > > > > Looking slightly closer at hpage_collapse_scan_file I think that it is > > > > possible that xas_for_each simply doesn't find any entries in the page > > > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back > > > > to collapse_file even without any real entries. > > > > > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <= > > > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it. > > > > > > But a closer look at the code about how to pick up the preferred node, > > > there seems to be a corner case for MADV_COLLAPSE. > > > > > > The code tried to do some balance if several nodes have the same hit > > > record. Basically it does conceptually: > > > * If the target_node <= last_target_node, then iterate from > > > last_target_node + 1 to MAX_NUMNODES (1024 on default config) > > > * If the max_value == node_load[nid], then target_node = nid > > > > > > So assuming the system has 2 nodes, the target_node is 0 and the > > > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may > > > return 2 for target_node, but it is actually not existing (offline), > > > so the warn is triggered. > > > > > > > You're one step ahead of me, Yang. I was just debugging the syzbot C > > reproducer, and this seems to be exactly the case that is happening. > > > > > The below patch should be able to fix it: > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index ea0d186bc9d4..d24405e6736b 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct > > > collapse_control *cc) > > > if (target_node <= cc->last_target_node) > > > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; > > > nid++) > > > - if (max_value == cc->node_load[nid]) { > > > + if (node_online(nid) && > > > + max_value == cc->node_load[nid]) { > > > target_node = nid; > > > break; > > > } > > > > > > > Thanks for the patch. I think this is the right place to do the check. > > > > This is slightly tangential - but I don't want to send a new mail > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > explicit node vs having hpage_collapse_find_target_node() set a > > nodemask. We could then provide fallback nodes for ties, or if some > > node contained > some threshold number of pages. > > We definitely could, but I'm not sure whether we should make this code > more complicated or not. TBH I think it is already overengineered. If > fallback is fine, then just removing __GFP_THISNODE may be fine too. > Actually I doubt there would be any noticeable difference for real > life workload. > It would definitely make it more complicated. Fallback to certain nodes is ok -- we really would like to avoid allocating a THP if it's mostly accessed remotely. Anyways -- as you mention, likely no difference in real workloads, so fine leaving as-is. > > > > > > But the mere possibility of the hotplug race should be a sufficient > > > > ground to remove those WARN_ONs > > > > > > > Agreed. > > > > > The warn_on did help to catch this bug. But the reasons for removing > > > it still stand TBH, so we may consider to move this warn_on to the > > > callers which care about it? > > > > I didn't come across in a cursory search -- but if there are callers > > which try to synchronize with hot remove to ensure __GFP_THISNODE > > succeeds, then sure, the warn makes sense to them. > > Yeah, we are on the same page, the specific warning could be added by > the caller instead of in the common path. > > > > > > > > > > > > > > > Thanks! > > > > -- > > > > Michal Hocko > > > > SUSE Labs
On Tue 01-11-22 10:12:49, Yang Shi wrote: > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > > > Syzbot reported the below splat: > > > > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > Modules linked in: > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > Call Trace: > > > > > <TASK> > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > > > busy node (as per collapse_control). How come this can be an offline > > > > node? Is a parallel memory hotplug happening? > > > > > > TBH -- I did not look closely at the syzbot reproducer (let alone > > > attempt to run it) and assumed this was the case. Taking a quick look, > > > at least memory hot remove is enabled: > > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > > > CONFIG_MEMORY_HOTPLUG=y > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > > > CONFIG_MEMORY_HOTREMOVE=y > > > > > > But looking at the C reproducer, I don't immediately see anywhere > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm > > > not missing something real here. > > > > Looking slightly closer at hpage_collapse_scan_file I think that it is > > possible that xas_for_each simply doesn't find any entries in the page > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back > > to collapse_file even without any real entries. > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <= > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it. OK, I see. > But a closer look at the code about how to pick up the preferred node, > there seems to be a corner case for MADV_COLLAPSE. > > The code tried to do some balance if several nodes have the same hit > record. Basically it does conceptually: > * If the target_node <= last_target_node, then iterate from > last_target_node + 1 to MAX_NUMNODES (1024 on default config) > * If the max_value == node_load[nid], then target_node = nid Correct > So assuming the system has 2 nodes, the target_node is 0 and the > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may > return 2 for target_node, but it is actually not existing (offline), > so the warn is triggered. How can node_load[2] > 0 (IIUC max_value > 0 here) if the node is offline (other than a race with hotplug)? > The below patch should be able to fix it: > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index ea0d186bc9d4..d24405e6736b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct > collapse_control *cc) > if (target_node <= cc->last_target_node) > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; > nid++) > - if (max_value == cc->node_load[nid]) { > + if (node_online(nid) && > + max_value == cc->node_load[nid]) { > target_node = nid; > break; > } Node, this is equally racy. node_online might become false right after the check.
On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: [...] > This is slightly tangential - but I don't want to send a new mail > about it -- but I wonder if we should be doing __GFP_THISNODE + > explicit node vs having hpage_collapse_find_target_node() set a > nodemask. We could then provide fallback nodes for ties, or if some > node contained > some threshold number of pages. I would simply go with something like this (not even compile tested): diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4734315f7940..947a5158fe11 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -96,9 +96,6 @@ struct collapse_control { /* Num pages scanned per node */ u32 node_load[MAX_NUMNODES]; - - /* Last target selected in hpage_collapse_find_target_node() */ - int last_target_node; }; /** @@ -734,7 +731,6 @@ static void khugepaged_alloc_sleep(void) struct collapse_control khugepaged_collapse_control = { .is_khugepaged = true, - .last_target_node = NUMA_NO_NODE, }; static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc) @@ -772,7 +768,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void) } #ifdef CONFIG_NUMA -static int hpage_collapse_find_target_node(struct collapse_control *cc) +static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask) { int nid, target_node = 0, max_value = 0; @@ -783,28 +779,25 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc) target_node = nid; } + nodes_clear(&alloc_mask); /* do some balance if several nodes have the same hit record */ - if (target_node <= cc->last_target_node) - for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; - nid++) - if (max_value == cc->node_load[nid]) { - target_node = nid; - break; - } + for_each_online_node(nid) {_ + if (max_value == cc->node_load[nid]) + node_set(nid, &alloc_mask) + } - cc->last_target_node = target_node; return target_node; } #else -static int hpage_collapse_find_target_node(struct collapse_control *cc) +static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask) { return 0; } #endif -static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node) +static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node, nodemask_t *nmask) { - *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER); + *hpage = __alloc_pages(gfp, HPAGE_PMD_ORDER, node, nmask); if (unlikely(!*hpage)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); return false; @@ -958,9 +951,18 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, /* Only allocate from the target node */ gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE) | __GFP_THISNODE; - int node = hpage_collapse_find_target_node(cc); + NODEMASK_ALLOC(nodemask_t, nmask, GFP_KERNEL); + int node; + int ret; + + if (!nmaks) + return SCAN_ALLOC_HUGE_PAGE_FAIL; + + node = hpage_collapse_find_target_node(cc, nmask); + ret = hpage_collapse_alloc_page(hpage, gfp, node, nmask); + NODEMASK_FREE(nmask); - if (!hpage_collapse_alloc_page(hpage, gfp, node)) + if (!ret) return SCAN_ALLOC_HUGE_PAGE_FAIL; if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp))) return SCAN_CGROUP_CHARGE_FAIL; @@ -2576,7 +2578,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, if (!cc) return -ENOMEM; cc->is_khugepaged = false; - cc->last_target_node = NUMA_NO_NODE; mmgrab(mm); lru_add_drain_all();
On Wed 02-11-22 08:39:25, Michal Hocko wrote: > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > [...] > > This is slightly tangential - but I don't want to send a new mail > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > explicit node vs having hpage_collapse_find_target_node() set a > > nodemask. We could then provide fallback nodes for ties, or if some > > node contained > some threshold number of pages. > > I would simply go with something like this (not even compile tested): Btw. while at it. It is really ugly to allocate 4kB stack space for node mask for !NUMA configurations! If you are touching that area then this shouldn't be hard to fix.
On Wed, Nov 2, 2022 at 12:14 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 01-11-22 10:12:49, Yang Shi wrote: > > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > > > > Syzbot reported the below splat: > > > > > > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > Modules linked in: > > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > > Call Trace: > > > > > > <TASK> > > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > > > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > > > > busy node (as per collapse_control). How come this can be an offline > > > > > node? Is a parallel memory hotplug happening? > > > > > > > > TBH -- I did not look closely at the syzbot reproducer (let alone > > > > attempt to run it) and assumed this was the case. Taking a quick look, > > > > at least memory hot remove is enabled: > > > > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > > > > CONFIG_MEMORY_HOTPLUG=y > > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > > > > CONFIG_MEMORY_HOTREMOVE=y > > > > > > > > But looking at the C reproducer, I don't immediately see anywhere > > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm > > > > not missing something real here. > > > > > > Looking slightly closer at hpage_collapse_scan_file I think that it is > > > possible that xas_for_each simply doesn't find any entries in the page > > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back > > > to collapse_file even without any real entries. > > > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <= > > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it. > > OK, I see. > > > But a closer look at the code about how to pick up the preferred node, > > there seems to be a corner case for MADV_COLLAPSE. > > > > The code tried to do some balance if several nodes have the same hit > > record. Basically it does conceptually: > > * If the target_node <= last_target_node, then iterate from > > last_target_node + 1 to MAX_NUMNODES (1024 on default config) > > * If the max_value == node_load[nid], then target_node = nid > > Correct > > > So assuming the system has 2 nodes, the target_node is 0 and the > > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may > > return 2 for target_node, but it is actually not existing (offline), > > so the warn is triggered. > > How can node_load[2] > 0 (IIUC max_value > 0 here) if the node is > offline (other than a race with hotplug)? The max_value may be 0 if there is no entry in page cache for that range IIUC. > > > The below patch should be able to fix it: > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index ea0d186bc9d4..d24405e6736b 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct > > collapse_control *cc) > > if (target_node <= cc->last_target_node) > > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; > > nid++) > > - if (max_value == cc->node_load[nid]) { > > + if (node_online(nid) && > > + max_value == cc->node_load[nid]) { > > target_node = nid; > > break; > > } > > Node, this is equally racy. node_online might become false right after > the check. Yes, my point is this could filter out the non-existing node case, for example, the system just has 2 nodes physically, but MAX_NUMNODES > 2. It doesn't prevent offlining, it is still racy, just like other __GFP_THISNODE call sites. > -- > Michal Hocko > SUSE Labs
On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > [...] > > This is slightly tangential - but I don't want to send a new mail > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > explicit node vs having hpage_collapse_find_target_node() set a > > nodemask. We could then provide fallback nodes for ties, or if some > > node contained > some threshold number of pages. > > I would simply go with something like this (not even compile tested): Thanks, Michal. It is definitely an option. As I talked with Zach, I'm not sure whether it is worth making the code more complicated for such micro optimization or not. Removing __GFP_THISNODE or even removing the node balance code should be fine too IMHO. TBH I doubt there would be any noticeable difference. > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 4734315f7940..947a5158fe11 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -96,9 +96,6 @@ struct collapse_control { > > /* Num pages scanned per node */ > u32 node_load[MAX_NUMNODES]; > - > - /* Last target selected in hpage_collapse_find_target_node() */ > - int last_target_node; > }; > > /** > @@ -734,7 +731,6 @@ static void khugepaged_alloc_sleep(void) > > struct collapse_control khugepaged_collapse_control = { > .is_khugepaged = true, > - .last_target_node = NUMA_NO_NODE, > }; > > static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc) > @@ -772,7 +768,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void) > } > > #ifdef CONFIG_NUMA > -static int hpage_collapse_find_target_node(struct collapse_control *cc) > +static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask) > { > int nid, target_node = 0, max_value = 0; > > @@ -783,28 +779,25 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc) > target_node = nid; > } > > + nodes_clear(&alloc_mask); > /* do some balance if several nodes have the same hit record */ > - if (target_node <= cc->last_target_node) > - for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES; > - nid++) > - if (max_value == cc->node_load[nid]) { > - target_node = nid; > - break; > - } > + for_each_online_node(nid) {_ > + if (max_value == cc->node_load[nid]) > + node_set(nid, &alloc_mask) > + } > > - cc->last_target_node = target_node; > return target_node; > } > #else > -static int hpage_collapse_find_target_node(struct collapse_control *cc) > +static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask) > { > return 0; > } > #endif > > -static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node) > +static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node, nodemask_t *nmask) > { > - *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER); > + *hpage = __alloc_pages(gfp, HPAGE_PMD_ORDER, node, nmask); > if (unlikely(!*hpage)) { > count_vm_event(THP_COLLAPSE_ALLOC_FAILED); > return false; > @@ -958,9 +951,18 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > /* Only allocate from the target node */ > gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : > GFP_TRANSHUGE) | __GFP_THISNODE; > - int node = hpage_collapse_find_target_node(cc); > + NODEMASK_ALLOC(nodemask_t, nmask, GFP_KERNEL); > + int node; > + int ret; > + > + if (!nmaks) > + return SCAN_ALLOC_HUGE_PAGE_FAIL; > + > + node = hpage_collapse_find_target_node(cc, nmask); > + ret = hpage_collapse_alloc_page(hpage, gfp, node, nmask); > + NODEMASK_FREE(nmask); > > - if (!hpage_collapse_alloc_page(hpage, gfp, node)) > + if (!ret) > return SCAN_ALLOC_HUGE_PAGE_FAIL; > if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp))) > return SCAN_CGROUP_CHARGE_FAIL; > @@ -2576,7 +2578,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > if (!cc) > return -ENOMEM; > cc->is_khugepaged = false; > - cc->last_target_node = NUMA_NO_NODE; > > mmgrab(mm); > lru_add_drain_all(); > -- > Michal Hocko > SUSE Labs
On Wed 02-11-22 08:58:09, Yang Shi wrote: > On Wed, Nov 2, 2022 at 12:14 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 01-11-22 10:12:49, Yang Shi wrote: > > > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote: > > > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote: > > > > > > > Syzbot reported the below splat: > > > > > > > > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > > Modules linked in: > > > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 > > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 > > > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] > > > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] > > > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 > > > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae > > > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 > > > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 > > > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 > > > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > > > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 > > > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 > > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > > > > > Call Trace: > > > > > > > <TASK> > > > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 > > > > > > > > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most > > > > > > busy node (as per collapse_control). How come this can be an offline > > > > > > node? Is a parallel memory hotplug happening? > > > > > > > > > > TBH -- I did not look closely at the syzbot reproducer (let alone > > > > > attempt to run it) and assumed this was the case. Taking a quick look, > > > > > at least memory hot remove is enabled: > > > > > > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y > > > > > CONFIG_MEMORY_HOTPLUG=y > > > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y > > > > > CONFIG_MEMORY_HOTREMOVE=y > > > > > > > > > > But looking at the C reproducer, I don't immediately see anywhere > > > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm > > > > > not missing something real here. > > > > > > > > Looking slightly closer at hpage_collapse_scan_file I think that it is > > > > possible that xas_for_each simply doesn't find any entries in the page > > > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back > > > > to collapse_file even without any real entries. > > > > > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <= > > > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it. > > > > OK, I see. > > > > > But a closer look at the code about how to pick up the preferred node, > > > there seems to be a corner case for MADV_COLLAPSE. > > > > > > The code tried to do some balance if several nodes have the same hit > > > record. Basically it does conceptually: > > > * If the target_node <= last_target_node, then iterate from > > > last_target_node + 1 to MAX_NUMNODES (1024 on default config) > > > * If the max_value == node_load[nid], then target_node = nid > > > > Correct > > > > > So assuming the system has 2 nodes, the target_node is 0 and the > > > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may > > > return 2 for target_node, but it is actually not existing (offline), > > > so the warn is triggered. > > > > How can node_load[2] > 0 (IIUC max_value > 0 here) if the node is > > offline (other than a race with hotplug)? > > The max_value may be 0 if there is no entry in page cache for that range IIUC. Ohh, I am blind. I believe you have already mentioned that but I must have overlooked it. I have only now noticed cc->is_khugepaged part of the check. Supposedly, the primary idea here being that madvise calls should de-facto create a brand new THP in that case. A creative way to call collapsing but I can see some point in that. See my other reply on how to deal with that in a more reasonable (IMHO) way.
On Wed 02-11-22 09:03:57, Yang Shi wrote: > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > [...] > > > This is slightly tangential - but I don't want to send a new mail > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > explicit node vs having hpage_collapse_find_target_node() set a > > > nodemask. We could then provide fallback nodes for ties, or if some > > > node contained > some threshold number of pages. > > > > I would simply go with something like this (not even compile tested): > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > not sure whether it is worth making the code more complicated for such > micro optimization or not. Removing __GFP_THISNODE or even removing > the node balance code should be fine too IMHO. TBH I doubt there would > be any noticeable difference. I do agree that an explicit nodes (quasi)round robin sounds over engineered. It makes some sense to try to target the prevalent node though because this code can be executed from khugepaged and therefore allocating with a completely different affinity than the original fault.
On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 02-11-22 09:03:57, Yang Shi wrote: > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > > [...] > > > > This is slightly tangential - but I don't want to send a new mail > > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > > explicit node vs having hpage_collapse_find_target_node() set a > > > > nodemask. We could then provide fallback nodes for ties, or if some > > > > node contained > some threshold number of pages. > > > > > > I would simply go with something like this (not even compile tested): > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > > not sure whether it is worth making the code more complicated for such > > micro optimization or not. Removing __GFP_THISNODE or even removing > > the node balance code should be fine too IMHO. TBH I doubt there would > > be any noticeable difference. > > I do agree that an explicit nodes (quasi)round robin sounds over > engineered. It makes some sense to try to target the prevalent node > though because this code can be executed from khugepaged and therefore > allocating with a completely different affinity than the original fault. Yeah, the corner case comes from the node balance code, it just tries to balance between multiple prevalent nodes, so you agree to remove it IIRC? > -- > Michal Hocko > SUSE Labs
On Wed 02-11-22 10:36:07, Yang Shi wrote: > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote: > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > > > [...] > > > > > This is slightly tangential - but I don't want to send a new mail > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > > > explicit node vs having hpage_collapse_find_target_node() set a > > > > > nodemask. We could then provide fallback nodes for ties, or if some > > > > > node contained > some threshold number of pages. > > > > > > > > I would simply go with something like this (not even compile tested): > > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > > > not sure whether it is worth making the code more complicated for such > > > micro optimization or not. Removing __GFP_THISNODE or even removing > > > the node balance code should be fine too IMHO. TBH I doubt there would > > > be any noticeable difference. > > > > I do agree that an explicit nodes (quasi)round robin sounds over > > engineered. It makes some sense to try to target the prevalent node > > though because this code can be executed from khugepaged and therefore > > allocating with a completely different affinity than the original fault. > > Yeah, the corner case comes from the node balance code, it just tries > to balance between multiple prevalent nodes, so you agree to remove it > IIRC? Yeah, let's just collect all good nodes into a nodemask and keep __GFP_THISNODE in place. You can consider having the nodemask per collapse_control so that you allocate it only once in the struct lifetime. And as mentioned in other reply it would be really nice to hide this under CONFIG_NUMA (in a standalong follow up of course).
On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 02-11-22 10:36:07, Yang Shi wrote: > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote: > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > > > > [...] > > > > > > This is slightly tangential - but I don't want to send a new mail > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > > > > explicit node vs having hpage_collapse_find_target_node() set a > > > > > > nodemask. We could then provide fallback nodes for ties, or if some > > > > > > node contained > some threshold number of pages. > > > > > > > > > > I would simply go with something like this (not even compile tested): > > > > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > > > > not sure whether it is worth making the code more complicated for such > > > > micro optimization or not. Removing __GFP_THISNODE or even removing > > > > the node balance code should be fine too IMHO. TBH I doubt there would > > > > be any noticeable difference. > > > > > > I do agree that an explicit nodes (quasi)round robin sounds over > > > engineered. It makes some sense to try to target the prevalent node > > > though because this code can be executed from khugepaged and therefore > > > allocating with a completely different affinity than the original fault. > > > > Yeah, the corner case comes from the node balance code, it just tries > > to balance between multiple prevalent nodes, so you agree to remove it > > IIRC? > > Yeah, let's just collect all good nodes into a nodemask and keep > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control > so that you allocate it only once in the struct lifetime. Actually my intention is more aggressive, just remove that node balance code. > > And as mentioned in other reply it would be really nice to hide this > under CONFIG_NUMA (in a standalong follow up of course). The hpage_collapse_find_target_node() function itself is defined under CONFIG_NUMA. > > -- > Michal Hocko > SUSE Labs
On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 02-11-22 10:36:07, Yang Shi wrote: > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote: > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > > > > > [...] > > > > > > > This is slightly tangential - but I don't want to send a new mail > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some > > > > > > > node contained > some threshold number of pages. > > > > > > > > > > > > I would simply go with something like this (not even compile tested): > > > > > > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > > > > > not sure whether it is worth making the code more complicated for such > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing > > > > > the node balance code should be fine too IMHO. TBH I doubt there would > > > > > be any noticeable difference. > > > > > > > > I do agree that an explicit nodes (quasi)round robin sounds over > > > > engineered. It makes some sense to try to target the prevalent node > > > > though because this code can be executed from khugepaged and therefore > > > > allocating with a completely different affinity than the original fault. > > > > > > Yeah, the corner case comes from the node balance code, it just tries > > > to balance between multiple prevalent nodes, so you agree to remove it > > > IIRC? > > > > Yeah, let's just collect all good nodes into a nodemask and keep > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control > > so that you allocate it only once in the struct lifetime. > > Actually my intention is more aggressive, just remove that node balance code. > The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") where it was made to satisfy "numactl --interleave=all". I don't know why any real workloads would want this -- but there very well could be a valid use case. If not, I think it could be removed independent of what we do with __GFP_THISNODE and nodemask. Balancing aside -- I haven't fully thought through what an ideal (and further overengineered) solution would be for numa, but one (perceived - not measured) issue that khugepaged might have (MADV_COLLAPSE doesn't have the choice) is on systems with many, many nodes with source pages sprinkled across all of them. Should we collapse these pages into a single THP from the node with the most (but could still be a small %) pages? Probably there are better candidates. So, maybe a khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something makes sense. > > > > And as mentioned in other reply it would be really nice to hide this > > under CONFIG_NUMA (in a standalong follow up of course). > > The hpage_collapse_find_target_node() function itself is defined under > CONFIG_NUMA. > > > > > -- > > Michal Hocko > > SUSE Labs
On Wed, Nov 2, 2022 at 11:59 AM Zach O'Keefe <zokeefe@google.com> wrote: > > On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 02-11-22 10:36:07, Yang Shi wrote: > > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote: > > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > > > > > > [...] > > > > > > > > This is slightly tangential - but I don't want to send a new mail > > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a > > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some > > > > > > > > node contained > some threshold number of pages. > > > > > > > > > > > > > > I would simply go with something like this (not even compile tested): > > > > > > > > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > > > > > > not sure whether it is worth making the code more complicated for such > > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing > > > > > > the node balance code should be fine too IMHO. TBH I doubt there would > > > > > > be any noticeable difference. > > > > > > > > > > I do agree that an explicit nodes (quasi)round robin sounds over > > > > > engineered. It makes some sense to try to target the prevalent node > > > > > though because this code can be executed from khugepaged and therefore > > > > > allocating with a completely different affinity than the original fault. > > > > > > > > Yeah, the corner case comes from the node balance code, it just tries > > > > to balance between multiple prevalent nodes, so you agree to remove it > > > > IIRC? > > > > > > Yeah, let's just collect all good nodes into a nodemask and keep > > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control > > > so that you allocate it only once in the struct lifetime. > > > > Actually my intention is more aggressive, just remove that node balance code. > > > > The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp: > khugepaged: add policy for finding target node") where it was made to > satisfy "numactl --interleave=all". I don't know why any real > workloads would want this -- but there very well could be a valid use > case. If not, I think it could be removed independent of what we do > with __GFP_THISNODE and nodemask. Hmm... if the code is used for interleave, I don't think nodemask could preserve the behavior IIUC. The nodemask also tries to allocate memory from the preferred node, and fallback to the allowed nodes from nodemask when the allocation fails on the preferred node. But the round robin style node balance tries to distribute the THP on the nodes evenly. And I just thought of __GFP_THISNODE + nodemask should not be the right combination IIUC, right? __GFP_THISNODE does disallow any fallback, so nodemask is actually useless. So I think we narrowed down to two options: 1. Preserve the interleave behavior but bail out if the target node is not online (it is also racy, but doesn't hurt) 2. Remove the node balance code entirely > > Balancing aside -- I haven't fully thought through what an ideal (and > further overengineered) solution would be for numa, but one (perceived > - not measured) issue that khugepaged might have (MADV_COLLAPSE > doesn't have the choice) is on systems with many, many nodes with > source pages sprinkled across all of them. Should we collapse these > pages into a single THP from the node with the most (but could still > be a small %) pages? Probably there are better candidates. So, maybe a > khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something > makes sense. Anyway you have to allocate a THP on one node, I don't think of a better idea to make the node selection fairer. But I'd prefer to wait for real life usecase surfaces. > > > > > > > And as mentioned in other reply it would be really nice to hide this > > > under CONFIG_NUMA (in a standalong follow up of course). > > > > The hpage_collapse_find_target_node() function itself is defined under > > CONFIG_NUMA. > > > > > > > > -- > > > Michal Hocko > > > SUSE Labs
On Wed, Nov 2, 2022 at 1:08 PM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Nov 2, 2022 at 11:59 AM Zach O'Keefe <zokeefe@google.com> wrote: > > > > On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Wed 02-11-22 10:36:07, Yang Shi wrote: > > > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote: > > > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > > > > > > > [...] > > > > > > > > > This is slightly tangential - but I don't want to send a new mail > > > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a > > > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some > > > > > > > > > node contained > some threshold number of pages. > > > > > > > > > > > > > > > > I would simply go with something like this (not even compile tested): > > > > > > > > > > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > > > > > > > not sure whether it is worth making the code more complicated for such > > > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing > > > > > > > the node balance code should be fine too IMHO. TBH I doubt there would > > > > > > > be any noticeable difference. > > > > > > > > > > > > I do agree that an explicit nodes (quasi)round robin sounds over > > > > > > engineered. It makes some sense to try to target the prevalent node > > > > > > though because this code can be executed from khugepaged and therefore > > > > > > allocating with a completely different affinity than the original fault. > > > > > > > > > > Yeah, the corner case comes from the node balance code, it just tries > > > > > to balance between multiple prevalent nodes, so you agree to remove it > > > > > IIRC? > > > > > > > > Yeah, let's just collect all good nodes into a nodemask and keep > > > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control > > > > so that you allocate it only once in the struct lifetime. > > > > > > Actually my intention is more aggressive, just remove that node balance code. > > > > > > > The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp: > > khugepaged: add policy for finding target node") where it was made to > > satisfy "numactl --interleave=all". I don't know why any real > > workloads would want this -- but there very well could be a valid use > > case. If not, I think it could be removed independent of what we do > > with __GFP_THISNODE and nodemask. > > Hmm... if the code is used for interleave, I don't think nodemask > could preserve the behavior IIUC. The nodemask also tries to allocate > memory from the preferred node, and fallback to the allowed nodes from > nodemask when the allocation fails on the preferred node. But the > round robin style node balance tries to distribute the THP on the > nodes evenly. Ya, I don't think this has anything to do with nodemask -- I think I inadvertently started a discussion about it and we now have 2 threads merged into one :) > And I just thought of __GFP_THISNODE + nodemask should not be the > right combination IIUC, right? __GFP_THISNODE does disallow any > fallback, so nodemask is actually useless. Ya I was confused when I read this the first time -- thanks for clarifying my understanding. > So I think we narrowed down to two options: > 1. Preserve the interleave behavior but bail out if the target node is > not online (it is also racy, but doesn't hurt) > 2. Remove the node balance code entirely > Agreed. Really comes down to if we care about that "numactl --interleave" use case. My inclination would be to just remove it -- if we didn't have that code today, and someone raised this use case and asked for the code to be added, I'm not sure it'd be approved. > > > > Balancing aside -- I haven't fully thought through what an ideal (and > > further overengineered) solution would be for numa, but one (perceived > > - not measured) issue that khugepaged might have (MADV_COLLAPSE > > doesn't have the choice) is on systems with many, many nodes with > > source pages sprinkled across all of them. Should we collapse these > > pages into a single THP from the node with the most (but could still > > be a small %) pages? Probably there are better candidates. So, maybe a > > khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something > > makes sense. > > Anyway you have to allocate a THP on one node, I don't think of a > better idea to make the node selection fairer. But I'd prefer to wait > for real life usecase surfaces. So, the thought here is that we don't _have_ to allocate a THP. We can bail-out, just as we do with max_ptes_*, when we think allocating a THP isn't beneficial. As mentioned, MADV_COLLAPSE still has to allocate a THP -- but khugepaged need not. I'm fine waiting on this until needed, however. > > > > > > > > > > And as mentioned in other reply it would be really nice to hide this > > > > under CONFIG_NUMA (in a standalong follow up of course). > > > > > > The hpage_collapse_find_target_node() function itself is defined under > > > CONFIG_NUMA. > > > > > > > > > > > -- > > > > Michal Hocko > > > > SUSE Labs
On Wed 02-11-22 11:58:26, Zach O'Keefe wrote: > On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 02-11-22 10:36:07, Yang Shi wrote: > > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote: > > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote: > > > > > > > [...] > > > > > > > > This is slightly tangential - but I don't want to send a new mail > > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE + > > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a > > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some > > > > > > > > node contained > some threshold number of pages. > > > > > > > > > > > > > > I would simply go with something like this (not even compile tested): > > > > > > > > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm > > > > > > not sure whether it is worth making the code more complicated for such > > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing > > > > > > the node balance code should be fine too IMHO. TBH I doubt there would > > > > > > be any noticeable difference. > > > > > > > > > > I do agree that an explicit nodes (quasi)round robin sounds over > > > > > engineered. It makes some sense to try to target the prevalent node > > > > > though because this code can be executed from khugepaged and therefore > > > > > allocating with a completely different affinity than the original fault. > > > > > > > > Yeah, the corner case comes from the node balance code, it just tries > > > > to balance between multiple prevalent nodes, so you agree to remove it > > > > IIRC? > > > > > > Yeah, let's just collect all good nodes into a nodemask and keep > > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control > > > so that you allocate it only once in the struct lifetime. > > > > Actually my intention is more aggressive, just remove that node balance code. > > > > The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp: > khugepaged: add policy for finding target node") where it was made to > satisfy "numactl --interleave=all". I don't know why any real > workloads would want this -- but there very well could be a valid use > case. If not, I think it could be removed independent of what we do > with __GFP_THISNODE and nodemask. Thanks for the reference. The patch is really dubious. If the primary usecase is a memory policy then one should be used. We have the vma handy. Sure per task policy would be a bigger problem but interleaving is a mere hint rather than something that has hard requirements. > Balancing aside -- I haven't fully thought through what an ideal (and > further overengineered) solution would be for numa, but one (perceived > - not measured) issue that khugepaged might have (MADV_COLLAPSE > doesn't have the choice) is on systems with many, many nodes with > source pages sprinkled across all of them. Should we collapse these > pages into a single THP from the node with the most (but could still > be a small %) pages? Probably there are better candidates. So, maybe a > khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something > makes sense. Honestly I do not see any problem to be solved here.
On Wed 02-11-22 13:08:08, Yang Shi wrote: [...] > So I think we narrowed down to two options: > 1. Preserve the interleave behavior but bail out if the target node is > not online (it is also racy, but doesn't hurt) I do not think there is merit in the interleave patch is dubious to say the least. > 2. Remove the node balance code entirely Yes, removing the balancing makes sense but I would still hope that we do not fail too easily if the range is populated on multiple nodes equally. In practice it will likely not matter much I guess but setting up all nodes with top score is just easy to achieve.
On Thu, Nov 3, 2022 at 12:54 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 02-11-22 13:08:08, Yang Shi wrote: > [...] > > So I think we narrowed down to two options: > > 1. Preserve the interleave behavior but bail out if the target node is > > not online (it is also racy, but doesn't hurt) > > I do not think there is merit in the interleave patch is dubious to say > the least. > > > 2. Remove the node balance code entirely > > Yes, removing the balancing makes sense but I would still hope that we > do not fail too easily if the range is populated on multiple nodes > equally. In practice it will likely not matter much I guess but setting > up all nodes with top score is just easy to achieve. OK, thanks. I will come up with a patch to allow fallback between the nodes, it should be largely based on the change suggested by you. > -- > Michal Hocko > SUSE Labs
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index ef4aea3b356e..594d6dee5646 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -218,7 +218,6 @@ static inline struct page * __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); return __alloc_pages(gfp_mask, order, nid, NULL); } @@ -227,7 +226,6 @@ static inline struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); - VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid)); return __folio_alloc(gfp, order, nid, NULL); }