Message ID | 20240123045843.75969-1-shijie@os.amperecomputing.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-34715-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2553:b0:103:945f:af90 with SMTP id p19csp124212dyi; Mon, 22 Jan 2024 21:00:29 -0800 (PST) X-Google-Smtp-Source: AGHT+IHNsTVYTlgvaPpjIaLYIVi9rVGZ6bSFA8/0neloKeRrZX573yPdOFYfUsnIA6yd5fPECPbc X-Received: by 2002:a17:906:c1c2:b0:a2e:a517:df64 with SMTP id bw2-20020a170906c1c200b00a2ea517df64mr1423431ejb.42.1705986029536; Mon, 22 Jan 2024 21:00:29 -0800 (PST) Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id k1-20020a17090666c100b00a2ceb076c21si10227334ejp.801.2024.01.22.21.00.29 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Jan 2024 21:00:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-34715-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@os.amperecomputing.com header.s=selector2 header.b=m+AjAekQ; arc=fail (signature failed); spf=pass (google.com: domain of linux-kernel+bounces-34715-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-34715-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amperecomputing.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 04CD51F264CC for <ouuuleilei@gmail.com>; Tue, 23 Jan 2024 05:00:29 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id F05DB5242; Tue, 23 Jan 2024 05:00:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=os.amperecomputing.com header.i=@os.amperecomputing.com header.b="m+AjAekQ" Received: from NAM04-MW2-obe.outbound.protection.outlook.com (mail-mw2nam04on2113.outbound.protection.outlook.com [40.107.101.113]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16C33468C; Tue, 23 Jan 2024 05:00:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.101.113 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705986011; cv=fail; b=EGWiusQFXAOwDsaihCO/pwqoEleaF06xVKx16Z+6qZxj46RZUX/03GUt0+GkNAFv5Y9oGYOkr7r034tnjpLEWxd3xz4Wv1y0bxqv6VhQKfNl0u5xXoxiuOp1p8NbR0zPNSFzOU/vYudNqCUwW4Ee4kY0gx/++7nP6IVK3nA+6N8= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1705986011; c=relaxed/simple; bh=L979Qw/iaV24ouemHtWpyWkG+WRhHTZA/64ZShB4PaQ=; h=From:To:Cc:Subject:Date:Message-Id:Content-Type:MIME-Version; b=THApC3gPIvUc4m+A5Wyu7KC07vJu3Vwck2CcP5z/OWn8K51eXfyl2uECwoyjNldQZEE5WVSRwjT5lPqxursarFTzOXar4mj8+WGsELPl5+MSAwDPH3Bf4FmM8erAgaBghe6BKjRKQG4VakR5v+pVal46SephNS9N/xSX8byDz+o= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=os.amperecomputing.com; spf=pass smtp.mailfrom=os.amperecomputing.com; dkim=pass (1024-bit key) header.d=os.amperecomputing.com header.i=@os.amperecomputing.com header.b=m+AjAekQ; arc=fail smtp.client-ip=40.107.101.113 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=os.amperecomputing.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=WwIZwVw05GXifOwX+7u4/hT4BktsxgdX9gUq7jsv3dadOZCmikbqWZYDnV6pVuaQAQ1mQajsbieS+K3AvtTtioHNGihvJoPJWoLsnu1R3GOAa0oQwYMfVtnU9+tw0AimlsO4P4aIAgzzu1mDBpYUrI/mJNoMfvq7jPPEzSFfAFFxjKw27nwDpaue7Vjj+i+puOuP8WsVsVGIl5YHcXI1kF07E1DBqIVb7kaLVR13FSi0CaGbbABv55zyFJlzLexYw0wLqjbS+VrVuMWoLAeaXeqPigvBlcM97P/gePb2i+AwSlUKzTlrxEA/YCeZN/xCuN7JcPyRI8VQ7jvUSoe2Tg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=L8VMOla1IfoGtWF3wkPssQAMpxGPlgWmvfrDwBZXah8=; b=jbm04+Y9wNsQITlTP7oaVLwR2sdTuUxZTR6Oek8/WM7Wg9eNivbdsNncs1t3qBbCTTwftp6JFq7AuW2nQdcU0gXJWRYTzXEhs8vX08mFkurac4Xyk2iPj8GvULzcUdHf4ybn/GKFcmjSVxZqPY7+PK6X8ofbRchMmIq5+NPNiz/Zq6zwaiNd0tQphaaFubf+ePSIL9wZsHhSx8PI+s4+RDwN39cy1nob22YtsS/CD5iTzI/ROK8lRYnmiq1f8aM9jYlCIDNeVvbEbeBLJqxZei7T+91vr48UfM4+vPbnOKN/JthVCsmxMNlvBHHi6lcvSLb3wBH7gEKnr/l1NzJm5w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=L8VMOla1IfoGtWF3wkPssQAMpxGPlgWmvfrDwBZXah8=; b=m+AjAekQwvzmwcaQLWU2c4FXfV4iIOXemadq1ERnmUxlVOiulnPl65nYBaXcVkdcjCkzehI/NLrnBHM1DMBPxB7HIhgts4fOJHVzmH3ZbkmLt90vhKM+4V+bpUeqnoKytZuWtZg2G2Wo9n/m166KukD7ZAIC8GC896Or24g3nlo= Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; Received: from PH0PR01MB7975.prod.exchangelabs.com (2603:10b6:510:26d::15) by BL1PR01MB7865.prod.exchangelabs.com (2603:10b6:208:39c::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.37; Tue, 23 Jan 2024 05:00:06 +0000 Received: from PH0PR01MB7975.prod.exchangelabs.com ([fe80::91c:92f:45a5:e68a]) by PH0PR01MB7975.prod.exchangelabs.com ([fe80::91c:92f:45a5:e68a%6]) with mapi id 15.20.7202.026; Tue, 23 Jan 2024 05:00:05 +0000 From: Huang Shijie <shijie@os.amperecomputing.com> To: gregkh@linuxfoundation.org Cc: patches@amperecomputing.com, rafael@kernel.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, yury.norov@gmail.com, kuba@kernel.org, vschneid@redhat.com, mingo@kernel.org, akpm@linux-foundation.org, vbabka@suse.cz, rppt@kernel.org, tglx@linutronix.de, jpoimboe@kernel.org, ndesaulniers@google.com, mikelley@microsoft.com, mhiramat@kernel.org, arnd@arndb.de, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com, will@kernel.org, mark.rutland@arm.com, mpe@ellerman.id.au, linuxppc-dev@lists.ozlabs.org, chenhuacai@kernel.org, jiaxun.yang@flygoat.com, linux-mips@vger.kernel.org, cl@os.amperecomputing.com, Huang Shijie <shijie@os.amperecomputing.com> Subject: [PATCH v2] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id Date: Tue, 23 Jan 2024 12:58:43 +0800 Message-Id: <20240123045843.75969-1-shijie@os.amperecomputing.com> X-Mailer: git-send-email 2.40.1 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: CH0PR03CA0384.namprd03.prod.outlook.com (2603:10b6:610:119::26) To PH0PR01MB7975.prod.exchangelabs.com (2603:10b6:510:26d::15) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: PH0PR01MB7975:EE_|BL1PR01MB7865:EE_ X-MS-Office365-Filtering-Correlation-Id: e587ba2d-86b5-4ad6-ea82-08dc1bd02a01 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: +dPw5AkUKDAFa6fB1cwtJupmWRsmlVWFpO/rFtQ6UaY0Hcb4jQxQ0voerUo4skt68kMU2LFdmg8u/GQvuiB+n44nhuuS8B9mnjwFn4MMjHs5iuiGO1wBvDWBxgVWP6CWVqSo1gACs7TuQecKIi8EnueOTvbiM/1TNxgm35qqHDojoEAnn+HG42M2lGzEHv54MEDpQA3EUnOh5HVmujk2A4KkC5v3VB5aOcr/nS27n1BkpeJdGdmnse+JxGXNVO9mKaqizwTaTPpVL+g6SD4kmpYRSW1/lkxIZmbNfIS55V47nQsi6bn5oT6nzE4nd+CzjoW2BiCkRUxsuxZBmsFgW1falAFAkXfBQV0kr0FAgmlM0cb7HHm+ORtxl7gtI6mnaeTu/TsXtm4vMAV8Tmj1SfPdnifMU9rX7mpq3motbCYKa9DioieZpT2Mb0f/wW07UQMxyqGlfztbxpkWrxvbrJZ6Zy5KrRUsSnyZYmwYMNdCewawXETbsFWLktsRpYCiXJ3rCN6KApJ+KEVjsYYv85JYpUVZKvSA8qKJwLvEaQ6j+l9s6+O/J85gvIUIcBnihjY/LFMDKrcQv4uoHi4RnVbDAiTxoCU68ZrZ2wLtRVo= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:PH0PR01MB7975.prod.exchangelabs.com;PTR:;CAT:NONE;SFS:(13230031)(396003)(346002)(136003)(366004)(39850400004)(376002)(230922051799003)(186009)(64100799003)(1800799012)(451199024)(6666004)(86362001)(66476007)(66556008)(66946007)(107886003)(316002)(6916009)(2616005)(38350700005)(6512007)(41300700001)(52116002)(6506007)(5660300002)(8936002)(8676002)(4326008)(83380400001)(38100700002)(1076003)(26005)(7416002)(2906002)(478600001)(6486002)(966005);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: fqghm1328fx4Erk9LUmtZxirlH34qnrW7vkFv1ctbjWLlHKxU20vcrZRVRfuljM5Mb38oKRJ9HUQgDAEGvsOmGn7T86xTPDSlhDSidLyHFrghqHi+2cBzh9fqe4MxQHH6JPkcwTxG1nvsoWgfdYv3vqhIG37FQzLsHVjBXgkFTg0Mtw5tneQkKd6XKcevkSbm9keBudBeCc0rByr4PtcsbECgWvCt1HGyppD9/06BctP/oeMalUQkDPTczuWSeuaotN9CAZMTyS0gFhioFt3M1JWV3KglYZ/jbjKmFh0EElO6mutVFYNkm8yKg35pYi+YKK7lLDPkfiGpmDD2PObSUdS8bbdMJmxqibdHBpABWy0f64bxL6uj3p7KnzQzxc4rNp+1bmeekPorGK81sojUwgFyyG5EJ/wDJpC9gx5+qB+vvQpCLQGPLAxb7/5h+weNb7/qx5rKL4XSgZFXDr2jRLvarLRO/NOtVczoMRY0SGHRNcaJ6LwITUFNMbZ9VdZLg5X0vowFNwvm0lbjhqz4su48kvBZaTkk7AMUR2CAvfdFT6n35nOfh30mZFOje4SutEQxHOLzPy/yr6mWnHUYgWljAapEqkHFl+1mm5Exwt4/Bue6W64pUk8WZh9d+aYsIu06Ow7/pp23407i2LJ2ICVD6HBnqFj3z0Rf/HUtHAdYSR03IR1zj3Ee7dqA4x9AQRDIQNBp0V7kky8cZPu/GwCYrJA9htFddx+bNKAzwqiY8Dd9UGqSSmWWqQlbWkpj1xGtHdmPv2phYiPEJyitBjPQM/GqroFccQhMsrdJ/OdlmgSzsaFuaypxF+U0NkpFCs7qNRsgw+KH2HxnfQNOgNYalg7DxKNY+eWS721J4xBp+s68hCQiTq9B4mDLPeW9OwqJMHz3sfxnbH+TerTFJUAPSth2NRp34t1wuUxIyNhUqzgeZII65AUCSQcV5eKzC6nNgouJffoITKEdxQ+6A6QFQYwlUD394Ek6eF/M8DSrUG5cc3g8MY1xg+JYqx29mtnk91DNEbl3NOPAZgufElxMpUHNzpvjMbHktJby7vhlpHYn1V2g6P3/W1WxSQW4+rq/qYzFu92EjqZ9G+jSpIuqGT1YfwEiC14gIdRcWoEAHEsX1rkhinJKMZNFHKD4YcSs7Nu3b4r2vS1gm1uf+lIwnyucyUJDD0VGhW3QITL9fW2hCFlQf3K5mBfFUAPZx4szOqEpYaY2+Q2uuWqArJ8ou5KI5QP9G6XCb/5W+qb2ZxfuCcZqDAqYuenche9AmiXNKikcibRDRCBQHVUAuwFphLGW7365hO3HScdKWeJmX9f7fYTEf2104qTKRf4Hd1AlAeJYKW9WSrxcLTHleNt31zXrmHhox7alRkcrnVxoioE87Ki+nVDoFspbX6WUnGJ0vhKwMcviLYwtmTrqQVY+kfI7ztVeAQrHzFJyBmE7+S0Aqs6W6rRolrWojN/uL7CaG5MQSk6riLMsVp/k8RydfA/i0NbscMe/3B2JBE/e/1sEgOkPRURzhAEIIwf1Wn9X/j08nqD24Vv+Y2bN1pgJ2NIqsB7xwHjf9EO9ry2BLMY4L4Qj0wNg8Fn1fGWUaT0wSrNd7q4ekpjxbaejtCzs6nRN4KKy2PGAeMvO8Q= X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-Network-Message-Id: e587ba2d-86b5-4ad6-ea82-08dc1bd02a01 X-MS-Exchange-CrossTenant-AuthSource: PH0PR01MB7975.prod.exchangelabs.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 05:00:05.7255 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 0En8K7/fzhbbvkh2ervdoWUhcfdH/JRiWXlMNnHUiV+XvFu/WY4HNWZU5BqloIkoh/Rj+OW9K5QXJytl4QBJKC9aLaFi0NasK44tSqUdj6+PbvAYH5RE0lgbRH9TZz0/ X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL1PR01MB7865 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1788856006565063060 X-GMAIL-MSGID: 1788856006565063060 |
Series |
[v2] NUMA: Early use of cpu_to_node() returns 0 instead of the correct node id
|
|
Commit Message
Huang Shijie
Jan. 23, 2024, 4:58 a.m. UTC
During the kernel booting, the generic cpu_to_node() is called too early in
arm64, powerpc and riscv when CONFIG_NUMA is enabled.
For arm64/powerpc/riscv, there are at least four places in the common code
where the generic cpu_to_node() is called before it is initialized:
1.) early_trace_init() in kernel/trace/trace.c
2.) sched_init() in kernel/sched/core.c
3.) init_sched_fair_class() in kernel/sched/fair.c
4.) workqueue_init_early() in kernel/workqueue.c
In order to fix the bug, the patch changes generic cpu_to_node to
function pointer, and export it for kernel modules.
Introduce smp_prepare_boot_cpu_start() to wrap the original
smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node.
Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(),
and set the cpu_to_node to formal _cpu_to_node().
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
v1 --> v2:
In order to fix the x86 compiling error, move the cpu_to_node()
from driver/base/arch_numa.c to driver/base/node.c.
v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2024-January/896160.html
An old different title patch:
http://lists.infradead.org/pipermail/linux-arm-kernel/2024-January/895963.html
---
drivers/base/node.c | 11 +++++++++++
include/linux/topology.h | 6 ++----
init/main.c | 29 +++++++++++++++++++++++++++--
3 files changed, 40 insertions(+), 6 deletions(-)
Comments
On Tue, 23 Jan 2024, Huang Shijie wrote: > During the kernel booting, the generic cpu_to_node() is called too early in > arm64, powerpc and riscv when CONFIG_NUMA is enabled. > > For arm64/powerpc/riscv, there are at least four places in the common code > where the generic cpu_to_node() is called before it is initialized: > 1.) early_trace_init() in kernel/trace/trace.c > 2.) sched_init() in kernel/sched/core.c > 3.) init_sched_fair_class() in kernel/sched/fair.c > 4.) workqueue_init_early() in kernel/workqueue.c > > In order to fix the bug, the patch changes generic cpu_to_node to > function pointer, and export it for kernel modules. > Introduce smp_prepare_boot_cpu_start() to wrap the original > smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node. > Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(), > and set the cpu_to_node to formal _cpu_to_node(). Would you please fix this cleanly without a function pointer? What I think needs to be done is a patch series. 1. Instrument cpu_to_node so that some warning is issued if it is used too early. Preloading the array with NUMA_NO_NODE would allow us to do that. 2. Implement early_cpu_to_node on platforms that currently do not have it. 3. A series of patches that fix each place where cpu_to_node is used too early.
On Wed, Jan 24, 2024 at 09:19:00AM -0800, Lameter, Christopher wrote: > On Tue, 23 Jan 2024, Huang Shijie wrote: > > > During the kernel booting, the generic cpu_to_node() is called too early in > > arm64, powerpc and riscv when CONFIG_NUMA is enabled. > > > > For arm64/powerpc/riscv, there are at least four places in the common code > > where the generic cpu_to_node() is called before it is initialized: > > 1.) early_trace_init() in kernel/trace/trace.c > > 2.) sched_init() in kernel/sched/core.c > > 3.) init_sched_fair_class() in kernel/sched/fair.c > > 4.) workqueue_init_early() in kernel/workqueue.c > > > > In order to fix the bug, the patch changes generic cpu_to_node to > > function pointer, and export it for kernel modules. > > Introduce smp_prepare_boot_cpu_start() to wrap the original > > smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node. > > Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(), > > and set the cpu_to_node to formal _cpu_to_node(). > > Would you please fix this cleanly without a function pointer? > > What I think needs to be done is a patch series. > > 1. Instrument cpu_to_node so that some warning is issued if it is used too > early. Preloading the array with NUMA_NO_NODE would allow us to do that. By preloading do you mean compile-time initialization? > 2. Implement early_cpu_to_node on platforms that currently do not have it. > > 3. A series of patches that fix each place where cpu_to_node is used too > early. Agree. This is the right way to go. And pretty well all of it was discussed in v1, isn't? Thanks, Yury
在 2024/1/25 1:19, Lameter, Christopher 写道: > On Tue, 23 Jan 2024, Huang Shijie wrote: > >> During the kernel booting, the generic cpu_to_node() is called too >> early in >> arm64, powerpc and riscv when CONFIG_NUMA is enabled. >> >> For arm64/powerpc/riscv, there are at least four places in the common >> code >> where the generic cpu_to_node() is called before it is initialized: >> 1.) early_trace_init() in kernel/trace/trace.c >> 2.) sched_init() in kernel/sched/core.c >> 3.) init_sched_fair_class() in kernel/sched/fair.c >> 4.) workqueue_init_early() in kernel/workqueue.c >> >> In order to fix the bug, the patch changes generic cpu_to_node to >> function pointer, and export it for kernel modules. >> Introduce smp_prepare_boot_cpu_start() to wrap the original >> smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node. >> Introduce smp_prepare_cpus_done() to wrap the original >> smp_prepare_cpus(), >> and set the cpu_to_node to formal _cpu_to_node(). > > Would you please fix this cleanly without a function pointer? > > What I think needs to be done is a patch series. > > 1. Instrument cpu_to_node so that some warning is issued if it is used > too early. Preloading the array with NUMA_NO_NODE would allow us to do > that. > > 2. Implement early_cpu_to_node on platforms that currently do not have > it. > > 3. A series of patches that fix each place where cpu_to_node is used > too early. okay, I will try to do it. Thanks Huang Shijie
On Wed, Jan 24, 2024 at 09:19:00AM -0800, Lameter, Christopher wrote: > On Tue, 23 Jan 2024, Huang Shijie wrote: > > > During the kernel booting, the generic cpu_to_node() is called too early in > > arm64, powerpc and riscv when CONFIG_NUMA is enabled. > > > > For arm64/powerpc/riscv, there are at least four places in the common code > > where the generic cpu_to_node() is called before it is initialized: > > 1.) early_trace_init() in kernel/trace/trace.c > > 2.) sched_init() in kernel/sched/core.c > > 3.) init_sched_fair_class() in kernel/sched/fair.c > > 4.) workqueue_init_early() in kernel/workqueue.c > > > > In order to fix the bug, the patch changes generic cpu_to_node to > > function pointer, and export it for kernel modules. > > Introduce smp_prepare_boot_cpu_start() to wrap the original > > smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node. > > Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(), > > and set the cpu_to_node to formal _cpu_to_node(). > > Would you please fix this cleanly without a function pointer? > > What I think needs to be done is a patch series. > > 1. Instrument cpu_to_node so that some warning is issued if it is used too > early. Preloading the array with NUMA_NO_NODE would allow us to do that. > > 2. Implement early_cpu_to_node on platforms that currently do not have it. > > 3. A series of patches that fix each place where cpu_to_node is used too > early. I think step 3 can be simplified with a generic function that sets per_cpu(numa_node) using early_cpu_to_node(). It can be called right after setup_per_cpu_areas().
在 2024/1/25 15:31, Mike Rapoport 写道: > On Wed, Jan 24, 2024 at 09:19:00AM -0800, Lameter, Christopher wrote: >> On Tue, 23 Jan 2024, Huang Shijie wrote: >> >>> During the kernel booting, the generic cpu_to_node() is called too early in >>> arm64, powerpc and riscv when CONFIG_NUMA is enabled. >>> >>> For arm64/powerpc/riscv, there are at least four places in the common code >>> where the generic cpu_to_node() is called before it is initialized: >>> 1.) early_trace_init() in kernel/trace/trace.c >>> 2.) sched_init() in kernel/sched/core.c >>> 3.) init_sched_fair_class() in kernel/sched/fair.c >>> 4.) workqueue_init_early() in kernel/workqueue.c >>> >>> In order to fix the bug, the patch changes generic cpu_to_node to >>> function pointer, and export it for kernel modules. >>> Introduce smp_prepare_boot_cpu_start() to wrap the original >>> smp_prepare_boot_cpu(), and set cpu_to_node with early_cpu_to_node. >>> Introduce smp_prepare_cpus_done() to wrap the original smp_prepare_cpus(), >>> and set the cpu_to_node to formal _cpu_to_node(). >> Would you please fix this cleanly without a function pointer? >> >> What I think needs to be done is a patch series. >> >> 1. Instrument cpu_to_node so that some warning is issued if it is used too >> early. Preloading the array with NUMA_NO_NODE would allow us to do that. >> >> 2. Implement early_cpu_to_node on platforms that currently do not have it. >> >> 3. A series of patches that fix each place where cpu_to_node is used too >> early. For step 3, I find it it hard to change the cpu_to_node() to early_cpu_to_node() for early_trace_init(). In early_trace_init(), the __ring_buffer_alloc() calls the cpu_to_node(). In order to fix the bug, we should use early_cpu_to_node() for __ring_buffer_alloc(). But __ring_buffer_alloc() is also used by the kernel after the booting finished. After the booting finishes, we should use the cpu_to_node(), not the early_cpu_to_node(). > I think step 3 can be simplified with a generic function that sets > per_cpu(numa_node) using early_cpu_to_node(). It can be called right after > setup_per_cpu_areas(). I think this method maybe better.. I will try this too. Thanks Huang Shijie
diff --git a/drivers/base/node.c b/drivers/base/node.c index 1c05640461dd..477d58c12886 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -976,3 +976,14 @@ void __init node_dev_init(void) panic("%s() failed to add node: %d\n", __func__, ret); } } + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node +int _cpu_to_node(int cpu) +{ + return per_cpu(numa_node, cpu); +} +int (*cpu_to_node)(int cpu); +EXPORT_SYMBOL(cpu_to_node); +#endif +#endif diff --git a/include/linux/topology.h b/include/linux/topology.h index 52f5850730b3..e7ce2bae11dd 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -91,10 +91,8 @@ static inline int numa_node_id(void) #endif #ifndef cpu_to_node -static inline int cpu_to_node(int cpu) -{ - return per_cpu(numa_node, cpu); -} +extern int (*cpu_to_node)(int cpu); +extern int _cpu_to_node(int cpu); #endif #ifndef set_numa_node diff --git a/init/main.c b/init/main.c index e24b0780fdff..b142e9c51161 100644 --- a/init/main.c +++ b/init/main.c @@ -870,6 +870,18 @@ static void __init print_unknown_bootoptions(void) memblock_free(unknown_options, len); } +static void __init smp_prepare_boot_cpu_start(void) +{ + smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node + /* The early_cpu_to_node should be ready now. */ + cpu_to_node = early_cpu_to_node; +#endif +#endif +} + asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector void start_kernel(void) { @@ -899,7 +911,7 @@ void start_kernel(void) setup_command_line(command_line); setup_nr_cpu_ids(); setup_per_cpu_areas(); - smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ + smp_prepare_boot_cpu_start(); boot_cpu_hotplug_init(); pr_notice("Kernel command line: %s\n", saved_command_line); @@ -1519,6 +1531,19 @@ void __init console_on_rootfs(void) fput(file); } +static void __init smp_prepare_cpus_done(unsigned int setup_max_cpus) +{ + /* Different ARCHs may override smp_prepare_cpus() */ + smp_prepare_cpus(setup_max_cpus); + +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +#ifndef cpu_to_node + /* Change to the formal function. */ + cpu_to_node = _cpu_to_node; +#endif +#endif +} + static noinline void __init kernel_init_freeable(void) { /* Now the scheduler is fully set up and can do blocking allocations */ @@ -1531,7 +1556,7 @@ static noinline void __init kernel_init_freeable(void) cad_pid = get_pid(task_pid(current)); - smp_prepare_cpus(setup_max_cpus); + smp_prepare_cpus_done(setup_max_cpus); workqueue_init();