Message ID | 20240207225632.159276-3-avadhut.naik@amd.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-57287-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:168b:b0:106:860b:bbdd with SMTP id ma11csp2553383dyb; Wed, 7 Feb 2024 14:58:56 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCUHUDrqSTCf4WttgN00eLY8KOLubSYzmfbBL+k8POv/s5t5bzY/ADR1qj0qhl2SEDj8IvkWGru/Jjj17kPRWj8IQHmK6g== X-Google-Smtp-Source: AGHT+IE2og5L+SLG6pIEn8rTEpVazsdFBDfmZIO5FrKwY171i/bDCC38H1o79CsgAiqKv1wOBur6 X-Received: by 2002:a05:6358:93a8:b0:176:3e0d:9910 with SMTP id h40-20020a05635893a800b001763e0d9910mr4517556rwb.0.1707346736066; Wed, 07 Feb 2024 14:58:56 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCVYSHXMU/qC+rm0G4/+rghwXnb54D0F71LEe9LtzdrbNLl+ux/NBieXJGFtc00c3I50g8fdWeRHEbpENqP++QeBJyTgUA== Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id m11-20020a633f0b000000b005d8e1ec7226si2511733pga.552.2024.02.07.14.58.55 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Feb 2024 14:58:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-57287-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@amd.com header.s=selector1 header.b=yfQ4RGiB; arc=fail (signature failed); spf=pass (google.com: domain of linux-kernel+bounces-57287-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-57287-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amd.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id E2411B222E6 for <ouuuleilei@gmail.com>; Wed, 7 Feb 2024 22:58:54 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 280205A0FD; Wed, 7 Feb 2024 22:57:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="yfQ4RGiB" Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2066.outbound.protection.outlook.com [40.107.237.66]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C4A06535D2; Wed, 7 Feb 2024 22:57:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.237.66 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707346625; cv=fail; b=NI6NN7d04aU1RWa3zygJBd39Zq2GFXpErsnC7BEx+TcLPaUcPcGD2Bdvfe8YFl5Oeq99sEsQ17gF2rtRIbXR06fh1UMrP6b+1Jr4+BjuUb7l3ZurUO/I6wj902G0npgqoHPjjDqqyH1Zu+DR7ja4qjW5DoDc3RtClDGPmi5faHg= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707346625; c=relaxed/simple; bh=Y0I1AhQRVjXM/RIaecgawzrVGPZqUnbcrtQJ0b3hFiw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=H1RNCIeXqGUOZ2hV1NgHS353fVDFAtvSSY0A7uCbrKXikRdcJyh8TquzLYje4PN/bl4ZBaSgMJB8wIKXr9fethXujNl9HkfdISHh4uA374K/Bj+xM6iAaFu1IHDincZvM1sg9leGuZocH3CW05jYkl69zkCuIQ0z+UBDUDpfbD4= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=yfQ4RGiB; arc=fail smtp.client-ip=40.107.237.66 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=UIp7JDkf9G6eDV6zNnsfhf1seoq9BIe085GmgO2o3i7P2qx7nm1rohKAyyQIYajW0PngJfTGX6oTRvE2enQRd079FO74zaTOVjsrnJuH8YkMoXI+lMEbQkhbct5XYcYnDmP3/pHn62+Rk+0HEzEcTPD5ODarq7WDBm6tYecbmYkm6jriaLHgrAtxqPo3ivmHsQDNzsheA9r/dhzmadQ2wb8RQXysc26M4dUzizckWVVNUw881JH2sO/hI51QofZrOfrhY8Tc8OVNx+z1VOG2Z9QteNNwo/vKqcOJXso0UBzs104YzivVSpPjFXRsISx+tF7+dcjg9+cA0ot7EDKMvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=WQkoJWbdayBfEtsgw6k9ijmojUFSO2Ic22H5uELhThE=; b=bLfHZwtMB1Xv75fjxVy6nj6AMbUZraZZ6EGiqAc3x0m5E48gLcjSxlW+56Z4UdpA34iCnjQHColZ5fA3Oni4KQeWd9SyCvmYhPKCFHSxDlvNWIqBOrEdJm5s92Yp26QlkkdyNg6fkCUaPi66Mng4/HQ3aIfwF4xqe3dsCFE+L6hp06fA0Rmdj/AD98eLYVo98xIV+DSoq9r0TEzpd864/daGroxHyuOck/2N8HylOSb0lHlSZyhwJQDvfrFZEUKdZeduYuytD0bD+qotqEzkKdZfHLo8FvsfxK4aesOtKv9bEhoEAiQF3QHMz+exdjdaryrs9FQM/4VZMsnZ3Us8fA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=WQkoJWbdayBfEtsgw6k9ijmojUFSO2Ic22H5uELhThE=; b=yfQ4RGiBV4xssO9vUazYZKNGUslGTahgoS0rRHCYduxqE9VWVaoS6C9rkGABC9xEJasDtrloxrk7jYQtC/xIRnnj8MaR4W+BleBN4oVHMGj2DxBkDppRnY9hYvnZb8dWBnGLXlDLQXTBjWWFe6O837RZi6VHJOoTc93rEd5XhRk= Received: from BN9PR03CA0380.namprd03.prod.outlook.com (2603:10b6:408:f7::25) by DM4PR12MB5748.namprd12.prod.outlook.com (2603:10b6:8:5f::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7270.17; Wed, 7 Feb 2024 22:57:01 +0000 Received: from BN1PEPF00004680.namprd03.prod.outlook.com (2603:10b6:408:f7:cafe::39) by BN9PR03CA0380.outlook.office365.com (2603:10b6:408:f7::25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7249.36 via Frontend Transport; Wed, 7 Feb 2024 22:57:00 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by BN1PEPF00004680.mail.protection.outlook.com (10.167.243.85) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.7249.19 via Frontend Transport; Wed, 7 Feb 2024 22:57:00 +0000 Received: from titanite-d354host.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.34; Wed, 7 Feb 2024 16:57:00 -0600 From: Avadhut Naik <avadhut.naik@amd.com> To: <x86@kernel.org>, <linux-edac@vger.kernel.org> CC: <bp@alien8.de>, <tony.luck@intel.com>, <linux-kernel@vger.kernel.org>, <yazen.ghannam@amd.com>, <avadnaik@amd.com> Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool Date: Wed, 7 Feb 2024 16:56:32 -0600 Message-ID: <20240207225632.159276-3-avadhut.naik@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240207225632.159276-1-avadhut.naik@amd.com> References: <20240207225632.159276-1-avadhut.naik@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-ClientProxiedBy: SATLEXMB04.amd.com (10.181.40.145) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN1PEPF00004680:EE_|DM4PR12MB5748:EE_ X-MS-Office365-Filtering-Correlation-Id: cb51bfc5-e3dd-43fd-8e61-08dc2830182b X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: YM8wpl+RRX7fe2lWq+Z6sNvx7IMa25DyTl8KyUi5/0MNi5hPlfKrVSrzcmGOxfyoQxD+ahclWlIJxoGiOZMgbRXht1ieSQKFaI+q1yQlRjb/C51T5dtqSBrEeAz54e/JFS7XfCtP62+G+SmDtcmQNU382G5mXt5GwysNzJodjITU7KOxwJteTjbpYex9yH6k6c0dBaE2P7cVX3sirTyEXpjGx+QCWiKaAcljh+/aPr2slvlHrMB/Yi83zv0ikK7i65XId0o4GhdWuDHbcrfrrXwbxariCARl3r3TW4v/+l0weVDTVkx+2HdN1MRZPctMxTfKd871r9kwxpGZWPZkH80PJ99DhQ6c7B75+DhwozCxceFF4e2s0VeWtmGU53OgT6vWcNHwxob4V27WN7qM67z6HaBfEyOOQnU27lh9T+l+osXgY/UY9MvZzLwcX8rPIboOhkYDKMViAHxhzrx0/16Ch0pOq6va8MFh1ZyOQN7QbsUx1NeiDdnWGz+Da/Y7mous7cpcOryjEMJ2kVi4VnDvAWbhYNeAnaqcejeHH92Zikws7LJisX6AyTW9jYyjgkspeciYVu7oj7A1wfy9cV3IF6mg7qg9i0Wk7VXVZ8s= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230031)(4636009)(396003)(346002)(39860400002)(376002)(136003)(230922051799003)(451199024)(1800799012)(64100799003)(186009)(82310400011)(36840700001)(46966006)(40470700004)(82740400003)(41300700001)(44832011)(5660300002)(4326008)(316002)(8936002)(36756003)(70206006)(70586007)(6666004)(110136005)(54906003)(7696005)(2906002)(8676002)(81166007)(356005)(2616005)(478600001)(86362001)(336012)(426003)(16526019)(1076003)(26005)(83380400001);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 07 Feb 2024 22:57:00.8035 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: cb51bfc5-e3dd-43fd-8e61-08dc2830182b X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN1PEPF00004680.namprd03.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR12MB5748 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790282810807055219 X-GMAIL-MSGID: 1790282810807055219 |
Series |
Extend size of the MCE Records pool
|
|
Commit Message
Avadhut Naik
Feb. 7, 2024, 10:56 p.m. UTC
Extension of MCE Records pool, based on system's CPU count, was undertaken
through the previous patch (x86/MCE: Extend size of the MCE Records pool).
Add a new command line parameter "mce-genpool-extend" to set the size of
MCE Records pool to a predetermined number of pages instead of system's
CPU count.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
.../admin-guide/kernel-parameters.txt | 2 ++
arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++---
2 files changed, 21 insertions(+), 3 deletions(-)
Comments
On 2/7/2024 2:56 PM, Avadhut Naik wrote: > Extension of MCE Records pool, based on system's CPU count, was undertaken > through the previous patch (x86/MCE: Extend size of the MCE Records pool). > This statement is unnecessary. > Add a new command line parameter "mce-genpool-extend" to set the size of > MCE Records pool to a predetermined number of pages instead of system's > CPU count. > Like Tony, I am unsure of when this would be useful. Also, why does it need to override the CPU count based extension mechanism? Would just adding more pages not work for them? If there really is a good reason to do this, how about changing mce-genpool-extend to mce-genpool-add-pages and keeping the description the same? mce-genpool-add-pages= [X86-64] Number of pages to add to MCE Records pool. Then you can simply add these many number of additional pages to the new CPU based mechanism. Sohil > Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> > --- > .../admin-guide/kernel-parameters.txt | 2 ++ > arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++--- > 2 files changed, 21 insertions(+), 3 deletions(-) >
Hi, On 2/8/2024 19:36, Sohil Mehta wrote: > On 2/7/2024 2:56 PM, Avadhut Naik wrote: >> Extension of MCE Records pool, based on system's CPU count, was undertaken >> through the previous patch (x86/MCE: Extend size of the MCE Records pool). >> > > This statement is unnecessary. > Noted. >> Add a new command line parameter "mce-genpool-extend" to set the size of >> MCE Records pool to a predetermined number of pages instead of system's >> CPU count. >> > > Like Tony, I am unsure of when this would be useful. > > Also, why does it need to override the CPU count based extension > mechanism? Would just adding more pages not work for them? > > If there really is a good reason to do this, how about changing > mce-genpool-extend to mce-genpool-add-pages and keeping the description > the same? > > mce-genpool-add-pages= [X86-64] Number of pages to add to MCE Records pool. > > Then you can simply add these many number of additional pages to the new > CPU based mechanism. > Is it safe to assume that users will always want to increase the size of the pool and not decrease it? IMO, the command-line option provides flexibility for users to choose the size of MCE Records pool in case, they don't agree with the CPU count logic. Just added it to ensure that we are not enforcing this increased memory footprint across the board. Would you agree?
On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote: > Is it safe to assume that users will always want to increase the size > of the pool and not decrease it? Why don't you make the gen pool size a function of the number of CPUs on the system and have it all work automagically? Burdening the user with yet another cmdline switch is a bad idea. We have way too many as it is. This stuff should work out-of-the-box, without user intervention if possible. And it is possible in this case. Thx.
On 2/9/2024 12:02 PM, Naik, Avadhut wrote: > Is it safe to assume that users will always want to increase the size of the > pool and not decrease it? > > IMO, the command-line option provides flexibility for users to choose the size of > MCE Records pool in case, they don't agree with the CPU count logic. Just added it > to ensure that we are not enforcing this increased memory footprint across the board. > > Would you agree? > Not really. Providing this level of configuration seems excessive and unnecessary. To me, it seems that we are over-compensating with the calculations in the previous patch and then providing a mechanism to correct it here and putting this burden on the user. How about being more conservative with the allocations in the previous patch so that we don't need to introduce this additional mechanism right now? Later, if there is really a need for some specific usage, the patch can be re-submitted then with the supporting data. Sohil
> How about being more conservative with the allocations in the previous > patch so that we don't need to introduce this additional mechanism right > now? Later, if there is really a need for some specific usage, the patch > can be re-submitted then with the supporting data. There used to be a rule-of-thumb when configuring systems to have at least one GByte of memory per CPU. Anyone following that rule shouldn't be worried about sub-kilobyte allocations per CPU. -Tony
Hi, On 2/9/2024 14:09, Borislav Petkov wrote: > On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote: >> Is it safe to assume that users will always want to increase the size >> of the pool and not decrease it? > > Why don't you make the gen pool size a function of the number of CPUs on > the system and have it all work automagically? > IIUC, this is exactly what the first patch in this series is trying to accomplish. Please correct me if I understood wrong. > Burdening the user with yet another cmdline switch is a bad idea. We > have way too many as it is. > > This stuff should work out-of-the-box, without user intervention if > possible. And it is possible in this case. > Okay. Will drop the command line argument.
On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote: > IIUC, this is exactly what the first patch in this series is trying to > accomplish. Please correct me if I understood wrong. Yes, you did. I don't mean to extend it - I mean to allocate it from the very beginning to min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); There's a sane minimum and one page pro logical CPU should be fine on pretty much every configuration... Thx.
On 2/9/2024 12:28 PM, Luck, Tony wrote: >> How about being more conservative with the allocations in the previous >> patch so that we don't need to introduce this additional mechanism right >> now? Later, if there is really a need for some specific usage, the patch >> can be re-submitted then with the supporting data. > > There used to be a rule-of-thumb when configuring systems to have at least > one GByte of memory per CPU. Anyone following that rule shouldn't be > worried about sub-kilobyte allocations per CPU. > I meant, to avoid the need for this second patch we can always start lower and increase it later. 256 bytes per cpu seems fine to me as done in the previous patch. But, if that seems too high as described by Avadhut below then maybe we can start with 200 bytes or any other number. It's just heuristic IIUC. https://lore.kernel.org/lkml/8d2d0dac-b188-4826-a43a-bb5fc0528f0d@amd.com/ Sohil
On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <bp@alien8.de> wrote: >On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote: >> IIUC, this is exactly what the first patch in this series is trying to >> accomplish. Please correct me if I understood wrong. > >Yes, you did. > >I don't mean to extend it - I mean to allocate it from the very >beginning to > > min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); max() ofc. >There's a sane minimum and one page pro logical CPU should be fine on >pretty much every configuration... > >Thx. >
Hi, On 2/10/2024 01:52, Borislav Petkov wrote: > On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <bp@alien8.de> wrote: >> On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote: >>> IIUC, this is exactly what the first patch in this series is trying to >>> accomplish. Please correct me if I understood wrong. >> >> Yes, you did. >> >> I don't mean to extend it - I mean to allocate it from the very >> beginning to >> >> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE); > IIUC, you wouldn't want to extend the pool through late_initcall(). Instead, you would want for memory to be allocated (on the heap) and size of the pool to be set at the very beginning i.e. when the pool is created (~2 seconds, according to dmesg timestamps). Please correct me if I have understood wrong. I was actually doing something similar initially (setting size of the pool right after its creation) but then went with extending the pool since I wasn't sure if heap allocations should be undertaken that early during bootup. > max() ofc. > Thanks! This does clear a one part of my confusion.
On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote: > IIUC, you wouldn't want to extend the pool through late_initcall(). > Instead, you would want for memory to be allocated (on the heap) and > size of the pool to be set at the very beginning i.e. when the pool > is created (~2 seconds, according to dmesg timestamps). > > Please correct me if I have understood wrong. Nah, you got it right. I went, looked and realized that we have to do this early dance because we have no allocator yet. And we can't move this gen_pool allocation to later, when we *do* have an allocator because MCA is up and logging already. But your extending approach doesn't fly in all cases either: gen_pool_add->gen_pool_add_virt->gen_pool_add_owner it grabs the pool->lock spinlock and adds to &pool->chunks while, at the exact same time, gen_pool_alloc(), in *NMI* context iterates over that same &pool->chunks in the case we're logging an MCE at exact that same time when you're extending the buffer. And Tony already said that in the thread you're quoting: https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/ So no, that doesn't work either. Thx.
Hi, On 2/11/2024 05:14, Borislav Petkov wrote: > On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote: >> IIUC, you wouldn't want to extend the pool through late_initcall(). >> Instead, you would want for memory to be allocated (on the heap) and >> size of the pool to be set at the very beginning i.e. when the pool >> is created (~2 seconds, according to dmesg timestamps). >> >> Please correct me if I have understood wrong. > > Nah, you got it right. I went, looked and realized that we have to do > this early dance because we have no allocator yet. And we can't move > this gen_pool allocation to later, when we *do* have an allocator > because MCA is up and logging already. > Okay. Will make changes to allocate memory and set size of the pool when it is created. Also, will remove the command line parameter and resubmit. > But your extending approach doesn't fly in all cases either: > > gen_pool_add->gen_pool_add_virt->gen_pool_add_owner > > it grabs the pool->lock spinlock and adds to &pool->chunks while, at the > exact same time, gen_pool_alloc(), in *NMI* context iterates over that > same &pool->chunks in the case we're logging an MCE at exact that same > time when you're extending the buffer. > > And Tony already said that in the thread you're quoting: > > https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/ > > So no, that doesn't work either. > > Thx. Thanks for this explanation! >
On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote: > Okay. Will make changes to allocate memory and set size of the pool > when it is created. Also, will remove the command line parameter and > resubmit. Before you do, go read that original thread again but this time take your time to grok it. And then try answering those questions: * Why are *you* fixing this? I know what the AWS reason is, what is yours? * Can you think of a slick deduplication scheme instead of blindly raising the buffer size? * What's wrong with not logging some early errors, can we live with that too? If it were firmware-first, it cannot simply extend its buffer size because it has limited space. So what does firmware do in such cases? Think long and hard about the big picture, analyze the problem properly and from all angles before you go and do patches. Thx.
On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote: > * Can you think of a slick deduplication scheme instead of blindly > raising the buffer size? And here's the simplest scheme: you don't extend the buffer. On overflow, you say "Buffer full, here's the MCE" and you dump the error long into dmesg. Problem solved. A slicker deduplication scheme would be even better, tho. Maybe struct mce.count which gets incremented instead of adding the error record to the buffer again. And so on...
> And here's the simplest scheme: you don't extend the buffer. On > overflow, you say "Buffer full, here's the MCE" and you dump the error > long into dmesg. Problem solved. > > A slicker deduplication scheme would be even better, tho. Maybe struct > mce.count which gets incremented instead of adding the error record to > the buffer again. And so on... Walking the structures already allocated from the genpool in the #MC handler may be possible, but what is the criteria for "duplicates"? Do we avoid entering duplicates into the pool altogether? Or when the pool is full overwrite a duplicate? How about compile time allocation of extra space. Algorithm below for illustrative purposes only. May need some more thought about how to scale up. -Tony [Diff pasted into Outlook, chances that it will automatically apply = 0%] diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..0fc2925c0839 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,10 +16,15 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, so a fixed size memory pool should be enough. + * Low CPU count systems allocate 2 pages (enough for ~64 "struct mce" + * records). Large systems scale up the allocation based on CPU count. */ +#if CONFIG_NR_CPUS < 128 #define MCE_POOLSZ (2 * PAGE_SIZE) +#else +#define MCE_POOLSZ (CONFIG_NR_CPUS / 64 * PAGE_SIZE) +#endif static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); [agluck@agluck-desk3 mywork]$ vi arch/x86/kernel/cpu/mce/genpool.c [agluck@agluck-desk3 mywork]$ git diff diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..47bf677578ca 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,10 +16,15 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, but scale up with CPU count. Low CPU count + * systems allocate 2 pages (enough for ~64 "struct mce" records). Large + * systems scale up the allocation based on CPU count. */ +#if CONFIG_NR_CPUS < 128 #define MCE_POOLSZ (2 * PAGE_SIZE) +#else +#define MCE_POOLSZ (CONFIG_NR_CPUS / 64 * PAGE_SIZE) +#endif static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist);
On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote: > Walking the structures already allocated from the genpool in the #MC > handler may be possible, but what is the criteria for "duplicates"? for each i in pool: memcmp(mce[i], new_mce, sizeof(struct mce)); It'll probably need to mask out fields like ->time etc. > Do we avoid entering duplicates into the pool altogether? We could do mce[i].count++; in the same loop. Dunno yet if we even need to state how many duplicates are there... > Or when the pool is full overwrite a duplicate? Nah, not overwrite the duplicate but not add the new one. Cheaper. > How about compile time allocation of extra space. Algorithm below for > illustrative purposes only. May need some more thought about how > to scale up. Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing - needs to be based on the actual number of CPUs on the machine. BUT, we don't have an allocator yet. So we end up allocating it there on the heap. Unless we can do something with memblock... But then this still needs code audit whether num_possible_cpus() is final at that point. Because if it is, that would be the optimal thing to do a memblock_alloc() there... > [Diff pasted into Outlook, chances that it will automatically apply = 0%] How you even do patches with outschmook is mindboggling :-) At least use Thunderbird. That's what I do for the company mail but then again I don't even try to do patches over company mail... Thx.
> Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing > - needs to be based on the actual number of CPUs on the machine. > > BUT, we don't have an allocator yet. > > So we end up allocating it there on the heap. > > Unless we can do something with memblock... > > But then this still needs code audit whether num_possible_cpus() is > final at that point. Because if it is, that would be the optimal thing > to do a memblock_alloc() there... I threw a printk() into mce_gen_pool_init() and got: [ 2.948285] mce: mce_gen_pool_init: num_possible_cpus = 144 So it is currently true that number of CPUs has been computed at this point. So using memblock_alloc() based on number of CPUs may work > > [Diff pasted into Outlook, chances that it will automatically apply = 0%] > > How you even do patches with outschmook is mindboggling :-) > > At least use Thunderbird. That's what I do for the company mail but then > again I don't even try to do patches over company mail... I use "git send-email" to send out patches, and usually "b4" to get them to my desktop. I do have "mutt" on there, but IT have made it complex for me to use it to simply read & reply. It requires separate mutt config files to fetch mail and a different config to send mail because of the firewall rules on the lab where my desktop lives. -Tony
On 2/11/2024 6:14 AM, Borislav Petkov wrote: > On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote: >> IIUC, you wouldn't want to extend the pool through late_initcall(). >> Instead, you would want for memory to be allocated (on the heap) and >> size of the pool to be set at the very beginning i.e. when the pool >> is created (~2 seconds, according to dmesg timestamps). >> >> Please correct me if I have understood wrong. > > Nah, you got it right. I went, looked and realized that we have to do > this early dance because we have no allocator yet. And we can't move > this gen_pool allocation to later, when we *do* have an allocator > because MCA is up and logging already. > > But your extending approach doesn't fly in all cases either: > > gen_pool_add->gen_pool_add_virt->gen_pool_add_owner > > it grabs the pool->lock spinlock and adds to &pool->chunks while, at the > exact same time, gen_pool_alloc(), in *NMI* context iterates over that > same &pool->chunks in the case we're logging an MCE at exact that same > time when you're extending the buffer. > > And Tony already said that in the thread you're quoting: > > https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/ > > So no, that doesn't work either. > I'm confused why it won't work. X86 has ARCH_HAVE_NMI_SAFE_CMPXCHG. I expect atomics/caches will still work in interrupt or #MC context. If not, then we'd have a fatal error that causes a hardware reset or a kernel panic before we get to logging, I think. Or is the issue when running on the same CPU? In this case, either &pool->chunks was updated before taking the #MC, so the extra memory is there and can be used. Or it wasn't updated, so the extra memory is not available during the #MC which is the same behavior as now. I need to look more at the genpool code, but I thought I'd ask too. Thanks, Yazen
> I need to look more at the genpool code, but I thought I'd ask too.
Yazen,
gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool.
This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of
the list_add_rcu()
spin_lock(&pool->lock);
list_add_rcu(&chunk->next_chunk, &pool->chunks);
spin_unlock(&pool->lock);
-Tony
On Mon, Feb 12, 2024 at 06:45:34PM +0000, Luck, Tony wrote: > So it is currently true that number of CPUs has been computed at this point. It needs a proper explanation why that's ok rather than an empirical test only. > I use "git send-email" to send out patches, and usually "b4" to get them > to my desktop. I do have "mutt" on there, but IT have made it complex > for me to use it to simply read & reply. IT departments all around the world make sure of that. :) > It requires separate mutt config files to fetch mail and a different > config to send mail because of the firewall rules on the lab where my > desktop lives. Yeah, that's why I'm working with !company mail account. For my own sanity.
Hi, On 2/12/2024 12:58, Luck, Tony wrote: >> I need to look more at the genpool code, but I thought I'd ask too. > > Yazen, > > gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool. > > This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of > the list_add_rcu() > > spin_lock(&pool->lock); > list_add_rcu(&chunk->next_chunk, &pool->chunks); > spin_unlock(&pool->lock); > Even I am somewhat confused by this. The spinlock is mostly held to prevent other primitives from modifying chunks within the genpool. In #MC context however, we are not modifying the existing chunks, per se. While in the MC context, records will be added to the genpool through gen_pool_alloc() which eventually drops down into gen_pool_alloc_algo_owner(). gen_pool_alloc_algo_owner() iterates over the existing chunks within the genpool through list_for_each_entry_rcu(), within an RCU read-side critical section (rcu_read_lock()). Now, the below description of list_for_each_entry_rcu(): * list_for_each_entry_rcu - iterate over rcu list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the list_head within the struct. * @cond: optional lockdep expression if called from non-RCU protection. * * This list-traversal primitive may safely run concurrently with * the _rcu list-mutation primitives such as list_add_rcu() * as long as the traversal is guarded by rcu_read_lock(). Makes me wonder if the genpool can be extended and traversed concurrently. OFC, not sure if gen_pool_alloc_algo_owner() being in #MC context makes a difference here or if I am missing something.
> It needs a proper explanation why that's ok rather than an empirical > test only. start_kernel() ... setup_arch() .... acpi stuff parses MADT and sets bits in possible map ... arch_cpu_finalize_init() ... calls mce_gen_pool_init() -Tony
On 2/12/2024 1:58 PM, Luck, Tony wrote: >> I need to look more at the genpool code, but I thought I'd ask too. > > Yazen, > > gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool. > > This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of > the list_add_rcu() > > spin_lock(&pool->lock); > list_add_rcu(&chunk->next_chunk, &pool->chunks); > spin_unlock(&pool->lock); > Thanks Tony. So the concern is not about traversal, but rather that the #MC can break the list_add_rcu(). Is this correct? Thanks, Yazen
I said: > spin_lock(&pool->lock); > list_add_rcu(&chunk->next_chunk, &pool->chunks); > spin_unlock(&pool->lock); Avadhut said: > gen_pool_alloc_algo_owner() iterates over the existing chunks > within the genpool through list_for_each_entry_rcu(), within > an RCU read-side critical section (rcu_read_lock()). > So the concern is not about traversal, but rather that the #MC can break the > list_add_rcu(). Is this correct? Yazen, Yes. The question is whether a #MC that come in the middle of list_rcu_add() can safely do list_for_each_entry_rcu() on the same list. RCU is black magic ... maybe it can do this? Adding Paul. -Tony
On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote: > Yes. The question is whether a #MC that come in the middle of list_rcu_add() > can safely do list_for_each_entry_rcu() on the same list. > > RCU is black magic ... maybe it can do this? Adding Paul. Yeah, the list traversal might be ok as this is what that list_add does - you can't encounter an inconsistent list - but we still take a spinlock on addition and the commit which added it: 7f184275aa30 ("lib, Make gen_pool memory allocator lockless") says The lockless operation only works if there is enough memory available. If new memory is added to the pool a lock has to be still taken. So any user relying on locklessness has to ensure that sufficient memory is preallocated. and this is exactly what we're doing - adding new memory. So, until we're absolutely sure that it is ok to interrupt a context holding a spinlock with a #MC which is non-maskable, I don't think we wanna do this. Thx.
On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote: > The spinlock is mostly held to prevent other primitives > from modifying chunks within the genpool. > > In #MC context however, we are not modifying the existing > chunks, per se. What if we decide to do mce[i]count++; in #MC context? That's modifying the existing chunks. TBH, I'm not sure what that spinlock is for. See my reply to that same subthread.
On Mon, Feb 12, 2024 at 09:10:38PM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote: > > Yes. The question is whether a #MC that come in the middle of list_rcu_add() > > can safely do list_for_each_entry_rcu() on the same list. > > > > RCU is black magic ... maybe it can do this? Adding Paul. > > Yeah, the list traversal might be ok as this is what that list_add does > - you can't encounter an inconsistent list - but we still take > a spinlock on addition and the commit which added it: > > 7f184275aa30 ("lib, Make gen_pool memory allocator lockless") > > says > > The lockless operation only works if there is enough memory available. > If new memory is added to the pool a lock has to be still taken. So > any user relying on locklessness has to ensure that sufficient memory > is preallocated. > > and this is exactly what we're doing - adding new memory. Is the #MC adding new memory, or is the interrupted context adding new memory? > So, until we're absolutely sure that it is ok to interrupt a context > holding a spinlock with a #MC which is non-maskable, I don't think we > wanna do this. If it is the #MC adding new memory, agreed. If the #MC is simply traversing the list, and the interrupted context was in the midst of adding a new element, this should be no worse than some other CPU traversing the list while this CPU is in the midst of adding a new element. Or am I missing a turn in here somewhere? Thanx, Paul
Hi, On 2/12/2024 14:18, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote: >> The spinlock is mostly held to prevent other primitives >> from modifying chunks within the genpool. >> >> In #MC context however, we are not modifying the existing >> chunks, per se. > > What if we decide to do > > mce[i]count++; > > in #MC context? > > That's modifying the existing chunks. > > TBH, I'm not sure what that spinlock is for. See my reply to that same > subthread. > Yes, noticed your reply. Clears most of my confusion.
>> and this is exactly what we're doing - adding new memory. > > Is the #MC adding new memory, or is the interrupted context adding new > memory? The interrupted context is adding the memory. >> So, until we're absolutely sure that it is ok to interrupt a context >> holding a spinlock with a #MC which is non-maskable, I don't think we >> wanna do this. > > If it is the #MC adding new memory, agreed. Not what is happening. > If the #MC is simply traversing the list, and the interrupted context > was in the midst of adding a new element, this should be no worse than > some other CPU traversing the list while this CPU is in the midst of > adding a new element. > > Or am I missing a turn in here somewhere? Not missing anything. I believe you've answered the question. -Tony
On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote: > If it is the #MC adding new memory, agreed. > > If the #MC is simply traversing the list, and the interrupted context > was in the midst of adding a new element, this should be no worse than > some other CPU traversing the list while this CPU is in the midst of > adding a new element. Right, Tony answered which context is doing what. What I'm still scratching my head over is, why grab a spinlock around list_add_rcu(&chunk->next_chunk, &pool->chunks); ? That's the part that looks really weird. And that's the interrupted context, yap. Thx.
On Mon, Feb 12, 2024 at 07:41:03PM +0000, Luck, Tony wrote: > > It needs a proper explanation why that's ok rather than an empirical > > test only. > > start_kernel() > ... setup_arch() > .... acpi stuff parses MADT and sets bits in possible map > > ... arch_cpu_finalize_init() > ... calls mce_gen_pool_init() This made me question the "we don't have an allocator in mce_gen_pool_init()". Because if we got through all the ACPI stuff, we surely have an allocator. Below patch doesn't explode at runtime. -Tony diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..81de877f2a51 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,14 +16,12 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, so a fixed size memory pool should be enough. + * Allocate on a sliding scale based on number of CPUs. */ -#define MCE_POOLSZ (2 * PAGE_SIZE) static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); -static char gen_pool_buf[MCE_POOLSZ]; /* * Compare the record "t" with each of the records on list "l" to see if @@ -118,14 +116,23 @@ int mce_gen_pool_add(struct mce *mce) static int mce_gen_pool_create(void) { + int mce_numrecords, mce_poolsz; struct gen_pool *tmpp; int ret = -ENOMEM; + void *mce_pool; tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); if (!tmpp) goto out; - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); + mce_numrecords = max(80, num_possible_cpus() * 8); + mce_poolsz = mce_numrecords * ilog2(sizeof(struct mce_evt_llist)); + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); + if (!mce_pool) { + gen_pool_destroy(tmpp); + goto out; + } + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); if (ret) { gen_pool_destroy(tmpp); goto out;
On Mon, Feb 12, 2024 at 01:37:09PM -0800, Tony Luck wrote: > This made me question the "we don't have an allocator in > mce_gen_pool_init()". Because if we got through all the > ACPI stuff, we surely have an allocator. > > Below patch doesn't explode at runtime. Yap, I see it. And I can't dig out why it didn't use to work and we had to do it this way. The earliest thing I found is: https://lore.kernel.org/all/1432150538-3120-2-git-send-email-gong.chen@linux.intel.com/T/#mf673ed669ee0d4c27b75bd48450c13c01cbb2cbf I'll have to dig into my archives tomorrow, on a clear head... Thx.
On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote:
> I'll have to dig into my archives tomorrow, on a clear head...
So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is
4.2-something.
And even back then, mcheck_cpu_init() gets called *after* mm_init()
which already initializes the allocators. So why did we allocate that
buffer statically?
On Mon, Feb 12, 2024 at 11:19:13PM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote: > > I'll have to dig into my archives tomorrow, on a clear head... > > So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is > 4.2-something. > > And even back then, mcheck_cpu_init() gets called *after* mm_init() > which already initializes the allocators. So why did we allocate that > buffer statically? Found it in my archives. You should have it too: Date: Thu, 31 Jul 2014 02:51:25 -0400 From: "Chen, Gong" <gong.chen@linux.intel.com> To: tony.luck@intel.com, bp@alien8.de Subject: Re: [RFC PATCH untest v2 1/4] x86, MCE: Provide a lock-less memory pool to save error record Message-ID: <20140731065125.GA5999@gchen.bj.intel.com> and that's not on any ML that's why I can't find it on lore... There's this fragment from Chen: -------- > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index bb92f38..a1b6841 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -2023,6 +2023,9 @@ int __init mcheck_init(void) > { > mcheck_intel_therm_init(); > > + if (!mce_genpool_init()) > + mca_cfg.disabled = true; > + when setup_arch is called, memory subsystem hasn't been initialized, which means I can't use regular page allocation function. So I still need to put genpool init in mcheck_cpu_init. -------- And that is still the case - mcheck_init() gets called in setup_arch() and thus before before mm_init() which is called mm_core_init() now. And on that same thread we agree that we should allocate it statically but then the call to mce_gen_pool_init() ended up in mcheck_cpu_init() which happens *after* mm_init(). What a big fscking facepalm. :-\
On Mon, Feb 12, 2024 at 10:27:41PM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote: > > If it is the #MC adding new memory, agreed. > > > > If the #MC is simply traversing the list, and the interrupted context > > was in the midst of adding a new element, this should be no worse than > > some other CPU traversing the list while this CPU is in the midst of > > adding a new element. > > Right, Tony answered which context is doing what. > > What I'm still scratching my head over is, why grab a spinlock around > > list_add_rcu(&chunk->next_chunk, &pool->chunks); > > ? > > That's the part that looks really weird. > > And that's the interrupted context, yap. The usual reason is to exclude other CPUs also doing list_add_rcu() on the same list. Or is there other synchronization that is preventing concurrent updates? Thanx, Paul
> The usual reason is to exclude other CPUs also doing list_add_rcu() > on the same list. Or is there other synchronization that is preventing > concurrent updates? In this case the patch is trying to do a one-time bump of the pool size. So no races here. -Tony
On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote: > The usual reason is to exclude other CPUs also doing list_add_rcu() > on the same list. Doh, it even says so in the comment above list_add_rcu(). And the traversal which is happening in NMI-like context is fine. So phew, I think we should be fine here. Thanks! And as it turns out, we're not going to need any of that after all as it looks like we can allocate the proper size from the very beginning... Lovely.
On Tue, Feb 13, 2024 at 12:10:09AM +0100, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote: > > The usual reason is to exclude other CPUs also doing list_add_rcu() > > on the same list. > > Doh, it even says so in the comment above list_add_rcu(). > > And the traversal which is happening in NMI-like context is fine. > > So phew, I think we should be fine here. Thanks! > > And as it turns out, we're not going to need any of that after all as > it looks like we can allocate the proper size from the very beginning... Sounds even better! ;-) Thanx, Paul
Hi, On 2/12/2024 02:58, Borislav Petkov wrote: > On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote: >> Okay. Will make changes to allocate memory and set size of the pool >> when it is created. Also, will remove the command line parameter and >> resubmit. > > Before you do, go read that original thread again but this time take > your time to grok it. > > And then try answering those questions: > > * Why are *you* fixing this? I know what the AWS reason is, what is > yours? > I think this issue of genpool getting full with MCE records can occur on AMD system too since the pool doesn't scale up with the number of CPUs and memory in the system. The probability of issue occurrence only increases as CPU count and memory increases. Feel that the genpool size should be proportional to, at least, the CPU count of the system. > * Can you think of a slick deduplication scheme instead of blindly > raising the buffer size? > > * What's wrong with not logging some early errors, can we live with that > too? If it were firmware-first, it cannot simply extend its buffer size > because it has limited space. So what does firmware do in such cases? > Think that we can live with not logging some early errors, as long as they are correctable. Not very sure about what you mean by Firmware First. Do you mean handling of MCEs through HEST and GHES? Or something else? > Think long and hard about the big picture, analyze the problem properly > and from all angles before you go and do patches. > > Thx. >
On 2/12/2024 03:32, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote: >> * Can you think of a slick deduplication scheme instead of blindly >> raising the buffer size? > > And here's the simplest scheme: you don't extend the buffer. On > overflow, you say "Buffer full, here's the MCE" and you dump the error > long into dmesg. Problem solved. > Wouldn't this require logging in the #MC context? We would know that the genpool is full in mce_gen_pool_add() which, I think is executed in #MC context. > A slicker deduplication scheme would be even better, tho. Maybe struct > mce.count which gets incremented instead of adding the error record to > the buffer again. And so on... >
Hi, On 2/12/2024 11:54, Borislav Petkov wrote: > On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote: >> Walking the structures already allocated from the genpool in the #MC >> handler may be possible, but what is the criteria for "duplicates"? > > for each i in pool: > memcmp(mce[i], new_mce, sizeof(struct mce)); > > It'll probably need to mask out fields like ->time etc. > Also, some fields like cpuvendor, ppin, microcode will remain same for all MCEs received on a system. Can we avoid comparing them? We already seem to have a function mce_cmp(), introduced back in 2016, which accomplishes something similar for fatal errors. But it only checks for bank, status, addr and misc registers. Should we just modify this function to compare MCEs? It should work for fatal errors too. For my own understanding: The general motto for #MC or interrupt contexts is *keep it short and sweet*. Though memcmp() is fairly optimized, we would still be running a *for* loop in MC context. In case successive back-to-back MCEs are being received and if the pool already has a fair number of records, wouldn't this comparison significantly extend our stay in #MC context? Had discussed this with Yazen, IIUC, nested MCEs are not supported on x86. Please correct me if I am wrong in this.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3a1fa1f81d9d..62e7da4d9fda 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3135,6 +3135,8 @@ mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst + mce-genpool-extend= [X86-64] Number of pages to add to MCE Records pool. + md= [HW] RAID subsystems devices and level See Documentation/admin-guide/md.rst. diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index aed01612d342..d6e04fa5ee07 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -22,6 +22,7 @@ #define MCE_POOLSZ (2 * PAGE_SIZE) #define CPU_GEN_MEMSZ 256 +static unsigned int mce_genpool_extend; static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); static char gen_pool_buf[MCE_POOLSZ]; @@ -123,10 +124,14 @@ int mce_gen_pool_extend(void) int ret = -ENOMEM; u32 num_threads; - num_threads = num_present_cpus(); - len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); - addr = (unsigned long)kzalloc(len, GFP_KERNEL); + if (mce_genpool_extend) { + len = mce_genpool_extend * PAGE_SIZE; + } else { + num_threads = num_present_cpus(); + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ); + } + addr = (unsigned long)kzalloc(len, GFP_KERNEL); if (!addr) goto out; @@ -159,6 +164,17 @@ static int mce_gen_pool_create(void) return ret; } +static int __init parse_mce_genpool_extend(char *str) +{ + if (*str == '=') + str++; + + get_option(&str, &mce_genpool_extend); + return 1; +} + +__setup("mce-genpool-extend", parse_mce_genpool_extend); + int mce_gen_pool_init(void) { /* Just init mce_gen_pool once. */