Message ID | Zd--PJp-NbXGrb39@agluck-desk3 |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-85835-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:6358:a1a:b0:17b:cd04:e0c6 with SMTP id 26csp250081rwa; Wed, 28 Feb 2024 15:14:22 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCX7SE6NSPGmYxdetNtOHtiUXYzUaVwO4/sEGWzM+S65V/G1Fz/JxbDHK7wDBJa0rwU8Pgmxb8+8CgpIgtXZ8x5hCH0tzg== X-Google-Smtp-Source: AGHT+IGnHSwykEp5/KOJoY6/Sl3FB9XyrP5fxACWl9QDE2ZH//bhtwfntS9gXCJhBVDLSP8CW0bq X-Received: by 2002:a05:6358:98aa:b0:17b:c86b:29f8 with SMTP id q42-20020a05635898aa00b0017bc86b29f8mr850571rwa.5.1709162062789; Wed, 28 Feb 2024 15:14:22 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709162062; cv=pass; d=google.com; s=arc-20160816; b=G5++Lx7Rz1dlxsVbSX9N5qCMohiQa8Nd8xBXwdDqLScBfdJ/OfS3HpDklfpBDM/P6j CHVSjJ55ttsjnx8LsUvdL0u0ymIjDtM0l8TevLbQKI8QdrqZx9Fu+qaZyJI1w2eiYNzr J68AtZo+I1Ebxo1r/BxotYZ9TF/AljdmLNT8JqDO0ftrQUqhJ2tfUvJGtVvqWhEYcCsa Q2lGKGBNpPTS8LqhQEilv7eL02PaA/5NWVUtKcaMkDYzNfqVZVqfM753VEmgFsHP3jeJ bOTBSaGgDuugoKyvRVIsY0JvBokgFlmPs5R9Ts7I6QNaq9jBjKy9ArgWfVJjRGWmYBVg cjOw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=LXFGqta6m2YTTllCJN5gPVV5q8eHf1tlDKNpNL8XiPk=; fh=AwQAODi5dePkoJuHgX22mYrvMtKpspXwe+wjIsqz7TE=; b=zfajrVTf2/aibFwWlxyyxPLnVhiJc1rGp04sOuq9qWUPxczjXYJ92rKh4C54qgGdLy 230maUzLhazn/xCf5w6nPF0BCo2xb488io451t/1rKiZOpxAEUt9lDAfMkH8x0n9Ya4/ 80fjUlGgz6O8YYbsTXsfYfsM7c2W9ajENhr/CHmEL/5oOsQWEjXEgXACRJMm12v4vE5H 0Q4vrmWwZFncn1TepauR2CkY58XPcXXGUYhx22CI0ZtKZgyuSOvTfIYybeCubKIxbgLX JcCndlo3LhTMKYTSRLvh2ZsAGm/x/9xWp5RS7VIxDTZ9RuCdh2aKh0T3vPJFq8lX6ArQ PoVw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Th4ZCR6T; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-85835-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-85835-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id k4-20020ad45be4000000b00690197c48f0si164937qvc.28.2024.02.28.15.14.22 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Feb 2024 15:14:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-85835-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Th4ZCR6T; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-85835-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-85835-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 974B01C23250 for <ouuuleilei@gmail.com>; Wed, 28 Feb 2024 23:14:22 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 6195B13D311; Wed, 28 Feb 2024 23:14:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Th4ZCR6T" Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B6A913D2F2; Wed, 28 Feb 2024 23:14:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709162049; cv=none; b=lBj7KDD0Wt02n1Kzg+A0n/uRtHxZB2QEcqWrrhMyS07OydD6GqcNHTBOUiI3ViwFT3tEvceKxABrtcQTTI/VUy+8imz8qEJAFWXUGVLkT+0aRc3SXSzCURrmG5iRXr8p6SLsBetsLW/0IT6KKyA6VhCb5LeN9Psvrr/ppxAp0jo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709162049; c=relaxed/simple; bh=9zjdHyxAIafTdj9D6iu5/FFeCEFFi2NSkV3RY2SOaCA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=C4pEwLHpy5S9HOoX/b4FpKi++Y+EX1sue77gMVJLc1EJk42ae1lvX82Yo3NwPC5LeUPZy3T7Iv2MFEjAj8LGKW17Bn8tnFYJ5ufaBOua3LkB6K9kZuu6yYO2mLDcY5VTw/FN45uCXDClYzDCeznkSAnapWvzhEcPuH3GpW1znsw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Th4ZCR6T; arc=none smtp.client-ip=198.175.65.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709162048; x=1740698048; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=9zjdHyxAIafTdj9D6iu5/FFeCEFFi2NSkV3RY2SOaCA=; b=Th4ZCR6T4FiM2pDtSWE8CyfaAIYktZlDCyMruYz4v2u1Q0cAZZiGq/zD KY1F4wQ0p6yftG+mdIF5xLu6fnEuHOMXSmSGdjmidv03cUhAgCHHZlhUx x6g8+/4UIwfS7seNOv2Q4p0DgCyHTnt84oH16+vLD0mH7PAvHVxpWLauU OQmKofHKzTMyJF/cahPF1GfEUaXn7VD1irC/zFX/Uf6GjDgftIliw1ciY vZancQeG9VgtW2EGNMsmlB5suEQBCOgOK4rJ0BY9hhwPTOh92DiNjOgzN Y3fnJw7H1ZGIMqSXU5p4ADqUuNAjE6MB52tmD/yVs1gAU768jX//Y1zKB A==; X-IronPort-AV: E=McAfee;i="6600,9927,10998"; a="26066302" X-IronPort-AV: E=Sophos;i="6.06,191,1705392000"; d="scan'208";a="26066302" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by orvoesa101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Feb 2024 15:14:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,191,1705392000"; d="scan'208";a="30773949" Received: from agluck-desk3.sc.intel.com (HELO agluck-desk3) ([172.25.222.105]) by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Feb 2024 15:14:05 -0800 Date: Wed, 28 Feb 2024 15:14:04 -0800 From: Tony Luck <tony.luck@intel.com> To: Borislav Petkov <bp@alien8.de> Cc: "Naik, Avadhut" <avadnaik@amd.com>, "Mehta, Sohil" <sohil.mehta@intel.com>, "x86@kernel.org" <x86@kernel.org>, "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "yazen.ghannam@amd.com" <yazen.ghannam@amd.com>, Avadhut Naik <avadhut.naik@amd.com> Subject: [PATCH] x86/mce: Dynamically size space for machine check records Message-ID: <Zd--PJp-NbXGrb39@agluck-desk3> References: <20240212093246.GBZcnlvkPKDC8C7rv5@fat_crate.local> <SJ1PR11MB6083B3511D18787BE823AB2CFC482@SJ1PR11MB6083.namprd11.prod.outlook.com> <20240212175408.GIZcpbQHVjEtwRKLS-@fat_crate.local> <SJ1PR11MB60830AF35FA89C7869B8C11EFC482@SJ1PR11MB6083.namprd11.prod.outlook.com> <20240212191401.GLZcpt-XHFqPg3cDw-@fat_crate.local> <SJ1PR11MB6083C60D7584B02E9CAF19D5FC482@SJ1PR11MB6083.namprd11.prod.outlook.com> <ZcqPhVO_DtD2x5N7@agluck-desk3> <20240212220833.GQZcqW4WxKH34i-oBR@fat_crate.local> <20240212221913.GRZcqZYRd6EPTTnN97@fat_crate.local> <20240212224220.GSZcqezMhPojxvIcvO@fat_crate.local> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240212224220.GSZcqezMhPojxvIcvO@fat_crate.local> X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1792186319154965388 X-GMAIL-MSGID: 1792186319154965388 |
Series |
x86/mce: Dynamically size space for machine check records
|
|
Commit Message
Luck, Tony
Feb. 28, 2024, 11:14 p.m. UTC
Systems with a large number of CPUs may generate a large
number of machine check records when things go seriously
wrong. But Linux has a fixed buffer that can only capture
a few dozen errors.
Allocate space based on the number of CPUs (with a minimum
value based on the historical fixed buffer that could store
80 records).
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Discussion earlier concluded with the realization that it is
safe to dynamically allocate the mce_evt_pool at boot time.
So here's a patch to do that. Scaling algorithm here is a
simple linear "4 records per possible CPU" with a minimum
of 80 to match the legacy behavior. I'm open to other
suggestions.
Note that I threw in a "+1" to the return from ilog2() when
calling gen_pool_create(). From reading code, and running
some tests, it appears that the min_alloc_order argument
needs to be large enough to allocate one of the mce_evt_llist
structures.
Some other gen_pool users in Linux may also need this "+1".
arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
Comments
On 2/28/2024 3:14 PM, Tony Luck wrote: > static int mce_gen_pool_create(void) > { > + int mce_numrecords, mce_poolsz; > struct gen_pool *tmpp; > int ret = -ENOMEM; > + void *mce_pool; > + int order; > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > + tmpp = gen_pool_create(order, -1); > if (!tmpp) > goto out; > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > + mce_numrecords = max(80, num_possible_cpus() * 4); > + mce_poolsz = mce_numrecords * (1 << order); > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); > + if (!mce_pool) { > + gen_pool_destroy(tmpp); > + goto out; > + } > + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); > if (ret) { > gen_pool_destroy(tmpp); Is this missing a kfree(mce_pool) here? > goto out;
>> + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); >> if (ret) { >> gen_pool_destroy(tmpp); > > Is this missing a kfree(mce_pool) here? > >> goto out; Yes. Will add. Thanks -Tony
A few other nits. On 2/28/2024 3:14 PM, Tony Luck wrote: > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > index fbe8b61c3413..a1f0a8f29cf5 100644 > --- a/arch/x86/kernel/cpu/mce/genpool.c > +++ b/arch/x86/kernel/cpu/mce/genpool.c > @@ -16,14 +16,13 @@ > * used to save error information organized in a lock-less list. > * > * This memory pool is only to be used to save MCE records in MCE context. > - * MCE events are rare, so a fixed size memory pool should be enough. Use > - * 2 pages to save MCE events for now (~80 MCE records at most). > + * MCE events are rare, so a fixed size memory pool should be enough. > + * Allocate on a sliding scale based on number of CPUs. > */ > -#define MCE_POOLSZ (2 * PAGE_SIZE) > +#define MCE_MIN_ENTRIES 80 > > static struct gen_pool *mce_evt_pool; > static LLIST_HEAD(mce_event_llist); > -static char gen_pool_buf[MCE_POOLSZ]; > > /* > * Compare the record "t" with each of the records on list "l" to see if > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) > > static int mce_gen_pool_create(void) > { > + int mce_numrecords, mce_poolsz; Should order be also declared in this line? That way we can have all the uninitialized 'int's together. > struct gen_pool *tmpp; > int ret = -ENOMEM; > + void *mce_pool; > + int order; > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; I didn't exactly understand why a +1 is needed here. Do you have a pointer to somewhere to help understand this? Also, I think, a comment on top might be useful since this isn't obvious. > + tmpp = gen_pool_create(order, -1); > if (!tmpp) > goto out; > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > + mce_numrecords = max(80, num_possible_cpus() * 4); How about using MCE_MIN_ENTRIES here? > + mce_poolsz = mce_numrecords * (1 << order); > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); > + if (!mce_pool) { > + gen_pool_destroy(tmpp); > + goto out; > + } > + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); > if (ret) { > gen_pool_destroy(tmpp); > goto out;
Hi, On 2/28/2024 17:14, Tony Luck wrote: > Systems with a large number of CPUs may generate a large > number of machine check records when things go seriously > wrong. But Linux has a fixed buffer that can only capture > a few dozen errors. > > Allocate space based on the number of CPUs (with a minimum > value based on the historical fixed buffer that could store > 80 records). > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > > Discussion earlier concluded with the realization that it is > safe to dynamically allocate the mce_evt_pool at boot time. > So here's a patch to do that. Scaling algorithm here is a > simple linear "4 records per possible CPU" with a minimum > of 80 to match the legacy behavior. I'm open to other > suggestions. > > Note that I threw in a "+1" to the return from ilog2() when > calling gen_pool_create(). From reading code, and running > some tests, it appears that the min_alloc_order argument > needs to be large enough to allocate one of the mce_evt_llist > structures. > > Some other gen_pool users in Linux may also need this "+1". > Somewhat confused here. Weren't we also exploring ways to avoid duplicate records from being added to the genpool? Has something changed in that regard? > arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > index fbe8b61c3413..a1f0a8f29cf5 100644 > --- a/arch/x86/kernel/cpu/mce/genpool.c > +++ b/arch/x86/kernel/cpu/mce/genpool.c > @@ -16,14 +16,13 @@ > * used to save error information organized in a lock-less list. > * > * This memory pool is only to be used to save MCE records in MCE context. > - * MCE events are rare, so a fixed size memory pool should be enough. Use > - * 2 pages to save MCE events for now (~80 MCE records at most). > + * MCE events are rare, so a fixed size memory pool should be enough. > + * Allocate on a sliding scale based on number of CPUs. > */ > -#define MCE_POOLSZ (2 * PAGE_SIZE) > +#define MCE_MIN_ENTRIES 80 > > static struct gen_pool *mce_evt_pool; > static LLIST_HEAD(mce_event_llist); > -static char gen_pool_buf[MCE_POOLSZ]; > > /* > * Compare the record "t" with each of the records on list "l" to see if > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) > > static int mce_gen_pool_create(void) > { > + int mce_numrecords, mce_poolsz; > struct gen_pool *tmpp; > int ret = -ENOMEM; > + void *mce_pool; > + int order; > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > + tmpp = gen_pool_create(order, -1); > if (!tmpp) > goto out; > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > + mce_numrecords = max(80, num_possible_cpus() * 4); > + mce_poolsz = mce_numrecords * (1 << order); > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); To err on the side of caution, wouldn't kzalloc() be a safer choice here?
On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: > Somewhat confused here. Weren't we also exploring ways to avoid > duplicate records from being added to the genpool? Has something > changed in that regard? You can always send patches proposing how *you* think this duplicate elimination should look like and we can talk. :) I don't think anyone would mind it if done properly but first you'd need a real-life use case. As in, do we log sooo many duplicates such that we'd want to dedup? Hmmm.
On 2/28/2024 8:56 PM, Sohil Mehta wrote: > A few other nits. > > On 2/28/2024 3:14 PM, Tony Luck wrote: >> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c >> index fbe8b61c3413..a1f0a8f29cf5 100644 >> --- a/arch/x86/kernel/cpu/mce/genpool.c >> +++ b/arch/x86/kernel/cpu/mce/genpool.c >> @@ -16,14 +16,13 @@ >> * used to save error information organized in a lock-less list. >> * >> * This memory pool is only to be used to save MCE records in MCE context. >> - * MCE events are rare, so a fixed size memory pool should be enough. Use >> - * 2 pages to save MCE events for now (~80 MCE records at most). >> + * MCE events are rare, so a fixed size memory pool should be enough. >> + * Allocate on a sliding scale based on number of CPUs. >> */ >> -#define MCE_POOLSZ (2 * PAGE_SIZE) >> +#define MCE_MIN_ENTRIES 80 >> >> static struct gen_pool *mce_evt_pool; >> static LLIST_HEAD(mce_event_llist); >> -static char gen_pool_buf[MCE_POOLSZ]; >> >> /* >> * Compare the record "t" with each of the records on list "l" to see if >> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) >> >> static int mce_gen_pool_create(void) >> { >> + int mce_numrecords, mce_poolsz; > > Should order be also declared in this line? That way we can have all the > uninitialized 'int's together. > >> struct gen_pool *tmpp; >> int ret = -ENOMEM; >> + void *mce_pool; >> + int order; >> >> - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); >> + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > > I didn't exactly understand why a +1 is needed here. Do you have a > pointer to somewhere to help understand this? > > Also, I think, a comment on top might be useful since this isn't obvious. > Would order_base_2() work here? It automatically rounds up to the next power. Thanks, Yazen
On Wed, Feb 28, 2024 at 05:56:26PM -0800, Sohil Mehta wrote: > A few other nits. > > On 2/28/2024 3:14 PM, Tony Luck wrote: > > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c > > index fbe8b61c3413..a1f0a8f29cf5 100644 > > --- a/arch/x86/kernel/cpu/mce/genpool.c > > +++ b/arch/x86/kernel/cpu/mce/genpool.c > > @@ -16,14 +16,13 @@ > > * used to save error information organized in a lock-less list. > > * > > * This memory pool is only to be used to save MCE records in MCE context. > > - * MCE events are rare, so a fixed size memory pool should be enough. Use > > - * 2 pages to save MCE events for now (~80 MCE records at most). > > + * MCE events are rare, so a fixed size memory pool should be enough. > > + * Allocate on a sliding scale based on number of CPUs. > > */ > > -#define MCE_POOLSZ (2 * PAGE_SIZE) > > +#define MCE_MIN_ENTRIES 80 > > > > static struct gen_pool *mce_evt_pool; > > static LLIST_HEAD(mce_event_llist); > > -static char gen_pool_buf[MCE_POOLSZ]; > > > > /* > > * Compare the record "t" with each of the records on list "l" to see if > > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) > > > > static int mce_gen_pool_create(void) > > { > > + int mce_numrecords, mce_poolsz; > > Should order be also declared in this line? That way we can have all the > uninitialized 'int's together. Sure. Even with the addition of "order" the line is still short enough. > > struct gen_pool *tmpp; > > int ret = -ENOMEM; > > + void *mce_pool; > > + int order; > > > > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); > > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > > I didn't exactly understand why a +1 is needed here. Do you have a > pointer to somewhere to help understand this? > > Also, I think, a comment on top might be useful since this isn't obvious. gen_pool works in power-of-two blocks. The user must pick a specific size with the gen_pool_create() call. Internally the gen_pool code uses a bitmap to track which blocks in the pool are allocated/free. Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So the original version of this code picks order 7 to allocate in 128 byte units. But this means that every allocation of a mce_evt_llist will take two 128-byte blocks. Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c that two pages are enough for ~80 records was wrong when written. At that point struct mce_evt_llist was below 128, so order was 6, and each allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64) results in 64 possible allocations. But over time Intel and AMD added to the structure. So the current math comes out at just 32 allocations before the pool is out of space. Yazen provided the right answer for this. Change to use order_base_2() > > + tmpp = gen_pool_create(order, -1); > > if (!tmpp) > > goto out; > > > > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); > > + mce_numrecords = max(80, num_possible_cpus() * 4); > > How about using MCE_MIN_ENTRIES here? Oops. I meant to do that when I added the #define! I've also added a "#define MCE_PER_CPU 4" instead of a raw "4" in that expression. -Tony
On Thu, Feb 29, 2024 at 10:49:18AM -0500, Yazen Ghannam wrote: > On 2/28/2024 8:56 PM, Sohil Mehta wrote: > > > + order = ilog2(sizeof(struct mce_evt_llist)) + 1; > > > > I didn't exactly understand why a +1 is needed here. Do you have a > > pointer to somewhere to help understand this? > > > > Also, I think, a comment on top might be useful since this isn't obvious. > > > > Would order_base_2() work here? It automatically rounds up to the next power. Yes! That's exactly what is needed here. Thanks! -Tony
On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: > Hi, > > On 2/28/2024 17:14, Tony Luck wrote: > > Systems with a large number of CPUs may generate a large > > number of machine check records when things go seriously > > wrong. But Linux has a fixed buffer that can only capture > > a few dozen errors. > > > > Allocate space based on the number of CPUs (with a minimum > > value based on the historical fixed buffer that could store > > 80 records). > > > > Signed-off-by: Tony Luck <tony.luck@intel.com> > > --- > > > > Discussion earlier concluded with the realization that it is > > safe to dynamically allocate the mce_evt_pool at boot time. > > So here's a patch to do that. Scaling algorithm here is a > > simple linear "4 records per possible CPU" with a minimum > > of 80 to match the legacy behavior. I'm open to other > > suggestions. > > > > Note that I threw in a "+1" to the return from ilog2() when > > calling gen_pool_create(). From reading code, and running > > some tests, it appears that the min_alloc_order argument > > needs to be large enough to allocate one of the mce_evt_llist > > structures. > > > > Some other gen_pool users in Linux may also need this "+1". > > > > Somewhat confused here. Weren't we also exploring ways to avoid > duplicate records from being added to the genpool? Has something > changed in that regard? I'm going to cover this in the reply to Boris. > > + mce_numrecords = max(80, num_possible_cpus() * 4); > > + mce_poolsz = mce_numrecords * (1 << order); > > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); > > To err on the side of caution, wouldn't kzalloc() be a safer choice here? Seems like too much caution. When mce_gen_pool_add() allocates an entry from the pool it does: memcpy(&node->mce, mce, sizeof(*mce)); llist_add(&node->llnode, &mce_event_llist); between those two lines, every field in the struct mce_evt_llist is written (including any holes in the struct mce part of the structure). -Tony
On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote: > On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: > > Somewhat confused here. Weren't we also exploring ways to avoid > > duplicate records from being added to the genpool? Has something > > changed in that regard? > > You can always send patches proposing how *you* think this duplicate > elimination should look like and we can talk. :) > > I don't think anyone would mind it if done properly but first you'd need > a real-life use case. As in, do we log sooo many duplicates such that > we'd want to dedup? There are definitly cases where dedup will not help. If a row fails in a DIMM there will be a flood of correctable errors with different addresses (depending on number of channels in the interleave schema for a system this may be dozens or hundreds of distinct addresses). Same for other failures in structures like column and rank. As to "real-life" use cases. A search on Lore for "MCE records pool full!" only finds threads about modifications to this code. So the general population of Linux developers isn't seeing this. But a search in my internal e-mail box kicks up a dozen or so distinct hits from internal validation teams in just the past year. But those folks are super-dedicated to finding corner cases. Just this morning I got a triumphant e-mail from someone who reproduced an issue "after 1.6 million error injections". I'd bet that large cloud providers with system numbered in the hundreds of thousands see the MCE pool exhausted from time to time. Open question is "How many error records do you need to diagnose the cause of various floods of errors?" I'm going to say that the current 32 (see earlier e-mail today to Sohil https://lore.kernel.org/all/ZeC8_jzdFnkpPVPf@agluck-desk3/ ) isn't enough for big systems. It may be hard to distinguish between the various bulk fail modes with just that many error logs. -Tony
On 2/29/2024 11:47, Tony Luck wrote: > On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote: >> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote: >>> Somewhat confused here. Weren't we also exploring ways to avoid >>> duplicate records from being added to the genpool? Has something >>> changed in that regard? >> >> You can always send patches proposing how *you* think this duplicate >> elimination should look like and we can talk. :) >> >> I don't think anyone would mind it if done properly but first you'd need >> a real-life use case. As in, do we log sooo many duplicates such that >> we'd want to dedup? > > There are definitly cases where dedup will not help. If a row fails in a > DIMM there will be a flood of correctable errors with different addresses > (depending on number of channels in the interleave schema for a system > this may be dozens or hundreds of distinct addresses). > > Same for other failures in structures like column and rank. > Wouldn't having dedup actually increase the time we spend #MC context? Comparing the new MCE record against each existing record in the genpool. AFAIK, MCEs cannot be nested. Correct me if I am wrong here. In a flood situation, like the one described above, that is exactly what may happen: An MCE coming in while the dedup mechanism is underway (in #MC context).
> Wouldn't having dedup actually increase the time we spend #MC context? > Comparing the new MCE record against each existing record in the > genpool. Yes, dedup would take extra time (increasing linearly with the number of pending errors that were not filtered out by the dedup process). > AFAIK, MCEs cannot be nested. Correct me if I am wrong here. Can't be nested on the same CPU. But multiple CPUs may take a local machine check simultaneously. Local machine check is opt-in on Intel, I believe it is default on AMD. Errors can also be signaled with CMCI. > In a flood situation, like the one described above, that is exactly > what may happen: An MCE coming in while the dedup mechanism is > underway (in #MC context). In a flood of errors it would be complicated to synchronize dedup filtering on multiple CPUs. The trade-off between trying to get that code right, and just allocating a few extra Kbytes of memory would seem to favor allocating more memory. -- Thanks, Avadhut Naik
On 2/29/2024 9:21 AM, Tony Luck wrote: > Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So > the original version of this code picks order 7 to allocate in 128 byte > units. But this means that every allocation of a mce_evt_llist will take > two 128-byte blocks. > > Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c > that two pages are enough for ~80 records was wrong when written. At > that point struct mce_evt_llist was below 128, so order was 6, and each > allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64) > results in 64 possible allocations. > Thanks for the explanation. The part that got me is that I somehow expected ilog2() to round-up and not round-down. > But over time Intel and AMD added to the structure. So the current math > comes out at just 32 allocations before the pool is out of space. > > Yazen provided the right answer for this. Change to use order_base_2() > Yes, I agree. order_base_2() is better than doing ilog2(struct size) + 1. In the rare scenario of the size exactly being a power of 2 we don't need to add the +1.
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c index fbe8b61c3413..a1f0a8f29cf5 100644 --- a/arch/x86/kernel/cpu/mce/genpool.c +++ b/arch/x86/kernel/cpu/mce/genpool.c @@ -16,14 +16,13 @@ * used to save error information organized in a lock-less list. * * This memory pool is only to be used to save MCE records in MCE context. - * MCE events are rare, so a fixed size memory pool should be enough. Use - * 2 pages to save MCE events for now (~80 MCE records at most). + * MCE events are rare, so a fixed size memory pool should be enough. + * Allocate on a sliding scale based on number of CPUs. */ -#define MCE_POOLSZ (2 * PAGE_SIZE) +#define MCE_MIN_ENTRIES 80 static struct gen_pool *mce_evt_pool; static LLIST_HEAD(mce_event_llist); -static char gen_pool_buf[MCE_POOLSZ]; /* * Compare the record "t" with each of the records on list "l" to see if @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce) static int mce_gen_pool_create(void) { + int mce_numrecords, mce_poolsz; struct gen_pool *tmpp; int ret = -ENOMEM; + void *mce_pool; + int order; - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1); + order = ilog2(sizeof(struct mce_evt_llist)) + 1; + tmpp = gen_pool_create(order, -1); if (!tmpp) goto out; - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1); + mce_numrecords = max(80, num_possible_cpus() * 4); + mce_poolsz = mce_numrecords * (1 << order); + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL); + if (!mce_pool) { + gen_pool_destroy(tmpp); + goto out; + } + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1); if (ret) { gen_pool_destroy(tmpp); goto out;