[1/2] x86/MCE: Extend size of the MCE Records pool

Message ID 20240207225632.159276-2-avadhut.naik@amd.com
State New
Headers
Series Extend size of the MCE Records pool |

Commit Message

Avadhut Naik Feb. 7, 2024, 10:56 p.m. UTC
  Currently, 2 pages are allocated for the MCE Records pool during system
bootup. Records of MCEs (struct mce) occurring on the system are added to
the pool through mce_gen_pool_add() in MC context. These records are then
decoded later, in process context through notifier chains.

However, on systems with high CPU count, the 2 pages allocated for the
pool might not be sufficient in some instances. Successive MCEs received
back-to-back, before they are decoded through mce_gen_pool_process(), will
result in the pool getting exhausted. Consequently, some MCE records will
be missed. The issue further intensifies since the amount of memory
associated with a system typically increases with the CPU count, thereby,
increasing the probability of MCEs being received.

The limit of 2 pages for the MCE records pool was set more than 8 years
ago and has not been revised till date. The CPU count and the amount of
memory associated with a system however, have increased enormously since
then. Additionally, the size of MCE Records (struct mce) too has increased
from 88 bytes to 124 bytes.

Extend the size of MCE Records pool to better serve modern systems. The
increase in size depends on the CPU count of the system. Currently, since
size of struct mce is 124 bytes, each logical CPU of the system will have
space for at least 2 MCE records available in the pool. To get around the
allocation woes during early boot time, the same is undertaken using
late_initcall().

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
 arch/x86/kernel/cpu/mce/core.c     |  3 +++
 arch/x86/kernel/cpu/mce/genpool.c  | 22 ++++++++++++++++++++++
 arch/x86/kernel/cpu/mce/internal.h |  1 +
 3 files changed, 26 insertions(+)
  

Comments

Luck, Tony Feb. 8, 2024, 12:02 a.m. UTC | #1
> +#define CPU_GEN_MEMSZ	256
 
What is this define?

Why isn't this "sizeof(struct mce)"?

Or 2* that if you are trying to reserve enough space for two records per CPU.

-Tony
  
Naik, Avadhut Feb. 8, 2024, 5:41 p.m. UTC | #2
Hi

On 2/7/2024 18:02, Luck, Tony wrote:
>> +#define CPU_GEN_MEMSZ	256
>  
> What is this define?
> 
> Why isn't this "sizeof(struct mce)"?
> 
> Or 2* that if you are trying to reserve enough space for two records per CPU.
> 
That's the memory in bytes reserved for each logical CPU in the
extended MCE Records pool. By current size of struct mce that
equates to around 2 records.

Will change it to (2 * sizeof(struct mce)) though. Feels more
accurate. Thanks for the suggestion!

Do you have any additional concerns/comments on this patchset?

> -Tony
  
Naik, Avadhut Feb. 8, 2024, 5:47 p.m. UTC | #3
On 2/8/2024 11:41, Naik, Avadhut wrote:
> Hi
> 
> On 2/7/2024 18:02, Luck, Tony wrote:
>>> +#define CPU_GEN_MEMSZ	256
>>  
>> What is this define?
>>
>> Why isn't this "sizeof(struct mce)"?
>>
>> Or 2* that if you are trying to reserve enough space for two records per CPU.
>>
> That's the memory in bytes reserved for each logical CPU in the
> extended MCE Records pool. By current size of struct mce that
> equates to around 2 records.
This memory will be reserved for each logical CPU only when the
command line parameter being introduced through the second patch
"mce-genpool-extend" is not set.

> 
> Will change it to (2 * sizeof(struct mce)) though. Feels more
> accurate. Thanks for the suggestion!
> 
> Do you have any additional concerns/comments on this patchset?
> 
>> -Tony
>
  
Luck, Tony Feb. 8, 2024, 6:39 p.m. UTC | #4
> Will change it to (2 * sizeof(struct mce)) though. Feels more
> accurate. Thanks for the suggestion!

Thanks.

> Do you have any additional concerns/comments on this patchset?

Overall this is an excellent addition. Reserved space to log errors does need to scale
up with the CPU count.

I think part 1 (unconditional increase based on CPU count) is a "must have" enhancement.
With the change to CPU_GEN_MEMSZ #define:

Reviewed-by: Tony Luck <tony.luck@intel.com>


I'm less enthusiastic about part 2 adding a command line option to override the code in
part 1 with a bigger (or smaller?) amount. Can you describe some situation where a user
would need to use this?

-Tony
  
Sohil Mehta Feb. 8, 2024, 9:09 p.m. UTC | #5
On 2/7/2024 2:56 PM, Avadhut Naik wrote:

> Extend the size of MCE Records pool to better serve modern systems. The
> increase in size depends on the CPU count of the system. Currently, since
> size of struct mce is 124 bytes, each logical CPU of the system will have
> space for at least 2 MCE records available in the pool. To get around the
> allocation woes during early boot time, the same is undertaken using
> late_initcall().
> 

I guess making this proportional to the number of CPUs is probably fine
assuming CPUs and memory capacity *would* generally increase in sync.

But, is there some logic to having 2 MCE records per logical cpu or it
is just a heuristic approach? In practice, the pool is shared amongst
all MCE sources and can be filled by anyone, right?

> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c     |  3 +++
>  arch/x86/kernel/cpu/mce/genpool.c  | 22 ++++++++++++++++++++++
>  arch/x86/kernel/cpu/mce/internal.h |  1 +
>  3 files changed, 26 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index b5cc557cfc37..5d6d7994d549 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void)
>  	if (mca_cfg.recovery)
>  		enable_copy_mc_fragile();
>  
> +	if (mce_gen_pool_extend())
> +		pr_info("Couldn't extend MCE records pool!\n");
> +

Why do this unconditionally? For a vast majority of low core-count, low
memory systems the default 2 pages would be good enough.

Should there be a threshold beyond which the extension becomes active?
Let's say, for example, a check for num_present_cpus() > 32 (Roughly
based on 8Kb memory and 124b*2 estimate per logical CPU).

Whatever you choose, a comment above the code would be helpful
describing when the extension is expected to be useful.

>  	mcheck_debugfs_init();
>  
>  	/*
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..aed01612d342 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -20,6 +20,7 @@
>   * 2 pages to save MCE events for now (~80 MCE records at most).
>   */
>  #define MCE_POOLSZ	(2 * PAGE_SIZE)
> +#define CPU_GEN_MEMSZ	256
>  

The comment above MCE_POOLSZ probably needs a complete re-write. Right
now, it reads as follows:

* This memory pool is only to be used to save MCE records in MCE context.
* MCE events are rare, so a fixed size memory pool should be enough. Use
* 2 pages to save MCE events for now (~80 MCE records at most).

Apart from the numbers being incorrect since sizeof(struct mce) has
increased, this patch is based on the assumption that the current MCE
memory pool is no longer enough in certain cases.

>  static struct gen_pool *mce_evt_pool;
>  static LLIST_HEAD(mce_event_llist);
> @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce)
>  	return 0;
>  }
>  
> +int mce_gen_pool_extend(void)
> +{
> +	unsigned long addr, len;

s/len/size/

> +	int ret = -ENOMEM;
> +	u32 num_threads;
> +
> +	num_threads = num_present_cpus();
> +	len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);

Nit: Can the use of the num_threads variable be avoided?
How about:

	size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ);



> +	addr = (unsigned long)kzalloc(len, GFP_KERNEL);

Also, shouldn't the new allocation be incremental to the 2 pages already
present?

Let's say, for example, that you have a 40-cpu system and the calculated
size in this case comes out to 40 * 2 * 128b = 9920bytes  i.e. 3 pages.
You only need to allocate 1 additional page to add to mce_evt_pool
instead of the 3 pages that the current code does.

Sohil

> +
> +	if (!addr)
> +		goto out;
> +
> +	ret = gen_pool_add(mce_evt_pool, addr, len, -1);
> +	if (ret)
> +		kfree((void *)addr);
> +
> +out:
> +	return ret;
> +}
> +
>  static int mce_gen_pool_create(void)
>  {
>  	struct gen_pool *tmpp;
  
Naik, Avadhut Feb. 9, 2024, 7:47 p.m. UTC | #6
Hi,

On 2/8/2024 12:39, Luck, Tony wrote:
>> Will change it to (2 * sizeof(struct mce)) though. Feels more
>> accurate. Thanks for the suggestion!
> 
> Thanks.
> 
>> Do you have any additional concerns/comments on this patchset?
> 
> Overall this is an excellent addition. Reserved space to log errors does need to scale
> up with the CPU count.
> 
> I think part 1 (unconditional increase based on CPU count) is a "must have" enhancement.
> With the change to CPU_GEN_MEMSZ #define:
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> 
> 
> I'm less enthusiastic about part 2 adding a command line option to override the code in
> part 1 with a bigger (or smaller?) amount. Can you describe some situation where a user
> would need to use this?
> 
I added the command-line option to ensure that we have covered all bases and are not enforcing
this memory footprint on all users.

A system with 512 logical CPUs, by the current proposed logic, will have 32 pages allocated
for the pool ((512*256)/4096)). Some users may feel that this is not needed on their systems
and they can do with just, maybe, 16 pages. The command line option gives them the flexibility
to do so without having to change kernel code, rebuild and deploy.

Conversely, some users wanting to err on the side of caution, might feel that the above 32 pages
are not enough for the pool and may want to allocate more, maybe, 48 pages. The command line
option again, provides them with the flexibility to do so.

Sounds reasonable?
  
Naik, Avadhut Feb. 9, 2024, 7:52 p.m. UTC | #7
Hi,

On 2/8/2024 15:09, Sohil Mehta wrote:
> On 2/7/2024 2:56 PM, Avadhut Naik wrote:
> 
>> Extend the size of MCE Records pool to better serve modern systems. The
>> increase in size depends on the CPU count of the system. Currently, since
>> size of struct mce is 124 bytes, each logical CPU of the system will have
>> space for at least 2 MCE records available in the pool. To get around the
>> allocation woes during early boot time, the same is undertaken using
>> late_initcall().
>>
> 
> I guess making this proportional to the number of CPUs is probably fine
> assuming CPUs and memory capacity *would* generally increase in sync.
> 
> But, is there some logic to having 2 MCE records per logical cpu or it
> is just a heuristic approach? In practice, the pool is shared amongst
> all MCE sources and can be filled by anyone, right?
> 
Yes, the pool is shared among all MCE sources but the logic for 256 is
that the genpool was set to 2 pages i.e. 8192 bytes in 2015.
Around that time, AFAIK, the max number of logical CPUs on a system was
32.
So, in the maximum case, each CPU will have around 256 bytes (8192/32) in
the pool. It equates to approximately 2 MCE records since sizeof(struct mce)
back then was 88 bytes.
>> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
>> ---
>>  arch/x86/kernel/cpu/mce/core.c     |  3 +++
>>  arch/x86/kernel/cpu/mce/genpool.c  | 22 ++++++++++++++++++++++
>>  arch/x86/kernel/cpu/mce/internal.h |  1 +
>>  3 files changed, 26 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index b5cc557cfc37..5d6d7994d549 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void)
>>  	if (mca_cfg.recovery)
>>  		enable_copy_mc_fragile();
>>  
>> +	if (mce_gen_pool_extend())
>> +		pr_info("Couldn't extend MCE records pool!\n");
>> +
> 
> Why do this unconditionally? For a vast majority of low core-count, low
> memory systems the default 2 pages would be good enough.
> 
> Should there be a threshold beyond which the extension becomes active?
> Let's say, for example, a check for num_present_cpus() > 32 (Roughly
> based on 8Kb memory and 124b*2 estimate per logical CPU).
> 
> Whatever you choose, a comment above the code would be helpful
> describing when the extension is expected to be useful.
> 
Put it in unconditionally because IMO the increase in memory even for
low-core systems didn't seem to be substantial. Just an additional page
for systems with less than 16 CPUs.

But I do get your point. Will add a check in mcheck_late_init() for CPUs
present. Something like below:

@@ -2901,7 +2901,7 @@ static int __init mcheck_late_init(void)
    if (mca_cfg.recovery)
        enable_copy_mc_fragile();

-   if (mce_gen_pool_extend())
+   if ((num_present_cpus() > 32) && mce_gen_pool_extend())
        pr_info("Couldn't extend MCE records pool!\n");

Does this look good? The genpool extension will then be undertaken only for
systems with more than 32 CPUs. Will explain the same in a comment.

>>  	mcheck_debugfs_init();
>>  
>>  	/*
>> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
>> index fbe8b61c3413..aed01612d342 100644
>> --- a/arch/x86/kernel/cpu/mce/genpool.c
>> +++ b/arch/x86/kernel/cpu/mce/genpool.c
>> @@ -20,6 +20,7 @@
>>   * 2 pages to save MCE events for now (~80 MCE records at most).
>>   */
>>  #define MCE_POOLSZ	(2 * PAGE_SIZE)
>> +#define CPU_GEN_MEMSZ	256
>>  
> 
> The comment above MCE_POOLSZ probably needs a complete re-write. Right
> now, it reads as follows:
> 
> * This memory pool is only to be used to save MCE records in MCE context.
> * MCE events are rare, so a fixed size memory pool should be enough. Use
> * 2 pages to save MCE events for now (~80 MCE records at most).
> 
> Apart from the numbers being incorrect since sizeof(struct mce) has
> increased, this patch is based on the assumption that the current MCE
> memory pool is no longer enough in certain cases.
> 
Yes, will change the comment to something like below:

 * This memory pool is only to be used to save MCE records in MCE context.
 * Though MCE events are rare, their frequency typically depends on the
 * system's memory and CPU count.
 * Allocate 2 pages to the MCE Records pool during early boot with the
 * option to extend the pool, as needed, through command line, for systems
 * with CPU count of more than 32.
 * By default, each logical CPU can have around 2 MCE records in the pool
 * at the same time. 

Sounds good?

>>  static struct gen_pool *mce_evt_pool;
>>  static LLIST_HEAD(mce_event_llist);
>> @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce)
>>  	return 0;
>>  }
>>  
>> +int mce_gen_pool_extend(void)
>> +{
>> +	unsigned long addr, len;
> 
> s/len/size/
> 
Noted.
>> +	int ret = -ENOMEM;
>> +	u32 num_threads;
>> +
>> +	num_threads = num_present_cpus();
>> +	len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
> 
> Nit: Can the use of the num_threads variable be avoided?
> How about:
> 
> 	size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ);
> 
Will do.
> 
> 
>> +	addr = (unsigned long)kzalloc(len, GFP_KERNEL);
> 
> Also, shouldn't the new allocation be incremental to the 2 pages already
> present?
> 
> Let's say, for example, that you have a 40-cpu system and the calculated
> size in this case comes out to 40 * 2 * 128b = 9920bytes  i.e. 3 pages.
> You only need to allocate 1 additional page to add to mce_evt_pool
> instead of the 3 pages that the current code does.
> 
Will make it incremental when genpool extension is being undertaken through
the default means. Something like below:

@@ -129,6 +134,7 @@ int mce_gen_pool_extend(void)
    } else {
        num_threads = num_present_cpus();
        len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+       len -= MCE_POOLSZ;

Does this sound good?
  

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index b5cc557cfc37..5d6d7994d549 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2901,6 +2901,9 @@  static int __init mcheck_late_init(void)
 	if (mca_cfg.recovery)
 		enable_copy_mc_fragile();
 
+	if (mce_gen_pool_extend())
+		pr_info("Couldn't extend MCE records pool!\n");
+
 	mcheck_debugfs_init();
 
 	/*
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..aed01612d342 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -20,6 +20,7 @@ 
  * 2 pages to save MCE events for now (~80 MCE records at most).
  */
 #define MCE_POOLSZ	(2 * PAGE_SIZE)
+#define CPU_GEN_MEMSZ	256
 
 static struct gen_pool *mce_evt_pool;
 static LLIST_HEAD(mce_event_llist);
@@ -116,6 +117,27 @@  int mce_gen_pool_add(struct mce *mce)
 	return 0;
 }
 
+int mce_gen_pool_extend(void)
+{
+	unsigned long addr, len;
+	int ret = -ENOMEM;
+	u32 num_threads;
+
+	num_threads = num_present_cpus();
+	len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+	addr = (unsigned long)kzalloc(len, GFP_KERNEL);
+
+	if (!addr)
+		goto out;
+
+	ret = gen_pool_add(mce_evt_pool, addr, len, -1);
+	if (ret)
+		kfree((void *)addr);
+
+out:
+	return ret;
+}
+
 static int mce_gen_pool_create(void)
 {
 	struct gen_pool *tmpp;
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 01f8f03969e6..81e35ec58ebc 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -33,6 +33,7 @@  void mce_gen_pool_process(struct work_struct *__unused);
 bool mce_gen_pool_empty(void);
 int mce_gen_pool_add(struct mce *mce);
 int mce_gen_pool_init(void);
+int mce_gen_pool_extend(void);
 struct llist_node *mce_gen_pool_prepare_records(void);
 
 int mce_severity(struct mce *a, struct pt_regs *regs, char **msg, bool is_excp);