[2/2] x86/MCE: Add command line option to extend MCE Records pool

Message ID 20240207225632.159276-3-avadhut.naik@amd.com
State New
Headers
Series Extend size of the MCE Records pool |

Commit Message

Avadhut Naik Feb. 7, 2024, 10:56 p.m. UTC
  Extension of MCE Records pool, based on system's CPU count, was undertaken
through the previous patch (x86/MCE: Extend size of the MCE Records pool).

Add a new command line parameter "mce-genpool-extend" to set the size of
MCE Records pool to a predetermined number of pages instead of system's
CPU count.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
 .../admin-guide/kernel-parameters.txt         |  2 ++
 arch/x86/kernel/cpu/mce/genpool.c             | 22 ++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)
  

Comments

Sohil Mehta Feb. 9, 2024, 1:36 a.m. UTC | #1
On 2/7/2024 2:56 PM, Avadhut Naik wrote:
> Extension of MCE Records pool, based on system's CPU count, was undertaken
> through the previous patch (x86/MCE: Extend size of the MCE Records pool).
> 

This statement is unnecessary.

> Add a new command line parameter "mce-genpool-extend" to set the size of
> MCE Records pool to a predetermined number of pages instead of system's
> CPU count.
> 

Like Tony, I am unsure of when this would be useful.

Also, why does it need to override the CPU count based extension
mechanism? Would just adding more pages not work for them?

If there really is a good reason to do this, how about changing
mce-genpool-extend to mce-genpool-add-pages and keeping the description
the same?

mce-genpool-add-pages=	[X86-64] Number of pages to add to MCE Records pool.

Then you can simply add these many number of additional pages to the new
CPU based mechanism.

Sohil

> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  2 ++
>  arch/x86/kernel/cpu/mce/genpool.c             | 22 ++++++++++++++++---
>  2 files changed, 21 insertions(+), 3 deletions(-)
>
  
Naik, Avadhut Feb. 9, 2024, 8:02 p.m. UTC | #2
Hi,

On 2/8/2024 19:36, Sohil Mehta wrote:
> On 2/7/2024 2:56 PM, Avadhut Naik wrote:
>> Extension of MCE Records pool, based on system's CPU count, was undertaken
>> through the previous patch (x86/MCE: Extend size of the MCE Records pool).
>>
> 
> This statement is unnecessary.
>
Noted.
 
>> Add a new command line parameter "mce-genpool-extend" to set the size of
>> MCE Records pool to a predetermined number of pages instead of system's
>> CPU count.
>>
> 
> Like Tony, I am unsure of when this would be useful.
> 
> Also, why does it need to override the CPU count based extension
> mechanism? Would just adding more pages not work for them?
> 
> If there really is a good reason to do this, how about changing
> mce-genpool-extend to mce-genpool-add-pages and keeping the description
> the same?
> 
> mce-genpool-add-pages=	[X86-64] Number of pages to add to MCE Records pool.
> 
> Then you can simply add these many number of additional pages to the new
> CPU based mechanism.
> 
Is it safe to assume that users will always want to increase the size of the
pool and not decrease it?

IMO, the command-line option provides flexibility for users to choose the size of
MCE Records pool in case, they don't agree with the CPU count logic. Just added it
to ensure that we are not enforcing this increased memory footprint across the board.

Would you agree?
  
Borislav Petkov Feb. 9, 2024, 8:09 p.m. UTC | #3
On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote:
> Is it safe to assume that users will always want to increase the size
> of the pool and not decrease it?

Why don't you make the gen pool size a function of the number of CPUs on
the system and have it all work automagically?

Burdening the user with yet another cmdline switch is a bad idea. We
have way too many as it is.

This stuff should work out-of-the-box, without user intervention if
possible. And it is possible in this case.

Thx.
  
Sohil Mehta Feb. 9, 2024, 8:16 p.m. UTC | #4
On 2/9/2024 12:02 PM, Naik, Avadhut wrote:

> Is it safe to assume that users will always want to increase the size of the
> pool and not decrease it?
> 
> IMO, the command-line option provides flexibility for users to choose the size of
> MCE Records pool in case, they don't agree with the CPU count logic. Just added it
> to ensure that we are not enforcing this increased memory footprint across the board.
> 
> Would you agree?
> 

Not really. Providing this level of configuration seems excessive and
unnecessary.

To me, it seems that we are over-compensating with the calculations in
the previous patch and then providing a mechanism to correct it here and
putting this burden on the user.

How about being more conservative with the allocations in the previous
patch so that we don't need to introduce this additional mechanism right
now? Later, if there is really a need for some specific usage, the patch
can be re-submitted then with the supporting data.

Sohil
  
Luck, Tony Feb. 9, 2024, 8:28 p.m. UTC | #5
> How about being more conservative with the allocations in the previous
> patch so that we don't need to introduce this additional mechanism right
> now? Later, if there is really a need for some specific usage, the patch
> can be re-submitted then with the supporting data.

There used to be a rule-of-thumb when configuring systems to have at least
one GByte of memory per CPU. Anyone following that rule shouldn't be
worried about sub-kilobyte allocations per CPU.

-Tony
  
Naik, Avadhut Feb. 9, 2024, 8:35 p.m. UTC | #6
Hi,

On 2/9/2024 14:09, Borislav Petkov wrote:
> On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote:
>> Is it safe to assume that users will always want to increase the size
>> of the pool and not decrease it?
> 
> Why don't you make the gen pool size a function of the number of CPUs on
> the system and have it all work automagically?
>
IIUC, this is exactly what the first patch in this series is trying to accomplish.
Please correct me if I understood wrong.
 
> Burdening the user with yet another cmdline switch is a bad idea. We
> have way too many as it is.
> 
> This stuff should work out-of-the-box, without user intervention if
> possible. And it is possible in this case.
> Okay. Will drop the command line argument.
  
Borislav Petkov Feb. 9, 2024, 8:51 p.m. UTC | #7
On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote:
> IIUC, this is exactly what the first patch in this series is trying to
> accomplish.  Please correct me if I understood wrong.

Yes, you did.

I don't mean to extend it - I mean to allocate it from the very
beginning to

	min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);

There's a sane minimum and one page pro logical CPU should be fine on
pretty much every configuration...

Thx.
  
Sohil Mehta Feb. 9, 2024, 9:02 p.m. UTC | #8
On 2/9/2024 12:28 PM, Luck, Tony wrote:
>> How about being more conservative with the allocations in the previous
>> patch so that we don't need to introduce this additional mechanism right
>> now? Later, if there is really a need for some specific usage, the patch
>> can be re-submitted then with the supporting data.
> 
> There used to be a rule-of-thumb when configuring systems to have at least
> one GByte of memory per CPU. Anyone following that rule shouldn't be
> worried about sub-kilobyte allocations per CPU.
> 

I meant, to avoid the need for this second patch we can always start
lower and increase it later.

256 bytes per cpu seems fine to me as done in the previous patch. But,
if that seems too high as described by Avadhut below then maybe we can
start with 200 bytes or any other number. It's just heuristic IIUC.

https://lore.kernel.org/lkml/8d2d0dac-b188-4826-a43a-bb5fc0528f0d@amd.com/

Sohil
  
Borislav Petkov Feb. 10, 2024, 7:52 a.m. UTC | #9
On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <bp@alien8.de> wrote:
>On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote:
>> IIUC, this is exactly what the first patch in this series is trying to
>> accomplish.  Please correct me if I understood wrong.
>
>Yes, you did.
>
>I don't mean to extend it - I mean to allocate it from the very
>beginning to
>
>	min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);

max() ofc.

>There's a sane minimum and one page pro logical CPU should be fine on
>pretty much every configuration...
>
>Thx.
>
  
Naik, Avadhut Feb. 10, 2024, 9:15 p.m. UTC | #10
Hi,

On 2/10/2024 01:52, Borislav Petkov wrote:
> On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <bp@alien8.de> wrote:
>> On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote:
>>> IIUC, this is exactly what the first patch in this series is trying to
>>> accomplish.  Please correct me if I understood wrong.
>>
>> Yes, you did.
>>
>> I don't mean to extend it - I mean to allocate it from the very
>> beginning to
>>
>> 	min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
> 

IIUC, you wouldn't want to extend the pool through late_initcall().
Instead, you would want for memory to be allocated (on the heap) and
size of the pool to be set at the very beginning i.e. when the pool
is created (~2 seconds, according to dmesg timestamps).

Please correct me if I have understood wrong.

I was actually doing something similar initially (setting size of the
pool right after its creation) but then went with extending the pool
since I wasn't sure if heap allocations should be undertaken that
early during bootup.

> max() ofc.
>
Thanks! This does clear a one part of my confusion.
  
Borislav Petkov Feb. 11, 2024, 11:14 a.m. UTC | #11
On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote:
> IIUC, you wouldn't want to extend the pool through late_initcall().
> Instead, you would want for memory to be allocated (on the heap) and
> size of the pool to be set at the very beginning i.e. when the pool
> is created (~2 seconds, according to dmesg timestamps).
> 
> Please correct me if I have understood wrong.

Nah, you got it right. I went, looked and realized that we have to do
this early dance because we have no allocator yet. And we can't move
this gen_pool allocation to later, when we *do* have an allocator
because MCA is up and logging already.

But your extending approach doesn't fly in all cases either:

gen_pool_add->gen_pool_add_virt->gen_pool_add_owner

it grabs the pool->lock spinlock and adds to &pool->chunks while, at the
exact same time, gen_pool_alloc(), in *NMI* context iterates over that
same &pool->chunks in the case we're logging an MCE at exact that same
time when you're extending the buffer.

And Tony already said that in the thread you're quoting:

https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/

So no, that doesn't work either.

Thx.
  
Naik, Avadhut Feb. 12, 2024, 2:54 a.m. UTC | #12
Hi,

On 2/11/2024 05:14, Borislav Petkov wrote:
> On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote:
>> IIUC, you wouldn't want to extend the pool through late_initcall().
>> Instead, you would want for memory to be allocated (on the heap) and
>> size of the pool to be set at the very beginning i.e. when the pool
>> is created (~2 seconds, according to dmesg timestamps).
>>
>> Please correct me if I have understood wrong.
> 
> Nah, you got it right. I went, looked and realized that we have to do
> this early dance because we have no allocator yet. And we can't move
> this gen_pool allocation to later, when we *do* have an allocator
> because MCA is up and logging already.
> 
Okay. Will make changes to allocate memory and set size of the pool
when it is created. Also, will remove the command line parameter and
resubmit.

> But your extending approach doesn't fly in all cases either:
> 
> gen_pool_add->gen_pool_add_virt->gen_pool_add_owner
> 
> it grabs the pool->lock spinlock and adds to &pool->chunks while, at the
> exact same time, gen_pool_alloc(), in *NMI* context iterates over that
> same &pool->chunks in the case we're logging an MCE at exact that same
> time when you're extending the buffer.
> 
> And Tony already said that in the thread you're quoting:
> 
> https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/
> 
> So no, that doesn't work either.
> 
> Thx.

Thanks for this explanation!
>
  
Borislav Petkov Feb. 12, 2024, 8:58 a.m. UTC | #13
On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote:
> Okay. Will make changes to allocate memory and set size of the pool
> when it is created. Also, will remove the command line parameter and
> resubmit.

Before you do, go read that original thread again but this time take
your time to grok it.

And then try answering those questions:

* Why are *you* fixing this? I know what the AWS reason is, what is
yours?

* Can you think of a slick deduplication scheme instead of blindly
raising the buffer size?

* What's wrong with not logging some early errors, can we live with that
too? If it were firmware-first, it cannot simply extend its buffer size
because it has limited space. So what does firmware do in such cases?

Think long and hard about the big picture, analyze the problem properly
and from all angles before you go and do patches.

Thx.
  
Borislav Petkov Feb. 12, 2024, 9:32 a.m. UTC | #14
On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote:
> * Can you think of a slick deduplication scheme instead of blindly
> raising the buffer size?

And here's the simplest scheme: you don't extend the buffer. On
overflow, you say "Buffer full, here's the MCE" and you dump the error
long into dmesg. Problem solved.

A slicker deduplication scheme would be even better, tho. Maybe struct
mce.count which gets incremented instead of adding the error record to
the buffer again. And so on...
  
Luck, Tony Feb. 12, 2024, 5:29 p.m. UTC | #15
> And here's the simplest scheme: you don't extend the buffer. On
> overflow, you say "Buffer full, here's the MCE" and you dump the error
> long into dmesg. Problem solved.
>
> A slicker deduplication scheme would be even better, tho. Maybe struct
> mce.count which gets incremented instead of adding the error record to
> the buffer again. And so on...

Walking the structures already allocated from the genpool in the #MC
handler may be possible, but what is the criteria for "duplicates"?
Do we avoid entering duplicates into the pool altogether? Or when the pool
is full overwrite a duplicate?

How about compile time allocation of extra space. Algorithm below for
illustrative purposes only. May need some more thought about how
to scale up.

-Tony

[Diff pasted into Outlook, chances that it will automatically apply = 0%]

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..0fc2925c0839 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -16,10 +16,15 @@
  * used to save error information organized in a lock-less list.
  *
  * This memory pool is only to be used to save MCE records in MCE context.
- * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * MCE events are rare, so a fixed size memory pool should be enough.
+ * Low CPU count systems allocate 2 pages (enough for ~64 "struct mce"
+ * records). Large systems scale up the allocation based on CPU count.
  */
+#if CONFIG_NR_CPUS < 128
 #define MCE_POOLSZ     (2 * PAGE_SIZE)
+#else
+#define MCE_POOLSZ     (CONFIG_NR_CPUS / 64 * PAGE_SIZE)
+#endif

 static struct gen_pool *mce_evt_pool;
 static LLIST_HEAD(mce_event_llist);
[agluck@agluck-desk3 mywork]$ vi arch/x86/kernel/cpu/mce/genpool.c
[agluck@agluck-desk3 mywork]$ git diff
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..47bf677578ca 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -16,10 +16,15 @@
  * used to save error information organized in a lock-less list.
  *
  * This memory pool is only to be used to save MCE records in MCE context.
- * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * MCE events are rare, but scale up with CPU count.  Low CPU count
+ * systems allocate 2 pages (enough for ~64 "struct mce" records). Large
+ * systems scale up the allocation based on CPU count.
  */
+#if CONFIG_NR_CPUS < 128
 #define MCE_POOLSZ     (2 * PAGE_SIZE)
+#else
+#define MCE_POOLSZ     (CONFIG_NR_CPUS / 64 * PAGE_SIZE)
+#endif

 static struct gen_pool *mce_evt_pool;
 static LLIST_HEAD(mce_event_llist);
  
Borislav Petkov Feb. 12, 2024, 5:54 p.m. UTC | #16
On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote:
> Walking the structures already allocated from the genpool in the #MC
> handler may be possible, but what is the criteria for "duplicates"?

for each i in pool:
	memcmp(mce[i], new_mce, sizeof(struct mce));

It'll probably need to mask out fields like ->time etc.

> Do we avoid entering duplicates into the pool altogether?

We could do

	mce[i].count++;

in the same loop.

Dunno yet if we even need to state how many duplicates are there...

> Or when the pool is full overwrite a duplicate?

Nah, not overwrite the duplicate but not add the new one. Cheaper.

> How about compile time allocation of extra space. Algorithm below for
> illustrative purposes only. May need some more thought about how
> to scale up.

Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing
- needs to be based on the actual number of CPUs on the machine.

BUT, we don't have an allocator yet.

So we end up allocating it there on the heap.

Unless we can do something with memblock...

But then this still needs code audit whether num_possible_cpus() is
final at that point. Because if it is, that would be the optimal thing
to do a memblock_alloc() there...

> [Diff pasted into Outlook, chances that it will automatically apply = 0%]

How you even do patches with outschmook is mindboggling :-)

At least use Thunderbird. That's what I do for the company mail but then
again I don't even try to do patches over company mail...

Thx.
  
Luck, Tony Feb. 12, 2024, 6:45 p.m. UTC | #17
> Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing
> - needs to be based on the actual number of CPUs on the machine.
>
> BUT, we don't have an allocator yet.
>
> So we end up allocating it there on the heap.
>
> Unless we can do something with memblock...
>
> But then this still needs code audit whether num_possible_cpus() is
> final at that point. Because if it is, that would be the optimal thing
> to do a memblock_alloc() there...

I threw a printk() into mce_gen_pool_init() and got:

[    2.948285] mce: mce_gen_pool_init: num_possible_cpus = 144

So it is currently true that number of CPUs has been computed at this point.

So using memblock_alloc() based on number of CPUs may work

> > [Diff pasted into Outlook, chances that it will automatically apply = 0%]
>
> How you even do patches with outschmook is mindboggling :-)
>
> At least use Thunderbird. That's what I do for the company mail but then
> again I don't even try to do patches over company mail...

I use "git send-email" to send out patches, and usually "b4" to get them
to my desktop. I do have "mutt" on there, but IT have made it complex
for me to use it to simply read & reply. It requires separate mutt config
files to fetch mail and a different config to send mail because of the
firewall rules on the lab where my desktop lives.

-Tony
  
Yazen Ghannam Feb. 12, 2024, 6:47 p.m. UTC | #18
On 2/11/2024 6:14 AM, Borislav Petkov wrote:
> On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote:
>> IIUC, you wouldn't want to extend the pool through late_initcall().
>> Instead, you would want for memory to be allocated (on the heap) and
>> size of the pool to be set at the very beginning i.e. when the pool
>> is created (~2 seconds, according to dmesg timestamps).
>>
>> Please correct me if I have understood wrong.
> 
> Nah, you got it right. I went, looked and realized that we have to do
> this early dance because we have no allocator yet. And we can't move
> this gen_pool allocation to later, when we *do* have an allocator
> because MCA is up and logging already.
> 
> But your extending approach doesn't fly in all cases either:
> 
> gen_pool_add->gen_pool_add_virt->gen_pool_add_owner
> 
> it grabs the pool->lock spinlock and adds to &pool->chunks while, at the
> exact same time, gen_pool_alloc(), in *NMI* context iterates over that
> same &pool->chunks in the case we're logging an MCE at exact that same
> time when you're extending the buffer.
> 
> And Tony already said that in the thread you're quoting:
> 
> https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/
> 
> So no, that doesn't work either.
> 

I'm confused why it won't work.

X86 has ARCH_HAVE_NMI_SAFE_CMPXCHG. I expect atomics/caches will still
work in interrupt or #MC context. If not, then we'd have a fatal error
that causes a hardware reset or a kernel panic before we get to logging,
I think.

Or is the issue when running on the same CPU? In this case, either
&pool->chunks was updated before taking the #MC, so the extra memory
is there and can be used. Or it wasn't updated, so the extra memory is
not available during the #MC which is the same behavior as now.

I need to look more at the genpool code, but I thought I'd ask too.

Thanks,
Yazen
  
Luck, Tony Feb. 12, 2024, 6:58 p.m. UTC | #19
> I need to look more at the genpool code, but I thought I'd ask too.

Yazen,

gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool.

This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of
the list_add_rcu()

        spin_lock(&pool->lock);
        list_add_rcu(&chunk->next_chunk, &pool->chunks);
        spin_unlock(&pool->lock);

-Tony
  
Borislav Petkov Feb. 12, 2024, 7:14 p.m. UTC | #20
On Mon, Feb 12, 2024 at 06:45:34PM +0000, Luck, Tony wrote:
> So it is currently true that number of CPUs has been computed at this point.

It needs a proper explanation why that's ok rather than an empirical
test only.

> I use "git send-email" to send out patches, and usually "b4" to get them
> to my desktop. I do have "mutt" on there, but IT have made it complex
> for me to use it to simply read & reply.

IT departments all around the world make sure of that. :)

> It requires separate mutt config files to fetch mail and a different
> config to send mail because of the firewall rules on the lab where my
> desktop lives.

Yeah, that's why I'm working with !company mail account. For my own
sanity.
  
Naik, Avadhut Feb. 12, 2024, 7:40 p.m. UTC | #21
Hi,

On 2/12/2024 12:58, Luck, Tony wrote:
>> I need to look more at the genpool code, but I thought I'd ask too.
> 
> Yazen,
> 
> gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool.
> 
> This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of
> the list_add_rcu()
> 
>         spin_lock(&pool->lock);
>         list_add_rcu(&chunk->next_chunk, &pool->chunks);
>         spin_unlock(&pool->lock);
> 

Even I am somewhat confused by this.

The spinlock is mostly held to prevent other primitives
from modifying chunks within the genpool.

In #MC context however, we are not modifying the existing
chunks, per se.

While in the MC context, records will be added to the genpool
through gen_pool_alloc() which eventually drops down into
gen_pool_alloc_algo_owner().

gen_pool_alloc_algo_owner() iterates over the existing chunks
within the genpool through list_for_each_entry_rcu(), within
an RCU read-side critical section (rcu_read_lock()).

Now, the below description of list_for_each_entry_rcu():

 * list_for_each_entry_rcu  -   iterate over rcu list of given type
 * @pos:    the type * to use as a loop cursor.
 * @head:   the head for your list.
 * @member: the name of the list_head within the struct.
 * @cond:   optional lockdep expression if called from non-RCU protection.
 *
 * This list-traversal primitive may safely run concurrently with
 * the _rcu list-mutation primitives such as list_add_rcu()
 * as long as the traversal is guarded by rcu_read_lock().


Makes me wonder if the genpool can be extended and traversed
concurrently.

OFC, not sure if gen_pool_alloc_algo_owner() being in #MC context
makes a difference here or if I am missing something.
  
Luck, Tony Feb. 12, 2024, 7:41 p.m. UTC | #22
> It needs a proper explanation why that's ok rather than an empirical
> test only.

start_kernel()
   ... setup_arch()
       .... acpi stuff parses MADT and sets bits in possible map

   ... arch_cpu_finalize_init()
       ... calls mce_gen_pool_init()

-Tony
  
Yazen Ghannam Feb. 12, 2024, 7:43 p.m. UTC | #23
On 2/12/2024 1:58 PM, Luck, Tony wrote:
>> I need to look more at the genpool code, but I thought I'd ask too.
> 
> Yazen,
> 
> gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool.
> 
> This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of
> the list_add_rcu()
> 
>          spin_lock(&pool->lock);
>          list_add_rcu(&chunk->next_chunk, &pool->chunks);
>          spin_unlock(&pool->lock);
> 

Thanks Tony.

So the concern is not about traversal, but rather that the #MC can break the
list_add_rcu(). Is this correct?


Thanks,
Yazen
  
Luck, Tony Feb. 12, 2024, 7:49 p.m. UTC | #24
I said:

>          spin_lock(&pool->lock);
>          list_add_rcu(&chunk->next_chunk, &pool->chunks);
>          spin_unlock(&pool->lock);

Avadhut said:

> gen_pool_alloc_algo_owner() iterates over the existing chunks
> within the genpool through list_for_each_entry_rcu(), within
> an RCU read-side critical section (rcu_read_lock()).
 

> So the concern is not about traversal, but rather that the #MC can break the
> list_add_rcu(). Is this correct?

Yazen,

Yes. The question is whether a #MC that come in the middle of list_rcu_add()
can safely do list_for_each_entry_rcu() on the same list.

RCU is black magic ... maybe it can do this? Adding Paul.

-Tony
  
Borislav Petkov Feb. 12, 2024, 8:10 p.m. UTC | #25
On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote:
> Yes. The question is whether a #MC that come in the middle of list_rcu_add()
> can safely do list_for_each_entry_rcu() on the same list.
> 
> RCU is black magic ... maybe it can do this? Adding Paul.

Yeah, the list traversal might be ok as this is what that list_add does
- you can't encounter an inconsistent list - but we still take
 a spinlock on addition and the commit which added it:

7f184275aa30 ("lib, Make gen_pool memory allocator lockless")

says

    The lockless operation only works if there is enough memory available.
    If new memory is added to the pool a lock has to be still taken.  So
    any user relying on locklessness has to ensure that sufficient memory
    is preallocated.

and this is exactly what we're doing - adding new memory.

So, until we're absolutely sure that it is ok to interrupt a context
holding a spinlock with a #MC which is non-maskable, I don't think we
wanna do this.

Thx.
  
Borislav Petkov Feb. 12, 2024, 8:18 p.m. UTC | #26
On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote:
> The spinlock is mostly held to prevent other primitives
> from modifying chunks within the genpool.
> 
> In #MC context however, we are not modifying the existing
> chunks, per se.

What if we decide to do

	mce[i]count++;

in #MC context?

That's modifying the existing chunks.

TBH, I'm not sure what that spinlock is for. See my reply to that same
subthread.
  
Paul E. McKenney Feb. 12, 2024, 8:44 p.m. UTC | #27
On Mon, Feb 12, 2024 at 09:10:38PM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote:
> > Yes. The question is whether a #MC that come in the middle of list_rcu_add()
> > can safely do list_for_each_entry_rcu() on the same list.
> > 
> > RCU is black magic ... maybe it can do this? Adding Paul.
> 
> Yeah, the list traversal might be ok as this is what that list_add does
> - you can't encounter an inconsistent list - but we still take
>  a spinlock on addition and the commit which added it:
> 
> 7f184275aa30 ("lib, Make gen_pool memory allocator lockless")
> 
> says
> 
>     The lockless operation only works if there is enough memory available.
>     If new memory is added to the pool a lock has to be still taken.  So
>     any user relying on locklessness has to ensure that sufficient memory
>     is preallocated.
> 
> and this is exactly what we're doing - adding new memory.

Is the #MC adding new memory, or is the interrupted context adding new
memory?

> So, until we're absolutely sure that it is ok to interrupt a context
> holding a spinlock with a #MC which is non-maskable, I don't think we
> wanna do this.

If it is the #MC adding new memory, agreed.

If the #MC is simply traversing the list, and the interrupted context
was in the midst of adding a new element, this should be no worse than
some other CPU traversing the list while this CPU is in the midst of
adding a new element.

Or am I missing a turn in here somewhere?

							Thanx, Paul
  
Naik, Avadhut Feb. 12, 2024, 8:51 p.m. UTC | #28
Hi,

On 2/12/2024 14:18, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote:
>> The spinlock is mostly held to prevent other primitives
>> from modifying chunks within the genpool.
>>
>> In #MC context however, we are not modifying the existing
>> chunks, per se.
> 
> What if we decide to do
> 
> 	mce[i]count++;
> 
> in #MC context?
> 
> That's modifying the existing chunks.
> 
> TBH, I'm not sure what that spinlock is for. See my reply to that same
> subthread.
> 
Yes, noticed your reply. Clears most of my confusion.
  
Luck, Tony Feb. 12, 2024, 9:18 p.m. UTC | #29
>> and this is exactly what we're doing - adding new memory.
>
> Is the #MC adding new memory, or is the interrupted context adding new
> memory?

The interrupted context is adding the memory.

>> So, until we're absolutely sure that it is ok to interrupt a context
>> holding a spinlock with a #MC which is non-maskable, I don't think we
>> wanna do this.
>
> If it is the #MC adding new memory, agreed.

Not what is happening.

> If the #MC is simply traversing the list, and the interrupted context
> was in the midst of adding a new element, this should be no worse than
> some other CPU traversing the list while this CPU is in the midst of
> adding a new element.
>
> Or am I missing a turn in here somewhere?

Not missing anything. I believe you've answered the question.

-Tony
  
Borislav Petkov Feb. 12, 2024, 9:27 p.m. UTC | #30
On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote:
> If it is the #MC adding new memory, agreed.
> 
> If the #MC is simply traversing the list, and the interrupted context
> was in the midst of adding a new element, this should be no worse than
> some other CPU traversing the list while this CPU is in the midst of
> adding a new element.

Right, Tony answered which context is doing what.

What I'm still scratching my head over is, why grab a spinlock around

	list_add_rcu(&chunk->next_chunk, &pool->chunks);

?

That's the part that looks really weird.

And that's the interrupted context, yap.

Thx.
  
Luck, Tony Feb. 12, 2024, 9:37 p.m. UTC | #31
On Mon, Feb 12, 2024 at 07:41:03PM +0000, Luck, Tony wrote:
> > It needs a proper explanation why that's ok rather than an empirical
> > test only.
> 
> start_kernel()
>    ... setup_arch()
>        .... acpi stuff parses MADT and sets bits in possible map
> 
>    ... arch_cpu_finalize_init()
>        ... calls mce_gen_pool_init()

This made me question the "we don't have an allocator in
mce_gen_pool_init()". Because if we got through all the
ACPI stuff, we surely have an allocator.

Below patch doesn't explode at runtime.

-Tony

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..81de877f2a51 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -16,14 +16,12 @@
  * used to save error information organized in a lock-less list.
  *
  * This memory pool is only to be used to save MCE records in MCE context.
- * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * MCE events are rare, so a fixed size memory pool should be enough.
+ * Allocate on a sliding scale based on number of CPUs.
  */
-#define MCE_POOLSZ	(2 * PAGE_SIZE)
 
 static struct gen_pool *mce_evt_pool;
 static LLIST_HEAD(mce_event_llist);
-static char gen_pool_buf[MCE_POOLSZ];
 
 /*
  * Compare the record "t" with each of the records on list "l" to see if
@@ -118,14 +116,23 @@ int mce_gen_pool_add(struct mce *mce)
 
 static int mce_gen_pool_create(void)
 {
+	int mce_numrecords, mce_poolsz;
 	struct gen_pool *tmpp;
 	int ret = -ENOMEM;
+	void *mce_pool;
 
 	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
 	if (!tmpp)
 		goto out;
 
-	ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
+	mce_numrecords = max(80, num_possible_cpus() * 8);
+	mce_poolsz = mce_numrecords * ilog2(sizeof(struct mce_evt_llist));
+	mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
+	if (!mce_pool) {
+		gen_pool_destroy(tmpp);
+		goto out;
+	}
+	ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
 	if (ret) {
 		gen_pool_destroy(tmpp);
 		goto out;
  
Borislav Petkov Feb. 12, 2024, 10:08 p.m. UTC | #32
On Mon, Feb 12, 2024 at 01:37:09PM -0800, Tony Luck wrote:
> This made me question the "we don't have an allocator in
> mce_gen_pool_init()". Because if we got through all the
> ACPI stuff, we surely have an allocator.
> 
> Below patch doesn't explode at runtime.

Yap, I see it.

And I can't dig out why it didn't use to work and we had to do it this
way. The earliest thing I found is:

https://lore.kernel.org/all/1432150538-3120-2-git-send-email-gong.chen@linux.intel.com/T/#mf673ed669ee0d4c27b75bd48450c13c01cbb2cbf

I'll have to dig into my archives tomorrow, on a clear head...

Thx.
  
Borislav Petkov Feb. 12, 2024, 10:19 p.m. UTC | #33
On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote:
> I'll have to dig into my archives tomorrow, on a clear head...

So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is
4.2-something.

And even back then, mcheck_cpu_init() gets called *after* mm_init()
which already initializes the allocators. So why did we allocate that
buffer statically?
  
Borislav Petkov Feb. 12, 2024, 10:42 p.m. UTC | #34
On Mon, Feb 12, 2024 at 11:19:13PM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote:
> > I'll have to dig into my archives tomorrow, on a clear head...
> 
> So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is
> 4.2-something.
> 
> And even back then, mcheck_cpu_init() gets called *after* mm_init()
> which already initializes the allocators. So why did we allocate that
> buffer statically?

Found it in my archives. You should have it too:

Date: Thu, 31 Jul 2014 02:51:25 -0400
From: "Chen, Gong" <gong.chen@linux.intel.com>
To: tony.luck@intel.com, bp@alien8.de
Subject: Re: [RFC PATCH untest v2 1/4] x86, MCE: Provide a lock-less memory pool to save error record
Message-ID: <20140731065125.GA5999@gchen.bj.intel.com>

and that's not on any ML that's why I can't find it on lore...

There's this fragment from Chen:

--------
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index bb92f38..a1b6841 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -2023,6 +2023,9 @@ int __init mcheck_init(void)
>  {
>       mcheck_intel_therm_init();
>  
> +     if (!mce_genpool_init())
> +             mca_cfg.disabled = true;
> +
when setup_arch is called, memory subsystem hasn't been initialized,
which means I can't use regular page allocation function. So I still
need to put genpool init in mcheck_cpu_init.
--------

And that is still the case - mcheck_init() gets called in setup_arch()
and thus before before mm_init() which is called mm_core_init() now.

And on that same thread we agree that we should allocate it statically
but then the call to mce_gen_pool_init() ended up in mcheck_cpu_init()
which happens *after* mm_init().

What a big fscking facepalm. :-\
  
Paul E. McKenney Feb. 12, 2024, 10:46 p.m. UTC | #35
On Mon, Feb 12, 2024 at 10:27:41PM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote:
> > If it is the #MC adding new memory, agreed.
> > 
> > If the #MC is simply traversing the list, and the interrupted context
> > was in the midst of adding a new element, this should be no worse than
> > some other CPU traversing the list while this CPU is in the midst of
> > adding a new element.
> 
> Right, Tony answered which context is doing what.
> 
> What I'm still scratching my head over is, why grab a spinlock around
> 
> 	list_add_rcu(&chunk->next_chunk, &pool->chunks);
> 
> ?
> 
> That's the part that looks really weird.
> 
> And that's the interrupted context, yap.

The usual reason is to exclude other CPUs also doing list_add_rcu()
on the same list.  Or is there other synchronization that is preventing
concurrent updates?

							Thanx, Paul
  
Luck, Tony Feb. 12, 2024, 10:53 p.m. UTC | #36
> The usual reason is to exclude other CPUs also doing list_add_rcu()
> on the same list.  Or is there other synchronization that is preventing
> concurrent updates?

In this case the patch is trying to do a one-time bump of the pool size.
So no races here.

-Tony
  
Borislav Petkov Feb. 12, 2024, 11:10 p.m. UTC | #37
On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote:
> The usual reason is to exclude other CPUs also doing list_add_rcu()
> on the same list. 

Doh, it even says so in the comment above list_add_rcu().

And the traversal which is happening in NMI-like context is fine.

So phew, I think we should be fine here. Thanks!

And as it turns out, we're not going to need any of that after all as
it looks like we can allocate the proper size from the very beginning...

Lovely.
  
Paul E. McKenney Feb. 13, 2024, 1:07 a.m. UTC | #38
On Tue, Feb 13, 2024 at 12:10:09AM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote:
> > The usual reason is to exclude other CPUs also doing list_add_rcu()
> > on the same list. 
> 
> Doh, it even says so in the comment above list_add_rcu().
> 
> And the traversal which is happening in NMI-like context is fine.
> 
> So phew, I think we should be fine here. Thanks!
> 
> And as it turns out, we're not going to need any of that after all as
> it looks like we can allocate the proper size from the very beginning...

Sounds even better!  ;-)

							Thanx, Paul
  
Naik, Avadhut Feb. 15, 2024, 8:14 p.m. UTC | #39
Hi,

On 2/12/2024 02:58, Borislav Petkov wrote:
> On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote:
>> Okay. Will make changes to allocate memory and set size of the pool
>> when it is created. Also, will remove the command line parameter and
>> resubmit.
> 
> Before you do, go read that original thread again but this time take
> your time to grok it.
> 
> And then try answering those questions:
> 
> * Why are *you* fixing this? I know what the AWS reason is, what is
> yours?
> 
I think this issue of genpool getting full with MCE records can occur
on AMD system too since the pool doesn't scale up with the number of
CPUs and memory in the system. The probability of issue occurrence
only increases as CPU count and memory increases. Feel that the genpool
size should be proportional to, at least, the CPU count of the system.

> * Can you think of a slick deduplication scheme instead of blindly
> raising the buffer size?
> 
> * What's wrong with not logging some early errors, can we live with that
> too? If it were firmware-first, it cannot simply extend its buffer size
> because it has limited space. So what does firmware do in such cases?
>
Think that we can live with not logging some early errors, as long as they
are correctable.
Not very sure about what you mean by Firmware First. Do you mean handling
of MCEs through HEST and GHES? Or something else?

> Think long and hard about the big picture, analyze the problem properly
> and from all angles before you go and do patches.
> 
> Thx.
>
  
Naik, Avadhut Feb. 15, 2024, 8:15 p.m. UTC | #40
On 2/12/2024 03:32, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote:
>> * Can you think of a slick deduplication scheme instead of blindly
>> raising the buffer size?
> 
> And here's the simplest scheme: you don't extend the buffer. On
> overflow, you say "Buffer full, here's the MCE" and you dump the error
> long into dmesg. Problem solved.
>
Wouldn't this require logging in the #MC context?

We would know that the genpool is full in mce_gen_pool_add() which, I
think is executed in #MC context.
 
> A slicker deduplication scheme would be even better, tho. Maybe struct
> mce.count which gets incremented instead of adding the error record to
> the buffer again. And so on...
>
  
Naik, Avadhut Feb. 15, 2024, 8:18 p.m. UTC | #41
Hi,

On 2/12/2024 11:54, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote:
>> Walking the structures already allocated from the genpool in the #MC
>> handler may be possible, but what is the criteria for "duplicates"?
> 
> for each i in pool:
> 	memcmp(mce[i], new_mce, sizeof(struct mce));
> 
> It'll probably need to mask out fields like ->time etc.
> 
Also, some fields like cpuvendor, ppin, microcode will remain same
for all MCEs received on a system. Can we avoid comparing them?

We already seem to have a function mce_cmp(), introduced back in
2016, which accomplishes something similar for fatal errors. But
it only checks for bank, status, addr and misc registers. Should
we just modify this function to compare MCEs? It should work for
fatal errors too.

For my own understanding:
The general motto for #MC or interrupt contexts is *keep it short
and sweet*. Though memcmp() is fairly optimized, we would still be
running a *for* loop in MC context. In case successive back-to-back
MCEs are being received and if the pool already has a fair number of
records, wouldn't this comparison significantly extend our stay
in #MC context?
Had discussed this with Yazen, IIUC, nested MCEs are not supported
on x86. Please correct me if I am wrong in this.
  

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3a1fa1f81d9d..62e7da4d9fda 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3135,6 +3135,8 @@ 
 
 	mce=option	[X86-64] See Documentation/arch/x86/x86_64/boot-options.rst
 
+	mce-genpool-extend=	[X86-64] Number of pages to add to MCE Records pool.
+
 	md=		[HW] RAID subsystems devices and level
 			See Documentation/admin-guide/md.rst.
 
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index aed01612d342..d6e04fa5ee07 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -22,6 +22,7 @@ 
 #define MCE_POOLSZ	(2 * PAGE_SIZE)
 #define CPU_GEN_MEMSZ	256
 
+static unsigned int mce_genpool_extend;
 static struct gen_pool *mce_evt_pool;
 static LLIST_HEAD(mce_event_llist);
 static char gen_pool_buf[MCE_POOLSZ];
@@ -123,10 +124,14 @@  int mce_gen_pool_extend(void)
 	int ret = -ENOMEM;
 	u32 num_threads;
 
-	num_threads = num_present_cpus();
-	len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
-	addr = (unsigned long)kzalloc(len, GFP_KERNEL);
+	if (mce_genpool_extend) {
+		len = mce_genpool_extend * PAGE_SIZE;
+	} else {
+		num_threads = num_present_cpus();
+		len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+	}
 
+	addr = (unsigned long)kzalloc(len, GFP_KERNEL);
 	if (!addr)
 		goto out;
 
@@ -159,6 +164,17 @@  static int mce_gen_pool_create(void)
 	return ret;
 }
 
+static int __init parse_mce_genpool_extend(char *str)
+{
+	if (*str == '=')
+		str++;
+
+	get_option(&str, &mce_genpool_extend);
+	return 1;
+}
+
+__setup("mce-genpool-extend", parse_mce_genpool_extend);
+
 int mce_gen_pool_init(void)
 {
 	/* Just init mce_gen_pool once. */