[v2,03/13] x86/microcode/intel: Fix a hang if early loading microcode fails

Message ID 20221103175901.164783-4-ashok.raj@intel.com
State New
Headers
Series Make microcode late loading more robust |

Commit Message

Ashok Raj Nov. 3, 2022, 5:58 p.m. UTC
  When early loading of microcode fails for any reason other than the wrong
family-model-stepping, Linux can get into an infinite loop retrying the
same failed load.

A single retry is needed to handle any mixed stepping case.

Assume we have a microcode that fails to load for some reason.
load_ucode_ap() seems to retry if the loading fails. But it searches for
a new rev, but ends up finding the same copy. Hence it appears to repeat
the same load, retry loop for ever.

load_ucode_intel_ap()
{
..
reget:
        if (!*iup) {
                patch = __load_ucode_intel(&uci);
		^^^^^ Finds the same patch every time.

                if (!patch)
                        return;

                *iup = patch;
        }

        uci.mc = *iup;

        if (apply_microcode_early(&uci, true)) {
	^^^^^^^^^^^^ apply fails
              /* Mixed-silicon system? Try to refetch the proper patch: */
              *iup = NULL;

              goto reget;
	      ^^^^^ Rince repeat.
        }

}

Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading")
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/kernel/cpu/microcode/intel.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)
  

Comments

Borislav Petkov Nov. 9, 2022, 11:25 a.m. UTC | #1
On Thu, Nov 03, 2022 at 05:58:51PM +0000, Ashok Raj wrote:
> When early loading of microcode fails for any reason other than the wrong
> family-model-stepping, Linux can get into an infinite loop retrying the
> same failed load.
> 
> A single retry is needed to handle any mixed stepping case.
> 
> Assume we have a microcode that fails to load for some reason.
> load_ucode_ap() seems to retry if the loading fails. But it searches for

Seems to retry because we were supporting mixed revisions. Which we do
not now.

And if you say "seems" then this sounds like the problem hasn't been
analyzed properly. If this can happen with the current code, then this
needs to be fixed in stable. So, how do you trigger exactly?

I'd like to reproduce it myself.

As to this patch: it should simply be removing the retrying instead of
doing silly crap like

	bool retried = false;

...

In light of how a lot has changed since last time, yes, please redo the
patchset ontop of tip:x86/microcode, keeping in mind now that we don't
support mixed revisions anymore.

Just like dhansen said, you can split it in fixes and new features so
that it is not too many patches at once - your call.

Thx.
  
Ashok Raj Nov. 9, 2022, 4:07 p.m. UTC | #2
On Wed, Nov 09, 2022 at 12:25:02PM +0100, Borislav Petkov wrote:
> On Thu, Nov 03, 2022 at 05:58:51PM +0000, Ashok Raj wrote:
> > When early loading of microcode fails for any reason other than the wrong
> > family-model-stepping, Linux can get into an infinite loop retrying the
> > same failed load.
> > 
> > A single retry is needed to handle any mixed stepping case.
> > 
> > Assume we have a microcode that fails to load for some reason.
> > load_ucode_ap() seems to retry if the loading fails. But it searches for
> 
> Seems to retry because we were supporting mixed revisions. Which we do
> not now.

The retry wasn't the problem, but hitting the same failed microcode over
and over is the problem. It is called out in the commit log.

As part of dropping mixed stepping, we can drop this retry.

Maybe the right way is to remember if the bsp failed, then there is no
point in trying to apply on the AP's. 

reload_early_microcode->reload_ucode_intel()
                               ->apply_microcode_intel() 

we aren't checking if early load failed for bsp, we should save and
skip loading on all AP's.

> 
> And if you say "seems" then this sounds like the problem hasn't been
> analyzed properly. If this can happen with the current code, then this
> needs to be fixed in stable. So, how do you trigger exactly?
> 
> I'd like to reproduce it myself.

Certainly, take the fms+pf of the platform you are testing. 

- Take a microcode file from the distribution for a different fms that didn't
  belong to the one you are testing.
- You will have to fake the external header data and change it to the one
  you want microcode match to work 
- recompute all checksums and use that file instead of the original file.

I accidently ran into it since I had a copy of debug uCode that require
additional steps before loading.

I have a tool that I can change to give you some production microcode that
will fail in your platform. Just provide me with the fms+pf values, and I
an provide one for  your test. 

Let me know if you need one for testing.

> 
> As to this patch: it should simply be removing the retrying instead of
> doing silly crap like
> 
> 	bool retried = false;
> 
> ...
> 
> In light of how a lot has changed since last time, yes, please redo the
> patchset ontop of tip:x86/microcode, keeping in mind now that we don't
> support mixed revisions anymore.
> 
> Just like dhansen said, you can split it in fixes and new features so
> that it is not too many patches at once - your call.


That makes sense, I'll send the bug fix patches separately.

Cheers,
Ashok
  
Borislav Petkov Nov. 9, 2022, 11:34 p.m. UTC | #3
On Wed, Nov 09, 2022 at 08:07:32AM -0800, Ashok Raj wrote:
> - Take a microcode file from the distribution for a different fms that didn't
>   belong to the one you are testing.
> - You will have to fake the external header data and change it to the one
>   you want microcode match to work 
> - recompute all checksums and use that file instead of the original file.

This sounds like this cannot happen with officially released microcode -
only with something "hacked". If so, I'm not interested in such "fixes".
  

Patch

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 733b5eac0444..8ef04447fcf0 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -606,6 +606,7 @@  void __init load_ucode_intel_bsp(void)
 {
 	struct microcode_intel *patch;
 	struct ucode_cpu_info uci;
+	int rev, ret;
 
 	patch = __load_ucode_intel(&uci);
 	if (!patch)
@@ -613,13 +614,18 @@  void __init load_ucode_intel_bsp(void)
 
 	uci.mc = patch;
 
-	apply_microcode_early(&uci, true);
+	ret = apply_microcode_early(&uci, true);
+	if (ret) {
+		rev = patch->hdr.rev;
+		pr_err("Revision 0x%x failed during early loading\n", rev);
+	}
 }
 
 void load_ucode_intel_ap(void)
 {
 	struct microcode_intel *patch, **iup;
 	struct ucode_cpu_info uci;
+	bool retried = false;
 
 	if (IS_ENABLED(CONFIG_X86_32))
 		iup = (struct microcode_intel **) __pa_nodebug(&intel_ucode_patch);
@@ -638,9 +644,13 @@  void load_ucode_intel_ap(void)
 	uci.mc = *iup;
 
 	if (apply_microcode_early(&uci, true)) {
+		if (retried)
+			return;
+
 		/* Mixed-silicon system? Try to refetch the proper patch: */
 		*iup = NULL;
 
+		retried = true;
 		goto reget;
 	}
 }