[v4,6/6] x86/microcode/intel: Print when early microcode loading fails

Message ID 20230109153555.4986-7-ashok.raj@intel.com
State New
Headers
Series Some fixes and cleanups for microcode |

Commit Message

Ashok Raj Jan. 9, 2023, 3:35 p.m. UTC
  Currently when early microcode loading fails there is no way for the
user to know that the update failed.

Store the failed status and pass it to print_ucode_info() so that early
loading failures are captured in dmesg.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: x86 <x86@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
Changes since V1:

Thomas: Fix commit log as suggested.
---
 arch/x86/kernel/cpu/microcode/intel.c | 28 +++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)
  

Comments

Borislav Petkov Jan. 15, 2023, 7:05 p.m. UTC | #1
On Mon, Jan 09, 2023 at 07:35:55AM -0800, Ashok Raj wrote:
> Currently when early microcode loading fails there is no way for the
> user to know that the update failed.

Of course there is - there's no early update message in dmesg.
  
Ashok Raj Jan. 17, 2023, 4:12 p.m. UTC | #2
On Sun, Jan 15, 2023 at 08:05:14PM +0100, Borislav Petkov wrote:
> On Mon, Jan 09, 2023 at 07:35:55AM -0800, Ashok Raj wrote:
> > Currently when early microcode loading fails there is no way for the
> > user to know that the update failed.
> 
> Of course there is - there's no early update message in dmesg.

Sorry Boris, didn't comprehend. 

Without user making some additional steps to check the revision in the
default location and the current rev reported to find the update failed.

Maybe that's what you meant.
  
Dave Hansen Jan. 17, 2023, 4:29 p.m. UTC | #3
On 1/17/23 08:12, Ashok Raj wrote:
> On Sun, Jan 15, 2023 at 08:05:14PM +0100, Borislav Petkov wrote:
>> On Mon, Jan 09, 2023 at 07:35:55AM -0800, Ashok Raj wrote:
>>> Currently when early microcode loading fails there is no way for the
>>> user to know that the update failed.
>> Of course there is - there's no early update message in dmesg.
> Sorry Boris, didn't comprehend. 
> 
> Without user making some additional steps to check the revision in the
> default location and the current rev reported to find the update failed.
> 
> Maybe that's what you meant.

I think a better changelog might help here.  The original was a bit too
absolute.  There is, as Boris pointed out, a way to tell if a failure
occurred.  But, that method is a bit unfriendly to our users.

--

When early microcode loading succeeds, the kernel prints a message via
print_ucode_info() that starts with 'early update:'.  If loading fails,
apply_microcode_early() returns before that message is printed.  This
means that users must know to go looking for that message.  They can
infer a early microcode loading failure if they do not see the message.

That's not great for users.  Instead of expecting them to infer this, be
more explicit about it.  Instead of bailing out and returning from
apply_microcode_early(), stash the error code off and plumb it down to
print_ucode_info().

This ensures that a message of some kind is printed on all early loads:
successes *and* failures.  This should make it easier for our hapless
users to figure out when a failure occurred.
  
Dave Hansen Jan. 17, 2023, 4:35 p.m. UTC | #4
On 1/9/23 07:35, Ashok Raj wrote:
> -static void print_ucode(int old_rev, int new_rev, int date)
> +static void print_ucode(bool failed, int old_rev, int new_rev, int date)
...
>  	if (rev != mc->hdr.rev)
> -		return -1;
> +		retval = -1;
>  
>  	uci->cpu_sig.rev = rev;
>  
>  	if (early)
> -		print_ucode(old_rev, uci->cpu_sig.rev, mc->hdr.date);
> +		print_ucode(retval, old_rev, mc->hdr.rev, mc->hdr.date);
>  	else
> -		print_ucode_info(old_rev, uci->cpu_sig.rev, mc->hdr.date);
> +		print_ucode_info(retval, old_rev, uci->cpu_sig.rev, mc->hdr.date);
>  
> -	return 0;
> +	return retval;
>  }

I'm generally not a _huge_ fan of having an 'int' implicitly cast to a
bool.  The:

	print_ucode_info(retval, ...

Line could be right or wrong based on what the retval is logically.
This, on the other hand:

	bool failed = false;
	...
	if (rev != mc->hdr.rev) {
		retval = -1;
		failed = true;
	}
	...
	print_ucode_info(failed, old_rev, uci->cpu_sig.rev, ...

*Clearly* and unambiguously matches up with:

	static void print_ucode(bool failed, int old_rev, ...
  
Ashok Raj Jan. 17, 2023, 5:59 p.m. UTC | #5
On Tue, Jan 17, 2023 at 08:35:33AM -0800, Dave Hansen wrote:
> On 1/9/23 07:35, Ashok Raj wrote:
> > -static void print_ucode(int old_rev, int new_rev, int date)
> > +static void print_ucode(bool failed, int old_rev, int new_rev, int date)
> ...
> >  	if (rev != mc->hdr.rev)
> > -		return -1;
> > +		retval = -1;
> >  
> >  	uci->cpu_sig.rev = rev;
> >  
> >  	if (early)
> > -		print_ucode(old_rev, uci->cpu_sig.rev, mc->hdr.date);
> > +		print_ucode(retval, old_rev, mc->hdr.rev, mc->hdr.date);
> >  	else
> > -		print_ucode_info(old_rev, uci->cpu_sig.rev, mc->hdr.date);
> > +		print_ucode_info(retval, old_rev, uci->cpu_sig.rev, mc->hdr.date);
> >  
> > -	return 0;
> > +	return retval;
> >  }
> 
> I'm generally not a _huge_ fan of having an 'int' implicitly cast to a
> bool.  The:
> 
> 	print_ucode_info(retval, ...
> 
> Line could be right or wrong based on what the retval is logically.
> This, on the other hand:
> 
> 	bool failed = false;
> 	...
> 	if (rev != mc->hdr.rev) {
> 		retval = -1;
> 		failed = true;
> 	}
> 	...
> 	print_ucode_info(failed, old_rev, uci->cpu_sig.rev, ...
> 
> *Clearly* and unambiguously matches up with:
> 
> 	static void print_ucode(bool failed, int old_rev, ...

Yes, it makes good sense.. I'll fix up next update including the commit log
that you called out.
  
Borislav Petkov Jan. 17, 2023, 6:21 p.m. UTC | #6
On Tue, Jan 17, 2023 at 08:29:28AM -0800, Dave Hansen wrote:
> This ensures that a message of some kind is printed on all early loads:
> successes *and* failures.  This should make it easier for our hapless
> users to figure out when a failure occurred.

I'm still not convinced. When something doesn't happen in the kernel, we don't
always say "It didn't happen". We don't say anything.

So I don't like all those talkative drivers for no good reason. If there wasn't
an update message, then no update happened. That's it.

And the current microcode revision is in /proc/cpuinfo.

If you wanna know why the update didn't happen, then you start adding debug
printks and trying things but then you're clearly not a user so you know what
you're doing.

And the log buffer can get overwritten sooner or later depending on its size so
any message can disappear.

So what's the point of this pointless exercise in verbosity?
  
Dave Hansen Jan. 17, 2023, 6:32 p.m. UTC | #7
On 1/17/23 10:21, Borislav Petkov wrote:
> On Tue, Jan 17, 2023 at 08:29:28AM -0800, Dave Hansen wrote:
>> This ensures that a message of some kind is printed on all early loads:
>> successes *and* failures.  This should make it easier for our hapless
>> users to figure out when a failure occurred.
> I'm still not convinced. When something doesn't happen in the kernel, we don't
> always say "It didn't happen". We don't say anything.

Well, we have an awful lot of pr_warn()'s in the kernel that talk about
something that was tried and failed.

> So I don't like all those talkative drivers for no good reason. If there wasn't
> an update message, then no update happened. That's it.

I actually kinda like the inverse.

The common (boring) case where an update was needed and was successful.
It's the one we don't need to tell users about at all.  It barely
deserves a message.  Users expect that if there's an early update
available, it'll get attempted, it will be successful and the kernel
won't say much.

The time we need to spam dmesg is when something upends user
expectations and they might need to go do something.  An early loading
failure is exactly the place where they want to know.  They want to know
if they're running with known CPU bugs that would have been fixed by the
early update, or if they somehow have a botched early loading image.

So, if I had to pick either:

 * Print on failure
or
 * Print on success

I'd pick failure.  But, considering that we're already printing on
success, I'm OK doing both.
  
Borislav Petkov Jan. 17, 2023, 6:40 p.m. UTC | #8
On Tue, Jan 17, 2023 at 10:32:50AM -0800, Dave Hansen wrote:
> Well, we have an awful lot of pr_warn()'s in the kernel that talk about
> something that was tried and failed.

Well, is microcode loading failure worth to warn about?

What if there's no microcode for that CPU?

Now imagine in that case you issue a warning and then you get all those
bugzillas opened:

"Heeey, is my CPU broken, it says it cannot load microcode"

I sure don't want to be on the receiving end of this.

I don't think you wanna be there either.

> I actually kinda like the inverse.
> 
> The common (boring) case where an update was needed and was successful.
> It's the one we don't need to tell users about at all.  It barely
> deserves a message.  Users expect that if there's an early update
> available, it'll get attempted, it will be successful and the kernel
> won't say much.

No argument there.

> The time we need to spam dmesg is when something upends user
> expectations and they might need to go do something.  An early loading
> failure is exactly the place where they want to know.  They want to know
> if they're running with known CPU bugs that would have been fixed by the
> early update

No, we already warn about that in the CPU mitigations code.

> or if they somehow have a botched early loading image.

If you can pick apart from this here:

        /* write microcode via MSR 0x79 */
        native_wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);

what type of failure it is, then sure, let's warn.

But we should not warn just for every revision mismatch in there...
  
Ashok Raj Jan. 17, 2023, 7:10 p.m. UTC | #9
On Tue, Jan 17, 2023 at 10:32:50AM -0800, Dave Hansen wrote:
> On 1/17/23 10:21, Borislav Petkov wrote:
> > On Tue, Jan 17, 2023 at 08:29:28AM -0800, Dave Hansen wrote:
> >> This ensures that a message of some kind is printed on all early loads:
> >> successes *and* failures.  This should make it easier for our hapless
> >> users to figure out when a failure occurred.
> > I'm still not convinced. When something doesn't happen in the kernel, we don't
> > always say "It didn't happen". We don't say anything.
> 
> Well, we have an awful lot of pr_warn()'s in the kernel that talk about
> something that was tried and failed.
> 
> > So I don't like all those talkative drivers for no good reason. If there wasn't
> > an update message, then no update happened. That's it.
> 
> I actually kinda like the inverse.
> 
> The common (boring) case where an update was needed and was successful.
> It's the one we don't need to tell users about at all.  It barely

I think its useful to know what microcode was loaded at FIT from BIOS and
what its updated to from initrd. We have been telling about successful
updates, no good reason to nuke that. 

We added successful and failures on late-loads, so its better to be
consistent for both early and late.


> deserves a message.  Users expect that if there's an early update
> available, it'll get attempted, it will be successful and the kernel
> won't say much.
> 
> The time we need to spam dmesg is when something upends user
> expectations and they might need to go do something.  An early loading
> failure is exactly the place where they want to know.  They want to know
> if they're running with known CPU bugs that would have been fixed by the
> early update, or if they somehow have a botched early loading image.
> 
> So, if I had to pick either:
> 
>  * Print on failure
> or
>  * Print on success
> 
> I'd pick failure.  But, considering that we're already printing on
> success, I'm OK doing both.

That sounds good, lets do on both.
  
Ashok Raj Jan. 17, 2023, 8:40 p.m. UTC | #10
On Tue, Jan 17, 2023 at 07:40:31PM +0100, Borislav Petkov wrote:
> On Tue, Jan 17, 2023 at 10:32:50AM -0800, Dave Hansen wrote:
> > Well, we have an awful lot of pr_warn()'s in the kernel that talk about
> > something that was tried and failed.
> 
> Well, is microcode loading failure worth to warn about?

Is it not? 
> 
> What if there's no microcode for that CPU?

If there is no microcode, we don't print anything. So what's loaded in the
CPU is the latest version. When we have something we can always tell if its
successful or not.

Its not a microcode file in initrd, but a matching microcode to load. If
none is found, nothing to worry about.

We just agreed to show both failed and success for late-load. Doing this is
consistent with that isn't it?

https://lore.kernel.org/lkml/Y7iYLbEJSYnVn+dW@zn.tnic/

Ingo's:
https://lore.kernel.org/lkml/Y7k9DNz%2F%2FvqBAvZK@gmail.com/

Should we treat early loading differently?

Cheers,
Ashok
  
Luck, Tony Jan. 17, 2023, 8:58 p.m. UTC | #11
> If there is no microcode, we don't print anything. So what's loaded in the
> CPU is the latest version. When we have something we can always tell if its
> successful or not.
>
> Its not a microcode file in initrd, but a matching microcode to load. If
> none is found, nothing to worry about.
>
> We just agreed to show both failed and success for late-load. Doing this is
> consistent with that isn't it?
>
> https://lore.kernel.org/lkml/Y7iYLbEJSYnVn+dW@zn.tnic/
>
> Ingo's:
> https://lore.kernel.org/lkml/Y7k9DNz%2F%2FvqBAvZK@gmail.com/
>
> Should we treat early loading differently?

Getting a better set of messages from early microcode update would
be a "nice-to-have" feature. But if there is no agreement on what those
messages should look like, perhaps just skip this part for now.

Then Ashok can move on to the real issue of allowing LATE_LOAD for a
microcode that supports the new "minrev" header field.

-Tony
  
Dave Hansen Jan. 17, 2023, 9 p.m. UTC | #12
On 1/17/23 12:40, Ashok Raj wrote:
> We just agreed to show both failed and success for late-load. Doing this is
> consistent with that isn't it?

Well, I kinda proposed that and you agreed.  I don't think Boris ever
agreed, though.
  
Borislav Petkov Jan. 17, 2023, 9:06 p.m. UTC | #13
On Tue, Jan 17, 2023 at 12:40:30PM -0800, Ashok Raj wrote:
> We just agreed to show both failed and success for late-load.

Did you read the part you snipped about getting false/misleading bug reports? I
doubt it, as usual.

If you all Intel folks want to deal with people asking why is there a warning
message about microcode not loading successfully, I don't mind forwarding you
all those messages. I sure ain't going to deal with them.
  
Ashok Raj Jan. 17, 2023, 9:34 p.m. UTC | #14
On Tue, Jan 17, 2023 at 10:06:29PM +0100, Borislav Petkov wrote:
> On Tue, Jan 17, 2023 at 12:40:30PM -0800, Ashok Raj wrote:
> > We just agreed to show both failed and success for late-load.
> 
> Did you read the part you snipped about getting false/misleading bug reports? I
> doubt it, as usual.
> 
> If you all Intel folks want to deal with people asking why is there a warning
> message about microcode not loading successfully, I don't mind forwarding you
> all those messages. I sure ain't going to deal with them.

TBH, there is nothing to hide from a microcode loading failure. If user
sees that a loading failed, it could be maybe due to corrupted files and
its best to know and update to proper file vs running without the latest
microcode file always.
  
Ashok Raj Jan. 19, 2023, 5:59 p.m. UTC | #15
Hi Boris,

On Tue, Jan 17, 2023 at 12:58:12PM -0800, Tony Luck wrote:
> > If there is no microcode, we don't print anything. So what's loaded in the
> > CPU is the latest version. When we have something we can always tell if its
> > successful or not.
> >
> > Its not a microcode file in initrd, but a matching microcode to load. If
> > none is found, nothing to worry about.
> >
> > We just agreed to show both failed and success for late-load. Doing this is
> > consistent with that isn't it?
> >
> > https://lore.kernel.org/lkml/Y7iYLbEJSYnVn+dW@zn.tnic/
> >
> > Ingo's:
> > https://lore.kernel.org/lkml/Y7k9DNz%2F%2FvqBAvZK@gmail.com/
> >
> > Should we treat early loading differently?
> 
> Getting a better set of messages from early microcode update would
> be a "nice-to-have" feature. But if there is no agreement on what those
> messages should look like, perhaps just skip this part for now.
> 
> Then Ashok can move on to the real issue of allowing LATE_LOAD for a
> microcode that supports the new "minrev" header field.
> 
If this is the one blocking this series, as Tony proposed, its not worth
loosing sleep over this.

Would you recommend resubmitting a new set with the fixes for the interim
patch compile failures (for 32bit) and drop this last patch?

If you have any other comments would be great to hear now before I repost a
new series to capture everything that's need to be addressed.

As always you can change commit logs to your satisfaction, I try, but its
still not as perfect as how you like to see them.

Cheers,
Ashok
  
Borislav Petkov Jan. 20, 2023, 12:03 p.m. UTC | #16
On Thu, Jan 19, 2023 at 09:59:33AM -0800, Ashok Raj wrote:
> Would you recommend resubmitting a new set with the fixes for the interim
> patch compile failures (for 32bit) and drop this last patch?

Sure.

> If you have any other comments would be great to hear now before I repost a
> new series to capture everything that's need to be addressed.

No, no more comments.

Thx.
  
Ashok Raj Jan. 20, 2023, 4:52 p.m. UTC | #17
On Fri, Jan 20, 2023 at 01:03:57PM +0100, Borislav Petkov wrote:
> On Thu, Jan 19, 2023 at 09:59:33AM -0800, Ashok Raj wrote:
> > Would you recommend resubmitting a new set with the fixes for the interim
> > patch compile failures (for 32bit) and drop this last patch?
> 
> Sure.
> 
> > If you have any other comments would be great to hear now before I repost a
> > new series to capture everything that's need to be addressed.
> 
> No, no more comments.
> 
Great, light at the end of the tunnel!

I have posted v5. 

https://lore.kernel.org/lkml/20230120161923.118882-1-ashok.raj@intel.com/
  

Patch

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index f24300830ed7..0cdff9ed2a4e 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -310,11 +310,11 @@  static bool load_builtin_intel_microcode(struct cpio_data *cp)
 /*
  * Print ucode update info.
  */
-static void print_ucode_info(int old_rev, int new_rev, unsigned int date)
+static void print_ucode_info(bool failed, int old_rev, int new_rev, unsigned int date)
 {
-	pr_info_once("early update: 0x%x -> 0x%x, date = %04x-%02x-%02x\n",
+	pr_info_once("early update: 0x%x -> 0x%x, date = %04x-%02x-%02x %s\n",
 		     old_rev, new_rev, date & 0xffff, date >> 24,
-		     (date >> 16) & 0xff);
+		     (date >> 16) & 0xff, failed ? "FAILED" : "");
 }
 
 #ifdef CONFIG_X86_32
@@ -322,6 +322,7 @@  static void print_ucode_info(int old_rev, int new_rev, unsigned int date)
 static int delay_ucode_info;
 static int current_mc_date;
 static int early_old_rev;
+static bool early_failed;
 
 /*
  * Print early updated ucode info after printk works. This is delayed info dump.
@@ -332,7 +333,7 @@  void show_ucode_info_early(void)
 
 	if (delay_ucode_info) {
 		intel_cpu_collect_info(&uci);
-		print_ucode_info(early_old_rev, uci.cpu_sig.rev, current_mc_date);
+		print_ucode_info(early_failed, early_old_rev, uci.cpu_sig.rev, current_mc_date);
 		delay_ucode_info = 0;
 	}
 }
@@ -341,26 +342,28 @@  void show_ucode_info_early(void)
  * At this point, we can not call printk() yet. Delay printing microcode info in
  * show_ucode_info_early() until printk() works.
  */
-static void print_ucode(int old_rev, int new_rev, int date)
+static void print_ucode(bool failed, int old_rev, int new_rev, int date)
 {
-	struct microcode_intel *mc;
 	int *delay_ucode_info_p;
 	int *current_mc_date_p;
 	int *early_old_rev_p;
+	bool *early_failed_p;
 
 	delay_ucode_info_p = (int *)__pa_nodebug(&delay_ucode_info);
 	current_mc_date_p = (int *)__pa_nodebug(&current_mc_date);
 	early_old_rev_p = (int *)__pa_nodebug(&early_old_rev);
+	early_failed_p = (bool *)__pa_nodebug(&early_failed);
 
 	*delay_ucode_info_p = 1;
 	*current_mc_date_p = date;
 	*early_old_rev_p = old_rev;
+	*early_failed_p = failed;
 }
 #else
 
-static inline void print_ucode(int old_rev, int new_rev, int date)
+static inline void print_ucode(bool failed, int old_rev, int new_rev, int date)
 {
-	print_ucode_info(old_rev, new_rev, date);
+	print_ucode_info(failed, old_rev, new_rev, date);
 }
 #endif
 
@@ -368,6 +371,7 @@  static int apply_microcode_early(struct ucode_cpu_info *uci, bool early)
 {
 	struct microcode_intel *mc;
 	u32 rev, old_rev;
+	int retval = 0;
 
 	mc = uci->mc;
 	if (!mc)
@@ -396,16 +400,16 @@  static int apply_microcode_early(struct ucode_cpu_info *uci, bool early)
 	old_rev = rev;
 	rev = intel_get_microcode_revision();
 	if (rev != mc->hdr.rev)
-		return -1;
+		retval = -1;
 
 	uci->cpu_sig.rev = rev;
 
 	if (early)
-		print_ucode(old_rev, uci->cpu_sig.rev, mc->hdr.date);
+		print_ucode(retval, old_rev, mc->hdr.rev, mc->hdr.date);
 	else
-		print_ucode_info(old_rev, uci->cpu_sig.rev, mc->hdr.date);
+		print_ucode_info(retval, old_rev, uci->cpu_sig.rev, mc->hdr.date);
 
-	return 0;
+	return retval;
 }
 
 int __init save_microcode_in_initrd_intel(void)