[v2,1/6] x86/bugs: Add asm helpers for executing VERW

Message ID 20231024-delay-verw-v2-1-f1881340c807@linux.intel.com
State New
Headers
Series Delay VERW |

Commit Message

Pawan Gupta Oct. 24, 2023, 8:08 a.m. UTC
  MDS mitigation requires clearing the CPU buffers before returning to
user. This needs to be done late in the exit-to-user path. Current
location of VERW leaves a possibility of kernel data ending up in CPU
buffers for memory accesses done after VERW such as:

  1. Kernel data accessed by an NMI between VERW and return-to-user can
     remain in CPU buffers ( since NMI returning to kernel does not
     execute VERW to clear CPU buffers.
  2. Alyssa reported that after VERW is executed,
     CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system
     call. Memory accesses during stack scrubbing can move kernel stack
     contents into CPU buffers.
  3. When caller saved registers are restored after a return from
     function executing VERW, the kernel stack accesses can remain in
     CPU buffers(since they occur after VERW).

To fix this VERW needs to be moved very late in exit-to-user path.

In preparation for moving VERW to entry/exit asm code, create macros
that can be used in asm. Also make them depend on a new feature flag
X86_FEATURE_CLEAR_CPU_BUF.

Reported-by: Alyssa Milburn <alyssa.milburn@intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h   |  2 +-
 arch/x86/include/asm/nospec-branch.h | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)
  

Comments

Peter Zijlstra Oct. 24, 2023, 10:36 a.m. UTC | #1
On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote:

> +.macro CLEAR_CPU_BUFFERS
> +	ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF
> +		/* nopl __KERNEL_DS(%rax) */
> +		.byte 0x0f, 0x1f, 0x80, 0x00, 0x00;
> +.Lverw_arg_\@:	.word __KERNEL_DS;
> +.Ldo_verw_\@:	verw _ASM_RIP(.Lverw_arg_\@);
> +.Lskip_verw_\@:
> +.endm

Why can't this be:

	ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF

And have that mds_verw_sel thing be out-of-line ? That gives much better
code for the case where we don't need this.
  
Pawan Gupta Oct. 24, 2023, 4:35 p.m. UTC | #2
On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote:
> 
> > +.macro CLEAR_CPU_BUFFERS
> > +	ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF
> > +		/* nopl __KERNEL_DS(%rax) */
> > +		.byte 0x0f, 0x1f, 0x80, 0x00, 0x00;
> > +.Lverw_arg_\@:	.word __KERNEL_DS;
> > +.Ldo_verw_\@:	verw _ASM_RIP(.Lverw_arg_\@);
> > +.Lskip_verw_\@:
> > +.endm
> 
> Why can't this be:
> 
> 	ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF
> 
> And have that mds_verw_sel thing be out-of-line ?

I haven't done this way because its a tad bit fragile as it depends on
modules being within 4GB of kernel.

> That gives much better code for the case where we don't need this.

If this is the preferred way let me test this and roll a new revision.
  
Peter Zijlstra Oct. 24, 2023, 4:36 p.m. UTC | #3
On Tue, Oct 24, 2023 at 09:35:15AM -0700, Pawan Gupta wrote:
> On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote:
> > 
> > > +.macro CLEAR_CPU_BUFFERS
> > > +	ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF
> > > +		/* nopl __KERNEL_DS(%rax) */
> > > +		.byte 0x0f, 0x1f, 0x80, 0x00, 0x00;
> > > +.Lverw_arg_\@:	.word __KERNEL_DS;
> > > +.Ldo_verw_\@:	verw _ASM_RIP(.Lverw_arg_\@);
> > > +.Lskip_verw_\@:
> > > +.endm
> > 
> > Why can't this be:
> > 
> > 	ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF
> > 
> > And have that mds_verw_sel thing be out-of-line ?
> 
> I haven't done this way because its a tad bit fragile as it depends on
> modules being within 4GB of kernel.

We 100% rely on that *everywhere*, nothing fragile about it.
  
Pawan Gupta Oct. 24, 2023, 4:45 p.m. UTC | #4
On Tue, Oct 24, 2023 at 06:36:21PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 24, 2023 at 09:35:15AM -0700, Pawan Gupta wrote:
> > On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote:
> > > 
> > > > +.macro CLEAR_CPU_BUFFERS
> > > > +	ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF
> > > > +		/* nopl __KERNEL_DS(%rax) */
> > > > +		.byte 0x0f, 0x1f, 0x80, 0x00, 0x00;
> > > > +.Lverw_arg_\@:	.word __KERNEL_DS;
> > > > +.Ldo_verw_\@:	verw _ASM_RIP(.Lverw_arg_\@);
> > > > +.Lskip_verw_\@:
> > > > +.endm
> > > 
> > > Why can't this be:
> > > 
> > > 	ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF
> > > 
> > > And have that mds_verw_sel thing be out-of-line ?
> > 
> > I haven't done this way because its a tad bit fragile as it depends on
> > modules being within 4GB of kernel.
> 
> We 100% rely on that *everywhere*, nothing fragile about it.

I didn't know that, doing it this way then. Thanks.
  
Peter Zijlstra Oct. 24, 2023, 5:02 p.m. UTC | #5
On Tue, Oct 24, 2023 at 09:45:20AM -0700, Pawan Gupta wrote:

> > > modules being within 4GB of kernel.

FWIW, it's 2G, it's a s32 displacement, the highest most address can
jump 2g down, while the lowest most address can jump 2g up. Leaving a 2G
directly addressable range.

And yeah, we ensure kernel text and modules are inside that 2G range.
  
Pawan Gupta Oct. 24, 2023, 5:30 p.m. UTC | #6
On Tue, Oct 24, 2023 at 07:02:48PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 24, 2023 at 09:45:20AM -0700, Pawan Gupta wrote:
> 
> > > > modules being within 4GB of kernel.
> 
> FWIW, it's 2G, it's a s32 displacement, the highest most address can
> jump 2g down, while the lowest most address can jump 2g up. Leaving a 2G
> directly addressable range.
> 
> And yeah, we ensure kernel text and modules are inside that 2G range.

Ah, okay. Thanks for the info.
  
H. Peter Anvin Oct. 24, 2023, 6:27 p.m. UTC | #7
On October 24, 2023 10:02:48 AM PDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Tue, Oct 24, 2023 at 09:45:20AM -0700, Pawan Gupta wrote:
>
>> > > modules being within 4GB of kernel.
>
>FWIW, it's 2G, it's a s32 displacement, the highest most address can
>jump 2g down, while the lowest most address can jump 2g up. Leaving a 2G
>directly addressable range.
>
>And yeah, we ensure kernel text and modules are inside that 2G range.

To be specific, we don't require that it is located at any particular *physical* addresses, but all modules including the root module are remapped into the [-2GiB,0) range. If we didn't do that, modules would have to be compiled with the pic memory model rather than the kernel memory model which is what they currently are. This would add substantial overhead due to the need for a GOT (the PLT is optional if all symbols are resolved at load time.)

The kernel is different from user space objects since it is always fully loaded into physical memory and is never paged or shared. Therefore, inline relocations, which break sharing and create dirty pages in user space, have zero execution cost in the kernel; the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries.
  
Luck, Tony Oct. 24, 2023, 6:49 p.m. UTC | #8
> the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries.

If there were some significant win for using large pages, couldn't the
kernel pre-allocate some 2MB pages in the [-2GiB,0) range?  Boot parameter
for how many (perhaps two for separate code/data pages). First few loaded
modules get to use that space until it is all gone.

It would all be quite messy if those modules were later unloaded/reloaded
... so there would have to be some compelling benchmarks to justify
the complexity.

That's probably why Peter said "can't realistically".

-Tony
  
H. Peter Anvin Oct. 24, 2023, 7:14 p.m. UTC | #9
On October 24, 2023 11:49:07 AM PDT, "Luck, Tony" <tony.luck@intel.com> wrote:
>> the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries.
>
>If there were some significant win for using large pages, couldn't the
>kernel pre-allocate some 2MB pages in the [-2GiB,0) range?  Boot parameter
>for how many (perhaps two for separate code/data pages). First few loaded
>modules get to use that space until it is all gone.
>
>It would all be quite messy if those modules were later unloaded/reloaded
>... so there would have to be some compelling benchmarks to justify
>the complexity.
>
>That's probably why Peter said "can't realistically".
>
>-Tony
>

Sure it could, but it would mean the kernel is sitting on an average of 6 MB of unusable memory. It would also mean that unloaded modules would create holes in that memory which would have to be managed.
  
Luck, Tony Oct. 24, 2023, 7:40 p.m. UTC | #10
> Sure it could, but it would mean the kernel is sitting on an average of 6 MB of unusable memory. It would also mean that unloaded modules would create holes in that memory which would have to be managed.

On my Fedora38 desktop:

$ lsmod | awk '{ bytes += $2 } END {print bytes/(1024*1024)}'
21.0859

Lots more than 6MB memory already essentially pinned by loaded modules.

$ head -3 /proc/meminfo
MemTotal:       65507344 kB
MemFree:        56762336 kB
MemAvailable:   63358552 kB

Pinning 20 or so Mbytes isn't going to make a dent in that free memory.

Managing the holes for unloading/reloading modules adds some complexity ... but shouldn't be awful.

If this code managed at finer granularity than "page", it would save some memory.

$ lsmod | wc -l
123

All those modules rounding text/data up to 4K boundaries is wasting a bunch of it.

-Tony
  
H. Peter Anvin Oct. 24, 2023, 8:30 p.m. UTC | #11
On October 24, 2023 12:40:02 PM PDT, "Luck, Tony" <tony.luck@intel.com> wrote:
>> Sure it could, but it would mean the kernel is sitting on an average of 6 MB of unusable memory. It would also mean that unloaded modules would create holes in that memory which would have to be managed.
>
>On my Fedora38 desktop:
>
>$ lsmod | awk '{ bytes += $2 } END {print bytes/(1024*1024)}'
>21.0859
>
>Lots more than 6MB memory already essentially pinned by loaded modules.
>
>$ head -3 /proc/meminfo
>MemTotal:       65507344 kB
>MemFree:        56762336 kB
>MemAvailable:   63358552 kB
>
>Pinning 20 or so Mbytes isn't going to make a dent in that free memory.
>
>Managing the holes for unloading/reloading modules adds some complexity ... but shouldn't be awful.
>
>If this code managed at finer granularity than "page", it would save some memory.
>
>$ lsmod | wc -l
>123
>
>All those modules rounding text/data up to 4K boundaries is wasting a bunch of it.
>
>-Tony
>
>
>

Sure, but is it worth the effort?
  
Peter Zijlstra Oct. 24, 2023, 10:28 p.m. UTC | #12
On Tue, Oct 24, 2023 at 11:27:36AM -0700, H. Peter Anvin wrote:

> the only overhead to modules other than load time (including the
> runtime linking) is that modules can't realistically be mapped using
> large page entries.

Song is working on that. Currently the module allocator is split between
all (5 IIRC) different page permissions required, next step is using
large page pools for those.
  
Peter Zijlstra Oct. 24, 2023, 10:30 p.m. UTC | #13
On Tue, Oct 24, 2023 at 06:49:07PM +0000, Luck, Tony wrote:
> > the only overhead to modules other than load time (including the runtime linking) is that modules can't realistically be mapped using large page entries.
> 
> If there were some significant win for using large pages, couldn't the

There is. The 4k TLBs really hurt. Thomas and me ran into that when
doing the retbleed call depth crud. Similarly Song ran into it with BPF,
they really want eBPF JIT to write to large pages.
  
Pawan Gupta Oct. 25, 2023, 4 a.m. UTC | #14
On Tue, Oct 24, 2023 at 12:36:01PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 24, 2023 at 01:08:21AM -0700, Pawan Gupta wrote:
> 
> > +.macro CLEAR_CPU_BUFFERS
> > +	ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF
> > +		/* nopl __KERNEL_DS(%rax) */
> > +		.byte 0x0f, 0x1f, 0x80, 0x00, 0x00;
> > +.Lverw_arg_\@:	.word __KERNEL_DS;
> > +.Ldo_verw_\@:	verw _ASM_RIP(.Lverw_arg_\@);
> > +.Lskip_verw_\@:
> > +.endm
> 
> Why can't this be:
> 
> 	ALTERNATIVE "". "verw _ASM_RIP(mds_verw_sel)", X86_FEATURE_CLEAR_CPU_BUF
> 
> And have that mds_verw_sel thing be out-of-line ? That gives much better
> code for the case where we don't need this.

Overall the code generated with this approach is much better. But, in my
testing I am seeing an issue with runtime patching in 32-bit mode, when
mitigations are off. Instead of NOPs I am seeing random instruction. I
don't see any issue with 64-bit mode.

config1: mitigations=on, 32-bit mode, post-boot

entry_SYSENTER_32:
   ...
   0xc1a3748e <+222>:   pop    %eax
   0xc1a3748f <+223>:   verw   0xc1a38240
   0xc1a37496 <+230>:   sti
   0xc1a37497 <+231>:   sysexit

---------------------------------------------

config2: mitigations=off, 32-bit mode, post-boot

entry_SYSENTER_32:
   ...
   0xc1a3748e <+222>:   pop    %eax
   0xc1a3748f <+223>:   lea    0x0(%esi,%eiz,1),%esi   <---- Doesn't look right
   0xc1a37496 <+230>:   sti
   0xc1a37497 <+231>:   sysexit

---------------------------------------------

config3: 32-bit mode, pre-boot objdump

entry_SYSENTER_32:
   ...
   c8e:       58                      pop    %eax
   c8f:       90                      nop
   c90:       90                      nop
   c91:       90                      nop
   c92:       90                      nop
   c93:       90                      nop
   c94:       90                      nop
   c95:       90                      nop
   c96:       fb                      sti
   c97:       0f 35                   sysexit

These tests were done with below patch:

-----8<-----
From: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Date: Mon, 23 Oct 2023 15:04:56 -0700
Subject: [PATCH] x86/bugs: Add asm helpers for executing VERW

MDS mitigation requires clearing the CPU buffers before returning to
user. This needs to be done late in the exit-to-user path. Current
location of VERW leaves a possibility of kernel data ending up in CPU
buffers for memory accesses done after VERW such as:

  1. Kernel data accessed by an NMI between VERW and return-to-user can
     remain in CPU buffers ( since NMI returning to kernel does not
     execute VERW to clear CPU buffers.
  2. Alyssa reported that after VERW is executed,
     CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system
     call. Memory accesses during stack scrubbing can move kernel stack
     contents into CPU buffers.
  3. When caller saved registers are restored after a return from
     function executing VERW, the kernel stack accesses can remain in
     CPU buffers(since they occur after VERW).

To fix this VERW needs to be moved very late in exit-to-user path.

In preparation for moving VERW to entry/exit asm code, create macros
that can be used in asm. Also make them depend on a new feature flag
X86_FEATURE_CLEAR_CPU_BUF.

Reported-by: Alyssa Milburn <alyssa.milburn@intel.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h   |  2 +-
 arch/x86/include/asm/nospec-branch.h | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 58cb9495e40f..f21fc0f12737 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -308,10 +308,10 @@
 #define X86_FEATURE_SMBA		(11*32+21) /* "" Slow Memory Bandwidth Allocation */
 #define X86_FEATURE_BMEC		(11*32+22) /* "" Bandwidth Monitoring Event Configuration */
 #define X86_FEATURE_USER_SHSTK		(11*32+23) /* Shadow stack support for user mode applications */
-
 #define X86_FEATURE_SRSO		(11*32+24) /* "" AMD BTB untrain RETs */
 #define X86_FEATURE_SRSO_ALIAS		(11*32+25) /* "" AMD BTB untrain RETs through aliasing */
 #define X86_FEATURE_IBPB_ON_VMEXIT	(11*32+26) /* "" Issue an IBPB only on VMEXIT */
+#define X86_FEATURE_CLEAR_CPU_BUF	(11*32+27) /* "" Clear CPU buffers */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index c55cc243592e..ed8218e2d9a7 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -13,6 +13,7 @@
 #include <asm/unwind_hints.h>
 #include <asm/percpu.h>
 #include <asm/current.h>
+#include <asm/segment.h>
 
 /*
  * Call depth tracking for Intel SKL CPUs to address the RSB underflow
@@ -329,6 +330,29 @@
 #endif
 .endm
 
+/*
+ * Macros to execute VERW instruction that mitigate transient data sampling
+ * attacks such as MDS. On affected systems a microcode update overloaded VERW
+ * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF.
+ *
+ * Note: Only the memory operand variant of VERW clears the CPU buffers.
+ */
+.pushsection .rodata
+.align 64
+mds_verw_sel:
+	.word __KERNEL_DS
+ 	.byte 0xcc
+.align 64
+.popsection
+
+.macro EXEC_VERW
+	verw _ASM_RIP(mds_verw_sel)
+.endm
+
+.macro CLEAR_CPU_BUFFERS
+	ALTERNATIVE "", __stringify(EXEC_VERW), X86_FEATURE_CLEAR_CPU_BUF
+.endm
+
 #else /* __ASSEMBLY__ */
 
 #define ANNOTATE_RETPOLINE_SAFE					\
  
Peter Zijlstra Oct. 25, 2023, 6:56 a.m. UTC | #15
On Tue, Oct 24, 2023 at 09:00:29PM -0700, Pawan Gupta wrote:

> config1: mitigations=on, 32-bit mode, post-boot
> 
> entry_SYSENTER_32:
>    ...
>    0xc1a3748e <+222>:   pop    %eax
>    0xc1a3748f <+223>:   verw   0xc1a38240
>    0xc1a37496 <+230>:   sti
>    0xc1a37497 <+231>:   sysexit
> 
> ---------------------------------------------
> 
> config2: mitigations=off, 32-bit mode, post-boot
> 
> entry_SYSENTER_32:
>    ...
>    0xc1a3748e <+222>:   pop    %eax
>    0xc1a3748f <+223>:   lea    0x0(%esi,%eiz,1),%esi   <---- Doesn't look right
>    0xc1a37496 <+230>:   sti
>    0xc1a37497 <+231>:   sysexit
> 
> ---------------------------------------------
> 
> config3: 32-bit mode, pre-boot objdump
> 
> entry_SYSENTER_32:
>    ...
>    c8e:       58                      pop    %eax
>    c8f:       90                      nop
>    c90:       90                      nop
>    c91:       90                      nop
>    c92:       90                      nop
>    c93:       90                      nop
>    c94:       90                      nop
>    c95:       90                      nop
>    c96:       fb                      sti
>    c97:       0f 35                   sysexit
> 

If you look at arch/x86/include/asm/nops.h, you'll find (for 32bit):

 * 7: leal 0x0(%esi,%eiz,1),%esi

Which reads as:

	load-effective-address of %esi[0] into %esi

which is, of course, just %esi.

You can also get this from GAS using:

	.nops 7

which results in:

   0:   8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi

It is basically abusing bizarro x86 instruction encoding rules to create
a 7 byte nop without using NOPL. If you really want to know, volume 2
of the SDM has a ton of stuff on instruction encoding, also the interweb
:-)
  
Peter Zijlstra Oct. 25, 2023, 6:58 a.m. UTC | #16
On Tue, Oct 24, 2023 at 09:00:29PM -0700, Pawan Gupta wrote:

> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index c55cc243592e..ed8218e2d9a7 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -13,6 +13,7 @@
>  #include <asm/unwind_hints.h>
>  #include <asm/percpu.h>
>  #include <asm/current.h>
> +#include <asm/segment.h>
>  
>  /*
>   * Call depth tracking for Intel SKL CPUs to address the RSB underflow
> @@ -329,6 +330,29 @@
>  #endif
>  .endm
>  
> +/*
> + * Macros to execute VERW instruction that mitigate transient data sampling
> + * attacks such as MDS. On affected systems a microcode update overloaded VERW
> + * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF.
> + *
> + * Note: Only the memory operand variant of VERW clears the CPU buffers.
> + */
> +.pushsection .rodata
> +.align 64
> +mds_verw_sel:
> +	.word __KERNEL_DS
> + 	.byte 0xcc
> +.align 64
> +.popsection

This should not be in a header file, you'll get an instance of this per
translation unit, not what you want.

> +
> +.macro EXEC_VERW
> +	verw _ASM_RIP(mds_verw_sel)
> +.endm
> +
> +.macro CLEAR_CPU_BUFFERS
> +	ALTERNATIVE "", __stringify(EXEC_VERW), X86_FEATURE_CLEAR_CPU_BUF
> +.endm
> +
>  #else /* __ASSEMBLY__ */
  
Pawan Gupta Oct. 25, 2023, 3:06 p.m. UTC | #17
On Wed, Oct 25, 2023 at 08:56:10AM +0200, Peter Zijlstra wrote:
> > config3: 32-bit mode, pre-boot objdump
> > 
> > entry_SYSENTER_32:
> >    ...
> >    c8e:       58                      pop    %eax
> >    c8f:       90                      nop
> >    c90:       90                      nop
> >    c91:       90                      nop
> >    c92:       90                      nop
> >    c93:       90                      nop
> >    c94:       90                      nop
> >    c95:       90                      nop
> >    c96:       fb                      sti
> >    c97:       0f 35                   sysexit
> > 
> 
> If you look at arch/x86/include/asm/nops.h, you'll find (for 32bit):
> 
>  * 7: leal 0x0(%esi,%eiz,1),%esi
> 
> Which reads as:
> 
> 	load-effective-address of %esi[0] into %esi

Wow, never imagined that this would be one of the magician's trick. I
will go read on why is it better than NOPL.
  
Pawan Gupta Oct. 25, 2023, 3:10 p.m. UTC | #18
On Wed, Oct 25, 2023 at 08:58:18AM +0200, Peter Zijlstra wrote:
> > +.pushsection .rodata
> > +.align 64
> > +mds_verw_sel:
> > +	.word __KERNEL_DS
> > + 	.byte 0xcc
> > +.align 64
> > +.popsection
> 
> This should not be in a header file, you'll get an instance of this per
> translation unit, not what you want.

Agh, sorry I missed it.
  

Patch

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 58cb9495e40f..f21fc0f12737 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -308,10 +308,10 @@ 
 #define X86_FEATURE_SMBA		(11*32+21) /* "" Slow Memory Bandwidth Allocation */
 #define X86_FEATURE_BMEC		(11*32+22) /* "" Bandwidth Monitoring Event Configuration */
 #define X86_FEATURE_USER_SHSTK		(11*32+23) /* Shadow stack support for user mode applications */
-
 #define X86_FEATURE_SRSO		(11*32+24) /* "" AMD BTB untrain RETs */
 #define X86_FEATURE_SRSO_ALIAS		(11*32+25) /* "" AMD BTB untrain RETs through aliasing */
 #define X86_FEATURE_IBPB_ON_VMEXIT	(11*32+26) /* "" Issue an IBPB only on VMEXIT */
+#define X86_FEATURE_CLEAR_CPU_BUF	(11*32+27) /* "" Clear CPU buffers */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index c55cc243592e..c269ee74682c 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -329,6 +329,25 @@ 
 #endif
 .endm
 
+/*
+ * Macro to execute VERW instruction to mitigate transient data sampling
+ * attacks such as MDS. On affected systems a microcode update overloaded VERW
+ * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF.
+ *
+ * Note: Only the memory operand variant of VERW clears the CPU buffers. To
+ * handle the case when VERW is executed after user registers are restored, use
+ * RIP to point the memory operand to a part NOPL instruction that contains
+ * __KERNEL_DS.
+ */
+.macro CLEAR_CPU_BUFFERS
+	ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@", X86_FEATURE_CLEAR_CPU_BUF
+		/* nopl __KERNEL_DS(%rax) */
+		.byte 0x0f, 0x1f, 0x80, 0x00, 0x00;
+.Lverw_arg_\@:	.word __KERNEL_DS;
+.Ldo_verw_\@:	verw _ASM_RIP(.Lverw_arg_\@);
+.Lskip_verw_\@:
+.endm
+
 #else /* __ASSEMBLY__ */
 
 #define ANNOTATE_RETPOLINE_SAFE					\