[v10,03/38] x86/msr: Add the WRMSRNS instruction support

Message ID 20230914044805.301390-4-xin3.li@intel.com
State New
Headers
Series x86: enable FRED for x86-64 |

Commit Message

Li, Xin3 Sept. 14, 2023, 4:47 a.m. UTC
  Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
into the code.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/msr.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
  

Comments

Andrew Cooper Sept. 14, 2023, 1:01 p.m. UTC | #1
On 14/09/2023 7:02 am, Juergen Gross wrote:
> On 14.09.23 06:47, Xin Li wrote:
>> Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
>> into the code.
>>
>> Tested-by: Shan Kang <shan.kang@intel.com>
>> Signed-off-by: Xin Li <xin3.li@intel.com>
>
> In order to avoid having to add paravirt support for WRMSRNS I think
> xen_init_capabilities() should gain:
>
> +    setup_clear_cpu_cap(X86_FEATURE_WRMSRNS);

Xen PV guests will never ever see WRMSRNS.  Operating in CPL3, they have
no possible way of adjusting an MSR which isn't serialising, because
even the hypercall forms are serialising.

Xen only exposes the bit for HVM guests.

~Andrew
  
Andrew Cooper Sept. 14, 2023, 2:05 p.m. UTC | #2
On 14/09/2023 5:47 am, Xin Li wrote:
> Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
> into the code.
>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  arch/x86/include/asm/msr.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..c284ff9ebe67 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -97,6 +97,19 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>  }
>  
> +/*
> + * WRMSRNS behaves exactly like WRMSR with the only difference being
> + * that it is not a serializing instruction by default.
> + */
> +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> +{
> +	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
> +	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
> +		     "2:\n"
> +		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> +		     : : "c" (msr), "a"(low), "d" (high));
> +}
> +
>  #define native_rdmsr(msr, val1, val2)			\
>  do {							\
>  	u64 __val = __rdmsr((msr));			\
> @@ -297,6 +310,11 @@ do {							\
>  
>  #endif	/* !CONFIG_PARAVIRT_XXL */
>  
> +static __always_inline void wrmsrns(u32 msr, u64 val)
> +{
> +	__wrmsrns(msr, val, val >> 32);
> +}

This API works in terms of this series where every WRMSRNS is hidden
behind a FRED check, but it's an awkward interface to use anywhere else
in the kernel.

I fully understand that you expect all FRED capable systems to have
WRMSRNS, but it is not a hard requirement and you will end up with
simpler (and therefore better) logic by deleting the dependency.

As a "normal" user of the WRMSR APIs, the programmer only cares about:

1) wrmsr() -> needs to be serialising
2) wrmsr_ns() -> safe to be non-serialising


In Xen, I added something of the form:

/* Non-serialising WRMSR, when available.  Falls back to a serialising
WRMSR. */
static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
{
    /*
     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant CS
     * prefix to avoid a trailing NOP.
     */
    alternative_input(".byte 0x2e; wrmsr",
                      ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
                      "c" (msr), "a" (lo), "d" (hi));
}

and despite what Juergen said, I'm going to recommend that you do wire
this through the paravirt infrastructure, for the benefit of regular
users having a nice API, not because XenPV is expecting to do something
wildly different here.


I'd actually go as far as suggesting that you break patches 1-3 into
different series and e.g. update the regular context switch path to use
the WRMSRNS-falling-back-to-WRMSR helpers, just to get started.

~Andrew
  
Thomas Gleixner Sept. 14, 2023, 11 p.m. UTC | #3
Andrew!

On Thu, Sep 14 2023 at 15:05, andrew wrote:
> On 14/09/2023 5:47 am, Xin Li wrote:
>> +static __always_inline void wrmsrns(u32 msr, u64 val)
>> +{
>> +	__wrmsrns(msr, val, val >> 32);
>> +}
>
> This API works in terms of this series where every WRMSRNS is hidden
> behind a FRED check, but it's an awkward interface to use anywhere else
> in the kernel.

Agreed.

> I fully understand that you expect all FRED capable systems to have
> WRMSRNS, but it is not a hard requirement and you will end up with
> simpler (and therefore better) logic by deleting the dependency.

According to the CPU folks FRED systems are guaranteed to have WRMSRNS -
I asked for that :). It's just not yet documented.

But that I aside, I agree that we should opt for the safe side with a
fallback like the one you have in XEN even for the places which are
strictly FRED dependent.

> As a "normal" user of the WRMSR APIs, the programmer only cares about:
>
> 1) wrmsr() -> needs to be serialising
> 2) wrmsr_ns() -> safe to be non-serialising

Correct.

> In Xen, I added something of the form:
>
> /* Non-serialising WRMSR, when available.  Falls back to a serialising
> WRMSR. */
> static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
> {
>     /*
>      * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant CS
>      * prefix to avoid a trailing NOP.
>      */
>     alternative_input(".byte 0x2e; wrmsr",
>                       ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
>                       "c" (msr), "a" (lo), "d" (hi));
> }
>
> and despite what Juergen said, I'm going to recommend that you do wire
> this through the paravirt infrastructure, for the benefit of regular
> users having a nice API, not because XenPV is expecting to do something
> wildly different here.

I fundamentaly hate adding this to the PV infrastructure. We don't want
more PV ops, quite the contrary.

For the initial use case at hand, there is an explicit FRED dependency
and the code in question really wants to use WRMSRNS directly and not
through a PV function call.

I agree with your reasoning for the more generic use case where we can
gain performance independent of FRED by using WRMSRNS for cases where
the write has no serialization requirements.

But this made me look into PV ops some more. For actual performance
relevant code the current PV ops mechanics are a horrorshow when the op
defaults to the native instruction.

Let's look at wrmsrl():

wrmsrl(msr, val
 wrmsr(msr, (u32)val, (u32)val >> 32))
  paravirt_write_msr(msr, low, high)
    PVOP_VCALL3(cpu.write_msr, msr, low, high)

Which results in

	mov	$msr, %edi
	mov	$val, %rdx
	mov	%edx, %esi
	shr	$0x20, %rdx
	call	native_write_msr

and native_write_msr() does at minimum:

	mov    %edi,%ecx
	mov    %esi,%eax
	wrmsr
        ret

In the worst case 'ret' is going through the return thunk. Not to talk
about function prologues and whatever.

This becomes even more silly for trivial instructions like STI/CLI or in
the worst case paravirt_nop().

The call makes only sense, when the native default is an actual
function, but for the trivial cases it's a blatant engineering
trainwreck.

I wouldn't care at all if CONFIG_PARAVIRT_XXL would be the esoteric use
case, but AFAICT it's default enabled on all major distros.

So no. I'm fundamentally disagreeing with your recommendation. The way
forward is:

  1) Provide the native variant for wrmsrns(), i.e. rename the proposed
     wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
     safety net as you pointed out.

     That function can be used in code which is guaranteed to be not
     affected by the PV_XXL madness.

  2) Come up with a sensible solution for the PV_XXL horrorshow

  3) Implement a sane general variant of wrmsr_ns() which handles
     both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL

  4) Convert other code which benefits from the non-serializing variant
     to wrmsr_ns()

Thanks,

        tglx
  
H. Peter Anvin Sept. 14, 2023, 11:34 p.m. UTC | #4
On September 14, 2023 4:00:39 PM PDT, Thomas Gleixner <tglx@linutronix.de> wrote:
>Andrew!
>
>On Thu, Sep 14 2023 at 15:05, andrew wrote:
>> On 14/09/2023 5:47 am, Xin Li wrote:
>>> +static __always_inline void wrmsrns(u32 msr, u64 val)
>>> +{
>>> +	__wrmsrns(msr, val, val >> 32);
>>> +}
>>
>> This API works in terms of this series where every WRMSRNS is hidden
>> behind a FRED check, but it's an awkward interface to use anywhere else
>> in the kernel.
>
>Agreed.
>
>> I fully understand that you expect all FRED capable systems to have
>> WRMSRNS, but it is not a hard requirement and you will end up with
>> simpler (and therefore better) logic by deleting the dependency.
>
>According to the CPU folks FRED systems are guaranteed to have WRMSRNS -
>I asked for that :). It's just not yet documented.
>
>But that I aside, I agree that we should opt for the safe side with a
>fallback like the one you have in XEN even for the places which are
>strictly FRED dependent.
>
>> As a "normal" user of the WRMSR APIs, the programmer only cares about:
>>
>> 1) wrmsr() -> needs to be serialising
>> 2) wrmsr_ns() -> safe to be non-serialising
>
>Correct.
>
>> In Xen, I added something of the form:
>>
>> /* Non-serialising WRMSR, when available.  Falls back to a serialising
>> WRMSR. */
>> static inline void wrmsr_ns(uint32_t msr, uint32_t lo, uint32_t hi)
>> {
>>     /*
>>      * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant CS
>>      * prefix to avoid a trailing NOP.
>>      */
>>     alternative_input(".byte 0x2e; wrmsr",
>>                       ".byte 0x0f,0x01,0xc6", X86_FEATURE_WRMSRNS,
>>                       "c" (msr), "a" (lo), "d" (hi));
>> }
>>
>> and despite what Juergen said, I'm going to recommend that you do wire
>> this through the paravirt infrastructure, for the benefit of regular
>> users having a nice API, not because XenPV is expecting to do something
>> wildly different here.
>
>I fundamentaly hate adding this to the PV infrastructure. We don't want
>more PV ops, quite the contrary.
>
>For the initial use case at hand, there is an explicit FRED dependency
>and the code in question really wants to use WRMSRNS directly and not
>through a PV function call.
>
>I agree with your reasoning for the more generic use case where we can
>gain performance independent of FRED by using WRMSRNS for cases where
>the write has no serialization requirements.
>
>But this made me look into PV ops some more. For actual performance
>relevant code the current PV ops mechanics are a horrorshow when the op
>defaults to the native instruction.
>
>Let's look at wrmsrl():
>
>wrmsrl(msr, val
> wrmsr(msr, (u32)val, (u32)val >> 32))
>  paravirt_write_msr(msr, low, high)
>    PVOP_VCALL3(cpu.write_msr, msr, low, high)
>
>Which results in
>
>	mov	$msr, %edi
>	mov	$val, %rdx
>	mov	%edx, %esi
>	shr	$0x20, %rdx
>	call	native_write_msr
>
>and native_write_msr() does at minimum:
>
>	mov    %edi,%ecx
>	mov    %esi,%eax
>	wrmsr
>        ret
>
>In the worst case 'ret' is going through the return thunk. Not to talk
>about function prologues and whatever.
>
>This becomes even more silly for trivial instructions like STI/CLI or in
>the worst case paravirt_nop().
>
>The call makes only sense, when the native default is an actual
>function, but for the trivial cases it's a blatant engineering
>trainwreck.
>
>I wouldn't care at all if CONFIG_PARAVIRT_XXL would be the esoteric use
>case, but AFAICT it's default enabled on all major distros.
>
>So no. I'm fundamentally disagreeing with your recommendation. The way
>forward is:
>
>  1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>     wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>     safety net as you pointed out.
>
>     That function can be used in code which is guaranteed to be not
>     affected by the PV_XXL madness.
>
>  2) Come up with a sensible solution for the PV_XXL horrorshow
>
>  3) Implement a sane general variant of wrmsr_ns() which handles
>     both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>
>  4) Convert other code which benefits from the non-serializing variant
>     to wrmsr_ns()
>
>Thanks,
>
>        tglx
>

With regards to (2), the IMO only sensible solution is to have the ABI be the one of the native instruction, and to have the PVXXL users take the full brunt of the overhead. That means that on a native or HVM machine, the proper code gets patched in inline, and the PVXXL code becomes a call to a stub to do the parameter marshalling before walking off back into C ABI land. The pvop further has to bear the full cost of providing the full native semantics unless otherwise agreed with the native maintainers and explicitly documented what the modified semantics are (notably, having excplicit stubs for certain specific MSRs is entirely reasonable.)

In case this sounds familiar, it is the pvops we were promised over 15 years ago, and yet somehow never really got there. It *also* is similar in an inside-out way of the ABI marshalling used for legacy BIOS functions in the 16-bit startup code.

In-place "fat" paravirtualization in the Linux kernel has been a horror show, with the notable exception of UML, which is almost universally ignored but yet manages to stay out of the way and keep working.

This is a classic case of Tragedy of The Commons, much like burning coal for power.
  
Andrew Cooper Sept. 14, 2023, 11:46 p.m. UTC | #5
On 15/09/2023 12:00 am, Thomas Gleixner wrote:
>> and despite what Juergen said, I'm going to recommend that you do wire
>> this through the paravirt infrastructure, for the benefit of regular
>> users having a nice API, not because XenPV is expecting to do something
>> wildly different here.
> I fundamentaly hate adding this to the PV infrastructure. We don't want
> more PV ops, quite the contrary.

What I meant was "there should be the two top-level APIs, and under the
covers they DTRT".  Part of doing the right thing is to wire up paravirt
for configs where that is specified.

Anything else is going to force people to write logic of the form:

    if (WRMSRNS && !XENPV)
        wrmsr_ns(...)
    else
        wrmsr(...)

which is going to be worse overall.  And there really is one example of
this antipattern already in the series.


> For the initial use case at hand, there is an explicit FRED dependency
> and the code in question really wants to use WRMSRNS directly and not
> through a PV function call.
>
> I agree with your reasoning for the more generic use case where we can
> gain performance independent of FRED by using WRMSRNS for cases where
> the write has no serialization requirements.
>
> But this made me look into PV ops some more. For actual performance
> relevant code the current PV ops mechanics are a horrorshow when the op
> defaults to the native instruction.
>
> Let's look at wrmsrl():
>
> wrmsrl(msr, val
>  wrmsr(msr, (u32)val, (u32)val >> 32))
>   paravirt_write_msr(msr, low, high)
>     PVOP_VCALL3(cpu.write_msr, msr, low, high)
>
> Which results in
>
> 	mov	$msr, %edi
> 	mov	$val, %rdx
> 	mov	%edx, %esi
> 	shr	$0x20, %rdx
> 	call	native_write_msr
>
> and native_write_msr() does at minimum:
>
> 	mov    %edi,%ecx
> 	mov    %esi,%eax
> 	wrmsr
>         ret

Yeah, this is daft.  But it can also be fixed irrespective of WRMSRNS.

WRMSR has one complexity that most other PV-ops don't, and that's the
exception table reference for the instruction itself.

In a theoretical future ought to look like:

    mov    $msr, %ecx
    mov    $lo, %eax
    mov    $hi, %edx
    1: {call paravirt_blah(%rip) | cs...cs wrmsr | cs...cs wrmsrns }
    _ASM_EXTABLE(1b, ...)

In paravirt builds, the CALL needs to be the emitted form, because it
needs to function in very early boot.

But once the paravirt-ness has been chosen and alternatives run, the
as-native paths are fully inline.

The alternative which processes this site wants to conclude that, in the
case it does not alter from the CALL, to clobber the EXTABLE reference. 
CALL instructions can #GP, and you don't want to end up thinking you're
handling a WRMSR #GP when in fact it was a non-canonical function pointer.

> In the worst case 'ret' is going through the return thunk. Not to talk
> about function prologues and whatever.
>
> This becomes even more silly for trivial instructions like STI/CLI or in
> the worst case paravirt_nop().

STI/CLI are already magic.  Are they not inlined?

> The call makes only sense, when the native default is an actual
> function, but for the trivial cases it's a blatant engineering
> trainwreck.
>
> I wouldn't care at all if CONFIG_PARAVIRT_XXL would be the esoteric use
> case, but AFAICT it's default enabled on all major distros.
>
> So no. I'm fundamentally disagreeing with your recommendation. The way
> forward is:
>
>   1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>      wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>      safety net as you pointed out.
>
>      That function can be used in code which is guaranteed to be not
>      affected by the PV_XXL madness.
>
>   2) Come up with a sensible solution for the PV_XXL horrorshow
>
>   3) Implement a sane general variant of wrmsr_ns() which handles
>      both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>
>   4) Convert other code which benefits from the non-serializing variant
>      to wrmsr_ns()

Well - point 1 is mostly work that needs reverting as part of completing
point 3, and point 2 clearly needs doing irrespective of anything else.

Thanks,

~Andrew
  
Thomas Gleixner Sept. 15, 2023, 12:12 a.m. UTC | #6
On Fri, Sep 15 2023 at 00:46, andrew wrote:
> On 15/09/2023 12:00 am, Thomas Gleixner wrote:
>> So no. I'm fundamentally disagreeing with your recommendation. The way
>> forward is:
>>
>>   1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>>      wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>>      safety net as you pointed out.
>>
>>      That function can be used in code which is guaranteed to be not
>>      affected by the PV_XXL madness.
>>
>>   2) Come up with a sensible solution for the PV_XXL horrorshow
>>
>>   3) Implement a sane general variant of wrmsr_ns() which handles
>>      both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>>
>>   4) Convert other code which benefits from the non-serializing variant
>>      to wrmsr_ns()
>
> Well - point 1 is mostly work that needs reverting as part of completing
> point 3, and point 2 clearly needs doing irrespective of anything else.

No. #1 has a value on its own independent of the general variant in #3.

>>      That function can be used in code which is guaranteed to be not
>>      affected by the PV_XXL madness.

That makes a lot of sense because it's guaranteed to generate better
code than whatever we come up with for the PV_XXL nonsense.

Thanks,

        tglx
  
Andrew Cooper Sept. 15, 2023, 12:33 a.m. UTC | #7
On 15/09/2023 1:12 am, Thomas Gleixner wrote:
> On Fri, Sep 15 2023 at 00:46, andrew wrote:
>> On 15/09/2023 12:00 am, Thomas Gleixner wrote:
>>> So no. I'm fundamentally disagreeing with your recommendation. The way
>>> forward is:
>>>
>>>   1) Provide the native variant for wrmsrns(), i.e. rename the proposed
>>>      wrmsrns() to native_wrmsr_ns() and have the X86_FEATURE_WRMSRNS
>>>      safety net as you pointed out.
>>>
>>>      That function can be used in code which is guaranteed to be not
>>>      affected by the PV_XXL madness.
>>>
>>>   2) Come up with a sensible solution for the PV_XXL horrorshow
>>>
>>>   3) Implement a sane general variant of wrmsr_ns() which handles
>>>      both X86_FEATURE_WRMSRNS and X86_MISFEATURE_PV_XXL
>>>
>>>   4) Convert other code which benefits from the non-serializing variant
>>>      to wrmsr_ns()
>> Well - point 1 is mostly work that needs reverting as part of completing
>> point 3, and point 2 clearly needs doing irrespective of anything else.
> No. #1 has a value on its own independent of the general variant in #3.
>
>>>      That function can be used in code which is guaranteed to be not
>>>      affected by the PV_XXL madness.
> That makes a lot of sense because it's guaranteed to generate better
> code than whatever we come up with for the PV_XXL nonsense.

It's an assumption about what "definitely won't" be paravirt in the future.

XenPV stack handling is almost-FRED-like and has been for the better
part of two decades.

You frequently complain that there's too much black magic holding XenPV
together.  A paravirt-FRED will reduce the differences vs native
substantially.

~Andrew
  
H. Peter Anvin Sept. 15, 2023, 12:38 a.m. UTC | #8
On 9/14/23 17:33, andrew.cooper3@citrix.com wrote:
> 
> It's an assumption about what "definitely won't" be paravirt in the future.
> 
> XenPV stack handling is almost-FRED-like and has been for the better
> part of two decades.
> 
> You frequently complain that there's too much black magic holding XenPV
> together.  A paravirt-FRED will reduce the differences vs native
> substantially.
> 

Call it "paravirtualized exception handling." In that sense, the 
refactoring of the exception handling to benefit FRED is definitely 
useful for reducing paravirtualization. The FRED-specific code is 
largely trivial, and presumably what you would do is to replace the FRED 
wrapper with a Xen wrapper and call the common handler routines.

	-hpa
  
Thomas Gleixner Sept. 15, 2023, 12:42 a.m. UTC | #9
On Fri, Sep 15 2023 at 00:46, andrew wrote:
> On 15/09/2023 12:00 am, Thomas Gleixner wrote:
> What I meant was "there should be the two top-level APIs, and under the
> covers they DTRT".  Part of doing the right thing is to wire up paravirt
> for configs where that is specified.
>
> Anything else is going to force people to write logic of the form:
>
>     if (WRMSRNS && !XENPV)
>         wrmsr_ns(...)
>     else
>         wrmsr(...)
>
> which is going to be worse overall.

I already agreed with that for generic code which might be affected by
PV. But code which is explicitely depending on something which never can
be affected by PV _and_ is in a performance sensitive code path really
wants to be able to use the native variant explicitely.

> And there really is one example of this antipattern already in the
> series.

No. There is no antipattern in this series. The only place which uses
wrmsrns() is:

@@ -70,9 +70,13 @@ static inline void update_task_stack(str
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
 #else
-	/* Xen PV enters the kernel on the thread stack. */
-	if (cpu_feature_enabled(X86_FEATURE_XENPV))
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		/* WRMSRNS is a baseline feature for FRED. */
+		wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE);
+	} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+		/* Xen PV enters the kernel on the thread stack. */
 		load_sp0(task_top_of_stack(task));
+	}
 #endif
 }

The XENPV condition exists already today and is required independent of
FRED, no?

I deliberately distinguished #1 and #3 on my proposed todo list exactly
because the above use case really wants #1 without the extra bells and
whistles of a generic PV patchable wrmrs_ns() variant. Why?

  No matter how clever the enhanced PV implementation might be, it is
  guaranteed to generate worse code than the straight forward native
  inline assembly. Simply because it has to prevent the compiler from
  being overly clever on optimizations as it obviously mandates wider
  register restrictions, while the pure native variant (independent of
  the availability of X86_FEATURE_WRMSRNS) ony mandates the requirements
  of WRMSR[NS], but not the extra register indirection of the call ABI.

I'm not debating that any other code pattern like you pointed out in
some generic code would be horrible, but I'm not buying your strawman
related to this particular usage site.

Thanks,

        tglx
  
H. Peter Anvin Sept. 15, 2023, 1:01 a.m. UTC | #10
> WRMSR has one complexity that most other PV-ops don't, and that's the
> exception table reference for the instruction itself.
> 
> In a theoretical future ought to look like:
> 
>     mov    $msr, %ecx
>     mov    $lo, %eax
>     mov    $hi, %edx
>     1: {call paravirt_blah(%rip) | cs...cs wrmsr | cs...cs wrmsrns }
>     _ASM_EXTABLE(1b, ...)
> 
> In paravirt builds, the CALL needs to be the emitted form, because it
> needs to function in very early boot.
> 
> But once the paravirt-ness has been chosen and alternatives run, the
> as-native paths are fully inline.
> 
> The alternative which processes this site wants to conclude that, in the
> case it does not alter from the CALL, to clobber the EXTABLE reference. 
> CALL instructions can #GP, and you don't want to end up thinking you're
> handling a WRMSR #GP when in fact it was a non-canonical function pointer.


On 9/14/23 17:36, andrew.cooper3@citrix.com wrote:> On 15/09/2023 1:07 
am, H. Peter Anvin wrote:
 >> Is *that* your concern?! Just put a NOP before WRMSR – you need 
padding NOP bytes anyway – and the extable entry is no longer at the 
same address. Problem solved.
 >>
 >> Either that, or use a direct call, which can't #GP in the address 
range used by the kernel.
 >
 > For non-paravirt builds, I really hope the inlining DoesTheRightThing.
 > If it doesn't lets fix it to do so.
 >
 > For paravirt builds, the emitted form must be the indirect call so it
 > can function in boot prior to alternatives running [1].
 >
No, it doesn't. You always have the option of a direct call to an 
indirect JMP. This is in fact exactly what userspace does in the form of 
the PLT.

 > So you still need some way of putting the EXTABLE reference at the
 > emitted site, not in the .altintr_replacement section where the
 > WRMSR{,NS} instruction lives.  This needs to be at build time because
 > the EXTABLE references aren't shuffled at runtime.
 >
 > How else do you propose getting an extable reference to midway through
 > an instruction on the "wrong" part of an alternative?
Well, obviously there has to be a magic inline at the patch site. It 
ends up looking something like this:

	asm volatile("1:"
		     ALTERNATIVE_2("call pv_wrmsr(%%rip)",
			"nop; wrmsr", X86_FEATURE_NATIVE_MSR,
			"nop; wrmsrns", X86_FEATURE_WRMSRNS)
		     "2:"
		     _ASM_EXTABLE_TYPE(1b+1, 2b, EX_TYPE_WRMSR)
		     : : "c" (msr), "a" (low), "d" (high) : "memory");


[one can argue whether or not WRMSRNS specifically should require 
"memory" or not.]

The whole bit with alternatives and pvops being separate is a major 
maintainability problem, and honestly it never made any sense in the 
first place. Never have two mechanisms to do one job; it makes it harder 
to grok their interactions.

As an alternative to the NOP, the EX_TYPE_*MSR* handlers could simply 
look for a CALL opcode at the origin.

	-hpa
  
Andrew Cooper Sept. 15, 2023, 1:16 a.m. UTC | #11
On 15/09/2023 2:01 am, H. Peter Anvin wrote:
> The whole bit with alternatives and pvops being separate is a major
> maintainability problem, and honestly it never made any sense in the
> first place. Never have two mechanisms to do one job; it makes it
> harder to grok their interactions.

This bit is easy.

Juergen has already done the work to delete one of these two patching
mechanisms and replace it with the other.

https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/

Unfortunately, it's only collecting pings and tumbleweeds.

~Andrew
  
Andrew Cooper Sept. 15, 2023, 1:46 a.m. UTC | #12
On 15/09/2023 1:38 am, H. Peter Anvin wrote:
> On 9/14/23 17:33, andrew.cooper3@citrix.com wrote:
>>
>> It's an assumption about what "definitely won't" be paravirt in the
>> future.
>>
>> XenPV stack handling is almost-FRED-like and has been for the better
>> part of two decades.
>>
>> You frequently complain that there's too much black magic holding XenPV
>> together.  A paravirt-FRED will reduce the differences vs native
>> substantially.
>>
>
> Call it "paravirtualized exception handling." In that sense, the
> refactoring of the exception handling to benefit FRED is definitely
> useful for reducing paravirtualization. The FRED-specific code is
> largely trivial, and presumably what you would do is to replace the
> FRED wrapper with a Xen wrapper and call the common handler routines.

Why do only half the job?

There's no need for any Xen wrappers at all when XenPV can use the
native FRED paths, as long as ERETU, ERETS and the relevant MSRs can be
paravirt (sure - with an interface that sucks less than right now) so
they're not taking the #GP/emulate in Xen path.

And this can work on all hardware with a slightly-future version of Xen
and Linux, because it's just a minor adjustment to how Xen writes the
exception frame on the guests stack as part of event delivery.

~Andrew
  
H. Peter Anvin Sept. 15, 2023, 2:06 a.m. UTC | #13
On 9/14/23 18:46, andrew.cooper3@citrix.com wrote:
> On 15/09/2023 1:38 am, H. Peter Anvin wrote:
>> On 9/14/23 17:33, andrew.cooper3@citrix.com wrote:
>>>
>>> It's an assumption about what "definitely won't" be paravirt in the
>>> future.
>>>
>>> XenPV stack handling is almost-FRED-like and has been for the better
>>> part of two decades.
>>>
>>> You frequently complain that there's too much black magic holding XenPV
>>> together.  A paravirt-FRED will reduce the differences vs native
>>> substantially.
>>>
>>
>> Call it "paravirtualized exception handling." In that sense, the
>> refactoring of the exception handling to benefit FRED is definitely
>> useful for reducing paravirtualization. The FRED-specific code is
>> largely trivial, and presumably what you would do is to replace the
>> FRED wrapper with a Xen wrapper and call the common handler routines.
> 
> Why do only half the job?
> 
> There's no need for any Xen wrappers at all when XenPV can use the
> native FRED paths, as long as ERETU, ERETS and the relevant MSRs can be
> paravirt (sure - with an interface that sucks less than right now) so
> they're not taking the #GP/emulate in Xen path.
> 
> And this can work on all hardware with a slightly-future version of Xen
> and Linux, because it's just a minor adjustment to how Xen writes the
> exception frame on the guests stack as part of event delivery.
> 

It's not about doing "half the job", it's about using the proper 
abstraction mechanism. By all means, if you can join the common code 
flow earlier, do so, but paravirtualizing the entry/exit stubs which is 
the *only* place ERETU and ERETS show up is just crazy.

Similarly, nearly all the MSRs are just configuration setup. The only 
ones which have any kind of performance relevance is the stack setup (RSP0).

	-hpa
  
Juergen Gross Sept. 15, 2023, 5:32 a.m. UTC | #14
On 15.09.23 03:16, andrew.cooper3@citrix.com wrote:
> On 15/09/2023 2:01 am, H. Peter Anvin wrote:
>> The whole bit with alternatives and pvops being separate is a major
>> maintainability problem, and honestly it never made any sense in the
>> first place. Never have two mechanisms to do one job; it makes it
>> harder to grok their interactions.
> 
> This bit is easy.
> 
> Juergen has already done the work to delete one of these two patching
> mechanisms and replace it with the other.
> 
> https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/
> 
> Unfortunately, it's only collecting pings and tumbleweeds.

Indeed.

Unfortunately there is probably some objtool support needed for that, which I'm
not sure how to implement.


Juergen
  
Nikolay Borisov Sept. 20, 2023, 7:58 a.m. UTC | #15
On 14.09.23 г. 7:47 ч., Xin Li wrote:
> Add an always inline API __wrmsrns() to embed the WRMSRNS instruction
> into the code.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>   arch/x86/include/asm/msr.h | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..c284ff9ebe67 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -97,6 +97,19 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>   		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>   }
>   
> +/*
> + * WRMSRNS behaves exactly like WRMSR with the only difference being
> + * that it is not a serializing instruction by default.
> + */
> +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)

Shouldn't this be named wrmsrns_safe since it has exception handling, 
similar to the current wrmsrl_safe.

> +{
> +	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
> +	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
> +		     "2:\n"
> +		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> +		     : : "c" (msr), "a"(low), "d" (high));
> +}
> +


<snip>
  
Li, Xin3 Sept. 20, 2023, 8:18 a.m. UTC | #16
> > +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> 
> Shouldn't this be named wrmsrns_safe since it has exception handling, similar to
> the current wrmsrl_safe.
> 

Both safe and unsafe versions have exception handling, while the safe
version returns an integer to its caller to indicate an exception did
happen or not.

Exception handling is a must for WRMSR/RDMSR and related instructions.

Thanks!
    Xin
  
Peter Zijlstra Sept. 20, 2023, 3 p.m. UTC | #17
On Fri, Sep 15, 2023 at 02:16:50AM +0100, andrew.cooper3@citrix.com wrote:

> Juergen has already done the work to delete one of these two patching
> mechanisms and replace it with the other.
> 
> https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/
> 
> Unfortunately, it's only collecting pings and tumbleweeds.

Fixed that...
  
Juergen Gross Sept. 20, 2023, 3:04 p.m. UTC | #18
On 20.09.23 17:00, Peter Zijlstra wrote:
> On Fri, Sep 15, 2023 at 02:16:50AM +0100, andrew.cooper3@citrix.com wrote:
> 
>> Juergen has already done the work to delete one of these two patching
>> mechanisms and replace it with the other.
>>
>> https://lore.kernel.org/lkml/a32e211f-4add-4fb2-9e5a-480ae9b9bbf2@suse.com/
>>
>> Unfortunately, it's only collecting pings and tumbleweeds.
> 
> Fixed that...

Thanks. :-)


Juergen
  
Li, Xin3 Sept. 22, 2023, 8:16 a.m. UTC | #19
> > > +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> >
> > Shouldn't this be named wrmsrns_safe since it has exception handling, similar
> to
> > the current wrmsrl_safe.
> >
> 
> Both safe and unsafe versions have exception handling, while the safe
> version returns an integer to its caller to indicate an exception did
> happen or not.

I notice there are several call sites using the safe version w/o
checking the return value, should the unsafe version be a better
choice in such cases?
  
Thomas Gleixner Sept. 22, 2023, 3 p.m. UTC | #20
On Fri, Sep 22 2023 at 08:16, Xin3 Li wrote:
>> > > +static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
>> >
>> > Shouldn't this be named wrmsrns_safe since it has exception handling, similar
>> to
>> > the current wrmsrl_safe.
>> >
>> 
>> Both safe and unsafe versions have exception handling, while the safe
>> version returns an integer to its caller to indicate an exception did
>> happen or not.
>
> I notice there are several call sites using the safe version w/o
> checking the return value, should the unsafe version be a better
> choice in such cases?

Depends. The safe version does not emit a warning on fail. So if the
callsite truly does not care about the error it's fine.
  
Li, Xin3 Sept. 22, 2023, 11:21 p.m. UTC | #21
> > I notice there are several call sites using the safe version w/o
> > checking the return value, should the unsafe version be a better
> > choice in such cases?
> 
> Depends. The safe version does not emit a warning on fail. So if the
> callsite truly does not care about the error it's fine.

Right. So the _safe suffix also means to suppress a warning that the
caller doesn't care.

Thanks!
    Xin
  

Patch

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..c284ff9ebe67 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -97,6 +97,19 @@  static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
 		     : : "c" (msr), "a"(low), "d" (high) : "memory");
 }
 
+/*
+ * WRMSRNS behaves exactly like WRMSR with the only difference being
+ * that it is not a serializing instruction by default.
+ */
+static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
+{
+	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
+	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
+		     "2:\n"
+		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
+		     : : "c" (msr), "a"(low), "d" (high));
+}
+
 #define native_rdmsr(msr, val1, val2)			\
 do {							\
 	u64 __val = __rdmsr((msr));			\
@@ -297,6 +310,11 @@  do {							\
 
 #endif	/* !CONFIG_PARAVIRT_XXL */
 
+static __always_inline void wrmsrns(u32 msr, u64 val)
+{
+	__wrmsrns(msr, val, val >> 32);
+}
+
 /*
  * 64-bit version of wrmsr_safe():
  */