[Question,kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?)

Message ID 20230127035616.508966-1-aik@amd.com
State New
Headers
Series [Question,kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?) |

Commit Message

Alexey Kardashevskiy Jan. 27, 2023, 3:56 a.m. UTC
  AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
"perf" executes. "./perf record sleep 20" is an example.

Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
and actual reading of DB7 (which in turn causes #VC) every function is
inlined and no stack frame is created (?). Replacing __always_inline with
noinline in  local_db_save() or native_get_debugreg() fixes the problem.

The crash does not happen with CONFIG_PARAVIRT_XXL as in this case
paravirt_get_debugreg() is used and there is an indirect call via
PVOP_CALL1(). It has not been noticed as the most configs have CONFIG_XEN
enabled which enables CONFIG_PARAVIRT_XXL.

This happens with the recent tip/master, here is my test kernel
and the config:
https://github.com/aik/linux/commits/debug_dr7

Found this while testing DebugSwap (which also fixes the crash as
it eliminates DB7 interception == #VC):
https://lore.kernel.org/all/20230120031047.628097-1-aik@amd.com

Define local_db_save_exc_nmi() to demostrate that the problem better.

Why is this crash happening and how to fix that? I am still reading
the assembly but was hoping for a shortcut here :) Thanks,

aik-Standard-PC-i440FX-PIIX-1996 login: [   15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
[   15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[   15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
[   15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[   15.775323] RIP: 0010:error_entry+0x17/0x140
[   15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
[   15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
[   15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
[   15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
[   15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
[   15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
[   15.775339] FS:  0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
[   15.775340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
[   15.775342] Call Trace:
[   15.775352]  <NMI>
[   15.775355]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775357]  ? exc_page_fault+0x11/0x120
[   15.775360]  ? asm_exc_page_fault+0x22/0x30
[   15.775364]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775365]  ? exc_page_fault+0x11/0x120
[   15.775367]  ? asm_exc_page_fault+0x22/0x30
[   15.775368]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775369]  ? exc_page_fault+0x11/0x120
[   15.775371]  ? asm_exc_page_fault+0x22/0x30
[   15.775372]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775373]  ? exc_page_fault+0x11/0x120
[   15.775374]  ? asm_exc_page_fault+0x22/0x30
[   15.775375]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775376]  ? exc_page_fault+0x11/0x120
[   15.775378]  ? asm_exc_page_fault+0x22/0x30
[   15.775379]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775380]  ? exc_page_fault+0x11/0x120
[   15.775381]  ? asm_exc_page_fault+0x22/0x30
[   15.775382]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775383]  ? exc_page_fault+0x11/0x120
[   15.775384]  ? asm_exc_page_fault+0x22/0x30
[   15.775385]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775386]  ? exc_page_fault+0x11/0x120
[   15.775388]  ? asm_exc_page_fault+0x22/0x30
[   15.775389]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775390]  ? exc_page_fault+0x11/0x120
[   15.775391]  ? asm_exc_page_fault+0x22/0x30
[   15.775392]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775393]  ? exc_page_fault+0x11/0x120
[   15.775395]  ? asm_exc_page_fault+0x22/0x30
[   15.775396]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775397]  ? exc_page_fault+0x11/0x120
[   15.775398]  ? asm_exc_page_fault+0x22/0x30
[   15.775399]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775400]  ? exc_page_fault+0x11/0x120
[   15.775401]  ? asm_exc_page_fault+0x22/0x30
[   15.775403]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775404]  ? exc_page_fault+0x11/0x120
[   15.775405]  ? asm_exc_page_fault+0x22/0x30
[   15.775406]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775407]  ? exc_page_fault+0x11/0x120
[   15.775408]  ? asm_exc_page_fault+0x22/0x30
[   15.775409]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775410]  ? exc_page_fault+0x11/0x120
[   15.775412]  ? asm_exc_page_fault+0x22/0x30
[   15.775413]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775414]  ? exc_page_fault+0x11/0x120
[   15.775415]  ? asm_exc_page_fault+0x22/0x30
[   15.775416]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775420]  ? exc_page_fault+0x11/0x120
[   15.775421]  ? asm_exc_page_fault+0x22/0x30
[   15.775422]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775423]  ? exc_page_fault+0x11/0x120
[   15.775425]  ? asm_exc_page_fault+0x22/0x30
[   15.775426]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775427]  ? exc_page_fault+0x11/0x120
[   15.775431]  ? asm_exc_page_fault+0x22/0x30
[   15.775432]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775433]  ? exc_page_fault+0x11/0x120
[   15.775435]  ? asm_exc_page_fault+0x22/0x30
[   15.775436]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775437]  ? exc_page_fault+0x11/0x120
[   15.775438]  ? asm_exc_page_fault+0x22/0x30
[   15.775439]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775440]  ? exc_page_fault+0x11/0x120
[   15.775441]  ? asm_exc_page_fault+0x22/0x30
[   15.775442]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775443]  ? exc_page_fault+0x11/0x120
[   15.775445]  ? asm_exc_page_fault+0x22/0x30
[   15.775446]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775447]  ? exc_page_fault+0x11/0x120
[   15.775448]  ? asm_exc_page_fault+0x22/0x30
[   15.775449]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775450]  ? exc_page_fault+0x11/0x120
[   15.775454]  ? asm_exc_page_fault+0x22/0x30
[   15.775455]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775456]  ? exc_page_fault+0x11/0x120
[   15.775458]  ? asm_exc_page_fault+0x22/0x30
[   15.775459]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775460]  ? exc_page_fault+0x11/0x120
[   15.775461]  ? asm_exc_page_fault+0x22/0x30
[   15.775462]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775463]  ? exc_page_fault+0x11/0x120
[   15.775465]  ? asm_exc_page_fault+0x22/0x30
[   15.775466]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775467]  ? exc_page_fault+0x11/0x120
[   15.775468]  ? asm_exc_page_fault+0x22/0x30
[   15.775469]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775470]  ? exc_page_fault+0x11/0x120
[   15.775471]  ? asm_exc_page_fault+0x22/0x30
[   15.775472]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775473]  ? exc_page_fault+0x11/0x120
[   15.775475]  ? asm_exc_page_fault+0x22/0x30
[   15.775476]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775477]  ? exc_page_fault+0x11/0x120
[   15.775478]  ? asm_exc_page_fault+0x22/0x30
[   15.775482]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775483]  ? exc_page_fault+0x11/0x120
[   15.775485]  ? asm_exc_page_fault+0x22/0x30
[   15.775486]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775487]  ? exc_page_fault+0x11/0x120
[   15.775488]  ? asm_exc_page_fault+0x22/0x30
[   15.775490]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775491]  ? exc_page_fault+0x11/0x120
[   15.775492]  ? asm_exc_page_fault+0x22/0x30
[   15.775493]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775494]  ? exc_page_fault+0x11/0x120
[   15.775495]  ? asm_exc_page_fault+0x22/0x30
[   15.775496]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775497]  ? exc_page_fault+0x11/0x120
[   15.775499]  ? asm_exc_page_fault+0x22/0x30
[   15.775500]  ? check_preemption_disabled+0x8/0xe0
[   15.775502]  ? __sev_es_ist_enter+0x13/0x100
[   15.775503]  ? exc_nmi+0x10e/0x150
[   15.775505]  ? end_repeat_nmi+0x16/0x67
[   15.775506]  ? asm_exc_double_fault+0x30/0x30
[   15.775507]  ? asm_exc_double_fault+0x30/0x30
[   15.775508]  ? asm_exc_double_fault+0x30/0x30
[   15.775509]  </NMI>
[   15.775509]  <#VC>
[   15.775510]  ? __show_regs.cold+0x18e/0x23d
[   15.775511]  </#VC>
[   15.775511]  <#DF>
[   15.775512]  ? __die_body.cold+0x1a/0x1f
[   15.775513]  ? die+0x26/0x40
[   15.775517]  ? handle_stack_overflow+0x44/0x60
[   15.775518]  ? exc_double_fault+0x14b/0x180
[   15.775519]  ? asm_exc_double_fault+0x1f/0x30
[   15.775520]  ? restore_regs_and_return_to_kernel+0x22/0x22
[   15.775521]  ? asm_exc_page_fault+0x9/0x30
[   15.775522]  ? error_entry+0x17/0x140
[   15.775523]  </#DF>
[   15.775523] WARNING: stack recursion on stack type 6
[   15.775524] Modules linked in: msr efivarfs
[   15.837935] ---[ end trace 0000000000000000 ]---

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
 arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
 arch/x86/kernel/nmi.c           |  2 +-
 2 files changed, 30 insertions(+), 1 deletion(-)
  

Comments

Peter Zijlstra Jan. 27, 2023, 9:08 a.m. UTC | #1
On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote:
> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
> "perf" executes. "./perf record sleep 20" is an example.
>
> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
> and actual reading of DB7 (which in turn causes #VC) every function is
> inlined

Very much on purpose.

> and no stack frame is created (?).

Silly compilers ;-) I think you can force it to by using inline asm with
a rsp dependency or somesuch.

> Replacing __always_inline with
> noinline in  local_db_save() or native_get_debugreg() fixes the problem.

It will create others.

> Why is this crash happening and how to fix that? I am still reading
> the assembly but was hoping for a shortcut here :) Thanks,

Welcome to the wonderful shit show that is x86 exceptions :/

I thought sev_es_*() is supposed to fix this. Joerg, any clue?

> aik-Standard-PC-i440FX-PIIX-1996 login: [   15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
> [   15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [   15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
> [   15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
> [   15.775323] RIP: 0010:error_entry+0x17/0x140
> [   15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
> [   15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
> [   15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
> [   15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
> [   15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
> [   15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [   15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
> [   15.775339] FS:  0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
> [   15.775340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
> [   15.775342] Call Trace:
> [   15.775352]  <NMI>

<snip>

> [   15.775495]  ? asm_exc_page_fault+0x22/0x30
> [   15.775496]  ? restore_regs_and_return_to_kernel+0x22/0x22
> [   15.775497]  ? exc_page_fault+0x11/0x120
> [   15.775499]  ? asm_exc_page_fault+0x22/0x30
> [   15.775500]  ? check_preemption_disabled+0x8/0xe0
> [   15.775502]  ? __sev_es_ist_enter+0x13/0x100
> [   15.775503]  ? exc_nmi+0x10e/0x150
> [   15.775505]  ? end_repeat_nmi+0x16/0x67
> [   15.775506]  ? asm_exc_double_fault+0x30/0x30
> [   15.775507]  ? asm_exc_double_fault+0x30/0x30
> [   15.775508]  ? asm_exc_double_fault+0x30/0x30
> [   15.775509]  </NMI>
> [   15.775509]  <#VC>
> [   15.775510]  ? __show_regs.cold+0x18e/0x23d
> [   15.775511]  </#VC>
> [   15.775511]  <#DF>
> [   15.775512]  ? __die_body.cold+0x1a/0x1f
> [   15.775513]  ? die+0x26/0x40
> [   15.775517]  ? handle_stack_overflow+0x44/0x60
> [   15.775518]  ? exc_double_fault+0x14b/0x180
> [   15.775519]  ? asm_exc_double_fault+0x1f/0x30
> [   15.775520]  ? restore_regs_and_return_to_kernel+0x22/0x22
> [   15.775521]  ? asm_exc_page_fault+0x9/0x30
> [   15.775522]  ? error_entry+0x17/0x140
> [   15.775523]  </#DF>
> [   15.775523] WARNING: stack recursion on stack type 6
> [   15.775524] Modules linked in: msr efivarfs
> [   15.837935] ---[ end trace 0000000000000000 ]---
> 
> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> ---
>  arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
>  arch/x86/kernel/nmi.c           |  2 +-
>  2 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
> index b049d950612f..396e2ea114dc 100644
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
>  		set_debugreg(dr7, 7);
>  }
>  
> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
> +static noinline unsigned long native_get_debugreg7(void)
> +{
> +	unsigned long dr7;
> +	asm("mov %%db7, %0" :"=r" (dr7));
> +	return dr7;
> +}
> +
> +static __always_inline unsigned long local_db_save_exc_nmi(void)
> +{
> +	unsigned long dr7;
> +
> +	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
> +		return 0;
> +
> +	dr7 = native_get_debugreg7();
> +	dr7 &= ~0x400; /* architecturally set bit */
> +	if (dr7)
> +		set_debugreg(0, 7);
> +	/*
> +	 * Ensure the compiler doesn't lower the above statements into
> +	 * the critical section; disabling breakpoints late would not
> +	 * be good.
> +	 */
> +	barrier();
> +
> +	return dr7;
> +}

This is broken, and building with DEBUG_ENTRY=y would've told you.

> +
>  #ifdef CONFIG_CPU_SUP_AMD
>  extern void set_dr_addr_mask(unsigned long mask, int dr);
>  #else
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index cec0bfa3bc04..400b5b6b74f6 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
>  	 */
>  	sev_es_ist_enter(regs);
>  
> -	this_cpu_write(nmi_dr7, local_db_save());
> +	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
>  
>  	irq_state = irqentry_nmi_enter(regs);

So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
stack frame, that has an actual call in it (although admittedly
conditional).
  
Joerg Roedel Jan. 27, 2023, 10:37 a.m. UTC | #2
On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote:
> Welcome to the wonderful shit show that is x86 exceptions :/
> 
> I thought sev_es_*() is supposed to fix this. Joerg, any clue?

Hmm, no, not yet, the stack-trace doesn't make much sense to me. The
sev_es_* function calls in the NMI path are for re-enabling NMI and
adjusting the #VC IST stack to allow nested VCs.

Alexey, can you try to get a more stable backtrace? For example by
building the kernel with frame pointers?

Regards,

	Joerg
  
Alexey Kardashevskiy Jan. 27, 2023, 11:56 a.m. UTC | #3
On 27/1/23 21:37, Joerg Roedel wrote:
> On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote:
>> Welcome to the wonderful shit show that is x86 exceptions :/
>>
>> I thought sev_es_*() is supposed to fix this. Joerg, any clue?
> 
> Hmm, no, not yet, the stack-trace doesn't make much sense to me. The
> sev_es_* function calls in the NMI path are for re-enabling NMI and
> adjusting the #VC IST stack to allow nested VCs.
> 
> Alexey, can you try to get a more stable backtrace? For example by
> building the kernel with frame pointers?

Do you mean these guys (below)?

aik@aik-Standard-PC-i440FX-PIIX-1996:~$ grep FRAME_POINTER 
/boot/config-$(uname -r)
CONFIG_SCHED_OMIT_FRAME_POINTER=y 

CONFIG_FRAME_POINTER=y 

CONFIG_UNWINDER_FRAME_POINTER=y


Here is the complete output of that VM (200k so not in the email):

https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Note that the original long backtrace appears more than once, I just did 
not copy all of that in the first email.
  
Alexey Kardashevskiy Jan. 27, 2023, 12:13 p.m. UTC | #4
On 27/1/23 20:08, Peter Zijlstra wrote:
> On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote:
>> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when
>> "perf" executes. "./perf record sleep 20" is an example.
>>
>> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not
>> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi)
>> and actual reading of DB7 (which in turn causes #VC) every function is
>> inlined
> 
> Very much on purpose.
> 
>> and no stack frame is created (?).
> 
> Silly compilers ;-) I think you can force it to by using inline asm with
> a rsp dependency or somesuch.
> 
>> Replacing __always_inline with
>> noinline in  local_db_save() or native_get_debugreg() fixes the problem.
> 
> It will create others.
> 
>> Why is this crash happening and how to fix that? I am still reading
>> the assembly but was hoping for a shortcut here :) Thanks,
> 
> Welcome to the wonderful shit show that is x86 exceptions :/
> 
> I thought sev_es_*() is supposed to fix this. Joerg, any clue?
> 
>> aik-Standard-PC-i440FX-PIIX-1996 login: [   15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2)
>> [   15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI
>> [   15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73
>> [   15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
>> [   15.775323] RIP: 0010:error_entry+0x17/0x140
>> [   15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9
>> [   15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097
>> [   15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed
>> [   15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8
>> [   15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000
>> [   15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> [   15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000
>> [   15.775339] FS:  0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000
>> [   15.775340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0
>> [   15.775342] Call Trace:
>> [   15.775352]  <NMI>
> 
> <snip>
> 
>> [   15.775495]  ? asm_exc_page_fault+0x22/0x30
>> [   15.775496]  ? restore_regs_and_return_to_kernel+0x22/0x22
>> [   15.775497]  ? exc_page_fault+0x11/0x120
>> [   15.775499]  ? asm_exc_page_fault+0x22/0x30
>> [   15.775500]  ? check_preemption_disabled+0x8/0xe0
>> [   15.775502]  ? __sev_es_ist_enter+0x13/0x100
>> [   15.775503]  ? exc_nmi+0x10e/0x150
>> [   15.775505]  ? end_repeat_nmi+0x16/0x67
>> [   15.775506]  ? asm_exc_double_fault+0x30/0x30
>> [   15.775507]  ? asm_exc_double_fault+0x30/0x30
>> [   15.775508]  ? asm_exc_double_fault+0x30/0x30
>> [   15.775509]  </NMI>
>> [   15.775509]  <#VC>
>> [   15.775510]  ? __show_regs.cold+0x18e/0x23d
>> [   15.775511]  </#VC>
>> [   15.775511]  <#DF>
>> [   15.775512]  ? __die_body.cold+0x1a/0x1f
>> [   15.775513]  ? die+0x26/0x40
>> [   15.775517]  ? handle_stack_overflow+0x44/0x60
>> [   15.775518]  ? exc_double_fault+0x14b/0x180
>> [   15.775519]  ? asm_exc_double_fault+0x1f/0x30
>> [   15.775520]  ? restore_regs_and_return_to_kernel+0x22/0x22
>> [   15.775521]  ? asm_exc_page_fault+0x9/0x30
>> [   15.775522]  ? error_entry+0x17/0x140
>> [   15.775523]  </#DF>
>> [   15.775523] WARNING: stack recursion on stack type 6
>> [   15.775524] Modules linked in: msr efivarfs
>> [   15.837935] ---[ end trace 0000000000000000 ]---
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
>> ---
>>   arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++
>>   arch/x86/kernel/nmi.c           |  2 +-
>>   2 files changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
>> index b049d950612f..396e2ea114dc 100644
>> --- a/arch/x86/include/asm/debugreg.h
>> +++ b/arch/x86/include/asm/debugreg.h
>> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7)
>>   		set_debugreg(dr7, 7);
>>   }
>>   
>> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
>> +static noinline unsigned long native_get_debugreg7(void)
>> +{
>> +	unsigned long dr7;
>> +	asm("mov %%db7, %0" :"=r" (dr7));
>> +	return dr7;
>> +}
>> +
>> +static __always_inline unsigned long local_db_save_exc_nmi(void)
>> +{
>> +	unsigned long dr7;
>> +
>> +	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
>> +		return 0;
>> +
>> +	dr7 = native_get_debugreg7();
>> +	dr7 &= ~0x400; /* architecturally set bit */
>> +	if (dr7)
>> +		set_debugreg(0, 7);
>> +	/*
>> +	 * Ensure the compiler doesn't lower the above statements into
>> +	 * the critical section; disabling breakpoints late would not
>> +	 * be good.
>> +	 */
>> +	barrier();
>> +
>> +	return dr7;
>> +}
> 
> This is broken, and building with DEBUG_ENTRY=y would've told you.


Huh, good to know. Is this it telling me so?

vmlinux.o: warning: objtool: exc_nmi+0x73: call to 
native_get_debugreg7() leaves .noinstr.text section



>> +
>>   #ifdef CONFIG_CPU_SUP_AMD
>>   extern void set_dr_addr_mask(unsigned long mask, int dr);
>>   #else
>> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
>> index cec0bfa3bc04..400b5b6b74f6 100644
>> --- a/arch/x86/kernel/nmi.c
>> +++ b/arch/x86/kernel/nmi.c
>> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
>>   	 */
>>   	sev_es_ist_enter(regs);
>>   
>> -	this_cpu_write(nmi_dr7, local_db_save());
>> +	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
>>   
>>   	irq_state = irqentry_nmi_enter(regs);
> 
> So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
> stack frame, that has an actual call in it (although admittedly
> conditional).

Is not the frame gone when sev_es_ist_enter() (which does not get 
inlined as per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>:
") returned?
  
Peter Zijlstra Jan. 27, 2023, 12:41 p.m. UTC | #5
On Fri, Jan 27, 2023 at 11:13:38PM +1100, Alexey Kardashevskiy wrote:

> > This is broken, and building with DEBUG_ENTRY=y would've told you.
> 
> 
> Huh, good to know. Is this it telling me so?
> 
> vmlinux.o: warning: objtool: exc_nmi+0x73: call to native_get_debugreg7()
> leaves .noinstr.text section
> 

Yep. The ramification of all that is that by calling non-noinstr code
(double negative, iow, regular instrumented code) is that you can end up
in the tracers/*SAN/breakpoints etc.. code -- something we're very much
not ready for at this point.

> > > +
> > >   #ifdef CONFIG_CPU_SUP_AMD
> > >   extern void set_dr_addr_mask(unsigned long mask, int dr);
> > >   #else
> > > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> > > index cec0bfa3bc04..400b5b6b74f6 100644
> > > --- a/arch/x86/kernel/nmi.c
> > > +++ b/arch/x86/kernel/nmi.c
> > > @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
> > >   	 */
> > >   	sev_es_ist_enter(regs);
> > > -	this_cpu_write(nmi_dr7, local_db_save());
> > > +	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
> > >   	irq_state = irqentry_nmi_enter(regs);
> > 
> > So what I don't get is why sev_es_ist_enter() doesn't cause us to make a
> > stack frame, that has an actual call in it (although admittedly
> > conditional).
> 
> Is not the frame gone when sev_es_ist_enter() (which does not get inlined as
> per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>:
> ") returned?

Well, returning would consume the callframe, but the stack setup of the
caller should remain. Let me go stare at some asm.
  
Joerg Roedel Jan. 27, 2023, 12:59 p.m. UTC | #6
On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> Here is the complete output of that VM (200k so not in the email):
> 
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Thanks. So looking at the code in the traces:

Code starting with the faulting instruction
===========================================
   0:	65 48 8b 04 25 c0 db 	mov    %gs:0x2dbc0,%rax
   7:	02 00
   9:	48 8b 80 a8 08 00 00 	mov    0x8a8(%rax),%rax
  10:	0f 0d 48 70          	prefetchw 0x70(%rax)
  14:	e8                   	.byte 0xe8
  15:	82                   	.byte 0x82

I think the fault in the page-fault handler happens here:

DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
        unsigned long address = read_cr2();
        irqentry_state_t state;

        prefetchw(&current->mm->mmap_lock); <--- Here

To be precise, it faults while dereferencing current. That means that
GS_BASE is likely broken, need to find out why...

This at least explains why it page-faults in a loop until the stack
overflows and the guard page is hit.

Regards,

	Joerg
  
Joerg Roedel Jan. 27, 2023, 5:25 p.m. UTC | #7
On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc

Okay, I reproduced the problem here and the root cause turned out to be
that the compiler moved the DR7 read instruction before the 5-byte NOP
which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
guaranteed to cause #VC exception stack recursion if the NMI was
triggered on the #VC stack, and that leads to all kinds of undefined
behavior.

Regards,

	Joerg
  
Alexey Kardashevskiy Jan. 28, 2023, 11:24 a.m. UTC | #8
On 28/1/23 04:25, Joerg Roedel wrote:
> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
> 
> Okay, I reproduced the problem here and the root cause turned out to be
> that the compiler moved the DR7 read instruction before the 5-byte NOP
> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
> guaranteed to cause #VC exception stack recursion if the NMI was
> triggered on the #VC stack, and that leads to all kinds of undefined
> behavior.

Cool!

(out of curiosity) where do you see these NOPs? "objdump -D vmlinux" 
does not show any, is this after lifepatching?

Meanwhile, this seems to be doing the right thing:


diff --git a/arch/x86/include/asm/debugreg.h 
b/arch/x86/include/asm/debugreg.h
index b049d950612f..687b15297057 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -39,7 +39,7 @@ static __always_inline unsigned long 
native_get_debugreg(int regno)
                 asm("mov %%db6, %0" :"=r" (val));
                 break;
         case 7:
-               asm("mov %%db7, %0" :"=r" (val));
+               asm volatile ("mov %%db7, %0" :"=r" (val));
  
Joerg Roedel Jan. 28, 2023, 1:52 p.m. UTC | #9
On Sat, Jan 28, 2023 at 10:24:56PM +1100, Alexey Kardashevskiy wrote:
> (out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does
> not show any, is this after lifepatching?

Here is the disassembly of exc_nmi of a kernel built from tip/master
with CONFIG_PARAVIRT=n:

<exc_nmi>:
       41 54                   push   %r12
       55                      push   %rbp
       48 89 fd                mov    %rdi,%rbp
       53                      push   %rbx
       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
       65 8b 05 69 66 41 7e    mov    %gs:0x7e416669(%rip),%eax        # 3254c <pcpu_hot+0xc>
       48 98                   cltq
       48 0f a3 05 33 00 2b    bt     %rax,0x12b0033(%rip)        # ffffffff82ecbf20 <__cpu_online_mask>
       01 
       0f 83 c9 00 00 00       jae    ffffffff81c1bfbc <exc_nmi+0xec>
       65 8b 05 f6 41 40 7e    mov    %gs:0x7e4041f6(%rip),%eax        # 200f0 <nmi_state>
       85 c0                   test   %eax,%eax
       0f 85 f8 00 00 00       jne    ffffffff81c1bffa <exc_nmi+0x12a>
       65 c7 05 e3 41 40 7e    movl   $0x1,%gs:0x7e4041e3(%rip)        # 200f0 <nmi_state>
       01 00 00 00 
       0f 20 d0                mov    %cr2,%rax
       65 48 89 05 d0 41 40    mov    %rax,%gs:0x7e4041d0(%rip)        # 200e8 <nmi_cr2>
       7e 
       41 0f 21 fc             mov    %db7,%r12			<-- here is the DR7 read
       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)		<-- here are the NOPS that become a
       								    call to sev_es_ist_enter() in
								    SEV-ES guests

The DR7 read will cause a #VC exception, switching to the #VC IST stack.
If the NMI was raised while already on the #VC IST stack, this DR7 read
will overwrite the previous stack frame and cause stack recursion, with
all funny side effects.


> diff --git a/arch/x86/include/asm/debugreg.h
> b/arch/x86/include/asm/debugreg.h
> index b049d950612f..687b15297057 100644
> --- a/arch/x86/include/asm/debugreg.h
> +++ b/arch/x86/include/asm/debugreg.h
> @@ -39,7 +39,7 @@ static __always_inline unsigned long
> native_get_debugreg(int regno)
>                 asm("mov %%db6, %0" :"=r" (val));
>                 break;
>         case 7:
> -               asm("mov %%db7, %0" :"=r" (val));
> +               asm volatile ("mov %%db7, %0" :"=r" (val));

Yeah, something like this will be the fix. I am still thinking about
the right place to put the volatile to make it explicit to the situation
we are encountering here (which is SEV-ES specific).

Best would be an explicit barrier in C code between sev_es_ist_enter()
and the DR7 read, but all barriers I tried to far only seem to affect
memory instructions and had no influence on the DR7 read (which is
obviously not considered as a memory read by the compiler).

The best place to put the barrier is in the sev_es_ist_enter() inline
function, right after the static_call to __sev_es_ist_enter().

Regards,

	Joerg
  
Joerg Roedel Jan. 30, 2023, 9:17 a.m. UTC | #10
On Sat, Jan 28, 2023 at 02:52:23PM +0100, Joerg Roedel wrote:
> Yeah, something like this will be the fix. I am still thinking about
> the right place to put the volatile to make it explicit to the situation
> we are encountering here (which is SEV-ES specific).
> 
> Best would be an explicit barrier in C code between sev_es_ist_enter()
> and the DR7 read, but all barriers I tried to far only seem to affect
> memory instructions and had no influence on the DR7 read (which is
> obviously not considered as a memory read by the compiler).
> 
> The best place to put the barrier is in the sev_es_ist_enter() inline
> function, right after the static_call to __sev_es_ist_enter().

Okay, after some investigation I was not able to find a compiler barrier
which affects DR7 read ordering. This leaves us with the only solution
of directly forbidding DR7 register access re-ordering by adding a
volatile to the asm, like you did before.

I will send a fix later today.

Regards,

	Joerg
  
H. Peter Anvin Jan. 30, 2023, 5:30 p.m. UTC | #11
On January 28, 2023 3:24:56 AM PST, Alexey Kardashevskiy <aik@amd.com> wrote:
>
>
>On 28/1/23 04:25, Joerg Roedel wrote:
>> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
>>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
>> 
>> Okay, I reproduced the problem here and the root cause turned out to be
>> that the compiler moved the DR7 read instruction before the 5-byte NOP
>> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
>> guaranteed to cause #VC exception stack recursion if the NMI was
>> triggered on the #VC stack, and that leads to all kinds of undefined
>> behavior.
>
>Cool!
>
>(out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does not show any, is this after lifepatching?
>
>Meanwhile, this seems to be doing the right thing:
>
>
>diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
>index b049d950612f..687b15297057 100644
>--- a/arch/x86/include/asm/debugreg.h
>+++ b/arch/x86/include/asm/debugreg.h
>@@ -39,7 +39,7 @@ static __always_inline unsigned long native_get_debugreg(int regno)
>                asm("mov %%db6, %0" :"=r" (val));
>                break;
>        case 7:
>-               asm("mov %%db7, %0" :"=r" (val));
>+               asm volatile ("mov %%db7, %0" :"=r" (val));
>
>
>

It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is not... %dr6 is the status register!

I believe they should all be volatile (the compiler semantics is that volatile operations are always executed exactly once, in strict program order with respect to any other volatile operations); the real question is if there should also be memory clobbers on %dr6 reads and any %dr write.
  
Borislav Petkov Jan. 30, 2023, 6:04 p.m. UTC | #12
On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> not... %dr6 is the status register!

Yeah, as a precaution I think we should make all those volatile. Just in
case.
 
> I believe they should all be volatile (the compiler semantics is that
> volatile operations are always executed exactly once, in strict
> program order with respect to any other volatile operations); the real
> question is if there should also be memory clobbers on %dr6 reads and
> any %dr write.

Yes, I think so too. From gcc docs:


"6.47.2.1 Volatile
.................

...

Note that the compiler can move even 'volatile asm' instructions
relative to other code, including across jump instructions."

We already have __FORCE_ORDER for exactly things like that.
  
Joerg Roedel Jan. 31, 2023, 8:57 a.m. UTC | #13
On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> not... %dr6 is the status register!

The reason is that on SEV-ES only accesses to DR7 will cause #VC
exceptions, DR0-DR6 are not intercepted.

Regards,

	Joerg
  
Sean Christopherson Jan. 31, 2023, 3:53 p.m. UTC | #14
On Tue, Jan 31, 2023, Joerg Roedel wrote:
> On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote:
> > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is
> > not... %dr6 is the status register!
> 
> The reason is that on SEV-ES only accesses to DR7 will cause #VC
> exceptions, DR0-DR6 are not intercepted.

I don't think that is technically true.  A _well-behaved_ hypervisor will not
intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
architecture enforces that behavior.
  
Joerg Roedel Jan. 31, 2023, 4 p.m. UTC | #15
On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote:
> I don't think that is technically true.  A _well-behaved_ hypervisor will not
> intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
> architecture enforces that behavior.

Not from the hardware architecture side, but the GHCB spec does not
list NAE events for DR0-DR6 accesses, so a guest is not required to
handle them in the VC handler.

Linux under SEV-ES will crash if the HV intercepts debug registers,
except DR7.

Regards,

	Joerg
  
Sean Christopherson Jan. 31, 2023, 4:47 p.m. UTC | #16
On Tue, Jan 31, 2023, Joerg Roedel wrote:
> On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote:
> > I don't think that is technically true.  A _well-behaved_ hypervisor will not
> > intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES
> > architecture enforces that behavior.
> 
> Not from the hardware architecture side, but the GHCB spec does not
> list NAE events for DR0-DR6 accesses, so a guest is not required to
> handle them in the VC handler.
> 
> Linux under SEV-ES will crash if the HV intercepts debug registers,
> except DR7.

Right, I'm just objecting to the wording of "DR0-DR6 are not intercepted".  E.g.
from a security perspective, the kernel shouldn't rely on DR0-DR6 to execute
cleanly.
  

Patch

diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index b049d950612f..396e2ea114dc 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -125,6 +125,35 @@  static __always_inline void local_db_restore(unsigned long dr7)
 		set_debugreg(dr7, 7);
 }
 
+/* noinline here inline __always_inline'd native_get_debugreg(int regno) */
+static noinline unsigned long native_get_debugreg7(void)
+{
+	unsigned long dr7;
+	asm("mov %%db7, %0" :"=r" (dr7));
+	return dr7;
+}
+
+static __always_inline unsigned long local_db_save_exc_nmi(void)
+{
+	unsigned long dr7;
+
+	if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active())
+		return 0;
+
+	dr7 = native_get_debugreg7();
+	dr7 &= ~0x400; /* architecturally set bit */
+	if (dr7)
+		set_debugreg(0, 7);
+	/*
+	 * Ensure the compiler doesn't lower the above statements into
+	 * the critical section; disabling breakpoints late would not
+	 * be good.
+	 */
+	barrier();
+
+	return dr7;
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 extern void set_dr_addr_mask(unsigned long mask, int dr);
 #else
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index cec0bfa3bc04..400b5b6b74f6 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -503,7 +503,7 @@  DEFINE_IDTENTRY_RAW(exc_nmi)
 	 */
 	sev_es_ist_enter(regs);
 
-	this_cpu_write(nmi_dr7, local_db_save());
+	this_cpu_write(nmi_dr7, local_db_save_exc_nmi());
 
 	irq_state = irqentry_nmi_enter(regs);