[PATCHv2] x86/trampoline: Bypass compat mode in trampoline_start64() if not needed

Message ID 20240107122847.27740-1-kirill.shutemov@linux.intel.com
State New
Headers
Series [PATCHv2] x86/trampoline: Bypass compat mode in trampoline_start64() if not needed |

Commit Message

Kirill A. Shutemov Jan. 7, 2024, 12:28 p.m. UTC
  The trampoline_start64() vector is used when a secondary CPU starts in
64-bit mode. The current implementation directly enters compatibility
mode. It is necessary to disable paging and re-enable it in the correct
paging mode: either 4- or 5-level, depending on the configuration.

The X86S[1] ISA does not support compatibility mode in ring 0, and
paging cannot be disabled.

The trampoline_start64() function is reworked to only enter compatibility
mode if it is necessary to change the paging mode. If the CPU is already
in the desired paging mode, it will proceed in long mode.

This change will allow a secondary CPU to boot on an X86S machine as
long as the CPU is already in the correct paging mode.

In the future, there will be a mechanism to switch between paging modes
without disabling paging.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

---
 v2:
  - Fix build with GCC;
---
 arch/x86/realmode/rm/trampoline_64.S | 31 +++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)
  

Comments

Kai Huang Jan. 8, 2024, 1:10 p.m. UTC | #1
On Sun, 2024-01-07 at 15:28 +0300, Kirill A. Shutemov wrote:
> The trampoline_start64() vector is used when a secondary CPU starts in
> 64-bit mode. The current implementation directly enters compatibility
> mode. It is necessary to disable paging and re-enable it in the correct
> paging mode: either 4- or 5-level, depending on the configuration.
> 
> The X86S[1] ISA does not support compatibility mode in ring 0, and
> paging cannot be disabled.
> 
> The trampoline_start64() function is reworked to only enter compatibility
> mode if it is necessary to change the paging mode. If the CPU is already
> in the desired paging mode, it will proceed in long mode.
> 
> This change will allow a secondary CPU to boot on an X86S machine as
> long as the CPU is already in the correct paging mode.
> 
> In the future, there will be a mechanism to switch between paging modes
> without disabling paging.
> 
> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> 
> ---
>  v2:
>   - Fix build with GCC;
> ---
>  arch/x86/realmode/rm/trampoline_64.S | 31 +++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index c9f76fae902e..c07354542188 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -37,13 +37,15 @@
>  	.text
>  	.code16
>  
> -.macro LOCK_AND_LOAD_REALMODE_ESP lock_pa=0
> +.macro LOCK_AND_LOAD_REALMODE_ESP lock_pa=0 lock_rip=0
>  	/*
>  	 * Make sure only one CPU fiddles with the realmode stack
>  	 */
>  .Llock_rm\@:
>  	.if \lock_pa
>          lock btsl       $0, pa_tr_lock
> +	.elseif \lock_rip
> +        lock btsl       $0, tr_lock(%rip)
>  	.else
>          lock btsl       $0, tr_lock
>  	.endif
> @@ -220,6 +222,33 @@ SYM_CODE_START(trampoline_start64)
>  	lidt	tr_idt(%rip)
>  	lgdt	tr_gdt64(%rip)
>  
> +	/* Check if paging mode has to be changed */
> +	movq	%cr4, %rax
> +	xorq	tr_cr4(%rip), %rax
> +	andq	$X86_CR4_LA57, %rax
> +	jnz	.L_switch_paging

This seems depends on the BIOS will always use 4-level paging.  Can we make such
assumption?

> +
> +	/* Paging mode is correct proceed in 64-bit mode */
> +
> +	LOCK_AND_LOAD_REALMODE_ESP lock_rip=1
> +
> +	movw	$__KERNEL_DS, %dx
> +	movl	%edx, %ss
> +	addl	$pa_real_mode_base, %esp
> +	movl	%edx, %ds
> +	movl	%edx, %es
> +	movl	%edx, %fs
> +	movl	%edx, %gs
> +
> +	movl	$pa_trampoline_pgd, %eax
> +	movq	%rax, %cr3
> +
> +	jmpq	*tr_start(%rip)

IIUC you won't be using __KERNEL_CS in this case?  Not sure whether this matters
though, because the spec says in 64-bit mode the hardware treats CS,DS,ES,SS as
zero.

> +.L_switch_paging:
> +	/*
> +	 * To switch between 4- and 5-level paging modes, it is necessary
> +	 * to disable paging. This must be done in the compatibility mode.
> +	 */
>  	ljmpl	*tr_compat(%rip)
>  SYM_CODE_END(trampoline_start64)
>
  
Kirill A. Shutemov Jan. 8, 2024, 1:33 p.m. UTC | #2
On Mon, Jan 08, 2024 at 01:10:31PM +0000, Huang, Kai wrote:
> On Sun, 2024-01-07 at 15:28 +0300, Kirill A. Shutemov wrote:
> > The trampoline_start64() vector is used when a secondary CPU starts in
> > 64-bit mode. The current implementation directly enters compatibility
> > mode. It is necessary to disable paging and re-enable it in the correct
> > paging mode: either 4- or 5-level, depending on the configuration.
> > 
> > The X86S[1] ISA does not support compatibility mode in ring 0, and
> > paging cannot be disabled.
> > 
> > The trampoline_start64() function is reworked to only enter compatibility
> > mode if it is necessary to change the paging mode. If the CPU is already
> > in the desired paging mode, it will proceed in long mode.
> > 
> > This change will allow a secondary CPU to boot on an X86S machine as
> > long as the CPU is already in the correct paging mode.
> > 
> > In the future, there will be a mechanism to switch between paging modes
> > without disabling paging.
> > 
> > [1] https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reviewed-by: Andi Kleen <ak@linux.intel.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> > 
> > ---
> >  v2:
> >   - Fix build with GCC;
> > ---
> >  arch/x86/realmode/rm/trampoline_64.S | 31 +++++++++++++++++++++++++++-
> >  1 file changed, 30 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> > index c9f76fae902e..c07354542188 100644
> > --- a/arch/x86/realmode/rm/trampoline_64.S
> > +++ b/arch/x86/realmode/rm/trampoline_64.S
> > @@ -37,13 +37,15 @@
> >  	.text
> >  	.code16
> >  
> > -.macro LOCK_AND_LOAD_REALMODE_ESP lock_pa=0
> > +.macro LOCK_AND_LOAD_REALMODE_ESP lock_pa=0 lock_rip=0
> >  	/*
> >  	 * Make sure only one CPU fiddles with the realmode stack
> >  	 */
> >  .Llock_rm\@:
> >  	.if \lock_pa
> >          lock btsl       $0, pa_tr_lock
> > +	.elseif \lock_rip
> > +        lock btsl       $0, tr_lock(%rip)
> >  	.else
> >          lock btsl       $0, tr_lock
> >  	.endif
> > @@ -220,6 +222,33 @@ SYM_CODE_START(trampoline_start64)
> >  	lidt	tr_idt(%rip)
> >  	lgdt	tr_gdt64(%rip)
> >  
> > +	/* Check if paging mode has to be changed */
> > +	movq	%cr4, %rax
> > +	xorq	tr_cr4(%rip), %rax
> > +	andq	$X86_CR4_LA57, %rax
> > +	jnz	.L_switch_paging
> 
> This seems depends on the BIOS will always use 4-level paging.  Can we make such
> assumption?

What makes you think this?

The check is basically

    if ((tr_cr4 ^ CR4) & X86_CR4_LA57)
	goto .L_switch_paging;

It means if LA57 is not the same between tr_cr4 and CR4 we need to change
paging mode.

> > +
> > +	/* Paging mode is correct proceed in 64-bit mode */
> > +
> > +	LOCK_AND_LOAD_REALMODE_ESP lock_rip=1
> > +
> > +	movw	$__KERNEL_DS, %dx
> > +	movl	%edx, %ss
> > +	addl	$pa_real_mode_base, %esp
> > +	movl	%edx, %ds
> > +	movl	%edx, %es
> > +	movl	%edx, %fs
> > +	movl	%edx, %gs
> > +
> > +	movl	$pa_trampoline_pgd, %eax
> > +	movq	%rax, %cr3
> > +
> > +	jmpq	*tr_start(%rip)
> 
> IIUC you won't be using __KERNEL_CS in this case?  Not sure whether this matters
> though, because the spec says in 64-bit mode the hardware treats CS,DS,ES,SS as
> zero.
> 

secondary_startup_64() will set CS to __KERNEL_CS before jumping to C
code.

> > +.L_switch_paging:
> > +	/*
> > +	 * To switch between 4- and 5-level paging modes, it is necessary
> > +	 * to disable paging. This must be done in the compatibility mode.
> > +	 */
> >  	ljmpl	*tr_compat(%rip)
> >  SYM_CODE_END(trampoline_start64)
> >  
>
  
Sean Christopherson Jan. 8, 2024, 4:18 p.m. UTC | #3
On Sun, Jan 07, 2024, Kirill A. Shutemov wrote:
> @@ -220,6 +222,33 @@ SYM_CODE_START(trampoline_start64)
>  	lidt	tr_idt(%rip)
>  	lgdt	tr_gdt64(%rip)
>  
> +	/* Check if paging mode has to be changed */
> +	movq	%cr4, %rax
> +	xorq	tr_cr4(%rip), %rax

This is buggy, tr_cr4 is only 4 bytes.  And even if tr_cr4 were 8 bytes, the reason
why nothing showed up in testing is also why only 4 bytes need to be XOR'd: the
upper 32 bits of the result are never consumed.

> +	andq	$X86_CR4_LA57, %rax

Nit, this can be TEST instead of AND, e.g. I was looking to see if RAX was used
anywhere in the flow.  And in theory it's possible a CPU could support uop fusing
for TEST+Jcc but not AND+Jcc, cause shaving a cycle in this code is obviously
super important ;-)

And as above, testl, not testq.

> +	jnz	.L_switch_paging
> +
> +	/* Paging mode is correct proceed in 64-bit mode */
> +
> +	LOCK_AND_LOAD_REALMODE_ESP lock_rip=1
> +
> +	movw	$__KERNEL_DS, %dx
> +	movl	%edx, %ss
> +	addl	$pa_real_mode_base, %esp
> +	movl	%edx, %ds
> +	movl	%edx, %es
> +	movl	%edx, %fs
> +	movl	%edx, %gs
> +
> +	movl	$pa_trampoline_pgd, %eax
> +	movq	%rax, %cr3
> +
> +	jmpq	*tr_start(%rip)
> +.L_switch_paging:
> +	/*
> +	 * To switch between 4- and 5-level paging modes, it is necessary
> +	 * to disable paging. This must be done in the compatibility mode.
> +	 */
>  	ljmpl	*tr_compat(%rip)
>  SYM_CODE_END(trampoline_start64)
>  
> -- 
> 2.41.0
>
  
Kirill A. Shutemov Jan. 8, 2024, 9:58 p.m. UTC | #4
On Mon, Jan 08, 2024 at 08:18:55AM -0800, Sean Christopherson wrote:
> On Sun, Jan 07, 2024, Kirill A. Shutemov wrote:
> > @@ -220,6 +222,33 @@ SYM_CODE_START(trampoline_start64)
> >  	lidt	tr_idt(%rip)
> >  	lgdt	tr_gdt64(%rip)
> >  
> > +	/* Check if paging mode has to be changed */
> > +	movq	%cr4, %rax
> > +	xorq	tr_cr4(%rip), %rax
> 
> This is buggy, tr_cr4 is only 4 bytes.  And even if tr_cr4 were 8 bytes, the reason
> why nothing showed up in testing is also why only 4 bytes need to be XOR'd: the
> upper 32 bits of the result are never consumed.

Oh. Good catch. Will fix.

tr_cr4 will need to be changed to 8 bytes soonish. FRED uses bit 32 of the
register.

> > +	andq	$X86_CR4_LA57, %rax
> 
> Nit, this can be TEST instead of AND, e.g. I was looking to see if RAX was used
> anywhere in the flow.  And in theory it's possible a CPU could support uop fusing
> for TEST+Jcc but not AND+Jcc, cause shaving a cycle in this code is obviously
> super important ;-)
> 
> And as above, testl, not testq.

Fair enough. Will use testl.
  
Andi Kleen Jan. 9, 2024, 2:26 a.m. UTC | #5
> This seems depends on the BIOS will always use 4-level paging.  Can we make such
> assumption?

Yes I believe it's fine. All BIOS on 5 level capable systems currently 
only use 4-level when passing control to someone else.

(although I cannot find the quote in the UEFI spec currently, will check
on that)

The UEFI run time environment is defined as 4-level. Changing that would
break compatibility OS supprt at least for run time services.


> 
> > +
> > +	/* Paging mode is correct proceed in 64-bit mode */
> > +
> > +	LOCK_AND_LOAD_REALMODE_ESP lock_rip=1
> > +
> > +	movw	$__KERNEL_DS, %dx
> > +	movl	%edx, %ss
> > +	addl	$pa_real_mode_base, %esp
> > +	movl	%edx, %ds
> > +	movl	%edx, %es
> > +	movl	%edx, %fs
> > +	movl	%edx, %gs
> > +
> > +	movl	$pa_trampoline_pgd, %eax
> > +	movq	%rax, %cr3
> > +
> > +	jmpq	*tr_start(%rip)
> 
> IIUC you won't be using __KERNEL_CS in this case?  Not sure whether this matters
> though, because the spec says in 64-bit mode the hardware treats CS,DS,ES,SS as
> zero.

That's a good catch. Might be better to use __KERNEL_CS. Otherwise if a
IRET happens later and it tries to reload CS it might fault. Probably
doesn't happen before another reload happens anyways, but it's better
to avoid it.

-Andi
  

Patch

diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index c9f76fae902e..c07354542188 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -37,13 +37,15 @@ 
 	.text
 	.code16
 
-.macro LOCK_AND_LOAD_REALMODE_ESP lock_pa=0
+.macro LOCK_AND_LOAD_REALMODE_ESP lock_pa=0 lock_rip=0
 	/*
 	 * Make sure only one CPU fiddles with the realmode stack
 	 */
 .Llock_rm\@:
 	.if \lock_pa
         lock btsl       $0, pa_tr_lock
+	.elseif \lock_rip
+        lock btsl       $0, tr_lock(%rip)
 	.else
         lock btsl       $0, tr_lock
 	.endif
@@ -220,6 +222,33 @@  SYM_CODE_START(trampoline_start64)
 	lidt	tr_idt(%rip)
 	lgdt	tr_gdt64(%rip)
 
+	/* Check if paging mode has to be changed */
+	movq	%cr4, %rax
+	xorq	tr_cr4(%rip), %rax
+	andq	$X86_CR4_LA57, %rax
+	jnz	.L_switch_paging
+
+	/* Paging mode is correct proceed in 64-bit mode */
+
+	LOCK_AND_LOAD_REALMODE_ESP lock_rip=1
+
+	movw	$__KERNEL_DS, %dx
+	movl	%edx, %ss
+	addl	$pa_real_mode_base, %esp
+	movl	%edx, %ds
+	movl	%edx, %es
+	movl	%edx, %fs
+	movl	%edx, %gs
+
+	movl	$pa_trampoline_pgd, %eax
+	movq	%rax, %cr3
+
+	jmpq	*tr_start(%rip)
+.L_switch_paging:
+	/*
+	 * To switch between 4- and 5-level paging modes, it is necessary
+	 * to disable paging. This must be done in the compatibility mode.
+	 */
 	ljmpl	*tr_compat(%rip)
 SYM_CODE_END(trampoline_start64)