[v7,2/9] x86/startup_64: Defer assignment of 5-level paging global variables

Message ID 20240227151907.387873-13-ardb+git@google.com
State New
Headers
Series x86: Confine early 1:1 mapped startup code |

Commit Message

Ard Biesheuvel Feb. 27, 2024, 3:19 p.m. UTC
  From: Ard Biesheuvel <ardb@kernel.org>

Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/include/asm/pgtable_64_types.h |  2 +-
 arch/x86/kernel/head64.c                | 44 +++++++-------------
 2 files changed, 15 insertions(+), 31 deletions(-)
  

Comments

Borislav Petkov Feb. 28, 2024, 8:55 p.m. UTC | #1
On Tue, Feb 27, 2024 at 04:19:10PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> Assigning the 5-level paging related global variables from the earliest
> C code using explicit references that use the 1:1 translation of memory
> is unnecessary, as the startup code itself does not rely on them to
> create the initial page tables, and this is all it should be doing. So
> defer these assignments to the primary C entry code that executes via
> the ordinary kernel virtual mapping.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/x86/include/asm/pgtable_64_types.h |  2 +-
>  arch/x86/kernel/head64.c                | 44 +++++++-------------
>  2 files changed, 15 insertions(+), 31 deletions(-)

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

Those should probably be tested on a 5level machine, just in case.

Thx.
  
Ard Biesheuvel March 1, 2024, 10:01 a.m. UTC | #2
On Wed, 28 Feb 2024 at 21:56, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Feb 27, 2024 at 04:19:10PM +0100, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Assigning the 5-level paging related global variables from the earliest
> > C code using explicit references that use the 1:1 translation of memory
> > is unnecessary, as the startup code itself does not rely on them to
> > create the initial page tables, and this is all it should be doing. So
> > defer these assignments to the primary C entry code that executes via
> > the ordinary kernel virtual mapping.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/x86/include/asm/pgtable_64_types.h |  2 +-
> >  arch/x86/kernel/head64.c                | 44 +++++++-------------
> >  2 files changed, 15 insertions(+), 31 deletions(-)
>
> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
>
> Those should probably be tested on a 5level machine, just in case.
>

I have tested this myself on QEMU with -cpu qemu64,+la57 and -cpu host+kvm using
- EFI boot (OVMF)
- legacy BIOS boot (SeaBIOS)
- with and without no5lvl on the command line
- with and without CONFIG_X86_5LEVEL

The scenario that I have not managed to test is entering from EFI with
5 levels of paging enabled, and switching back to 4 levels (which
should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
existence actually supports that today, and I am pretty sure that this
code has never been tested under those conditions to begin with. (OVMF
patches are under review atm to allow 5-level paging to be enabled in
the firmware)

I currently don't have access to real hardware with LA57 support so
any additional coverage there is highly appreciated (same for the last
patch in this series)
  
Borislav Petkov March 1, 2024, 4:09 p.m. UTC | #3
On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
> The scenario that I have not managed to test is entering from EFI with
> 5 levels of paging enabled, and switching back to 4 levels (which
> should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
> existence actually supports that today, and I am pretty sure that this
> code has never been tested under those conditions to begin with. (OVMF
> patches are under review atm to allow 5-level paging to be enabled in
> the firmware)

Aha.

> I currently don't have access to real hardware with LA57 support so
> any additional coverage there is highly appreciated (same for the last
> patch in this series)

Right, I'm sure dhansen could dig up such a machine. We'll ask him
nicely to test when the set is ready.

Thx.
  
Ard Biesheuvel March 1, 2024, 5:09 p.m. UTC | #4
On Fri, 1 Mar 2024 at 17:09, Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
> > The scenario that I have not managed to test is entering from EFI with
> > 5 levels of paging enabled, and switching back to 4 levels (which
> > should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
> > existence actually supports that today, and I am pretty sure that this
> > code has never been tested under those conditions to begin with. (OVMF
> > patches are under review atm to allow 5-level paging to be enabled in
> > the firmware)
>
> Aha.
>

I've built a debug OVMF image using the latest version of the series,
and put it at [0]

Run like this

qemu-system-x86_64 -M q35 \
  -cpu qemu64,+la57 -smp 4 \
  -bios OVMF-5level.fd \
  -kernel arch/x86/boot/bzImage \
  -append console=ttyS0\ earlyprintk=ttyS0 \
  -vga none -nographic -m 1g \
  -initrd <initrd.img>

and you will get loads of DEBUG output from the firmware first, and
then boot into Linux. (initrd can be omitted)

Right before entering, it will print

CpuDxe: 5-Level Paging = 1

which confirms that the firmware is running with 5 levels of paging.

I've confirmed that this boots happily with this series applied,
including when using 'no5lvl' on the command line, or when disabling
CONFIG_X86_5LEVEL [confirmed by inspecting
/sys/kernel/debug/page_tables/kernel].


[0] http://files.workofard.com/OVMF-5level.fd.gz
  
Borislav Petkov March 1, 2024, 5:33 p.m. UTC | #5
On Fri, Mar 01, 2024 at 06:09:53PM +0100, Ard Biesheuvel wrote:
> On Fri, 1 Mar 2024 at 17:09, Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
> > > The scenario that I have not managed to test is entering from EFI with
> > > 5 levels of paging enabled, and switching back to 4 levels (which
> > > should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
> > > existence actually supports that today, and I am pretty sure that this
> > > code has never been tested under those conditions to begin with. (OVMF
> > > patches are under review atm to allow 5-level paging to be enabled in
> > > the firmware)
> >
> > Aha.
> >
> 
> I've built a debug OVMF image using the latest version of the series,
> and put it at [0]
> 
> Run like this
> 
> qemu-system-x86_64 -M q35 \
>   -cpu qemu64,+la57 -smp 4 \
>   -bios OVMF-5level.fd \
>   -kernel arch/x86/boot/bzImage \
>   -append console=ttyS0\ earlyprintk=ttyS0 \
>   -vga none -nographic -m 1g \
>   -initrd <initrd.img>
> 
> and you will get loads of DEBUG output from the firmware first, and
> then boot into Linux. (initrd can be omitted)
> 
> Right before entering, it will print
> 
> CpuDxe: 5-Level Paging = 1
> 
> which confirms that the firmware is running with 5 levels of paging.
> 
> I've confirmed that this boots happily with this series applied,
> including when using 'no5lvl' on the command line, or when disabling
> CONFIG_X86_5LEVEL [confirmed by inspecting
> /sys/kernel/debug/page_tables/kernel].
> 
> 
> [0] http://files.workofard.com/OVMF-5level.fd.gz

Nice, that might come in handy for other testing too.

Thx.
  
Tom Lendacky March 1, 2024, 7:13 p.m. UTC | #6
On 3/1/24 11:33, Borislav Petkov wrote:
> On Fri, Mar 01, 2024 at 06:09:53PM +0100, Ard Biesheuvel wrote:
>> On Fri, 1 Mar 2024 at 17:09, Borislav Petkov <bp@alien8.de> wrote:
>>>
>>> On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
>>>> The scenario that I have not managed to test is entering from EFI with
>>>> 5 levels of paging enabled, and switching back to 4 levels (which
>>>> should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
>>>> existence actually supports that today, and I am pretty sure that this
>>>> code has never been tested under those conditions to begin with. (OVMF
>>>> patches are under review atm to allow 5-level paging to be enabled in
>>>> the firmware)
>>>
>>> Aha.
>>>
>>
>> I've built a debug OVMF image using the latest version of the series,
>> and put it at [0]
>>
>> Run like this
>>
>> qemu-system-x86_64 -M q35 \
>>    -cpu qemu64,+la57 -smp 4 \
>>    -bios OVMF-5level.fd \
>>    -kernel arch/x86/boot/bzImage \
>>    -append console=ttyS0\ earlyprintk=ttyS0 \
>>    -vga none -nographic -m 1g \
>>    -initrd <initrd.img>
>>
>> and you will get loads of DEBUG output from the firmware first, and
>> then boot into Linux. (initrd can be omitted)
>>
>> Right before entering, it will print
>>
>> CpuDxe: 5-Level Paging = 1
>>
>> which confirms that the firmware is running with 5 levels of paging.
>>
>> I've confirmed that this boots happily with this series applied,
>> including when using 'no5lvl' on the command line, or when disabling
>> CONFIG_X86_5LEVEL [confirmed by inspecting
>> /sys/kernel/debug/page_tables/kernel].
>>
>>
>> [0] http://files.workofard.com/OVMF-5level.fd.gz
> 
> Nice, that might come in handy for other testing too.

Be aware that additional work will need to be done in OVMF to support 
5-level paging for SEV VMs.

Initial SEV implementation happened when there wasn't a page table library 
and so SEV support had to roll it's own page table modifications. A page 
table library has since been created and 5-level support was added, but 
the SEV code hasn't been converted over to use the new library, yet.

Thanks,
Tom

> 
> Thx.
>
  
Borislav Petkov March 3, 2024, 7:26 p.m. UTC | #7
On Fri, Mar 01, 2024 at 06:33:23PM +0100, Borislav Petkov wrote:
> > I've built a debug OVMF image using the latest version of the series,
> > and put it at [0]
> > 
> > Run like this
> > 
> > qemu-system-x86_64 -M q35 \
> >   -cpu qemu64,+la57 -smp 4 \
> >   -bios OVMF-5level.fd \
> >   -kernel arch/x86/boot/bzImage \
> >   -append console=ttyS0\ earlyprintk=ttyS0 \
> >   -vga none -nographic -m 1g \
> >   -initrd <initrd.img>
> > 
> > and you will get loads of DEBUG output from the firmware first, and
> > then boot into Linux. (initrd can be omitted)
> > 
> > Right before entering, it will print
> > 
> > CpuDxe: 5-Level Paging = 1
> > 
> > which confirms that the firmware is running with 5 levels of paging.
> > 
> > I've confirmed that this boots happily with this series applied,
> > including when using 'no5lvl' on the command line, or when disabling
> > CONFIG_X86_5LEVEL [confirmed by inspecting
> > /sys/kernel/debug/page_tables/kernel].
> > 
> > 
> > [0] http://files.workofard.com/OVMF-5level.fd.gz
> 
> Nice, that might come in handy for other testing too.

Btw, on a semi-related note, do you have an idea whether a normal guest
kernel using OVMF istead of seabios would be even able to boot a kernel
supplied with -kernel like above but without an -initrd?

I have everything builtin and the same kernel boots fine in a guest with
a
[    0.000000] SMBIOS 3.0.0 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014

but if I try to boot the respective guest installed with the OVMF BIOS
from the debian package:

[    0.000000] efi: EFI v2.7 by Debian distribution of EDK II
[    0.000000] efi: SMBIOS=0x7f788000 SMBIOS 3.0=0x7f786000 ACPI=0x7f97e000 ACPI 2.0=0x7f97e014 MEMATTR=0x7ddfe018

it fails looking up the /dev/root device major/minor deep in the bowels
of the vfs:

[    2.565651] do_new_mount:
[    2.566380] vfs_get_tree: fc->root: 0000000000000000
[    2.567298] kern_path: filename: ffff88800d666000 of name: /dev/root
[    2.568418] kern_path: ret: 0
[    2.569009] lookup_bdev: kern_path(/dev/root, , path: ffff88800e537380), error: 0
[    2.571645] lookup_bdev: inode->i_rdev: 0x0
[    2.572417] get_tree_bdev: lookup_bdev(/dev/root, dev: 0x0), error: 0
						     ^^^^^^^^^

That dev_t should be 0x800002 - the major and minor of /dev/sda2 but it
looks like something else is missing in this case...

Thx.
  
Ard Biesheuvel March 3, 2024, 9:56 p.m. UTC | #8
On Sun, 3 Mar 2024 at 20:27, Borislav Petkov <bp@alien8.de> wrote:
>
..
>
> Btw, on a semi-related note, do you have an idea whether a normal guest
> kernel using OVMF istead of seabios would be even able to boot a kernel
> supplied with -kernel like above but without an -initrd?
>

How are you passing the root device to the kernel? Via root= on the
command line?

> I have everything builtin and the same kernel boots fine in a guest with
> a
> [    0.000000] SMBIOS 3.0.0 present.
> [    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>

OK, so this is SeaBIOS

> but if I try to boot the respective guest installed with the OVMF BIOS
> from the debian package:
>
> [    0.000000] efi: EFI v2.7 by Debian distribution of EDK II
> [    0.000000] efi: SMBIOS=0x7f788000 SMBIOS 3.0=0x7f786000 ACPI=0x7f97e000 ACPI 2.0=0x7f97e014 MEMATTR=0x7ddfe018
>

and this is OVMF.

I have tried both of these, with i440fx as well as q35, and they all
work happily with my Debian guest image passed via -hda to QEMU, and
with root=/dev/sda2 on the kernel command line.


> it fails looking up the /dev/root device major/minor deep in the bowels
> of the vfs:
>
> [    2.565651] do_new_mount:
> [    2.566380] vfs_get_tree: fc->root: 0000000000000000
> [    2.567298] kern_path: filename: ffff88800d666000 of name: /dev/root
> [    2.568418] kern_path: ret: 0
> [    2.569009] lookup_bdev: kern_path(/dev/root, , path: ffff88800e537380), error: 0
> [    2.571645] lookup_bdev: inode->i_rdev: 0x0
> [    2.572417] get_tree_bdev: lookup_bdev(/dev/root, dev: 0x0), error: 0
>                                                      ^^^^^^^^^
>
> That dev_t should be 0x800002 - the major and minor of /dev/sda2 but it
> looks like something else is missing in this case...
>

How did you get this output? Are these debug printk()s you added yourself?
  
Borislav Petkov March 3, 2024, 10:10 p.m. UTC | #9
On Sun, Mar 03, 2024 at 10:56:49PM +0100, Ard Biesheuvel wrote:
> How are you passing the root device to the kernel? Via root= on the
> command line?

Yeah:

qemu
..
-kernel arch/x86/boot/bzImage
-append "root=/dev/sda2 resume=/dev/sda3 ...

> and this is OVMF.

Yap.

> I have tried both of these, with i440fx as well as q35, and they all
> work happily with my Debian guest image passed via -hda to QEMU, and
> with root=/dev/sda2 on the kernel command line.

Interesting. I'm not passing any machine type. Maybe I should even
thought I've never done it before.

/me goes and tries machine type.

Well, I'll be damned!

-machine type=pc-i440fx-2.8 - no workie BUT

-machine type=pc-q35-2.8

booted.

Now on to figure out what's different with q35 and why it is magical and
it finds the root device just fine:

[    2.732908] mount_root_generic: i: 2, fs_name: ext4
[    2.734275] do_mount_root: name: /dev/root
[    2.735093] kern_path: filename: ffff88800d4de000 of name: /root
[    2.736954] kern_path: ret: 0
[    2.737727] init_mount: kern_path(/root), ret: 0
[    2.738964] path_mount: will do_new_mount
[    2.739784] do_new_mount: 1, fc source: (null)
[    2.740961] do_new_mount: 2, err: 0
[    2.741722] do_new_mount: 3, err: 0
[    2.742448] do_new_mount: 4, err: 0
[    2.743164] vfs_get_tree: fc->root: 0000000000000000
[    2.744095] kern_path: filename: ffff88800d4de000 of name: /dev/root
[    2.745352] kern_path: ret: 0
[    2.745994] lookup_bdev: kern_path(/dev/root, , path: ffff88800cf163c0), error: 0
[    2.747288] lookup_bdev: inode->i_rdev: 0x800002
[    2.748163] get_tree_bdev: lookup_bdev(/dev/root, dev: 0x800002), error: 0
							  ^^^^^^^^^

> How did you get this output? Are these debug printk()s you added yourself?

Yeah, the good old "sprinkle printks" debugging method. Figured I should
look at the VFS code out of interest. :-)

Thanks a lot for the suggestions, especially about q35!
  

Patch

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 38b54b992f32..9053dfe9fa03 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -21,9 +21,9 @@  typedef unsigned long	pgprotval_t;
 typedef struct { pteval_t pte; } pte_t;
 typedef struct { pmdval_t pmd; } pmd_t;
 
-#ifdef CONFIG_X86_5LEVEL
 extern unsigned int __pgtable_l5_enabled;
 
+#ifdef CONFIG_X86_5LEVEL
 #ifdef USE_EARLY_PGTABLE_L5
 /*
  * cpu_feature_enabled() is not available in early boot code.
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 72351c3121a6..deaaea3280d9 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -23,6 +23,7 @@ 
 #include <linux/pgtable.h>
 
 #include <asm/asm.h>
+#include <asm/page_64.h>
 #include <asm/processor.h>
 #include <asm/proto.h>
 #include <asm/smp.h>
@@ -77,24 +78,11 @@  static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
 	[GDT_ENTRY_KERNEL_DS]           = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
 };
 
-#ifdef CONFIG_X86_5LEVEL
-static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
-{
-	return ptr - (void *)_text + (void *)physaddr;
-}
-
-static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
+static inline bool check_la57_support(void)
 {
-	return fixup_pointer(ptr, physaddr);
-}
-
-static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
-{
-	return fixup_pointer(ptr, physaddr);
-}
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return false;
 
-static bool __head check_la57_support(unsigned long physaddr)
-{
 	/*
 	 * 5-level paging is detected and enabled at kernel decompression
 	 * stage. Only check if it has been enabled there.
@@ -102,21 +90,8 @@  static bool __head check_la57_support(unsigned long physaddr)
 	if (!(native_read_cr4() & X86_CR4_LA57))
 		return false;
 
-	*fixup_int(&__pgtable_l5_enabled, physaddr) = 1;
-	*fixup_int(&pgdir_shift, physaddr) = 48;
-	*fixup_int(&ptrs_per_p4d, physaddr) = 512;
-	*fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE_L5;
-	*fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE_L5;
-	*fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE_L5;
-
 	return true;
 }
-#else
-static bool __head check_la57_support(unsigned long physaddr)
-{
-	return false;
-}
-#endif
 
 static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
 {
@@ -180,7 +155,7 @@  unsigned long __head __startup_64(unsigned long physaddr,
 	bool la57;
 	int i;
 
-	la57 = check_la57_support(physaddr);
+	la57 = check_la57_support();
 
 	/* Is the address too large? */
 	if (physaddr >> MAX_PHYSMEM_BITS)
@@ -465,6 +440,15 @@  asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
 				(__START_KERNEL & PGDIR_MASK)));
 	BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
 
+	if (check_la57_support()) {
+		__pgtable_l5_enabled	= 1;
+		pgdir_shift		= 48;
+		ptrs_per_p4d		= 512;
+		page_offset_base	= __PAGE_OFFSET_BASE_L5;
+		vmalloc_base		= __VMALLOC_BASE_L5;
+		vmemmap_base		= __VMEMMAP_BASE_L5;
+	}
+
 	cr4_init_shadow();
 
 	/* Kill off the identity-map trampoline */