[v7,9/9] x86/startup_64: Drop global variables keeping track of LA57 state

Message ID 20240227151907.387873-20-ardb+git@google.com
State New
Headers
Series x86: Confine early 1:1 mapped startup code |

Commit Message

Ard Biesheuvel Feb. 27, 2024, 3:19 p.m. UTC
  From: Ard Biesheuvel <ardb@kernel.org>

On x86_64, the core kernel is entered in long mode, which implies that
paging is enabled. This means that the CR4.LA57 control bit is
guaranteed to be in sync with the number of paging levels used by the
kernel, and there is no need to store this in a variable.

There is also no need to use variables for storing the calculations of
pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.

This removes the need for two different sources of truth for determining
whether 5-level paging is in use: CR4.LA57 always reflects the actual
state, and never changes from the point of view of the 64-bit core
kernel. The only potential concern is the cost of CR4 accesses, which
can be mitigated using alternatives patching based on feature detection.

Note that even the decompressor does not manipulate any page tables
before updating CR4.LA57, so it can also avoid the associated global
variables entirely. However, as it does not implement alternatives
patching, the associated ELF sections need to be discarded.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/misc.h         |  4 --
 arch/x86/boot/compressed/pgtable_64.c   | 12 ------
 arch/x86/boot/compressed/vmlinux.lds.S  |  1 +
 arch/x86/include/asm/pgtable_64_types.h | 43 ++++++++++----------
 arch/x86/kernel/cpu/common.c            |  2 -
 arch/x86/kernel/head64.c                | 33 +--------------
 arch/x86/mm/kasan_init_64.c             |  3 --
 arch/x86/mm/mem_encrypt_identity.c      |  9 ----
 8 files changed, 25 insertions(+), 82 deletions(-)
  

Comments

Borislav Petkov March 1, 2024, 7:20 p.m. UTC | #1
On Tue, Feb 27, 2024 at 04:19:17PM +0100, Ard Biesheuvel wrote:
> +	asm(ALTERNATIVE_TERNARY(
> +		"movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> +		%P[feat], "stc", "clc")
> +		: [reg] "=&r" (r), CC_OUT(c) (ret)
> +		: [feat] "i"  (X86_FEATURE_LA57),
> +		  [la57] "i"  (X86_CR4_LA57_BIT)
> +		: "cc");
> +
> +	return ret;

Yeah, I said this is creative but an alternative here looks like an
overkill.

Can we use a RIP_REL_REF(global var) instead pls?

Thx.
  
Ard Biesheuvel March 1, 2024, 11:55 p.m. UTC | #2
On Fri, 1 Mar 2024 at 20:20, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Feb 27, 2024 at 04:19:17PM +0100, Ard Biesheuvel wrote:
> > +     asm(ALTERNATIVE_TERNARY(
> > +             "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> > +             %P[feat], "stc", "clc")
> > +             : [reg] "=&r" (r), CC_OUT(c) (ret)
> > +             : [feat] "i"  (X86_FEATURE_LA57),
> > +               [la57] "i"  (X86_CR4_LA57_BIT)
> > +             : "cc");
> > +
> > +     return ret;
>
> Yeah, I said this is creative but an alternative here looks like an
> overkill.
>
> Can we use a RIP_REL_REF(global var) instead pls?
>

I don't see the point of that, tbh. Patch #2 already ripped out all
the fixup_pointer() occurrences. This patch gets rid of the need to
#define USE_EARLY_PGTABLE_L5 in each translation unit that contains
code that might execute before alternatives patching has occurred.

Today, pgtable_l5_enabled() is used in many places, most of which
resolve to cpu_feature_enabled(), and I don't think you are suggesting
to replace all of those with a variable load, right? So that means
we'll have to stick with early and late variants of
pgtable_l5_enabled() like we have today, and we should just drop this
patch instead - I put it at the end of the series for a reason.
  
Borislav Petkov March 2, 2024, 3:17 p.m. UTC | #3
On Sat, Mar 02, 2024 at 12:55:14AM +0100, Ard Biesheuvel wrote:
> Today, pgtable_l5_enabled() is used in many places, most of which
> resolve to cpu_feature_enabled(), and I don't think you are suggesting
> to replace all of those with a variable load, right?

pgtable_l5_enabled() is the odd one out, special thing which should
have been cpu_feature_enabled() as latter is the default interface we
use to query what features the CPU supports. But 5level got done long
ago and we hadn't decided upon cpu_feature_enabled() back then.

So should we replace it with it?

Yap, eventually.

> So that means we'll have to stick with early and late variants of
> pgtable_l5_enabled() like we have today,

I don't mind at all if we had a

	early_pgtable_l5_enabled()

which does the RIP_REL_REF() thing and it should be used only in 1:1
mapping code and the late stuff should start to get replaced with
cpu_feature_enabled() until pgtable_l5_enabled() is completely gone.

And then we even see whether we can opencode the handful places.

> and we should just drop this patch instead - I put it at the end of
> the series for a reason.

Yeah, we can leave that one aside for now but use it to start a cleanup
series.

If you're bored and feel like doing it, be my guest but for the next
cycle. Or I'll put it on my evergrowing TODO and get to it eventually.

For now, lemme test your set without this one and see whether we can
queue them even now.

Thx.
  
Ard Biesheuvel March 2, 2024, 3:32 p.m. UTC | #4
On Sat, 2 Mar 2024 at 16:17, Borislav Petkov <bp@alien8.de> wrote:
>
> On Sat, Mar 02, 2024 at 12:55:14AM +0100, Ard Biesheuvel wrote:
> > Today, pgtable_l5_enabled() is used in many places, most of which
> > resolve to cpu_feature_enabled(), and I don't think you are suggesting
> > to replace all of those with a variable load, right?
>
> pgtable_l5_enabled() is the odd one out, special thing which should
> have been cpu_feature_enabled() as latter is the default interface we
> use to query what features the CPU supports. But 5level got done long
> ago and we hadn't decided upon cpu_feature_enabled() back then.
>
> So should we replace it with it?
>
> Yap, eventually.
>

That is not the issue here. The issue is that cpu_feature_enabled()
will produce the wrong value if you call it too early.

The 'late' version of pgtable_l5_enabled() already uses
cpu_feature_enabled(), and I don't think that needs to change. Or are
you saying that pgtable_l5_enabled() should not exist at all, and all
ordinary users should use cpu_feature_enabled() directly? I suspect
that would cause problems where pgtable_l5_enabled() is used in static
inlines in header files, and it is left up to the .c file to set the
#define to convert all #include'd occurrences of pgtable_l5_enabled()
into the 'early' variant.

> > So that means we'll have to stick with early and late variants of
> > pgtable_l5_enabled() like we have today,
>
> I don't mind at all if we had a
>
>         early_pgtable_l5_enabled()
>

That wouldn't work with the static inline users of pgtable_l5_enabled().

> which does the RIP_REL_REF() thing and it should be used only in 1:1
> mapping code and the late stuff should start to get replaced with
> cpu_feature_enabled() until pgtable_l5_enabled() is completely gone.
>

All the references to pgtable_l5_enabled() are already gone from the
code that executes from a 1:1 mapping. That is why this patch is
optional: it is just cleanup that goes on top of the earlier patch
that gets rid of potentially problematic uses of fixup_pointer().

The issue being solved here is that we duplicate the value of CR4.LA57
into a global variable, in a way that requires us to reason about
whether routines in a certain compilation unit might be called before
cpu_feature_enabled() may be used. LA57 is an example of a feature
where we cannot just assume that it is missing until the point where
we figure out whether it is there or not, like in most other cases
with CPU features that are truly optional.

But I am not aware of any issues around this, although the early
accessors are probably used more widely than necessary at this point.

So an alternative approach might be to not use cpu_feature_enabled()
at all, and always rely on the global variables. But that might tickle
a hot path in the wrong way and cause a noticeable slowdown on some
benchmark.

> And then we even see whether we can opencode the handful places.
>
> > and we should just drop this patch instead - I put it at the end of
> > the series for a reason.
>
> Yeah, we can leave that one aside for now but use it to start a cleanup
> series.
>
> If you're bored and feel like doing it, be my guest but for the next
> cycle. Or I'll put it on my evergrowing TODO and get to it eventually.
>

I don't mind revisiting this next cycle.

> For now, lemme test your set without this one and see whether we can
> queue them even now.
>

Cheers.
  
Borislav Petkov March 2, 2024, 6:22 p.m. UTC | #5
On Sat, Mar 02, 2024 at 04:32:19PM +0100, Ard Biesheuvel wrote:
> That is not the issue here. The issue is that cpu_feature_enabled()
> will produce the wrong value if you call it too early.

Looks like I didn't express myself too clearly: the early version of
early_pgtable_l5_enabled() should use a simple variable like now - not
cpu_feature_enabled().

And regardless, cpu_feature_enabled() should work even before
alternatives have run because we do dynamic testing in that case... and
so on and so on.

BUT(!), let's put a pin in this and let me first have an indepth look
after the merge window so that I can poke at the details and produce
a concrete diff which we can talk about.

Thx.
  

Patch

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index b353a7be380c..e4ab7b4d8698 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@ 
 
 #define __NO_FORTIFY
 
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
 /*
  * Boot stub deals with identity mappings, physical and virtual addresses are
  * the same, so override these defines.
@@ -181,7 +178,6 @@  static inline int count_immovable_mem_regions(void) { return 0; }
 #endif
 
 /* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
 extern void kernel_add_identity_map(unsigned long start, unsigned long end);
 
 /* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 51f957b24ba7..ae72f53f5e77 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -9,13 +9,6 @@ 
 #define BIOS_START_MIN		0x20000U	/* 128K, less than this is insane */
 #define BIOS_START_MAX		0x9f000U	/* 640K, absolute maximum */
 
-#ifdef CONFIG_X86_5LEVEL
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
-unsigned int __section(".data") ptrs_per_p4d = 1;
-#endif
-
 /* Buffer to preserve trampoline memory */
 static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
 
@@ -125,11 +118,6 @@  asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 			native_cpuid_eax(0) >= 7 &&
 			(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
 		l5_required = true;
-
-		/* Initialize variables for 5-level paging */
-		__pgtable_l5_enabled = 1;
-		pgdir_shift = 48;
-		ptrs_per_p4d = 512;
 	}
 
 	/*
diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
index 083ec6d7722a..06358bb067fe 100644
--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -81,6 +81,7 @@  SECTIONS
 		*(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
 		*(.hash) *(.gnu.hash)
 		*(.note.*)
+		*(.altinstructions .altinstr_replacement)
 	}
 
 	.got.plt (INFO) : {
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 9053dfe9fa03..2fac8ba9564a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,7 +6,10 @@ 
 
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+#include <asm/alternative.h>
+#include <asm/cpufeatures.h>
 #include <asm/kaslr.h>
+#include <asm/processor-flags.h>
 
 /*
  * These are used to make use of C type-checking..
@@ -21,28 +24,24 @@  typedef unsigned long	pgprotval_t;
 typedef struct { pteval_t pte; } pte_t;
 typedef struct { pmdval_t pmd; } pmd_t;
 
-extern unsigned int __pgtable_l5_enabled;
-
-#ifdef CONFIG_X86_5LEVEL
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
+static __always_inline __pure bool pgtable_l5_enabled(void)
 {
-	return __pgtable_l5_enabled;
-}
-#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
-#endif /* USE_EARLY_PGTABLE_L5 */
+	unsigned long r;
+	bool ret;
 
-#else
-#define pgtable_l5_enabled() 0
-#endif /* CONFIG_X86_5LEVEL */
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return false;
 
-extern unsigned int pgdir_shift;
-extern unsigned int ptrs_per_p4d;
+	asm(ALTERNATIVE_TERNARY(
+		"movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
+		%P[feat], "stc", "clc")
+		: [reg] "=&r" (r), CC_OUT(c) (ret)
+		: [feat] "i"  (X86_FEATURE_LA57),
+		  [la57] "i"  (X86_CR4_LA57_BIT)
+		: "cc");
+
+	return ret;
+}
 
 #endif	/* !__ASSEMBLY__ */
 
@@ -53,7 +52,7 @@  extern unsigned int ptrs_per_p4d;
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
-#define PGDIR_SHIFT	pgdir_shift
+#define PGDIR_SHIFT	(pgtable_l5_enabled() ? 48 : 39)
 #define PTRS_PER_PGD	512
 
 /*
@@ -61,7 +60,7 @@  extern unsigned int ptrs_per_p4d;
  */
 #define P4D_SHIFT		39
 #define MAX_PTRS_PER_P4D	512
-#define PTRS_PER_P4D		ptrs_per_p4d
+#define PTRS_PER_P4D		(pgtable_l5_enabled() ? 512 : 1)
 #define P4D_SIZE		(_AC(1, UL) << P4D_SHIFT)
 #define P4D_MASK		(~(P4D_SIZE - 1))
 
@@ -76,6 +75,8 @@  extern unsigned int ptrs_per_p4d;
 #define PTRS_PER_PGD		512
 #define MAX_PTRS_PER_P4D	1
 
+#define MAX_POSSIBLE_PHYSMEM_BITS	46
+
 #endif /* CONFIG_X86_5LEVEL */
 
 /*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9e35e276c55a..d88e4be88868 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,6 +1,4 @@ 
 // SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
 
 #include <linux/memblock.h>
 #include <linux/linkage.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index deaaea3280d9..789ed2c53527 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -7,9 +7,6 @@ 
 
 #define DISABLE_BRANCH_PROFILING
 
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
 #include <linux/init.h>
 #include <linux/linkage.h>
 #include <linux/types.h>
@@ -52,14 +49,6 @@  extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
-#ifdef CONFIG_X86_5LEVEL
-unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
-#endif
-
 #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
 unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
 EXPORT_SYMBOL(page_offset_base);
@@ -78,21 +67,6 @@  static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
 	[GDT_ENTRY_KERNEL_DS]           = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
 };
 
-static inline bool check_la57_support(void)
-{
-	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
-		return false;
-
-	/*
-	 * 5-level paging is detected and enabled at kernel decompression
-	 * stage. Only check if it has been enabled there.
-	 */
-	if (!(native_read_cr4() & X86_CR4_LA57))
-		return false;
-
-	return true;
-}
-
 static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
 {
 	unsigned long vaddr, vaddr_end;
@@ -155,7 +129,7 @@  unsigned long __head __startup_64(unsigned long physaddr,
 	bool la57;
 	int i;
 
-	la57 = check_la57_support();
+	la57 = pgtable_l5_enabled();
 
 	/* Is the address too large? */
 	if (physaddr >> MAX_PHYSMEM_BITS)
@@ -440,10 +414,7 @@  asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
 				(__START_KERNEL & PGDIR_MASK)));
 	BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
 
-	if (check_la57_support()) {
-		__pgtable_l5_enabled	= 1;
-		pgdir_shift		= 48;
-		ptrs_per_p4d		= 512;
+	if (pgtable_l5_enabled()) {
 		page_offset_base	= __PAGE_OFFSET_BASE_L5;
 		vmalloc_base		= __VMALLOC_BASE_L5;
 		vmemmap_base		= __VMEMMAP_BASE_L5;
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0302491d799d..85ae1ef840cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -2,9 +2,6 @@ 
 #define DISABLE_BRANCH_PROFILING
 #define pr_fmt(fmt) "kasan: " fmt
 
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
 #include <linux/memblock.h>
 #include <linux/kasan.h>
 #include <linux/kdebug.h>
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 64b5005d49e5..a857945af177 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -27,15 +27,6 @@ 
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>