[RFC,4/5] x86/head64: Replace pointer fixups with PIE codegen
Commit Message
From: Ard Biesheuvel <ardb@kernel.org>
Some of the C code in head64.c may be called from a different virtual
address than it was linked at. Currently, we deal with this by using
ordinary, position dependent codegen, and fixing up all symbol
references on the fly. This is fragile and tricky to maintain. It is
also unnecessary: we can use position independent codegen (with hidden
visibility) to ensure that all compiler generated symbol references are
RIP-relative, removing the need for fixups entirely.
It does mean we need explicit references to kernel virtual addresses to
be generated by hand, so generate those using a movabs instruction in
inline asm in the handful places where we actually need this.
While at it, move these routines to .inittext where they belong.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/Makefile | 11 ++
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/include/asm/init.h | 2 -
arch/x86/include/asm/setup.h | 2 +-
arch/x86/kernel/Makefile | 4 +
arch/x86/kernel/head64.c | 117 +++++++-------------
6 files changed, 60 insertions(+), 78 deletions(-)
Comments
On Mon, Jan 22, 2024 at 4:14 AM Ard Biesheuvel <ardb+git@google.com> wrote:
>
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Some of the C code in head64.c may be called from a different virtual
> address than it was linked at. Currently, we deal with this by using
> ordinary, position dependent codegen, and fixing up all symbol
> references on the fly. This is fragile and tricky to maintain. It is
> also unnecessary: we can use position independent codegen (with hidden
> visibility) to ensure that all compiler generated symbol references are
> RIP-relative, removing the need for fixups entirely.
>
> It does mean we need explicit references to kernel virtual addresses to
> be generated by hand, so generate those using a movabs instruction in
> inline asm in the handful places where we actually need this.
>
> While at it, move these routines to .inittext where they belong.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> arch/x86/Makefile | 11 ++
> arch/x86/boot/compressed/Makefile | 2 +-
> arch/x86/include/asm/init.h | 2 -
> arch/x86/include/asm/setup.h | 2 +-
> arch/x86/kernel/Makefile | 4 +
> arch/x86/kernel/head64.c | 117 +++++++-------------
> 6 files changed, 60 insertions(+), 78 deletions(-)
>
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 1a068de12a56..bed0850d91b0 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -168,6 +168,17 @@ else
> KBUILD_CFLAGS += -mcmodel=kernel
> KBUILD_RUSTFLAGS += -Cno-redzone=y
> KBUILD_RUSTFLAGS += -Ccode-model=kernel
> +
> + PIE_CFLAGS := -fpie -mcmodel=small \
> + -include $(srctree)/include/linux/hidden.h
> +
> + ifeq ($(CONFIG_STACKPROTECTOR),y)
> + ifeq ($(CONFIG_SMP),y)
> + PIE_CFLAGS += -mstack-protector-guard-reg=gs
> + endif
This compiler flag requires GCC 8.1 or later. When I posted a patch
series[1] to convert the stack protector to a normal percpu variable
instead of the fixed offset, there was pushback over requiring GCC 8.1
to keep stack protector support. I added code to objtool to convert
code from older compilers, but there hasn't been any feedback since.
Similar conversion code would be needed in objtool for this unless the
decision is made to require GCC 8.1 for stack protector support going
forward.
Brian Gerst
[1] https://lore.kernel.org/lkml/20231115173708.108316-1-brgerst@gmail.com/
On Mon, Jan 22, 2024 at 02:34:46PM -0500, Brian Gerst wrote:
> On Mon, Jan 22, 2024 at 4:14 AM Ard Biesheuvel <ardb+git@google.com> wrote:
> >
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Some of the C code in head64.c may be called from a different virtual
> > address than it was linked at. Currently, we deal with this by using
> > ordinary, position dependent codegen, and fixing up all symbol
> > references on the fly. This is fragile and tricky to maintain. It is
> > also unnecessary: we can use position independent codegen (with hidden
> > visibility) to ensure that all compiler generated symbol references are
> > RIP-relative, removing the need for fixups entirely.
> >
> > It does mean we need explicit references to kernel virtual addresses to
> > be generated by hand, so generate those using a movabs instruction in
> > inline asm in the handful places where we actually need this.
> >
> > While at it, move these routines to .inittext where they belong.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/Makefile | 11 ++
> > arch/x86/boot/compressed/Makefile | 2 +-
> > arch/x86/include/asm/init.h | 2 -
> > arch/x86/include/asm/setup.h | 2 +-
> > arch/x86/kernel/Makefile | 4 +
> > arch/x86/kernel/head64.c | 117 +++++++-------------
> > 6 files changed, 60 insertions(+), 78 deletions(-)
> >
> > diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> > index 1a068de12a56..bed0850d91b0 100644
> > --- a/arch/x86/Makefile
> > +++ b/arch/x86/Makefile
> > @@ -168,6 +168,17 @@ else
> > KBUILD_CFLAGS += -mcmodel=kernel
> > KBUILD_RUSTFLAGS += -Cno-redzone=y
> > KBUILD_RUSTFLAGS += -Ccode-model=kernel
> > +
> > + PIE_CFLAGS := -fpie -mcmodel=small \
> > + -include $(srctree)/include/linux/hidden.h
> > +
> > + ifeq ($(CONFIG_STACKPROTECTOR),y)
> > + ifeq ($(CONFIG_SMP),y)
> > + PIE_CFLAGS += -mstack-protector-guard-reg=gs
> > + endif
>
> This compiler flag requires GCC 8.1 or later. When I posted a patch
> series[1] to convert the stack protector to a normal percpu variable
> instead of the fixed offset, there was pushback over requiring GCC 8.1
> to keep stack protector support. I added code to objtool to convert
> code from older compilers, but there hasn't been any feedback since.
> Similar conversion code would be needed in objtool for this unless the
> decision is made to require GCC 8.1 for stack protector support going
> forward.
>
> Brian Gerst
>
> [1] https://lore.kernel.org/lkml/20231115173708.108316-1-brgerst@gmail.com/
I was going to comment on this as well, as that flag was only supported
in clang 12.0.0 and newer. It should not be too big of a deal for us
though, as I was already planning on bumping the minimum supported
version of clang for building the kernel to 13.0.1 (but there may be
breakage reports if this series lands before that):
https://lore.kernel.org/20240110165339.GA3105@dev-arch.thelio-3990X/
Cheers,
Nathan
On Mon, 22 Jan 2024 at 23:44, Nathan Chancellor <nathan@kernel.org> wrote:
>
> On Mon, Jan 22, 2024 at 02:34:46PM -0500, Brian Gerst wrote:
> > On Mon, Jan 22, 2024 at 4:14 AM Ard Biesheuvel <ardb+git@googlecom> wrote:
> > >
> > > From: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > Some of the C code in head64.c may be called from a different virtual
> > > address than it was linked at. Currently, we deal with this by using
> > > ordinary, position dependent codegen, and fixing up all symbol
> > > references on the fly. This is fragile and tricky to maintain. It is
> > > also unnecessary: we can use position independent codegen (with hidden
> > > visibility) to ensure that all compiler generated symbol references are
> > > RIP-relative, removing the need for fixups entirely.
> > >
> > > It does mean we need explicit references to kernel virtual addresses to
> > > be generated by hand, so generate those using a movabs instruction in
> > > inline asm in the handful places where we actually need this.
> > >
> > > While at it, move these routines to .inittext where they belong.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > > arch/x86/Makefile | 11 ++
> > > arch/x86/boot/compressed/Makefile | 2 +-
> > > arch/x86/include/asm/init.h | 2 -
> > > arch/x86/include/asm/setup.h | 2 +-
> > > arch/x86/kernel/Makefile | 4 +
> > > arch/x86/kernel/head64.c | 117 +++++++-------------
> > > 6 files changed, 60 insertions(+), 78 deletions(-)
> > >
> > > diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> > > index 1a068de12a56..bed0850d91b0 100644
> > > --- a/arch/x86/Makefile
> > > +++ b/arch/x86/Makefile
> > > @@ -168,6 +168,17 @@ else
> > > KBUILD_CFLAGS += -mcmodel=kernel
> > > KBUILD_RUSTFLAGS += -Cno-redzone=y
> > > KBUILD_RUSTFLAGS += -Ccode-model=kernel
> > > +
> > > + PIE_CFLAGS := -fpie -mcmodel=small \
> > > + -include $(srctree)/include/linux/hidden.h
> > > +
> > > + ifeq ($(CONFIG_STACKPROTECTOR),y)
> > > + ifeq ($(CONFIG_SMP),y)
> > > + PIE_CFLAGS += -mstack-protector-guard-reg=gs
> > > + endif
> >
> > This compiler flag requires GCC 8.1 or later. When I posted a patch
> > series[1] to convert the stack protector to a normal percpu variable
> > instead of the fixed offset, there was pushback over requiring GCC 8.1
> > to keep stack protector support. I added code to objtool to convert
> > code from older compilers, but there hasn't been any feedback since.
> > Similar conversion code would be needed in objtool for this unless the
> > decision is made to require GCC 8.1 for stack protector support going
> > forward.
> >
> > Brian Gerst
> >
> > [1] https://lore.kernel.org/lkml/20231115173708.108316-1-brgerst@gmail.com/
>
> I was going to comment on this as well, as that flag was only supported
> in clang 12.0.0 and newer. It should not be too big of a deal for us
> though, as I was already planning on bumping the minimum supported
> version of clang for building the kernel to 13.0.1 (but there may be
> breakage reports if this series lands before that):
>
Thanks for pointing this out.
Given that building the entire kernel with fPIC is neither necessary
nor sufficient, I am going to abandon this approach.
If we apply fPIC to only a handful of compilation units containing
code that runs from the 1:1 mapping, it is not unreasonable to simply
disable the stack protector altogether for those pieces too. This
works around the older GCC issue.
@@ -168,6 +168,17 @@ else
KBUILD_CFLAGS += -mcmodel=kernel
KBUILD_RUSTFLAGS += -Cno-redzone=y
KBUILD_RUSTFLAGS += -Ccode-model=kernel
+
+ PIE_CFLAGS := -fpie -mcmodel=small \
+ -include $(srctree)/include/linux/hidden.h
+
+ ifeq ($(CONFIG_STACKPROTECTOR),y)
+ ifeq ($(CONFIG_SMP),y)
+ PIE_CFLAGS += -mstack-protector-guard-reg=gs
+ endif
+ endif
+
+ export PIE_CFLAGS
endif
#
@@ -84,7 +84,7 @@ LDFLAGS_vmlinux += -T
hostprogs := mkpiggy
HOST_EXTRACFLAGS += -I$(srctree)/tools/include
-sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
+sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABbCDGRSTtVW] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
quiet_cmd_voffset = VOFFSET $@
cmd_voffset = $(NM) $< | sed -n $(sed-voffset) > $@
@@ -2,8 +2,6 @@
#ifndef _ASM_X86_INIT_H
#define _ASM_X86_INIT_H
-#define __head __section(".head.text")
-
struct x86_mapping_info {
void *(*alloc_pgt_page)(void *); /* allocate buf for page table */
void *context; /* context for alloc_pgt_page */
@@ -48,7 +48,7 @@ extern unsigned long saved_video_mode;
extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
-extern void startup_64_setup_env(unsigned long physbase);
+extern void startup_64_setup_env(void);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);
@@ -21,6 +21,10 @@ CFLAGS_REMOVE_sev.o = -pg
CFLAGS_REMOVE_rethook.o = -pg
endif
+# head64.c contains C code that may execute from a different virtual address
+# than it was linked at, so we always build it using PIE codegen
+CFLAGS_head64.o += $(PIE_CFLAGS)
+
KASAN_SANITIZE_head$(BITS).o := n
KASAN_SANITIZE_dumpstack.o := n
KASAN_SANITIZE_dumpstack_$(BITS).o := n
@@ -85,23 +85,8 @@ static struct desc_ptr startup_gdt_descr __initdata = {
.address = 0,
};
-static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
-{
- return ptr - (void *)_text + (void *)physaddr;
-}
-
-static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
-{
- return fixup_pointer(ptr, physaddr);
-}
-
#ifdef CONFIG_X86_5LEVEL
-static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
-{
- return fixup_pointer(ptr, physaddr);
-}
-
-static bool __head check_la57_support(unsigned long physaddr)
+static bool __init check_la57_support(void)
{
/*
* 5-level paging is detected and enabled at kernel decompression
@@ -110,23 +95,28 @@ static bool __head check_la57_support(unsigned long physaddr)
if (!(native_read_cr4() & X86_CR4_LA57))
return false;
- *fixup_int(&__pgtable_l5_enabled, physaddr) = 1;
- *fixup_int(&pgdir_shift, physaddr) = 48;
- *fixup_int(&ptrs_per_p4d, physaddr) = 512;
- *fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE_L5;
- *fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE_L5;
- *fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE_L5;
+ __pgtable_l5_enabled = 1;
+ pgdir_shift = 48;
+ ptrs_per_p4d = 512;
+ page_offset_base = __PAGE_OFFSET_BASE_L5;
+ vmalloc_base = __VMALLOC_BASE_L5;
+ vmemmap_base = __VMEMMAP_BASE_L5;
return true;
}
#else
-static bool __head check_la57_support(unsigned long physaddr)
+static bool __init check_la57_support(void)
{
return false;
}
#endif
-static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
+#define __va_symbol(sym) ({ \
+ unsigned long __v; \
+ asm("movabsq $" __stringify(sym) ", %0":"=r"(__v)); \
+ __v; })
+
+static unsigned long __init sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
unsigned long vaddr, vaddr_end;
int i;
@@ -141,8 +131,8 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
* attribute.
*/
if (sme_get_me_mask()) {
- vaddr = (unsigned long)__start_bss_decrypted;
- vaddr_end = (unsigned long)__end_bss_decrypted;
+ vaddr = __va_symbol(__start_bss_decrypted);
+ vaddr_end = __va_symbol(__end_bss_decrypted);
for (; vaddr < vaddr_end; vaddr += PMD_SIZE) {
/*
@@ -169,13 +159,7 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
return sme_get_me_mask();
}
-/* Code in __startup_64() can be relocated during execution, but the compiler
- * doesn't have to generate PC-relative relocations when accessing globals from
- * that function. Clang actually does not generate them, which leads to
- * boot-time crashes. To work around this problem, every global pointer must
- * be adjusted using fixup_pointer().
- */
-unsigned long __head __startup_64(unsigned long physaddr,
+unsigned long __init __startup_64(unsigned long physaddr,
struct boot_params *bp)
{
unsigned long load_delta, *p;
@@ -184,12 +168,10 @@ unsigned long __head __startup_64(unsigned long physaddr,
p4dval_t *p4d;
pudval_t *pud;
pmdval_t *pmd, pmd_entry;
- pteval_t *mask_ptr;
bool la57;
int i;
- unsigned int *next_pgt_ptr;
- la57 = check_la57_support(physaddr);
+ la57 = check_la57_support();
/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -199,7 +181,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* Compute the delta between the address I am compiled to run at
* and the address I am actually running at.
*/
- load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+ load_delta = physaddr - (__va_symbol(_text) - __START_KERNEL_map);
/* Is the address not 2M aligned? */
if (load_delta & ~PMD_MASK)
@@ -210,26 +192,22 @@ unsigned long __head __startup_64(unsigned long physaddr,
/* Fixup the physical addresses in the page table */
- pgd = fixup_pointer(early_top_pgt, physaddr);
+ pgd = (pgdval_t *)early_top_pgt;
p = pgd + pgd_index(__START_KERNEL_map);
if (la57)
*p = (unsigned long)level4_kernel_pgt;
else
*p = (unsigned long)level3_kernel_pgt;
- *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
+ *p += _PAGE_TABLE_NOENC + sme_get_me_mask();
- if (la57) {
- p4d = fixup_pointer(level4_kernel_pgt, physaddr);
- p4d[511] += load_delta;
- }
+ if (la57)
+ level4_kernel_pgt[511].p4d += load_delta;
- pud = fixup_pointer(level3_kernel_pgt, physaddr);
- pud[510] += load_delta;
- pud[511] += load_delta;
+ level3_kernel_pgt[510].pud += load_delta;
+ level3_kernel_pgt[511].pud += load_delta;
- pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
for (i = FIXMAP_PMD_TOP; i > FIXMAP_PMD_TOP - FIXMAP_PMD_NUM; i--)
- pmd[i] += load_delta;
+ level2_fixmap_pgt[i].pmd += load_delta;
/*
* Set up the identity mapping for the switchover. These
@@ -238,15 +216,13 @@ unsigned long __head __startup_64(unsigned long physaddr,
* it avoids problems around wraparound.
*/
- next_pgt_ptr = fixup_pointer(&next_early_pgt, physaddr);
- pud = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
- pmd = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
+ pud = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
+ pmd = (pmdval_t *)early_dynamic_pgts[next_early_pgt++];
pgtable_flags = _KERNPG_TABLE_NOENC + sme_get_me_mask();
if (la57) {
- p4d = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++],
- physaddr);
+ p4d = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
i = (physaddr >> PGDIR_SHIFT) % PTRS_PER_PGD;
pgd[i + 0] = (pgdval_t)p4d + pgtable_flags;
@@ -267,8 +243,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
/* Filter out unsupported __PAGE_KERNEL_* bits: */
- mask_ptr = fixup_pointer(&__supported_pte_mask, physaddr);
- pmd_entry &= *mask_ptr;
+ pmd_entry &= __supported_pte_mask;
pmd_entry += sme_get_me_mask();
pmd_entry += physaddr;
@@ -294,14 +269,14 @@ unsigned long __head __startup_64(unsigned long physaddr,
* error, causing the BIOS to halt the system.
*/
- pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+ pmd = (pmdval_t *)level2_kernel_pgt;
/* invalidate pages before the kernel image */
- for (i = 0; i < pmd_index((unsigned long)_text); i++)
+ for (i = 0; i < pmd_index(__va_symbol(_text)); i++)
pmd[i] &= ~_PAGE_PRESENT;
/* fixup pages that are part of the kernel image */
- for (; i <= pmd_index((unsigned long)_end); i++)
+ for (; i <= pmd_index(__va_symbol(_end)); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
@@ -313,7 +288,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* Fixup phys_base - remove the memory encryption mask to obtain
* the true physical address.
*/
- *fixup_long(&phys_base, physaddr) += load_delta - sme_get_me_mask();
+ phys_base += load_delta - sme_get_me_mask();
return sme_postprocess_startup(bp, pmd);
}
@@ -587,22 +562,16 @@ static void set_bringup_idt_handler(gate_desc *idt, int n, void *handler)
}
/* This runs while still in the direct mapping */
-static void __head startup_64_load_idt(unsigned long physbase)
+static void startup_64_load_idt(void)
{
- struct desc_ptr *desc = fixup_pointer(&bringup_idt_descr, physbase);
- gate_desc *idt = fixup_pointer(bringup_idt_table, physbase);
-
-
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- void *handler;
+ gate_desc *idt = bringup_idt_table;
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
/* VMM Communication Exception */
- handler = fixup_pointer(vc_no_ghcb, physbase);
- set_bringup_idt_handler(idt, X86_TRAP_VC, handler);
- }
+ set_bringup_idt_handler(idt, X86_TRAP_VC, vc_no_ghcb);
- desc->address = (unsigned long)idt;
- native_load_idt(desc);
+ bringup_idt_descr.address = (unsigned long)idt;
+ native_load_idt(&bringup_idt_descr);
}
/* This is used when running on kernel addresses */
@@ -621,10 +590,10 @@ void early_setup_idt(void)
/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
-void __head startup_64_setup_env(unsigned long physbase)
+void __init startup_64_setup_env(void)
{
/* Load GDT */
- startup_gdt_descr.address = (unsigned long)fixup_pointer(startup_gdt, physbase);
+ startup_gdt_descr.address = (unsigned long)startup_gdt;
native_load_gdt(&startup_gdt_descr);
/* New GDT is live - reload data segment registers */
@@ -632,5 +601,5 @@ void __head startup_64_setup_env(unsigned long physbase)
"movl %%eax, %%ss\n"
"movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");
- startup_64_load_idt(physbase);
+ startup_64_load_idt();
}