[RFC,00/43] x86/pie: Make kernel image's virtual address flexible

Message ID cover.1682673542.git.houwenlong.hwl@antgroup.com
Headers
Series x86/pie: Make kernel image's virtual address flexible |

Message

Hou Wenlong April 28, 2023, 9:50 a.m. UTC
  Purpose:

These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated
below the top 2G of the virtual address space. And this patchset
provides an example to allow kernel image to be relocated in top 512G of
the address space.

The ultimate purpose for PIE kernel is to increase the security of the
the kernel and also the fleixbility of the kernel image's virtual
address, which can be even in the low half of the address space. More
locations the kernel can fit in, this means an attacker could guess
harder.

The patchset is based on Thomas Garnier's X86 PIE patchset v6[1] and
v11[2]. However, some design changes are made and some bugs are fixed by
testing with different configurations and compilers.

  Important changes:
  - For fixmap area, move vsyscall page out of fixmap area and unify
    __FIXADDR_TOP for x86. Then fixmap area could be relocated together
    with kernel image.

  - For compile-time base address of kernel image, keep it in top 2G of
    address space. Introduce a new variable to store the run-time base
    address and adapt for VA/PA transition during runtime.

  - For percpu section, keep it as zero mapping for SMP. Because
    compile-time base address of kernel image still resides in top 2G of
    address space, then RIP-relative reference can still be used when
    percpu section is zero mapping. However, when do relocation for
    percpu variable references, percpu variable should be treated as
    normal variable and absolute references should be relocated
    accordingly. In addition, the relocation offset should be subtracted
    from the GS base in order to ensure correct operation.

  - For x86/boot/head64.c, don't build it as mcmodel=large. Instead, use
    data relocation to acqiure global symbol's value and make
    fixup_pointer() as a nop when running in identity mapping. This is
    because not all global symbol references in the code use
    fixup_pointer(), e.g. variables in macro related to 5-level paging,
    which can be optimized by GCC as relative referencs. If build it as
    mcmodel=large, there will be more fixup_pointer() calls, resulting
    in uglier code. Actually, if build it as PIE even when
    CONFIG_X86_PIE is disabled, then all fixup_pointer() could be
    dropped. However stack protector would be broken if per-cpu stack
    protector is not supported.

  Limitations:
  - Since I am not familiar with XEN, it has been disabled for now as it
    is not adapted for PIE. This is due to the assignment of wrong
    pointers (with low address values) to x86_ini_ops when running in
    identity mapping. This issue can be resolved by building pagetable
    eraly and jumping to high kernel address as soon as possible.

  - It is not allowed to reference global variables in an alternative
    section since RIP-relative addressing is not fixed in
    apply_alternatives(). Fortunately, all disallowed relocations in the
    alternative section can be captured by objtool. I believe that this
    issue can also be fixed by using objtool.

  - For module loading, only allow to load module without GOT for
    simplicity. Only weak global variable referencs are using GOT.

  Tests:
    I only have tested booting with GCC 5.1.0 (min version), GCC 12.2.0
    and CLANG 15.0.7. And I have also run the following tests for both
    default configuration and Ubuntu configuration.

Performance/Size impact (GCC 12.2.0):

Size of vmlinux (Default configuration):
 File size:
 - PIE disabled: +0.012%
 - PIE enabled: -2.219%
 instructions:
 - PIE disabled: same
 - PIE enabled: +1.383%
 .text section:
 - PIE disabled: same
 - PIE enabled: +0.589%

Size of vmlinux (Ubuntu configuration):
 File size:
 - PIE disabled: same
 - PIE enabled: +2.391%
 instructions:
 - PIE disabled: +0.013%
 - PIE enabled: +1.566%
 .text section:
 - PIE disabled: same
 - PIE enabled: +0.055%

The .text section size increase is due to more instructions required for
PIE code. There are two reasons that have been mentioned in previous
mailist. Firstly, switch folding is disabled under PIE [3]. Secondly,
two instructions are needed for PIE to represent a single instruction
with sign extension, such as when accessing an array element. While only
one instruction is required when using mcmode=kernel, for PIE, it needs
to use lea to get the base of the array first.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
 - PIE disabled: no significant change (avg -/+ 0.5% on default config).
 - PIE enabled: -2% to +2% in average (default config).

Kernbench (average of 10 Half and Optimal runs):
 Elapsed Time:
 - PIE disabled: no significant change (avg -0.2% on ubuntu config)
 - PIE enabled: average -0.2% to +0.2%
 System Time:
 - PIE disabled: no significant change (avg -0.5% on ubuntu config)
 - PIE enabled: average -0.5% to +0.5%

[1] https://lore.kernel.org/all/20190131192533.34130-1-thgarnie@chromium.org
[2] https://lore.kernel.org/all/20200228000105.165012-1-thgarnie@chromium.org
[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

Brian Gerst (1):
  x86-64: Use per-cpu stack canary if supported by compiler

Hou Wenlong (29):
  x86/irq: Adapt assembly for PIE support
  x86,rethook: Adapt assembly for PIE support
  x86/paravirt: Use relative reference for original instruction
  x86/Kconfig: Introduce new Kconfig for PIE kernel building
  x86/PVH: Use fixed_percpu_data to set up GS base
  x86/pie: Enable stack protector only if per-cpu stack canary is
    supported
  x86/percpu: Use PC-relative addressing for percpu variable references
  x86/tools: Explicitly include autoconf.h for hostprogs
  x86/percpu: Adapt percpu references relocation for PIE support
  x86/ftrace: Adapt assembly for PIE support
  x86/pie: Force hidden visibility for all symbol references
  x86/boot/compressed: Adapt sed command to generate voffset.h when PIE
    is enabled
  x86/pie: Add .data.rel.* sections into link script
  KVM: x86: Adapt assembly for PIE support
  x86/PVH: Adapt PVH booting for PIE support
  x86/bpf: Adapt BPF_CALL JIT codegen for PIE support
  x86/modules: Adapt module loading for PIE support
  x86/boot/64: Use data relocation to get absloute address when PIE is
    enabled
  objtool: Add validation for x86 PIE support
  objtool: Adapt indirect call of __fentry__() for PIE support
  x86/pie: Build the kernel as PIE
  x86/vsyscall: Don't use set_fixmap() to map vsyscall page
  x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap
    area
  x86/fixmap: Move vsyscall page out of fixmap area
  x86/fixmap: Unify FIXADDR_TOP
  x86/boot: Fill kernel image puds dynamically
  x86/mm: Sort address_markers array when X86 PIE is enabled
  x86/pie: Allow kernel image to be relocated in top 512G
  x86/boot: Extend relocate range for PIE kernel image

Thomas Garnier (13):
  x86/crypto: Adapt assembly for PIE support
  x86: Add macro to get symbol address for PIE support
  x86: relocate_kernel - Adapt assembly for PIE support
  x86/entry/64: Adapt assembly for PIE support
  x86: pm-trace: Adapt assembly for PIE support
  x86/CPU: Adapt assembly for PIE support
  x86/acpi: Adapt assembly for PIE support
  x86/boot/64: Adapt assembly for PIE support
  x86/power/64: Adapt assembly for PIE support
  x86/alternatives: Adapt assembly for PIE support
  x86/ftrace: Adapt ftrace nop patching for PIE support
  x86/mm: Make the x86 GOT read-only
  x86/relocs: Handle PIE relocations

 Documentation/x86/x86_64/mm.rst              |   4 +
 arch/x86/Kconfig                             |  36 +++++-
 arch/x86/Makefile                            |  33 +++--
 arch/x86/boot/compressed/Makefile            |   2 +-
 arch/x86/boot/compressed/kaslr.c             |  55 +++++++++
 arch/x86/boot/compressed/misc.c              |   4 +-
 arch/x86/boot/compressed/misc.h              |   9 ++
 arch/x86/crypto/aegis128-aesni-asm.S         |   6 +-
 arch/x86/crypto/aesni-intel_asm.S            |   2 +-
 arch/x86/crypto/aesni-intel_avx-x86_64.S     |   3 +-
 arch/x86/crypto/aria-aesni-avx-asm_64.S      |  30 ++---
 arch/x86/crypto/camellia-aesni-avx-asm_64.S  |  30 ++---
 arch/x86/crypto/camellia-aesni-avx2-asm_64.S |  30 ++---
 arch/x86/crypto/camellia-x86_64-asm_64.S     |   8 +-
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S    |  50 ++++----
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S    |  44 ++++---
 arch/x86/crypto/crc32c-pcl-intel-asm_64.S    |   3 +-
 arch/x86/crypto/des3_ede-asm_64.S            |  96 ++++++++++-----
 arch/x86/crypto/ghash-clmulni-intel_asm.S    |   4 +-
 arch/x86/crypto/sha256-avx2-asm.S            |  18 ++-
 arch/x86/entry/calling.h                     |  17 ++-
 arch/x86/entry/entry_64.S                    |  22 +++-
 arch/x86/entry/vdso/Makefile                 |   2 +-
 arch/x86/entry/vsyscall/vsyscall_64.c        |   7 +-
 arch/x86/include/asm/alternative.h           |   6 +-
 arch/x86/include/asm/asm.h                   |   1 +
 arch/x86/include/asm/fixmap.h                |  28 +----
 arch/x86/include/asm/irq_stack.h             |   2 +-
 arch/x86/include/asm/kmsan.h                 |   6 +-
 arch/x86/include/asm/nospec-branch.h         |  10 +-
 arch/x86/include/asm/page_64.h               |   8 +-
 arch/x86/include/asm/page_64_types.h         |   8 ++
 arch/x86/include/asm/paravirt.h              |  17 ++-
 arch/x86/include/asm/paravirt_types.h        |  12 +-
 arch/x86/include/asm/percpu.h                |  29 ++++-
 arch/x86/include/asm/pgtable_64_types.h      |  10 +-
 arch/x86/include/asm/pm-trace.h              |   2 +-
 arch/x86/include/asm/processor.h             |  17 ++-
 arch/x86/include/asm/sections.h              |   5 +
 arch/x86/include/asm/stackprotector.h        |  16 ++-
 arch/x86/include/asm/sync_core.h             |   6 +-
 arch/x86/include/asm/vsyscall.h              |  13 ++
 arch/x86/kernel/acpi/wakeup_64.S             |  31 ++---
 arch/x86/kernel/alternative.c                |   8 +-
 arch/x86/kernel/asm-offsets_64.c             |   2 +-
 arch/x86/kernel/callthunks.c                 |   2 +-
 arch/x86/kernel/cpu/common.c                 |  15 ++-
 arch/x86/kernel/ftrace.c                     |  46 ++++++-
 arch/x86/kernel/ftrace_64.S                  |   9 +-
 arch/x86/kernel/head64.c                     |  77 +++++++++---
 arch/x86/kernel/head_64.S                    |  68 ++++++++---
 arch/x86/kernel/kvm.c                        |  21 +++-
 arch/x86/kernel/module.c                     |  27 +++++
 arch/x86/kernel/paravirt.c                   |   4 +
 arch/x86/kernel/relocate_kernel_64.S         |   2 +-
 arch/x86/kernel/rethook.c                    |   8 ++
 arch/x86/kernel/setup.c                      |   6 +
 arch/x86/kernel/vmlinux.lds.S                |  10 +-
 arch/x86/kvm/svm/vmenter.S                   |  10 +-
 arch/x86/kvm/vmx/vmenter.S                   |   2 +-
 arch/x86/lib/cmpxchg16b_emu.S                |   8 +-
 arch/x86/mm/dump_pagetables.c                |  36 +++++-
 arch/x86/mm/fault.c                          |   1 -
 arch/x86/mm/init_64.c                        |  10 +-
 arch/x86/mm/ioremap.c                        |   5 +-
 arch/x86/mm/kasan_init_64.c                  |   4 +-
 arch/x86/mm/pat/set_memory.c                 |   2 +-
 arch/x86/mm/pgtable.c                        |  13 ++
 arch/x86/mm/pgtable_32.c                     |   3 -
 arch/x86/mm/physaddr.c                       |  14 +--
 arch/x86/net/bpf_jit_comp.c                  |  17 ++-
 arch/x86/platform/efi/efi_thunk_64.S         |   4 +
 arch/x86/platform/pvh/head.S                 |  29 ++++-
 arch/x86/power/hibernate_asm_64.S            |   4 +-
 arch/x86/tools/Makefile                      |   4 +-
 arch/x86/tools/relocs.c                      | 113 ++++++++++++++++-
 arch/x86/xen/mmu_pv.c                        |  32 +++--
 arch/x86/xen/xen-asm.S                       |  10 +-
 arch/x86/xen/xen-head.S                      |  14 ++-
 include/asm-generic/vmlinux.lds.h            |  12 ++
 scripts/Makefile.lib                         |   1 +
 scripts/recordmcount.c                       |  81 ++++++++-----
 tools/objtool/arch/x86/decode.c              |  10 +-
 tools/objtool/builtin-check.c                |   4 +-
 tools/objtool/check.c                        | 121 +++++++++++++++++++
 tools/objtool/include/objtool/builtin.h      |   1 +
 86 files changed, 1202 insertions(+), 410 deletions(-)


Patchset is based on tip/master.
base-commit: 01cbd032298654fe4c85e153dd9a224e5bc10194
--
2.31.1
  

Comments

Peter Zijlstra April 28, 2023, 3:22 p.m. UTC | #1
For some raison I didn't get 0/n but did get all of the others. Please
keep your Cc list consistent.

On Fri, Apr 28, 2023 at 05:50:40PM +0800, Hou Wenlong wrote:

>   - It is not allowed to reference global variables in an alternative
>     section since RIP-relative addressing is not fixed in
>     apply_alternatives(). Fortunately, all disallowed relocations in the
>     alternative section can be captured by objtool. I believe that this
>     issue can also be fixed by using objtool.

https://lkml.kernel.org/r/Y9py2a5Xw0xbB8ou@hirez.programming.kicks-ass.net
  
Ard Biesheuvel April 30, 2023, 2:23 p.m. UTC | #2
On Fri, 28 Apr 2023 at 11:55, Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> From: Thomas Garnier <thgarnie@chromium.org>
>
> From: Thomas Garnier <thgarnie@chromium.org>
>
> The GOT is changed during early boot when relocations are applied. Make
> it read-only directly. This table exists only for PIE binary. Since weak
> symbol reference would always be GOT reference, there are 8 entries in
> GOT, but only one entry for __fentry__() is in use.  Other GOT
> references have been optimized by linker.
>
> [Hou Wenlong: Change commit message and skip GOT size check]
>
> Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/kernel/vmlinux.lds.S     |  2 ++
>  include/asm-generic/vmlinux.lds.h | 12 ++++++++++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index f02dcde9f8a8..fa4c6582663f 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -462,6 +462,7 @@ SECTIONS
>  #endif
>                "Unexpected GOT/PLT entries detected!")
>
> +#ifndef CONFIG_X86_PIE
>         /*
>          * Sections that should stay zero sized, which is safer to
>          * explicitly check instead of blindly discarding.
> @@ -470,6 +471,7 @@ SECTIONS
>                 *(.got) *(.igot.*)
>         }
>         ASSERT(SIZEOF(.got) == 0, "Unexpected GOT entries detected!")
> +#endif
>
>         .plt : {
>                 *(.plt) *(.plt.*) *(.iplt)
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index d1f57e4868ed..438ed8b39896 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -441,6 +441,17 @@
>         __end_ro_after_init = .;
>  #endif
>
> +#ifdef CONFIG_X86_PIE
> +#define RO_GOT_X86

Please don't put X86 specific stuff in generic code.

> +       .got        : AT(ADDR(.got) - LOAD_OFFSET) {                    \
> +               __start_got = .;                                        \
> +               *(.got) *(.igot.*);                                     \
> +               __end_got = .;                                          \
> +       }
> +#else
> +#define RO_GOT_X86
> +#endif
> +

I don't think it makes sense for this definition to be conditional.
You can include it conditionally from the x86 code, but even that
seems unnecessary, given that it will be empty otherwise.

>  /*
>   * .kcfi_traps contains a list KCFI trap locations.
>   */
> @@ -486,6 +497,7 @@
>                 BOUNDED_SECTION_PRE_LABEL(.pci_fixup_suspend_late, _pci_fixups_suspend_late, __start, __end) \
>         }                                                               \
>                                                                         \
> +       RO_GOT_X86                                                      \
>         FW_LOADER_BUILT_IN_DATA                                         \
>         TRACEDATA                                                       \
>                                                                         \
> --
> 2.31.1
>
  
Nick Desaulniers May 1, 2023, 5:27 p.m. UTC | #3
On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> From: Brian Gerst <brgerst@gmail.com>
>
> From: Brian Gerst <brgerst@gmail.com>
>
> If the compiler supports it, use a standard per-cpu variable for the
> stack protector instead of the old fixed location.  Keep the fixed
> location code for compatibility with older compilers.
>
> [Hou Wenlong: Disable it on Clang, adapt new code change and adapt
> missing GS set up path in pvh_start_xen()]
>
> Signed-off-by: Brian Gerst <brgerst@gmail.com>
> Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/Kconfig                      | 12 ++++++++++++
>  arch/x86/Makefile                     | 21 ++++++++++++++-------
>  arch/x86/entry/entry_64.S             |  6 +++++-
>  arch/x86/include/asm/processor.h      | 17 ++++++++++++-----
>  arch/x86/include/asm/stackprotector.h | 16 +++++++---------
>  arch/x86/kernel/asm-offsets_64.c      |  2 +-
>  arch/x86/kernel/cpu/common.c          | 15 +++++++--------
>  arch/x86/kernel/head_64.S             | 16 ++++++++++------
>  arch/x86/kernel/vmlinux.lds.S         |  4 +++-
>  arch/x86/platform/pvh/head.S          |  8 ++++++++
>  arch/x86/xen/xen-head.S               | 14 +++++++++-----
>  11 files changed, 88 insertions(+), 43 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 68e5da464b96..55cce8cdf9bd 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -410,6 +410,18 @@ config CC_HAS_SANE_STACKPROTECTOR
>           the compiler produces broken code or if it does not let us control
>           the segment on 32-bit kernels.
>
> +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> +       bool
> +       # Although clang supports -mstack-protector-guard-reg option, it
> +       # would generate GOT reference for __stack_chk_guard even with
> +       # -fno-PIE flag.
> +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))

Hi Hou,
I've filed this bug against LLVM and will work with LLVM folks at
Intel to resolve:
https://github.com/llvm/llvm-project/issues/62481
Can you please review that report and let me know here or there if I
missed anything? Would you also mind including a link to that in the
comments in the next version of this patch?

Less relevant issues I filed looking at some related codegen:
https://github.com/llvm/llvm-project/issues/62482
https://github.com/llvm/llvm-project/issues/62480

And we should probably look into:
https://github.com/llvm/llvm-project/issues/22476
  
Juergen Gross May 4, 2023, 10:31 a.m. UTC | #4
On 28.04.23 11:50, Hou Wenlong wrote:
> From: Brian Gerst <brgerst@gmail.com>
> 
> From: Brian Gerst <brgerst@gmail.com>
> 
> If the compiler supports it, use a standard per-cpu variable for the
> stack protector instead of the old fixed location.  Keep the fixed
> location code for compatibility with older compilers.
> 
> [Hou Wenlong: Disable it on Clang, adapt new code change and adapt
> missing GS set up path in pvh_start_xen()]
> 
> Signed-off-by: Brian Gerst <brgerst@gmail.com>
> Co-developed-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
> Cc: Thomas Garnier <thgarnie@chromium.org>
> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> Cc: Kees Cook <keescook@chromium.org>
> ---
>   arch/x86/Kconfig                      | 12 ++++++++++++
>   arch/x86/Makefile                     | 21 ++++++++++++++-------
>   arch/x86/entry/entry_64.S             |  6 +++++-
>   arch/x86/include/asm/processor.h      | 17 ++++++++++++-----
>   arch/x86/include/asm/stackprotector.h | 16 +++++++---------
>   arch/x86/kernel/asm-offsets_64.c      |  2 +-
>   arch/x86/kernel/cpu/common.c          | 15 +++++++--------
>   arch/x86/kernel/head_64.S             | 16 ++++++++++------
>   arch/x86/kernel/vmlinux.lds.S         |  4 +++-
>   arch/x86/platform/pvh/head.S          |  8 ++++++++
>   arch/x86/xen/xen-head.S               | 14 +++++++++-----
>   11 files changed, 88 insertions(+), 43 deletions(-)
> 

...

> diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
> index 643d02900fbb..09eaf59e8066 100644
> --- a/arch/x86/xen/xen-head.S
> +++ b/arch/x86/xen/xen-head.S
> @@ -51,15 +51,19 @@ SYM_CODE_START(startup_xen)
>   
>   	leaq	(__end_init_task - PTREGS_SIZE)(%rip), %rsp
>   
> -	/* Set up %gs.
> -	 *
> -	 * The base of %gs always points to fixed_percpu_data.  If the
> -	 * stack protector canary is enabled, it is located at %gs:40.
> +	/*
> +	 * Set up GS base.
>   	 * Note that, on SMP, the boot cpu uses init data section until
>   	 * the per cpu areas are set up.
>   	 */
>   	movl	$MSR_GS_BASE,%ecx
> -	movq	$INIT_PER_CPU_VAR(fixed_percpu_data),%rax
> +#if defined(CONFIG_STACKPROTECTOR_FIXED)
> +	leaq	INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
> +#elif defined(CONFIG_SMP)
> +	movabs	$__per_cpu_load, %rdx

Shouldn't above 2 targets be %rax?

> +#else
> +	xorl	%eax, %eax
> +#endif
>   	cdq
>   	wrmsr
>   


Juergen
  
Nick Desaulniers May 5, 2023, 6:02 p.m. UTC | #5
On Thu, May 4, 2023 at 11:14 PM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
>
> On Tue, May 02, 2023 at 01:27:53AM +0800, Nick Desaulniers wrote:
> > On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > >
> > > +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> > > +       bool
> > > +       # Although clang supports -mstack-protector-guard-reg option, it
> > > +       # would generate GOT reference for __stack_chk_guard even with
> > > +       # -fno-PIE flag.
> > > +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))
> >
> > Hi Hou,
> > I've filed this bug against LLVM and will work with LLVM folks at
> > Intel to resolve:
> > https://github.com/llvm/llvm-project/issues/62481
> > Can you please review that report and let me know here or there if I
> > missed anything? Would you also mind including a link to that in the
> > comments in the next version of this patch?
> >
> Hi Nick,
>
> Thanks for your help, I'll include the link in the next version.
> Actually, I had post an issue on github too when I test the patch on
> LLVM. But no replies. :(.

Ah, sorry about that.  The issue tracker is pretty high volume and
stuff gets missed.  With many users comes many bug reports.  We could
be better about triage though.  If it's specific to the Linux kernel,
https://github.com/ClangBuiltLinux/linux/issues is a better issue
tracker to use; we can move bug reports upstream to
https://github.com/llvm/llvm-project/issues/ when necessary. It's
linked off of clangbuiltlinux.github.io if you lose it.

> https://github.com/llvm/llvm-project/issues/60116
>
> There is another problem I met for this patch, some unexpected code
> are generated:
>
> do_one_initcall: (init/main.o)
> ......
> movq    __stack_chk_guard(%rip), %rax
> movq    %rax,0x2b0(%rsp)
>
> The complier generates wrong instruction, no GOT reference and gs
> register. I only see it in init/main.c file. I have tried to move the
> function into another file and compiled it with same cflags. It could
> generate right instruction for the function in another file.

The wrong instruction or the wrong operand?  This is loading the
canary into the stack slot in the fn prolog.  Perhaps the expected
cflag is not getting set (or being removed) from init/main.c? You
should be able to do:

$ make LLVM=1 init/main.o V=1

to see how clang was invoked to see if the expected flag was there, or not.

>
> The LLVM chain toolsare built by myself:
> clang version 15.0.7 (https://github.com/llvm/llvm-project.git
> 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)

Perhaps worth rebuilding with top of tree, which is clang 17.

>
> > Less relevant issues I filed looking at some related codegen:
> > https://github.com/llvm/llvm-project/issues/62482
> > https://github.com/llvm/llvm-project/issues/62480
> >
> > And we should probably look into:
> > https://github.com/llvm/llvm-project/issues/22476
> >
> >
>
> Except for per-cpu stack canary patch, there is another issue I post on
> github: https://github.com/llvm/llvm-project/issues/60096

Thanks, I'll bring that up with Intel, too.

>
> The related patch is:
> https://lore.kernel.org/lkml/175116f75c38c15d8d73a03301eab805fea13a0a.1682673543.git.houwenlong.hwl@antgroup.com/
>
> I couldn't find the related documentation about that, hope you can help
> me too.
>
> One more problem that I didn't post is:
> https://lore.kernel.org/lkml/8d6bbaf66b90cf1a8fd2c5da98f5e094b9ffcb27.1682673543.git.houwenlong.hwl@antgroup.com/

Mind filing another bug for this in llvm's issue tracker? We can
discuss there if LLD needs to be doing something different.

Thanks for uncovering these and helping us get them fixed up!
  
Fangrui Song May 5, 2023, 7:06 p.m. UTC | #6
On Fri, May 5, 2023 at 11:02 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Thu, May 4, 2023 at 11:14 PM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> >
> > On Tue, May 02, 2023 at 01:27:53AM +0800, Nick Desaulniers wrote:
> > > On Fri, Apr 28, 2023 at 2:52 AM Hou Wenlong <houwenlong.hwl@antgroup.com> wrote:
> > > >
> > > > +config CC_HAS_CUSTOMIZED_STACKPROTECTOR
> > > > +       bool
> > > > +       # Although clang supports -mstack-protector-guard-reg option, it
> > > > +       # would generate GOT reference for __stack_chk_guard even with
> > > > +       # -fno-PIE flag.
> > > > +       default y if (!CC_IS_CLANG && $(cc-option,-mstack-protector-guard-reg=gs))
> > >
> > > Hi Hou,
> > > I've filed this bug against LLVM and will work with LLVM folks at
> > > Intel to resolve:
> > > https://github.com/llvm/llvm-project/issues/62481
> > > Can you please review that report and let me know here or there if I
> > > missed anything? Would you also mind including a link to that in the
> > > comments in the next version of this patch?
> > >
> > Hi Nick,
> >
> > Thanks for your help, I'll include the link in the next version.
> > Actually, I had post an issue on github too when I test the patch on
> > LLVM. But no replies. :(.
>
> Ah, sorry about that.  The issue tracker is pretty high volume and
> stuff gets missed.  With many users comes many bug reports.  We could
> be better about triage though.  If it's specific to the Linux kernel,
> https://github.com/ClangBuiltLinux/linux/issues is a better issue
> tracker to use; we can move bug reports upstream to
> https://github.com/llvm/llvm-project/issues/ when necessary. It's
> linked off of clangbuiltlinux.github.io if you lose it.
>
> > https://github.com/llvm/llvm-project/issues/60116
> >
> > There is another problem I met for this patch, some unexpected code
> > are generated:
> >
> > do_one_initcall: (init/main.o)
> > ......
> > movq    __stack_chk_guard(%rip), %rax
> > movq    %rax,0x2b0(%rsp)
> >
> > The complier generates wrong instruction, no GOT reference and gs
> > register. I only see it in init/main.c file. I have tried to move the
> > function into another file and compiled it with same cflags. It could
> > generate right instruction for the function in another file.
>
> The wrong instruction or the wrong operand?  This is loading the
> canary into the stack slot in the fn prolog.  Perhaps the expected
> cflag is not getting set (or being removed) from init/main.c? You
> should be able to do:
>
> $ make LLVM=1 init/main.o V=1
>
> to see how clang was invoked to see if the expected flag was there, or not.
>
> >
> > The LLVM chain toolsare built by myself:
> > clang version 15.0.7 (https://github.com/llvm/llvm-project.git
> > 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
>
> Perhaps worth rebuilding with top of tree, which is clang 17.
>
> >
> > > Less relevant issues I filed looking at some related codegen:
> > > https://github.com/llvm/llvm-project/issues/62482
> > > https://github.com/llvm/llvm-project/issues/62480
> > >
> > > And we should probably look into:
> > > https://github.com/llvm/llvm-project/issues/22476
> > >
> > >
> >
> > Except for per-cpu stack canary patch, there is another issue I post on
> > github: https://github.com/llvm/llvm-project/issues/60096
>
> Thanks, I'll bring that up with Intel, too.
>
> >
> > The related patch is:
> > https://lore.kernel.org/lkml/175116f75c38c15d8d73a03301eab805fea13a0a.1682673543.git.houwenlong.hwl@antgroup.com/
> >
> > I couldn't find the related documentation about that, hope you can help
> > me too.
> >
> > One more problem that I didn't post is:
> > https://lore.kernel.org/lkml/8d6bbaf66b90cf1a8fd2c5da98f5e094b9ffcb27.1682673543.git.houwenlong.hwl@antgroup.com/
>
> Mind filing another bug for this in llvm's issue tracker? We can
> discuss there if LLD needs to be doing something different.
>
> Thanks for uncovering these and helping us get them fixed up!
> --
> Thanks,
> ~Nick Desaulniers

In my opinion, Clang's behavior is working as intended. The Linux
kernel should support R_X86_64_REX_GOTPCRELX, and the solution is
straightforward: treat R_X86_64_REX_GOTPCRELX the same way as
R_X86_64_PC32 (-shared -Bsymbolic), assuming that every symbol is
defined, which means that every symbol is non-preemptible.

Clang's `-fno-pic` option chooses `R_X86_64_REX_GOTPCRELX` which is
correct, although it differs from GCC's `-fno-pic` option.

The compiler doesn't know whether `__stack_chk_guard` will be provided
by the main executable (`libc.a`) or a shared object (`libc.so`,
available on some ports of glibc but not x86, on musl this is
available for all ports).
(Also see `__stack_chk_guard` on
https://maskray.me/blog/2022-12-18-control-flow-integrity)

If an `R_X86_64_32` relocation is used and `__stack_chk_guard` is
defined by a shared object, copy relocation.
We will need an ELF hack called [copy
relocation](https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected).

The instruction movq __stack_chk_guard@GOTPCREL(%rip), %rbx produces
an R_X86_64_REX_GOTPCRELX relocation.
If `__stack_chk_guard` is non-preemptible, linkers can [optimize the
access to be direct](https://maskray.me/blog/2021-08-29-all-about-global-offset-table#got-optimization).

Although we could technically use the
`-fno-direct-access-external-data` option to switch between
`R_X86_64_REX_GOTPCRELX` and `R_X86_64_32`, I think there is no
justification to complicate the compiler.
  
Hou Wenlong May 6, 2023, 7:19 a.m. UTC | #7
On Fri, Apr 28, 2023 at 11:22:06PM +0800, Peter Zijlstra wrote:
> 
> For some raison I didn't get 0/n but did get all of the others. Please
> keep your Cc list consistent.
>
Sorry, I'll pay attention next time.
 
> On Fri, Apr 28, 2023 at 05:50:40PM +0800, Hou Wenlong wrote:
> 
> >   - It is not allowed to reference global variables in an alternative
> >     section since RIP-relative addressing is not fixed in
> >     apply_alternatives(). Fortunately, all disallowed relocations in the
> >     alternative section can be captured by objtool. I believe that this
> >     issue can also be fixed by using objtool.
> 
> https://lkml.kernel.org/r/Y9py2a5Xw0xbB8ou@hirez.programming.kicks-ass.net

Thank you for your patch. However, it's more complicated for call depth
tracking case. Although, the per-cpu variable in the alternative section
is relocated, but the content of the "skl_call_thunk_template" is copied
into another place, so the offset is still incorrect. I'm not sure if
this case is common or not. Since the destination is clear, I could do
relocation here as well, but it would make the code more complicated.

Thanks!