arm64: jump_label: use constraint "S" instead of "i"

Message ID 20240131065322.1126831-1-maskray@google.com
State New
Headers
Series arm64: jump_label: use constraint "S" instead of "i" |

Commit Message

Fangrui Song Jan. 31, 2024, 6:53 a.m. UTC
  The constraint "i" seems to be copied from x86 (and with a redundant
modifier "c"). It works with -fno-PIE but not with -fPIE/-fPIC in GCC's
aarch64 port.

The constraint "S", which denotes a symbol reference (e.g. function,
global variable) or label reference, is more appropriate, and has been
available in GCC since 2012 and in Clang since 7.0.

Signed-off-by: Fangrui Song <maskray@google.com>
Link: https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
---
 arch/arm64/include/asm/jump_label.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
  

Comments

Ard Biesheuvel Jan. 31, 2024, 7:16 a.m. UTC | #1
Hello Fangrui,

On Wed, 31 Jan 2024 at 07:53, Fangrui Song <maskray@google.com> wrote:
>
> The constraint "i" seems to be copied from x86 (and with a redundant
> modifier "c"). It works with -fno-PIE but not with -fPIE/-fPIC in GCC's
> aarch64 port.
>
> The constraint "S", which denotes a symbol reference (e.g. function,
> global variable) or label reference, is more appropriate, and has been
> available in GCC since 2012 and in Clang since 7.0.
>
> Signed-off-by: Fangrui Song <maskray@google.com>
> Link: https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
> ---
>  arch/arm64/include/asm/jump_label.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> index 48ddc0f45d22..31862b3bb33d 100644
> --- a/arch/arm64/include/asm/jump_label.h
> +++ b/arch/arm64/include/asm/jump_label.h
> @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
>                  "      .pushsection    __jump_table, \"aw\"    \n\t"
>                  "      .align          3                       \n\t"
>                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> -                "      .quad           %c0 - .                 \n\t"
> +                "      .quad           %0 - .                  \n\t"
>                  "      .popsection                             \n\t"
> -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +                :  :  "S"(&((char *)key)[branch]) :  : l_yes);

'key' is not used as a raw symbol name. We should make this

"    .quad   %0 + %1 - ."

and

::  "S"(key), "i"(branch) :: l_yes);

if we want to really clean this up.



>
>         return false;
>  l_yes:
> @@ -40,9 +40,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key * const ke
>                  "      .pushsection    __jump_table, \"aw\"    \n\t"
>                  "      .align          3                       \n\t"
>                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> -                "      .quad           %c0 - .                 \n\t"
> +                "      .quad           %0 - .                  \n\t"
>                  "      .popsection                             \n\t"
> -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +                :  :  "S"(&((char *)key)[branch]) :  : l_yes);
>
>         return false;
>  l_yes:
> --
> 2.43.0.429.g432eaa2c6b-goog
>
  
Dave Martin Jan. 31, 2024, 1:01 p.m. UTC | #2
On Wed, Jan 31, 2024 at 08:16:04AM +0100, Ard Biesheuvel wrote:
> Hello Fangrui,
> 
> On Wed, 31 Jan 2024 at 07:53, Fangrui Song <maskray@google.com> wrote:
> >
> > The constraint "i" seems to be copied from x86 (and with a redundant
> > modifier "c"). It works with -fno-PIE but not with -fPIE/-fPIC in GCC's
> > aarch64 port.

(I'm not sure of the exact history, but the "c" may be inherited from
arm, where an output modifier was needed to suppress the "#" that
prefixes immediates in the traditional asm syntax.  This does not
actually seem to be required for AArch64: rather while a # is allowed
and still considered good style in handwritten asm code, the syntax
doesn't require it, and the compiler doesn't emit it for "i" arguments,
AFAICT.)

> > The constraint "S", which denotes a symbol reference (e.g. function,
> > global variable) or label reference, is more appropriate, and has been
> > available in GCC since 2012 and in Clang since 7.0.
> >
> > Signed-off-by: Fangrui Song <maskray@google.com>
> > Link: https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
> > ---
> >  arch/arm64/include/asm/jump_label.h | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > index 48ddc0f45d22..31862b3bb33d 100644
> > --- a/arch/arm64/include/asm/jump_label.h
> > +++ b/arch/arm64/include/asm/jump_label.h
> > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
> >                  "      .align          3                       \n\t"
> >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> > -                "      .quad           %c0 - .                 \n\t"
> > +                "      .quad           %0 - .                  \n\t"
> >                  "      .popsection                             \n\t"
> > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > +                :  :  "S"(&((char *)key)[branch]) :  : l_yes);
> 
> 'key' is not used as a raw symbol name. We should make this
> 
> "    .quad   %0 + %1 - ."
> 
> and
> 
> ::  "S"(key), "i"(branch) :: l_yes);
>
> if we want to really clean this up.

This hides more logic in the asm so it's arguably more cryptic
(although the code is fairly cryptic to begin with -- I don't really
see why the argument wasn't written as the equivalent
(char *)key + branch...)

Anyway, I don't think the "i" versys "S" distinction makes any
difference without -fpic or equivalent, so it is not really relevant
for the kernel (except that "S" breaks compatibility with older
compilers...)


I think the main advantage of "S" is that it stops you accidentally
emitting undesirable relocations from asm code that is not written for
the -fpic case.

But just changing "i" to "S" is not sufficient to port asms to -fpic:
the asms still need to be reviewed.


So unless the asm has been reviewed for position-independence, it may
anyway be better to stick with "i" so that the compiler actually chokes
if someone tries to build the code with -fpic.

Since we are not trying to run arbitraily many running kernels in a
common address space (and not likely to do that), I'm not sure that we
would ever build the kernel with -fpic except for a few special-case
bits like the EFI stub and vDSO... unless I've missed something?

If there's another reason why "S" is advantageous though, I'm happy to
be corrected.

[...]

Cheers
---Dave
  
Fangrui Song Feb. 1, 2024, 4:55 a.m. UTC | #3
On 2024-01-31, Dave Martin wrote:
>On Wed, Jan 31, 2024 at 08:16:04AM +0100, Ard Biesheuvel wrote:
>> Hello Fangrui,
>>
>> On Wed, 31 Jan 2024 at 07:53, Fangrui Song <maskray@google.com> wrote:
>> >
>> > The constraint "i" seems to be copied from x86 (and with a redundant
>> > modifier "c"). It works with -fno-PIE but not with -fPIE/-fPIC in GCC's
>> > aarch64 port.
>
>(I'm not sure of the exact history, but the "c" may be inherited from
>arm, where an output modifier was needed to suppress the "#" that
>prefixes immediates in the traditional asm syntax.  This does not
>actually seem to be required for AArch64: rather while a # is allowed
>and still considered good style in handwritten asm code, the syntax
>doesn't require it, and the compiler doesn't emit it for "i" arguments,
>AFAICT.)

The aarch64 one could be inherited from
arch/arm/include/asm/jump_label.h (2012), which could in turn be
inherited from x86 (2010).
Both the constraint "i" and the modifier "c" are generic..
For -fno-pic this combination can be used for every arch.

>> > The constraint "S", which denotes a symbol reference (e.g. function,
>> > global variable) or label reference, is more appropriate, and has been
>> > available in GCC since 2012 and in Clang since 7.0.
>> >
>> > Signed-off-by: Fangrui Song <maskray@google.com>
>> > Link: https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
>> > ---
>> >  arch/arm64/include/asm/jump_label.h | 8 ++++----
>> >  1 file changed, 4 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
>> > index 48ddc0f45d22..31862b3bb33d 100644
>> > --- a/arch/arm64/include/asm/jump_label.h
>> > +++ b/arch/arm64/include/asm/jump_label.h
>> > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
>> >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
>> >                  "      .align          3                       \n\t"
>> >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
>> > -                "      .quad           %c0 - .                 \n\t"
>> > +                "      .quad           %0 - .                  \n\t"
>> >                  "      .popsection                             \n\t"
>> > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
>> > +                :  :  "S"(&((char *)key)[branch]) :  : l_yes);
>>
>> 'key' is not used as a raw symbol name. We should make this
>>
>> "    .quad   %0 + %1 - ."
>>
>> and
>>
>> ::  "S"(key), "i"(branch) :: l_yes);
>>
>> if we want to really clean this up.
>
>This hides more logic in the asm so it's arguably more cryptic
>(although the code is fairly cryptic to begin with -- I don't really
>see why the argument wasn't written as the equivalent
>(char *)key + branch...)

I agree that using "S" and "i" would introduce complexity.
Using just "S" as this patch does should be clear.

All of "i" "s" "S" support a symbol or label reference and a constant offset (can be zero),
(in object file, a symbol and an addend; in GCC's term, the sum of a SYMBOL_REF and a CONST_INT).

>Anyway, I don't think the "i" versys "S" distinction makes any
>difference without -fpic or equivalent, so it is not really relevant
>for the kernel (except that "S" breaks compatibility with older
>compilers...)
>
>
>I think the main advantage of "S" is that it stops you accidentally
>emitting undesirable relocations from asm code that is not written for
>the -fpic case.
>
>But just changing "i" to "S" is not sufficient to port asms to -fpic:
>the asms still need to be reviewed.
>
>
>So unless the asm has been reviewed for position-independence, it may
>anyway be better to stick with "i" so that the compiler actually chokes
>if someone tries to build the code with -fpic.

The asm is position-independent.
This `.long sym - .` is a common metadata section pattern to support PIC:)

Regarding the constraints, I've updated
https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
to include more details.

>Since we are not trying to run arbitraily many running kernels in a
>common address space (and not likely to do that), I'm not sure that we
>would ever build the kernel with -fpic except for a few special-case
>bits like the EFI stub and vDSO... unless I've missed something?
>
>If there's another reason why "S" is advantageous though, I'm happy to
>be corrected.

I remember that Ard has an RFC
https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@kernel.org/
"[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
and see some recent PIE codegen patches.

> Building the KASLR kernel without -fpie but linking it with -pie works
> in practice, but it is not something that is explicitly supported by the
> toolchains - it happens to work because the default 'small' code model
> used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
> symbol references.

I agree that current -fno-PIE with -shared -Bsymbolic linking is a hack
that works as a conincidence, not guaranteed by the toolchain.
This jump_label improvement (with no object file difference) fixes an
obstacle.
  
Ard Biesheuvel Feb. 1, 2024, 8:08 a.m. UTC | #4
On Thu, 1 Feb 2024 at 05:55, Fangrui Song <maskray@google.com> wrote:
>
> On 2024-01-31, Dave Martin wrote:
> >On Wed, Jan 31, 2024 at 08:16:04AM +0100, Ard Biesheuvel wrote:
> >> Hello Fangrui,
> >>
> >> On Wed, 31 Jan 2024 at 07:53, Fangrui Song <maskray@google.com> wrote:
> >> >
> >> > The constraint "i" seems to be copied from x86 (and with a redundant
> >> > modifier "c"). It works with -fno-PIE but not with -fPIE/-fPIC in GCC's
> >> > aarch64 port.
> >
> >(I'm not sure of the exact history, but the "c" may be inherited from
> >arm, where an output modifier was needed to suppress the "#" that
> >prefixes immediates in the traditional asm syntax.  This does not
> >actually seem to be required for AArch64: rather while a # is allowed
> >and still considered good style in handwritten asm code, the syntax
> >doesn't require it, and the compiler doesn't emit it for "i" arguments,
> >AFAICT.)
>
> The aarch64 one could be inherited from
> arch/arm/include/asm/jump_label.h (2012), which could in turn be
> inherited from x86 (2010).
> Both the constraint "i" and the modifier "c" are generic..
> For -fno-pic this combination can be used for every arch.
>
> >> > The constraint "S", which denotes a symbol reference (e.g. function,
> >> > global variable) or label reference, is more appropriate, and has been
> >> > available in GCC since 2012 and in Clang since 7.0.
> >> >
> >> > Signed-off-by: Fangrui Song <maskray@google.com>
> >> > Link: https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
> >> > ---
> >> >  arch/arm64/include/asm/jump_label.h | 8 ++++----
> >> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> >> > index 48ddc0f45d22..31862b3bb33d 100644
> >> > --- a/arch/arm64/include/asm/jump_label.h
> >> > +++ b/arch/arm64/include/asm/jump_label.h
> >> > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> >> >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
> >> >                  "      .align          3                       \n\t"
> >> >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> >> > -                "      .quad           %c0 - .                 \n\t"
> >> > +                "      .quad           %0 - .                  \n\t"
> >> >                  "      .popsection                             \n\t"
> >> > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> >> > +                :  :  "S"(&((char *)key)[branch]) :  : l_yes);
> >>
> >> 'key' is not used as a raw symbol name. We should make this
> >>
> >> "    .quad   %0 + %1 - ."
> >>
> >> and
> >>
> >> ::  "S"(key), "i"(branch) :: l_yes);
> >>
> >> if we want to really clean this up.
> >
> >This hides more logic in the asm so it's arguably more cryptic
> >(although the code is fairly cryptic to begin with -- I don't really
> >see why the argument wasn't written as the equivalent
> >(char *)key + branch...)
>
> I agree that using "S" and "i" would introduce complexity.
> Using just "S" as this patch does should be clear.
>
> All of "i" "s" "S" support a symbol or label reference and a constant offset (can be zero),
> (in object file, a symbol and an addend; in GCC's term, the sum of a SYMBOL_REF and a CONST_INT).
>

Taken the address of a struct, cast it to char[] and then index it
using a boolean is rather disgusting, no?

> >Anyway, I don't think the "i" versys "S" distinction makes any
> >difference without -fpic or equivalent, so it is not really relevant
> >for the kernel (except that "S" breaks compatibility with older
> >compilers...)
> >
> >
> >I think the main advantage of "S" is that it stops you accidentally
> >emitting undesirable relocations from asm code that is not written for
> >the -fpic case.
> >
> >But just changing "i" to "S" is not sufficient to port asms to -fpic:
> >the asms still need to be reviewed.
> >
> >
> >So unless the asm has been reviewed for position-independence, it may
> >anyway be better to stick with "i" so that the compiler actually chokes
> >if someone tries to build the code with -fpic.
>

The virtual address of the kernel is randomized by KASLR, which relies
on PIE linking, and this puts constraints on the permitted types of
relocations.

IOW, we basically already build the kernel as PIC code, but without
relying on -fPIC, because that triggers some behaviors that only make
sense for shared objects in user space.

> >Since we are not trying to run arbitraily many running kernels in a
> >common address space (and not likely to do that), I'm not sure that we
> >would ever build the kernel with -fpic except for a few special-case
> >bits like the EFI stub and vDSO... unless I've missed something?
> >

Yes, KASLR. The number of kernels is not the point, the point is that
the virtual load address of the kernel is usually decided at boot, and
so the code needs to be generated to accommodate that.

> >If there's another reason why "S" is advantageous though, I'm happy to
> >be corrected.
>
> I remember that Ard has an RFC
> https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@kernel.org/
> "[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
> and see some recent PIE codegen patches.
>
> > Building the KASLR kernel without -fpie but linking it with -pie works
> > in practice, but it is not something that is explicitly supported by the
> > toolchains - it happens to work because the default 'small' code model
> > used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
> > symbol references.
>
> I agree that current -fno-PIE with -shared -Bsymbolic linking is a hack
> that works as a conincidence, not guaranteed by the toolchain.
> This jump_label improvement (with no object file difference) fixes an
> obstacle.

If we can get the guaranteed behavior of #pragma GCC visibility
push(hidden) from a command line option, we should build the core
kernel with -fpie instead. (Modules are partially linked objects, so
they can be built non-PIC as before)
  
Fangrui Song Feb. 1, 2024, 9:11 a.m. UTC | #5
On 2024-02-01, Ard Biesheuvel wrote:
>On Thu, 1 Feb 2024 at 05:55, Fangrui Song <maskray@google.com> wrote:
>>
>> On 2024-01-31, Dave Martin wrote:
>> >On Wed, Jan 31, 2024 at 08:16:04AM +0100, Ard Biesheuvel wrote:
>> >> Hello Fangrui,
>> >>
>> >> On Wed, 31 Jan 2024 at 07:53, Fangrui Song <maskray@google.com> wrote:
>> >> >
>> >> > The constraint "i" seems to be copied from x86 (and with a redundant
>> >> > modifier "c"). It works with -fno-PIE but not with -fPIE/-fPIC in GCC's
>> >> > aarch64 port.
>> >
>> >(I'm not sure of the exact history, but the "c" may be inherited from
>> >arm, where an output modifier was needed to suppress the "#" that
>> >prefixes immediates in the traditional asm syntax.  This does not
>> >actually seem to be required for AArch64: rather while a # is allowed
>> >and still considered good style in handwritten asm code, the syntax
>> >doesn't require it, and the compiler doesn't emit it for "i" arguments,
>> >AFAICT.)
>>
>> The aarch64 one could be inherited from
>> arch/arm/include/asm/jump_label.h (2012), which could in turn be
>> inherited from x86 (2010).
>> Both the constraint "i" and the modifier "c" are generic..
>> For -fno-pic this combination can be used for every arch.
>>
>> >> > The constraint "S", which denotes a symbol reference (e.g. function,
>> >> > global variable) or label reference, is more appropriate, and has been
>> >> > available in GCC since 2012 and in Clang since 7.0.
>> >> >
>> >> > Signed-off-by: Fangrui Song <maskray@google.com>
>> >> > Link: https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
>> >> > ---
>> >> >  arch/arm64/include/asm/jump_label.h | 8 ++++----
>> >> >  1 file changed, 4 insertions(+), 4 deletions(-)
>> >> >
>> >> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
>> >> > index 48ddc0f45d22..31862b3bb33d 100644
>> >> > --- a/arch/arm64/include/asm/jump_label.h
>> >> > +++ b/arch/arm64/include/asm/jump_label.h
>> >> > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
>> >> >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
>> >> >                  "      .align          3                       \n\t"
>> >> >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
>> >> > -                "      .quad           %c0 - .                 \n\t"
>> >> > +                "      .quad           %0 - .                  \n\t"
>> >> >                  "      .popsection                             \n\t"
>> >> > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
>> >> > +                :  :  "S"(&((char *)key)[branch]) :  : l_yes);
>> >>
>> >> 'key' is not used as a raw symbol name. We should make this
>> >>
>> >> "    .quad   %0 + %1 - ."
>> >>
>> >> and
>> >>
>> >> ::  "S"(key), "i"(branch) :: l_yes);
>> >>
>> >> if we want to really clean this up.
>> >
>> >This hides more logic in the asm so it's arguably more cryptic
>> >(although the code is fairly cryptic to begin with -- I don't really
>> >see why the argument wasn't written as the equivalent
>> >(char *)key + branch...)
>>
>> I agree that using "S" and "i" would introduce complexity.
>> Using just "S" as this patch does should be clear.
>>
>> All of "i" "s" "S" support a symbol or label reference and a constant offset (can be zero),
>> (in object file, a symbol and an addend; in GCC's term, the sum of a SYMBOL_REF and a CONST_INT).
>>
>
>Taken the address of a struct, cast it to char[] and then index it
>using a boolean is rather disgusting, no?

I agree with you.

Hmm. Clang's constraint "S" implementation doesn't support a constant
offset, so
`static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
would not compile with Clang <= 18.

I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
even if it is accepted and cherry-picked into the 18.x release branch,
if we still support older Clang, we cannot use "S" unconditionally.


So we probably need the following to prepare for -fPIE support in the
future:

diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
index 48ddc0f45d22..b8af2f8b0c99 100644
--- a/arch/arm64/include/asm/jump_label.h
+++ b/arch/arm64/include/asm/jump_label.h
@@ -15,6 +15,16 @@
  
  #define JUMP_LABEL_NOP_SIZE		AARCH64_INSN_SIZE
  
+/*
+ * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
+ * on a symbol with a constant offset is not supported.
+ */
+#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18
+#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
+#else
+#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
+#endif
+
  static __always_inline bool arch_static_branch(struct static_key * const key,
  					       const bool branch)
  {
@@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
  		 "	.pushsection	__jump_table, \"aw\"	\n\t"
  		 "	.align		3			\n\t"
  		 "	.long		1b - ., %l[l_yes] - .	\n\t"
-		 "	.quad		%c0 - .			\n\t"
+		 "	.quad		%0 + %1 - .		\n\t"
  		 "	.popsection				\n\t"
-		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
+		 :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
  
  	return false;
  l_yes:
@@ -40,9 +50,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key * const ke
  		 "	.pushsection	__jump_table, \"aw\"	\n\t"
  		 "	.align		3			\n\t"
  		 "	.long		1b - ., %l[l_yes] - .	\n\t"
-		 "	.quad		%c0 - .			\n\t"
+		 "	.quad		%0 + %1 - .		\n\t"
  		 "	.popsection				\n\t"
-		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
+		 :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
  
  	return false;
  l_yes:


>> >Anyway, I don't think the "i" versys "S" distinction makes any
>> >difference without -fpic or equivalent, so it is not really relevant
>> >for the kernel (except that "S" breaks compatibility with older
>> >compilers...)
>> >
>> >
>> >I think the main advantage of "S" is that it stops you accidentally
>> >emitting undesirable relocations from asm code that is not written for
>> >the -fpic case.
>> >
>> >But just changing "i" to "S" is not sufficient to port asms to -fpic:
>> >the asms still need to be reviewed.
>> >
>> >
>> >So unless the asm has been reviewed for position-independence, it may
>> >anyway be better to stick with "i" so that the compiler actually chokes
>> >if someone tries to build the code with -fpic.
>>
>
>The virtual address of the kernel is randomized by KASLR, which relies
>on PIE linking, and this puts constraints on the permitted types of
>relocations.
>
>IOW, we basically already build the kernel as PIC code, but without
>relying on -fPIC, because that triggers some behaviors that only make
>sense for shared objects in user space.
>
>> >Since we are not trying to run arbitraily many running kernels in a
>> >common address space (and not likely to do that), I'm not sure that we
>> >would ever build the kernel with -fpic except for a few special-case
>> >bits like the EFI stub and vDSO... unless I've missed something?
>> >
>
>Yes, KASLR. The number of kernels is not the point, the point is that
>the virtual load address of the kernel is usually decided at boot, and
>so the code needs to be generated to accommodate that.
>
>> >If there's another reason why "S" is advantageous though, I'm happy to
>> >be corrected.
>>
>> I remember that Ard has an RFC
>> https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@kernel.org/
>> "[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
>> and see some recent PIE codegen patches.
>>
>> > Building the KASLR kernel without -fpie but linking it with -pie works
>> > in practice, but it is not something that is explicitly supported by the
>> > toolchains - it happens to work because the default 'small' code model
>> > used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
>> > symbol references.
>>
>> I agree that current -fno-PIE with -shared -Bsymbolic linking is a hack
>> that works as a conincidence, not guaranteed by the toolchain.
>> This jump_label improvement (with no object file difference) fixes an
>> obstacle.
>
>If we can get the guaranteed behavior of #pragma GCC visibility
>push(hidden) from a command line option, we should build the core
>kernel with -fpie instead. (Modules are partially linked objects, so
>they can be built non-PIC as before)
>

I believe we don't have such a GCC option, but the effect can be
simulated by -include hidden.h ...
  
Dave Martin Feb. 1, 2024, 11:49 a.m. UTC | #6
On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
> On 2024-02-01, Ard Biesheuvel wrote:
> > On Thu, 1 Feb 2024 at 05:55, Fangrui Song <maskray@google.com> wrote:
> > > 
> > > On 2024-01-31, Dave Martin wrote:
> > > >On Wed, Jan 31, 2024 at 08:16:04AM +0100, Ard Biesheuvel wrote:
> > > >> Hello Fangrui,
> > > >>
> > > >> On Wed, 31 Jan 2024 at 07:53, Fangrui Song <maskray@google.com> wrote:
> > > >> >
> > > >> > The constraint "i" seems to be copied from x86 (and with a redundant
> > > >> > modifier "c"). It works with -fno-PIE but not with -fPIE/-fPIC in GCC's
> > > >> > aarch64 port.
> > > >
> > > >(I'm not sure of the exact history, but the "c" may be inherited from
> > > >arm, where an output modifier was needed to suppress the "#" that
> > > >prefixes immediates in the traditional asm syntax.  This does not
> > > >actually seem to be required for AArch64: rather while a # is allowed
> > > >and still considered good style in handwritten asm code, the syntax
> > > >doesn't require it, and the compiler doesn't emit it for "i" arguments,
> > > >AFAICT.)
> > > 
> > > The aarch64 one could be inherited from
> > > arch/arm/include/asm/jump_label.h (2012), which could in turn be
> > > inherited from x86 (2010).
> > > Both the constraint "i" and the modifier "c" are generic..
> > > For -fno-pic this combination can be used for every arch.

Fair enough.  It wasn't clear to me that "c" was generic; the output
modifiers have always been a bit of a black art...

> > > 
> > > >> > The constraint "S", which denotes a symbol reference (e.g. function,
> > > >> > global variable) or label reference, is more appropriate, and has been
> > > >> > available in GCC since 2012 and in Clang since 7.0.
> > > >> >
> > > >> > Signed-off-by: Fangrui Song <maskray@google.com>
> > > >> > Link: https://maskray.me/blog/2024-01-30-raw-symbol-names-in-inline-assembly
> > > >> > ---
> > > >> >  arch/arm64/include/asm/jump_label.h | 8 ++++----
> > > >> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > >> >
> > > >> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > > >> > index 48ddc0f45d22..31862b3bb33d 100644
> > > >> > --- a/arch/arm64/include/asm/jump_label.h
> > > >> > +++ b/arch/arm64/include/asm/jump_label.h
> > > >> > @@ -23,9 +23,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> > > >> >                  "      .pushsection    __jump_table, \"aw\"    \n\t"
> > > >> >                  "      .align          3                       \n\t"
> > > >> >                  "      .long           1b - ., %l[l_yes] - .   \n\t"
> > > >> > -                "      .quad           %c0 - .                 \n\t"
> > > >> > +                "      .quad           %0 - .                  \n\t"
> > > >> >                  "      .popsection                             \n\t"
> > > >> > -                :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > > >> > +                :  :  "S"(&((char *)key)[branch]) :  : l_yes);
> > > >>
> > > >> 'key' is not used as a raw symbol name. We should make this
> > > >>
> > > >> "    .quad   %0 + %1 - ."
> > > >>
> > > >> and
> > > >>
> > > >> ::  "S"(key), "i"(branch) :: l_yes);
> > > >>
> > > >> if we want to really clean this up.
> > > >
> > > >This hides more logic in the asm so it's arguably more cryptic
> > > >(although the code is fairly cryptic to begin with -- I don't really
> > > >see why the argument wasn't written as the equivalent
> > > >(char *)key + branch...)
> > > 
> > > I agree that using "S" and "i" would introduce complexity.
> > > Using just "S" as this patch does should be clear.
> > > 
> > > All of "i" "s" "S" support a symbol or label reference and a constant offset (can be zero),
> > > (in object file, a symbol and an addend; in GCC's term, the sum of a SYMBOL_REF and a CONST_INT).
> > > 
> > 
> > Taken the address of a struct, cast it to char[] and then index it
> > using a boolean is rather disgusting, no?
> 
> I agree with you.
> 
> Hmm. Clang's constraint "S" implementation doesn't support a constant
> offset, so
> `static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
> would not compile with Clang <= 18.
> 
> I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
> even if it is accepted and cherry-picked into the 18.x release branch,
> if we still support older Clang, we cannot use "S" unconditionally.
> 
> 
> So we probably need the following to prepare for -fPIE support in the
> future:
> 
> diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> index 48ddc0f45d22..b8af2f8b0c99 100644
> --- a/arch/arm64/include/asm/jump_label.h
> +++ b/arch/arm64/include/asm/jump_label.h
> @@ -15,6 +15,16 @@
>  #define JUMP_LABEL_NOP_SIZE		AARCH64_INSN_SIZE
> +/*
> + * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
> + * on a symbol with a constant offset is not supported.
> + */
> +#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18

Is a GCC version check needed?  Or is the minimum GCC version specified
for building the kernel new enough?

> +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
> +#else
> +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
> +#endif
> +
>  static __always_inline bool arch_static_branch(struct static_key * const key,
>  					       const bool branch)
>  {
> @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
>  		 "	.pushsection	__jump_table, \"aw\"	\n\t"
>  		 "	.align		3			\n\t"
>  		 "	.long		1b - ., %l[l_yes] - .	\n\t"
> -		 "	.quad		%c0 - .			\n\t"
> +		 "	.quad		%0 + %1 - .		\n\t"
>  		 "	.popsection				\n\t"
> -		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +		 :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);

While this looks reasonable, I'm still not clear now why the asm must do
the addition.

Can we roll the branch argument into the macro?

"S"(symbol + constant) seems to do the right thing for AArch64, at least
in GCC.

>  	return false;
>  l_yes:
> @@ -40,9 +50,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key * const ke
>  		 "	.pushsection	__jump_table, \"aw\"	\n\t"
>  		 "	.align		3			\n\t"
>  		 "	.long		1b - ., %l[l_yes] - .	\n\t"
> -		 "	.quad		%c0 - .			\n\t"
> +		 "	.quad		%0 + %1 - .		\n\t"
>  		 "	.popsection				\n\t"
> -		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> +		 :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
>  	return false;
>  l_yes:
> 
> 
> > > >Anyway, I don't think the "i" versys "S" distinction makes any
> > > >difference without -fpic or equivalent, so it is not really relevant
> > > >for the kernel (except that "S" breaks compatibility with older
> > > >compilers...)
> > > >
> > > >
> > > >I think the main advantage of "S" is that it stops you accidentally
> > > >emitting undesirable relocations from asm code that is not written for
> > > >the -fpic case.
> > > >
> > > >But just changing "i" to "S" is not sufficient to port asms to -fpic:
> > > >the asms still need to be reviewed.
> > > >
> > > >
> > > >So unless the asm has been reviewed for position-independence, it may
> > > >anyway be better to stick with "i" so that the compiler actually chokes
> > > >if someone tries to build the code with -fpic.
> > > 
> > 
> > The virtual address of the kernel is randomized by KASLR, which relies
> > on PIE linking, and this puts constraints on the permitted types of
> > relocations.
> > 
> > IOW, we basically already build the kernel as PIC code, but without
> > relying on -fPIC, because that triggers some behaviors that only make
> > sense for shared objects in user space.
> > 
> > > >Since we are not trying to run arbitraily many running kernels in a
> > > >common address space (and not likely to do that), I'm not sure that we
> > > >would ever build the kernel with -fpic except for a few special-case
> > > >bits like the EFI stub and vDSO... unless I've missed something?
> > > >
> > 
> > Yes, KASLR. The number of kernels is not the point, the point is that
> > the virtual load address of the kernel is usually decided at boot, and
> > so the code needs to be generated to accommodate that.
> > 
> > > >If there's another reason why "S" is advantageous though, I'm happy to
> > > >be corrected.
> > > 
> > > I remember that Ard has an RFC
> > > https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@kernel.org/
> > > "[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
> > > and see some recent PIE codegen patches.
> > > 
> > > > Building the KASLR kernel without -fpie but linking it with -pie works
> > > > in practice, but it is not something that is explicitly supported by the
> > > > toolchains - it happens to work because the default 'small' code model
> > > > used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
> > > > symbol references.

Does this mean that doing boot-time relocation on a fully statically
linked kernel doesn't bring much benefit?  It would tend to be more
painful to do the relocation work, and mean that the kernel text has to
be temporarily writeable, but static linking should have the best
runtime performance.

Maybe it doesn't make as much difference as I thought (certainly ADRP
based addressing should make a fair amount of the PIC/PIE overhead go
away).

> > > 
> > > I agree that current -fno-PIE with -shared -Bsymbolic linking is a hack
> > > that works as a conincidence, not guaranteed by the toolchain.
> > > This jump_label improvement (with no object file difference) fixes an
> > > obstacle.
> > 
> > If we can get the guaranteed behavior of #pragma GCC visibility
> > push(hidden) from a command line option, we should build the core
> > kernel with -fpie instead. (Modules are partially linked objects, so
> > they can be built non-PIC as before)
> > 
> I believe we don't have such a GCC option, but the effect can be
> simulated by -include hidden.h ...

[...]

Cheers
---Dave
  
Ard Biesheuvel Feb. 1, 2024, 4:07 p.m. UTC | #7
On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
..
> >
> > Hmm. Clang's constraint "S" implementation doesn't support a constant
> > offset, so
> > `static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
> > would not compile with Clang <= 18.
> >
> > I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
> > even if it is accepted and cherry-picked into the 18.x release branch,
> > if we still support older Clang, we cannot use "S" unconditionally.
> >
> >
> > So we probably need the following to prepare for -fPIE support in the
> > future:
> >
> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > index 48ddc0f45d22..b8af2f8b0c99 100644
> > --- a/arch/arm64/include/asm/jump_label.h
> > +++ b/arch/arm64/include/asm/jump_label.h
> > @@ -15,6 +15,16 @@
> >  #define JUMP_LABEL_NOP_SIZE          AARCH64_INSN_SIZE
> > +/*
> > + * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
> > + * on a symbol with a constant offset is not supported.
> > + */
> > +#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18
>
> Is a GCC version check needed?  Or is the minimum GCC version specified
> for building the kernel new enough?
>
> > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
> > +#else
> > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
> > +#endif
> > +

Can we use "Si" instead?

> >  static __always_inline bool arch_static_branch(struct static_key * const key,
> >                                              const bool branch)
> >  {
> > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> >                "      .pushsection    __jump_table, \"aw\"    \n\t"
> >                "      .align          3                       \n\t"
> >                "      .long           1b - ., %l[l_yes] - .   \n\t"
> > -              "      .quad           %c0 - .                 \n\t"
> > +              "      .quad           %0 + %1 - .             \n\t"
> >                "      .popsection                             \n\t"
> > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > +              :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
>
> While this looks reasonable, I'm still not clear now why the asm must do
> the addition.
>
> Can we roll the branch argument into the macro?
>
> "S"(symbol + constant) seems to do the right thing for AArch64, at least
> in GCC.
>

'symbol' is a struct static_key pointer, so adding '1' to it will not
produce the needed result.

..
> > > > >If there's another reason why "S" is advantageous though, I'm happy to
> > > > >be corrected.
> > > >
> > > > I remember that Ard has an RFC
> > > > https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@kernel.org/
> > > > "[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
> > > > and see some recent PIE codegen patches.
> > > >
> > > > > Building the KASLR kernel without -fpie but linking it with -pie works
> > > > > in practice, but it is not something that is explicitly supported by the
> > > > > toolchains - it happens to work because the default 'small' code model
> > > > > used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
> > > > > symbol references.
>
> Does this mean that doing boot-time relocation on a fully statically
> linked kernel doesn't bring much benefit?

Not sure I follow you here. The boot-time relocation is necessary in
any case, to fix up statically initialized pointer variables all over
the kernel.

> It would tend to be more
> painful to do the relocation work, and mean that the kernel text has to
> be temporarily writeable, but static linking should have the best
> runtime performance.
>

Not sure what you mean by 'static linking' here.

The only thing PIE linking does in the case of the kernel is
a) complain if static relocations are used that cannot be fixed up at
runtime (e.g., movk/movz sequences)
b) emit dynamic relocations that the early startup code can use to fix
up all statically initialized pointers

From a codegen point-of-view, there is no difference because we don't
use -fpic code. The problem with -fpic codegen is that it makes some
assumptions that only hold for shared ELF objects, e.g., avoiding
dynamic relocations in .text, using GOT entries even for variables
defined in the same compilation so that they can be preempted, keeping
all runtime relocatable quantities together so the CoW footprint of
the shared library is minimized.

None of this matters for the kernel, and as it turns out, the non-PIC
small C model produces code that the PIE linker is happy with.

TL;DR our code (including inline and out-of-line asm) is already PIC,
and this property is relied upon by KASLR.

> Maybe it doesn't make as much difference as I thought (certainly ADRP
> based addressing should make a fair amount of the PIC/PIE overhead go
> away).
>

The PIC/PIE overhead is in the indirections via the GOT. However,
recent linkers will actually perform some relaxations too: if a symbol
is defined in the same executable and is not preemptible, a GOT load

ADRP
LDR

can be transparently converted to

ADRP
ADD

referring to the symbol directly. This is also where hidden.h comes
in: making all symbol declarations and definitions 'hidden' will allow
the compiler to assume that a GOT entry is never needed.

So building with -fPIC is currently not need in practice, and creates
some complications, which is why we have been avoiding it. But PIE
linking non-PIC objects is not guaranteed to remain supported going
forward, so it would be better to have a plan B, i.e., being able to
turn on -fpic without massive regressions due to GOT overhead, or
codegen problems with our asm hacks.
  
Dave Martin Feb. 1, 2024, 5:29 p.m. UTC | #8
On Thu, Feb 01, 2024 at 05:07:59PM +0100, Ard Biesheuvel wrote:
> On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
> ...
> > >
> > > Hmm. Clang's constraint "S" implementation doesn't support a constant
> > > offset, so
> > > `static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
> > > would not compile with Clang <= 18.
> > >
> > > I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
> > > even if it is accepted and cherry-picked into the 18.x release branch,
> > > if we still support older Clang, we cannot use "S" unconditionally.
> > >
> > >
> > > So we probably need the following to prepare for -fPIE support in the
> > > future:
> > >
> > > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > > index 48ddc0f45d22..b8af2f8b0c99 100644
> > > --- a/arch/arm64/include/asm/jump_label.h
> > > +++ b/arch/arm64/include/asm/jump_label.h
> > > @@ -15,6 +15,16 @@
> > >  #define JUMP_LABEL_NOP_SIZE          AARCH64_INSN_SIZE
> > > +/*
> > > + * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
> > > + * on a symbol with a constant offset is not supported.
> > > + */
> > > +#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18
> >
> > Is a GCC version check needed?  Or is the minimum GCC version specified
> > for building the kernel new enough?
> >
> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
> > > +#else
> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
> > > +#endif
> > > +
> 
> Can we use "Si" instead?

I thought the point was to avoid "S" on compilers that would choke on
it?  If so, those compilers would surely choke on "Si" too, no?


> > >  static __always_inline bool arch_static_branch(struct static_key * const key,
> > >                                              const bool branch)
> > >  {
> > > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> > >                "      .pushsection    __jump_table, \"aw\"    \n\t"
> > >                "      .align          3                       \n\t"
> > >                "      .long           1b - ., %l[l_yes] - .   \n\t"
> > > -              "      .quad           %c0 - .                 \n\t"
> > > +              "      .quad           %0 + %1 - .             \n\t"
> > >                "      .popsection                             \n\t"
> > > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > > +              :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
> >
> > While this looks reasonable, I'm still not clear now why the asm must do
> > the addition.
> >
> > Can we roll the branch argument into the macro?
> >
> > "S"(symbol + constant) seems to do the right thing for AArch64, at least
> > in GCC.
> >
> 
> 'symbol' is a struct static_key pointer, so adding '1' to it will not
> produce the needed result.

I mean loosely things that resolve to something of the form
"symbol + constant" in the compiler output.

"S"(&((char *)key)[branch]) does indeed seem to do the right thing,
at least with GCC.

I probably didn't help by bikeshedding the way that expression was
written, apologies for that.  It's orthogonal to what this patch is
doing.

> 
> ...
> > > > > >If there's another reason why "S" is advantageous though, I'm happy to
> > > > > >be corrected.
> > > > >
> > > > > I remember that Ard has an RFC
> > > > > https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@kernel.org/
> > > > > "[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
> > > > > and see some recent PIE codegen patches.
> > > > >
> > > > > > Building the KASLR kernel without -fpie but linking it with -pie works
> > > > > > in practice, but it is not something that is explicitly supported by the
> > > > > > toolchains - it happens to work because the default 'small' code model
> > > > > > used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
> > > > > > symbol references.
> >
> > Does this mean that doing boot-time relocation on a fully statically
> > linked kernel doesn't bring much benefit?
> 
> Not sure I follow you here. The boot-time relocation is necessary in
> any case, to fix up statically initialized pointer variables all over
> the kernel.
> 
> > It would tend to be more
> > painful to do the relocation work, and mean that the kernel text has to
> > be temporarily writeable, but static linking should have the best
> > runtime performance.
> >
> 
> Not sure what you mean by 'static linking' here.
> 
> The only thing PIE linking does in the case of the kernel is
> a) complain if static relocations are used that cannot be fixed up at
> runtime (e.g., movk/movz sequences)
> b) emit dynamic relocations that the early startup code can use to fix
> up all statically initialized pointers
> 
> From a codegen point-of-view, there is no difference because we don't
> use -fpic code. The problem with -fpic codegen is that it makes some
> assumptions that only hold for shared ELF objects, e.g., avoiding
> dynamic relocations in .text, using GOT entries even for variables
> defined in the same compilation so that they can be preempted, keeping
> all runtime relocatable quantities together so the CoW footprint of
> the shared library is minimized.
> 
> None of this matters for the kernel, and as it turns out, the non-PIC
> small C model produces code that the PIE linker is happy with.
> 
> TL;DR our code (including inline and out-of-line asm) is already PIC,
> and this property is relied upon by KASLR.
> 
> > Maybe it doesn't make as much difference as I thought (certainly ADRP
> > based addressing should make a fair amount of the PIC/PIE overhead go
> > away).
> >
> 
> The PIC/PIE overhead is in the indirections via the GOT. However,
> recent linkers will actually perform some relaxations too: if a symbol
> is defined in the same executable and is not preemptible, a GOT load
> 
> ADRP
> LDR
> 
> can be transparently converted to
> 
> ADRP
> ADD
> 
> referring to the symbol directly. This is also where hidden.h comes
> in: making all symbol declarations and definitions 'hidden' will allow
> the compiler to assume that a GOT entry is never needed.

Is there a reason why we don't just build the whole kernel with
-fvisibility=hidden today?

> So building with -fPIC is currently not need in practice, and creates
> some complications, which is why we have been avoiding it. But PIE
> linking non-PIC objects is not guaranteed to remain supported going
> forward, so it would be better to have a plan B, i.e., being able to
> turn on -fpic without massive regressions due to GOT overhead, or
> codegen problems with our asm hacks.

Summarising all of this is it right that:

1) ld -pie is how we get the reloc info into the kernel for KASLR
today.

2) We use gcc -fno-pic today, but this might break with ld -pie in the
future, although it works for now.

3) gcc -fno-pic and gcc -fPIC (or -fPIE) generate almost the same code,
assuming we tweak symbol visibility and use a memory model that
ADR+ADD/LDR can span.  So, moving to -fPIE is likely to be do-able.


My point is that an alternative option would be to move to ld -no-pie.
We would need another way to get relocs into the kernel, such as an
intermediate step with ld --emit-relocs.  I have definitely seen this
done before somewhere, but it would be extra work and not necessarily
worth it, based on what you say about code gen.

This may all have been discussed to death already -- I didn't mean to
hijack this patch, so I'll stop digging here, but if I've misunderstood,
please shout!

Cheers
---Dave
  
Fangrui Song Feb. 1, 2024, 8:56 p.m. UTC | #9
On 2024-02-01, Dave Martin wrote:
>On Thu, Feb 01, 2024 at 05:07:59PM +0100, Ard Biesheuvel wrote:
>> On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@arm.com> wrote:
>> >
>> > On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
>> ...
>> > >
>> > > Hmm. Clang's constraint "S" implementation doesn't support a constant
>> > > offset, so
>> > > `static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
>> > > would not compile with Clang <= 18.
>> > >
>> > > I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
>> > > even if it is accepted and cherry-picked into the 18.x release branch,
>> > > if we still support older Clang, we cannot use "S" unconditionally.
>> > >
>> > >
>> > > So we probably need the following to prepare for -fPIE support in the
>> > > future:
>> > >
>> > > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
>> > > index 48ddc0f45d22..b8af2f8b0c99 100644
>> > > --- a/arch/arm64/include/asm/jump_label.h
>> > > +++ b/arch/arm64/include/asm/jump_label.h
>> > > @@ -15,6 +15,16 @@
>> > >  #define JUMP_LABEL_NOP_SIZE          AARCH64_INSN_SIZE
>> > > +/*
>> > > + * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
>> > > + * on a symbol with a constant offset is not supported.
>> > > + */
>> > > +#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18
>> >
>> > Is a GCC version check needed?  Or is the minimum GCC version specified
>> > for building the kernel new enough?
>> >
>> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
>> > > +#else
>> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
>> > > +#endif
>> > > +
>>
>> Can we use "Si" instead?
>
>I thought the point was to avoid "S" on compilers that would choke on
>it?  If so, those compilers would surely choke on "Si" too, no?

"Si" is an invalid constraint. Unfortunately, GCC recognizes "S" as a
valid constraint and ignores trailing characters (asm_operand_ok). That
is, GCC would accept "Siiiii" as well...

The GCC support for "S" is great. The initial aarch64 port in 2012 supports "S".

>> > >  static __always_inline bool arch_static_branch(struct static_key * const key,
>> > >                                              const bool branch)
>> > >  {
>> > > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
>> > >                "      .pushsection    __jump_table, \"aw\"    \n\t"
>> > >                "      .align          3                       \n\t"
>> > >                "      .long           1b - ., %l[l_yes] - .   \n\t"
>> > > -              "      .quad           %c0 - .                 \n\t"
>> > > +              "      .quad           %0 + %1 - .             \n\t"
>> > >                "      .popsection                             \n\t"
>> > > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
>> > > +              :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
>> >
>> > While this looks reasonable, I'm still not clear now why the asm must do
>> > the addition.
>> >
>> > Can we roll the branch argument into the macro?
>> >
>> > "S"(symbol + constant) seems to do the right thing for AArch64, at least
>> > in GCC.
>> >
>>
>> 'symbol' is a struct static_key pointer, so adding '1' to it will not
>> produce the needed result.
>
>I mean loosely things that resolve to something of the form
>"symbol + constant" in the compiler output.
>
>"S"(&((char *)key)[branch]) does indeed seem to do the right thing,
>at least with GCC.
>
>I probably didn't help by bikeshedding the way that expression was
>written, apologies for that.  It's orthogonal to what this patch is
>doing.

Yes, "S"(&((char *)key)[branch])  would do the right thing.
I have compared assembly output. It's a matter of "s" vs "s + 0" and "s+1" vs "s + 1".

>>
>> ...
>> > > > > >If there's another reason why "S" is advantageous though, I'm happy to
>> > > > > >be corrected.
>> > > > >
>> > > > > I remember that Ard has an RFC
>> > > > > https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@kernel.org/
>> > > > > "[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
>> > > > > and see some recent PIE codegen patches.
>> > > > >
>> > > > > > Building the KASLR kernel without -fpie but linking it with -pie works
>> > > > > > in practice, but it is not something that is explicitly supported by the
>> > > > > > toolchains - it happens to work because the default 'small' code model
>> > > > > > used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
>> > > > > > symbol references.
>> >
>> > Does this mean that doing boot-time relocation on a fully statically
>> > linked kernel doesn't bring much benefit?
>>
>> Not sure I follow you here. The boot-time relocation is necessary in
>> any case, to fix up statically initialized pointer variables all over
>> the kernel.
>>
>> > It would tend to be more
>> > painful to do the relocation work, and mean that the kernel text has to
>> > be temporarily writeable, but static linking should have the best
>> > runtime performance.
>> >
>>
>> Not sure what you mean by 'static linking' here.
>>
>> The only thing PIE linking does in the case of the kernel is
>> a) complain if static relocations are used that cannot be fixed up at
>> runtime (e.g., movk/movz sequences)
>> b) emit dynamic relocations that the early startup code can use to fix
>> up all statically initialized pointers
>>
>> From a codegen point-of-view, there is no difference because we don't
>> use -fpic code. The problem with -fpic codegen is that it makes some
>> assumptions that only hold for shared ELF objects, e.g., avoiding
>> dynamic relocations in .text, using GOT entries even for variables
>> defined in the same compilation so that they can be preempted, keeping
>> all runtime relocatable quantities together so the CoW footprint of
>> the shared library is minimized.
>>
>> None of this matters for the kernel, and as it turns out, the non-PIC
>> small C model produces code that the PIE linker is happy with.
>>
>> TL;DR our code (including inline and out-of-line asm) is already PIC,
>> and this property is relied upon by KASLR.
>>
>> > Maybe it doesn't make as much difference as I thought (certainly ADRP
>> > based addressing should make a fair amount of the PIC/PIE overhead go
>> > away).
>> >
>>
>> The PIC/PIE overhead is in the indirections via the GOT. However,
>> recent linkers will actually perform some relaxations too: if a symbol
>> is defined in the same executable and is not preemptible, a GOT load
>>
>> ADRP
>> LDR
>>
>> can be transparently converted to
>>
>> ADRP
>> ADD
>>
>> referring to the symbol directly. This is also where hidden.h comes
>> in: making all symbol declarations and definitions 'hidden' will allow
>> the compiler to assume that a GOT entry is never needed.
>
>Is there a reason why we don't just build the whole kernel with
>-fvisibility=hidden today?

This topic, loosely related to this patch, is about switching to PIC
compiles. I am not familiar with the Linux kernel, so I'll mostly leave
the discussion to you and Ard :)

That said, I have done a lot of Clang work in visibility/symbol
preemptibility and linkers, so I am happy to add what I know.

-fvisibility=hidden only applies to definitions, not non-definition
declarations.

I've mentioned this at
https://lore.kernel.org/all/20220429070318.iwj3j5lpfkw4t7g2@google.com/

     `#pragma GCC visibility push(hidden)` is very similar to -fvisibility=hidden -fdirect-access-external-data with Clang.
     ...
     The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
     -fdirect-access-external-data can replace `#pragma GCC visibility
     push(hidden)`.

>> So building with -fPIC is currently not need in practice, and creates
>> some complications, which is why we have been avoiding it. But PIE
>> linking non-PIC objects is not guaranteed to remain supported going
>> forward, so it would be better to have a plan B, i.e., being able to
>> turn on -fpic without massive regressions due to GOT overhead, or
>> codegen problems with our asm hacks.
>
>Summarising all of this is it right that:
>
>1) ld -pie is how we get the reloc info into the kernel for KASLR
>today.
>
>2) We use gcc -fno-pic today, but this might break with ld -pie in the
>future, although it works for now.
>
>3) gcc -fno-pic and gcc -fPIC (or -fPIE) generate almost the same code,
>assuming we tweak symbol visibility and use a memory model that
>ADR+ADD/LDR can span.  So, moving to -fPIE is likely to be do-able.
>
>
>My point is that an alternative option would be to move to ld -no-pie.
>We would need another way to get relocs into the kernel, such as an
>intermediate step with ld --emit-relocs.  I have definitely seen this
>done before somewhere, but it would be extra work and not necessarily
>worth it, based on what you say about code gen.
>
>This may all have been discussed to death already -- I didn't mean to
>hijack this patch, so I'll stop digging here, but if I've misunderstood,
>please shout!
>
>Cheers
>---Dave
  
Ard Biesheuvel Feb. 1, 2024, 9:23 p.m. UTC | #10
On Thu, 1 Feb 2024 at 21:56, Fangrui Song <maskray@google.com> wrote:
>
> On 2024-02-01, Dave Martin wrote:
> >On Thu, Feb 01, 2024 at 05:07:59PM +0100, Ard Biesheuvel wrote:
> >> On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@arm.com> wrote:
> >> >
> >> > On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
> >> ...
> >> > >
> >> > > Hmm. Clang's constraint "S" implementation doesn't support a constant
> >> > > offset, so
> >> > > `static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
> >> > > would not compile with Clang <= 18.
> >> > >
> >> > > I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
> >> > > even if it is accepted and cherry-picked into the 18.x release branch,
> >> > > if we still support older Clang, we cannot use "S" unconditionally.
> >> > >
> >> > >
> >> > > So we probably need the following to prepare for -fPIE support in the
> >> > > future:
> >> > >
> >> > > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> >> > > index 48ddc0f45d22..b8af2f8b0c99 100644
> >> > > --- a/arch/arm64/include/asm/jump_label.h
> >> > > +++ b/arch/arm64/include/asm/jump_label.h
> >> > > @@ -15,6 +15,16 @@
> >> > >  #define JUMP_LABEL_NOP_SIZE          AARCH64_INSN_SIZE
> >> > > +/*
> >> > > + * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
> >> > > + * on a symbol with a constant offset is not supported.
> >> > > + */
> >> > > +#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18
> >> >
> >> > Is a GCC version check needed?  Or is the minimum GCC version specified
> >> > for building the kernel new enough?
> >> >
> >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
> >> > > +#else
> >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
> >> > > +#endif
> >> > > +
> >>
> >> Can we use "Si" instead?
> >
> >I thought the point was to avoid "S" on compilers that would choke on
> >it?  If so, those compilers would surely choke on "Si" too, no?
>
> "Si" is an invalid constraint. Unfortunately, GCC recognizes "S" as a
> valid constraint and ignores trailing characters (asm_operand_ok). That
> is, GCC would accept "Siiiii" as well...
>
> The GCC support for "S" is great. The initial aarch64 port in 2012 supports "S".
>

So it is not possible to combine the S and i constraint, and let the
compiler figure out whether it meets either? We rely on this elsewhere
by combining r and Z into rZ. x86 uses "rm" for inline asm parameters
that could be either register or memory operands.


> >> > >  static __always_inline bool arch_static_branch(struct static_key * const key,
> >> > >                                              const bool branch)
> >> > >  {
> >> > > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> >> > >                "      .pushsection    __jump_table, \"aw\"    \n\t"
> >> > >                "      .align          3                       \n\t"
> >> > >                "      .long           1b - ., %l[l_yes] - .   \n\t"
> >> > > -              "      .quad           %c0 - .                 \n\t"
> >> > > +              "      .quad           %0 + %1 - .             \n\t"
> >> > >                "      .popsection                             \n\t"
> >> > > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> >> > > +              :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
> >> >
> >> > While this looks reasonable, I'm still not clear now why the asm must do
> >> > the addition.
> >> >
> >> > Can we roll the branch argument into the macro?
> >> >
> >> > "S"(symbol + constant) seems to do the right thing for AArch64, at least
> >> > in GCC.
> >> >
> >>
> >> 'symbol' is a struct static_key pointer, so adding '1' to it will not
> >> produce the needed result.
> >
> >I mean loosely things that resolve to something of the form
> >"symbol + constant" in the compiler output.
> >
> >"S"(&((char *)key)[branch]) does indeed seem to do the right thing,
> >at least with GCC.
> >
> >I probably didn't help by bikeshedding the way that expression was
> >written, apologies for that.  It's orthogonal to what this patch is
> >doing.
>
> Yes, "S"(&((char *)key)[branch])  would do the right thing.
> I have compared assembly output. It's a matter of "s" vs "s + 0" and "s+1" vs "s + 1".
>

But if S does not work with constant offsets with older Clang, what is
the harm in splitting it into "S" for the symbol and "i" for the
branch?
..
> >>
> >> The only thing PIE linking does in the case of the kernel is
> >> a) complain if static relocations are used that cannot be fixed up at
> >> runtime (e.g., movk/movz sequences)
> >> b) emit dynamic relocations that the early startup code can use to fix
> >> up all statically initialized pointers
> >>
> >> From a codegen point-of-view, there is no difference because we don't
> >> use -fpic code. The problem with -fpic codegen is that it makes some
> >> assumptions that only hold for shared ELF objects, e.g., avoiding
> >> dynamic relocations in .text, using GOT entries even for variables
> >> defined in the same compilation so that they can be preempted, keeping
> >> all runtime relocatable quantities together so the CoW footprint of
> >> the shared library is minimized.
> >>
> >> None of this matters for the kernel, and as it turns out, the non-PIC
> >> small C model produces code that the PIE linker is happy with.
> >>
> >> TL;DR our code (including inline and out-of-line asm) is already PIC,
> >> and this property is relied upon by KASLR.
> >>
> >> > Maybe it doesn't make as much difference as I thought (certainly ADRP
> >> > based addressing should make a fair amount of the PIC/PIE overhead go
> >> > away).
> >> >
> >>
> >> The PIC/PIE overhead is in the indirections via the GOT. However,
> >> recent linkers will actually perform some relaxations too: if a symbol
> >> is defined in the same executable and is not preemptible, a GOT load
> >>
> >> ADRP
> >> LDR
> >>
> >> can be transparently converted to
> >>
> >> ADRP
> >> ADD
> >>
> >> referring to the symbol directly. This is also where hidden.h comes
> >> in: making all symbol declarations and definitions 'hidden' will allow
> >> the compiler to assume that a GOT entry is never needed.
> >
> >Is there a reason why we don't just build the whole kernel with
> >-fvisibility=hidden today?
>
> This topic, loosely related to this patch, is about switching to PIC
> compiles. I am not familiar with the Linux kernel, so I'll mostly leave
> the discussion to you and Ard :)
>
> That said, I have done a lot of Clang work in visibility/symbol
> preemptibility and linkers, so I am happy to add what I know.
>
> -fvisibility=hidden only applies to definitions, not non-definition
> declarations.
>
> I've mentioned this at
> https://lore.kernel.org/all/20220429070318.iwj3j5lpfkw4t7g2@google.com/
>
>      `#pragma GCC visibility push(hidden)` is very similar to -fvisibility=hidden -fdirect-access-external-data with Clang.
>      ...
>      The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
>      -fdirect-access-external-data can replace `#pragma GCC visibility
>      push(hidden)`.
>
> >> So building with -fPIC is currently not need in practice, and creates
> >> some complications, which is why we have been avoiding it. But PIE
> >> linking non-PIC objects is not guaranteed to remain supported going
> >> forward, so it would be better to have a plan B, i.e., being able to
> >> turn on -fpic without massive regressions due to GOT overhead, or
> >> codegen problems with our asm hacks.
> >
> >Summarising all of this is it right that:
> >
> >1) ld -pie is how we get the reloc info into the kernel for KASLR
> >today.
> >
> >2) We use gcc -fno-pic today, but this might break with ld -pie in the
> >future, although it works for now.
> >
> >3) gcc -fno-pic and gcc -fPIC (or -fPIE) generate almost the same code,
> >assuming we tweak symbol visibility and use a memory model that
> >ADR+ADD/LDR can span.  So, moving to -fPIE is likely to be do-able.
> >
> >
> >My point is that an alternative option would be to move to ld -no-pie.

Why? What would that achieve?

> >We would need another way to get relocs into the kernel, such as an
> >intermediate step with ld --emit-relocs.  I have definitely seen this
> >done before somewhere, but it would be extra work and not necessarily
> >worth it, based on what you say about code gen.

Relying on --emit-relocs is a terrible hack. It contains mostly
relocations that we don't need at runtime, so we need to postprocess
them. And linkers may not update these relocations in some cases after
relaxation or other transformations (and sometimes, it is not even
possible to do so, if the post-transformation situation cannot be
described by a relocation).

And with Fangrui's RELR optimization, the PIE relocation data is
extremely dense.

There is nothing to fix here, really.

> >
> >This may all have been discussed to death already -- I didn't mean to
> >hijack this patch, so I'll stop digging here, but if I've misunderstood,
> >please shout!

No worries. Ramana and I spent some time 5+ years ago to document how
PIC codegen for the kernel (or any freestanding executable, really)
deviates from the typical PIC codegen for shared libraries and PIE
executables, with the intention of adding a GCC compiler switch for
it, but that never really materialized, mostly because things just
work fine today.
  
Fangrui Song Feb. 1, 2024, 10:12 p.m. UTC | #11
On Thu, Feb 1, 2024 at 1:23 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 1 Feb 2024 at 21:56, Fangrui Song <maskray@google.com> wrote:
> >
> > On 2024-02-01, Dave Martin wrote:
> > >On Thu, Feb 01, 2024 at 05:07:59PM +0100, Ard Biesheuvel wrote:
> > >> On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@arm.com> wrote:
> > >> >
> > >> > On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
> > >> ...
> > >> > >
> > >> > > Hmm. Clang's constraint "S" implementation doesn't support a constant
> > >> > > offset, so
> > >> > > `static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
> > >> > > would not compile with Clang <= 18.
> > >> > >
> > >> > > I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
> > >> > > even if it is accepted and cherry-picked into the 18.x release branch,
> > >> > > if we still support older Clang, we cannot use "S" unconditionally.
> > >> > >
> > >> > >
> > >> > > So we probably need the following to prepare for -fPIE support in the
> > >> > > future:
> > >> > >
> > >> > > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > >> > > index 48ddc0f45d22..b8af2f8b0c99 100644
> > >> > > --- a/arch/arm64/include/asm/jump_label.h
> > >> > > +++ b/arch/arm64/include/asm/jump_label.h
> > >> > > @@ -15,6 +15,16 @@
> > >> > >  #define JUMP_LABEL_NOP_SIZE          AARCH64_INSN_SIZE
> > >> > > +/*
> > >> > > + * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
> > >> > > + * on a symbol with a constant offset is not supported.
> > >> > > + */
> > >> > > +#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18
> > >> >
> > >> > Is a GCC version check needed?  Or is the minimum GCC version specified
> > >> > for building the kernel new enough?
> > >> >
> > >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
> > >> > > +#else
> > >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
> > >> > > +#endif
> > >> > > +
> > >>
> > >> Can we use "Si" instead?
> > >
> > >I thought the point was to avoid "S" on compilers that would choke on
> > >it?  If so, those compilers would surely choke on "Si" too, no?
> >
> > "Si" is an invalid constraint. Unfortunately, GCC recognizes "S" as a
> > valid constraint and ignores trailing characters (asm_operand_ok). That
> > is, GCC would accept "Siiiii" as well...
> >
> > The GCC support for "S" is great. The initial aarch64 port in 2012 supports "S".
> >
>
> So it is not possible to combine the S and i constraint, and let the
> compiler figure out whether it meets either? We rely on this elsewhere
> by combining r and Z into rZ. x86 uses "rm" for inline asm parameters
> that could be either register or memory operands.

I did not realize that your "Si" suggestion meant this feature
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

    After the prefix, there must be one or more additional constraints
(see Constraints for asm Operands) that describe where the value
resides. Common constraints include ‘r’ for register and ‘m’ for
memory. When you list more than one possible location (for example,
"=rm"), the compiler chooses the most efficient one based on the
current context.

After some debugging of both GCC and Clang, I think you are right and
"Si" will work. Thanks for the suggestion!

GCC accepts "Si". Sorry for my incorrectly stating that GCC ignores
the trailing constraint. GCC does ignore an unknown constraint, as
long as one accepted constraint.

Clang accepts "Si" as well. In the absence of
https://github.com/llvm/llvm-project/pull/80255 (constant offset
support for 'S'), in TargetLowering::ComputeConstraintToUse, 'S' is
skipped and 'i' will be picked. This is unrelated to the "rm" weakness
in Clang (https://github.com/llvm/llvm-project/issues/20571).

I'll post a PATCH v2 removing JUMP_LABEL_STATIC_KEY_CONSTRAINT and
using "Si" unconditionally.

>
> > >> > >  static __always_inline bool arch_static_branch(struct static_key * const key,
> > >> > >                                              const bool branch)
> > >> > >  {
> > >> > > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> > >> > >                "      .pushsection    __jump_table, \"aw\"    \n\t"
> > >> > >                "      .align          3                       \n\t"
> > >> > >                "      .long           1b - ., %l[l_yes] - .   \n\t"
> > >> > > -              "      .quad           %c0 - .                 \n\t"
> > >> > > +              "      .quad           %0 + %1 - .             \n\t"
> > >> > >                "      .popsection                             \n\t"
> > >> > > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > >> > > +              :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
> > >> >
> > >> > While this looks reasonable, I'm still not clear now why the asm must do
> > >> > the addition.
> > >> >
> > >> > Can we roll the branch argument into the macro?
> > >> >
> > >> > "S"(symbol + constant) seems to do the right thing for AArch64, at least
> > >> > in GCC.
> > >> >
> > >>
> > >> 'symbol' is a struct static_key pointer, so adding '1' to it will not
> > >> produce the needed result.
> > >
> > >I mean loosely things that resolve to something of the form
> > >"symbol + constant" in the compiler output.
> > >
> > >"S"(&((char *)key)[branch]) does indeed seem to do the right thing,
> > >at least with GCC.
> > >
> > >I probably didn't help by bikeshedding the way that expression was
> > >written, apologies for that.  It's orthogonal to what this patch is
> > >doing.
> >
> > Yes, "S"(&((char *)key)[branch])  would do the right thing.
> > I have compared assembly output. It's a matter of "s" vs "s + 0" and "s+1" vs "s + 1".
> >
>
> But if S does not work with constant offsets with older Clang, what is
> the harm in splitting it into "S" for the symbol and "i" for the
> branch?
> ...
> > >>
> > >> The only thing PIE linking does in the case of the kernel is
> > >> a) complain if static relocations are used that cannot be fixed up at
> > >> runtime (e.g., movk/movz sequences)
> > >> b) emit dynamic relocations that the early startup code can use to fix
> > >> up all statically initialized pointers
> > >>
> > >> From a codegen point-of-view, there is no difference because we don't
> > >> use -fpic code. The problem with -fpic codegen is that it makes some
> > >> assumptions that only hold for shared ELF objects, e.g., avoiding
> > >> dynamic relocations in .text, using GOT entries even for variables
> > >> defined in the same compilation so that they can be preempted, keeping
> > >> all runtime relocatable quantities together so the CoW footprint of
> > >> the shared library is minimized.
> > >>
> > >> None of this matters for the kernel, and as it turns out, the non-PIC
> > >> small C model produces code that the PIE linker is happy with.
> > >>
> > >> TL;DR our code (including inline and out-of-line asm) is already PIC,
> > >> and this property is relied upon by KASLR.
> > >>
> > >> > Maybe it doesn't make as much difference as I thought (certainly ADRP
> > >> > based addressing should make a fair amount of the PIC/PIE overhead go
> > >> > away).
> > >> >
> > >>
> > >> The PIC/PIE overhead is in the indirections via the GOT. However,
> > >> recent linkers will actually perform some relaxations too: if a symbol
> > >> is defined in the same executable and is not preemptible, a GOT load
> > >>
> > >> ADRP
> > >> LDR
> > >>
> > >> can be transparently converted to
> > >>
> > >> ADRP
> > >> ADD
> > >>
> > >> referring to the symbol directly. This is also where hidden.h comes
> > >> in: making all symbol declarations and definitions 'hidden' will allow
> > >> the compiler to assume that a GOT entry is never needed.
> > >
> > >Is there a reason why we don't just build the whole kernel with
> > >-fvisibility=hidden today?
> >
> > This topic, loosely related to this patch, is about switching to PIC
> > compiles. I am not familiar with the Linux kernel, so I'll mostly leave
> > the discussion to you and Ard :)
> >
> > That said, I have done a lot of Clang work in visibility/symbol
> > preemptibility and linkers, so I am happy to add what I know.
> >
> > -fvisibility=hidden only applies to definitions, not non-definition
> > declarations.
> >
> > I've mentioned this at
> > https://lore.kernel.org/all/20220429070318.iwj3j5lpfkw4t7g2@google.com/
> >
> >      `#pragma GCC visibility push(hidden)` is very similar to -fvisibility=hidden -fdirect-access-external-data with Clang.
> >      ...
> >      The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
> >      -fdirect-access-external-data can replace `#pragma GCC visibility
> >      push(hidden)`.
> >
> > >> So building with -fPIC is currently not need in practice, and creates
> > >> some complications, which is why we have been avoiding it. But PIE
> > >> linking non-PIC objects is not guaranteed to remain supported going
> > >> forward, so it would be better to have a plan B, i.e., being able to
> > >> turn on -fpic without massive regressions due to GOT overhead, or
> > >> codegen problems with our asm hacks.
> > >
> > >Summarising all of this is it right that:
> > >
> > >1) ld -pie is how we get the reloc info into the kernel for KASLR
> > >today.
> > >
> > >2) We use gcc -fno-pic today, but this might break with ld -pie in the
> > >future, although it works for now.
> > >
> > >3) gcc -fno-pic and gcc -fPIC (or -fPIE) generate almost the same code,
> > >assuming we tweak symbol visibility and use a memory model that
> > >ADR+ADD/LDR can span.  So, moving to -fPIE is likely to be do-able.
> > >
> > >
> > >My point is that an alternative option would be to move to ld -no-pie.
>
> Why? What would that achieve?
>
> > >We would need another way to get relocs into the kernel, such as an
> > >intermediate step with ld --emit-relocs.  I have definitely seen this
> > >done before somewhere, but it would be extra work and not necessarily
> > >worth it, based on what you say about code gen.
>
> Relying on --emit-relocs is a terrible hack. It contains mostly
> relocations that we don't need at runtime, so we need to postprocess
> them. And linkers may not update these relocations in some cases after
> relaxation or other transformations (and sometimes, it is not even
> possible to do so, if the post-transformation situation cannot be
> described by a relocation).

Agreed. I've filed
https://sourceware.org/bugzilla/show_bug.cgi?id=30844
("ld riscv: --emit-relocs does not retain the original relocation
type") for a RISC-V version.

--emit-relocs is under-specified and the linker behaviors are less tested.

> And with Fangrui's RELR optimization, the PIE relocation data is
> extremely dense.
>
> There is nothing to fix here, really.
>
> > >
> > >This may all have been discussed to death already -- I didn't mean to
> > >hijack this patch, so I'll stop digging here, but if I've misunderstood,
> > >please shout!
>
> No worries. Ramana and I spent some time 5+ years ago to document how
> PIC codegen for the kernel (or any freestanding executable, really)
> deviates from the typical PIC codegen for shared libraries and PIE
> executables, with the intention of adding a GCC compiler switch for
> it, but that never really materialized, mostly because things just
> work fine today.
>
  
Dave Martin Feb. 2, 2024, 3:42 p.m. UTC | #12
On Thu, Feb 01, 2024 at 02:12:06PM -0800, Fangrui Song wrote:
> On Thu, Feb 1, 2024 at 1:23 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 1 Feb 2024 at 21:56, Fangrui Song <maskray@google.com> wrote:
> > >
> > > On 2024-02-01, Dave Martin wrote:
> > > >On Thu, Feb 01, 2024 at 05:07:59PM +0100, Ard Biesheuvel wrote:
> > > >> On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@arm.com> wrote:
> > > >> >
> > > >> > On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:

[...]

> > > >> > Is a GCC version check needed?  Or is the minimum GCC version specified
> > > >> > for building the kernel new enough?
> > > >> >
> > > >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
> > > >> > > +#else
> > > >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
> > > >> > > +#endif
> > > >> > > +
> > > >>
> > > >> Can we use "Si" instead?
> > > >
> > > >I thought the point was to avoid "S" on compilers that would choke on
> > > >it?  If so, those compilers would surely choke on "Si" too, no?
> > >
> > > "Si" is an invalid constraint. Unfortunately, GCC recognizes "S" as a
> > > valid constraint and ignores trailing characters (asm_operand_ok). That
> > > is, GCC would accept "Siiiii" as well...
> > >
> > > The GCC support for "S" is great. The initial aarch64 port in 2012 supports "S".

Ah, no problem.


> > So it is not possible to combine the S and i constraint, and let the
> > compiler figure out whether it meets either? We rely on this elsewhere
> > by combining r and Z into rZ. x86 uses "rm" for inline asm parameters
> > that could be either register or memory operands.
> 
> I did not realize that your "Si" suggestion meant this feature
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
> 
>     After the prefix, there must be one or more additional constraints
> (see Constraints for asm Operands) that describe where the value
> resides. Common constraints include ‘r’ for register and ‘m’ for
> memory. When you list more than one possible location (for example,
> "=rm"), the compiler chooses the most efficient one based on the
> current context.
> 
> After some debugging of both GCC and Clang, I think you are right and
> "Si" will work. Thanks for the suggestion!
> 
> GCC accepts "Si". Sorry for my incorrectly stating that GCC ignores
> the trailing constraint. GCC does ignore an unknown constraint, as
> long as one accepted constraint.
> 
> Clang accepts "Si" as well. In the absence of
> https://github.com/llvm/llvm-project/pull/80255 (constant offset
> support for 'S'), in TargetLowering::ComputeConstraintToUse, 'S' is
> skipped and 'i' will be picked. This is unrelated to the "rm" weakness
> in Clang (https://github.com/llvm/llvm-project/issues/20571).
> 
> I'll post a PATCH v2 removing JUMP_LABEL_STATIC_KEY_CONSTRAINT and
> using "Si" unconditionally.

If older Clang just ignores the "S" in "Si" as an unsatisfiable
alternative and falls back on "i", that seems ideal.

> > > >> > >  static __always_inline bool arch_static_branch(struct static_key * const key,
> > > >> > >                                              const bool branch)
> > > >> > >  {
> > > >> > > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> > > >> > >                "      .pushsection    __jump_table, \"aw\"    \n\t"
> > > >> > >                "      .align          3                       \n\t"
> > > >> > >                "      .long           1b - ., %l[l_yes] - .   \n\t"
> > > >> > > -              "      .quad           %c0 - .                 \n\t"
> > > >> > > +              "      .quad           %0 + %1 - .             \n\t"
> > > >> > >                "      .popsection                             \n\t"
> > > >> > > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
> > > >> > > +              :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);

[...]

> > > >"S"(&((char *)key)[branch]) does indeed seem to do the right thing,
> > > >at least with GCC.
> > > >
> > > >I probably didn't help by bikeshedding the way that expression was
> > > >written, apologies for that.  It's orthogonal to what this patch is
> > > >doing.
> > >
> > > Yes, "S"(&((char *)key)[branch])  would do the right thing.
> > > I have compared assembly output. It's a matter of "s" vs "s + 0" and "s+1" vs "s + 1".

[...]

> > > >Is there a reason why we don't just build the whole kernel with
> > > >-fvisibility=hidden today?
> > >
> > > This topic, loosely related to this patch, is about switching to PIC
> > > compiles. I am not familiar with the Linux kernel, so I'll mostly leave
> > > the discussion to you and Ard :)
> > >
> > > That said, I have done a lot of Clang work in visibility/symbol
> > > preemptibility and linkers, so I am happy to add what I know.
> > >
> > > -fvisibility=hidden only applies to definitions, not non-definition
> > > declarations.
> > >
> > > I've mentioned this at
> > > https://lore.kernel.org/all/20220429070318.iwj3j5lpfkw4t7g2@google.com/
> > >
> > >      `#pragma GCC visibility push(hidden)` is very similar to -fvisibility=hidden -fdirect-access-external-data with Clang.
> > >      ...
> > >      The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
> > >      -fdirect-access-external-data can replace `#pragma GCC visibility
> > >      push(hidden)`.
> > >
> > > >> So building with -fPIC is currently not need in practice, and creates
> > > >> some complications, which is why we have been avoiding it. But PIE
> > > >> linking non-PIC objects is not guaranteed to remain supported going
> > > >> forward, so it would be better to have a plan B, i.e., being able to
> > > >> turn on -fpic without massive regressions due to GOT overhead, or
> > > >> codegen problems with our asm hacks.
> > > >
> > > >Summarising all of this is it right that:
> > > >
> > > >1) ld -pie is how we get the reloc info into the kernel for KASLR
> > > >today.
> > > >
> > > >2) We use gcc -fno-pic today, but this might break with ld -pie in the
> > > >future, although it works for now.
> > > >
> > > >3) gcc -fno-pic and gcc -fPIC (or -fPIE) generate almost the same code,
> > > >assuming we tweak symbol visibility and use a memory model that
> > > >ADR+ADD/LDR can span.  So, moving to -fPIE is likely to be do-able.
> > > >
> > > >
> > > >My point is that an alternative option would be to move to ld -no-pie.
> >
> > Why? What would that achieve?
> >
> > > >We would need another way to get relocs into the kernel, such as an
> > > >intermediate step with ld --emit-relocs.  I have definitely seen this
> > > >done before somewhere, but it would be extra work and not necessarily
> > > >worth it, based on what you say about code gen.
> >
> > Relying on --emit-relocs is a terrible hack. It contains mostly
> > relocations that we don't need at runtime, so we need to postprocess
> > them. And linkers may not update these relocations in some cases after
> > relaxation or other transformations (and sometimes, it is not even
> > possible to do so, if the post-transformation situation cannot be
> > described by a relocation).
>
> Agreed. I've filed
> https://sourceware.org/bugzilla/show_bug.cgi?id=30844
> ("ld riscv: --emit-relocs does not retain the original relocation
> type") for a RISC-V version.
> 
> --emit-relocs is under-specified and the linker behaviors are less tested.

(Aside: This sounds like an interesting hole in the toolchain world.

I guess if you want post-link-time relocatability, it sounds like all
the focus from the compiler folks is on PIC code, and non-PIC linker
output is not considered relocatable at all (baremetal software being an
inconvenient truth).

"Relocatable" and "PIC" are not really the same requirement, but the
importance of the difference is arch-dependent, and I can see how we got
here.  End aside.)

> > And with Fangrui's RELR optimization, the PIE relocation data is
> > extremely dense.
> >
> > There is nothing to fix here, really.
> >
> > > >
> > > >This may all have been discussed to death already -- I didn't mean to
> > > >hijack this patch, so I'll stop digging here, but if I've misunderstood,
> > > >please shout!
> >
> > No worries. Ramana and I spent some time 5+ years ago to document how
> > PIC codegen for the kernel (or any freestanding executable, really)
> > deviates from the typical PIC codegen for shared libraries and PIE
> > executables, with the intention of adding a GCC compiler switch for
> > it, but that never really materialized, mostly because things just
> > work fine today.
> >

Right.  I was concerned that we would pay a significant runtime
performance penalty when migrating to -fPIC, in which case it would be
worth considering alernatives.  But from the discussion it sounds like
this concern is misguided (or at least, you have found robust
workarounds which mean that the real penalty is negligible).

There's no underlying reason why a toolchain couldn't emit full static
relocation information in a robust way, but since this info is hard to
consume and would be used rarely and only for very specific use cases I
can see why this support is rot-prone or even not considered a goal
at all by many people...

Cheers
---Dave
  
Fangrui Song Feb. 2, 2024, 10:28 p.m. UTC | #13
On 2024-02-02, Dave Martin wrote:
>On Thu, Feb 01, 2024 at 02:12:06PM -0800, Fangrui Song wrote:
>> On Thu, Feb 1, 2024 at 1:23 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>> >
>> > On Thu, 1 Feb 2024 at 21:56, Fangrui Song <maskray@google.com> wrote:
>> > >
>> > > On 2024-02-01, Dave Martin wrote:
>> > > >On Thu, Feb 01, 2024 at 05:07:59PM +0100, Ard Biesheuvel wrote:
>> > > >> On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@arm.com> wrote:
>> > > >> >
>> > > >> > On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
>
>[...]
>
>> > > >> > Is a GCC version check needed?  Or is the minimum GCC version specified
>> > > >> > for building the kernel new enough?
>> > > >> >
>> > > >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
>> > > >> > > +#else
>> > > >> > > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
>> > > >> > > +#endif
>> > > >> > > +
>> > > >>
>> > > >> Can we use "Si" instead?
>> > > >
>> > > >I thought the point was to avoid "S" on compilers that would choke on
>> > > >it?  If so, those compilers would surely choke on "Si" too, no?
>> > >
>> > > "Si" is an invalid constraint. Unfortunately, GCC recognizes "S" as a
>> > > valid constraint and ignores trailing characters (asm_operand_ok). That
>> > > is, GCC would accept "Siiiii" as well...
>> > >
>> > > The GCC support for "S" is great. The initial aarch64 port in 2012 supports "S".
>
>Ah, no problem.
>
>
>> > So it is not possible to combine the S and i constraint, and let the
>> > compiler figure out whether it meets either? We rely on this elsewhere
>> > by combining r and Z into rZ. x86 uses "rm" for inline asm parameters
>> > that could be either register or memory operands.
>>
>> I did not realize that your "Si" suggestion meant this feature
>> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
>>
>>     After the prefix, there must be one or more additional constraints
>> (see Constraints for asm Operands) that describe where the value
>> resides. Common constraints include ‘r’ for register and ‘m’ for
>> memory. When you list more than one possible location (for example,
>> "=rm"), the compiler chooses the most efficient one based on the
>> current context.
>>
>> After some debugging of both GCC and Clang, I think you are right and
>> "Si" will work. Thanks for the suggestion!
>>
>> GCC accepts "Si". Sorry for my incorrectly stating that GCC ignores
>> the trailing constraint. GCC does ignore an unknown constraint, as
>> long as one accepted constraint.
>>
>> Clang accepts "Si" as well. In the absence of
>> https://github.com/llvm/llvm-project/pull/80255 (constant offset
>> support for 'S'), in TargetLowering::ComputeConstraintToUse, 'S' is
>> skipped and 'i' will be picked. This is unrelated to the "rm" weakness
>> in Clang (https://github.com/llvm/llvm-project/issues/20571).
>>
>> I'll post a PATCH v2 removing JUMP_LABEL_STATIC_KEY_CONSTRAINT and
>> using "Si" unconditionally.
>
>If older Clang just ignores the "S" in "Si" as an unsatisfiable
>alternative and falls back on "i", that seems ideal.
>
>> > > >> > >  static __always_inline bool arch_static_branch(struct static_key * const key,
>> > > >> > >                                              const bool branch)
>> > > >> > >  {
>> > > >> > > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
>> > > >> > >                "      .pushsection    __jump_table, \"aw\"    \n\t"
>> > > >> > >                "      .align          3                       \n\t"
>> > > >> > >                "      .long           1b - ., %l[l_yes] - .   \n\t"
>> > > >> > > -              "      .quad           %c0 - .                 \n\t"
>> > > >> > > +              "      .quad           %0 + %1 - .             \n\t"
>> > > >> > >                "      .popsection                             \n\t"
>> > > >> > > -              :  :  "i"(&((char *)key)[branch]) :  : l_yes);
>> > > >> > > +              :  :  JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) :  : l_yes);
>
>[...]
>
>> > > >"S"(&((char *)key)[branch]) does indeed seem to do the right thing,
>> > > >at least with GCC.
>> > > >
>> > > >I probably didn't help by bikeshedding the way that expression was
>> > > >written, apologies for that.  It's orthogonal to what this patch is
>> > > >doing.
>> > >
>> > > Yes, "S"(&((char *)key)[branch])  would do the right thing.
>> > > I have compared assembly output. It's a matter of "s" vs "s + 0" and "s+1" vs "s + 1".
>
>[...]
>
>> > > >Is there a reason why we don't just build the whole kernel with
>> > > >-fvisibility=hidden today?
>> > >
>> > > This topic, loosely related to this patch, is about switching to PIC
>> > > compiles. I am not familiar with the Linux kernel, so I'll mostly leave
>> > > the discussion to you and Ard :)
>> > >
>> > > That said, I have done a lot of Clang work in visibility/symbol
>> > > preemptibility and linkers, so I am happy to add what I know.
>> > >
>> > > -fvisibility=hidden only applies to definitions, not non-definition
>> > > declarations.
>> > >
>> > > I've mentioned this at
>> > > https://lore.kernel.org/all/20220429070318.iwj3j5lpfkw4t7g2@google.com/
>> > >
>> > >      `#pragma GCC visibility push(hidden)` is very similar to -fvisibility=hidden -fdirect-access-external-data with Clang.
>> > >      ...
>> > >      The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
>> > >      -fdirect-access-external-data can replace `#pragma GCC visibility
>> > >      push(hidden)`.
>> > >
>> > > >> So building with -fPIC is currently not need in practice, and creates
>> > > >> some complications, which is why we have been avoiding it. But PIE
>> > > >> linking non-PIC objects is not guaranteed to remain supported going
>> > > >> forward, so it would be better to have a plan B, i.e., being able to
>> > > >> turn on -fpic without massive regressions due to GOT overhead, or
>> > > >> codegen problems with our asm hacks.
>> > > >
>> > > >Summarising all of this is it right that:
>> > > >
>> > > >1) ld -pie is how we get the reloc info into the kernel for KASLR
>> > > >today.
>> > > >
>> > > >2) We use gcc -fno-pic today, but this might break with ld -pie in the
>> > > >future, although it works for now.
>> > > >
>> > > >3) gcc -fno-pic and gcc -fPIC (or -fPIE) generate almost the same code,
>> > > >assuming we tweak symbol visibility and use a memory model that
>> > > >ADR+ADD/LDR can span.  So, moving to -fPIE is likely to be do-able.
>> > > >
>> > > >
>> > > >My point is that an alternative option would be to move to ld -no-pie.
>> >
>> > Why? What would that achieve?
>> >
>> > > >We would need another way to get relocs into the kernel, such as an
>> > > >intermediate step with ld --emit-relocs.  I have definitely seen this
>> > > >done before somewhere, but it would be extra work and not necessarily
>> > > >worth it, based on what you say about code gen.
>> >
>> > Relying on --emit-relocs is a terrible hack. It contains mostly
>> > relocations that we don't need at runtime, so we need to postprocess
>> > them. And linkers may not update these relocations in some cases after
>> > relaxation or other transformations (and sometimes, it is not even
>> > possible to do so, if the post-transformation situation cannot be
>> > described by a relocation).
>>
>> Agreed. I've filed
>> https://sourceware.org/bugzilla/show_bug.cgi?id=30844
>> ("ld riscv: --emit-relocs does not retain the original relocation
>> type") for a RISC-V version.
>>
>> --emit-relocs is under-specified and the linker behaviors are less tested.
>
>(Aside: This sounds like an interesting hole in the toolchain world.
>
>I guess if you want post-link-time relocatability, it sounds like all
>the focus from the compiler folks is on PIC code, and non-PIC linker
>output is not considered relocatable at all (baremetal software being an
>inconvenient truth).
>
>"Relocatable" and "PIC" are not really the same requirement, but the
>importance of the difference is arch-dependent, and I can see how we got
>here.  End aside.)

My background is about toolchains and I know very little about the
kernel.  When I saw CONFIG_RELOCATABLE the first time, I was (and am)
quite confused by term (I was thinking of relocatable linking).  I
believe CONFIG_RELOCATABLE in the kernel context does mean
position-independence.

>> > And with Fangrui's RELR optimization, the PIE relocation data is
>> > extremely dense.
>> >
>> > There is nothing to fix here, really.
>> >
>> > > >
>> > > >This may all have been discussed to death already -- I didn't mean to
>> > > >hijack this patch, so I'll stop digging here, but if I've misunderstood,
>> > > >please shout!
>> >
>> > No worries. Ramana and I spent some time 5+ years ago to document how
>> > PIC codegen for the kernel (or any freestanding executable, really)
>> > deviates from the typical PIC codegen for shared libraries and PIE
>> > executables, with the intention of adding a GCC compiler switch for
>> > it, but that never really materialized, mostly because things just
>> > work fine today.
>> >
>
>Right.  I was concerned that we would pay a significant runtime
>performance penalty when migrating to -fPIC, in which case it would be
>worth considering alernatives.  But from the discussion it sounds like
>this concern is misguided (or at least, you have found robust
>workarounds which mean that the real penalty is negligible).
>
>There's no underlying reason why a toolchain couldn't emit full static
>relocation information in a robust way, but since this info is hard to
>consume and would be used rarely and only for very specific use cases I
>can see why this support is rot-prone or even not considered a goal
>at all by many people...
>
>Cheers
>---Dave

I believe the PIC concept is built around dynamic shared library support
in 1980+, and therefore PIC and dynamic shared library are reall
tangled. Here are some notes from my archaeology
https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table#appendix

     In 1984 USENIX UniForum Conference Proceedings, Transparent
     Implementation of Shared Libraries described a library stub and link
     table scheme which is similar to .plt plus .got.plt used today.
   
     System V release 3 switched to the COFF format. In 1986 Summer USENIX
     Technical Conference & Exhibition Proceedings, Shared Libraries on
     UNIX System V from AT&T described a shared library design. Its shared
     library must have a fixed virtual address, which is called "static
     shared library" in Linkers and Loaders's term.
   
     In 1988, SunOS 4.0 was released with an extended a.out binary format
     with dynamic shared library support. Unlike previous static shared
     library schemes, the a.out shared libraries are position independent
     and can be loaded at different addresses. The GOT and PLT schemes are
     what we see today. In 1992, SunOS 5.0 (Solaris 2.0) switched to ELF.

Concepts like PLT and GOT have been ingrained in the design.  Therefore,
-fPIC compiles lead to PLT-generating and GOT-generating code by
default. Kernels, for security hardening, just need the "relocatable"
capability, not the dynamic library part, would find such code
inefficient.

The 1990+ introduction of ELF visibilities allows eliminating GOT/PLT,
if you use the hidden visibility. It is just not the toolchain defaults
and kernel folks would find themselves have to jump through hoops.

In addition, visibility-related options, if not used very carefully, can
easily shoot yourself in the foot. So there could be some tension
whether toolchain folks want to add extra options.
  

Patch

diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
index 48ddc0f45d22..31862b3bb33d 100644
--- a/arch/arm64/include/asm/jump_label.h
+++ b/arch/arm64/include/asm/jump_label.h
@@ -23,9 +23,9 @@  static __always_inline bool arch_static_branch(struct static_key * const key,
 		 "	.pushsection	__jump_table, \"aw\"	\n\t"
 		 "	.align		3			\n\t"
 		 "	.long		1b - ., %l[l_yes] - .	\n\t"
-		 "	.quad		%c0 - .			\n\t"
+		 "	.quad		%0 - .			\n\t"
 		 "	.popsection				\n\t"
-		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
+		 :  :  "S"(&((char *)key)[branch]) :  : l_yes);
 
 	return false;
 l_yes:
@@ -40,9 +40,9 @@  static __always_inline bool arch_static_branch_jump(struct static_key * const ke
 		 "	.pushsection	__jump_table, \"aw\"	\n\t"
 		 "	.align		3			\n\t"
 		 "	.long		1b - ., %l[l_yes] - .	\n\t"
-		 "	.quad		%c0 - .			\n\t"
+		 "	.quad		%0 - .			\n\t"
 		 "	.popsection				\n\t"
-		 :  :  "i"(&((char *)key)[branch]) :  : l_yes);
+		 :  :  "S"(&((char *)key)[branch]) :  : l_yes);
 
 	return false;
 l_yes: