[-next,V6,1/7] riscv: ftrace: Fixup panic by disabling preemption

Message ID 20230107133549.4192639-2-guoren@kernel.org
State New
Headers
Series riscv: Optimize function trace |

Commit Message

Guo Ren Jan. 7, 2023, 1:35 p.m. UTC
  From: Andy Chiu <andy.chiu@sifive.com>

In RISCV, we must use an AUIPC + JALR pair to encode an immediate,
forming a jump that jumps to an address over 4K. This may cause errors
if we want to enable kernel preemption and remove dependency from
patching code with stop_machine(). For example, if a task was switched
out on auipc. And, if we changed the ftrace function before it was
switched back, then it would jump to an address that has updated 11:0
bits mixing with previous XLEN:12 part.

p: patched area performed by dynamic ftrace
ftrace_prologue:
p|      REG_S   ra, -SZREG(sp)
p|      auipc   ra, 0x? ------------> preempted
					...
				change ftrace function
					...
p|      jalr    -?(ra) <------------- switched back
p|      REG_L   ra, -SZREG(sp)
func:
	xxx
	ret

Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of MCOUNT")
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
---
 arch/riscv/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Mark Rutland Jan. 9, 2023, 5:19 p.m. UTC | #1
On Sat, Jan 07, 2023 at 08:35:43AM -0500, guoren@kernel.org wrote:
> From: Andy Chiu <andy.chiu@sifive.com>
> 
> In RISCV, we must use an AUIPC + JALR pair to encode an immediate,
> forming a jump that jumps to an address over 4K. This may cause errors
> if we want to enable kernel preemption and remove dependency from
> patching code with stop_machine(). For example, if a task was switched
> out on auipc. And, if we changed the ftrace function before it was
> switched back, then it would jump to an address that has updated 11:0
> bits mixing with previous XLEN:12 part.
> 
> p: patched area performed by dynamic ftrace
> ftrace_prologue:
> p|      REG_S   ra, -SZREG(sp)
> p|      auipc   ra, 0x? ------------> preempted
> 					...
> 				change ftrace function
> 					...
> p|      jalr    -?(ra) <------------- switched back
> p|      REG_L   ra, -SZREG(sp)
> func:
> 	xxx
> 	ret

What happens on SMP but not !PREEMPTION; can't a CPU be in the middle of this
while you're patching the sequence?

Do you have any guarantee as to the atomicity and ordering of instruction
fetches?

Thanks,
Mark.

> 
> Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of MCOUNT")
> Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> ---
>  arch/riscv/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index e2b656043abf..ee0d39b26794 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -138,7 +138,7 @@ config RISCV
>  	select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>  	select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
>  	select HAVE_FUNCTION_GRAPH_TRACER
> -	select HAVE_FUNCTION_TRACER if !XIP_KERNEL
> +	select HAVE_FUNCTION_TRACER if !XIP_KERNEL && !PREEMPTION
>  
>  config ARCH_MMAP_RND_BITS_MIN
>  	default 18 if 64BIT
> -- 
> 2.36.1
>
  
Guo Ren Jan. 11, 2023, 1:22 p.m. UTC | #2
On Tue, Jan 10, 2023 at 1:20 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Sat, Jan 07, 2023 at 08:35:43AM -0500, guoren@kernel.org wrote:
> > From: Andy Chiu <andy.chiu@sifive.com>
> >
> > In RISCV, we must use an AUIPC + JALR pair to encode an immediate,
> > forming a jump that jumps to an address over 4K. This may cause errors
> > if we want to enable kernel preemption and remove dependency from
> > patching code with stop_machine(). For example, if a task was switched
> > out on auipc. And, if we changed the ftrace function before it was
> > switched back, then it would jump to an address that has updated 11:0
> > bits mixing with previous XLEN:12 part.
> >
> > p: patched area performed by dynamic ftrace
> > ftrace_prologue:
> > p|      REG_S   ra, -SZREG(sp)
> > p|      auipc   ra, 0x? ------------> preempted
> >                                       ...
> >                               change ftrace function
> >                                       ...
> > p|      jalr    -?(ra) <------------- switched back
> > p|      REG_L   ra, -SZREG(sp)
> > func:
> >       xxx
> >       ret
>
> What happens on SMP but not !PREEMPTION; can't a CPU be in the middle of this
> while you're patching the sequence?
Yes, when PREEMPTION, a timer interrupt between auipc & jalr may cause
context_switch. And riscv uses stop_machine for patch_text. Then, we
may modify auipc part, but only execute the jalr part when return.

>
> Do you have any guarantee as to the atomicity and ordering of instruction
> fetches?
Not yet. If the region is short, we could use nop + jalr pair instead.
Only one jalr instruction makes the entry atomicity.

There are already several proposed solutions:
1. Make stop_machine guarantee all CPU out of preemption point.
2. Expand -fpatchable-function-entry from 4 to 24, and make detour
codes atomicity.
3. We want to propose a solution to make auipc by hardware mask_irq.
For more details, see:
https://www.youtube.com/watch?v=4JkkkXuEvCw


>
> Thanks,
> Mark.
>
> >
> > Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of MCOUNT")
> > Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
> > Signed-off-by: Guo Ren <guoren@kernel.org>
> > ---
> >  arch/riscv/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index e2b656043abf..ee0d39b26794 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -138,7 +138,7 @@ config RISCV
> >       select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
> >       select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
> >       select HAVE_FUNCTION_GRAPH_TRACER
> > -     select HAVE_FUNCTION_TRACER if !XIP_KERNEL
> > +     select HAVE_FUNCTION_TRACER if !XIP_KERNEL && !PREEMPTION
> >
> >  config ARCH_MMAP_RND_BITS_MIN
> >       default 18 if 64BIT
> > --
> > 2.36.1
> >



--
Best Regards
 Guo Ren
  
Mark Rutland Jan. 12, 2023, 12:05 p.m. UTC | #3
On Wed, Jan 11, 2023 at 09:22:09PM +0800, Guo Ren wrote:
> On Tue, Jan 10, 2023 at 1:20 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Sat, Jan 07, 2023 at 08:35:43AM -0500, guoren@kernel.org wrote:
> > > From: Andy Chiu <andy.chiu@sifive.com>
> > >
> > > In RISCV, we must use an AUIPC + JALR pair to encode an immediate,
> > > forming a jump that jumps to an address over 4K. This may cause errors
> > > if we want to enable kernel preemption and remove dependency from
> > > patching code with stop_machine(). For example, if a task was switched
> > > out on auipc. And, if we changed the ftrace function before it was
> > > switched back, then it would jump to an address that has updated 11:0
> > > bits mixing with previous XLEN:12 part.
> > >
> > > p: patched area performed by dynamic ftrace
> > > ftrace_prologue:
> > > p|      REG_S   ra, -SZREG(sp)
> > > p|      auipc   ra, 0x? ------------> preempted
> > >                                       ...
> > >                               change ftrace function
> > >                                       ...
> > > p|      jalr    -?(ra) <------------- switched back
> > > p|      REG_L   ra, -SZREG(sp)
> > > func:
> > >       xxx
> > >       ret
> >
> > What happens on SMP but not !PREEMPTION; can't a CPU be in the middle of this
> > while you're patching the sequence?
> Yes, when PREEMPTION, a timer interrupt between auipc & jalr may cause
> context_switch. And riscv uses stop_machine for patch_text. Then, we
> may modify auipc part, but only execute the jalr part when return.

Please re-read my question; "!PREEMPTION" means "NOT PREEMPTION".

Ignore preeemption entirely and assume two CPUs X and Y are running code
concurrently. Assume CPU X is in the ftrace prologue, and CPU Y is patching
that prologue while CPU X is executing it.

Is that prevented somehow? If not, what happens in that case?

At the very least you can have exactly the same case as on a preemptible kernel
(and in a VM, the hypervisor can preempt the guest ata arbitrary times),
becuase CPU X could end up executing a mixture of the old and new instructions.

More generally, if you don't have strong rules about concurrent modification
and execution of instructions, it may not be safe to modify and instruction as
it is being executed (e.g. if the CPU's instruction fetches aren't atomic).

> > Do you have any guarantee as to the atomicity and ordering of instruction
> > fetches?
> Not yet. If the region is short, we could use nop + jalr pair instead.

Ok, so as above I do not understand how this is safe. Maybe I am missing
something, but if you don't have a guarantee as to ordering I don't see how you
can safely patch this even if you have atomicity of each instruction update.

Note that if you don't have atomicity of instruction fetches you *cannot*
safely concurrently modify and execute instructions.

> Only one jalr instruction makes the entry atomicity.

I'll have to take your word for that.

As above, I don't think this sequence is safe regardless.

> There are already several proposed solutions:
> 1. Make stop_machine guarantee all CPU out of preemption point.
> 2. Expand -fpatchable-function-entry from 4 to 24, and make detour
> codes atomicity.
> 3. We want to propose a solution to make auipc by hardware mask_irq.
> For more details, see:
> https://www.youtube.com/watch?v=4JkkkXuEvCw

Ignoring things which require HW changes, you could consider doing something
like what I'm doing for arm64 with DYNAMIC_FTRACE_WITH_CALL_OPS:

  https://lore.kernel.org/lkml/20230109135828.879136-1-mark.rutland@arm.com/

... which would replace the address generation with a load, which can be
atomic, and would give you a number of other benefits (e.g. avoiding branch
range limitations, performance benefits as in the cover letter).

Thanks,
Mark.
  
Guo Ren Jan. 28, 2023, 10 a.m. UTC | #4
On Thu, Jan 12, 2023 at 8:05 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Wed, Jan 11, 2023 at 09:22:09PM +0800, Guo Ren wrote:
> > On Tue, Jan 10, 2023 at 1:20 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Sat, Jan 07, 2023 at 08:35:43AM -0500, guoren@kernel.org wrote:
> > > > From: Andy Chiu <andy.chiu@sifive.com>
> > > >
> > > > In RISCV, we must use an AUIPC + JALR pair to encode an immediate,
> > > > forming a jump that jumps to an address over 4K. This may cause errors
> > > > if we want to enable kernel preemption and remove dependency from
> > > > patching code with stop_machine(). For example, if a task was switched
> > > > out on auipc. And, if we changed the ftrace function before it was
> > > > switched back, then it would jump to an address that has updated 11:0
> > > > bits mixing with previous XLEN:12 part.
> > > >
> > > > p: patched area performed by dynamic ftrace
> > > > ftrace_prologue:
> > > > p|      REG_S   ra, -SZREG(sp)
> > > > p|      auipc   ra, 0x? ------------> preempted
> > > >                                       ...
> > > >                               change ftrace function
> > > >                                       ...
> > > > p|      jalr    -?(ra) <------------- switched back
> > > > p|      REG_L   ra, -SZREG(sp)
> > > > func:
> > > >       xxx
> > > >       ret
> > >
> > > What happens on SMP but not !PREEMPTION; can't a CPU be in the middle of this
> > > while you're patching the sequence?
> > Yes, when PREEMPTION, a timer interrupt between auipc & jalr may cause
> > context_switch. And riscv uses stop_machine for patch_text. Then, we
> > may modify auipc part, but only execute the jalr part when return.
>
> Please re-read my question; "!PREEMPTION" means "NOT PREEMPTION".
>
> Ignore preeemption entirely and assume two CPUs X and Y are running code
> concurrently. Assume CPU X is in the ftrace prologue, and CPU Y is patching
> that prologue while CPU X is executing it.
>
> Is that prevented somehow? If not, what happens in that case?
>
> At the very least you can have exactly the same case as on a preemptible kernel
> (and in a VM, the hypervisor can preempt the guest ata arbitrary times),
> becuase CPU X could end up executing a mixture of the old and new instructions.
>
> More generally, if you don't have strong rules about concurrent modification
> and execution of instructions, it may not be safe to modify and instruction as
> it is being executed (e.g. if the CPU's instruction fetches aren't atomic).
>
> > > Do you have any guarantee as to the atomicity and ordering of instruction
> > > fetches?
> > Not yet. If the region is short, we could use nop + jalr pair instead.
>
> Ok, so as above I do not understand how this is safe. Maybe I am missing
> something, but if you don't have a guarantee as to ordering I don't see how you
> can safely patch this even if you have atomicity of each instruction update.
>
> Note that if you don't have atomicity of instruction fetches you *cannot*
> safely concurrently modify and execute instructions.
>
> > Only one jalr instruction makes the entry atomicity.
>
> I'll have to take your word for that.
>
> As above, I don't think this sequence is safe regardless.
>
> > There are already several proposed solutions:
> > 1. Make stop_machine guarantee all CPU out of preemption point.
> > 2. Expand -fpatchable-function-entry from 4 to 24, and make detour
> > codes atomicity.
> > 3. We want to propose a solution to make auipc by hardware mask_irq.
> > For more details, see:
> > https://www.youtube.com/watch?v=4JkkkXuEvCw
>
> Ignoring things which require HW changes, you could consider doing something
> like what I'm doing for arm64 with DYNAMIC_FTRACE_WITH_CALL_OPS:
>
>   https://lore.kernel.org/lkml/20230109135828.879136-1-mark.rutland@arm.com/
The idea of DYNAMIC_FTRACE_WITH_CALL_OPS (Using data load/store +
indirect jump instead of auipc+jalr) is similar to Andy's solution
(See youtube link, last page of ppt). But the key problem is you also
expand the size of the prologue of the function. 64BIT is already
expensive, and we can't afford more of it. I would change to seek a
new atomic auipc+jalr ISA extension to solve this problem.

DYNAMIC_FTRACE_WITH_CALL_OPS  would speed up ftrace_(regs)_caller
(Mostly for kernel debug), but it won't help
DYNAMIC_FTRACE_WITH_DIRECT_CALLS. So I do not so care about the
ftrace_(regs)_caller performance gain.

>
> ... which would replace the address generation with a load, which can be
> atomic, and would give you a number of other benefits (e.g. avoiding branch
> range limitations, performance benefits as in the cover letter).
>
> Thanks,
> Mark.
  
Guo Ren Jan. 29, 2023, 5:36 a.m. UTC | #5
On Sat, Jan 28, 2023 at 6:00 PM Guo Ren <guoren@kernel.org> wrote:
>
> On Thu, Jan 12, 2023 at 8:05 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Wed, Jan 11, 2023 at 09:22:09PM +0800, Guo Ren wrote:
> > > On Tue, Jan 10, 2023 at 1:20 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > On Sat, Jan 07, 2023 at 08:35:43AM -0500, guoren@kernel.org wrote:
> > > > > From: Andy Chiu <andy.chiu@sifive.com>
> > > > >
> > > > > In RISCV, we must use an AUIPC + JALR pair to encode an immediate,
> > > > > forming a jump that jumps to an address over 4K. This may cause errors
> > > > > if we want to enable kernel preemption and remove dependency from
> > > > > patching code with stop_machine(). For example, if a task was switched
> > > > > out on auipc. And, if we changed the ftrace function before it was
> > > > > switched back, then it would jump to an address that has updated 11:0
> > > > > bits mixing with previous XLEN:12 part.
> > > > >
> > > > > p: patched area performed by dynamic ftrace
> > > > > ftrace_prologue:
> > > > > p|      REG_S   ra, -SZREG(sp)
> > > > > p|      auipc   ra, 0x? ------------> preempted
> > > > >                                       ...
> > > > >                               change ftrace function
> > > > >                                       ...
> > > > > p|      jalr    -?(ra) <------------- switched back
> > > > > p|      REG_L   ra, -SZREG(sp)
> > > > > func:
> > > > >       xxx
> > > > >       ret
> > > >
> > > > What happens on SMP but not !PREEMPTION; can't a CPU be in the middle of this
> > > > while you're patching the sequence?
> > > Yes, when PREEMPTION, a timer interrupt between auipc & jalr may cause
> > > context_switch. And riscv uses stop_machine for patch_text. Then, we
> > > may modify auipc part, but only execute the jalr part when return.
> >
> > Please re-read my question; "!PREEMPTION" means "NOT PREEMPTION".
> >
> > Ignore preeemption entirely and assume two CPUs X and Y are running code
> > concurrently. Assume CPU X is in the ftrace prologue, and CPU Y is patching
> > that prologue while CPU X is executing it.
> >
> > Is that prevented somehow? If not, what happens in that case?
> >
> > At the very least you can have exactly the same case as on a preemptible kernel
> > (and in a VM, the hypervisor can preempt the guest ata arbitrary times),
> > becuase CPU X could end up executing a mixture of the old and new instructions.
> >
> > More generally, if you don't have strong rules about concurrent modification
> > and execution of instructions, it may not be safe to modify and instruction as
> > it is being executed (e.g. if the CPU's instruction fetches aren't atomic).
> >
> > > > Do you have any guarantee as to the atomicity and ordering of instruction
> > > > fetches?
> > > Not yet. If the region is short, we could use nop + jalr pair instead.
> >
> > Ok, so as above I do not understand how this is safe. Maybe I am missing
> > something, but if you don't have a guarantee as to ordering I don't see how you
> > can safely patch this even if you have atomicity of each instruction update.
> >
> > Note that if you don't have atomicity of instruction fetches you *cannot*
> > safely concurrently modify and execute instructions.
> >
> > > Only one jalr instruction makes the entry atomicity.
> >
> > I'll have to take your word for that.
> >
> > As above, I don't think this sequence is safe regardless.
> >
> > > There are already several proposed solutions:
> > > 1. Make stop_machine guarantee all CPU out of preemption point.
> > > 2. Expand -fpatchable-function-entry from 4 to 24, and make detour
> > > codes atomicity.
> > > 3. We want to propose a solution to make auipc by hardware mask_irq.
> > > For more details, see:
> > > https://www.youtube.com/watch?v=4JkkkXuEvCw
> >
> > Ignoring things which require HW changes, you could consider doing something
> > like what I'm doing for arm64 with DYNAMIC_FTRACE_WITH_CALL_OPS:
> >
> >   https://lore.kernel.org/lkml/20230109135828.879136-1-mark.rutland@arm.com/
> The idea of DYNAMIC_FTRACE_WITH_CALL_OPS (Using data load/store +
> indirect jump instead of auipc+jalr) is similar to Andy's solution
> (See youtube link, last page of ppt). But the key problem is you also
> expand the size of the prologue of the function. 64BIT is already
> expensive, and we can't afford more of it. I would change to seek a
> new atomic auipc+jalr ISA extension to solve this problem.
The atomicity here means:
 - auipc + jalr won't be interrupted
 - auipc + jalr should be aligned by 64bit, then one sd instruction
could update them in atomic.


>
> DYNAMIC_FTRACE_WITH_CALL_OPS  would speed up ftrace_(regs)_caller
> (Mostly for kernel debug), but it won't help
> DYNAMIC_FTRACE_WITH_DIRECT_CALLS. So I do not so care about the
> ftrace_(regs)_caller performance gain.
>
> >
> > ... which would replace the address generation with a load, which can be
> > atomic, and would give you a number of other benefits (e.g. avoiding branch
> > range limitations, performance benefits as in the cover letter).
> >
> > Thanks,
> > Mark.
>
>
>
> --
> Best Regards
>  Guo Ren
  
Mark Rutland Jan. 30, 2023, 11:17 a.m. UTC | #6
On Sat, Jan 28, 2023 at 06:00:20PM +0800, Guo Ren wrote:
> On Thu, Jan 12, 2023 at 8:05 PM Mark Rutland <mark.rutland@arm.com> wrote:

> > Ignoring things which require HW changes, you could consider doing something
> > like what I'm doing for arm64 with DYNAMIC_FTRACE_WITH_CALL_OPS:
> >
> >   https://lore.kernel.org/lkml/20230109135828.879136-1-mark.rutland@arm.com/
> The idea of DYNAMIC_FTRACE_WITH_CALL_OPS (Using data load/store +
> indirect jump instead of auipc+jalr) is similar to Andy's solution
> (See youtube link, last page of ppt).

Sure; I was present in that room and I spoke with Andy at the time.

The solutions are similar, but the important detail with
DYNAMIC_FTRACE_WITH_CALL_OPS is that the load and indirect branch is moved into
a common trampoline so that each call-site can be smaller. The ops pointer is
placed *before* the function entry point and doesn't need to be skipped with a
direct branch (which Andy's approach could also do if he aligned functions
similarly).

> But the key problem is you also expand the size of the prologue of the
> function. 64BIT is already expensive, and we can't afford more of it. I would
> change to seek a new atomic auipc+jalr ISA extension to solve this problem.

Sure, and that's nice for *new* hardware, but I'm talking about a solution
which works on *current* hardware.

> DYNAMIC_FTRACE_WITH_CALL_OPS  would speed up ftrace_(regs)_caller (Mostly for
> kernel debug), but it won't help DYNAMIC_FTRACE_WITH_DIRECT_CALLS. So I do
> not so care about the ftrace_(regs)_caller performance gain.

Actually, the plan is that it *will* help DYNAMIC_FTRACE_WITH_DIRECT_CALLS; we
just didn't make all the necessary changes in one go.

Florent Revest is looking at implementing that by placing the direct call
pointer into the ops, so the common trampoline can load that directly.

He has an older draft available at:

  https://github.com/FlorentRevest/linux/commits/indirect-direct-calls-3

... and since then, having spoken to Steven, we came up with a plan to make all
direct calls require an ops (which is the case for DIRECT_CALLS_MULTI), and
place a trampoline pointer in the ops.

That way, the common trampoline can do something like (in arm64 asm):

| 	LDR	<tmp>, [<ops>, #OPS_TRAMP_PTR]
| 	CBNZ	<tmp>, __call_tramp_directly
| 
| 	// ... regular regs trampoline logic here 
| 
| __call_tramp_directly:
| 
| 	// ... shuffle registers here
| 	
| 	BR	<tmp>

... and I believe the same should work for riscv.

Thanks,
Mark.
  
Guo Ren Feb. 7, 2023, 2:31 a.m. UTC | #7
On Mon, Jan 30, 2023 at 7:17 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Sat, Jan 28, 2023 at 06:00:20PM +0800, Guo Ren wrote:
> > On Thu, Jan 12, 2023 at 8:05 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> > > Ignoring things which require HW changes, you could consider doing something
> > > like what I'm doing for arm64 with DYNAMIC_FTRACE_WITH_CALL_OPS:
> > >
> > >   https://lore.kernel.org/lkml/20230109135828.879136-1-mark.rutland@arm.com/
> > The idea of DYNAMIC_FTRACE_WITH_CALL_OPS (Using data load/store +
> > indirect jump instead of auipc+jalr) is similar to Andy's solution
> > (See youtube link, last page of ppt).
>
> Sure; I was present in that room and I spoke with Andy at the time.
>
> The solutions are similar, but the important detail with
> DYNAMIC_FTRACE_WITH_CALL_OPS is that the load and indirect branch is moved into
> a common trampoline so that each call-site can be smaller. The ops pointer is
> placed *before* the function entry point and doesn't need to be skipped with a
> direct branch (which Andy's approach could also do if he aligned functions
> similarly).
>
> > But the key problem is you also expand the size of the prologue of the
> > function. 64BIT is already expensive, and we can't afford more of it. I would
> > change to seek a new atomic auipc+jalr ISA extension to solve this problem.
>
> Sure, and that's nice for *new* hardware, but I'm talking about a solution
> which works on *current* hardware.
>
> > DYNAMIC_FTRACE_WITH_CALL_OPS  would speed up ftrace_(regs)_caller (Mostly for
> > kernel debug), but it won't help DYNAMIC_FTRACE_WITH_DIRECT_CALLS. So I do
> > not so care about the ftrace_(regs)_caller performance gain.
>
> Actually, the plan is that it *will* help DYNAMIC_FTRACE_WITH_DIRECT_CALLS; we
> just didn't make all the necessary changes in one go.
>
> Florent Revest is looking at implementing that by placing the direct call
> pointer into the ops, so the common trampoline can load that directly.
>
> He has an older draft available at:
>
>   https://github.com/FlorentRevest/linux/commits/indirect-direct-calls-3
Thx for sharing :)

>
> ... and since then, having spoken to Steven, we came up with a plan to make all
> direct calls require an ops (which is the case for DIRECT_CALLS_MULTI), and
> place a trampoline pointer in the ops.
>
> That way, the common trampoline can do something like (in arm64 asm):
>
> |       LDR     <tmp>, [<ops>, #OPS_TRAMP_PTR]
> |       CBNZ    <tmp>, __call_tramp_directly
> |
> |       // ... regular regs trampoline logic here
> |
> | __call_tramp_directly:
> |
> |       // ... shuffle registers here
> |
> |       BR      <tmp>
>
> ... and I believe the same should work for riscv.
I agree; I would try next.

>
> Thanks,
> Mark.
  

Patch

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e2b656043abf..ee0d39b26794 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -138,7 +138,7 @@  config RISCV
 	select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
 	select HAVE_FUNCTION_GRAPH_TRACER
-	select HAVE_FUNCTION_TRACER if !XIP_KERNEL
+	select HAVE_FUNCTION_TRACER if !XIP_KERNEL && !PREEMPTION
 
 config ARCH_MMAP_RND_BITS_MIN
 	default 18 if 64BIT