[RESEND,2/6] x86/traps: add a system interrupt table for system interrupt dispatch

Message ID 20221110061545.1531-3-xin3.li@intel.com
State New
Headers
Series x86/traps,VMX: implement software based NMI/IRQ dispatch for VMX NMI/IRQ reinjection |

Commit Message

Li, Xin3 Nov. 10, 2022, 6:15 a.m. UTC
  Upon receiving an external interrupt, KVM VMX reinjects it through
calling the interrupt handler in its IDT descriptor on the current
kernel stack, which essentially uses the IDT as an interrupt dispatch
table.

However the IDT is one of the lowest level critical data structures
between a x86 CPU and the Linux kernel, we should avoid using it
*directly* whenever possible, espeically in a software defined manner.

On x86, external interrupts are divided into the following groups
  1) system interrupts
  2) external device interrupts
With the IDT, system interrupts are dispatched through the IDT
directly, while external device interrupts are all routed to the
external interrupt dispatch function common_interrupt(), which
dispatches external device interrupts through a per-CPU external
interrupt dispatch table vector_irq.

To eliminate dispatching external interrupts through the IDT, add
a system interrupt handler table for dispatching a system interrupt
to its corresponding handler directly. Thus a software based dispatch
function will be:

  void external_interrupt(struct pt_regs *regs, u8 vector)
  {
    if (is_system_interrupt(vector))
      system_interrupt_handler_table[vector_to_sysvec(vector)](regs);
    else /* external device interrupt */
      common_interrupt(regs, vector);
  }

What's more, with the Intel FRED (Flexible Return and Event Delivery)
architecture, IDT, the hardware based event dispatch table, is gone,
and the Linux kernel needs to dispatch events to their handlers with
vector to handler mappings, the dispatch function external_interrupt()
is also needed.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/traps.h |  8 ++++++
 arch/x86/kernel/traps.c      | 55 ++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)
  

Comments

Peter Zijlstra Nov. 10, 2022, 8:56 a.m. UTC | #1
On Wed, Nov 09, 2022 at 10:15:41PM -0800, Xin Li wrote:
> Upon receiving an external interrupt, KVM VMX reinjects it through
> calling the interrupt handler in its IDT descriptor on the current
> kernel stack, which essentially uses the IDT as an interrupt dispatch
> table.
> 
> However the IDT is one of the lowest level critical data structures
> between a x86 CPU and the Linux kernel, we should avoid using it
> *directly* whenever possible, espeically in a software defined manner.
> 
> On x86, external interrupts are divided into the following groups
>   1) system interrupts
>   2) external device interrupts
> With the IDT, system interrupts are dispatched through the IDT
> directly, while external device interrupts are all routed to the
> external interrupt dispatch function common_interrupt(), which
> dispatches external device interrupts through a per-CPU external
> interrupt dispatch table vector_irq.
> 
> To eliminate dispatching external interrupts through the IDT, add
> a system interrupt handler table for dispatching a system interrupt
> to its corresponding handler directly. Thus a software based dispatch
> function will be:
> 
>   void external_interrupt(struct pt_regs *regs, u8 vector)
>   {
>     if (is_system_interrupt(vector))
>       system_interrupt_handler_table[vector_to_sysvec(vector)](regs);
>     else /* external device interrupt */
>       common_interrupt(regs, vector);
>   }
> 
> What's more, with the Intel FRED (Flexible Return and Event Delivery)
> architecture, IDT, the hardware based event dispatch table, is gone,
> and the Linux kernel needs to dispatch events to their handlers with
> vector to handler mappings, the dispatch function external_interrupt()
> is also needed.
> 
> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>

This is not a valid SOB, it would suggest hpa is the author, but he's
not in in From.

> ---
>  arch/x86/include/asm/traps.h |  8 ++++++
>  arch/x86/kernel/traps.c      | 55 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+)
> 
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 47ecfff2c83d..3dc63d753bda 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -47,4 +47,12 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
>  				      struct stack_info *info);
>  #endif
>  
> +/*
> + * How system interrupt handlers are called.
> + */
> +#define DECLARE_SYSTEM_INTERRUPT_HANDLER(f)			\
> +	void f (struct pt_regs *regs __maybe_unused,		\
> +		unsigned long vector __maybe_unused)
> +typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));
> +
>  #endif /* _ASM_X86_TRAPS_H */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 178015a820f0..95dd917ef9ad 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -1444,6 +1444,61 @@ DEFINE_IDTENTRY_SW(iret_error)
>  }
>  #endif
>  
> +#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] = (system_interrupt_handler)y
> +
> +#pragma GCC diagnostic push
> +#pragma GCC diagnostic ignored "-Wcast-function-type"

How does this not break CFI ?
  
Li, Xin3 Nov. 10, 2022, 7:55 p.m. UTC | #2
> > Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> > Signed-off-by: Xin Li <xin3.li@intel.com>
> 
> This is not a valid SOB, it would suggest hpa is the author, but he's not in in
> From.

HPA wrote the initial dispatch code for FRED, and I worked with him to
refactor it for KVM/VMX NMI/IRQ dispatch.  So use SOB from both.  No?

> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index
> > 178015a820f0..95dd917ef9ad 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -1444,6 +1444,61 @@ DEFINE_IDTENTRY_SW(iret_error)  }  #endif
> >
> > +#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] =
> > +(system_interrupt_handler)y
> > +
> > +#pragma GCC diagnostic push
> > +#pragma GCC diagnostic ignored "-Wcast-function-type"
> 
> How does this not break CFI ?

I wasn't aware of it, will check.
  
Li, Xin3 Nov. 10, 2022, 8:36 p.m. UTC | #3
> > > +#pragma GCC diagnostic push
> > > +#pragma GCC diagnostic ignored "-Wcast-function-type"
> >
> > How does this not break CFI ?
> 
> I wasn't aware of it, will check.

CFI needs $(cc-option,-fsanitize=kcfi), which, reported on LWN on Jun, 2002,
had not yet landed in the LLVM mainline (I'm using GCC).  So looks we are
replying on people keeping an eye on it to make sure it's not broken?

Even we are unable to test it now, we should find a solution.

Thanks!
Xin
  
Nathan Chancellor Nov. 10, 2022, 9:12 p.m. UTC | #4
On Thu, Nov 10, 2022 at 08:36:30PM +0000, Li, Xin3 wrote:
> > > > +#pragma GCC diagnostic push
> > > > +#pragma GCC diagnostic ignored "-Wcast-function-type"
> > >
> > > How does this not break CFI ?
> > 
> > I wasn't aware of it, will check.
> 
> CFI needs $(cc-option,-fsanitize=kcfi), which, reported on LWN on Jun, 2002,
> had not yet landed in the LLVM mainline (I'm using GCC).  So looks we are
> replying on people keeping an eye on it to make sure it's not broken?

Well, the entire point of the warning that you are disabling here is to
catch potential CFI failures at compile time, rather than run time :)

Clang also has -Wcast-function-type-strict, which Gustavo and Kees are
working on getting enabled, so that even more CFI failures can be caught
at compile time.

https://github.com/ClangBuiltLinux/linux/issues/1724
https://lore.kernel.org/all/?q=-Wcast-function-type-strict

Cheers,
Nathan
  
Li, Xin3 Nov. 10, 2022, 11 p.m. UTC | #5
> > > > > +#pragma GCC diagnostic push
> > > > > +#pragma GCC diagnostic ignored "-Wcast-function-type"
> > > >
> > > > How does this not break CFI ?
> > >
> > > I wasn't aware of it, will check.
> >
> > CFI needs $(cc-option,-fsanitize=kcfi), which, reported on LWN on Jun,
> > 2002, had not yet landed in the LLVM mainline (I'm using GCC).  So
> > looks we are replying on people keeping an eye on it to make sure it's not
> broken?
> 
> Well, the entire point of the warning that you are disabling here is to catch
> potential CFI failures at compile time, rather than run time :)

Oh, of course I didn't intend to be opposed to CFI.

> 
> Clang also has -Wcast-function-type-strict, which Gustavo and Kees are working
> on getting enabled, so that even more CFI failures can be caught at compile
> time.
> 
> https://github.com/ClangBuiltLinux/linux/issues/1724
> https://lore.kernel.org/all/?q=-Wcast-function-type-strict
> 
> Cheers,
> Nathan
  
Nathan Chancellor Nov. 11, 2022, 12:08 a.m. UTC | #6
On Thu, Nov 10, 2022 at 11:00:51PM +0000, Li, Xin3 wrote:
> > > > > > +#pragma GCC diagnostic push
> > > > > > +#pragma GCC diagnostic ignored "-Wcast-function-type"
> > > > >
> > > > > How does this not break CFI ?
> > > >
> > > > I wasn't aware of it, will check.
> > >
> > > CFI needs $(cc-option,-fsanitize=kcfi), which, reported on LWN on Jun,
> > > 2002, had not yet landed in the LLVM mainline (I'm using GCC).  So
> > > looks we are replying on people keeping an eye on it to make sure it's not
> > broken?
> > 
> > Well, the entire point of the warning that you are disabling here is to catch
> > potential CFI failures at compile time, rather than run time :)
> 
> Oh, of course I didn't intend to be opposed to CFI.

Oh, I apologize if I gave the impression that you were, I did not mean
to put words in your mouth! I was just giving additional context as to
why we have that warning enable and how we use it to help catch
potential run time failures at compile time, which does not require
running CFI to keep CFI happy.

Cheers,
Nathan

> > Clang also has -Wcast-function-type-strict, which Gustavo and Kees are working
> > on getting enabled, so that even more CFI failures can be caught at compile
> > time.
> > 
> > https://github.com/ClangBuiltLinux/linux/issues/1724
> > https://lore.kernel.org/all/?q=-Wcast-function-type-strict
> > 
> > Cheers,
> > Nathan
  
Tian, Kevin Nov. 11, 2022, 1:12 a.m. UTC | #7
> From: Li, Xin3 <xin3.li@intel.com>
> Sent: Friday, November 11, 2022 3:55 AM
> 
> > > Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> > > Signed-off-by: Xin Li <xin3.li@intel.com>
> >
> > This is not a valid SOB, it would suggest hpa is the author, but he's not in in
> > From.
> 
> HPA wrote the initial dispatch code for FRED, and I worked with him to
> refactor it for KVM/VMX NMI/IRQ dispatch.  So use SOB from both.  No?
> 

If main work was done by HPA while your work is trivial:

  From: H. Peter Anvin (Intel) <hpa@zytor.com>

  commit msg

  Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
  Signed-off-by: Xin Li <xin3.li@intel.com>

if your contribution is non-trivial:

  From: H. Peter Anvin (Intel) <hpa@zytor.com>

  commit msg

  Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
  Co-developed-by: Xin Li <xin3.li@intel.com>
  Signed-off-by: Xin Li <xin3.li@intel.com>

It's documented in Documentation/process/submitting-patches.rst
  
Li, Xin3 Nov. 11, 2022, 3:03 a.m. UTC | #8
> > > Well, the entire point of the warning that you are disabling here is
> > > to catch potential CFI failures at compile time, rather than run
> > > time :)
> >
> > Oh, of course I didn't intend to be opposed to CFI.
> 
> Oh, I apologize if I gave the impression that you were, I did not mean to put
> words in your mouth! I was just giving additional context as to why we have
> that warning enable and how we use it to help catch potential run time failures
> at compile time, which does not require running CFI to keep CFI happy.

No, I don't feel impression at all :), maybe just a little bit upset that
I didn't realize it at the beginning.

I actually appreciate you provided additional context into this thread.
Xin
  
Li, Xin3 Nov. 11, 2022, 3:54 a.m. UTC | #9
> > > > Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> > > > Signed-off-by: Xin Li <xin3.li@intel.com>
> > >
> > > This is not a valid SOB, it would suggest hpa is the author, but
> > > he's not in in From.
> >
> > HPA wrote the initial dispatch code for FRED, and I worked with him to
> > refactor it for KVM/VMX NMI/IRQ dispatch.  So use SOB from both.  No?
> >
> 
> If main work was done by HPA while your work is trivial:
> 
>   From: H. Peter Anvin (Intel) <hpa@zytor.com>
> 
>   commit msg
> 
>   Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
>   Signed-off-by: Xin Li <xin3.li@intel.com>
> 
> if your contribution is non-trivial:
> 
>   From: H. Peter Anvin (Intel) <hpa@zytor.com>
> 
>   commit msg
> 
>   Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
>   Co-developed-by: Xin Li <xin3.li@intel.com>
>   Signed-off-by: Xin Li <xin3.li@intel.com>
> 
> It's documented in Documentation/process/submitting-patches.rst

Thanks, Kevin.

>
  
Peter Zijlstra Nov. 11, 2022, 8:55 a.m. UTC | #10
On Thu, Nov 10, 2022 at 07:55:22PM +0000, Li, Xin3 wrote:
> > > Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> > > Signed-off-by: Xin Li <xin3.li@intel.com>
> > 
> > This is not a valid SOB, it would suggest hpa is the author, but he's not in in
> > From.
> 
> HPA wrote the initial dispatch code for FRED, and I worked with him to
> refactor it for KVM/VMX NMI/IRQ dispatch.  So use SOB from both.  No?

Yes, but not like this. Please review the documentation we have on this.
  
Peter Zijlstra Nov. 11, 2022, 8:58 a.m. UTC | #11
On Thu, Nov 10, 2022 at 08:36:30PM +0000, Li, Xin3 wrote:
> > > > +#pragma GCC diagnostic push
> > > > +#pragma GCC diagnostic ignored "-Wcast-function-type"
> > >
> > > How does this not break CFI ?
> > 
> > I wasn't aware of it, will check.
> 
> CFI needs $(cc-option,-fsanitize=kcfi), which, reported on LWN on Jun, 2002,
> had not yet landed in the LLVM mainline (I'm using GCC).  So looks we are
> replying on people keeping an eye on it to make sure it's not broken?

We're relying on that warning you disabled.

> Even we are unable to test it now, we should find a solution.

You can test this just fine. Build llvm.git and compile a kernel with
CFI_CLANG=y. This is not hard, my ADL runs such a kernel as we speak.
  
H. Peter Anvin Nov. 11, 2022, 10:07 p.m. UTC | #12
On November 10, 2022 11:55:22 AM PST, "Li, Xin3" <xin3.li@intel.com> wrote:
>> > Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
>> > Signed-off-by: Xin Li <xin3.li@intel.com>
>> 
>> This is not a valid SOB, it would suggest hpa is the author, but he's not in in
>> From.
>
>HPA wrote the initial dispatch code for FRED, and I worked with him to
>refactor it for KVM/VMX NMI/IRQ dispatch.  So use SOB from both.  No?
>
>> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index
>> > 178015a820f0..95dd917ef9ad 100644
>> > --- a/arch/x86/kernel/traps.c
>> > +++ b/arch/x86/kernel/traps.c
>> > @@ -1444,6 +1444,61 @@ DEFINE_IDTENTRY_SW(iret_error)  }  #endif
>> >
>> > +#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] =
>> > +(system_interrupt_handler)y
>> > +
>> > +#pragma GCC diagnostic push
>> > +#pragma GCC diagnostic ignored "-Wcast-function-type"
>> 
>> How does this not break CFI ?
>
>I wasn't aware of it, will check.
>

It doesn't break CFI because the arguments passed is always a strict superset of the ones expected and they are free enough that they are always passed in registers.
  
Peter Zijlstra Nov. 12, 2022, 9:47 a.m. UTC | #13
On Fri, Nov 11, 2022 at 02:07:05PM -0800, H. Peter Anvin wrote:
> On November 10, 2022 11:55:22 AM PST, "Li, Xin3" <xin3.li@intel.com> wrote:
> >> > Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
> >> > Signed-off-by: Xin Li <xin3.li@intel.com>
> >> 
> >> This is not a valid SOB, it would suggest hpa is the author, but he's not in in
> >> From.
> >
> >HPA wrote the initial dispatch code for FRED, and I worked with him to
> >refactor it for KVM/VMX NMI/IRQ dispatch.  So use SOB from both.  No?
> >
> >> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index
> >> > 178015a820f0..95dd917ef9ad 100644
> >> > --- a/arch/x86/kernel/traps.c
> >> > +++ b/arch/x86/kernel/traps.c
> >> > @@ -1444,6 +1444,61 @@ DEFINE_IDTENTRY_SW(iret_error)  }  #endif
> >> >
> >> > +#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] =
> >> > +(system_interrupt_handler)y
> >> > +
> >> > +#pragma GCC diagnostic push
> >> > +#pragma GCC diagnostic ignored "-Wcast-function-type"
> >> 
> >> How does this not break CFI ?
> >
> >I wasn't aware of it, will check.
> >
> 
> It doesn't break CFI because the arguments passed is always a strict
> superset of the ones expected and they are free enough that they are
> always passed in registers.

It does break CFI because the signature hash doesn't match and you'll
trigger an explicit UD2.
  

Patch

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..3dc63d753bda 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,12 @@  void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+/*
+ * How system interrupt handlers are called.
+ */
+#define DECLARE_SYSTEM_INTERRUPT_HANDLER(f)			\
+	void f (struct pt_regs *regs __maybe_unused,		\
+		unsigned long vector __maybe_unused)
+typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 178015a820f0..95dd917ef9ad 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1444,6 +1444,61 @@  DEFINE_IDTENTRY_SW(iret_error)
 }
 #endif
 
+#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] = (system_interrupt_handler)y
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wcast-function-type"
+
+/*
+ * The initializer spurious_interrupt() has two arguments of types struct
+ * pt_regs * and unsigned long, and the system interrupt handlers with
+ * prefix sysvec_ are all defined with either DEFINE_IDTENTRY_SYSVEC or
+ * DEFINE_IDTENTRY_SYSVEC_SIMPLE, both with only one argument of type
+ * struct pt_regs *. Because all handlers only declare and require a subset
+ * of the arguments provided by the full system_interrupt_handler prototype,
+ * the function type cast is safe here.
+ */
+const system_interrupt_handler system_interrupt_handler_table[NR_SYSTEM_VECTORS] = {
+	[0 ... NR_SYSTEM_VECTORS-1]		= spurious_interrupt,
+#ifdef CONFIG_SMP
+	SYSV(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi),
+	SYSV(CALL_FUNCTION_VECTOR,		sysvec_call_function),
+	SYSV(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single),
+	SYSV(REBOOT_VECTOR,			sysvec_reboot),
+#endif
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+	SYSV(THERMAL_APIC_VECTOR,		sysvec_thermal),
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+	SYSV(THRESHOLD_APIC_VECTOR,		sysvec_threshold),
+#endif
+
+#ifdef CONFIG_X86_MCE_AMD
+	SYSV(DEFERRED_ERROR_VECTOR,		sysvec_deferred_error),
+#endif
+
+#ifdef CONFIG_X86_LOCAL_APIC
+	SYSV(LOCAL_TIMER_VECTOR,		sysvec_apic_timer_interrupt),
+	SYSV(X86_PLATFORM_IPI_VECTOR,		sysvec_x86_platform_ipi),
+# ifdef CONFIG_HAVE_KVM
+	SYSV(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi),
+	SYSV(POSTED_INTR_WAKEUP_VECTOR,		sysvec_kvm_posted_intr_wakeup_ipi),
+	SYSV(POSTED_INTR_NESTED_VECTOR,		sysvec_kvm_posted_intr_nested_ipi),
+# endif
+# ifdef CONFIG_IRQ_WORK
+	SYSV(IRQ_WORK_VECTOR,			sysvec_irq_work),
+# endif
+	SYSV(SPURIOUS_APIC_VECTOR,		sysvec_spurious_apic_interrupt),
+	SYSV(ERROR_APIC_VECTOR,			sysvec_error_interrupt),
+#endif
+};
+
+#pragma GCC diagnostic pop
+
+#undef SYSV
+
 void __init trap_init(void)
 {
 	/* Init cpu_entry_area before IST entries are set up */