[1/2] x86/hyperv: Expose an helper to map PCI interrupts

Message ID 168079870998.14175.16015623662679754647.stgit@skinsburskii.localdomain
State New
Headers
Series Fix MSI interrupts for nested Hyper-V root partition |

Commit Message

Stanislav Kinsburskii April 6, 2023, 4:33 p.m. UTC
  From: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>

This patch moves a part of currently internal logic into the new
hv_map_msi_interrupt function and makes it globally available helper,
which will be used to map PCI interrupts in case of root partition.

Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Wei Liu <wei.liu@kernel.org>
CC: Dexuan Cui <decui@microsoft.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: x86@kernel.org
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: linux-hyperv@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/hyperv/irqdomain.c     |   40 +++++++++++++++++++++++++++------------
 arch/x86/include/asm/mshyperv.h |    2 ++
 2 files changed, 30 insertions(+), 12 deletions(-)
  

Comments

Stanislav Kinsburskii April 12, 2023, 4:19 p.m. UTC | #1
On Thu, Apr 13, 2023 at 03:51:09PM +0200, Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 09:33, Stanislav Kinsburskii wrote:
> > This patch moves
> 
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#submittingpatches
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> 

Thanks. I'll elaborate on the reationale in the next revision.

> > a part of currently internal logic into the new
> > hv_map_msi_interrupt function and makes it globally available helper,
> > which will be used to map PCI interrupts in case of root partition.
> 
> > -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> > -				struct hv_interrupt_entry *entry)
> > +/**
> > + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
> 
> So if you need to put "" on Map then maybe your function is
> misnomed. Either it maps or it does not, right?
> 

Thanks, I'll remove the quotation marks in the next update.

> > + * @data:      Describes the IRQ
> > + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> > + *
> > + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> > + */
> > +int hv_map_msi_interrupt(struct irq_data *data,
> > +			 struct hv_interrupt_entry *out_entry)
> >  {
> > -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
> > +	struct msi_desc *msidesc;
> > +	struct pci_dev *dev;
> > +	union hv_device_id device_id;
> > +	struct hv_interrupt_entry dummy, *entry;
> > +	struct irq_cfg *cfg = irqd_cfg(data);
> > +	const cpumask_t *affinity;
> > +	int cpu, vector;
> > +
> > +	msidesc = irq_data_get_msi_desc(data);
> > +	dev = msi_desc_to_pci_dev(msidesc);
> > +	device_id = hv_build_pci_dev_id(dev);
> > +	affinity = irq_data_get_effective_affinity_mask(data);
> > +	cpu = cpumask_first_and(affinity, cpu_online_mask);
> 
> The effective affinity mask of MSI interrupts consists only of online
> CPUs, to be accurate: it has exactly one online CPU set.
> 
> But even if it would have only offline CPUs then the result would be:
> 
>     cpu = nr_cpu_ids
> 
> which is definitely invalid. While a disabled vector targeted to an
> offline CPU is not necessarily invalid.
> 

Thank you for diving into this logic.

Although this patch only tosses the code and doens't make any functional
changes, I guess if the fix for the the cpu is found has is required, it
has to be in a separated patch.

Would you mind to elaborate more of the problem(s)?
Do you mean that the result of cpumask_first_and has to be checked for not
being >= nr_cpus_ids?
Or do you mean there is no need to check the affinity against
cpu_online_mask at all ans we can simply take any first bit from the
effective affinity mask?

Also, could ou elaborate more on the disabled vector target to an
offline CPU? Is there any use case for such scenario (in this case we
might want to support it)?

I guess the goal of this code is to make sure that hypervisor on't be
configured to deliver MSI to an online CPU.

Thanks,
Stanislav

> Thanks,
> 
>         tglx
  
Stanislav Kinsburskii April 12, 2023, 4:36 p.m. UTC | #2
On Wed, Apr 12, 2023 at 09:19:51AM -0700, Stanislav Kinsburskii wrote:
> On Thu, Apr 13, 2023 at 03:51:09PM +0200, Thomas Gleixner wrote:
> > On Thu, Apr 06 2023 at 09:33, Stanislav Kinsburskii wrote:
> > > This patch moves
> > 
> > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#submittingpatches
> > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> > 
> 
> Thanks. I'll elaborate on the reationale in the next revision.
> 
> > > a part of currently internal logic into the new
> > > hv_map_msi_interrupt function and makes it globally available helper,
> > > which will be used to map PCI interrupts in case of root partition.
> > 
> > > -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> > > -				struct hv_interrupt_entry *entry)
> > > +/**
> > > + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
> > 
> > So if you need to put "" on Map then maybe your function is
> > misnomed. Either it maps or it does not, right?
> > 
> 
> Thanks, I'll remove the quotation marks in the next update.
> 
> > > + * @data:      Describes the IRQ
> > > + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> > > + *
> > > + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> > > + */
> > > +int hv_map_msi_interrupt(struct irq_data *data,
> > > +			 struct hv_interrupt_entry *out_entry)
> > >  {
> > > -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
> > > +	struct msi_desc *msidesc;
> > > +	struct pci_dev *dev;
> > > +	union hv_device_id device_id;
> > > +	struct hv_interrupt_entry dummy, *entry;
> > > +	struct irq_cfg *cfg = irqd_cfg(data);
> > > +	const cpumask_t *affinity;
> > > +	int cpu, vector;
> > > +
> > > +	msidesc = irq_data_get_msi_desc(data);
> > > +	dev = msi_desc_to_pci_dev(msidesc);
> > > +	device_id = hv_build_pci_dev_id(dev);
> > > +	affinity = irq_data_get_effective_affinity_mask(data);
> > > +	cpu = cpumask_first_and(affinity, cpu_online_mask);
> > 
> > The effective affinity mask of MSI interrupts consists only of online
> > CPUs, to be accurate: it has exactly one online CPU set.
> > 
> > But even if it would have only offline CPUs then the result would be:
> > 
> >     cpu = nr_cpu_ids
> > 
> > which is definitely invalid. While a disabled vector targeted to an
> > offline CPU is not necessarily invalid.
> > 
> 
> Thank you for diving into this logic.
> 
> Although this patch only tosses the code and doens't make any functional
> changes, I guess if the fix for the the cpu is found has is required, it
> has to be in a separated patch.
> 
> Would you mind to elaborate more of the problem(s)?
> Do you mean that the result of cpumask_first_and has to be checked for not
> being >= nr_cpus_ids?
> Or do you mean there is no need to check the affinity against
> cpu_online_mask at all ans we can simply take any first bit from the
> effective affinity mask?
> 
> Also, could ou elaborate more on the disabled vector target to an
> offline CPU? Is there any use case for such scenario (in this case we
> might want to support it)?
> 
> I guess the goal of this code is to make sure that hypervisor on't be
> configured to deliver MSI to an online CPU.
> 

I'm sorry, there were a lot of typos.
Let me try again:

Thank you for diving into this logic.

Although this patch only tosses the code and doens't make any functional
changes, I guess if the fix for the used cpu id is required, it has to
be in a separated patch.

Would you mind to elaborate more of the problem(s)?
Do you mean that the result of cpumask_first_and has to be checked for not
being >= nr_cpus_ids?
Or do you mean that there is no need to check the irq affinity against
cpu_online_mask at all and we can simply take any first bit from the
effective affinity mask?

Also, could you elaborate more on the disabled vector targeting an
offline CPU? Is there any use case for such scenario (in this case we
might want to support it)?

I guess the goal of this code is to make sure that hypervisor won't be
configured to deliver an MSI to an offline CPU.

Thanks,
Stanislav

> Thanks,
> Stanislav
> 
> > Thanks,
> > 
> >         tglx
  
Stanislav Kinsburskii April 12, 2023, 8:31 p.m. UTC | #3
On Fri, Apr 14, 2023 at 09:28:52AM +0200, Thomas Gleixner wrote:
> Stanislav!
> 
> On Wed, Apr 12 2023 at 09:36, Stanislav Kinsburskii wrote:
> > On Wed, Apr 12, 2023 at 09:19:51AM -0700, Stanislav Kinsburskii wrote:
> >> > > +	affinity = irq_data_get_effective_affinity_mask(data);
> >> > > +	cpu = cpumask_first_and(affinity, cpu_online_mask);
> >> > 
> >> > The effective affinity mask of MSI interrupts consists only of online
> >> > CPUs, to be accurate: it has exactly one online CPU set.
> >> > 
> >> > But even if it would have only offline CPUs then the result would be:
> >> > 
> >> >     cpu = nr_cpu_ids
> >> > 
> >> > which is definitely invalid. While a disabled vector targeted to an
> >> > offline CPU is not necessarily invalid.
> >
> > Although this patch only tosses the code and doens't make any functional
> > changes, I guess if the fix for the used cpu id is required, it has to
> > be in a separated patch.
> 
> Correct, but if the interrupt _is_ masked at the MSI level then the
> hypervisor must not deliver an interrupt at all.
> 
> The point is that it is valid to target a masked MSI entry to an offline
> CPU under the assumption that the hardware/emulation respects the
> masking. Whether that's a good idea or not is a different question.
> 
> The kernel as of today does not do that. It targets unused but
> configured MSI[-x] entries towards MANAGED_IRQ_SHUTDOWN_VECTOR on CPU0
> for various reasons, one of them being paranoia.
> 
> But in principle there is nothing wrong with that and it should either
> succeed or being rejected at the software level and not expose a
> completely invalid CPU number to the hypercall in the first place.
> 
> So if you want to be defensive, then keep the _and(), but then check the
> result for being valid and emit something useful like a pr_warn_once()
> instead of blindly handing the invalid result to the hypercall and then
> have that reject it with some undecipherable error code.
> 
> Actually it would not necessarily reach the hypercall because before
> that it dereferences cpumask_of(nr_cpu_ids) here:
> 
> 	nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set),	cpumask_of(cpu));
> 
> and explode with a kernel pagefault. If not it will read some random
> adjacent data and try to create a vp_set from it. Neither of that is
> anywhere close to correct.
> 

Thank you Thomas.
I sent a patch to address the problmes you highlighted:

"x86/hyperv: Fix IRQ effective cpu discovery for the interrupts unmasking"

I'll update this series after that patch is merged.

Thanks,
Stanislav

> Thanks,
> 
>         tglx
  
Thomas Gleixner April 13, 2023, 1:51 p.m. UTC | #4
On Thu, Apr 06 2023 at 09:33, Stanislav Kinsburskii wrote:
> This patch moves

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#submittingpatches
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog

> a part of currently internal logic into the new
> hv_map_msi_interrupt function and makes it globally available helper,
> which will be used to map PCI interrupts in case of root partition.

> -static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
> -				struct hv_interrupt_entry *entry)
> +/**
> + * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.

So if you need to put "" on Map then maybe your function is
misnomed. Either it maps or it does not, right?

> + * @data:      Describes the IRQ
> + * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
> + *
> + * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
> + */
> +int hv_map_msi_interrupt(struct irq_data *data,
> +			 struct hv_interrupt_entry *out_entry)
>  {
> -	union hv_device_id device_id = hv_build_pci_dev_id(dev);
> +	struct msi_desc *msidesc;
> +	struct pci_dev *dev;
> +	union hv_device_id device_id;
> +	struct hv_interrupt_entry dummy, *entry;
> +	struct irq_cfg *cfg = irqd_cfg(data);
> +	const cpumask_t *affinity;
> +	int cpu, vector;
> +
> +	msidesc = irq_data_get_msi_desc(data);
> +	dev = msi_desc_to_pci_dev(msidesc);
> +	device_id = hv_build_pci_dev_id(dev);
> +	affinity = irq_data_get_effective_affinity_mask(data);
> +	cpu = cpumask_first_and(affinity, cpu_online_mask);

The effective affinity mask of MSI interrupts consists only of online
CPUs, to be accurate: it has exactly one online CPU set.

But even if it would have only offline CPUs then the result would be:

    cpu = nr_cpu_ids

which is definitely invalid. While a disabled vector targeted to an
offline CPU is not necessarily invalid.

Thanks,

        tglx
  
Thomas Gleixner April 14, 2023, 7:28 a.m. UTC | #5
Stanislav!

On Wed, Apr 12 2023 at 09:36, Stanislav Kinsburskii wrote:
> On Wed, Apr 12, 2023 at 09:19:51AM -0700, Stanislav Kinsburskii wrote:
>> > > +	affinity = irq_data_get_effective_affinity_mask(data);
>> > > +	cpu = cpumask_first_and(affinity, cpu_online_mask);
>> > 
>> > The effective affinity mask of MSI interrupts consists only of online
>> > CPUs, to be accurate: it has exactly one online CPU set.
>> > 
>> > But even if it would have only offline CPUs then the result would be:
>> > 
>> >     cpu = nr_cpu_ids
>> > 
>> > which is definitely invalid. While a disabled vector targeted to an
>> > offline CPU is not necessarily invalid.
>
> Although this patch only tosses the code and doens't make any functional
> changes, I guess if the fix for the used cpu id is required, it has to
> be in a separated patch.

Sure.

> Would you mind to elaborate more of the problem(s)?
> Do you mean that the result of cpumask_first_and has to be checked for not
> being >= nr_cpus_ids?
> Or do you mean that there is no need to check the irq affinity against
> cpu_online_mask at all and we can simply take any first bit from the
> effective affinity mask?

As of today the effective mask of MSI interrupts contains only online
CPUs. I don't see a reason for that to change.

> Also, could you elaborate more on the disabled vector targeting an
> offline CPU? Is there any use case for such scenario (in this case we
> might want to support it)?

I'm not aware of one today. That was more a theoretical reasoning.

> I guess the goal of this code is to make sure that hypervisor won't be
> configured to deliver an MSI to an offline CPU.

Correct, but if the interrupt _is_ masked at the MSI level then the
hypervisor must not deliver an interrupt at all.

The point is that it is valid to target a masked MSI entry to an offline
CPU under the assumption that the hardware/emulation respects the
masking. Whether that's a good idea or not is a different question.

The kernel as of today does not do that. It targets unused but
configured MSI[-x] entries towards MANAGED_IRQ_SHUTDOWN_VECTOR on CPU0
for various reasons, one of them being paranoia.

But in principle there is nothing wrong with that and it should either
succeed or being rejected at the software level and not expose a
completely invalid CPU number to the hypercall in the first place.

So if you want to be defensive, then keep the _and(), but then check the
result for being valid and emit something useful like a pr_warn_once()
instead of blindly handing the invalid result to the hypercall and then
have that reject it with some undecipherable error code.

Actually it would not necessarily reach the hypercall because before
that it dereferences cpumask_of(nr_cpu_ids) here:

	nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set),	cpumask_of(cpu));

and explode with a kernel pagefault. If not it will read some random
adjacent data and try to create a vp_set from it. Neither of that is
anywhere close to correct.

Thanks,

        tglx
  

Patch

diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 42c70d28ef27..fd9c487726e3 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -169,13 +169,35 @@  static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
 	return dev_id;
 }
 
-static int hv_map_msi_interrupt(struct pci_dev *dev, int cpu, int vector,
-				struct hv_interrupt_entry *entry)
+/**
+ * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
+ * @data:      Describes the IRQ
+ * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
+ *
+ * Map the IRQ in the hypervisor by issuing a MAP_DEVICE_INTERRUPT hypercall.
+ */
+int hv_map_msi_interrupt(struct irq_data *data,
+			 struct hv_interrupt_entry *out_entry)
 {
-	union hv_device_id device_id = hv_build_pci_dev_id(dev);
+	struct msi_desc *msidesc;
+	struct pci_dev *dev;
+	union hv_device_id device_id;
+	struct hv_interrupt_entry dummy, *entry;
+	struct irq_cfg *cfg = irqd_cfg(data);
+	const cpumask_t *affinity;
+	int cpu, vector;
+
+	msidesc = irq_data_get_msi_desc(data);
+	dev = msi_desc_to_pci_dev(msidesc);
+	device_id = hv_build_pci_dev_id(dev);
+	affinity = irq_data_get_effective_affinity_mask(data);
+	cpu = cpumask_first_and(affinity, cpu_online_mask);
+	entry = out_entry ? out_entry : &dummy;
+	vector = cfg->vector;
 
 	return hv_map_interrupt(device_id, false, cpu, vector, entry);
 }
+EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
 
 static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
 {
@@ -190,10 +212,8 @@  static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct msi_desc *msidesc;
 	struct pci_dev *dev;
-	struct hv_interrupt_entry out_entry, *stored_entry;
+	struct hv_interrupt_entry *stored_entry;
 	struct irq_cfg *cfg = irqd_cfg(data);
-	const cpumask_t *affinity;
-	int cpu;
 	u64 status;
 
 	msidesc = irq_data_get_msi_desc(data);
@@ -204,9 +224,6 @@  static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		return;
 	}
 
-	affinity = irq_data_get_effective_affinity_mask(data);
-	cpu = cpumask_first_and(affinity, cpu_online_mask);
-
 	if (data->chip_data) {
 		/*
 		 * This interrupt is already mapped. Let's unmap first.
@@ -235,15 +252,14 @@  static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		return;
 	}
 
-	status = hv_map_msi_interrupt(dev, cpu, cfg->vector, &out_entry);
+	status = hv_map_msi_interrupt(data, stored_entry);
 	if (status != HV_STATUS_SUCCESS) {
 		kfree(stored_entry);
 		return;
 	}
 
-	*stored_entry = out_entry;
 	data->chip_data = stored_entry;
-	entry_to_msi_msg(&out_entry, msg);
+	entry_to_msi_msg(data->chip_data, msg);
 
 	return;
 }
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 4c4c0ec3b62e..aa0e83acacbd 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -203,6 +203,8 @@  static inline void hv_apic_init(void) {}
 
 struct irq_domain *hv_create_pci_msi_domain(void);
 
+int hv_map_msi_interrupt(struct irq_data *data,
+			 struct hv_interrupt_entry *out_entry);
 int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
 		struct hv_interrupt_entry *entry);
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);