[v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI

Message ID 20221104222953.11356-1-decui@microsoft.com
State New
Headers
Series [v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI |

Commit Message

Dexuan Cui Nov. 4, 2022, 10:29 p.m. UTC
  Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")

It turns out that the third patch (b4b77778ecc5) causes a performance
regression because all the interrupts now happen on 1 physical CPU (or two
pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
devices, it may suffer from soft lockups if the workload is heavy, e.g.,
see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/

Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
hv_irq_unmask() -> hv_arch_irq_unmask() ->
hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
consequently the hypervisor only uses 1 target pCPU for all the interrupts.

Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
is determinted the second time, the vCPU in the effective affinity mask is
used (i.e., it isn't always vCPU0), so the hypervisor chooses different
pCPU for each interrupt.

The hypercall will be fixed in future to update the pCPU as well, but
that will take quite a while, so let's restore the old behavior in
hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
manner for each PCI device, so the interrupts of different devices can
happen on different pCPUs, though the interrupts of each device happen on
some single pCPU.

The hypercall fix may not be backported to all old versions of Hyper-V, so
we want to have this guest side change for ever (or at least till we're sure
the old affected versions of Hyper-V are no longer supported).

Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>

---

v1 is here:
  https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/

Changes in v2:
  round-robin the vCPU for multi-MSI.
  The commit message is re-worked.
  Added Jeff and Carl's Co-developed-by and Signed-off-by.

Changes in v3:
  Michael Kelley kindly helped to make a great comment, and I added the
  comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!

  Rebased to Hyper-V tree's "hyperv-fixes" branch:
      https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes

  Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
  through the Hyper-V tree because it's rebased to another hv_pci patch (which
  only exists in the Hyper-V tree for now):
      e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") 

  BTW, Michael has some other hv_pci patches, which would also need go through
      the Hyper-V tree:
      https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/


 drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
 1 file changed, 75 insertions(+), 15 deletions(-)
  

Comments

Michael Kelley (LINUX) Nov. 4, 2022, 11:38 p.m. UTC | #1
From: Dexuan Cui <decui@microsoft.com> Sent: Friday, November 4, 2022 3:30 PM
> 
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> 
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> 
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
> used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> pCPU for each interrupt.
> 
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
> 
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
> the old affected versions of Hyper-V are no longer supported).
> 
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> ---
> 
> v1 is here:
> 
> https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Changes in v2:
>   round-robin the vCPU for multi-MSI.
>   The commit message is re-worked.
>   Added Jeff and Carl's Co-developed-by and Signed-off-by.
> 
> Changes in v3:
>   Michael Kelley kindly helped to make a great comment, and I added the
>   comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
> 
>   Rebased to Hyper-V tree's "hyperv-fixes" branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
> 
>   Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
>   through the Hyper-V tree because it's rebased to another hv_pci patch (which
>   only exists in the Hyper-V tree for now):
>       e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()")
> 
>   BTW, Michael has some other hv_pci patches, which would also need go through
>       the Hyper-V tree:
> 
> https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/
> 
> 
>  drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
>  1 file changed, 75 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index ba64284eaf9f..fa5a1ba35a82 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct
> pci_response *resp,
>  }
> 
>  static u32 hv_compose_msi_req_v1(
> -	struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
> +	struct pci_create_interrupt *int_pkt,
>  	u32 slot, u8 vector, u16 vector_count)
>  {
>  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1(
>  	return sizeof(*int_pkt);
>  }
> 
> +/*
> + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
> + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
> + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
> + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
> + * not irrelevant because Hyper-V chooses the physical CPU to handle the
> + * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> + * but assigning too many vPCI device interrupts to the same pCPU can cause a
> + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-
> V
> + * to spread out the pCPUs that it selects.
> + *
> + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
> + * to always return the same dummy vCPU, because a second call to
> + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
> + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> + * the same pCPU, even though the vCPUs will be spread out by later calls
> + * to hv_irq_unmask(), but that is the best we can do now.
> + *
> + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
> + * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
> + * enhancement is planned for a future version. With that enhancement, the
> + * dummy vCPU selection won't matter, and interrupts for the same multi-MSI
> + * device will be spread across multiple pCPUs.
> + */
> +
>  /*
>   * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten
>   * by subsequent retarget in hv_irq_unmask().
> @@ -1640,18 +1669,39 @@ static int hv_compose_msi_req_get_cpu(const struct
> cpumask *affinity)
>  	return cpumask_first_and(affinity, cpu_online_mask);
>  }
> 
> -static u32 hv_compose_msi_req_v2(
> -	struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity,
> -	u32 slot, u8 vector, u16 vector_count)
> +/*
> + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
> + */
> +static int hv_compose_multi_msi_req_get_cpu(void)
>  {
> +	static DEFINE_SPINLOCK(multi_msi_cpu_lock);
> +
> +	/* -1 means starting with CPU 0 */
> +	static int cpu_next = -1;
> +
> +	unsigned long flags;
>  	int cpu;
> 
> +	spin_lock_irqsave(&multi_msi_cpu_lock, flags);
> +
> +	cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
> +				     false);
> +	cpu = cpu_next;
> +
> +	spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
> +
> +	return cpu;
> +}
> +
> +static u32 hv_compose_msi_req_v2(
> +	struct pci_create_interrupt2 *int_pkt, int cpu,
> +	u32 slot, u8 vector, u16 vector_count)
> +{
>  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
>  	int_pkt->wslot.slot = slot;
>  	int_pkt->int_desc.vector = vector;
>  	int_pkt->int_desc.vector_count = vector_count;
>  	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>  	int_pkt->int_desc.processor_array[0] =
>  		hv_cpu_number_to_vp_number(cpu);
>  	int_pkt->int_desc.processor_count = 1;
> @@ -1660,18 +1710,15 @@ static u32 hv_compose_msi_req_v2(
>  }
> 
>  static u32 hv_compose_msi_req_v3(
> -	struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity,
> +	struct pci_create_interrupt3 *int_pkt, int cpu,
>  	u32 slot, u32 vector, u16 vector_count)
>  {
> -	int cpu;
> -
>  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
>  	int_pkt->wslot.slot = slot;
>  	int_pkt->int_desc.vector = vector;
>  	int_pkt->int_desc.reserved = 0;
>  	int_pkt->int_desc.vector_count = vector_count;
>  	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>  	int_pkt->int_desc.processor_array[0] =
>  		hv_cpu_number_to_vp_number(cpu);
>  	int_pkt->int_desc.processor_count = 1;
> @@ -1715,12 +1762,18 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>  			struct pci_create_interrupt3 v3;
>  		} int_pkts;
>  	} __packed ctxt;
> +	bool multi_msi;
>  	u64 trans_id;
>  	u32 size;
>  	int ret;
> +	int cpu;
> +
> +	msi_desc  = irq_data_get_msi_desc(data);
> +	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> +		    msi_desc->nvec_used > 1;
> 
>  	/* Reuse the previous allocation */
> -	if (data->chip_data) {
> +	if (data->chip_data && multi_msi) {
>  		int_desc = data->chip_data;
>  		msg->address_hi = int_desc->address >> 32;
>  		msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1728,7 +1781,6 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>  		return;
>  	}
> 
> -	msi_desc  = irq_data_get_msi_desc(data);
>  	pdev = msi_desc_to_pci_dev(msi_desc);
>  	dest = irq_data_get_effective_affinity_mask(data);
>  	pbus = pdev->bus;
> @@ -1738,11 +1790,18 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>  	if (!hpdev)
>  		goto return_null_message;
> 
> +	/* Free any previous message that might have already been composed. */
> +	if (data->chip_data && !multi_msi) {
> +		int_desc = data->chip_data;
> +		data->chip_data = NULL;
> +		hv_int_desc_free(hpdev, int_desc);
> +	}
> +
>  	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
>  	if (!int_desc)
>  		goto drop_reference;
> 
> -	if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> +	if (multi_msi) {
>  		/*
>  		 * If this is not the first MSI of Multi MSI, we already have
>  		 * a mapping.  Can exit early.
> @@ -1767,9 +1826,11 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>  		 */
>  		vector = 32;
>  		vector_count = msi_desc->nvec_used;
> +		cpu = hv_compose_multi_msi_req_get_cpu();
>  	} else {
>  		vector = hv_msi_get_int_vector(data);
>  		vector_count = 1;
> +		cpu = hv_compose_msi_req_get_cpu(dest);
>  	}
> 
>  	/*
> @@ -1785,7 +1846,6 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>  	switch (hbus->protocol_version) {
>  	case PCI_PROTOCOL_VERSION_1_1:
>  		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> -					dest,
>  					hpdev->desc.win_slot.slot,
>  					(u8)vector,
>  					vector_count);
> @@ -1794,7 +1854,7 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
>  	case PCI_PROTOCOL_VERSION_1_2:
>  	case PCI_PROTOCOL_VERSION_1_3:
>  		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> -					dest,
> +					cpu,
>  					hpdev->desc.win_slot.slot,
>  					(u8)vector,
>  					vector_count);
> @@ -1802,7 +1862,7 @@ static void hv_compose_msi_msg(struct irq_data *data,
> struct msi_msg *msg)
> 
>  	case PCI_PROTOCOL_VERSION_1_4:
>  		size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
> -					dest,
> +					cpu,
>  					hpdev->desc.win_slot.slot,
>  					vector,
>  					vector_count);
> --
> 2.25.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>
  
Dexuan Cui Nov. 10, 2022, 8:49 p.m. UTC | #2
> Sent: Friday, November 4, 2022 4:38 PM
> > ...
> >   Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> >   through the Hyper-V tree because it's rebased to another hv_pci patch
> >  (which only exists in the Hyper-V tree for now):
> >       e70af8d040d2 ("PCI: hv: Fix the definition of vector in
> >  hv_compose_msi_msg()")
> >
> >   BTW, Michael has some other hv_pci patches, which would also need go
> >  through the Hyper-V tree:
> >
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>

Hi Bjorn, Lorenzo, if you have no objection to the patch, I suggest Wei merge
it through the Hyper-V tree early next week.
  
Bjorn Helgaas Nov. 10, 2022, 9:43 p.m. UTC | #3
On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote:
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> 
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> 
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is

s/determinted/determined/

> used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> pCPU for each interrupt.
> 
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
> 
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure

s/for ever/forever/

> the old affected versions of Hyper-V are no longer supported).
> 
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> ---
> 
> v1 is here:
>   https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Changes in v2:
>   round-robin the vCPU for multi-MSI.
>   The commit message is re-worked.
>   Added Jeff and Carl's Co-developed-by and Signed-off-by.
> 
> Changes in v3:
>   Michael Kelley kindly helped to make a great comment, and I added the
>   comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
> 
>   Rebased to Hyper-V tree's "hyperv-fixes" branch:
>       https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
> 
>   Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
>   through the Hyper-V tree because it's rebased to another hv_pci patch (which
>   only exists in the Hyper-V tree for now):
>       e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") 

Fine with me, but it's Lorenzo's area so I don't want to preemptively
ack it.
  
Dexuan Cui Nov. 11, 2022, 7:17 a.m. UTC | #4
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Thursday, November 10, 2022 1:44 PM
> > ...
> > Note: before b4b77778ecc5, the pCPU is determined twice, and when the
> >  pCPU is determinted the second time, the vCPU in the effective affinity
> > mask is
> 
> s/determinted/determined/

Thanks, Bjorn! I suppose Wei can help fix this :-)

> > The hypercall fix may not be backported to all old versions of Hyper-V, so
> > we want to have this guest side change for ever (or at least till we're sure
> 
> s/for ever/forever/
ditto.

> > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> > through the Hyper-V tree because it's rebased to another hv_pci patch
> > (which only exists in the Hyper-V tree for now):
> >       e70af8d040d2 ("PCI: hv: Fix the definition of vector in
> > hv_compose_msi_msg()")
> 
> Fine with me, but it's Lorenzo's area so I don't want to preemptively
> ack it.

Thanks. Hopefully Lorenzo can take a look soon.
  
Lorenzo Pieralisi Nov. 11, 2022, 9:56 a.m. UTC | #5
On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote:
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> 
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> 
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
> used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> pCPU for each interrupt.
> 
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
> 
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
> the old affected versions of Hyper-V are no longer supported).
> 
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> ---
> 
> v1 is here:
>   https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Changes in v2:
>   round-robin the vCPU for multi-MSI.
>   The commit message is re-worked.
>   Added Jeff and Carl's Co-developed-by and Signed-off-by.
> 
> Changes in v3:
>   Michael Kelley kindly helped to make a great comment, and I added the
>   comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
> 
>   Rebased to Hyper-V tree's "hyperv-fixes" branch:
>       https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
> 
>   Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
>   through the Hyper-V tree because it's rebased to another hv_pci patch (which
>   only exists in the Hyper-V tree for now):
>       e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") 
> 
>   BTW, Michael has some other hv_pci patches, which would also need go through
>       the Hyper-V tree:
>       https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/
> 
> 
>  drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
>  1 file changed, 75 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index ba64284eaf9f..fa5a1ba35a82 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
>  }
>  
>  static u32 hv_compose_msi_req_v1(
> -	struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
> +	struct pci_create_interrupt *int_pkt,
>  	u32 slot, u8 vector, u16 vector_count)
>  {
>  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1(
>  	return sizeof(*int_pkt);
>  }
>  
> +/*
> + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
> + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
> + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
> + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
> + * not irrelevant because Hyper-V chooses the physical CPU to handle the
> + * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> + * but assigning too many vPCI device interrupts to the same pCPU can cause a
> + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
> + * to spread out the pCPUs that it selects.
> + *
> + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
> + * to always return the same dummy vCPU, because a second call to
> + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the

Why ? Can't you fix _that_ ? Why can't the initial call to
hv_compose_msi_msg() determine the _real_ target vCPU ?

> + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> + * the same pCPU, even though the vCPUs will be spread out by later calls
> + * to hv_irq_unmask(), but that is the best we can do now.
> + *
> + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*

"current" Hyper-V means nothing, remove it or provide versioning
information. Imagine yourself reading this comment some time
in the future.

I can't claim to understand how this MSI vCPU to pCPU mapping is made to
work in current code but I can't ack this patch sorry, if you feel like
it is good to merge it it is your and Hyper-V maintainers call, feel
free to go ahead - I can review PCI hyper-V changes that affect PCI
and IRQs core APIs, I don't know Hyper-V internals.

Lorenzo

> + * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
> + * enhancement is planned for a future version. With that enhancement, the
> + * dummy vCPU selection won't matter, and interrupts for the same multi-MSI
> + * device will be spread across multiple pCPUs.
> + */
> +
>  /*
>   * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten
>   * by subsequent retarget in hv_irq_unmask().
> @@ -1640,18 +1669,39 @@ static int hv_compose_msi_req_get_cpu(const struct cpumask *affinity)
>  	return cpumask_first_and(affinity, cpu_online_mask);
>  }
>  
> -static u32 hv_compose_msi_req_v2(
> -	struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity,
> -	u32 slot, u8 vector, u16 vector_count)
> +/*
> + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
> + */
> +static int hv_compose_multi_msi_req_get_cpu(void)
>  {
> +	static DEFINE_SPINLOCK(multi_msi_cpu_lock);
> +
> +	/* -1 means starting with CPU 0 */
> +	static int cpu_next = -1;
> +
> +	unsigned long flags;
>  	int cpu;
>  
> +	spin_lock_irqsave(&multi_msi_cpu_lock, flags);
> +
> +	cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
> +				     false);
> +	cpu = cpu_next;
> +
> +	spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
> +
> +	return cpu;
> +}
> +
> +static u32 hv_compose_msi_req_v2(
> +	struct pci_create_interrupt2 *int_pkt, int cpu,
> +	u32 slot, u8 vector, u16 vector_count)
> +{
>  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
>  	int_pkt->wslot.slot = slot;
>  	int_pkt->int_desc.vector = vector;
>  	int_pkt->int_desc.vector_count = vector_count;
>  	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>  	int_pkt->int_desc.processor_array[0] =
>  		hv_cpu_number_to_vp_number(cpu);
>  	int_pkt->int_desc.processor_count = 1;
> @@ -1660,18 +1710,15 @@ static u32 hv_compose_msi_req_v2(
>  }
>  
>  static u32 hv_compose_msi_req_v3(
> -	struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity,
> +	struct pci_create_interrupt3 *int_pkt, int cpu,
>  	u32 slot, u32 vector, u16 vector_count)
>  {
> -	int cpu;
> -
>  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
>  	int_pkt->wslot.slot = slot;
>  	int_pkt->int_desc.vector = vector;
>  	int_pkt->int_desc.reserved = 0;
>  	int_pkt->int_desc.vector_count = vector_count;
>  	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>  	int_pkt->int_desc.processor_array[0] =
>  		hv_cpu_number_to_vp_number(cpu);
>  	int_pkt->int_desc.processor_count = 1;
> @@ -1715,12 +1762,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  			struct pci_create_interrupt3 v3;
>  		} int_pkts;
>  	} __packed ctxt;
> +	bool multi_msi;
>  	u64 trans_id;
>  	u32 size;
>  	int ret;
> +	int cpu;
> +
> +	msi_desc  = irq_data_get_msi_desc(data);
> +	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> +		    msi_desc->nvec_used > 1;
>  
>  	/* Reuse the previous allocation */
> -	if (data->chip_data) {
> +	if (data->chip_data && multi_msi) {
>  		int_desc = data->chip_data;
>  		msg->address_hi = int_desc->address >> 32;
>  		msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1728,7 +1781,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  		return;
>  	}
>  
> -	msi_desc  = irq_data_get_msi_desc(data);
>  	pdev = msi_desc_to_pci_dev(msi_desc);
>  	dest = irq_data_get_effective_affinity_mask(data);
>  	pbus = pdev->bus;
> @@ -1738,11 +1790,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	if (!hpdev)
>  		goto return_null_message;
>  
> +	/* Free any previous message that might have already been composed. */
> +	if (data->chip_data && !multi_msi) {
> +		int_desc = data->chip_data;
> +		data->chip_data = NULL;
> +		hv_int_desc_free(hpdev, int_desc);
> +	}
> +
>  	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
>  	if (!int_desc)
>  		goto drop_reference;
>  
> -	if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> +	if (multi_msi) {
>  		/*
>  		 * If this is not the first MSI of Multi MSI, we already have
>  		 * a mapping.  Can exit early.
> @@ -1767,9 +1826,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  		 */
>  		vector = 32;
>  		vector_count = msi_desc->nvec_used;
> +		cpu = hv_compose_multi_msi_req_get_cpu();
>  	} else {
>  		vector = hv_msi_get_int_vector(data);
>  		vector_count = 1;
> +		cpu = hv_compose_msi_req_get_cpu(dest);
>  	}
>  
>  	/*
> @@ -1785,7 +1846,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	switch (hbus->protocol_version) {
>  	case PCI_PROTOCOL_VERSION_1_1:
>  		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> -					dest,
>  					hpdev->desc.win_slot.slot,
>  					(u8)vector,
>  					vector_count);
> @@ -1794,7 +1854,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	case PCI_PROTOCOL_VERSION_1_2:
>  	case PCI_PROTOCOL_VERSION_1_3:
>  		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> -					dest,
> +					cpu,
>  					hpdev->desc.win_slot.slot,
>  					(u8)vector,
>  					vector_count);
> @@ -1802,7 +1862,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  
>  	case PCI_PROTOCOL_VERSION_1_4:
>  		size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
> -					dest,
> +					cpu,
>  					hpdev->desc.win_slot.slot,
>  					vector,
>  					vector_count);
> -- 
> 2.25.1
>
  
Wei Liu Nov. 11, 2022, 11:40 p.m. UTC | #6
On Fri, Nov 11, 2022 at 10:56:28AM +0100, Lorenzo Pieralisi wrote:
> On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote:
> > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> > 
> > It turns out that the third patch (b4b77778ecc5) causes a performance
> > regression because all the interrupts now happen on 1 physical CPU (or two
> > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> > devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> > 
> > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> > hv_irq_unmask() -> hv_arch_irq_unmask() ->
> > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> > determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> > consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> > 
> > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> > is determinted the second time, the vCPU in the effective affinity mask is
> > used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> > pCPU for each interrupt.
> > 
> > The hypercall will be fixed in future to update the pCPU as well, but
> > that will take quite a while, so let's restore the old behavior in
> > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> > manner for each PCI device, so the interrupts of different devices can
> > happen on different pCPUs, though the interrupts of each device happen on
> > some single pCPU.
> > 
> > The hypercall fix may not be backported to all old versions of Hyper-V, so
> > we want to have this guest side change for ever (or at least till we're sure
> > the old affected versions of Hyper-V are no longer supported).
> > 
> > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> > Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> > Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> > Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > 
> > ---
> > 
> > v1 is here:
> >   https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> > 
> > Changes in v2:
> >   round-robin the vCPU for multi-MSI.
> >   The commit message is re-worked.
> >   Added Jeff and Carl's Co-developed-by and Signed-off-by.
> > 
> > Changes in v3:
> >   Michael Kelley kindly helped to make a great comment, and I added the
> >   comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
> > 
> >   Rebased to Hyper-V tree's "hyperv-fixes" branch:
> >       https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
> > 
> >   Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> >   through the Hyper-V tree because it's rebased to another hv_pci patch (which
> >   only exists in the Hyper-V tree for now):
> >       e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") 
> > 
> >   BTW, Michael has some other hv_pci patches, which would also need go through
> >       the Hyper-V tree:
> >       https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/
> > 
> > 
> >  drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
> >  1 file changed, 75 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index ba64284eaf9f..fa5a1ba35a82 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
> >  }
> >  
> >  static u32 hv_compose_msi_req_v1(
> > -	struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
> > +	struct pci_create_interrupt *int_pkt,
> >  	u32 slot, u8 vector, u16 vector_count)
> >  {
> >  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> > @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1(
> >  	return sizeof(*int_pkt);
> >  }
> >  
> > +/*
> > + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
> > + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
> > + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
> > + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
> > + * not irrelevant because Hyper-V chooses the physical CPU to handle the
> > + * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> > + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> > + * but assigning too many vPCI device interrupts to the same pCPU can cause a
> > + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
> > + * to spread out the pCPUs that it selects.
> > + *
> > + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
> > + * to always return the same dummy vCPU, because a second call to
> > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
> 
> Why ? Can't you fix _that_ ? Why can't the initial call to
> hv_compose_msi_msg() determine the _real_ target vCPU ?

Dexuan, any comment on this?

> 
> > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> > + * the same pCPU, even though the vCPUs will be spread out by later calls
> > + * to hv_irq_unmask(), but that is the best we can do now.
> > + *
> > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
> 
> "current" Hyper-V means nothing, remove it or provide versioning
> information. Imagine yourself reading this comment some time
> in the future.
> 

And this?

Wei.
  
Dexuan Cui Nov. 12, 2022, 12:58 a.m. UTC | #7
> From: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Sent: Friday, November 11, 2022 1:56 AM
> > ...
> > + * For the single-MSI and MSI-X cases, it's OK for
> > hv_compose_msi_req_get_cpu()
> > + * to always return the same dummy vCPU, because a second call to
> > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to
> > choose a
> > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP,
> > so the
> 
> Why ? Can't you fix _that_ ? Why can't the initial call to
> hv_compose_msi_msg() determine the _real_ target vCPU ?

Unluckily I can't fix this because of the way it works in x86's irq management
code. This is out of the control of the pci-hyperv driver.

hv_compose_msi_msg() uses irq_data_get_effective_affinity_mask() to get
the "effective"mask.

On x86, when the irq is initialized, irq_data_update_effective_affinity() is
called from apic_update_irq_cfg() -- please refer to the below debug code.

When the initial call to hv_compose_msi_msg() is invoked, it's from
pci_alloc_irq_vectors(), and the x86 irq code always passes CPU0 to pci-hyperv.
Please see the below "cpumask_first(cpu_online_mask)" in
vector_assign_managed_shutdown().

When hv_compose_msi_msg() is invoked the second time, it's from
request_irq(), and the x86 irq code passes the real effectivey CPU to pci-hyperv.

I added the below debug code and pasted the trimmed output below.

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -179,6 +179,7 @@ static void vector_assign_managed_shutdown(struct irq_data *irqd)
        unsigned int cpu = cpumask_first(cpu_online_mask);

        apic_update_irq_cfg(irqd, MANAGED_IRQ_SHUTDOWN_VECTOR, cpu);
+       WARN(irqd->irq >= 24, "cdx: vector_assign_managed_shutdown: irq=%d, cpu=%d\n", irqd->irq, cpu);
 }

 static int reserve_managed_vector(struct irq_data *irqd)
@@ -251,6 +252,7 @@ assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest)
                return vector;
        apic_update_vector(irqd, vector, cpu);
        apic_update_irq_cfg(irqd, vector, cpu);
+       WARN(irqd->irq >= 24, "cdx: assign_vector_locked: irq=%d, cpu=%d\n", irqd->irq, cpu);

        return 0;
 }
@@ -328,6 +330,7 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
                return vector;
        apic_update_vector(irqd, vector, cpu);
        apic_update_irq_cfg(irqd, vector, cpu);
+       WARN(irqd->irq >= 24, "cdx: assign_managed_vector: irq=%d, cpu=%d\n", irqd->irq, cpu);
        return 0;
 }

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1803,6 +1803,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
        u32 size;
        int ret;

+       WARN(1, "cdx: hv_compose_msi_msg: irq=%d\n", data->irq);
        /* Reuse the previous allocation */
        if (data->chip_data) {
                int_desc = data->chip_data;

  1 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
  2 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
  3 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
  4  reserve_irq_vector_locked+0x41/0xa0
  5  x86_vector_alloc_irqs+0x298/0x460
  6  irq_domain_alloc_irqs_hierarchy+0x1b/0x50
  7  irq_domain_alloc_irqs_parent+0x17/0x30
  8  msi_domain_alloc+0x83/0x150
  9  irq_domain_alloc_irqs_hierarchy+0x1b/0x50
 10  __irq_domain_alloc_irqs+0xdf/0x320
 11  __msi_domain_alloc_irqs+0x103/0x3e0
 12  msi_domain_alloc_irqs_descs_locked+0x3e/0x90
 13  pci_msi_setup_msi_irqs+0x2d/0x40
 14  __pci_enable_msix_range+0x2fd/0x420
 15  pci_alloc_irq_vectors_affinity+0xb0/0x110
 16  nvme_reset_work+0x1cf/0x1170
 17  process_one_work+0x21f/0x3f0
 18  worker_thread+0x4a/0x3c0
 19  kthread+0xff/0x130
 20  ret_from_fork+0x22/0x30
 21
 22 cdx: vector_assign_managed_shutdown: irq=24, cpu=0
 23 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60
 24 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60
 25  x86_vector_activate+0x149/0x1e0
 26  __irq_domain_activate_irq+0x58/0x90
 27  __irq_domain_activate_irq+0x38/0x90
 28  irq_domain_activate_irq+0x2d/0x50
 29  __msi_domain_alloc_irqs+0x1bb/0x3e0
 30  msi_domain_alloc_irqs_descs_locked+0x3e/0x90
 31  pci_msi_setup_msi_irqs+0x2d/0x40
 32  __pci_enable_msix_range+0x2fd/0x420
 33  pci_alloc_irq_vectors_affinity+0xb0/0x110
 34  nvme_reset_work+0x1cf/0x1170
 35  process_one_work+0x21f/0x3f0
 36  worker_thread+0x4a/0x3c0
 37  kthread+0xff/0x130
 38  ret_from_fork+0x22/0x30
 39
 40
 41 cdx: hv_compose_msi_msg: irq=24
 42 WARNING: CPU: 3 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 43 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 44  irq_chip_compose_msi_msg+0x32/0x50
 45  msi_domain_activate+0x45/0xa0
 46  __irq_domain_activate_irq+0x58/0x90
 47  irq_domain_activate_irq+0x2d/0x50
 48  __msi_domain_alloc_irqs+0x1bb/0x3e0
 49  msi_domain_alloc_irqs_descs_locked+0x3e/0x90
 50  pci_msi_setup_msi_irqs+0x2d/0x40
 51  __pci_enable_msix_range+0x2fd/0x420
 52  pci_alloc_irq_vectors_affinity+0xb0/0x110
 53  nvme_reset_work+0x1cf/0x1170
 54  process_one_work+0x21f/0x3f0
 55  worker_thread+0x4a/0x3c0
 56  kthread+0xff/0x130
 57  ret_from_fork+0x22/0x30
 58
 59
 60
 61 cdx: assign_vector_locked: irq=24, cpu=3
 62 WARNING: CPU: 0 PID: 87 at arch/x86/kernel/apic/vector.c:255 assign_vector_locked+0x160/0x170
 63 RIP: 0010:assign_vector_locked+0x160/0x170
 64  assign_irq_vector_any_locked+0x6a/0x150
 65  x86_vector_activate+0x93/0x1e0
 66  __irq_domain_activate_irq+0x58/0x90
 67  __irq_domain_activate_irq+0x38/0x90
 68  irq_domain_activate_irq+0x2d/0x50
 69  irq_activate+0x29/0x30
 70  __setup_irq+0x2e5/0x780
 71  request_threaded_irq+0x112/0x180
 72  pci_request_irq+0xa3/0xf0
 73  queue_request_irq+0x74/0x80
 74  nvme_reset_work+0x408/0x1170
 75  process_one_work+0x21f/0x3f0
 76  worker_thread+0x4a/0x3c0
 77  kthread+0xff/0x130
 78  ret_from_fork+0x22/0x30
 79
 80 cdx: hv_compose_msi_msg: irq=24
 81 WARNING: CPU: 0 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 82 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv]
 83  irq_chip_compose_msi_msg+0x32/0x50
 84  msi_domain_activate+0x45/0xa0
 85  __irq_domain_activate_irq+0x58/0x90
 86  irq_domain_activate_irq+0x2d/0x50
 87  irq_activate+0x29/0x30
 88  __setup_irq+0x2e5/0x780
 89  request_threaded_irq+0x112/0x180
 90  pci_request_irq+0xa3/0xf0
 91  queue_request_irq+0x74/0x80
 92  nvme_reset_work+0x408/0x1170
 93  process_one_work+0x21f/0x3f0
 94  worker_thread+0x4a/0x3c0
 95  kthread+0xff/0x130
 96  ret_from_fork+0x22/0x30

> > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed
> > so that
> > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up
> > using
> > + * the same pCPU, even though the vCPUs will be spread out by later calls
> > + * to hv_irq_unmask(), but that is the best we can do now.
> > + *
> > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does
> *not*
> 
> "current" Hyper-V means nothing, remove it or provide versioning
> information. Imagine yourself reading this comment some time
> in the future.

Good point. @Wei, can you please help fix this? I think we can change
"With current Hyper-V"
To
"With Hyper-V in Nov 2022".

BTW, it's hard to provide the exact versioning info, because technically
there are many versions of Hyper-V...

> I can't claim to understand how this MSI vCPU to pCPU mapping is made to
> work in current code but I can't ack this patch sorry, if you feel like
> it is good to merge it it is your and Hyper-V maintainers call, feel
> free to go ahead - I can review PCI hyper-V changes that affect PCI
> and IRQs core APIs, I don't know Hyper-V internals.
> 
> Lorenzo

I understand. Thanks! 

I discussed the issue with Hyper-V team. I believe I understood it and
this patch is the right thing to have.

@Wei, Bjorn spotted two typos in the commit message, and Lorenzo
suggested a change to the above "current". Can you please help
fix these and merge the patch? Or, let me know if it'd be easier if
I should send v4.

Thanks,
Dexuan
  
Wei Liu Nov. 12, 2022, 12:43 p.m. UTC | #8
On Sat, Nov 12, 2022 at 12:58:33AM +0000, Dexuan Cui wrote:
[...]
> 
> I discussed the issue with Hyper-V team. I believe I understood it and
> this patch is the right thing to have.
> 
> @Wei, Bjorn spotted two typos in the commit message, and Lorenzo
> suggested a change to the above "current". Can you please help
> fix these and merge the patch? Or, let me know if it'd be easier if
> I should send v4.

All fixed and patch applied to hyperv-fixes.

Thanks,
Wei.

> 
> Thanks,
> Dexuan
>
  

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index ba64284eaf9f..fa5a1ba35a82 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1613,7 +1613,7 @@  static void hv_pci_compose_compl(void *context, struct pci_response *resp,
 }
 
 static u32 hv_compose_msi_req_v1(
-	struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
+	struct pci_create_interrupt *int_pkt,
 	u32 slot, u8 vector, u16 vector_count)
 {
 	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
@@ -1631,6 +1631,35 @@  static u32 hv_compose_msi_req_v1(
 	return sizeof(*int_pkt);
 }
 
+/*
+ * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
+ * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
+ * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
+ * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
+ * not irrelevant because Hyper-V chooses the physical CPU to handle the
+ * interrupts based on the vCPU specified in message sent to the vPCI VSP in
+ * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
+ * but assigning too many vPCI device interrupts to the same pCPU can cause a
+ * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
+ * to spread out the pCPUs that it selects.
+ *
+ * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
+ * to always return the same dummy vCPU, because a second call to
+ * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
+ * new pCPU for the interrupt. But for the multi-MSI case, the second call to
+ * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
+ * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
+ * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
+ * the same pCPU, even though the vCPUs will be spread out by later calls
+ * to hv_irq_unmask(), but that is the best we can do now.
+ *
+ * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
+ * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
+ * enhancement is planned for a future version. With that enhancement, the
+ * dummy vCPU selection won't matter, and interrupts for the same multi-MSI
+ * device will be spread across multiple pCPUs.
+ */
+
 /*
  * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten
  * by subsequent retarget in hv_irq_unmask().
@@ -1640,18 +1669,39 @@  static int hv_compose_msi_req_get_cpu(const struct cpumask *affinity)
 	return cpumask_first_and(affinity, cpu_online_mask);
 }
 
-static u32 hv_compose_msi_req_v2(
-	struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity,
-	u32 slot, u8 vector, u16 vector_count)
+/*
+ * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
+ */
+static int hv_compose_multi_msi_req_get_cpu(void)
 {
+	static DEFINE_SPINLOCK(multi_msi_cpu_lock);
+
+	/* -1 means starting with CPU 0 */
+	static int cpu_next = -1;
+
+	unsigned long flags;
 	int cpu;
 
+	spin_lock_irqsave(&multi_msi_cpu_lock, flags);
+
+	cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
+				     false);
+	cpu = cpu_next;
+
+	spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
+
+	return cpu;
+}
+
+static u32 hv_compose_msi_req_v2(
+	struct pci_create_interrupt2 *int_pkt, int cpu,
+	u32 slot, u8 vector, u16 vector_count)
+{
 	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
 	int_pkt->wslot.slot = slot;
 	int_pkt->int_desc.vector = vector;
 	int_pkt->int_desc.vector_count = vector_count;
 	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
-	cpu = hv_compose_msi_req_get_cpu(affinity);
 	int_pkt->int_desc.processor_array[0] =
 		hv_cpu_number_to_vp_number(cpu);
 	int_pkt->int_desc.processor_count = 1;
@@ -1660,18 +1710,15 @@  static u32 hv_compose_msi_req_v2(
 }
 
 static u32 hv_compose_msi_req_v3(
-	struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity,
+	struct pci_create_interrupt3 *int_pkt, int cpu,
 	u32 slot, u32 vector, u16 vector_count)
 {
-	int cpu;
-
 	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
 	int_pkt->wslot.slot = slot;
 	int_pkt->int_desc.vector = vector;
 	int_pkt->int_desc.reserved = 0;
 	int_pkt->int_desc.vector_count = vector_count;
 	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
-	cpu = hv_compose_msi_req_get_cpu(affinity);
 	int_pkt->int_desc.processor_array[0] =
 		hv_cpu_number_to_vp_number(cpu);
 	int_pkt->int_desc.processor_count = 1;
@@ -1715,12 +1762,18 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 			struct pci_create_interrupt3 v3;
 		} int_pkts;
 	} __packed ctxt;
+	bool multi_msi;
 	u64 trans_id;
 	u32 size;
 	int ret;
+	int cpu;
+
+	msi_desc  = irq_data_get_msi_desc(data);
+	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
+		    msi_desc->nvec_used > 1;
 
 	/* Reuse the previous allocation */
-	if (data->chip_data) {
+	if (data->chip_data && multi_msi) {
 		int_desc = data->chip_data;
 		msg->address_hi = int_desc->address >> 32;
 		msg->address_lo = int_desc->address & 0xffffffff;
@@ -1728,7 +1781,6 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		return;
 	}
 
-	msi_desc  = irq_data_get_msi_desc(data);
 	pdev = msi_desc_to_pci_dev(msi_desc);
 	dest = irq_data_get_effective_affinity_mask(data);
 	pbus = pdev->bus;
@@ -1738,11 +1790,18 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	if (!hpdev)
 		goto return_null_message;
 
+	/* Free any previous message that might have already been composed. */
+	if (data->chip_data && !multi_msi) {
+		int_desc = data->chip_data;
+		data->chip_data = NULL;
+		hv_int_desc_free(hpdev, int_desc);
+	}
+
 	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
 	if (!int_desc)
 		goto drop_reference;
 
-	if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
+	if (multi_msi) {
 		/*
 		 * If this is not the first MSI of Multi MSI, we already have
 		 * a mapping.  Can exit early.
@@ -1767,9 +1826,11 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		 */
 		vector = 32;
 		vector_count = msi_desc->nvec_used;
+		cpu = hv_compose_multi_msi_req_get_cpu();
 	} else {
 		vector = hv_msi_get_int_vector(data);
 		vector_count = 1;
+		cpu = hv_compose_msi_req_get_cpu(dest);
 	}
 
 	/*
@@ -1785,7 +1846,6 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	switch (hbus->protocol_version) {
 	case PCI_PROTOCOL_VERSION_1_1:
 		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
-					dest,
 					hpdev->desc.win_slot.slot,
 					(u8)vector,
 					vector_count);
@@ -1794,7 +1854,7 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	case PCI_PROTOCOL_VERSION_1_2:
 	case PCI_PROTOCOL_VERSION_1_3:
 		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
-					dest,
+					cpu,
 					hpdev->desc.win_slot.slot,
 					(u8)vector,
 					vector_count);
@@ -1802,7 +1862,7 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 
 	case PCI_PROTOCOL_VERSION_1_4:
 		size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
-					dest,
+					cpu,
 					hpdev->desc.win_slot.slot,
 					vector,
 					vector_count);