[1/6] PCI: hv: fix a race condition bug in hv_pci_query_relations()

Message ID 20230328045122.25850-2-decui@microsoft.com
State New
Headers
Series pci-hyper: fix race condition bugs for fast device hotplug |

Commit Message

Dexuan Cui March 28, 2023, 4:51 a.m. UTC
  Fix the longstanding race between hv_pci_query_relations() and
survey_child_resources() by flushing the workqueue before we exit from
hv_pci_query_relations().

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>

---
 drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

With the below debug code:

@@ -2103,6 +2103,8 @@ static void survey_child_resources(struct hv_pcibus_device *hbus)
 	}

 	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
+	ssleep(15);
+	printk("%s: completing %px\n", __func__, event);
 	complete(event);
 }

@@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device *hdev)

 	ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
 			       0, VM_PKT_DATA_INBAND, 0);
-	if (!ret)
+	if (!ret) {
+		ssleep(10); // unassign the PCI device on the host during the 10s
 		ret = wait_for_response(hdev, &comp);
+		printk("%s: comp=%px is becoming invalid! ret=%d\n",
+			__func__, &comp, ret);
+	}

 	return ret;
 }
@@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,

 retry:
 	ret = hv_pci_query_relations(hdev);
+	printk("hv_pci_query_relations() exited\n");
+
 	if (ret)
 		goto free_irq_domain;

I'm able to repro the below hang issue:

[   74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus probing: Using version 0x10004
[   76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot 1 removed
[   84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[   84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming invalid! ret=-19
[   84.797505] hv_pci_query_relations() exited
[   89.652268] survey_child_resources: completing ffffa7504012fb58
[  150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  150.398447] rcu:     15-...0: (2 ticks this GP) idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
[  150.405851] rcu:     (detected by 14, t=15004 jiffies, g=2553, q=4833 ncpus=16)
[  150.410870] Sending NMI from CPU 14 to CPUs 15:
[  150.414836] NMI backtrace for cpu 15
[  150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G        W   E      6.3.0-rc3-decui-dirty #34
...
[  150.414849] Workqueue: hv_pci_468b pci_devices_present_work [pci_hyperv]
[  150.414866] RIP: 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
...
[  150.414905] Call Trace:
[  150.414907]  <TASK>
[  150.414911]  _raw_spin_lock_irqsave+0x40/0x50
[  150.414917]  complete+0x1d/0x60
[  150.414924]  pci_devices_present_work+0x5dd/0x680 [pci_hyperv]
[  150.414946]  process_one_work+0x21f/0x430
[  150.414952]  worker_thread+0x4a/0x3c0

With this patch, the hang issue goes away:

[  186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is gone.
[  186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming invalid! ret=-19
[  191.263611] survey_child_resources: completing ffffa7cfd0aa3b50
[  191.267732] hv_pci_query_relations() exited
  

Comments

Saurabh Singh Sengar March 28, 2023, 5:29 a.m. UTC | #1
> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Tuesday, March 28, 2023 10:21 AM
> To: bhelgaas@google.com; davem@davemloft.net; Dexuan Cui
> <decui@microsoft.com>; edumazet@google.com; Haiyang Zhang
> <haiyangz@microsoft.com>; Jake Oshins <jakeo@microsoft.com>;
> kuba@kernel.org; kw@linux.com; KY Srinivasan <kys@microsoft.com>;
> leon@kernel.org; linux-pci@vger.kernel.org; lpieralisi@kernel.org; Michael
> Kelley (LINUX) <mikelley@microsoft.com>; pabeni@redhat.com;
> robh@kernel.org; saeedm@nvidia.com; wei.liu@kernel.org; Long Li
> <longli@microsoft.com>; boqun.feng@gmail.com
> Cc: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; netdev@vger.kernel.org
> Subject: [EXTERNAL] [PATCH 1/6] PCI: hv: fix a race condition bug in
> hv_pci_query_relations()
> 
> Fix the longstanding race between hv_pci_query_relations() and
> survey_child_resources() by flushing the workqueue before we exit from
> hv_pci_query_relations().
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft
> Hyper-V VMs")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> ---
>  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> With the below debug code:
> 
> @@ -2103,6 +2103,8 @@ static void survey_child_resources(struct
> hv_pcibus_device *hbus)
>  	}
> 
>  	spin_unlock_irqrestore(&hbus->device_list_lock, flags);
> +	ssleep(15);
> +	printk("%s: completing %px\n", __func__, event);
>  	complete(event);
>  }
> 
> @@ -3305,8 +3307,12 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
> 
>  	ret = vmbus_sendpacket(hdev->channel, &message, sizeof(message),
>  			       0, VM_PKT_DATA_INBAND, 0);
> -	if (!ret)
> +	if (!ret) {
> +		ssleep(10); // unassign the PCI device on the host during the
> 10s
>  		ret = wait_for_response(hdev, &comp);
> +		printk("%s: comp=%px is becoming invalid! ret=%d\n",
> +			__func__, &comp, ret);
> +	}
> 
>  	return ret;
>  }
> @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> 
>  retry:
>  	ret = hv_pci_query_relations(hdev);
> +	printk("hv_pci_query_relations() exited\n");

Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).

> +
>  	if (ret)
>  		goto free_irq_domain;
> 
> I'm able to repro the below hang issue:
> 
> [   74.544744] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: PCI VMBus
> probing: Using version 0x10004
> [   76.886944] hv_netvsc 818fe754-b912-4445-af51-1f584812e3c9 eth0: VF slot
> 1 removed
> [   84.788266] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [   84.792586] hv_pci_query_relations: comp=ffffa7504012fb58 is becoming
> invalid! ret=-19
> [   84.797505] hv_pci_query_relations() exited
> [   89.652268] survey_child_resources: completing ffffa7504012fb58
> [  150.392242] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  150.398447] rcu:     15-...0: (2 ticks this GP)
> idle=867c/1/0x4000000000000000 softirq=947/947 fqs=5234
> [  150.405851] rcu:     (detected by 14, t=15004 jiffies, g=2553, q=4833
> ncpus=16)
> [  150.410870] Sending NMI from CPU 14 to CPUs 15:
> [  150.414836] NMI backtrace for cpu 15
> [  150.414840] CPU: 15 PID: 10 Comm: kworker/u32:0 Tainted: G        W   E
> 6.3.0-rc3-decui-dirty #34
> ...
> [  150.414849] Workqueue: hv_pci_468b pci_devices_present_work
> [pci_hyperv] [  150.414866] RIP:
> 0010:__pv_queued_spin_lock_slowpath+0x10f/0x3c0
> ...
> [  150.414905] Call Trace:
> [  150.414907]  <TASK>
> [  150.414911]  _raw_spin_lock_irqsave+0x40/0x50 [  150.414917]
> complete+0x1d/0x60 [  150.414924]  pci_devices_present_work+0x5dd/0x680
> [pci_hyperv] [  150.414946]  process_one_work+0x21f/0x430 [  150.414952]
> worker_thread+0x4a/0x3c0
> 
> With this patch, the hang issue goes away:
> 
> [  186.143612] hv_pci b92a0085-468b-407a-a88a-d33fac8edc75: The device is
> gone.
> [  186.148034] hv_pci_query_relations: comp=ffffa7cfd0aa3b50 is becoming
> invalid! ret=-19 [  191.263611] survey_child_resources: completing
> ffffa7cfd0aa3b50 [  191.267732] hv_pci_query_relations() exited
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index f33370b75628..b82c7cde19e6 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
>  	if (!ret)
>  		ret = wait_for_response(hdev, &comp);
> 
> +	/*
> +	 * In the case of fast device addition/removal, it's possible that
> +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> we
> +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> +	 * channel callback already scheduled a work to hbus->wq, which can
> be
> +	 * running survey_child_resources() -> complete(&hbus-
> >survey_event),
> +	 * even after hv_pci_query_relations() exits and the stack variable
> +	 * 'comp' is no longer valid. This can cause a strange hang issue
> +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> +	 * hv_pci_query_relations() to avoid the issues.
> +	 */
> +	flush_workqueue(hbus->wq);
> +
>  	return ret;
>  }
> 
> --
> 2.25.1
  
Dexuan Cui March 28, 2023, 5:38 a.m. UTC | #2
> From: Saurabh Singh Sengar <ssengar@microsoft.com>
> Sent: Monday, March 27, 2023 10:29 PM
> > ...
> > ---

Please note this special line "---". 
Anything after the special line and before the line "diff --git" is discarded
automaticaly by 'git' and 'patch'. 

> >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> >
> >  retry:
> >  	ret = hv_pci_query_relations(hdev);
> > +	printk("hv_pci_query_relations() exited\n");
> 
> Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).

This is not part of the real patch :-)
I just thought the debug code can help understand the issues
resolved by the patches.
I'll remove the debug code to avoid confusion if I need to post v2.
  
Long Li March 28, 2023, 4:48 p.m. UTC | #3
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index f33370b75628..b82c7cde19e6 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device
> *hdev)
>  	if (!ret)
>  		ret = wait_for_response(hdev, &comp);
> 
> +	/*
> +	 * In the case of fast device addition/removal, it's possible that
> +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> we
> +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> +	 * channel callback already scheduled a work to hbus->wq, which can
> be
> +	 * running survey_child_resources() -> complete(&hbus-
> >survey_event),
> +	 * even after hv_pci_query_relations() exits and the stack variable
> +	 * 'comp' is no longer valid. This can cause a strange hang issue
> +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> +	 * hv_pci_query_relations() to avoid the issues.
> +	 */
> +	flush_workqueue(hbus->wq);

Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling flush_workqueue(hbus->wq)?

> +
>  	return ret;
>  }
> 
> --
> 2.25.1
  
Bjorn Helgaas March 28, 2023, 5:24 p.m. UTC | #4
On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > Sent: Monday, March 27, 2023 10:29 PM
> > > ...
> > > ---
> 
> Please note this special line "---". 
> Anything after the special line and before the line "diff --git" is discarded
> automaticaly by 'git' and 'patch'. 
> 
> > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > >
> > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > >
> > >  retry:
> > >  	ret = hv_pci_query_relations(hdev);
> > > +	printk("hv_pci_query_relations() exited\n");
> > 
> > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> 
> This is not part of the real patch :-)
> I just thought the debug code can help understand the issues
> resolved by the patches.
> I'll remove the debug code to avoid confusion if I need to post v2.

I guess that means you *will* post a v2, right?  Or do you expect
somebody else to remove the debug code?  If you do keep any debug or
other logging, use pci_info() (or dev_info()) whenever possible.

Also capitalize the subject line to match the others in the series.

Bjorn
  
Dexuan Cui March 29, 2023, 8:21 a.m. UTC | #5
> From: Long Li <longli@microsoft.com>
> Sent: Tuesday, March 28, 2023 9:49 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct
> hv_device
> > *hdev)
> >  	if (!ret)
> >  		ret = wait_for_response(hdev, &comp);
> >
> > +	/*
> > +	 * In the case of fast device addition/removal, it's possible that
> > +	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but
> > we
> > +	 * already got a PCI_BUS_RELATIONS* message from the host and the
> > +	 * channel callback already scheduled a work to hbus->wq, which can
> > be
> > +	 * running survey_child_resources() -> complete(&hbus-
> > >survey_event),
> > +	 * even after hv_pci_query_relations() exits and the stack variable
> > +	 * 'comp' is no longer valid. This can cause a strange hang issue
> > +	 * or sometimes a page fault. Flush hbus->wq before we exit from
> > +	 * hv_pci_query_relations() to avoid the issues.
> > +	 */
> > +	flush_workqueue(hbus->wq);
> 
> Is it possible for PCI_BUS_RELATIONS to be scheduled arrive after calling
> flush_workqueue(hbus->wq)?

It's possible, but that doesn't matter:

hv_pci_query_relations() is called only once, and it sets hbus->survey_event
to point to the stack variable 'comp'. The first survey_child_resources()
calls complete() for the 'comp' and sets hbus->survey_event to NULL.

When the second survey_child_resources() is called, hbus->survey_event
is NULL, so survey_child_resources() returns immediately.

According to my test, after hv_pci_enter_d0() posts PCI_BUS_D0ENTRY,
the guest receives a second PCI_BUS_RELATIONS2 message, which is
the same as the first PCI_BUS_RELATIONS2 message, which is basically
a no-op in pci_devices_present_work(), especially with the
newly-introduced per-hbus state_lock mutex.
  
Dexuan Cui March 29, 2023, 8:37 a.m. UTC | #6
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Tuesday, March 28, 2023 10:25 AM
> To: Dexuan Cui <decui@microsoft.com>
> ...
> On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > Sent: Monday, March 27, 2023 10:29 PM
> > > > ...
> > > > ---
> >
> > Please note this special line "---".
> > Anything after the special line and before the line "diff --git" is discarded
> > automaticaly by 'git' and 'patch'.
> >
> > > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > >  1 file changed, 13 insertions(+)
> > > >
> > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > >
> > > >  retry:
> > > >  	ret = hv_pci_query_relations(hdev);
> > > > +	printk("hv_pci_query_relations() exited\n");
> > >
> > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> >
> > This is not part of the real patch :-)
> > I just thought the debug code can help understand the issues
> > resolved by the patches.
> > I'll remove the debug code to avoid confusion if I need to post v2.
> 
> I guess that means you *will* post a v2, right?  

I guess I didn't make myself clear, sorry. The "debug code" is not
part of the real patch body -- if we run the "patch" program or "git am"
to apply the patches, the "debug code" is automatically dropped because
it's between the special "---" line and the real start of the patch body (i.e.
the "diff --git" line). 

So far, IMO I don't have to post v2 because the patch body and the patch
description (except for the part that's automatically removed by 'patch'
and 'git') don't need any change.

> Or do you expect
> somebody else to remove the debug code?  If you do keep any debug or
> other logging, use pci_info() (or dev_info()) whenever possible.

As I explained above, 'patch' and 'git' automatically remove the part that
don't have to be in the git history.

> 
> Also capitalize the subject line to match the others in the series.
> 
> Bjorn

Thanks for catching this! If this is the only thing to be fixed, I hope the
PCI folks can help fix this when accepting the patch. If you think I should
post v2, please let me know.
  
Bjorn Helgaas March 29, 2023, 4:06 p.m. UTC | #7
On Wed, Mar 29, 2023 at 08:37:20AM +0000, Dexuan Cui wrote:
> > From: Bjorn Helgaas <helgaas@kernel.org>
> > Sent: Tuesday, March 28, 2023 10:25 AM
> > To: Dexuan Cui <decui@microsoft.com>
> > ...
> > On Tue, Mar 28, 2023 at 05:38:59AM +0000, Dexuan Cui wrote:
> > > > From: Saurabh Singh Sengar <ssengar@microsoft.com>
> > > > Sent: Monday, March 27, 2023 10:29 PM
> > > > > ...
> > > > > ---
> > >
> > > Please note this special line "---".
> > > Anything after the special line and before the line "diff --git" is discarded
> > > automaticaly by 'git' and 'patch'.
> > >
> > > > >  drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> > > > >  1 file changed, 13 insertions(+)
> > > > >
> > > > > @@ -3635,6 +3641,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> > > > >
> > > > >  retry:
> > > > >  	ret = hv_pci_query_relations(hdev);
> > > > > +	printk("hv_pci_query_relations() exited\n");
> > > >
> > > > Can we use pr_* or the appropriate KERN_<LEVEL> in all the printk(s).
> > >
> > > This is not part of the real patch :-)
> > > I just thought the debug code can help understand the issues
> > > resolved by the patches.
> > > I'll remove the debug code to avoid confusion if I need to post v2.
> > 
> > I guess that means you *will* post a v2, right?  
> 
> I guess I didn't make myself clear, sorry. The "debug code" is not
> part of the real patch body -- if we run the "patch" program or "git am"
> to apply the patches, the "debug code" is automatically dropped because
> it's between the special "---" line and the real start of the patch body (i.e.
> the "diff --git" line). 
> 
> So far, IMO I don't have to post v2 because the patch body and the patch
> description (except for the part that's automatically removed by 'patch'
> and 'git') don't need any change.

Ah, sorry, I didn't notice that, even though you explicitly pointed it
out earlier.  No need to repost just to capitalize the subject,
Lorenzo can do it or not as he chooses.

Bjorn
  

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index f33370b75628..b82c7cde19e6 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3308,6 +3308,19 @@  static int hv_pci_query_relations(struct hv_device *hdev)
 	if (!ret)
 		ret = wait_for_response(hdev, &comp);
 
+	/*
+	 * In the case of fast device addition/removal, it's possible that
+	 * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+	 * already got a PCI_BUS_RELATIONS* message from the host and the
+	 * channel callback already scheduled a work to hbus->wq, which can be
+	 * running survey_child_resources() -> complete(&hbus->survey_event),
+	 * even after hv_pci_query_relations() exits and the stack variable
+	 * 'comp' is no longer valid. This can cause a strange hang issue
+	 * or sometimes a page fault. Flush hbus->wq before we exit from
+	 * hv_pci_query_relations() to avoid the issues.
+	 */
+	flush_workqueue(hbus->wq);
+
 	return ret;
 }