[1/2] PCI: Clear LBMS on resume to avoid Target Speed quirk

Message ID 20240129112710.2852-2-ilpo.jarvinen@linux.intel.com
State New
Headers
Series PCI: Fix disconnect related issues |

Commit Message

Ilpo Järvinen Jan. 29, 2024, 11:27 a.m. UTC
  While a device is runtime suspended along with its PCIe hierarchy, the
device could get disconnected. Because of the suspend, the device
disconnection cannot be detected until portdrv/hotplug have resumed. On
runtime resume, pcie_wait_for_link_delay() is called:

  pci_pm_runtime_resume()
    pci_pm_bridge_power_up_actions()
      pci_bridge_wait_for_secondary_bus()
        pcie_wait_for_link_delay()

Because the device is already disconnected, this results in cascading
failures:

  1. pcie_wait_for_link_status() returns -ETIMEDOUT.

  2. After the commit a89c82249c37 ("PCI: Work around PCIe link
     training failures"), pcie_failed_link_retrain() spuriously detects
     this failure as a Link Retraining failure and attempts the Target
     Speed trick, which also fails.

  3. pci_bridge_wait_for_secondary_bus() then calls pci_dev_wait() which
     cannot succeed (but waits ~1 minute, delaying the resume).

The Target Speed trick (in step 2) is only used if LBMS bit (PCIe r6.1
sec 7.5.3.8) is set. For links that have been operational before
suspend, it is well possible that LBMS has been set at the bridge and
remains on. Thus, after resume, LBMS does not indicate the link needs
the Target Speed quirk. Clear LBMS on resume for bridges to avoid the
issue.

Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
 drivers/pci/pci-driver.c | 6 ++++++
 1 file changed, 6 insertions(+)
  

Comments

Bjorn Helgaas Jan. 29, 2024, 6:43 p.m. UTC | #1
On Mon, Jan 29, 2024 at 01:27:09PM +0200, Ilpo Järvinen wrote:
> While a device is runtime suspended along with its PCIe hierarchy, the
> device could get disconnected. Because of the suspend, the device
> disconnection cannot be detected until portdrv/hotplug have resumed. On
> runtime resume, pcie_wait_for_link_delay() is called:
> 
>   pci_pm_runtime_resume()
>     pci_pm_bridge_power_up_actions()
>       pci_bridge_wait_for_secondary_bus()
>         pcie_wait_for_link_delay()
> 
> Because the device is already disconnected, this results in cascading
> failures:
> 
>   1. pcie_wait_for_link_status() returns -ETIMEDOUT.
> 
>   2. After the commit a89c82249c37 ("PCI: Work around PCIe link
>      training failures"),

I this this also depends on the merge resolution in 1abb47390350
("Merge branch 'pci/enumeration'").  Just looking at a89c82249c37 in
isolation suggests that pcie_wait_for_link_status() returning
-ETIMEDOUT would not cause pcie_wait_for_link_delay() to call
pcie_failed_link_retrain().

>      pcie_failed_link_retrain() spuriously detects
>      this failure as a Link Retraining failure and attempts the Target
>      Speed trick, which also fails.

Based on the comment below, I guess "Target Speed trick" probably
refers to the "retrain at 2.5GT/s, then remove the speed restriction
and retrain again" part of pcie_failed_link_retrain() (which I guess
is basically the entire point of the function)?

>   3. pci_bridge_wait_for_secondary_bus() then calls pci_dev_wait() which
>      cannot succeed (but waits ~1 minute, delaying the resume).
> 
> The Target Speed trick (in step 2) is only used if LBMS bit (PCIe r6.1
> sec 7.5.3.8) is set. For links that have been operational before
> suspend, it is well possible that LBMS has been set at the bridge and
> remains on. Thus, after resume, LBMS does not indicate the link needs
> the Target Speed quirk. Clear LBMS on resume for bridges to avoid the
> issue.
> 
> Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
> Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> ---
>  drivers/pci/pci-driver.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 51ec9e7e784f..05a114962df3 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -574,6 +574,12 @@ static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
>  {
>  	int ret;
>  
> +	/*
> +	 * Clear LBMS on resume to avoid spuriously triggering Target Speed
> +	 * quirk in pcie_failed_link_retrain().
> +	 */
> +	pcie_capability_write_word(pci_dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
> +
>  	ret = pci_bridge_wait_for_secondary_bus(pci_dev, "resume");
>  	if (ret) {
>  		/*
> -- 
> 2.39.2
>
  
Ilpo Järvinen Jan. 30, 2024, 11:53 a.m. UTC | #2
On Mon, 29 Jan 2024, Bjorn Helgaas wrote:

> On Mon, Jan 29, 2024 at 01:27:09PM +0200, Ilpo Järvinen wrote:
> > While a device is runtime suspended along with its PCIe hierarchy, the
> > device could get disconnected. Because of the suspend, the device
> > disconnection cannot be detected until portdrv/hotplug have resumed. On
> > runtime resume, pcie_wait_for_link_delay() is called:
> > 
> >   pci_pm_runtime_resume()
> >     pci_pm_bridge_power_up_actions()
> >       pci_bridge_wait_for_secondary_bus()
> >         pcie_wait_for_link_delay()
> > 
> > Because the device is already disconnected, this results in cascading
> > failures:
> > 
> >   1. pcie_wait_for_link_status() returns -ETIMEDOUT.
> > 
> >   2. After the commit a89c82249c37 ("PCI: Work around PCIe link
> >      training failures"),
> 
> I this this also depends on the merge resolution in 1abb47390350
> ("Merge branch 'pci/enumeration'").  Just looking at a89c82249c37 in
> isolation suggests that pcie_wait_for_link_status() returning
> -ETIMEDOUT would not cause pcie_wait_for_link_delay() to call
> pcie_failed_link_retrain().

I was aware of the merge but I seem to have somehow misanalyzed the return 
values earlier since I cannot anymore reach my earlier conclusion and now
ended up agreeing with your analysis that 1abb47390350 broke it.

That would imply there is a logic error in 1abb47390350 in addition to 
the LBMS-logic problem in a89c82249c37 my patch is fixing... However, I 
cannot pinpoint a single error because there seems to be more than one in 
the whole code.

First of all, this is not true for pcie_failed_link_retrain():
 * Return TRUE if the link has been successfully retrained, otherwise FALSE.
If LBMS is not set, the Target Speed quirk is not applied but the function 
still returns true. I think that should be changed to early return false
when no LBMS is present.

But if I make that change, then pcie_wait_for_link_delay() will do 
msleep() + return true, and pci_bridge_wait_for_secondary_bus() will call 
long ~60s pci_dev_wait().

I'll try to come up another patch to cleanup all that return logic so that 
it actually starts to make some sense.

> >      pcie_failed_link_retrain() spuriously detects
> >      this failure as a Link Retraining failure and attempts the Target
> >      Speed trick, which also fails.
> 
> Based on the comment below, I guess "Target Speed trick" probably
> refers to the "retrain at 2.5GT/s, then remove the speed restriction
> and retrain again" part of pcie_failed_link_retrain() (which I guess
> is basically the entire point of the function)?

Yes. I'll change the wording slightly to make it more obvious and put 
(Target Speed quirk) into parenthesis so I can use it below.

> >   3. pci_bridge_wait_for_secondary_bus() then calls pci_dev_wait() which
> >      cannot succeed (but waits ~1 minute, delaying the resume).
> > 
> > The Target Speed trick (in step 2) is only used if LBMS bit (PCIe r6.1
> > sec 7.5.3.8) is set. For links that have been operational before
> > suspend, it is well possible that LBMS has been set at the bridge and
> > remains on. Thus, after resume, LBMS does not indicate the link needs
> > the Target Speed quirk. Clear LBMS on resume for bridges to avoid the
> > issue.
  
Ilpo Järvinen Feb. 7, 2024, 12:33 p.m. UTC | #3
On Fri, 2 Feb 2024, Ilpo Järvinen wrote:
> On Thu, 1 Feb 2024, Maciej W. Rozycki wrote:
> > On Thu, 1 Feb 2024, Ilpo Järvinen wrote:
> >
> > > > >  What kind of scenario does the LBMS bit get set in that you have a 
> > > > > trouble with?  You write that in your case the downstream device has been 
> > > > > disconnected while the bug hierarchy was suspended and it is not present 
> > > > > anymore at resume, is that correct?
> > > > >
> > > > >  But in that case no link negotiation could have been possible and 
> > > > > consequently the LBMS bit mustn't have been set by hardware (according to 
> > > > > my understanding of PCIe), because (for compliant, non-broken devices 
> > > > > anyway) it is only specified to be set for ports that can communicate with 
> > > > > the other link end (the spec explicitly says there mustn't have been a 
> > > > > transition through the DL_Down status for the port).
> > > > >
> > > > >  Am I missing something?
> > > > 
> > > > Yes, when resuming the device is already gone but the bridge still has 
> > > > LBMS set. My understanding is that it was set because it was there
> > > > from pre-suspend time but I've not really taken a deep look into it 
> > > > because the problem and fix seemed obvious.
> > 
> >  I've always been confused with the suspend/resume terminology: I'd have 
> > assumed this would have gone through a power cycle, in which case the LBMS 
> > bit would have necessarily been cleared in the transition, because its 
> > required state at power-up/reset is 0, so the value of 1 observed would be 
> > a result of what has happened solely through the resume stage.  Otherwise 
> > it may make sense to clear the bit in the course of the suspend stage, 
> > though it wouldn't be race-free I'm afraid.
> 
> I also thought suspend as one possibility but yes, it racy. Mika also 
> suggested clearing LBMS after each successful retrain but that wouldn't 
> cover all possible ways to get LBMS set as devices can set it 
> independently of OS. Keeping it cleared constantly is pretty much what 
> will happen with the bandwidth controller anyway.
> 
> > > > I read that "without the Port transitioning through DL_Down status" 
> > > > differently than you, I only interpret that it relates to the two 
> > > > bullets following it. ...So if retrain bit is set, and link then goes 
> > > > down, the bullet no longer applies and LBMS should not be set because 
> > > > there was transition through DL_Down. But I could well be wrong...
> > 
> > What I refer to is that if you suspend your system, remove the device 
> > that originally caused the quirk to trigger and then resume your system 
> > with the device absent,
> 
> A small correction here, the quirk didn't trigger initially for the 
> device, it does that only after resume. And even then quirk is called 
> only because the link doesn't come up.
> 
> On longer term, I'd actually want to have hotplug resumed earlier so the 
> disconnect could be detected before/while all this waiting related to link 
> up is done. But that's very complicated to realize in practice because 
> hotplug lurks behind portdrv so resuming it earlier isn't going to be 
> about just moving a few lines around.
> 
> > then LBMS couldn't have been set in the course of 
> > resume, because the port couldn't have come out of the DL_Down status in 
> > the absence of the downstream device.  Do you interpret it differently?
> 
> That's a good question and I don't have an answer to this yet, that is,
> I don't fully understand what happens to those device during runtime 
> suspend/resume cycle and what is the exact mechanism that preserves the 
> LBMS bit, I'll look more into it.
> 
> But I agree that if the device goes cold enough and downstream is 
> disconnected, the port should no longer have a way to reassert LBMS.

It seems that the root cause here is that the bridge ports do not enter 
D3cold but remain in D3hot because of an ACPI power resource that is 
shared (with Thunderbolt in this case but that's just one example, there 
could be other similar sharings outside of what PCI controls).

There is an attempt to power off the entire hierarchy into D3cold on 
suspend but the ports won't reach that state. Because the port remains in 
D3hot, the config space is preserved and LBMS bit along with it.

So it seems that we cannot make the assumption that a device enters D3cold 
just because it was suspended.


On the positive side, it was also tested that the logic fix I sent earlier 
solved the most visible problem which is the delay on resume. It doesn't 
address the false activation of Target Speed quirk because LBMS bit is 
still set but the symptoms are no longer the same because of the 
corrected logic.

So to solve the main issue with delay, there's no need to clear the LBMS 
and the patch you're preparing/testing for pcie_failed_link_retrain()
together with the logic fix on its caller are enough to address the first 
issue.

I still think those two should be put into the same commit because the
true <-> false changes, if made separately, lead to additional
incoherent states but if Bjorn wants them separately, at least they 
should be put back-to-back so the brokeness is just within a single 
commit in the history.

In addition, my 2nd patch addresses another issue which is unrelated 
to this despite similar symptoms with extra delay on resume. I'll send v2 
of that 2nd path separately with the inverted return value.

> > > Because if it is constantly picking another speed, it would mean you get 
> > > LBMS set over and over again, no? If that happens 34-35 times per second, 
> > > it should be set already again when we get into that quirk because there 
> > > was some wait before it gets called.
> > 
> >  I'll see if I can experiment with the hardware over the next couple of 
> > days and come back with my findings.
> 
> Okay thanks.

One point I'd be very interested to know if the link actually comes up 
successfully (even if briefly) because this has large implications whether 
the quirk can actually be invoked from the bandwidth controller code.
  

Patch

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 51ec9e7e784f..05a114962df3 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -574,6 +574,12 @@  static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
 {
 	int ret;
 
+	/*
+	 * Clear LBMS on resume to avoid spuriously triggering Target Speed
+	 * quirk in pcie_failed_link_retrain().
+	 */
+	pcie_capability_write_word(pci_dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
+
 	ret = pci_bridge_wait_for_secondary_bus(pci_dev, "resume");
 	if (ret) {
 		/*