[net-next] net: fec: Always call fec_restart() in resume path

Message ID 20240212105010.2258421-1-john.ernberg@actia.se
State New
Headers
Series [net-next] net: fec: Always call fec_restart() in resume path |

Commit Message

John Ernberg Feb. 12, 2024, 10:50 a.m. UTC
  When trying to resume from suspend the following can be observed:

    fec 5b040000.ethernet eth0: MDIO read timeout
    Microchip LAN87xx T1 5b040000.ethernet-1:04: PM: dpm_run_callback(): mdio_bus_phy_resume+0x0/0xc8 returns -110
    Microchip LAN87xx T1 5b040000.ethernet-1:04: PM: failed to resume: error -110

This is because the MAC is left powered down after resuming from suspend.

The MAC is brought up in both probe and open, so leaving it off in resume
from suspend is an imbalance.
This imbalance combined with a LAN8700R that is permanently powered
results in unusuable networking if the board would happen to suspend before
the link is brought up, and the only way to get out of it would be a full
power cycle.

NOTE: With this change the PHY ends up taking different resume paths when
the link has never been up compared to once the link has been up. Currently
the resume process is identical and just happens at different times, so
this *should* not have any unforseen consequences.

Signed-off-by: John Ernberg <john.ernberg@actia.se>
---

Tested on 6.1 kernel and forward ported. I discovered this when we
upgraded from 5.10 to 6.1, but the resume path in the FEC driver has had
this imbalance since at least 2009.

This is also why I target the -next tree, I can't identify a proper commit
to blame with a Fixes. Let me know if this should be the net tree anyway.

 drivers/net/ethernet/freescale/fec_main.c | 2 ++
 1 file changed, 2 insertions(+)
  

Comments

Jakub Kicinski Feb. 14, 2024, 2:44 a.m. UTC | #1
On Mon, 12 Feb 2024 10:50:30 +0000 John Ernberg wrote:
> Tested on 6.1 kernel and forward ported. I discovered this when we
> upgraded from 5.10 to 6.1, but the resume path in the FEC driver has had
> this imbalance since at least 2009.
> 
> This is also why I target the -next tree, I can't identify a proper commit
> to blame with a Fixes. Let me know if this should be the net tree anyway.

I thought you bisected it to one or two specific changes?
I'd put those down as Fixes tags and target net.
  
John Ernberg Feb. 14, 2024, 8:27 a.m. UTC | #2
On 2/14/24 03:44, Jakub Kicinski wrote:
> On Mon, 12 Feb 2024 10:50:30 +0000 John Ernberg wrote:
>> Tested on 6.1 kernel and forward ported. I discovered this when we
>> upgraded from 5.10 to 6.1, but the resume path in the FEC driver has had
>> this imbalance since at least 2009.
>>
>> This is also why I target the -next tree, I can't identify a proper commit
>> to blame with a Fixes. Let me know if this should be the net tree anyway.
> 
> I thought you bisected it to one or two specific changes?
> I'd put those down as Fixes tags and target net.

Hi Jakub,

You are correct, we thought so too at [1], but bisection is really hard 
because we need a whole bunch of patches on top to even boot the system 
(imx8qxp specific stuff in the NXP vendor tree that's difficult to 
rebase), we left it a bit open ended.

Over the course of the weekend I lost all confidence in my bisection 
after being confident for 4-5 days, because the more I thought about it 
the less it made sense for that commit to be the culprit.

I should probably have both followed up on that mail with that, and been 
clearer here. I apologize for failing that.

Best regards // John Ernberg

[1]: 
https://lore.kernel.org/netdev/1f45bdbe-eab1-4e59-8f24-add177590d27@actia.se/
  
Jakub Kicinski Feb. 14, 2024, 2:52 p.m. UTC | #3
On Wed, 14 Feb 2024 08:27:02 +0000 John Ernberg wrote:
> You are correct, we thought so too at [1], but bisection is really hard 
> because we need a whole bunch of patches on top to even boot the system 
> (imx8qxp specific stuff in the NXP vendor tree that's difficult to 
> rebase), we left it a bit open ended.
> 
> Over the course of the weekend I lost all confidence in my bisection 
> after being confident for 4-5 days, because the more I thought about it 
> the less it made sense for that commit to be the culprit.
> 
> I should probably have both followed up on that mail with that, and been 
> clearer here. I apologize for failing that.

Is it perhaps possible that upstream 5.10 also didn't work?
I'm not saying the change itself is incorrect, indeed there 
is fec_restart() on probe and open paths, as you say.
Did you try reverting as many of the changes that happened
in the meantime as possible (instead of bisection)?

The other question is whether we need to enable any of the
clocks or runtime resume before calling fec_restart()?
  
John Ernberg Feb. 14, 2024, 3:49 p.m. UTC | #4
On 2/14/24 15:52, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 08:27:02 +0000 John Ernberg wrote:
>> You are correct, we thought so too at [1], but bisection is really hard
>> because we need a whole bunch of patches on top to even boot the system
>> (imx8qxp specific stuff in the NXP vendor tree that's difficult to
>> rebase), we left it a bit open ended.
>>
>> Over the course of the weekend I lost all confidence in my bisection
>> after being confident for 4-5 days, because the more I thought about it
>> the less it made sense for that commit to be the culprit.
>>
>> I should probably have both followed up on that mail with that, and been
>> clearer here. I apologize for failing that.
> 
> Is it perhaps possible that upstream 5.10 also didn't work?
> I'm not saying the change itself is incorrect, indeed there
> is fec_restart() on probe and open paths, as you say.
> Did you try reverting as many of the changes that happened
> in the meantime as possible (instead of bisection)?
> 

That's a really good point. I'll make some time for this in the next weeks.
Please mark it with changes requested in the meantime, as I expect to 
make changes to the patch when I have a result.

> The other question is whether we need to enable any of the
> clocks or runtime resume before calling fec_restart()?

On our board it works fine without it, I don't know enough about this 
SoC or other NXP SoCs to know if it's necessary in other situations.

The clocks are re-enabled in the open call which appears to be enough to 
get traffic going again when the link is brought up.

Perhaps NXP can fill us in?

Thanks! // John Ernberg
  

Patch

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 42bdc01a304e..e6804c068d6b 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -4706,6 +4706,8 @@  static int __maybe_unused fec_resume(struct device *dev)
 		napi_enable(&fep->napi);
 		phy_init_hw(ndev->phydev);
 		phy_start(ndev->phydev);
+	} else {
+		fec_restart(ndev);
 	}
 	rtnl_unlock();