can: mcp251xfd: Always stop on BUS_OFF and call netif_stop_queue

Message ID 20230711152647.28673-1-mike.looijmans@topic.nl
State New
Headers
Series can: mcp251xfd: Always stop on BUS_OFF and call netif_stop_queue |

Commit Message

Mike Looijmans July 11, 2023, 3:26 p.m. UTC
  When there's an error attempting to store the BER counter, don't abort
but continue shutting down the chip as required.

After disabling communications, also stop the TX queue with a call to
netif_stop_queue. When the interface restarts, mcp251xfd_set_mode will
call netif_wake_queue and resume.

This fixes a hangup in either send() or poll() from userspace. To
reproduce: I ran "cansequence can0 -p" to flood the system with packets.
While running that, I shorted the CAN signals, causing a bus error.
Usually communications would resume after the CAN bus restarted, but
sometimes the system got stuck and refused to send any more packets.
The sending process would be in either poll() or send(), waiting for
the queue to resume. To "unstuck" the process from send() it was
sufficient to send any packet on the can interface from another
process. To get it out of the poll() hang, only bringing the can
interface down (and up) would work.

After adding the netif_stop_queue call, I was unable to reproduce the
problem.

Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl>

---

 drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
  

Comments

Marc Kleine-Budde July 11, 2023, 5:05 p.m. UTC | #1
On 11.07.2023 17:26:47, Mike Looijmans wrote:
> When there's an error attempting to store the BER counter, don't abort
> but continue shutting down the chip as required.

If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a
store error, but a failure of regmap_read(priv->map_reg,
MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled
and no/wrong data from the chip will be detected and return an error
here (after 3 retires). In out of memory conditions or other kernel
errors might be possible here, too.

Have you seen a problem here? But as we shut down the chip here anyways,
we can ignore the error here.

> After disabling communications, also stop the TX queue with a call to
> netif_stop_queue.

can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have
you enabled automatic restart in case of bus off? I think the netdev
watchdog will kick you, if the interface has a stooped queue for too
long (IIRC 5 seconds).

> When the interface restarts, mcp251xfd_set_mode will
> call netif_wake_queue and resume.
> 
> This fixes a hangup in either send() or poll() from userspace. To
> reproduce: I ran "cansequence can0 -p" to flood the system with packets.
> While running that, I shorted the CAN signals, causing a bus error.
> Usually communications would resume after the CAN bus restarted, but
> sometimes the system got stuck and refused to send any more packets.
> The sending process would be in either poll() or send(), waiting for
> the queue to resume. To "unstuck" the process from send() it was
> sufficient to send any packet on the can interface from another
> process. To get it out of the poll() hang, only bringing the can
> interface down (and up) would work.
> 
> After adding the netif_stop_queue call, I was unable to reproduce the
> problem.

The newly added netif_stop_queue() will cause the netif_wake_queue() in
the mcp251xfd_set_mode() to actually wake the queue. If you observe a
problem, I think it's a general problem, so all drivers would be
effected.

> Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl>

Marc
  
Mike Looijmans July 12, 2023, 5:43 a.m. UTC | #2
Met vriendelijke groet / kind regards,

Mike Looijmans
System Expert


TOPIC Embedded Products B.V.
Materiaalweg 4, 5681 RJ Best
The Netherlands

T: +31 (0) 499 33 69 69
E: mike.looijmans@topicproducts.com
W: www.topic.nl

Please consider the environment before printing this e-mail
On 11-07-2023 19:05, Marc Kleine-Budde wrote:
> On 11.07.2023 17:26:47, Mike Looijmans wrote:
>> When there's an error attempting to store the BER counter, don't abort
>> but continue shutting down the chip as required.
> 
> If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a
> store error, but a failure of regmap_read(priv->map_reg,
> MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled
> and no/wrong data from the chip will be detected and return an error
> here (after 3 retires). In out of memory conditions or other kernel
> errors might be possible here, too.
> 
> Have you seen a problem here? But as we shut down the chip here anyways,
> we can ignore the error here.

Excactly my point, the important task here is to report the bus off state.

I haven't seen the error trigger, so it wasn't the cause of our problems.

In retrospect, maybe I should have split this in two patches?


>> After disabling communications, also stop the TX queue with a call to
>> netif_stop_queue.
> 
> can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have
> you enabled automatic restart in case of bus off? I think the netdev
> watchdog will kick you, if the interface has a stooped queue for too
> long (IIRC 5 seconds).


Apparently not. The problem also occurred in the field, where the CAN 
communication got stuck for something like a full day.

Automatic restart was enabled, with a 100ms timeout. Setting a shorter timeout 
like 50 ms seemed to make the problem more likely to occur

What I also noticed, but didn't accurately measure, is that the time it took 
to recover used to be somewhat random, it could take up to about a second for 
packets to start going out again, as observed by watching LEDs. After adding 
netif_stop_queue the recovery was almost instantaneous.


> 
>> When the interface restarts, mcp251xfd_set_mode will
>> call netif_wake_queue and resume.
>>
>> This fixes a hangup in either send() or poll() from userspace. To
>> reproduce: I ran "cansequence can0 -p" to flood the system with packets.
>> While running that, I shorted the CAN signals, causing a bus error.
>> Usually communications would resume after the CAN bus restarted, but
>> sometimes the system got stuck and refused to send any more packets.
>> The sending process would be in either poll() or send(), waiting for
>> the queue to resume. To "unstuck" the process from send() it was
>> sufficient to send any packet on the can interface from another
>> process. To get it out of the poll() hang, only bringing the can
>> interface down (and up) would work.
>>
>> After adding the netif_stop_queue call, I was unable to reproduce the
>> problem.
> 
> The newly added netif_stop_queue() will cause the netif_wake_queue() in
> the mcp251xfd_set_mode() to actually wake the queue. If you observe a
> problem, I think it's a general problem, so all drivers would be
> effected.

After two days digging, my conclusion was that "something" wasn't getting a 
proper "kick" when the CAN interface came back online.


One could argue the netif_stop_queue should be put into can_bus_off. But the 
netif_wake_queue call is in the driver, so that would break symmetry.


> 
>> Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl>
> 
> Marc
>
  

Patch

diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
index 68df6d4641b5..854e0644764c 100644
--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
@@ -1084,10 +1084,11 @@  static int mcp251xfd_handle_cerrif(struct mcp251xfd_priv *priv)
 		 */
 		err = __mcp251xfd_get_berr_counter(priv->ndev, &priv->bec);
 		if (err)
-			return err;
+			netdev_err(priv->ndev, "Failed to store BER counter\n");
 
 		mcp251xfd_chip_stop(priv, CAN_STATE_BUS_OFF);
 		can_bus_off(priv->ndev);
+		netif_stop_queue(priv->ndev);
 	}
 
 	if (!skb)