[2/3] r8152: Retry register reads/writes

Message ID 20230926142724.2.I65ea4ac938a55877dc99fdf5b3883ad92d8abce2@changeid
State New
Headers
Series r8152: Avoid writing garbage to the adapter's registers |

Commit Message

Doug Anderson Sept. 26, 2023, 9:27 p.m. UTC
  Even though the functions to read/write registers can fail, most of
the places in the r8152 driver that read/write register values don't
check error codes. The lack of error code checking is problematic in
at least two ways.

The first problem is that the r8152 driver often uses code patterns
similar to this:
  x = read_register()
  x = x | SOME_BIT;
  write_register(x);

...with the above pattern, if the read_register() fails and returns
garbage then we'll end up trying to write modified garbage back to the
Realtek adapter. If the write_register() succeeds that's bad. Note
that as of commit f53a7ad18959 ("r8152: Set memory to all 0xFFs on
failed reg reads") the "garbage" returned by read_register() will at
least be consistent garbage, but it is still garbage.

It turns out that this problem is very serious. Writing garbage to
some of the hardware registers on the Ethernet adapter can put the
adapter in such a bad state that it needs to be power cycled (fully
unplugged and plugged in again) before it can enumerate again.

The second problem is that the r8152 driver generally has functions
that are long sequences of register writes. Assuming everything will
be OK if a random register write fails in the middle isn't a great
assumption.

One might wonder if the above two problems are real. You could ask if
we would really have a successful write after a failed read. It turns
out that the answer appears to be "yes, this can happen". In fact,
we've seen at least two distinct failure modes where this happens.

On a sc7180-trogdor Chromebook if you drop into kdb for a while and
then resume, you can see:
1. We get a "Tx timeout"
2. The "Tx timeout" queues up a USB reset.
3. In rtl8152_pre_reset() we try to reinit the hardware.
4. The first several (2-9) register accesses fail with a timeout, then
   things recover.

The above test case was actually fixed by the patch ("r8152: Increase
USB control msg timeout to 5000ms as per spec") but at least shows
that we really can see successful calls after failed ones.

On a different (AMD) based Chromebook, we found that during reboot
tests we'd also sometimes get a transitory failure. In this case we
saw -EPIPE being returned sometimes. Retrying one time worked fine.

To keep things robust, let's try register access up to 3 times. If we
get 3 5-second timeouts in a row this could block things for 15
seconds but that hasn't been seen in practice. If we see that
happening and there is a better way to solve it then we can add a
special case for that later.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
Originally when looking at this problem I thought that the obvious
solution was to "just" add better error handling to the driver. This
_sounds_ appealing, but it's a massive change and touches a
significant portion of the lines in this driver. It's also not always
obvious what the driver should be doing to handle errors.

 drivers/net/usb/r8152.c | 67 +++++++++++++++++++++++++++++++++++------
 1 file changed, 57 insertions(+), 10 deletions(-)
  

Comments

Alan Stern Sept. 27, 2023, 1:43 p.m. UTC | #1
On Tue, Sep 26, 2023 at 02:27:27PM -0700, Douglas Anderson wrote:
> +
> +static
> +int r8152_control_msg(struct usb_device *udev, unsigned int pipe, __u8 request,
> +		      __u8 requesttype, __u16 value, __u16 index, void *data,
> +		      __u16 size, const char *msg_tag)
> +{
> +	int i;
> +	int ret;
> +
> +	for (i = 0; i < REGISTER_ACCESS_TRIES; i++) {
> +		ret = usb_control_msg(udev, pipe, request, requesttype,
> +				      value, index, data, size,
> +				      USB_CTRL_GET_TIMEOUT);
> +
> +		/* No need to retry or spam errors if the USB device got
> +		 * unplugged; just return immediately.
> +		 */
> +		if (udev->state == USB_STATE_NOTATTACHED)
> +			return ret;

Rather than testing udev->state, it would be better to check whether
ret == -ENODEV.  udev->state is meant primarily for use by the USB core
and it's subject to races.

Alan Stern
  
Doug Anderson Sept. 27, 2023, 3:28 p.m. UTC | #2
Hi,

On Wed, Sep 27, 2023 at 6:43 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Tue, Sep 26, 2023 at 02:27:27PM -0700, Douglas Anderson wrote:
> > +
> > +static
> > +int r8152_control_msg(struct usb_device *udev, unsigned int pipe, __u8 request,
> > +                   __u8 requesttype, __u16 value, __u16 index, void *data,
> > +                   __u16 size, const char *msg_tag)
> > +{
> > +     int i;
> > +     int ret;
> > +
> > +     for (i = 0; i < REGISTER_ACCESS_TRIES; i++) {
> > +             ret = usb_control_msg(udev, pipe, request, requesttype,
> > +                                   value, index, data, size,
> > +                                   USB_CTRL_GET_TIMEOUT);
> > +
> > +             /* No need to retry or spam errors if the USB device got
> > +              * unplugged; just return immediately.
> > +              */
> > +             if (udev->state == USB_STATE_NOTATTACHED)
> > +                     return ret;
>
> Rather than testing udev->state, it would be better to check whether
> ret == -ENODEV.  udev->state is meant primarily for use by the USB core
> and it's subject to races.

Thanks for looking my patch over!

Happy to change this to -ENODEV. In my early drafts of this patch I
looked at -ENODEV but I noticed that other places in the driver were
checking `udev->state == USB_STATE_NOTATTACHED` so I changed it. In
reality I think for this code path it doesn't matter a whole lot. The
only thing it's doing is avoiding a few extra retries and avoiding a
log message. :-)

I'll wait a few more days to see if there is any other feedback on
this series and then send a new version with that addressed. If
someone needs me to send a new version sooner then please yell.

-Doug
  

Patch

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 482957beae66..976d6caf2f04 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -1200,6 +1200,52 @@  static unsigned int agg_buf_sz = 16384;
 
 #define RTL_LIMITED_TSO_SIZE	(size_to_mtu(agg_buf_sz) - sizeof(struct tx_desc))
 
+/* If we get a failure and the USB device is still attached when trying to read
+ * or write registers then we'll retry a few times. Failures accessing registers
+ * shouldn't be common and this adds robustness. Much code in the driver doesn't
+ * check for errors. Notably, many parts of the driver do a read/modify/write
+ * of a register value without confirming that the read succeeded. Writing back
+ * modified garbage like this can fully wedge the adapter, requiring a power
+ * cycle.
+ */
+#define REGISTER_ACCESS_TRIES	3
+
+static
+int r8152_control_msg(struct usb_device *udev, unsigned int pipe, __u8 request,
+		      __u8 requesttype, __u16 value, __u16 index, void *data,
+		      __u16 size, const char *msg_tag)
+{
+	int i;
+	int ret;
+
+	for (i = 0; i < REGISTER_ACCESS_TRIES; i++) {
+		ret = usb_control_msg(udev, pipe, request, requesttype,
+				      value, index, data, size,
+				      USB_CTRL_GET_TIMEOUT);
+
+		/* No need to retry or spam errors if the USB device got
+		 * unplugged; just return immediately.
+		 */
+		if (udev->state == USB_STATE_NOTATTACHED)
+			return ret;
+
+		if (ret >= 0)
+			break;
+	}
+
+	if (ret < 0) {
+		dev_err(&udev->dev,
+			"Failed to %s %d bytes at %#06x/%#06x (%d)\n",
+			msg_tag, size, value, index, ret);
+	} else if (i != 0) {
+		dev_warn(&udev->dev,
+			 "Needed %d tries to %s %d bytes at %#06x/%#06x\n",
+			 i + 1, msg_tag, size, value, index);
+	}
+
+	return ret;
+}
+
 static
 int get_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)
 {
@@ -1210,9 +1256,10 @@  int get_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)
 	if (!tmp)
 		return -ENOMEM;
 
-	ret = usb_control_msg(tp->udev, tp->pipe_ctrl_in,
-			      RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
-			      value, index, tmp, size, USB_CTRL_GET_TIMEOUT);
+	ret = r8152_control_msg(tp->udev, tp->pipe_ctrl_in,
+				RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
+				value, index, tmp, size, "read");
+
 	if (ret < 0)
 		memset(data, 0xff, size);
 	else
@@ -1233,9 +1280,9 @@  int set_registers(struct r8152 *tp, u16 value, u16 index, u16 size, void *data)
 	if (!tmp)
 		return -ENOMEM;
 
-	ret = usb_control_msg(tp->udev, tp->pipe_ctrl_out,
-			      RTL8152_REQ_SET_REGS, RTL8152_REQT_WRITE,
-			      value, index, tmp, size, USB_CTRL_SET_TIMEOUT);
+	ret = r8152_control_msg(tp->udev, tp->pipe_ctrl_out,
+				RTL8152_REQ_SET_REGS, RTL8152_REQT_WRITE,
+				value, index, tmp, size, "write");
 
 	kfree(tmp);
 
@@ -9492,10 +9539,10 @@  static u8 __rtl_get_hw_ver(struct usb_device *udev)
 	if (!tmp)
 		return 0;
 
-	ret = usb_control_msg(udev, usb_rcvctrlpipe(udev, 0),
-			      RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
-			      PLA_TCR0, MCU_TYPE_PLA, tmp, sizeof(*tmp),
-			      USB_CTRL_GET_TIMEOUT);
+	ret = r8152_control_msg(udev, usb_rcvctrlpipe(udev, 0),
+				RTL8152_REQ_GET_REGS, RTL8152_REQT_READ,
+				PLA_TCR0, MCU_TYPE_PLA, tmp, sizeof(*tmp),
+				"read");
 	if (ret > 0)
 		ocp_data = (__le32_to_cpu(*tmp) >> 16) & VERSION_MASK;