[v2] usb: gadget: udc: core: Offload usb_udc_vbus_handler processing

Message ID 20230519043041.1593578-1-badhri@google.com
State New
Headers
Series [v2] usb: gadget: udc: core: Offload usb_udc_vbus_handler processing |

Commit Message

Badhri Jagan Sridharan May 19, 2023, 4:30 a.m. UTC
  chipidea udc calls usb_udc_vbus_handler from udc_start gadget
ops causing a deadlock. Avoid this by offloading usb_udc_vbus_handler
processing.

============================================
WARNING: possible recursive locking detected
640-rc1-000-devel-00005-gcda3c69ebc14 #1 Not tainted
-------------------------------------------

CPU0
----
lock(&udc->connect_lock);
lock(&udc->connect_lock);

 DEADLOCK

stack backtrace:
  CPU: 1 PID: 566 Comm: echo Not tainted 640-rc1-000-devel-00005-gcda3c69ebc14 #1
  Hardware name: Freescale iMX7 Dual (Device Tree)
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x70/0xb0
  dump_stack_lvl from __lock_acquire+0x924/0x22c4
  __lock_acquire from lock_acquire+0x100/0x370
  lock_acquire from __mutex_lock+0xa8/0xfb4
  __mutex_lock from mutex_lock_nested+0x1c/0x24
  mutex_lock_nested from usb_udc_vbus_handler+0x1c/0x60
  usb_udc_vbus_handler from ci_udc_start+0x74/0x9c
  ci_udc_start from gadget_bind_driver+0x130/0x230
  gadget_bind_driver from really_probe+0xd8/0x3fc
  really_probe from __driver_probe_device+0x94/0x1f0
  __driver_probe_device from driver_probe_device+0x2c/0xc4
  driver_probe_device from __driver_attach+0x114/0x1cc
  __driver_attach from bus_for_each_dev+0x7c/0xcc
  bus_for_each_dev from bus_add_driver+0xd4/0x200
  bus_add_driver from driver_register+0x7c/0x114
  driver_register from usb_gadget_register_driver_owner+0x40/0xe0
  usb_gadget_register_driver_owner from gadget_dev_desc_UDC_store+0xd4/0x110
  gadget_dev_desc_UDC_store from configfs_write_iter+0xac/0x118
  configfs_write_iter from vfs_write+0x1b4/0x40c
  vfs_write from ksys_write+0x70/0xf8
  ksys_write from ret_fast_syscall+0x0/0x1c

Fixes: 0db213ea8eed ("usb: gadget: udc: core: Invoke usb_gadget_connect only when started")
Cc: stable@vger.kernel.org
Reported-by: Stephan Gerhold <stephan@gerhold.net>
Closes: https://lore.kernel.org/all/ZF4bMptC3Lf2Hnee@gerhold.net/
Reported-by: Francesco Dolcini <francesco.dolcini@toradex.com>
Closes: https://lore.kernel.org/all/ZF4BvgsOyoKxdPFF@francesco-nb.int.toradex.com/
Reported-by: Alistair <alistair@alistair23.me>
Closes: https://lore.kernel.org/lkml/0cf8c588b701d7cf25ffe1a9217b81716e6a5c51.camel@alistair23.me/
Signed-off-by: Badhri Jagan Sridharan <badhri@google.com>
---
Changes since v1:
- Address Alan Stern's comment on usb_udc_vbus_handler invocation from
  atomic context:
* vbus_events_lock is now a spinlock and allocations in
* usb_udc_vbus_handler are atomic now.
---
 drivers/usb/gadget/udc/core.c | 63 +++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 6 deletions(-)


base-commit: a4422ff221429c600c3dc5d0394fb3738b89d040
  

Comments

Alan Stern May 19, 2023, 2:49 p.m. UTC | #1
On Fri, May 19, 2023 at 04:30:41AM +0000, Badhri Jagan Sridharan wrote:
> chipidea udc calls usb_udc_vbus_handler from udc_start gadget
> ops causing a deadlock. Avoid this by offloading usb_udc_vbus_handler
> processing.

Look, this is way overkill.

usb_udc_vbus_handler() has only two jobs to do: set udc->vbus and call 
usb_udc_connect_control().  Furthermore, it gets called from only two 
drivers: chipidea and max3420.

Why not have the callers set udc->vbus themselves and then call 
usb_gadget_{dis}connect() directly?  Then we could eliminate 
usb_udc_vbus_handler() entirely.  And the unnecessary calls -- the ones 
causing deadlocks -- from within udc_start() and udc_stop() handlers can 
be removed with no further consequence.

This approach simplifies and removes code.  Whereas your approach 
complicates and adds code for no good reason.

Alan Stern
  
Alan Stern May 19, 2023, 3:07 p.m. UTC | #2
On Fri, May 19, 2023 at 10:49:49AM -0400, Alan Stern wrote:
> On Fri, May 19, 2023 at 04:30:41AM +0000, Badhri Jagan Sridharan wrote:
> > chipidea udc calls usb_udc_vbus_handler from udc_start gadget
> > ops causing a deadlock. Avoid this by offloading usb_udc_vbus_handler
> > processing.
> 
> Look, this is way overkill.
> 
> usb_udc_vbus_handler() has only two jobs to do: set udc->vbus and call 
> usb_udc_connect_control().  Furthermore, it gets called from only two 
> drivers: chipidea and max3420.
> 
> Why not have the callers set udc->vbus themselves and then call 
> usb_gadget_{dis}connect() directly?  Then we could eliminate 
> usb_udc_vbus_handler() entirely.  And the unnecessary calls -- the ones 
> causing deadlocks -- from within udc_start() and udc_stop() handlers can 
> be removed with no further consequence.
> 
> This approach simplifies and removes code.  Whereas your approach 
> complicates and adds code for no good reason.

I changed my mind.

After looking more closely, I found the comment in gadget.h about 
->disconnect() callbacks happening in interrupt context.  This means we 
cannot use a mutex to protect the associated state, and therefore the 
connect_lock _must_ be a spinlock, not a mutex.

This also probably means that udc_start and udc_stop callbacks should 
not be invoked with the lock held.  In fact, you might want to avoid 
using the lock at all with gadget_bind_driver() and 
gadget_unbind_driver() -- use it only in the functions that these 
routines call.

So it appears the whole connect_lock thing needs to be redesigned with 
these ideas in mind.  However, it's still true that the UDC drivers 
shouldn't try to set the connection state from within their udc_start 
and udc_stop callbacks, because the core takes care of this 
automatically.

Alan Stern
  
Badhri Jagan Sridharan May 19, 2023, 3:44 p.m. UTC | #3
On Fri, May 19, 2023 at 8:07 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, May 19, 2023 at 10:49:49AM -0400, Alan Stern wrote:
> > On Fri, May 19, 2023 at 04:30:41AM +0000, Badhri Jagan Sridharan wrote:
> > > chipidea udc calls usb_udc_vbus_handler from udc_start gadget
> > > ops causing a deadlock. Avoid this by offloading usb_udc_vbus_handler
> > > processing.
> >
> > Look, this is way overkill.
> >
> > usb_udc_vbus_handler() has only two jobs to do: set udc->vbus and call
> > usb_udc_connect_control().  Furthermore, it gets called from only two
> > drivers: chipidea and max3420.
> >
> > Why not have the callers set udc->vbus themselves and then call
> > usb_gadget_{dis}connect() directly?  Then we could eliminate
> > usb_udc_vbus_handler() entirely.  And the unnecessary calls -- the ones
> > causing deadlocks -- from within udc_start() and udc_stop() handlers can
> > be removed with no further consequence.
> >
> > This approach simplifies and removes code.  Whereas your approach
> > complicates and adds code for no good reason.
>
> I changed my mind.
>
> After looking more closely, I found the comment in gadget.h about
> ->disconnect() callbacks happening in interrupt context.  This means we
> cannot use a mutex to protect the associated state, and therefore the
> connect_lock _must_ be a spinlock, not a mutex.

Quick observation so that I don't misunderstand.
I already see gadget->udc->driver->disconnect(gadget) being called with
udc_lock being held.

               mutex_lock(&udc_lock);
               if (gadget->udc->driver)
                       gadget->udc->driver->disconnect(gadget);
               mutex_unlock(&udc_lock);

The below patch seems to have introduced it:
1016fc0c096c USB: gadget: Fix obscure lockdep violation for udc_mutex

Are you referring to some other ->disconnect() callback ? If so, can you point
me to which one ?

>
> This also probably means that udc_start and udc_stop callbacks should
> not be invoked with the lock held.  In fact, you might want to avoid
> using the lock at all with gadget_bind_driver() and
> gadget_unbind_driver() -- use it only in the functions that these
> routines call.
>
> So it appears the whole connect_lock thing needs to be redesigned with
> these ideas in mind.  However, it's still true that the UDC drivers
> shouldn't try to set the connection state from within their udc_start
> and udc_stop callbacks, because the core takes care of this
> automatically.
>
> Alan Stern

Thanks for your inputs !
Badhri
  
Alan Stern May 19, 2023, 5:27 p.m. UTC | #4
On Fri, May 19, 2023 at 08:44:57AM -0700, Badhri Jagan Sridharan wrote:
> On Fri, May 19, 2023 at 8:07 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > On Fri, May 19, 2023 at 10:49:49AM -0400, Alan Stern wrote:
> > > On Fri, May 19, 2023 at 04:30:41AM +0000, Badhri Jagan Sridharan wrote:
> > > > chipidea udc calls usb_udc_vbus_handler from udc_start gadget
> > > > ops causing a deadlock. Avoid this by offloading usb_udc_vbus_handler
> > > > processing.
> > >
> > > Look, this is way overkill.
> > >
> > > usb_udc_vbus_handler() has only two jobs to do: set udc->vbus and call
> > > usb_udc_connect_control().  Furthermore, it gets called from only two
> > > drivers: chipidea and max3420.
> > >
> > > Why not have the callers set udc->vbus themselves and then call
> > > usb_gadget_{dis}connect() directly?  Then we could eliminate
> > > usb_udc_vbus_handler() entirely.  And the unnecessary calls -- the ones
> > > causing deadlocks -- from within udc_start() and udc_stop() handlers can
> > > be removed with no further consequence.
> > >
> > > This approach simplifies and removes code.  Whereas your approach
> > > complicates and adds code for no good reason.
> >
> > I changed my mind.
> >
> > After looking more closely, I found the comment in gadget.h about
> > ->disconnect() callbacks happening in interrupt context.  This means we
> > cannot use a mutex to protect the associated state, and therefore the
> > connect_lock _must_ be a spinlock, not a mutex.
> 
> Quick observation so that I don't misunderstand.
> I already see gadget->udc->driver->disconnect(gadget) being called with
> udc_lock being held.
> 
>                mutex_lock(&udc_lock);
>                if (gadget->udc->driver)
>                        gadget->udc->driver->disconnect(gadget);
>                mutex_unlock(&udc_lock);
> 
> The below patch seems to have introduced it:
> 1016fc0c096c USB: gadget: Fix obscure lockdep violation for udc_mutex

Hmmm...  You're right about this.  A big problem with the USB gadget 
framework is that it does not clearly state which routines have to run 
in process context and which have to run in interrupt/atomic context.  
People therefore don't think about it and frequently get it wrong.

So now the problem is that the UDC or transceiver driver may detect 
(typically in an interrupt handler) that VBUS power has appeared or 
disappeared, and it wants to tell the core to adjust the D+/D- pullup 
signals appropriately.  The core notifies the UDC driver about this, and 
then in the case of a disconnection, it has to notify the gadget driver.  
But notifying the gadget driver requires process context for the 
udc_lock mutex, the ultimate reason being that disconnect notifications 
can race with gadget driver binding and unbinding.

If we could prevent those races in some other way then we wouldn't need 
to hold udc_lock in usb_gadget_disconnect().  This seems like a sensible 
thing to do in any case; the UDC core should never allow a connection to 
occur before a gadget driver is bound or after it is unbound.

The first approach that occurs to me is to add a boolean allow_connect 
flag to struct usb_udc, together with a global spinlock to synchronize 
access to it.  Then usb_gadget_disconnect() could check the flag before 
calling driver->disconnect(), gadget_bind_driver() could set the flag 
before calling usb_udc_connect_control(), and gadget_unbind_driver() 
could clear the flag before calling usb_gadget_disconnect().

(Another possible approach would be to change gadget->deactivated into a 
counter.  It would still need to be synchronized by a spinlock, 
however.)

This will simplify matters considerably.  udc_lock can remain a mutex 
and the deadlock problem should go away.

Do you want to try adding allow_connect as described here or would you 
prefer that I do it?

(And in any case, we should prevent the udc_start and udc_stop callbacks 
in the chipidea and max3420 drivers from trying to update the connection 
status.)

Alan Stern
  
Badhri Jagan Sridharan May 22, 2023, 7:48 a.m. UTC | #5
Hi Alan,

Thanks for taking the time out to share more details !
+1 on your comment: " A big problem with the USB gadget
framework is that it does not clearly state which routines have to run
in process context and which have to run in interrupt/atomic context."


I started to work on allow_connect and other suggestions that you had made.
In one of the previous comments you had mentioned that the
connect_lock should be a spinlock and not a mutex.
Right now there are four conditions that seem to be deciding whether
pullup needs to be enabled or disabled through gadget->ops->pullup().
1. Gadget not deactivated through usb_gadget_deactivate()
2. Gadget has to be started through usb_gadget_udc_start().
soft_connect_store() can start/stop gadget.
3. usb_gadget has been connected through usb_gadget_connect(). This is
assuming we are getting rid of usb_udc_vbus_handler.
4. allow_connect is true

I have so far identified two constraints here:
a. gadget->ops->pullup() can sleep in some implementations.
For instance:
BUG: scheduling while atomic: init/1/0x00000002
..
[   26.990631][    T1] Call trace:
[   26.993759][    T1]  dump_backtrace+0x104/0x128
[   26.998281][    T1]  show_stack+0x20/0x30
[   27.002279][    T1]  dump_stack_lvl+0x6c/0x9c
[   27.006627][    T1]  __schedule_bug+0x84/0xb4
[   27.010973][    T1]  __schedule+0x6f0/0xaec
[   27.015147][    T1]  schedule+0xc8/0x134
[   27.019059][    T1]  schedule_timeout+0x98/0x134
[   27.023666][    T1]  msleep+0x34/0x4c
[   27.027317][    T1]  dwc3_core_soft_reset+0xf0/0x354
[   27.032273][    T1]  dwc3_gadget_pullup+0xec/0x1d8
[   27.037055][    T1]  usb_gadget_pullup_update_locked+0xa0/0x1e0
[   27.042967][    T1]  udc_bind_to_driver+0x1e4/0x30c
[   27.047835][    T1]  usb_gadget_probe_driver+0xd0/0x178
[   27.053051][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
[   27.058442][    T1]  configfs_write_iter+0x100/0x178
[   27.063399][    T1]  vfs_write+0x278/0x3c4
[   27.067483][    T1]  ksys_write+0x80/0xf4

b. gadget->ops->udc_start can also sleep in some implementations.
For example:
[   28.024255][    T1] BUG: scheduling while atomic: init/1/0x00000002
....
[   28.324996][    T1] Call trace:
[   28.328126][    T1]  dump_backtrace+0x104/0x128
[   28.332647][    T1]  show_stack+0x20/0x30
[   28.336645][    T1]  dump_stack_lvl+0x6c/0x9c
[   28.340993][    T1]  __schedule_bug+0x84/0xb4
[   28.345340][    T1]  __schedule+0x6f0/0xaec
[   28.349513][    T1]  schedule+0xc8/0x134
[   28.353425][    T1]  schedule_timeout+0x4c/0x134
[   28.358033][    T1]  wait_for_common+0xac/0x13c
[   28.362554][    T1]  wait_for_completion_killable+0x20/0x3c
[   28.368118][    T1]  __kthread_create_on_node+0xe4/0x1ec
[   28.373422][    T1]  kthread_create_on_node+0x54/0x80
[   28.378464][    T1]  setup_irq_thread+0x50/0x108
[   28.383072][    T1]  __setup_irq+0x90/0x87c
[   28.387245][    T1]  request_threaded_irq+0x144/0x180
[   28.392287][    T1]  dwc3_gadget_start+0x50/0xac
[   28.396866][    T1]  udc_bind_to_driver+0x14c/0x31c
[   28.401763][    T1]  usb_gadget_probe_driver+0xd0/0x178
[   28.406980][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
[   28.412370][    T1]  configfs_write_iter+0x100/0x178
[   28.417325][    T1]  vfs_write+0x278/0x3c4
[   28.421411][    T1]  ksys_write+0x80/0xf4

static int dwc3_gadget_start(struct usb_gadget *g,
                struct usb_gadget_driver *driver)
{
        struct dwc3             *dwc = gadget_to_dwc(g);
...
        irq = dwc->irq_gadget;
        ret = request_threaded_irq(irq, dwc3_interrupt, dwc3_thread_interrupt,
                        IRQF_SHARED, "dwc3", dwc->ev_buf);

Given that "1016fc0c096c USB: gadget: Fix obscure lockdep violation
for udc_mutex" has been there for a while and no one has reported
issues so far, perhaps ->disconnect() callback is no longer being
invoked in atomic context and the documentation is what that needs to
be updated ?

Thanks,
Badhri

On Fri, May 19, 2023 at 10:27 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, May 19, 2023 at 08:44:57AM -0700, Badhri Jagan Sridharan wrote:
> > On Fri, May 19, 2023 at 8:07 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> > >
> > > On Fri, May 19, 2023 at 10:49:49AM -0400, Alan Stern wrote:
> > > > On Fri, May 19, 2023 at 04:30:41AM +0000, Badhri Jagan Sridharan wrote:
> > > > > chipidea udc calls usb_udc_vbus_handler from udc_start gadget
> > > > > ops causing a deadlock. Avoid this by offloading usb_udc_vbus_handler
> > > > > processing.
> > > >
> > > > Look, this is way overkill.
> > > >
> > > > usb_udc_vbus_handler() has only two jobs to do: set udc->vbus and call
> > > > usb_udc_connect_control().  Furthermore, it gets called from only two
> > > > drivers: chipidea and max3420.
> > > >
> > > > Why not have the callers set udc->vbus themselves and then call
> > > > usb_gadget_{dis}connect() directly?  Then we could eliminate
> > > > usb_udc_vbus_handler() entirely.  And the unnecessary calls -- the ones
> > > > causing deadlocks -- from within udc_start() and udc_stop() handlers can
> > > > be removed with no further consequence.
> > > >
> > > > This approach simplifies and removes code.  Whereas your approach
> > > > complicates and adds code for no good reason.
> > >
> > > I changed my mind.
> > >
> > > After looking more closely, I found the comment in gadget.h about
> > > ->disconnect() callbacks happening in interrupt context.  This means we
> > > cannot use a mutex to protect the associated state, and therefore the
> > > connect_lock _must_ be a spinlock, not a mutex.
> >
> > Quick observation so that I don't misunderstand.
> > I already see gadget->udc->driver->disconnect(gadget) being called with
> > udc_lock being held.
> >
> >                mutex_lock(&udc_lock);
> >                if (gadget->udc->driver)
> >                        gadget->udc->driver->disconnect(gadget);
> >                mutex_unlock(&udc_lock);
> >
> > The below patch seems to have introduced it:
> > 1016fc0c096c USB: gadget: Fix obscure lockdep violation for udc_mutex
>
> Hmmm...  You're right about this.  A big problem with the USB gadget
> framework is that it does not clearly state which routines have to run
> in process context and which have to run in interrupt/atomic context.
> People therefore don't think about it and frequently get it wrong.
>
> So now the problem is that the UDC or transceiver driver may detect
> (typically in an interrupt handler) that VBUS power has appeared or
> disappeared, and it wants to tell the core to adjust the D+/D- pullup
> signals appropriately.  The core notifies the UDC driver about this, and
> then in the case of a disconnection, it has to notify the gadget driver.
> But notifying the gadget driver requires process context for the
> udc_lock mutex, the ultimate reason being that disconnect notifications
> can race with gadget driver binding and unbinding.
>
> If we could prevent those races in some other way then we wouldn't need
> to hold udc_lock in usb_gadget_disconnect().  This seems like a sensible
> thing to do in any case; the UDC core should never allow a connection to
> occur before a gadget driver is bound or after it is unbound.
>
> The first approach that occurs to me is to add a boolean allow_connect
> flag to struct usb_udc, together with a global spinlock to synchronize
> access to it.  Then usb_gadget_disconnect() could check the flag before
> calling driver->disconnect(), gadget_bind_driver() could set the flag
> before calling usb_udc_connect_control(), and gadget_unbind_driver()
> could clear the flag before calling usb_gadget_disconnect().
>
> (Another possible approach would be to change gadget->deactivated into a
> counter.  It would still need to be synchronized by a spinlock,
> however.)
>
> This will simplify matters considerably.  udc_lock can remain a mutex
> and the deadlock problem should go away.
>
> Do you want to try adding allow_connect as described here or would you
> prefer that I do it?
>
> (And in any case, we should prevent the udc_start and udc_stop callbacks
> in the chipidea and max3420 drivers from trying to update the connection
> status.)
>
> Alan Stern
  
Badhri Jagan Sridharan May 22, 2023, 9:05 a.m. UTC | #6
On Mon, May 22, 2023 at 12:48 AM Badhri Jagan Sridharan
<badhri@google.com> wrote:
>
> Hi Alan,
>
> Thanks for taking the time out to share more details !
> +1 on your comment: " A big problem with the USB gadget
> framework is that it does not clearly state which routines have to run
> in process context and which have to run in interrupt/atomic context."
>
>
> I started to work on allow_connect and other suggestions that you had made.
> In one of the previous comments you had mentioned that the
> connect_lock should be a spinlock and not a mutex.
> Right now there are four conditions that seem to be deciding whether
> pullup needs to be enabled or disabled through gadget->ops->pullup().
> 1. Gadget not deactivated through usb_gadget_deactivate()
> 2. Gadget has to be started through usb_gadget_udc_start().
> soft_connect_store() can start/stop gadget.
> 3. usb_gadget has been connected through usb_gadget_connect(). This is
> assuming we are getting rid of usb_udc_vbus_handler.
> 4. allow_connect is true
>
> I have so far identified two constraints here:
> a. gadget->ops->pullup() can sleep in some implementations.
> For instance:
> BUG: scheduling while atomic: init/1/0x00000002
> ..
> [   26.990631][    T1] Call trace:
> [   26.993759][    T1]  dump_backtrace+0x104/0x128
> [   26.998281][    T1]  show_stack+0x20/0x30
> [   27.002279][    T1]  dump_stack_lvl+0x6c/0x9c
> [   27.006627][    T1]  __schedule_bug+0x84/0xb4
> [   27.010973][    T1]  __schedule+0x6f0/0xaec
> [   27.015147][    T1]  schedule+0xc8/0x134
> [   27.019059][    T1]  schedule_timeout+0x98/0x134
> [   27.023666][    T1]  msleep+0x34/0x4c

Adding more context to make sure that I am more articulate.
I am aware that alternatives such as mdelay can be used to work around
in this specific instance. However, my concern is more around whether
gadget->ops->pullup() of other implementations were designed as
atomic. I only have dwc3 based hardware so can't test other udc
implementations. Hence the concern.

Thanks,
Badhri

> [   27.027317][    T1]  dwc3_core_soft_reset+0xf0/0x354
> [   27.032273][    T1]  dwc3_gadget_pullup+0xec/0x1d8
> [   27.037055][    T1]  usb_gadget_pullup_update_locked+0xa0/0x1e0
> [   27.042967][    T1]  udc_bind_to_driver+0x1e4/0x30c
> [   27.047835][    T1]  usb_gadget_probe_driver+0xd0/0x178
> [   27.053051][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
> [   27.058442][    T1]  configfs_write_iter+0x100/0x178
> [   27.063399][    T1]  vfs_write+0x278/0x3c4
> [   27.067483][    T1]  ksys_write+0x80/0xf4
>
> b. gadget->ops->udc_start can also sleep in some implementations.
> For example:
> [   28.024255][    T1] BUG: scheduling while atomic: init/1/0x00000002
> ....
> [   28.324996][    T1] Call trace:
> [   28.328126][    T1]  dump_backtrace+0x104/0x128
> [   28.332647][    T1]  show_stack+0x20/0x30
> [   28.336645][    T1]  dump_stack_lvl+0x6c/0x9c
> [   28.340993][    T1]  __schedule_bug+0x84/0xb4
> [   28.345340][    T1]  __schedule+0x6f0/0xaec
> [   28.349513][    T1]  schedule+0xc8/0x134
> [   28.353425][    T1]  schedule_timeout+0x4c/0x134
> [   28.358033][    T1]  wait_for_common+0xac/0x13c
> [   28.362554][    T1]  wait_for_completion_killable+0x20/0x3c
> [   28.368118][    T1]  __kthread_create_on_node+0xe4/0x1ec
> [   28.373422][    T1]  kthread_create_on_node+0x54/0x80
> [   28.378464][    T1]  setup_irq_thread+0x50/0x108
> [   28.383072][    T1]  __setup_irq+0x90/0x87c
> [   28.387245][    T1]  request_threaded_irq+0x144/0x180
> [   28.392287][    T1]  dwc3_gadget_start+0x50/0xac
> [   28.396866][    T1]  udc_bind_to_driver+0x14c/0x31c
> [   28.401763][    T1]  usb_gadget_probe_driver+0xd0/0x178
> [   28.406980][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
> [   28.412370][    T1]  configfs_write_iter+0x100/0x178
> [   28.417325][    T1]  vfs_write+0x278/0x3c4
> [   28.421411][    T1]  ksys_write+0x80/0xf4
>
> static int dwc3_gadget_start(struct usb_gadget *g,
>                 struct usb_gadget_driver *driver)
> {
>         struct dwc3             *dwc = gadget_to_dwc(g);
> ...
>         irq = dwc->irq_gadget;
>         ret = request_threaded_irq(irq, dwc3_interrupt, dwc3_thread_interrupt,
>                         IRQF_SHARED, "dwc3", dwc->ev_buf);
>
> Given that "1016fc0c096c USB: gadget: Fix obscure lockdep violation
> for udc_mutex" has been there for a while and no one has reported
> issues so far, perhaps ->disconnect() callback is no longer being
> invoked in atomic context and the documentation is what that needs to
> be updated ?
>
> Thanks,
> Badhri
>
> On Fri, May 19, 2023 at 10:27 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > On Fri, May 19, 2023 at 08:44:57AM -0700, Badhri Jagan Sridharan wrote:
> > > On Fri, May 19, 2023 at 8:07 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> > > >
> > > > On Fri, May 19, 2023 at 10:49:49AM -0400, Alan Stern wrote:
> > > > > On Fri, May 19, 2023 at 04:30:41AM +0000, Badhri Jagan Sridharan wrote:
> > > > > > chipidea udc calls usb_udc_vbus_handler from udc_start gadget
> > > > > > ops causing a deadlock. Avoid this by offloading usb_udc_vbus_handler
> > > > > > processing.
> > > > >
> > > > > Look, this is way overkill.
> > > > >
> > > > > usb_udc_vbus_handler() has only two jobs to do: set udc->vbus and call
> > > > > usb_udc_connect_control().  Furthermore, it gets called from only two
> > > > > drivers: chipidea and max3420.
> > > > >
> > > > > Why not have the callers set udc->vbus themselves and then call
> > > > > usb_gadget_{dis}connect() directly?  Then we could eliminate
> > > > > usb_udc_vbus_handler() entirely.  And the unnecessary calls -- the ones
> > > > > causing deadlocks -- from within udc_start() and udc_stop() handlers can
> > > > > be removed with no further consequence.
> > > > >
> > > > > This approach simplifies and removes code.  Whereas your approach
> > > > > complicates and adds code for no good reason.
> > > >
> > > > I changed my mind.
> > > >
> > > > After looking more closely, I found the comment in gadget.h about
> > > > ->disconnect() callbacks happening in interrupt context.  This means we
> > > > cannot use a mutex to protect the associated state, and therefore the
> > > > connect_lock _must_ be a spinlock, not a mutex.
> > >
> > > Quick observation so that I don't misunderstand.
> > > I already see gadget->udc->driver->disconnect(gadget) being called with
> > > udc_lock being held.
> > >
> > >                mutex_lock(&udc_lock);
> > >                if (gadget->udc->driver)
> > >                        gadget->udc->driver->disconnect(gadget);
> > >                mutex_unlock(&udc_lock);
> > >
> > > The below patch seems to have introduced it:
> > > 1016fc0c096c USB: gadget: Fix obscure lockdep violation for udc_mutex
> >
> > Hmmm...  You're right about this.  A big problem with the USB gadget
> > framework is that it does not clearly state which routines have to run
> > in process context and which have to run in interrupt/atomic context.
> > People therefore don't think about it and frequently get it wrong.
> >
> > So now the problem is that the UDC or transceiver driver may detect
> > (typically in an interrupt handler) that VBUS power has appeared or
> > disappeared, and it wants to tell the core to adjust the D+/D- pullup
> > signals appropriately.  The core notifies the UDC driver about this, and
> > then in the case of a disconnection, it has to notify the gadget driver.
> > But notifying the gadget driver requires process context for the
> > udc_lock mutex, the ultimate reason being that disconnect notifications
> > can race with gadget driver binding and unbinding.
> >
> > If we could prevent those races in some other way then we wouldn't need
> > to hold udc_lock in usb_gadget_disconnect().  This seems like a sensible
> > thing to do in any case; the UDC core should never allow a connection to
> > occur before a gadget driver is bound or after it is unbound.
> >
> > The first approach that occurs to me is to add a boolean allow_connect
> > flag to struct usb_udc, together with a global spinlock to synchronize
> > access to it.  Then usb_gadget_disconnect() could check the flag before
> > calling driver->disconnect(), gadget_bind_driver() could set the flag
> > before calling usb_udc_connect_control(), and gadget_unbind_driver()
> > could clear the flag before calling usb_gadget_disconnect().
> >
> > (Another possible approach would be to change gadget->deactivated into a
> > counter.  It would still need to be synchronized by a spinlock,
> > however.)
> >
> > This will simplify matters considerably.  udc_lock can remain a mutex
> > and the deadlock problem should go away.
> >
> > Do you want to try adding allow_connect as described here or would you
> > prefer that I do it?
> >
> > (And in any case, we should prevent the udc_start and udc_stop callbacks
> > in the chipidea and max3420 drivers from trying to update the connection
> > status.)
> >
> > Alan Stern
  
Alan Stern May 22, 2023, 3:55 p.m. UTC | #7
On Mon, May 22, 2023 at 12:48:39AM -0700, Badhri Jagan Sridharan wrote:
> Hi Alan,
> 
> Thanks for taking the time out to share more details !
> +1 on your comment: " A big problem with the USB gadget
> framework is that it does not clearly state which routines have to run
> in process context and which have to run in interrupt/atomic context."
> 
> 
> I started to work on allow_connect and other suggestions that you had made.
> In one of the previous comments you had mentioned that the
> connect_lock should be a spinlock and not a mutex.

Yeah, I changed my mind about that.

> Right now there are four conditions that seem to be deciding whether
> pullup needs to be enabled or disabled through gadget->ops->pullup().
> 1. Gadget not deactivated through usb_gadget_deactivate()
> 2. Gadget has to be started through usb_gadget_udc_start().
> soft_connect_store() can start/stop gadget.
> 3. usb_gadget has been connected through usb_gadget_connect(). This is
> assuming we are getting rid of usb_udc_vbus_handler.
> 4. allow_connect is true
> 
> I have so far identified two constraints here:
> a. gadget->ops->pullup() can sleep in some implementations.
> For instance:
> BUG: scheduling while atomic: init/1/0x00000002
> ..
> [   26.990631][    T1] Call trace:
> [   26.993759][    T1]  dump_backtrace+0x104/0x128
> [   26.998281][    T1]  show_stack+0x20/0x30
> [   27.002279][    T1]  dump_stack_lvl+0x6c/0x9c
> [   27.006627][    T1]  __schedule_bug+0x84/0xb4
> [   27.010973][    T1]  __schedule+0x6f0/0xaec
> [   27.015147][    T1]  schedule+0xc8/0x134
> [   27.019059][    T1]  schedule_timeout+0x98/0x134
> [   27.023666][    T1]  msleep+0x34/0x4c
> [   27.027317][    T1]  dwc3_core_soft_reset+0xf0/0x354
> [   27.032273][    T1]  dwc3_gadget_pullup+0xec/0x1d8
> [   27.037055][    T1]  usb_gadget_pullup_update_locked+0xa0/0x1e0
> [   27.042967][    T1]  udc_bind_to_driver+0x1e4/0x30c
> [   27.047835][    T1]  usb_gadget_probe_driver+0xd0/0x178
> [   27.053051][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
> [   27.058442][    T1]  configfs_write_iter+0x100/0x178
> [   27.063399][    T1]  vfs_write+0x278/0x3c4
> [   27.067483][    T1]  ksys_write+0x80/0xf4

What kernel was this trace made with?  I don't see udc_bind_to_driver 
appearing anywhere in 6.4-rc3.

> b. gadget->ops->udc_start can also sleep in some implementations.
> For example:
> [   28.024255][    T1] BUG: scheduling while atomic: init/1/0x00000002
> ....
> [   28.324996][    T1] Call trace:
> [   28.328126][    T1]  dump_backtrace+0x104/0x128
> [   28.332647][    T1]  show_stack+0x20/0x30
> [   28.336645][    T1]  dump_stack_lvl+0x6c/0x9c
> [   28.340993][    T1]  __schedule_bug+0x84/0xb4
> [   28.345340][    T1]  __schedule+0x6f0/0xaec
> [   28.349513][    T1]  schedule+0xc8/0x134
> [   28.353425][    T1]  schedule_timeout+0x4c/0x134
> [   28.358033][    T1]  wait_for_common+0xac/0x13c
> [   28.362554][    T1]  wait_for_completion_killable+0x20/0x3c
> [   28.368118][    T1]  __kthread_create_on_node+0xe4/0x1ec
> [   28.373422][    T1]  kthread_create_on_node+0x54/0x80
> [   28.378464][    T1]  setup_irq_thread+0x50/0x108
> [   28.383072][    T1]  __setup_irq+0x90/0x87c
> [   28.387245][    T1]  request_threaded_irq+0x144/0x180
> [   28.392287][    T1]  dwc3_gadget_start+0x50/0xac
> [   28.396866][    T1]  udc_bind_to_driver+0x14c/0x31c
> [   28.401763][    T1]  usb_gadget_probe_driver+0xd0/0x178
> [   28.406980][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
> [   28.412370][    T1]  configfs_write_iter+0x100/0x178
> [   28.417325][    T1]  vfs_write+0x278/0x3c4
> [   28.421411][    T1]  ksys_write+0x80/0xf4
> 
> static int dwc3_gadget_start(struct usb_gadget *g,
>                 struct usb_gadget_driver *driver)
> {
>         struct dwc3             *dwc = gadget_to_dwc(g);
> ...
>         irq = dwc->irq_gadget;
>         ret = request_threaded_irq(irq, dwc3_interrupt, dwc3_thread_interrupt,
>                         IRQF_SHARED, "dwc3", dwc->ev_buf);
> 
> Given that "1016fc0c096c USB: gadget: Fix obscure lockdep violation
> for udc_mutex" has been there for a while and no one has reported
> issues so far, perhaps ->disconnect() callback is no longer being
> invoked in atomic context and the documentation is what that needs to
> be updated ?

That's part of what I'm trying to figure out.  However, some UDC drivers 
call ->disconnect() directly when they detect loss of VBUS power, 
instead of going through the core.  So disconnect handlers will have 
remain capable of running in interrupt context until those UDC drivers 
are changed.

Getting back to your first point, it looks like we need to assume any 
routine that needs to communicate with the UDC hardware (such as the 
->pullup callback used in usb_gadget_{dis}connect()) must always be 
called in process context.  This means that usb_udc_connect_control() 
always has to run in process context, since it will do either a connect 
or a disconnect.

On the other hand, some routines -- in particular, 
usb_udc_vbus_handler() -- may be called by a UDC driver's interrupt 
handler and therefore may run in interrupt context.  (This fact should 
be noted in that routine's kerneldoc, by the way.)

So here's the problem: usb_udc_vbus_handler() running in interrupt 
context calls usb_udc_connect_control(), which has to run in process 
context.  And this is not just a simple issue caused by the 
->disconnect() callback or use of mutexes; it's more fundamental.

I'm led to conclude that you were right to offload part of 
usb_udc_vbus_handler()'s job to a workqueue.  It's an awkward thing to 
do, because you have to make sure to cancel the work item at times when 
it's no longer needed.  But there doesn't seem to be any other choice.

Here's two related problems for you to think about:

    1.	Once gadget_unbind_driver() has called usb_gadget_disconnect(),
	we don't want a VBUS change to cause usb_udc_vbus_handler()'s 
	work routine to turn the pullup back on.  How can we prevent 
	this?

    2.	More generally, suppose usb_udc_vbus_handler() gets called at 
	exactly the same time that some other pathway (either 
	gadget_bind_driver() or soft_connect_store()) tries to do a
	connect or disconnect.  What should happen then?

Alan Stern
  
Badhri Jagan Sridharan May 27, 2023, 2:42 a.m. UTC | #8
Thanks again Alan !

On Mon, May 22, 2023 at 8:55 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Mon, May 22, 2023 at 12:48:39AM -0700, Badhri Jagan Sridharan wrote:
> > Hi Alan,
> >
> > Thanks for taking the time out to share more details !
> > +1 on your comment: " A big problem with the USB gadget
> > framework is that it does not clearly state which routines have to run
> > in process context and which have to run in interrupt/atomic context."
> >
> >
> > I started to work on allow_connect and other suggestions that you had made.
> > In one of the previous comments you had mentioned that the
> > connect_lock should be a spinlock and not a mutex.
>
> Yeah, I changed my mind about that.
>
> > Right now there are four conditions that seem to be deciding whether
> > pullup needs to be enabled or disabled through gadget->ops->pullup().
> > 1. Gadget not deactivated through usb_gadget_deactivate()
> > 2. Gadget has to be started through usb_gadget_udc_start().
> > soft_connect_store() can start/stop gadget.
> > 3. usb_gadget has been connected through usb_gadget_connect(). This is
> > assuming we are getting rid of usb_udc_vbus_handler.
> > 4. allow_connect is true
> >
> > I have so far identified two constraints here:
> > a. gadget->ops->pullup() can sleep in some implementations.
> > For instance:
> > BUG: scheduling while atomic: init/1/0x00000002
> > ..
> > [   26.990631][    T1] Call trace:
> > [   26.993759][    T1]  dump_backtrace+0x104/0x128
> > [   26.998281][    T1]  show_stack+0x20/0x30
> > [   27.002279][    T1]  dump_stack_lvl+0x6c/0x9c
> > [   27.006627][    T1]  __schedule_bug+0x84/0xb4
> > [   27.010973][    T1]  __schedule+0x6f0/0xaec
> > [   27.015147][    T1]  schedule+0xc8/0x134
> > [   27.019059][    T1]  schedule_timeout+0x98/0x134
> > [   27.023666][    T1]  msleep+0x34/0x4c
> > [   27.027317][    T1]  dwc3_core_soft_reset+0xf0/0x354
> > [   27.032273][    T1]  dwc3_gadget_pullup+0xec/0x1d8
> > [   27.037055][    T1]  usb_gadget_pullup_update_locked+0xa0/0x1e0
> > [   27.042967][    T1]  udc_bind_to_driver+0x1e4/0x30c
> > [   27.047835][    T1]  usb_gadget_probe_driver+0xd0/0x178
> > [   27.053051][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
> > [   27.058442][    T1]  configfs_write_iter+0x100/0x178
> > [   27.063399][    T1]  vfs_write+0x278/0x3c4
> > [   27.067483][    T1]  ksys_write+0x80/0xf4
>
> What kernel was this trace made with?  I don't see udc_bind_to_driver
> appearing anywhere in 6.4-rc3.


Sorry, I was switching between devices running different kernel
versions, with the latest one running 6.1, and posted trace from an
older one by mistake.
>
>
> > b. gadget->ops->udc_start can also sleep in some implementations.
> > For example:
> > [   28.024255][    T1] BUG: scheduling while atomic: init/1/0x00000002
> > ....
> > [   28.324996][    T1] Call trace:
> > [   28.328126][    T1]  dump_backtrace+0x104/0x128
> > [   28.332647][    T1]  show_stack+0x20/0x30
> > [   28.336645][    T1]  dump_stack_lvl+0x6c/0x9c
> > [   28.340993][    T1]  __schedule_bug+0x84/0xb4
> > [   28.345340][    T1]  __schedule+0x6f0/0xaec
> > [   28.349513][    T1]  schedule+0xc8/0x134
> > [   28.353425][    T1]  schedule_timeout+0x4c/0x134
> > [   28.358033][    T1]  wait_for_common+0xac/0x13c
> > [   28.362554][    T1]  wait_for_completion_killable+0x20/0x3c
> > [   28.368118][    T1]  __kthread_create_on_node+0xe4/0x1ec
> > [   28.373422][    T1]  kthread_create_on_node+0x54/0x80
> > [   28.378464][    T1]  setup_irq_thread+0x50/0x108
> > [   28.383072][    T1]  __setup_irq+0x90/0x87c
> > [   28.387245][    T1]  request_threaded_irq+0x144/0x180
> > [   28.392287][    T1]  dwc3_gadget_start+0x50/0xac
> > [   28.396866][    T1]  udc_bind_to_driver+0x14c/0x31c
> > [   28.401763][    T1]  usb_gadget_probe_driver+0xd0/0x178
> > [   28.406980][    T1]  gadget_dev_desc_UDC_store+0xf0/0x13c
> > [   28.412370][    T1]  configfs_write_iter+0x100/0x178
> > [   28.417325][    T1]  vfs_write+0x278/0x3c4
> > [   28.421411][    T1]  ksys_write+0x80/0xf4
> >
> > static int dwc3_gadget_start(struct usb_gadget *g,
> >                 struct usb_gadget_driver *driver)
> > {
> >         struct dwc3             *dwc = gadget_to_dwc(g);
> > ...
> >         irq = dwc->irq_gadget;
> >         ret = request_threaded_irq(irq, dwc3_interrupt, dwc3_thread_interrupt,
> >                         IRQF_SHARED, "dwc3", dwc->ev_buf);
> >
> > Given that "1016fc0c096c USB: gadget: Fix obscure lockdep violation
> > for udc_mutex" has been there for a while and no one has reported
> > issues so far, perhaps ->disconnect() callback is no longer being
> > invoked in atomic context and the documentation is what that needs to
> > be updated ?
>
> That's part of what I'm trying to figure out.  However, some UDC drivers
> call ->disconnect() directly when they detect loss of VBUS power,
> instead of going through the core.  So disconnect handlers will have
> remain capable of running in interrupt context until those UDC drivers
> are changed.
>
> Getting back to your first point, it looks like we need to assume any
> routine that needs to communicate with the UDC hardware (such as the
> ->pullup callback used in usb_gadget_{dis}connect()) must always be
> called in process context.  This means that usb_udc_connect_control()
> always has to run in process context, since it will do either a connect
> or a disconnect.
>
> On the other hand, some routines -- in particular,
> usb_udc_vbus_handler() -- may be called by a UDC driver's interrupt
> handler and therefore may run in interrupt context.  (This fact should
> be noted in that routine's kerneldoc, by the way.)
>
> So here's the problem: usb_udc_vbus_handler() running in interrupt
> context calls usb_udc_connect_control(), which has to run in process
> context.  And this is not just a simple issue caused by the
> ->disconnect() callback or use of mutexes; it's more fundamental.
>
> I'm led to conclude that you were right to offload part of
> usb_udc_vbus_handler()'s job to a workqueue.  It's an awkward thing to
> do, because you have to make sure to cancel the work item at times when
> it's no longer needed.  But there doesn't seem to be any other choice.
>
> Here's two related problems for you to think about:
>
>     1.  Once gadget_unbind_driver() has called usb_gadget_disconnect(),
>         we don't want a VBUS change to cause usb_udc_vbus_handler()'s
>         work routine to turn the pullup back on.  How can we prevent
>         this?
>
>     2.  More generally, suppose usb_udc_vbus_handler() gets called at
>         exactly the same time that some other pathway (either
>         gadget_bind_driver() or soft_connect_store()) tries to do a
>         connect or disconnect.  What should happen then?


I believe I can solve the above races by protecting the flags set by
each of them with connect_lock and not pulling up unless all of them
are true.
The caller will hold connect_lock, update the respective flag and
invoke the below usb_gadget_pullup_update_locked function(shown
below).

Code stub:
/* Internal version of usb_gadget_connect needs to be called with
connect_lock held. */
static int usb_gadget_pullup_update_locked(struct usb_gadget *gadget)
        __must_hold(&gadget->udc->connect_lock)
{
        int ret = 0;
        bool connect = !gadget->deactivated && gadget->udc->started &&
gadget->udc->vbus &&
                             gadget->udc->allow_connect;

        if (!gadget->ops->pullup) {
                ret = -EOPNOTSUPP;
                goto out;
        }

        if (connect != gadget->connected) {
                ret = gadget->ops->pullup(gadget, connect);
                if (!ret)
                        gadget->connected = connect;
                if (!connect) {
                        mutex_lock(&udc_lock);
                        if (gadget->udc->driver)
                                gadget->udc->driver->disconnect(gadget);
                        mutex_unlock(&udc_lock);
        }

out:
        trace_usb_gadget_connect(gadget, ret);

        return ret;
}

However, while auditing the code again, I noticed another potential
race as well:
Looks like usb_del_gadget() can potentially race against
usb_udc_vbus_handler() and call device_unregister.
This implies usb_udc can be freed while usb_udc_vbus_handler() or the
work item is executing.

void usb_del_gadget(struct usb_gadget *gadget)
{
        struct usb_udc *udc = gadget->udc;

..
...
        device_unregister(&udc->dev);
}
EXPORT_SYMBOL_GPL(usb_del_gadget);

Does this look like a valid concern to you or am I misunderstanding this ?
If so, I am afraid that the only way to solve this is by synchronizing
usb_udc_vbus_handler() against usb_del_gadget() through a mutex as
device_unregister() can sleep.
So offloading usb_udc_vbus_handler() cannot work either.

usb_udc_vbus_hander() seems to be invoked from the following drivers:

1. drivers/usb/chipidea/udc.c:
usb_udc_vbus_hander()  is called from a non-atomic context.

2. drivers/usb/gadget/udc/max3420_udc.c
usb_udc_vbus_hander()  is called from the interrupt handler.
However, all the events are processed from max3420_thread kthread.
So I am thinking of making them threaded irq handlers instead.

3. drivers/usb/gadget/udc/renesas_usbf.c
This one looks more invasive. However, still attempting to move them
to threaded irq handlers.

As always, I'm looking forward to your feedback !

Thanks,
Badhri

>
> Alan Stern
  
Alan Stern May 27, 2023, 4:36 p.m. UTC | #9
On Fri, May 26, 2023 at 07:42:39PM -0700, Badhri Jagan Sridharan wrote:
> Thanks again Alan !
> 
> On Mon, May 22, 2023 at 8:55 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> > Getting back to your first point, it looks like we need to assume any
> > routine that needs to communicate with the UDC hardware (such as the
> > ->pullup callback used in usb_gadget_{dis}connect()) must always be
> > called in process context.  This means that usb_udc_connect_control()
> > always has to run in process context, since it will do either a connect
> > or a disconnect.
> >
> > On the other hand, some routines -- in particular,
> > usb_udc_vbus_handler() -- may be called by a UDC driver's interrupt
> > handler and therefore may run in interrupt context.  (This fact should
> > be noted in that routine's kerneldoc, by the way.)
> >
> > So here's the problem: usb_udc_vbus_handler() running in interrupt
> > context calls usb_udc_connect_control(), which has to run in process
> > context.  And this is not just a simple issue caused by the
> > ->disconnect() callback or use of mutexes; it's more fundamental.
> >
> > I'm led to conclude that you were right to offload part of
> > usb_udc_vbus_handler()'s job to a workqueue.  It's an awkward thing to
> > do, because you have to make sure to cancel the work item at times when
> > it's no longer needed.  But there doesn't seem to be any other choice.
> >
> > Here's two related problems for you to think about:
> >
> >     1.  Once gadget_unbind_driver() has called usb_gadget_disconnect(),
> >         we don't want a VBUS change to cause usb_udc_vbus_handler()'s
> >         work routine to turn the pullup back on.  How can we prevent
> >         this?
> >
> >     2.  More generally, suppose usb_udc_vbus_handler() gets called at
> >         exactly the same time that some other pathway (either
> >         gadget_bind_driver() or soft_connect_store()) tries to do a
> >         connect or disconnect.  What should happen then?
> 
> 
> I believe I can solve the above races by protecting the flags set by
> each of them with connect_lock and not pulling up unless all of them
> are true.
> 
> The caller will hold connect_lock, update the respective flag and
> invoke the below usb_gadget_pullup_update_locked function(shown
> below).

Are you certain this can be done without causing any deadlocks?

> Code stub:
> /* Internal version of usb_gadget_connect needs to be called with
> connect_lock held. */
> static int usb_gadget_pullup_update_locked(struct usb_gadget *gadget)
>         __must_hold(&gadget->udc->connect_lock)
> {
>         int ret = 0;
>         bool connect = !gadget->deactivated && gadget->udc->started &&
> gadget->udc->vbus &&
>                              gadget->udc->allow_connect;

On further thought, I decided "allow_connect" is a dumb name.  Let's 
call it "unbinding" instead, since it gets set only when a gadget driver 
is about to be unbound (which is when we want to prevent new 
connections).

>         if (!gadget->ops->pullup) {
>                 ret = -EOPNOTSUPP;
>                 goto out;
>         }
> 
>         if (connect != gadget->connected) {

You need to be more careful here.  It's possible to have 
gadget->connected set at the same time as gadget->deactivated -- it 
means that when the gadget gets re-activated, it will immediately try to 
connect again.

In fact, this logic doesn't look right at all.  For example, suppose the 
gadget driver wants to disconnect.  This routine will compute connect = 
true and will see that gadget->connected is set, so it won't do 
anything!

I think it would be better just to merge the new material into 
usb_gadget_connect() and usb_gadget_disconnect().

>                 ret = gadget->ops->pullup(gadget, connect);
>                 if (!ret)
>                         gadget->connected = connect;
>                 if (!connect) {
>                         mutex_lock(&udc_lock);
>                         if (gadget->udc->driver)
>                                 gadget->udc->driver->disconnect(gadget);
>                         mutex_unlock(&udc_lock);
>         }
> 
> out:
>         trace_usb_gadget_connect(gadget, ret);
> 
>         return ret;
> }
> 
> However, while auditing the code again, I noticed another potential
> race as well:
> Looks like usb_del_gadget() can potentially race against
> usb_udc_vbus_handler() and call device_unregister.
> This implies usb_udc can be freed while usb_udc_vbus_handler() or the
> work item is executing.
> 
> void usb_del_gadget(struct usb_gadget *gadget)
> {
>         struct usb_udc *udc = gadget->udc;
> 
> ..
> ...
>         device_unregister(&udc->dev);
> }
> EXPORT_SYMBOL_GPL(usb_del_gadget);
> 
> Does this look like a valid concern to you or am I misunderstanding this ?

You're missing an important point.  Before doing device_unregister(), 
this routine calls device_del(&gadget->dev).  That will unbind the 
gadget driver, which (among other things) will stop the UDC, preventing 
it from calling usb_udc_vbus_handler().  However, you're right that the 
work item will need to be cancelled at some point before the usb_udc is 
unregistered.

> If so, I am afraid that the only way to solve this is by synchronizing
> usb_udc_vbus_handler() against usb_del_gadget() through a mutex as
> device_unregister() can sleep.
> So offloading usb_udc_vbus_handler() cannot work either.
> 
> usb_udc_vbus_hander() seems to be invoked from the following drivers:
> 
> 1. drivers/usb/chipidea/udc.c:
> usb_udc_vbus_hander()  is called from a non-atomic context.
> 
> 2. drivers/usb/gadget/udc/max3420_udc.c
> usb_udc_vbus_hander()  is called from the interrupt handler.
> However, all the events are processed from max3420_thread kthread.
> So I am thinking of making them threaded irq handlers instead.
> 
> 3. drivers/usb/gadget/udc/renesas_usbf.c
> This one looks more invasive. However, still attempting to move them
> to threaded irq handlers.
> 
> As always, I'm looking forward to your feedback !

Moving those things to threaded IRQ handlers is a separate job.  Let's 
get this issue fixed first.

Alan Stern
  
Badhri Jagan Sridharan May 29, 2023, 11:32 p.m. UTC | #10
On Sat, May 27, 2023 at 9:36 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, May 26, 2023 at 07:42:39PM -0700, Badhri Jagan Sridharan wrote:
> > Thanks again Alan !
> >
> > On Mon, May 22, 2023 at 8:55 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> > > Getting back to your first point, it looks like we need to assume any
> > > routine that needs to communicate with the UDC hardware (such as the
> > > ->pullup callback used in usb_gadget_{dis}connect()) must always be
> > > called in process context.  This means that usb_udc_connect_control()
> > > always has to run in process context, since it will do either a connect
> > > or a disconnect.
> > >
> > > On the other hand, some routines -- in particular,
> > > usb_udc_vbus_handler() -- may be called by a UDC driver's interrupt
> > > handler and therefore may run in interrupt context.  (This fact should
> > > be noted in that routine's kerneldoc, by the way.)
> > >
> > > So here's the problem: usb_udc_vbus_handler() running in interrupt
> > > context calls usb_udc_connect_control(), which has to run in process
> > > context.  And this is not just a simple issue caused by the
> > > ->disconnect() callback or use of mutexes; it's more fundamental.
> > >
> > > I'm led to conclude that you were right to offload part of
> > > usb_udc_vbus_handler()'s job to a workqueue.  It's an awkward thing to
> > > do, because you have to make sure to cancel the work item at times when
> > > it's no longer needed.  But there doesn't seem to be any other choice.
> > >
> > > Here's two related problems for you to think about:
> > >
> > >     1.  Once gadget_unbind_driver() has called usb_gadget_disconnect(),
> > >         we don't want a VBUS change to cause usb_udc_vbus_handler()'s
> > >         work routine to turn the pullup back on.  How can we prevent
> > >         this?
> > >
> > >     2.  More generally, suppose usb_udc_vbus_handler() gets called at
> > >         exactly the same time that some other pathway (either
> > >         gadget_bind_driver() or soft_connect_store()) tries to do a
> > >         connect or disconnect.  What should happen then?
> >
> >
> > I believe I can solve the above races by protecting the flags set by
> > each of them with connect_lock and not pulling up unless all of them
> > are true.
> >
> > The caller will hold connect_lock, update the respective flag and
> > invoke the below usb_gadget_pullup_update_locked function(shown
> > below).
>
> Are you certain this can be done without causing any deadlocks?
>
> > Code stub:
> > /* Internal version of usb_gadget_connect needs to be called with
> > connect_lock held. */
> > static int usb_gadget_pullup_update_locked(struct usb_gadget *gadget)
> >         __must_hold(&gadget->udc->connect_lock)
> > {
> >         int ret = 0;
> >         bool connect = !gadget->deactivated && gadget->udc->started &&
> > gadget->udc->vbus &&
> >                              gadget->udc->allow_connect;
>
> On further thought, I decided "allow_connect" is a dumb name.  Let's
> call it "unbinding" instead, since it gets set only when a gadget driver
> is about to be unbound (which is when we want to prevent new
> connections).

Sure, fixing it in v3.

>
> >         if (!gadget->ops->pullup) {
> >                 ret = -EOPNOTSUPP;
> >                 goto out;
> >         }
> >
> >         if (connect != gadget->connected) {
>
> You need to be more careful here.  It's possible to have
> gadget->connected set at the same time as gadget->deactivated -- it
> means that when the gadget gets re-activated, it will immediately try to
> connect again.
>
> In fact, this logic doesn't look right at all.  For example, suppose the
> gadget driver wants to disconnect.  This routine will compute connect =
> true and will see that gadget->connected is set, so it won't do
> anything!
>
> I think it would be better just to merge the new material into
> usb_gadget_connect() and usb_gadget_disconnect().

I ended up merging them into usb_gadget_pullup_update_locked() so that
each of the individual helper function can call
usb_gadget_pullup_update_locked() while holding the connect_lock. I
actually had usb_gadget_(dis)connect() set udc->vbus. It appears to me
that both usb_gadget_(dis)connect() and usb_udc_vbus_handler() are
meant to be called based on vbus presence and hence seem to be
redundant. Wondering if we could get rid of usb_gadget_(dis)connect()
given that drivers/power/supply/isp1704_charger.c is only call it and
instead make it call usb_udc_vbus_handler() instead ?

>
> >                 ret = gadget->ops->pullup(gadget, connect);
> >                 if (!ret)
> >                         gadget->connected = connect;
> >                 if (!connect) {
> >                         mutex_lock(&udc_lock);
> >                         if (gadget->udc->driver)
> >                                 gadget->udc->driver->disconnect(gadget);
> >                         mutex_unlock(&udc_lock);
> >         }
> >
> > out:
> >         trace_usb_gadget_connect(gadget, ret);
> >
> >         return ret;
> > }
> >
> > However, while auditing the code again, I noticed another potential
> > race as well:
> > Looks like usb_del_gadget() can potentially race against
> > usb_udc_vbus_handler() and call device_unregister.
> > This implies usb_udc can be freed while usb_udc_vbus_handler() or the
> > work item is executing.
> >
> > void usb_del_gadget(struct usb_gadget *gadget)
> > {
> >         struct usb_udc *udc = gadget->udc;
> >
> > ..
> > ...
> >         device_unregister(&udc->dev);
> > }
> > EXPORT_SYMBOL_GPL(usb_del_gadget);
> >
> > Does this look like a valid concern to you or am I misunderstanding this ?
>
> You're missing an important point.  Before doing device_unregister(),
> this routine calls device_del(&gadget->dev).  That will unbind the
> gadget driver, which (among other things) will stop the UDC, preventing
> it from calling usb_udc_vbus_handler().  However, you're right that the
> work item will need to be cancelled at some point before the usb_udc is
> unregistered.
>

Sure,  thought gadget_unbind_driver() might be a good place to cancel
the work item. So, cancelling it there in V3.

> > If so, I am afraid that the only way to solve this is by synchronizing
> > usb_udc_vbus_handler() against usb_del_gadget() through a mutex as
> > device_unregister() can sleep.
> > So offloading usb_udc_vbus_handler() cannot work either.
> >
> > usb_udc_vbus_hander() seems to be invoked from the following drivers:
> >
> > 1. drivers/usb/chipidea/udc.c:
> > usb_udc_vbus_hander()  is called from a non-atomic context.
> >
> > 2. drivers/usb/gadget/udc/max3420_udc.c
> > usb_udc_vbus_hander()  is called from the interrupt handler.
> > However, all the events are processed from max3420_thread kthread.
> > So I am thinking of making them threaded irq handlers instead.
> >
> > 3. drivers/usb/gadget/udc/renesas_usbf.c
> > This one looks more invasive. However, still attempting to move them
> > to threaded irq handlers.
> >
> > As always, I'm looking forward to your feedback !
>
> Moving those things to threaded IRQ handlers is a separate job.  Let's
> get this issue fixed first.

Sounds good !

Thanks,
Badhri

>
> Alan Stern
  
Alan Stern May 30, 2023, 12:42 a.m. UTC | #11
On Mon, May 29, 2023 at 04:32:29PM -0700, Badhri Jagan Sridharan wrote:
> On Sat, May 27, 2023 at 9:36 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > I think it would be better just to merge the new material into
> > usb_gadget_connect() and usb_gadget_disconnect().
> 
> I ended up merging them into usb_gadget_pullup_update_locked() so that
> each of the individual helper function can call
> usb_gadget_pullup_update_locked() while holding the connect_lock. I
> actually had usb_gadget_(dis)connect() set udc->vbus.

What?  No, that's not right.  They are two completely separate concepts.  
The host controls VBUS and the gadget controls the pullup.

>  It appears to me
> that both usb_gadget_(dis)connect() and usb_udc_vbus_handler() are
> meant to be called based on vbus presence and hence seem to be
> redundant.

They are not.  We need to support turning off the pullup while VBUS is 
on.

>  Wondering if we could get rid of usb_gadget_(dis)connect()
> given that drivers/power/supply/isp1704_charger.c is only call it and
> instead make it call usb_udc_vbus_handler() instead ?

In short, no.

Alan Stern
  

Patch

diff --git a/drivers/usb/gadget/udc/core.c b/drivers/usb/gadget/udc/core.c
index 69041cca5d24..ee612387b39c 100644
--- a/drivers/usb/gadget/udc/core.c
+++ b/drivers/usb/gadget/udc/core.c
@@ -41,6 +41,9 @@  static const struct bus_type gadget_bus_type;
  * functions. usb_gadget_connect_locked, usb_gadget_disconnect_locked,
  * usb_udc_connect_control_locked, usb_gadget_udc_start_locked, usb_gadget_udc_stop_locked are
  * called with this lock held.
+ * @vbus_events: list head for processing vbus updates on usb_udc_vbus_handler.
+ * @vbus_events_lock: protects vbus_events list
+ * @vbus_work: work item that invokes usb_udc_connect_control_locked.
  *
  * This represents the internal data structure which is used by the UDC-class
  * to hold information about udc driver and gadget together.
@@ -53,6 +56,19 @@  struct usb_udc {
 	bool				vbus;
 	bool				started;
 	struct mutex			connect_lock;
+	struct list_head		vbus_events;
+	spinlock_t			vbus_events_lock;
+	struct work_struct		vbus_work;
+};
+
+/**
+ * struct vbus_event - used to notify vbus updates posted through usb_udc_vbus_handler.
+ * @vbus_on: true when vbus is on. false other wise.
+ * @node: list node for maintaining a list of pending updates to be processed.
+ */
+struct vbus_event {
+	bool vbus_on;
+	struct list_head node;
 };
 
 static struct class *udc_class;
@@ -1134,6 +1150,30 @@  static int usb_udc_connect_control_locked(struct usb_udc *udc) __must_hold(&udc-
 	return ret;
 }
 
+static void vbus_event_work(struct work_struct *work)
+{
+	struct vbus_event *event, *n;
+	struct usb_udc *udc = container_of(work, struct usb_udc, vbus_work);
+	unsigned long flags;
+
+	spin_lock_irqsave(&udc->vbus_events_lock, flags);
+	list_for_each_entry_safe(event, n, &udc->vbus_events, node) {
+		list_del(&event->node);
+		/* OK to drop the lock here as it suffice to syncrhronize udc->vbus_events node
+		 * retrieval and deletion against usb_udc_vbus_handler. usb_udc_vbus_handler does
+		 * list_add_tail so n would be the same even if the lock is dropped.
+		 */
+		spin_unlock_irqrestore(&udc->vbus_events_lock, flags);
+		mutex_lock(&udc->connect_lock);
+		udc->vbus = event->vbus_on;
+		usb_udc_connect_control_locked(udc);
+		kfree(event);
+		mutex_unlock(&udc->connect_lock);
+		spin_lock_irqsave(&udc->vbus_events_lock, flags);
+	}
+	spin_unlock_irqrestore(&udc->vbus_events_lock, flags);
+}
+
 /**
  * usb_udc_vbus_handler - updates the udc core vbus status, and try to
  * connect or disconnect gadget
@@ -1146,13 +1186,21 @@  static int usb_udc_connect_control_locked(struct usb_udc *udc) __must_hold(&udc-
 void usb_udc_vbus_handler(struct usb_gadget *gadget, bool status)
 {
 	struct usb_udc *udc = gadget->udc;
+	struct vbus_event *vbus_event;
+	unsigned long flags;
 
-	mutex_lock(&udc->connect_lock);
-	if (udc) {
-		udc->vbus = status;
-		usb_udc_connect_control_locked(udc);
-	}
-	mutex_unlock(&udc->connect_lock);
+	if (!udc)
+		return;
+
+	vbus_event = kzalloc(sizeof(*vbus_event), GFP_ATOMIC);
+	if (!vbus_event)
+		return;
+
+	spin_lock_irqsave(&udc->vbus_events_lock, flags);
+	vbus_event->vbus_on = status;
+	list_add_tail(&vbus_event->node, &udc->vbus_events);
+	spin_unlock_irqrestore(&udc->vbus_events_lock, flags);
+	schedule_work(&udc->vbus_work);
 }
 EXPORT_SYMBOL_GPL(usb_udc_vbus_handler);
 
@@ -1379,6 +1427,9 @@  int usb_add_gadget(struct usb_gadget *gadget)
 	udc->gadget = gadget;
 	gadget->udc = udc;
 	mutex_init(&udc->connect_lock);
+	INIT_LIST_HEAD(&udc->vbus_events);
+	spin_lock_init(&udc->vbus_events_lock);
+	INIT_WORK(&udc->vbus_work, vbus_event_work);
 
 	udc->started = false;