ACPI: PCI: Fix device reference counting in acpi_get_pci_dev()

Message ID 12097002.O9o76ZdvQC@kreacher
State New
Headers
Series ACPI: PCI: Fix device reference counting in acpi_get_pci_dev() |

Commit Message

Rafael J. Wysocki Oct. 18, 2022, 5:34 p.m. UTC
  From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed
to reference count the device returned by acpi_get_pci_dev() as
expected by its callers which in some cases may cause device objects
to be dropped prematurely.

Add the missing get_device() to acpi_get_pci_dev().

Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/acpi/pci_root.c |    1 +
 1 file changed, 1 insertion(+)
  

Comments

Ville Syrjälä Oct. 19, 2022, 8:54 a.m. UTC | #1
On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed
> to reference count the device returned by acpi_get_pci_dev() as
> expected by its callers which in some cases may cause device objects
> to be dropped prematurely.
> 
> Add the missing get_device() to acpi_get_pci_dev().
> 
> Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()")

FYI this (and the rtc-cmos regression discussed in
https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/)
took down the entire Intel gfx CI. I've applied both fixes
into our fixup branch and things are looking much healthier
now.

This one caused i915 selftests to eat a lot of POISON_FREE
in the CI. While bisecting it locally I didn't have
poisoning enabled so I got refcount_t undeflows instead.

https://intel-gfx-ci.01.org/tree/drm-tip/index.html has a lot
of colorful boxes to click if you're interested in any of the
logs. The fixes are included in the CI_DRM_12259 build. Earlier
builds were broken.

> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/acpi/pci_root.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> Index: linux-pm/drivers/acpi/pci_root.c
> ===================================================================
> --- linux-pm.orig/drivers/acpi/pci_root.c
> +++ linux-pm/drivers/acpi/pci_root.c
> @@ -323,6 +323,7 @@ struct pci_dev *acpi_get_pci_dev(acpi_ha
>  
>  	list_for_each_entry(pn, &adev->physical_node_list, node) {
>  		if (dev_is_pci(pn->dev)) {
> +			get_device(pn->dev);
>  			pci_dev = to_pci_dev(pn->dev);
>  			break;
>  		}
> 
> 
>
  
Rafael J. Wysocki Oct. 19, 2022, 11:35 a.m. UTC | #2
On Wed, Oct 19, 2022 at 11:02 AM Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
>
> On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed
> > to reference count the device returned by acpi_get_pci_dev() as
> > expected by its callers which in some cases may cause device objects
> > to be dropped prematurely.
> >
> > Add the missing get_device() to acpi_get_pci_dev().
> >
> > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()")
>
> FYI this (and the rtc-cmos regression discussed in
> https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/)
> took down the entire Intel gfx CI.

Sorry for the disturbance.

> I've applied both fixes into our fixup branch and things are looking much
> healthier now.

Thanks for letting me know.

I've just added the $subject patch to my linux-next branch as an
urgent fix and the other one has been applied to the RTC tree.

> This one caused i915 selftests to eat a lot of POISON_FREE
> in the CI. While bisecting it locally I didn't have
> poisoning enabled so I got refcount_t undeflows instead.

Unfortunately, making no mistakes is generally hard to offer.

If catching things like this early is better, what about pulling my
bleeding-edge branch, where all of my changes are staged before going
into linux-next, into the CI?

> https://intel-gfx-ci.01.org/tree/drm-tip/index.html has a lot
> of colorful boxes to click if you're interested in any of the
> logs. The fixes are included in the CI_DRM_12259 build. Earlier
> builds were broken.

Thanks!
  
Ville Syrjälä Oct. 19, 2022, 12:22 p.m. UTC | #3
On Wed, Oct 19, 2022 at 01:35:26PM +0200, Rafael J. Wysocki wrote:
> On Wed, Oct 19, 2022 at 11:02 AM Ville Syrjälä
> <ville.syrjala@linux.intel.com> wrote:
> >
> > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed
> > > to reference count the device returned by acpi_get_pci_dev() as
> > > expected by its callers which in some cases may cause device objects
> > > to be dropped prematurely.
> > >
> > > Add the missing get_device() to acpi_get_pci_dev().
> > >
> > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()")
> >
> > FYI this (and the rtc-cmos regression discussed in
> > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/)
> > took down the entire Intel gfx CI.
> 
> Sorry for the disturbance.
> 
> > I've applied both fixes into our fixup branch and things are looking much
> > healthier now.
> 
> Thanks for letting me know.
> 
> I've just added the $subject patch to my linux-next branch as an
> urgent fix and the other one has been applied to the RTC tree.
> 
> > This one caused i915 selftests to eat a lot of POISON_FREE
> > in the CI. While bisecting it locally I didn't have
> > poisoning enabled so I got refcount_t undeflows instead.
> 
> Unfortunately, making no mistakes is generally hard to offer.
> 
> If catching things like this early is better, what about pulling my
> bleeding-edge branch, where all of my changes are staged before going
> into linux-next, into the CI?

Pretty sure we don't have the resources to become the CI for
everyone. So testing random trees is not really possible. And 
the alternative of pulling random trees into drm-tip is probably
a not a popular idea either. We used to pull in the sound tree
since it's pretty closely tied to graphics, but I think we
stopped even that because it eneded up pulling the whole of
-rc1 in at random points in time when we were't expecting it.

Ideally each subsystem would have its own CI, or there should
be some kernel wide thing. But I suppose the progress towards
something like that is glacial.

That said, we do test linux-next to some degree. And looks like
at least one of these could have been caught a bit earlier through
that. Unfortunately no one is really keeping an eye on that so
things tend to slip through. Probably need to figure out something
to make better use of that.

> 
> > https://intel-gfx-ci.01.org/tree/drm-tip/index.html has a lot
> > of colorful boxes to click if you're interested in any of the
> > logs. The fixes are included in the CI_DRM_12259 build. Earlier
> > builds were broken.
> 
> Thanks!
  
Jani Nikula Oct. 19, 2022, 12:57 p.m. UTC | #4
On Wed, 19 Oct 2022, Ville Syrjälä <ville.syrjala@linux.intel.com> wrote:
> On Wed, Oct 19, 2022 at 01:35:26PM +0200, Rafael J. Wysocki wrote:
>> If catching things like this early is better, what about pulling my
>> bleeding-edge branch, where all of my changes are staged before going
>> into linux-next, into the CI?
>
> Pretty sure we don't have the resources to become the CI for
> everyone. So testing random trees is not really possible. And 
> the alternative of pulling random trees into drm-tip is probably
> a not a popular idea either. We used to pull in the sound tree
> since it's pretty closely tied to graphics, but I think we
> stopped even that because it eneded up pulling the whole of
> -rc1 in at random points in time when we were't expecting it.

Basically, we only pull branches to drm-tip that are managed using our
tools and our development model and under our control. It was too much
trouble dealing with conflicts, Linus' master being pulled in at random
points (like in the middle of the merge window), and stuff like that,
with the external trees.

> Ideally each subsystem would have its own CI, or there should
> be some kernel wide thing. But I suppose the progress towards
> something like that is glacial.
>
> That said, we do test linux-next to some degree. And looks like
> at least one of these could have been caught a bit earlier through
> that. Unfortunately no one is really keeping an eye on that so
> things tend to slip through. Probably need to figure out something
> to make better use of that.

Yeah, we need to pay more attention to linux-next test results, as well
as Linus' master during the merge window. It's not necessarily easy with
the volatility of linux-next, you could easily have very broken runs
followed by good ones, but the low hanging fruit is raising more flags
and being louder about it earlier when everything's busted for several
days in linux-next or Linus' master.


BR,
Jani.
  
Rafael J. Wysocki Oct. 19, 2022, 1 p.m. UTC | #5
On Wed, Oct 19, 2022 at 2:22 PM Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
>
> On Wed, Oct 19, 2022 at 01:35:26PM +0200, Rafael J. Wysocki wrote:
> > On Wed, Oct 19, 2022 at 11:02 AM Ville Syrjälä
> > <ville.syrjala@linux.intel.com> wrote:
> > >
> > > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed
> > > > to reference count the device returned by acpi_get_pci_dev() as
> > > > expected by its callers which in some cases may cause device objects
> > > > to be dropped prematurely.
> > > >
> > > > Add the missing get_device() to acpi_get_pci_dev().
> > > >
> > > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()")
> > >
> > > FYI this (and the rtc-cmos regression discussed in
> > > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/)
> > > took down the entire Intel gfx CI.
> >
> > Sorry for the disturbance.
> >
> > > I've applied both fixes into our fixup branch and things are looking much
> > > healthier now.
> >
> > Thanks for letting me know.
> >
> > I've just added the $subject patch to my linux-next branch as an
> > urgent fix and the other one has been applied to the RTC tree.
> >
> > > This one caused i915 selftests to eat a lot of POISON_FREE
> > > in the CI. While bisecting it locally I didn't have
> > > poisoning enabled so I got refcount_t undeflows instead.
> >
> > Unfortunately, making no mistakes is generally hard to offer.
> >
> > If catching things like this early is better, what about pulling my
> > bleeding-edge branch, where all of my changes are staged before going
> > into linux-next, into the CI?
>
> Pretty sure we don't have the resources to become the CI for
> everyone. So testing random trees is not really possible. And
> the alternative of pulling random trees into drm-tip is probably
> a not a popular idea either. We used to pull in the sound tree
> since it's pretty closely tied to graphics, but I think we
> stopped even that because it eneded up pulling the whole of
> -rc1 in at random points in time when we were't expecting it.

I see.

> Ideally each subsystem would have its own CI, or there should
> be some kernel wide thing. But I suppose the progress towards
> something like that is glacial.

Well, I definitely cannot afford a dedicated CI just for my tree and I
haven't got any useful information from KernlCI yet (even though hey
pull and test my linux-next branch on a regular basis).

KernelCI seems to be focusing on different set of hardware, so to speak.

> That said, we do test linux-next to some degree. And looks like
> at least one of these could have been caught a bit earlier through
> that. Unfortunately no one is really keeping an eye on that so
> things tend to slip through. Probably need to figure out something
> to make better use of that.

I think it could also be possible to contribute to KernelCI to get
more useful x86 coverage from it.
  
Bjorn Helgaas Oct. 19, 2022, 4:53 p.m. UTC | #6
On Wed, Oct 19, 2022 at 11:54:42AM +0300, Ville Syrjälä wrote:
> On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed
> > to reference count the device returned by acpi_get_pci_dev() as
> > expected by its callers which in some cases may cause device objects
> > to be dropped prematurely.
> > 
> > Add the missing get_device() to acpi_get_pci_dev().
> > 
> > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()")
> 
> FYI this (and the rtc-cmos regression discussed in
> https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/)
> took down the entire Intel gfx CI.

From 1000 miles away and zero background with the gfx CI, this sounds
like "our CI system, whose purpose is to find bugs, found one", which
is a good thing.

Bjorn
  
Ville Syrjälä Oct. 19, 2022, 5:40 p.m. UTC | #7
On Wed, Oct 19, 2022 at 11:53:26AM -0500, Bjorn Helgaas wrote:
> On Wed, Oct 19, 2022 at 11:54:42AM +0300, Ville Syrjälä wrote:
> > On Tue, Oct 18, 2022 at 07:34:03PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > 
> > > Commit 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()") failed
> > > to reference count the device returned by acpi_get_pci_dev() as
> > > expected by its callers which in some cases may cause device objects
> > > to be dropped prematurely.
> > > 
> > > Add the missing get_device() to acpi_get_pci_dev().
> > > 
> > > Fixes: 63f534b8bad9 ("ACPI: PCI: Rework acpi_get_pci_dev()")
> > 
> > FYI this (and the rtc-cmos regression discussed in
> > https://lore.kernel.org/linux-acpi/5887691.lOV4Wx5bFT@kreacher/)
> > took down the entire Intel gfx CI.
> 
> >From 1000 miles away and zero background with the gfx CI, this sounds
> like "our CI system, whose purpose is to find bugs, found one", which
> is a good thing.

Mostly. It's certainly better than it going entirely undetected.

Sadly we found it after rc1 because no one was really looking at
linux-next results. Something we need to improve.

But ideally it would have been found by some other CI system
whose primary job is to prevent bugs in those subsystems, rather
than the one whose primary job is to prevent bugs in gfx drivers.
Also ideally it wouldn't have been me bisecting this :P

The biggest downside of bugs reaching our CI via rc1/etc. is that
it pretty much stops everyone from getting premerge results for
their graphics driver patches since the CI keeps tripping over
the already existing bugs. But I guess you can call this one a
somewhat self inflicted wound and we should just try harder to
keep new code out of our tree until it's known to be healthy.
  

Patch

Index: linux-pm/drivers/acpi/pci_root.c
===================================================================
--- linux-pm.orig/drivers/acpi/pci_root.c
+++ linux-pm/drivers/acpi/pci_root.c
@@ -323,6 +323,7 @@  struct pci_dev *acpi_get_pci_dev(acpi_ha
 
 	list_for_each_entry(pn, &adev->physical_node_list, node) {
 		if (dev_is_pci(pn->dev)) {
+			get_device(pn->dev);
 			pci_dev = to_pci_dev(pn->dev);
 			break;
 		}