[v2,1/2] PCI/VPD: Add runtime power management to sysfs interface

Message ID 20230803171233.3810944-2-alex.williamson@redhat.com
State New
Headers
Series PCI: Protect VPD and PME accesses from power management |

Commit Message

Alex Williamson Aug. 3, 2023, 5:12 p.m. UTC
  Unlike default access to config space through sysfs, the vpd read and
write function don't actively manage the runtime power management state
of the device during access.  Since commit 7ab5e10eda02 ("vfio/pci: Move
the unused device into low power state with runtime PM"), the vfio-pci
driver will use runtime power management and release unused devices to
make use of low power states.  Attempting to access VPD information in
this low power state can result in incorrect information or kernel
crashes depending on the system behavior.

Wrap the vpd read/write bin attribute handlers in runtime PM and take
into account the potential quirk to select the correct device to wake.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)
  

Comments

Bjorn Helgaas Aug. 10, 2023, 3:59 p.m. UTC | #1
On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote:
> Unlike default access to config space through sysfs, the vpd read and
> write function don't actively manage the runtime power management state
> of the device during access.  Since commit 7ab5e10eda02 ("vfio/pci: Move
> the unused device into low power state with runtime PM"), the vfio-pci
> driver will use runtime power management and release unused devices to
> make use of low power states.  Attempting to access VPD information in
> this low power state can result in incorrect information or kernel
> crashes depending on the system behavior.
> 
> Wrap the vpd read/write bin attribute handlers in runtime PM and take
> into account the potential quirk to select the correct device to wake.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c
> index a4fc4d0690fe..81217dd4789f 100644
> --- a/drivers/pci/vpd.c
> +++ b/drivers/pci/vpd.c
> @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj,
>  			size_t count)
>  {
>  	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
> +	struct pci_dev *vpd_dev = dev;
> +	ssize_t ret;
> +
> +	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
> +		vpd_dev = pci_get_func0_dev(dev);
> +		if (!vpd_dev)
> +			return -ENODEV;
> +	}
> +
> +	pci_config_pm_runtime_get(vpd_dev);
> +	ret = pci_read_vpd(vpd_dev, off, count, buf);
> +	pci_config_pm_runtime_put(vpd_dev);
> +
> +	if (dev != vpd_dev)
> +		pci_dev_put(vpd_dev);

I first thought this would leak a reference if dev was func0 and had
PCI_DEV_FLAGS_VPD_REF_F0 set, because in that case vpd_dev would be
the same as dev.

But I think that case can't happen because quirk_f0_vpd_link() does
nothing for func0 devices, so PCI_DEV_FLAGS_VPD_REF_F0 should never be
set for func0.  But it seems like this might be easier to analyze as:

  if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0)
    pci_dev_put(vpd_dev);

Or am I missing something?

> -	return pci_read_vpd(dev, off, count, buf);
> +	return ret;
>  }
>  
>  static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
> @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
>  			 size_t count)
>  {
>  	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
> +	struct pci_dev *vpd_dev = dev;
> +	ssize_t ret;
> +
> +	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
> +		vpd_dev = pci_get_func0_dev(dev);
> +		if (!vpd_dev)
> +			return -ENODEV;
> +	}
> +
> +	pci_config_pm_runtime_get(vpd_dev);
> +	ret = pci_write_vpd(vpd_dev, off, count, buf);
> +	pci_config_pm_runtime_put(vpd_dev);
> +
> +	if (dev != vpd_dev)
> +		pci_dev_put(vpd_dev);
>  
> -	return pci_write_vpd(dev, off, count, buf);
> +	return ret;
>  }
>  static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0);
>  
> -- 
> 2.40.1
>
  
Alex Williamson Aug. 10, 2023, 4:26 p.m. UTC | #2
On Thu, 10 Aug 2023 10:59:26 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote:
> > Unlike default access to config space through sysfs, the vpd read and
> > write function don't actively manage the runtime power management state
> > of the device during access.  Since commit 7ab5e10eda02 ("vfio/pci: Move
> > the unused device into low power state with runtime PM"), the vfio-pci
> > driver will use runtime power management and release unused devices to
> > make use of low power states.  Attempting to access VPD information in
> > this low power state can result in incorrect information or kernel
> > crashes depending on the system behavior.
> > 
> > Wrap the vpd read/write bin attribute handlers in runtime PM and take
> > into account the potential quirk to select the correct device to wake.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++--
> >  1 file changed, 32 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c
> > index a4fc4d0690fe..81217dd4789f 100644
> > --- a/drivers/pci/vpd.c
> > +++ b/drivers/pci/vpd.c
> > @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj,
> >  			size_t count)
> >  {
> >  	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
> > +	struct pci_dev *vpd_dev = dev;
> > +	ssize_t ret;
> > +
> > +	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
> > +		vpd_dev = pci_get_func0_dev(dev);
> > +		if (!vpd_dev)
> > +			return -ENODEV;
> > +	}
> > +
> > +	pci_config_pm_runtime_get(vpd_dev);
> > +	ret = pci_read_vpd(vpd_dev, off, count, buf);
> > +	pci_config_pm_runtime_put(vpd_dev);
> > +
> > +	if (dev != vpd_dev)
> > +		pci_dev_put(vpd_dev);  
> 
> I first thought this would leak a reference if dev was func0 and had
> PCI_DEV_FLAGS_VPD_REF_F0 set, because in that case vpd_dev would be
> the same as dev.
> 
> But I think that case can't happen because quirk_f0_vpd_link() does
> nothing for func0 devices, so PCI_DEV_FLAGS_VPD_REF_F0 should never be
> set for func0.  But it seems like this might be easier to analyze as:
> 
>   if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0)
>     pci_dev_put(vpd_dev);
> 
> Or am I missing something?

Nope, your analysis is correct, it doesn't make any sense to have a
flag on func0 redirecting VPD access to func0 so vpd_dev can only be
different on non-zero functions.  The alternative test is equally
valid so if you think it's more intuitive, let's use it.  Thanks,

Alex
  
Bjorn Helgaas Aug. 11, 2023, 7:25 p.m. UTC | #3
On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote:
> Unlike default access to config space through sysfs, the vpd read and
> write function don't actively manage the runtime power management state
> of the device during access.  Since commit 7ab5e10eda02 ("vfio/pci: Move
> the unused device into low power state with runtime PM"), the vfio-pci
> driver will use runtime power management and release unused devices to
> make use of low power states.  Attempting to access VPD information in
> this low power state can result in incorrect information or kernel
> crashes depending on the system behavior.

I guess this specifically refers to D3cold, right?  VPD is accessed
via config space, which should work in all D0-D3hot states, but not in
D3cold.  I don't see anything in the spec about needing to be in D0 to
access VPD.

I assume there's no public problem report we could cite here?  I
suppose the behavior in D3cold is however the system handles a UR
error.

> Wrap the vpd read/write bin attribute handlers in runtime PM and take
> into account the potential quirk to select the correct device to wake.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c
> index a4fc4d0690fe..81217dd4789f 100644
> --- a/drivers/pci/vpd.c
> +++ b/drivers/pci/vpd.c
> @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj,
>  			size_t count)
>  {
>  	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
> +	struct pci_dev *vpd_dev = dev;
> +	ssize_t ret;
> +
> +	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
> +		vpd_dev = pci_get_func0_dev(dev);
> +		if (!vpd_dev)
> +			return -ENODEV;
> +	}
> +
> +	pci_config_pm_runtime_get(vpd_dev);
> +	ret = pci_read_vpd(vpd_dev, off, count, buf);
> +	pci_config_pm_runtime_put(vpd_dev);
> +
> +	if (dev != vpd_dev)
> +		pci_dev_put(vpd_dev);
>  
> -	return pci_read_vpd(dev, off, count, buf);
> +	return ret;
>  }
>  
>  static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
> @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
>  			 size_t count)
>  {
>  	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
> +	struct pci_dev *vpd_dev = dev;
> +	ssize_t ret;
> +
> +	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
> +		vpd_dev = pci_get_func0_dev(dev);
> +		if (!vpd_dev)
> +			return -ENODEV;
> +	}
> +
> +	pci_config_pm_runtime_get(vpd_dev);
> +	ret = pci_write_vpd(vpd_dev, off, count, buf);
> +	pci_config_pm_runtime_put(vpd_dev);
> +
> +	if (dev != vpd_dev)
> +		pci_dev_put(vpd_dev);
>  
> -	return pci_write_vpd(dev, off, count, buf);
> +	return ret;
>  }
>  static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0);
>  
> -- 
> 2.40.1
>
  
Alex Williamson Aug. 11, 2023, 7:56 p.m. UTC | #4
On Fri, 11 Aug 2023 14:25:43 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote:
> > Unlike default access to config space through sysfs, the vpd read and
> > write function don't actively manage the runtime power management state
> > of the device during access.  Since commit 7ab5e10eda02 ("vfio/pci: Move
> > the unused device into low power state with runtime PM"), the vfio-pci
> > driver will use runtime power management and release unused devices to
> > make use of low power states.  Attempting to access VPD information in
> > this low power state can result in incorrect information or kernel
> > crashes depending on the system behavior.  
> 
> I guess this specifically refers to D3cold, right?  VPD is accessed
> via config space, which should work in all D0-D3hot states, but not in
> D3cold.  I don't see anything in the spec about needing to be in D0 to
> access VPD.
> 
> I assume there's no public problem report we could cite here?  I
> suppose the behavior in D3cold is however the system handles a UR
> error.

Yes, Eric tested that pcie_port_pm=off is a viable workaround resolving
both the VPD and PME accesses, so I think the issue is actually that
the root port is in D3cold.  This aligns with commit 7ab5e10eda02
above, since prior to that we were only manipulating the endpoint power
state.

The oops indicates an "Internal error: synchronous external abort",
with a stack trace ending in pci_generic_config_read, so I suspect this
is a UR.

Unfortunately the bz is not currently public :-\  Thanks,

Alex
 
> > Wrap the vpd read/write bin attribute handlers in runtime PM and take
> > into account the potential quirk to select the correct device to wake.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++--
> >  1 file changed, 32 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c
> > index a4fc4d0690fe..81217dd4789f 100644
> > --- a/drivers/pci/vpd.c
> > +++ b/drivers/pci/vpd.c
> > @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj,
> >  			size_t count)
> >  {
> >  	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
> > +	struct pci_dev *vpd_dev = dev;
> > +	ssize_t ret;
> > +
> > +	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
> > +		vpd_dev = pci_get_func0_dev(dev);
> > +		if (!vpd_dev)
> > +			return -ENODEV;
> > +	}
> > +
> > +	pci_config_pm_runtime_get(vpd_dev);
> > +	ret = pci_read_vpd(vpd_dev, off, count, buf);
> > +	pci_config_pm_runtime_put(vpd_dev);
> > +
> > +	if (dev != vpd_dev)
> > +		pci_dev_put(vpd_dev);
> >  
> > -	return pci_read_vpd(dev, off, count, buf);
> > +	return ret;
> >  }
> >  
> >  static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
> > @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
> >  			 size_t count)
> >  {
> >  	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
> > +	struct pci_dev *vpd_dev = dev;
> > +	ssize_t ret;
> > +
> > +	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
> > +		vpd_dev = pci_get_func0_dev(dev);
> > +		if (!vpd_dev)
> > +			return -ENODEV;
> > +	}
> > +
> > +	pci_config_pm_runtime_get(vpd_dev);
> > +	ret = pci_write_vpd(vpd_dev, off, count, buf);
> > +	pci_config_pm_runtime_put(vpd_dev);
> > +
> > +	if (dev != vpd_dev)
> > +		pci_dev_put(vpd_dev);
> >  
> > -	return pci_write_vpd(dev, off, count, buf);
> > +	return ret;
> >  }
> >  static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0);
> >  
> > -- 
> > 2.40.1
> >   
>
  

Patch

diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c
index a4fc4d0690fe..81217dd4789f 100644
--- a/drivers/pci/vpd.c
+++ b/drivers/pci/vpd.c
@@ -275,8 +275,23 @@  static ssize_t vpd_read(struct file *filp, struct kobject *kobj,
 			size_t count)
 {
 	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
+	struct pci_dev *vpd_dev = dev;
+	ssize_t ret;
+
+	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
+		vpd_dev = pci_get_func0_dev(dev);
+		if (!vpd_dev)
+			return -ENODEV;
+	}
+
+	pci_config_pm_runtime_get(vpd_dev);
+	ret = pci_read_vpd(vpd_dev, off, count, buf);
+	pci_config_pm_runtime_put(vpd_dev);
+
+	if (dev != vpd_dev)
+		pci_dev_put(vpd_dev);
 
-	return pci_read_vpd(dev, off, count, buf);
+	return ret;
 }
 
 static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
@@ -284,8 +299,23 @@  static ssize_t vpd_write(struct file *filp, struct kobject *kobj,
 			 size_t count)
 {
 	struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj));
+	struct pci_dev *vpd_dev = dev;
+	ssize_t ret;
+
+	if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) {
+		vpd_dev = pci_get_func0_dev(dev);
+		if (!vpd_dev)
+			return -ENODEV;
+	}
+
+	pci_config_pm_runtime_get(vpd_dev);
+	ret = pci_write_vpd(vpd_dev, off, count, buf);
+	pci_config_pm_runtime_put(vpd_dev);
+
+	if (dev != vpd_dev)
+		pci_dev_put(vpd_dev);
 
-	return pci_write_vpd(dev, off, count, buf);
+	return ret;
 }
 static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0);