[v2] PCI: vmd: Enable PCI PM's L1 substates of remapped PCIe Root Port and NVMe

Message ID 20240202071110.8515-3-jhp@endlessos.org
State New
Headers
Series [v2] PCI: vmd: Enable PCI PM's L1 substates of remapped PCIe Root Port and NVMe |

Commit Message

Jian-Hong Pan Feb. 2, 2024, 7:11 a.m. UTC
  The remapped PCIe Root Port and NVMe have PCI PM L1 substates
capability, but they are disabled originally:

Here is an example on ASUS B1400CEAE:

Capabilities: [900 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
                  PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                   T_CommonMode=0us LTR1.2_Threshold=0ns
        L1SubCtl2: T_PwrOn=10us

Power on all of the VMD remapped PCI devices and quirk max snoop LTR
before enable PCI-PM L1 PM Substates by following "Section 5.5.4 of PCIe
Base Spec Revision 6.0". Then, PCI PM's L1 substates control are
initialized & enabled accordingly. Also, update the comments of
pci_enable_link_state() and pci_enable_link_state_locked() for
kernel-doc, too.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
---
v2:
  - Power on the VMD remapped devices with pci_set_power_state_locked()
  - Prepare the PCIe LTR parameters before enable L1 Substates
  - Add note into the comments of both pci_enable_link_state() and
    pci_enable_link_state_locked() for kernel-doc.
  - The original patch set can be split as individual patches.

 drivers/pci/controller/vmd.c | 15 ++++++++++-----
 drivers/pci/pcie/aspm.c      | 10 ++++++++++
 2 files changed, 20 insertions(+), 5 deletions(-)
  

Comments

Bjorn Helgaas Feb. 3, 2024, 12:05 a.m. UTC | #1
On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> The remapped PCIe Root Port and NVMe have PCI PM L1 substates
> capability, but they are disabled originally:
> 
> Here is an example on ASUS B1400CEAE:
> 
> Capabilities: [900 v1] L1 PM Substates
>         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
>                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
>         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>                    T_CommonMode=0us LTR1.2_Threshold=0ns
>         L1SubCtl2: T_PwrOn=10us
> 
> Power on all of the VMD remapped PCI devices and quirk max snoop LTR
> before enable PCI-PM L1 PM Substates by following "Section 5.5.4 of PCIe
> Base Spec Revision 6.0". Then, PCI PM's L1 substates control are
> initialized & enabled accordingly.

> Also, update the comments of
> pci_enable_link_state() and pci_enable_link_state_locked() for
> kernel-doc, too.

The aspm.c changes should be in a separate patch.  Presumably the
aspm.c code change fixes a defect, and that defect can be described in
that patch.  That fix may be needed for non-VMD situations, and having
it in this vmd patch means it won't be as easy to find and backport.

Nit: rewrap commit log to fill 75 columns.

> @@ -751,11 +751,9 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
>  	if (!(features & VMD_FEAT_BIOS_PM_QUIRK))
>  		return 0;
>  
> -	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);
> -
>  	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_LTR);
>  	if (!pos)
> -		return 0;
> +		goto out_enable_link_state;
>  
>  	/*
>  	 * Skip if the max snoop LTR is non-zero, indicating BIOS has set it
> @@ -763,7 +761,7 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
>  	 */
>  	pci_read_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, &ltr_reg);
>  	if (!!(ltr_reg & (PCI_LTR_VALUE_MASK | PCI_LTR_SCALE_MASK)))
> -		return 0;
> +		goto out_enable_link_state;
>  
>  	/*
>  	 * Set the default values to the maximum required by the platform to
> @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
>  	pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, ltr_reg);
>  	pci_info(pdev, "VMD: Default LTR value set by driver\n");

You're not changing this part, and I don't understand exactly how LTR
works, but it makes me a little bit queasy to read "set the LTR value
to the maximum required to allow the deepest power management
savings" and then we set the max snoop values to a fixed constant.

I don't think the goal is to "allow the deepest power savings"; I
think it's to enable L1.2 *when the device has enough buffering to
absorb L1.2 entry/exit latencies*.

The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
the platform's maximum supported latency or less," so it seems like
that value must be platform-dependent, not fixed.

And I assume the "_DSM for Latency Tolerance Reporting" is part of the
way to get those platform-dependent values, but Linux doesn't actually
use that yet.

I assume that setting the max values incorrectly may lead to either
being too conservative, so we don't use L1.2 when we could, or too
aggressive, so we use L1.2 when we shouldn't, and the device loses
data because it doesn't have enough internal buffering to absorb the
entry/exit delays.

This paper has a lot of background and might help answer some of my
questions:
https://www.intel.co.za/content/dam/doc/white-paper/energy-efficient-platforms-white-paper.pdf

> +out_enable_link_state:
> +	/*
> +	 * Make PCI devices at D0 when enable PCI-PM L1 PM Substates from
> +	 * Section 5.5.4 of PCIe Base Spec Revision 6.0
> +	 */
> +	pci_set_power_state_locked(pdev, PCI_D0);
> +	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);

Hmmm.  PCIE_LINK_STATE_ALL includes ASPM L1.2.  We may do this even if
the device doesn't have an LTR Capability.  ASPM L1.2 cannot work
without LTR.

I only took a quick look but was not convinced that
pci_enable_link_state() does the right thing when we enable
PCIE_LINK_STATE_ALL (including ASPM L1.2) on a device that doesn't
have LTR.  I think it *should* decline to set PCI_L1SS_CTL1_ASPM_L1_2,
but I'm not sure it does.  Can you double check that path?  Maybe
that's another defect in aspm.c.

> @@ -1164,6 +1164,8 @@ static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool locked)
>  		link->aspm_default |= ASPM_STATE_L1_1_PCIPM | ASPM_STATE_L1;
>  	if (state & PCIE_LINK_STATE_L1_2_PCIPM)
>  		link->aspm_default |= ASPM_STATE_L1_2_PCIPM | ASPM_STATE_L1;
> +	if (state & ASPM_STATE_L1_2_MASK)
> +		aspm_l1ss_init(link);
>  	pcie_config_aspm_link(link, policy_to_aspm_state(link));
>  
>  	link->clkpm_default = (state & PCIE_LINK_STATE_CLKPM) ? 1 : 0;
  
David E. Box Feb. 5, 2024, 7:37 p.m. UTC | #2
On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> > The remapped PCIe Root Port and NVMe have PCI PM L1 substates
> > capability, but they are disabled originally:
> > 
> > Here is an example on ASUS B1400CEAE:
> > 
> > Capabilities: [900 v1] L1 PM Substates
> >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > L1_PM_Substates+
> >                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> >         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> >                    T_CommonMode=0us LTR1.2_Threshold=0ns
> >         L1SubCtl2: T_PwrOn=10us
> > 
> > Power on all of the VMD remapped PCI devices and quirk max snoop LTR
> > before enable PCI-PM L1 PM Substates by following "Section 5.5.4 of PCIe
> > Base Spec Revision 6.0". Then, PCI PM's L1 substates control are
> > initialized & enabled accordingly.
> 
> > Also, update the comments of
> > pci_enable_link_state() and pci_enable_link_state_locked() for
> > kernel-doc, too.
> 
> The aspm.c changes should be in a separate patch.  Presumably the
> aspm.c code change fixes a defect, and that defect can be described in
> that patch.  That fix may be needed for non-VMD situations, and having
> it in this vmd patch means it won't be as easy to find and backport.
> 
> Nit: rewrap commit log to fill 75 columns.
> 
> > @@ -751,11 +751,9 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev,
> > void *userdata)
> >  	if (!(features & VMD_FEAT_BIOS_PM_QUIRK))
> >  		return 0;
> >  
> > -	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);
> > -
> >  	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_LTR);
> >  	if (!pos)
> > -		return 0;
> > +		goto out_enable_link_state;
> >  
> >  	/*
> >  	 * Skip if the max snoop LTR is non-zero, indicating BIOS has set
> > it
> > @@ -763,7 +761,7 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev,
> > void *userdata)
> >  	 */
> >  	pci_read_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, &ltr_reg);
> >  	if (!!(ltr_reg & (PCI_LTR_VALUE_MASK | PCI_LTR_SCALE_MASK)))
> > -		return 0;
> > +		goto out_enable_link_state;
> >  
> >  	/*
> >  	 * Set the default values to the maximum required by the platform
> > to
> > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev,
> > void *userdata)
> >  	pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, ltr_reg);
> >  	pci_info(pdev, "VMD: Default LTR value set by driver\n");
> 
> You're not changing this part, and I don't understand exactly how LTR
> works, but it makes me a little bit queasy to read "set the LTR value
> to the maximum required to allow the deepest power management
> savings" and then we set the max snoop values to a fixed constant.
> 
> I don't think the goal is to "allow the deepest power savings"; I
> think it's to enable L1.2 *when the device has enough buffering to
> absorb L1.2 entry/exit latencies*.
> 
> The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> the platform's maximum supported latency or less," so it seems like
> that value must be platform-dependent, not fixed.
> 
> And I assume the "_DSM for Latency Tolerance Reporting" is part of the
> way to get those platform-dependent values, but Linux doesn't actually
> use that yet.

This may indeed be the best way but we need to double check with our BIOS folks.
AFAIK BIOS writes the LTR values directly so there hasn't been a need to use
this _DSM. But under VMD the ports are hidden from BIOS which is why we added it
here. I've brought up the question internally to find out how Windows handles
the DSM and to get a recommendation from our firmware leads.

> 
> I assume that setting the max values incorrectly may lead to either
> being too conservative, so we don't use L1.2 when we could, or too
> aggressive, so we use L1.2 when we shouldn't, and the device loses
> data because it doesn't have enough internal buffering to absorb the
> entry/exit delays.
> 
> This paper has a lot of background and might help answer some of my
> questions:
> https://www.intel.co.za/content/dam/doc/white-paper/energy-efficient-platforms-white-paper.pdf
> 
> > +out_enable_link_state:
> > +	/*
> > +	 * Make PCI devices at D0 when enable PCI-PM L1 PM Substates from
> > +	 * Section 5.5.4 of PCIe Base Spec Revision 6.0
> > +	 */
> > +	pci_set_power_state_locked(pdev, PCI_D0);
> > +	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);
> 
> Hmmm.  PCIE_LINK_STATE_ALL includes ASPM L1.2.  We may do this even if
> the device doesn't have an LTR Capability.  ASPM L1.2 cannot work
> without LTR.
> 
> I only took a quick look but was not convinced that
> pci_enable_link_state() does the right thing when we enable
> PCIE_LINK_STATE_ALL (including ASPM L1.2) on a device that doesn't
> have LTR.  I think it *should* decline to set PCI_L1SS_CTL1_ASPM_L1_2,
> but I'm not sure it does.  Can you double check that path?  Maybe
> that's another defect in aspm.c.

It doesn't currently decline. The same scenario happens when the user selects
powersupersave. If a device advertises L1.2 with the capable registers set, it
should also have the LTR register present. But it doesn't hurt to check.

David

> 
> > @@ -1164,6 +1164,8 @@ static int __pci_enable_link_state(struct pci_dev
> > *pdev, int state, bool locked)
> >  		link->aspm_default |= ASPM_STATE_L1_1_PCIPM |
> > ASPM_STATE_L1;
> >  	if (state & PCIE_LINK_STATE_L1_2_PCIPM)
> >  		link->aspm_default |= ASPM_STATE_L1_2_PCIPM |
> > ASPM_STATE_L1;
> > +	if (state & ASPM_STATE_L1_2_MASK)
> > +		aspm_l1ss_init(link);
> >  	pcie_config_aspm_link(link, policy_to_aspm_state(link));
> >  
> >  	link->clkpm_default = (state & PCIE_LINK_STATE_CLKPM) ? 1 : 0;
  
Bjorn Helgaas Feb. 5, 2024, 10:42 p.m. UTC | #3
On Mon, Feb 05, 2024 at 11:37:16AM -0800, David E. Box wrote:
> On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> > On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> ...

> > > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev,
> > > void *userdata)
> > >  	pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, ltr_reg);
> > >  	pci_info(pdev, "VMD: Default LTR value set by driver\n");
> > 
> > You're not changing this part, and I don't understand exactly how LTR
> > works, but it makes me a little bit queasy to read "set the LTR value
> > to the maximum required to allow the deepest power management
> > savings" and then we set the max snoop values to a fixed constant.
> > 
> > I don't think the goal is to "allow the deepest power savings"; I
> > think it's to enable L1.2 *when the device has enough buffering to
> > absorb L1.2 entry/exit latencies*.
> > 
> > The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> > the platform's maximum supported latency or less," so it seems like
> > that value must be platform-dependent, not fixed.
> > 
> > And I assume the "_DSM for Latency Tolerance Reporting" is part of the
> > way to get those platform-dependent values, but Linux doesn't actually
> > use that yet.
> 
> This may indeed be the best way but we need to double check with our
> BIOS folks.  AFAIK BIOS writes the LTR values directly so there
> hasn't been a need to use this _DSM. But under VMD the ports are
> hidden from BIOS which is why we added it here. I've brought up the
> question internally to find out how Windows handles the DSM and to
> get a recommendation from our firmware leads.

We want Linux to be able to program LTR itself, don't we?  We
shouldn't have to rely on firmware to do it.  If Linux can't do
it, hot-added devices aren't going to be able to use L1.2, right?

Bjorn
  
David E. Box Feb. 5, 2024, 11:05 p.m. UTC | #4
On Mon, 2024-02-05 at 16:42 -0600, Bjorn Helgaas wrote:
> On Mon, Feb 05, 2024 at 11:37:16AM -0800, David E. Box wrote:
> > On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> > > On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> > ...
> 
> > > > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev
> > > > *pdev,
> > > > void *userdata)
> > > >  	pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT,
> > > > ltr_reg);
> > > >  	pci_info(pdev, "VMD: Default LTR value set by driver\n");
> > > 
> > > You're not changing this part, and I don't understand exactly how LTR
> > > works, but it makes me a little bit queasy to read "set the LTR value
> > > to the maximum required to allow the deepest power management
> > > savings" and then we set the max snoop values to a fixed constant.
> > > 
> > > I don't think the goal is to "allow the deepest power savings"; I
> > > think it's to enable L1.2 *when the device has enough buffering to
> > > absorb L1.2 entry/exit latencies*.
> > > 
> > > The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> > > the platform's maximum supported latency or less," so it seems like
> > > that value must be platform-dependent, not fixed.
> > > 
> > > And I assume the "_DSM for Latency Tolerance Reporting" is part of the
> > > way to get those platform-dependent values, but Linux doesn't actually
> > > use that yet.
> > 
> > This may indeed be the best way but we need to double check with our
> > BIOS folks.  AFAIK BIOS writes the LTR values directly so there
> > hasn't been a need to use this _DSM. But under VMD the ports are
> > hidden from BIOS which is why we added it here. I've brought up the
> > question internally to find out how Windows handles the DSM and to
> > get a recommendation from our firmware leads.
> 
> We want Linux to be able to program LTR itself, don't we?  We
> shouldn't have to rely on firmware to do it.  If Linux can't do
> it, hot-added devices aren't going to be able to use L1.2, right?

Agreed. We just want to make sure we are not conflicting with what BIOS may be
doing.

David
  
Ilpo Järvinen Feb. 6, 2024, 12:29 p.m. UTC | #5
On Fri, 2 Feb 2024, Jian-Hong Pan wrote:

> The remapped PCIe Root Port and NVMe have PCI PM L1 substates
> capability, but they are disabled originally:
> 
> Here is an example on ASUS B1400CEAE:
> 
> Capabilities: [900 v1] L1 PM Substates
>         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
>                   PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
>         L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
>                    T_CommonMode=0us LTR1.2_Threshold=0ns
>         L1SubCtl2: T_PwrOn=10us
> 
> Power on all of the VMD remapped PCI devices and quirk max snoop LTR
> before enable PCI-PM L1 PM Substates by following "Section 5.5.4 of PCIe
> Base Spec Revision 6.0". Then, PCI PM's L1 substates control are
> initialized & enabled accordingly. Also, update the comments of
> pci_enable_link_state() and pci_enable_link_state_locked() for
> kernel-doc, too.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
> ---
> v2:
>   - Power on the VMD remapped devices with pci_set_power_state_locked()
>   - Prepare the PCIe LTR parameters before enable L1 Substates
>   - Add note into the comments of both pci_enable_link_state() and
>     pci_enable_link_state_locked() for kernel-doc.
>   - The original patch set can be split as individual patches.
> 
>  drivers/pci/controller/vmd.c | 15 ++++++++++-----
>  drivers/pci/pcie/aspm.c      | 10 ++++++++++
>  2 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 87b7856f375a..66e47a0dbf1a 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -751,11 +751,9 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
>  	if (!(features & VMD_FEAT_BIOS_PM_QUIRK))
>  		return 0;
>  
> -	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);
> -
>  	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_LTR);
>  	if (!pos)
> -		return 0;
> +		goto out_enable_link_state;
>  
>  	/*
>  	 * Skip if the max snoop LTR is non-zero, indicating BIOS has set it
> @@ -763,7 +761,7 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
>  	 */
>  	pci_read_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, &ltr_reg);
>  	if (!!(ltr_reg & (PCI_LTR_VALUE_MASK | PCI_LTR_SCALE_MASK)))
> -		return 0;
> +		goto out_enable_link_state;
>  
>  	/*
>  	 * Set the default values to the maximum required by the platform to
> @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
>  	pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, ltr_reg);
>  	pci_info(pdev, "VMD: Default LTR value set by driver\n");
>  
> +out_enable_link_state:
> +	/*
> +	 * Make PCI devices at D0 when enable PCI-PM L1 PM Substates from
> +	 * Section 5.5.4 of PCIe Base Spec Revision 6.0

I don't understand what are you trying to say here? Are there some typos 
or grammar errors or something entire missing from the comment?

> +	 */
> +	pci_set_power_state_locked(pdev, PCI_D0);
> +	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);
> +
>  	return 0;
>  }
>  
> @@ -926,7 +932,6 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>  				   dev_get_msi_domain(&vmd->dev->dev));
>  
>  	vmd_acpi_begin();
> -
>  	pci_scan_child_bus(vmd->bus);
>  	vmd_domain_reset(vmd);

Spurious newline change.
  
David E. Box Feb. 6, 2024, 9:25 p.m. UTC | #6
Adding Puranjay

On Mon, 2024-02-05 at 15:05 -0800, David E. Box wrote:
> On Mon, 2024-02-05 at 16:42 -0600, Bjorn Helgaas wrote:
> > On Mon, Feb 05, 2024 at 11:37:16AM -0800, David E. Box wrote:
> > > On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> > > > On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> > > ...
> > 
> > > > > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev
> > > > > *pdev,
> > > > > void *userdata)
> > > > >         pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT,
> > > > > ltr_reg);
> > > > >         pci_info(pdev, "VMD: Default LTR value set by driver\n");
> > > > 
> > > > You're not changing this part, and I don't understand exactly how LTR
> > > > works, but it makes me a little bit queasy to read "set the LTR value
> > > > to the maximum required to allow the deepest power management
> > > > savings" and then we set the max snoop values to a fixed constant.
> > > > 
> > > > I don't think the goal is to "allow the deepest power savings"; I
> > > > think it's to enable L1.2 *when the device has enough buffering to
> > > > absorb L1.2 entry/exit latencies*.
> > > > 
> > > > The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> > > > the platform's maximum supported latency or less," so it seems like
> > > > that value must be platform-dependent, not fixed.
> > > > 
> > > > And I assume the "_DSM for Latency Tolerance Reporting" is part of the
> > > > way to get those platform-dependent values, but Linux doesn't actually
> > > > use that yet.
> > > 
> > > This may indeed be the best way but we need to double check with our
> > > BIOS folks.  AFAIK BIOS writes the LTR values directly so there
> > > hasn't been a need to use this _DSM. But under VMD the ports are
> > > hidden from BIOS which is why we added it here. I've brought up the
> > > question internally to find out how Windows handles the DSM and to
> > > get a recommendation from our firmware leads.
> > 
> > We want Linux to be able to program LTR itself, don't we?  We
> > shouldn't have to rely on firmware to do it.  If Linux can't do
> > it, hot-added devices aren't going to be able to use L1.2, right?
> 
> Agreed. We just want to make sure we are not conflicting with what BIOS may be
> doing.

So the feedback is to run the _DSM and just overwrite any BIOS values. Looking
up the _DSM I saw there was an attempt to upstream this 4 years ago [1]. I'm not
sure why the effort stalled but we can pick up this work again.

https://patchwork.kernel.org/project/linux-pci/patch/20201015080311.7811-1-puranjay12@gmail.com/
  
Bjorn Helgaas Feb. 6, 2024, 11:30 p.m. UTC | #7
On Tue, Feb 06, 2024 at 01:25:29PM -0800, David E. Box wrote:
> On Mon, 2024-02-05 at 15:05 -0800, David E. Box wrote:
> > On Mon, 2024-02-05 at 16:42 -0600, Bjorn Helgaas wrote:
> > > On Mon, Feb 05, 2024 at 11:37:16AM -0800, David E. Box wrote:
> > > > On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> > > > > On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> > > > ...
> > > 
> > > > > > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev
> > > > > > *pdev,
> > > > > > void *userdata)
> > > > > >         pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT,
> > > > > > ltr_reg);
> > > > > >         pci_info(pdev, "VMD: Default LTR value set by driver\n");
> > > > > 
> > > > > You're not changing this part, and I don't understand exactly how LTR
> > > > > works, but it makes me a little bit queasy to read "set the LTR value
> > > > > to the maximum required to allow the deepest power management
> > > > > savings" and then we set the max snoop values to a fixed constant.
> > > > > 
> > > > > I don't think the goal is to "allow the deepest power savings"; I
> > > > > think it's to enable L1.2 *when the device has enough buffering to
> > > > > absorb L1.2 entry/exit latencies*.
> > > > > 
> > > > > The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> > > > > the platform's maximum supported latency or less," so it seems like
> > > > > that value must be platform-dependent, not fixed.
> > > > > 
> > > > > And I assume the "_DSM for Latency Tolerance Reporting" is part of the
> > > > > way to get those platform-dependent values, but Linux doesn't actually
> > > > > use that yet.
> > > > 
> > > > This may indeed be the best way but we need to double check with our
> > > > BIOS folks.  AFAIK BIOS writes the LTR values directly so there
> > > > hasn't been a need to use this _DSM. But under VMD the ports are
> > > > hidden from BIOS which is why we added it here. I've brought up the
> > > > question internally to find out how Windows handles the DSM and to
> > > > get a recommendation from our firmware leads.
> > > 
> > > We want Linux to be able to program LTR itself, don't we?  We
> > > shouldn't have to rely on firmware to do it.  If Linux can't do
> > > it, hot-added devices aren't going to be able to use L1.2,
> > > right?
> > 
> > Agreed. We just want to make sure we are not conflicting with what
> > BIOS may be doing.
> 
> So the feedback is to run the _DSM and just overwrite any BIOS
> values. Looking up the _DSM I saw there was an attempt to upstream
> this 4 years ago [1]. I'm not sure why the effort stalled but we can
> pick up this work again.
> 
> https://patchwork.kernel.org/project/linux-pci/patch/20201015080311.7811-1-puranjay12@gmail.com/

There was a PCI SIG discussion about this a few years ago that never
really seemed to get resolved:
https://members.pcisig.com/wg/PCIe-Protocol/mail/thread/35064

Unfortunately that discussion is not public, but the summary is:

  Q: How is the LTR_L1.2_THRESHOLD value determined?

     PCIe r5.0, sec 5.5.4, says the same value must be programmed into
     both Ports.

     A: As noted in sec 5.5.4, the value is determined primarily by
	the amount of time it will take to re-establish the common
	mode bias on the AC coupling caps, and it is assumed that the
	BIOS knows this.

  Q: How are the LTR Max Snoop values determined?

     PCI Firmware r3.3, sec 4.6.6, says the LTR _DSM reports the max
     values for each Downstream Port embedded in the platform, and the
     OS should calculate latencies along the path between each
     Downstream Port and any Upstream Port (Switch Upstream Port or
     Endpoint).

     Of course, Switches not embedded in the platform (e.g., external
     Thunderbolt hierarchies) will not have this _DSM, but I assume
     they should contribute to this sum?

     A: The fundamental problem is that there is no practical way for
	software to discover the AC coupling capacitor sizes and
	common mode bias circuit impedance.

	Software could compute conservative values, but they would
	likely be 10x worse than typical, so the L1.2 exit latency
	would be significantly longer than actually required to be.

	The interoperability issues here were understood when
	designing L1 Substates, but no viable solution was found.

So the main reason Puranjay's work got stalled is that I didn't feel
confident enough that we understood how to do this, especially for
external devices.

It would be great if somebody *did* feel confident about interpreting
and implementing all this.

Bjorn
  
David E. Box Feb. 7, 2024, 7:51 p.m. UTC | #8
On Tue, 2024-02-06 at 17:30 -0600, Bjorn Helgaas wrote:
> On Tue, Feb 06, 2024 at 01:25:29PM -0800, David E. Box wrote:
> > On Mon, 2024-02-05 at 15:05 -0800, David E. Box wrote:
> > > On Mon, 2024-02-05 at 16:42 -0600, Bjorn Helgaas wrote:
> > > > On Mon, Feb 05, 2024 at 11:37:16AM -0800, David E. Box wrote:
> > > > > On Fri, 2024-02-02 at 18:05 -0600, Bjorn Helgaas wrote:
> > > > > > On Fri, Feb 02, 2024 at 03:11:12PM +0800, Jian-Hong Pan wrote:
> > > > > ...
> > > > 
> > > > > > > @@ -775,6 +773,14 @@ static int vmd_pm_enable_quirk(struct pci_dev
> > > > > > > *pdev,
> > > > > > > void *userdata)
> > > > > > >         pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT,
> > > > > > > ltr_reg);
> > > > > > >         pci_info(pdev, "VMD: Default LTR value set by driver\n");
> > > > > > 
> > > > > > You're not changing this part, and I don't understand exactly how
> > > > > > LTR
> > > > > > works, but it makes me a little bit queasy to read "set the LTR
> > > > > > value
> > > > > > to the maximum required to allow the deepest power management
> > > > > > savings" and then we set the max snoop values to a fixed constant.
> > > > > > 
> > > > > > I don't think the goal is to "allow the deepest power savings"; I
> > > > > > think it's to enable L1.2 *when the device has enough buffering to
> > > > > > absorb L1.2 entry/exit latencies*.
> > > > > > 
> > > > > > The spec (PCIe r6.0, sec 7.8.2.2) says "Software should set this to
> > > > > > the platform's maximum supported latency or less," so it seems like
> > > > > > that value must be platform-dependent, not fixed.
> > > > > > 
> > > > > > And I assume the "_DSM for Latency Tolerance Reporting" is part of
> > > > > > the
> > > > > > way to get those platform-dependent values, but Linux doesn't
> > > > > > actually
> > > > > > use that yet.
> > > > > 
> > > > > This may indeed be the best way but we need to double check with our
> > > > > BIOS folks.  AFAIK BIOS writes the LTR values directly so there
> > > > > hasn't been a need to use this _DSM. But under VMD the ports are
> > > > > hidden from BIOS which is why we added it here. I've brought up the
> > > > > question internally to find out how Windows handles the DSM and to
> > > > > get a recommendation from our firmware leads.
> > > > 
> > > > We want Linux to be able to program LTR itself, don't we?  We
> > > > shouldn't have to rely on firmware to do it.  If Linux can't do
> > > > it, hot-added devices aren't going to be able to use L1.2,
> > > > right?
> > > 
> > > Agreed. We just want to make sure we are not conflicting with what
> > > BIOS may be doing.
> > 
> > So the feedback is to run the _DSM and just overwrite any BIOS
> > values. Looking up the _DSM I saw there was an attempt to upstream
> > this 4 years ago [1]. I'm not sure why the effort stalled but we can
> > pick up this work again.
> > 
> > https://patchwork.kernel.org/project/linux-pci/patch/20201015080311.7811-1-puranjay12@gmail.com/
> 
> There was a PCI SIG discussion about this a few years ago that never
> really seemed to get resolved:
> https://members.pcisig.com/wg/PCIe-Protocol/mail/thread/35064
> 
> Unfortunately that discussion is not public, but the summary is:
> 
>   Q: How is the LTR_L1.2_THRESHOLD value determined?
> 
>      PCIe r5.0, sec 5.5.4, says the same value must be programmed into
>      both Ports.
> 
>      A: As noted in sec 5.5.4, the value is determined primarily by
>         the amount of time it will take to re-establish the common
>         mode bias on the AC coupling caps, and it is assumed that the
>         BIOS knows this.
> 
>   Q: How are the LTR Max Snoop values determined?
> 
>      PCI Firmware r3.3, sec 4.6.6, says the LTR _DSM reports the max
>      values for each Downstream Port embedded in the platform, and the
>      OS should calculate latencies along the path between each
>      Downstream Port and any Upstream Port (Switch Upstream Port or
>      Endpoint).
> 
>      Of course, Switches not embedded in the platform (e.g., external
>      Thunderbolt hierarchies) will not have this _DSM, but I assume
>      they should contribute to this sum?
> 
>      A: The fundamental problem is that there is no practical way for
>         software to discover the AC coupling capacitor sizes and
>         common mode bias circuit impedance.
> 
>         Software could compute conservative values, but they would
>         likely be 10x worse than typical, so the L1.2 exit latency
>         would be significantly longer than actually required to be.
> 
>         The interoperability issues here were understood when
>         designing L1 Substates, but no viable solution was found.
> 
> So the main reason Puranjay's work got stalled is that I didn't feel
> confident enough that we understood how to do this, especially for
> external devices.
> 
> It would be great if somebody *did* feel confident about interpreting
> and implementing all this.

As it is BIOS (at least Intel BIOS) is already writing the maximum allowed LTR
value on Upstream Ports that have it set to 0. So we can't do any worse if we
write the BIOS provided _DSM value for all Upstream Ports, including external
devices. Sounds like the worst case scenario is that devices take longer than
needed to exit L1.2 (I'm still asking about this detail). But I think this is
better than not programming the LTR at all which could prevent the platform from
power gating they very resources that LTR is meant to help manage.

If that reasoning is okay with you, I'll submit patches to use the _DSM.

David
  

Patch

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 87b7856f375a..66e47a0dbf1a 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -751,11 +751,9 @@  static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
 	if (!(features & VMD_FEAT_BIOS_PM_QUIRK))
 		return 0;
 
-	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);
-
 	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_LTR);
 	if (!pos)
-		return 0;
+		goto out_enable_link_state;
 
 	/*
 	 * Skip if the max snoop LTR is non-zero, indicating BIOS has set it
@@ -763,7 +761,7 @@  static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
 	 */
 	pci_read_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, &ltr_reg);
 	if (!!(ltr_reg & (PCI_LTR_VALUE_MASK | PCI_LTR_SCALE_MASK)))
-		return 0;
+		goto out_enable_link_state;
 
 	/*
 	 * Set the default values to the maximum required by the platform to
@@ -775,6 +773,14 @@  static int vmd_pm_enable_quirk(struct pci_dev *pdev, void *userdata)
 	pci_write_config_dword(pdev, pos + PCI_LTR_MAX_SNOOP_LAT, ltr_reg);
 	pci_info(pdev, "VMD: Default LTR value set by driver\n");
 
+out_enable_link_state:
+	/*
+	 * Make PCI devices at D0 when enable PCI-PM L1 PM Substates from
+	 * Section 5.5.4 of PCIe Base Spec Revision 6.0
+	 */
+	pci_set_power_state_locked(pdev, PCI_D0);
+	pci_enable_link_state_locked(pdev, PCIE_LINK_STATE_ALL);
+
 	return 0;
 }
 
@@ -926,7 +932,6 @@  static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 				   dev_get_msi_domain(&vmd->dev->dev));
 
 	vmd_acpi_begin();
-
 	pci_scan_child_bus(vmd->bus);
 	vmd_domain_reset(vmd);
 
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index bc0bd86695ec..5f902c5552ca 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -1164,6 +1164,8 @@  static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool locked)
 		link->aspm_default |= ASPM_STATE_L1_1_PCIPM | ASPM_STATE_L1;
 	if (state & PCIE_LINK_STATE_L1_2_PCIPM)
 		link->aspm_default |= ASPM_STATE_L1_2_PCIPM | ASPM_STATE_L1;
+	if (state & ASPM_STATE_L1_2_MASK)
+		aspm_l1ss_init(link);
 	pcie_config_aspm_link(link, policy_to_aspm_state(link));
 
 	link->clkpm_default = (state & PCIE_LINK_STATE_CLKPM) ? 1 : 0;
@@ -1182,6 +1184,10 @@  static int __pci_enable_link_state(struct pci_dev *pdev, int state, bool locked)
  * touch the LNKCTL register. Also note that this does not enable states
  * disabled by pci_disable_link_state(). Return 0 or a negative errno.
  *
+ * Note: The PCIe devices of the link must be in D0, if the PCI-PM L1 PM
+ * substates are going to be enabled. From Section 5.5.4 of PCIe Base Spec
+ * Revision 6.0.
+ *
  * @pdev: PCI device
  * @state: Mask of ASPM link states to enable
  */
@@ -1198,6 +1204,10 @@  EXPORT_SYMBOL(pci_enable_link_state);
  * can't touch the LNKCTL register. Also note that this does not enable states
  * disabled by pci_disable_link_state(). Return 0 or a negative errno.
  *
+ * Note: The PCIe devices of the link must be in D0, if the PCI-PM L1 PM
+ * substates are going to be enabled. From Section 5.5.4 of PCIe Base Spec
+ * Revision 6.0.
+ *
  * @pdev: PCI device
  * @state: Mask of ASPM link states to enable
  *