[00/10] arm64: dts: qcom: sc8280xp: enable GICv3 ITS for PCIe

Message ID 20240212165043.26961-1-johan+linaro@kernel.org
Headers
Series arm64: dts: qcom: sc8280xp: enable GICv3 ITS for PCIe |

Message

Johan Hovold Feb. 12, 2024, 4:50 p.m. UTC
  This series addresses a few problems with the sc8280xp PCIe
implementation.

The DWC PCIe controller can either use its internal MSI controller or an
external one such as the GICv3 ITS. Enabling the latter allows for
assigning affinity to individual interrupts, but results in a large
amount of Correctable Errors being logged on both the Lenovo ThinkPad
X13s and the sc8280xp-crd reference design.

It turns out that these errors are always generated, but for some yet to
be determined reason, the AER interrupts are never received when using
the internal MSI controller, which makes the link errors harder to
notice.

On the X13s, there is a large number of errors generated when bringing
up the link on boot. This is related to the fact that UEFI firmware has
already enabled the Wi-Fi PCIe link at Gen2 speed and restarting the
link at Gen3 generates a massive amount of errors until the Wi-Fi
firmware is restarted.

A recent commit enabling ASPM on certain Qualcomm platforms introduced
further errors when using the Wi-Fi on the X13s as well as when
accessing the NVMe on the CRD. The exact reason for this has not yet
been identified, but disabling ASPM L0s makes the errors go away. This
could suggest that either the current ASPM implementation is incomplete
or that L0s is not supported with these devices.

Note that the X13s and CRD use the same Wi-Fi controller, but the errors
are only generated on the X13s. The NVMe controller on my X13s does not
support L0s so there are no issues there, unlike on the CRD which uses a
different controller. The modem on the CRD does not generate any errors,
but both the NVMe and modem keeps bouncing in and out of L0s/L1 also
when not used, which could indicate that there are bigger problems with
the ASPM implementation. I don't have a modem on my X13s so I have not
been able to test whether L0s causes an trouble there.

Enabling AER error reporting on sc8280xp could similarly also reveal
existing problems with the related sa8295p and sa8540p platforms as they
share the base dtsi.

The last four patches, marked as RFC, adds support for disabling ASPM
L0s in the devicetree and disables it selectively for the X13s Wi-Fi
and CRD NVMe. If it turns out that the Qualcomm PCIe implementation is
incomplete, we may need to disable ASPM (L0s) completely in the driver
instead.

Note that disabling ASPM L0s for the X13s Wi-Fi does not seem to have a
significant impact on the power consumption 

The DT bindings and PCI patch are expected to go through the PCI tree,
while Bjorn A takes the devicetree updates through the Qualcomm tree.

Johan


Johan Hovold (10):
  dt-bindings: PCI: qcom: Allow 'required-opps'
  dt-bindings: PCI: qcom: Do not require 'msi-map-mask'
  arm64: dts: qcom: sc8280xp: add missing PCIe minimum OPP
  arm64: dts: qcom: sc8280xp-crd: limit pcie4 link speed
  arm64: dts: qcom: sc8280xp-x13s: limit pcie4 link speed
  arm64: dts: qcom: sc8280xp: enable GICv3 ITS for PCIe
  dt-bindings: PCI: qcom: Allow 'aspm-no-l0s'
  PCI: qcom: Add support for disabling ASPM L0s in devicetree
  arm64: dts: qcom: sc8280xp-crd: disable ASPM L0s for NVMe
  arm64: dts: qcom: sc8280xp-x13s: disable ASPM L0s for Wi-Fi

 .../devicetree/bindings/pci/qcom,pcie.yaml    |  6 +++++-
 arch/arm64/boot/dts/qcom/sc8280xp-crd.dts     |  4 ++++
 .../qcom/sc8280xp-lenovo-thinkpad-x13s.dts    |  3 +++
 arch/arm64/boot/dts/qcom/sc8280xp.dtsi        | 17 +++++++++++++++-
 drivers/pci/controller/dwc/pcie-qcom.c        | 20 +++++++++++++++++++
 5 files changed, 48 insertions(+), 2 deletions(-)
  

Comments

Manivannan Sadhasivam Feb. 14, 2024, 6:35 a.m. UTC | #1
On Mon, Feb 12, 2024 at 05:50:33PM +0100, Johan Hovold wrote:
> This series addresses a few problems with the sc8280xp PCIe
> implementation.
> 
> The DWC PCIe controller can either use its internal MSI controller or an
> external one such as the GICv3 ITS. Enabling the latter allows for
> assigning affinity to individual interrupts, but results in a large
> amount of Correctable Errors being logged on both the Lenovo ThinkPad
> X13s and the sc8280xp-crd reference design.
> 
> It turns out that these errors are always generated,

How did you confirm this?

> but for some yet to
> be determined reason, the AER interrupts are never received when using
> the internal MSI controller, which makes the link errors harder to
> notice.
> 

If you manually inject the errors using "aer-inject", are you not seeing the AER
errors with internal MSI controller as well?

> On the X13s, there is a large number of errors generated when bringing
> up the link on boot. This is related to the fact that UEFI firmware has
> already enabled the Wi-Fi PCIe link at Gen2 speed and restarting the
> link at Gen3 generates a massive amount of errors until the Wi-Fi
> firmware is restarted.
> 
> A recent commit enabling ASPM on certain Qualcomm platforms introduced
> further errors when using the Wi-Fi on the X13s as well as when
> accessing the NVMe on the CRD. The exact reason for this has not yet
> been identified, but disabling ASPM L0s makes the errors go away. This
> could suggest that either the current ASPM implementation is incomplete
> or that L0s is not supported with these devices.
> 

What are those "further errors" you are seeing with ASPM enabled? Are those
errors appear with GIC ITS or with internal MSI controller as well?

> Note that the X13s and CRD use the same Wi-Fi controller, but the errors
> are only generated on the X13s. The NVMe controller on my X13s does not
> support L0s so there are no issues there, unlike on the CRD which uses a
> different controller. The modem on the CRD does not generate any errors,
> but both the NVMe and modem keeps bouncing in and out of L0s/L1 also
> when not used, which could indicate that there are bigger problems with
> the ASPM implementation. I don't have a modem on my X13s so I have not
> been able to test whether L0s causes an trouble there.
> 
> Enabling AER error reporting on sc8280xp could similarly also reveal
> existing problems with the related sa8295p and sa8540p platforms as they
> share the base dtsi.
> 
> The last four patches, marked as RFC, adds support for disabling ASPM
> L0s in the devicetree and disables it selectively for the X13s Wi-Fi
> and CRD NVMe. If it turns out that the Qualcomm PCIe implementation is
> incomplete, we may need to disable ASPM (L0s) completely in the driver
> instead.
> 

If the device is not supporting L0s, then it as to be disabled in the device,
not in the PCIe controller, no?

> Note that disabling ASPM L0s for the X13s Wi-Fi does not seem to have a
> significant impact on the power consumption 
> 
> The DT bindings and PCI patch are expected to go through the PCI tree,
> while Bjorn A takes the devicetree updates through the Qualcomm tree.
> 

Since I took a stab at enabling the GIC ITS previously, I noticed that the NVMe
performance got a slight dip. And that was one of the reasons (apart from AER
errors) that I never submitted the patch.

Could you share the NVMe benchmark (fio) with this series?

> Johan
> 
> 
> Johan Hovold (10):
>   dt-bindings: PCI: qcom: Allow 'required-opps'
>   dt-bindings: PCI: qcom: Do not require 'msi-map-mask'
>   arm64: dts: qcom: sc8280xp: add missing PCIe minimum OPP
>   arm64: dts: qcom: sc8280xp-crd: limit pcie4 link speed
>   arm64: dts: qcom: sc8280xp-x13s: limit pcie4 link speed
>   arm64: dts: qcom: sc8280xp: enable GICv3 ITS for PCIe

Is this patch based on the version I shared with you long back? If so, I'd
expect to have some credit. If you came up with your own version, then ignore
this comment.

- Mani

>   dt-bindings: PCI: qcom: Allow 'aspm-no-l0s'
>   PCI: qcom: Add support for disabling ASPM L0s in devicetree
>   arm64: dts: qcom: sc8280xp-crd: disable ASPM L0s for NVMe
>   arm64: dts: qcom: sc8280xp-x13s: disable ASPM L0s for Wi-Fi
> 
>  .../devicetree/bindings/pci/qcom,pcie.yaml    |  6 +++++-
>  arch/arm64/boot/dts/qcom/sc8280xp-crd.dts     |  4 ++++
>  .../qcom/sc8280xp-lenovo-thinkpad-x13s.dts    |  3 +++
>  arch/arm64/boot/dts/qcom/sc8280xp.dtsi        | 17 +++++++++++++++-
>  drivers/pci/controller/dwc/pcie-qcom.c        | 20 +++++++++++++++++++
>  5 files changed, 48 insertions(+), 2 deletions(-)
> 
> -- 
> 2.43.0
>
  
Johan Hovold Feb. 14, 2024, 11:09 a.m. UTC | #2
On Wed, Feb 14, 2024 at 12:05:54PM +0530, Manivannan Sadhasivam wrote:
> On Mon, Feb 12, 2024 at 05:50:33PM +0100, Johan Hovold wrote:
> > This series addresses a few problems with the sc8280xp PCIe
> > implementation.
> > 
> > The DWC PCIe controller can either use its internal MSI controller or an
> > external one such as the GICv3 ITS. Enabling the latter allows for
> > assigning affinity to individual interrupts, but results in a large
> > amount of Correctable Errors being logged on both the Lenovo ThinkPad
> > X13s and the sc8280xp-crd reference design.
> > 
> > It turns out that these errors are always generated,
> 
> How did you confirm this?

You can see that error flags being set in the controller and endpoint,
for example, using lspci -vv:

	CESta:  RxErr- BadTLP+ BadDLLP- Rollover- Timeout- AdvNonFatalErr-

> > but for some yet to
> > be determined reason, the AER interrupts are never received when using
> > the internal MSI controller, which makes the link errors harder to
> > notice.
> 
> If you manually inject the errors using "aer-inject", are you not seeing the AER
> errors with internal MSI controller as well?

I haven't tried that, I'm just reporting that that piece of
functionality is currently broken and that that partly explains why the
ASPM problems went unnoticed.

> > On the X13s, there is a large number of errors generated when bringing
> > up the link on boot. This is related to the fact that UEFI firmware has
> > already enabled the Wi-Fi PCIe link at Gen2 speed and restarting the
> > link at Gen3 generates a massive amount of errors until the Wi-Fi
> > firmware is restarted.
> > 
> > A recent commit enabling ASPM on certain Qualcomm platforms introduced
> > further errors when using the Wi-Fi on the X13s as well as when
> > accessing the NVMe on the CRD. The exact reason for this has not yet
> > been identified, but disabling ASPM L0s makes the errors go away. This
> > could suggest that either the current ASPM implementation is incomplete
> > or that L0s is not supported with these devices.
> 
> What are those "further errors" you are seeing with ASPM enabled? Are those
> errors appear with GIC ITS or with internal MSI controller as well?

Further errors as in further correctable errors that are not related to
the errors seen when resetting the X13s Wi-Fi link at boot.

These show up, for example, when accessing the NVMe on the CRD or when
using the Wi-Fi on the X13s. These errors go away when L0s is disabled.

And yes, you see them with both the external and internal MSI controller
(in the latter case, by looking at the error flags mentioned above).
 
> > Note that the X13s and CRD use the same Wi-Fi controller, but the errors
> > are only generated on the X13s. The NVMe controller on my X13s does not
> > support L0s so there are no issues there, unlike on the CRD which uses a
> > different controller. The modem on the CRD does not generate any errors,
> > but both the NVMe and modem keeps bouncing in and out of L0s/L1 also
> > when not used, which could indicate that there are bigger problems with
> > the ASPM implementation. I don't have a modem on my X13s so I have not
> > been able to test whether L0s causes an trouble there.
> > 
> > Enabling AER error reporting on sc8280xp could similarly also reveal
> > existing problems with the related sa8295p and sa8540p platforms as they
> > share the base dtsi.
> > 
> > The last four patches, marked as RFC, adds support for disabling ASPM
> > L0s in the devicetree and disables it selectively for the X13s Wi-Fi
> > and CRD NVMe. If it turns out that the Qualcomm PCIe implementation is
> > incomplete, we may need to disable ASPM (L0s) completely in the driver
> > instead.
> 
> If the device is not supporting L0s, then it as to be disabled in the device,
> not in the PCIe controller, no?

Well, we don't know yet where the problem lies, just that enabling L0s
results in a large number of correctable errors.

Until yesterday I had not seen any such errors for the Wi-Fi on the CRD,
which uses essentially the same ath11k controller, so there was no clear
indication that this was necessarily a problem with the devices either.

> > Note that disabling ASPM L0s for the X13s Wi-Fi does not seem to have a
> > significant impact on the power consumption 
> > 
> > The DT bindings and PCI patch are expected to go through the PCI tree,
> > while Bjorn A takes the devicetree updates through the Qualcomm tree.
> 
> Since I took a stab at enabling the GIC ITS previously, I noticed that the NVMe
> performance got a slight dip. And that was one of the reasons (apart from AER
> errors) that I never submitted the patch.
> 
> Could you share the NVMe benchmark (fio) with this series?

Did you have any particular benchmark in mind?

I have run multiple fio benchmarks and while the results vary with the
parameters, the impact of switching to ITS (so that not all PCIe
interrupts are processed on CPU0) is generally favourable.

A raw sequential read shows no change in throughput on either the X13s
or the CRD even if for some reason this test performs really badly on
the X13s (i.e. regardless of which MSI controller is used):

	crd-rseq-read:	IOPS=11.1k, BW=2764MiB/s (2898MB/s)(81.0GiB/30003msec)
	X13s-rseq-read:	IOPS=508, BW=127MiB/s (134MB/s)(3841MiB/30169msec)

Another benchmark I've used against a mounted ext4 partition shows a 2x
improvement in throughput with ITS for sequential and random reads and
writes on the X13s:

	seq-read:	IOPS=88.4k, BW=345MiB/s (362MB/s)(10.0GiB/29657msec)
	rand-read:	IOPS=21.2k, BW=82.8MiB/s (86.8MB/s)(4967MiB/60001msec)
	seq-write:	IOPS=162k, BW=632MiB/s (662MB/s)(10.0GiB/16213msec)
	rand-write:	IOPS=142k, BW=555MiB/s (582MB/s)(10.0GiB/18439msec)
	
while the results are essentially unchanged with a larger block size and
queue depth (32/2m instead of 4/4k):

	seq-read:	IOPS=1095, BW=2191MiB/s (2298MB/s)(10.0GiB/4673msec)
	rand-read:	IOPS=1020, BW=2041MiB/s (2140MB/s)(10.0GiB/5017msec)
	seq-write:	IOPS=918, BW=1837MiB/s (1926MB/s)(10.0GiB/5574msec)
	rand-write:	IOPS=826, BW=1653MiB/s (1734MB/s)(10.0GiB/6194msec)

> > Johan Hovold (10):
> >   dt-bindings: PCI: qcom: Allow 'required-opps'
> >   dt-bindings: PCI: qcom: Do not require 'msi-map-mask'
> >   arm64: dts: qcom: sc8280xp: add missing PCIe minimum OPP
> >   arm64: dts: qcom: sc8280xp-crd: limit pcie4 link speed
> >   arm64: dts: qcom: sc8280xp-x13s: limit pcie4 link speed
> >   arm64: dts: qcom: sc8280xp: enable GICv3 ITS for PCIe
> 
> Is this patch based on the version I shared with you long back? If so, I'd
> expect to have some credit. If you came up with your own version, then ignore
> this comment.

No, this patch has beeen created and evaluated from scratch based on the
downstream direwolf dts, which has these five 'msi-map' properties. 

I debated whether I should base it on your version instead, but in the
end it would have a new commit message and only these properties from
the downstream dtsi would remain (you also removed existing properties
IIRC). So while it's certainly inspired by your work, this has been done
from scratch, including the testing.

If you prefer I can make this clear in the commit message, but adding a
Co-developed-by didn't seem quite right either as I did this work
without your involvement. But perhaps that would be better?

Johan
  
Manivannan Sadhasivam Feb. 16, 2024, 2:54 p.m. UTC | #3
On Wed, Feb 14, 2024 at 12:09:15PM +0100, Johan Hovold wrote:
> On Wed, Feb 14, 2024 at 12:05:54PM +0530, Manivannan Sadhasivam wrote:
> > On Mon, Feb 12, 2024 at 05:50:33PM +0100, Johan Hovold wrote:
> > > This series addresses a few problems with the sc8280xp PCIe
> > > implementation.
> > > 
> > > The DWC PCIe controller can either use its internal MSI controller or an
> > > external one such as the GICv3 ITS. Enabling the latter allows for
> > > assigning affinity to individual interrupts, but results in a large
> > > amount of Correctable Errors being logged on both the Lenovo ThinkPad
> > > X13s and the sc8280xp-crd reference design.
> > > 
> > > It turns out that these errors are always generated,
> > 
> > How did you confirm this?
> 
> You can see that error flags being set in the controller and endpoint,
> for example, using lspci -vv:
> 
> 	CESta:  RxErr- BadTLP+ BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> 

Okay.

> > > but for some yet to
> > > be determined reason, the AER interrupts are never received when using
> > > the internal MSI controller, which makes the link errors harder to
> > > notice.
> > 
> > If you manually inject the errors using "aer-inject", are you not seeing the AER
> > errors with internal MSI controller as well?
> 
> I haven't tried that, I'm just reporting that that piece of
> functionality is currently broken and that that partly explains why the
> ASPM problems went unnoticed.
> 

I just gave it a shot and I could see the AER interrupts raised for correctable
errors with internal MSI controller.

Now I'm puzzled why this is not getting triggered by default. I'll check with
the hardware team if they have any clue.

> > > On the X13s, there is a large number of errors generated when bringing
> > > up the link on boot. This is related to the fact that UEFI firmware has
> > > already enabled the Wi-Fi PCIe link at Gen2 speed and restarting the
> > > link at Gen3 generates a massive amount of errors until the Wi-Fi
> > > firmware is restarted.
> > > 
> > > A recent commit enabling ASPM on certain Qualcomm platforms introduced
> > > further errors when using the Wi-Fi on the X13s as well as when
> > > accessing the NVMe on the CRD. The exact reason for this has not yet
> > > been identified, but disabling ASPM L0s makes the errors go away. This
> > > could suggest that either the current ASPM implementation is incomplete
> > > or that L0s is not supported with these devices.
> > 
> > What are those "further errors" you are seeing with ASPM enabled? Are those
> > errors appear with GIC ITS or with internal MSI controller as well?
> 
> Further errors as in further correctable errors that are not related to
> the errors seen when resetting the X13s Wi-Fi link at boot.
> 
> These show up, for example, when accessing the NVMe on the CRD or when
> using the Wi-Fi on the X13s. These errors go away when L0s is disabled.
> 
> And yes, you see them with both the external and internal MSI controller
> (in the latter case, by looking at the error flags mentioned above).
>  

Hmm.

> > > Note that the X13s and CRD use the same Wi-Fi controller, but the errors
> > > are only generated on the X13s. The NVMe controller on my X13s does not
> > > support L0s so there are no issues there, unlike on the CRD which uses a
> > > different controller. The modem on the CRD does not generate any errors,
> > > but both the NVMe and modem keeps bouncing in and out of L0s/L1 also
> > > when not used, which could indicate that there are bigger problems with
> > > the ASPM implementation. I don't have a modem on my X13s so I have not
> > > been able to test whether L0s causes an trouble there.
> > > 
> > > Enabling AER error reporting on sc8280xp could similarly also reveal
> > > existing problems with the related sa8295p and sa8540p platforms as they
> > > share the base dtsi.
> > > 
> > > The last four patches, marked as RFC, adds support for disabling ASPM
> > > L0s in the devicetree and disables it selectively for the X13s Wi-Fi
> > > and CRD NVMe. If it turns out that the Qualcomm PCIe implementation is
> > > incomplete, we may need to disable ASPM (L0s) completely in the driver
> > > instead.
> > 
> > If the device is not supporting L0s, then it as to be disabled in the device,
> > not in the PCIe controller, no?
> 
> Well, we don't know yet where the problem lies, just that enabling L0s
> results in a large number of correctable errors.
> 
> Until yesterday I had not seen any such errors for the Wi-Fi on the CRD,
> which uses essentially the same ath11k controller, so there was no clear
> indication that this was necessarily a problem with the devices either.
> 

I'll confirm the L0s compatibility with the hardware team.

> > > Note that disabling ASPM L0s for the X13s Wi-Fi does not seem to have a
> > > significant impact on the power consumption 
> > > 
> > > The DT bindings and PCI patch are expected to go through the PCI tree,
> > > while Bjorn A takes the devicetree updates through the Qualcomm tree.
> > 
> > Since I took a stab at enabling the GIC ITS previously, I noticed that the NVMe
> > performance got a slight dip. And that was one of the reasons (apart from AER
> > errors) that I never submitted the patch.
> > 
> > Could you share the NVMe benchmark (fio) with this series?
> 
> Did you have any particular benchmark in mind?
> 
> I have run multiple fio benchmarks and while the results vary with the
> parameters, the impact of switching to ITS (so that not all PCIe
> interrupts are processed on CPU0) is generally favourable.
> 
> A raw sequential read shows no change in throughput on either the X13s
> or the CRD even if for some reason this test performs really badly on
> the X13s (i.e. regardless of which MSI controller is used):
> 
> 	crd-rseq-read:	IOPS=11.1k, BW=2764MiB/s (2898MB/s)(81.0GiB/30003msec)
> 	X13s-rseq-read:	IOPS=508, BW=127MiB/s (134MB/s)(3841MiB/30169msec)
> 
> Another benchmark I've used against a mounted ext4 partition shows a 2x
> improvement in throughput with ITS for sequential and random reads and
> writes on the X13s:
> 
> 	seq-read:	IOPS=88.4k, BW=345MiB/s (362MB/s)(10.0GiB/29657msec)
> 	rand-read:	IOPS=21.2k, BW=82.8MiB/s (86.8MB/s)(4967MiB/60001msec)
> 	seq-write:	IOPS=162k, BW=632MiB/s (662MB/s)(10.0GiB/16213msec)
> 	rand-write:	IOPS=142k, BW=555MiB/s (582MB/s)(10.0GiB/18439msec)
> 	
> while the results are essentially unchanged with a larger block size and
> queue depth (32/2m instead of 4/4k):
> 
> 	seq-read:	IOPS=1095, BW=2191MiB/s (2298MB/s)(10.0GiB/4673msec)
> 	rand-read:	IOPS=1020, BW=2041MiB/s (2140MB/s)(10.0GiB/5017msec)
> 	seq-write:	IOPS=918, BW=1837MiB/s (1926MB/s)(10.0GiB/5574msec)
> 	rand-write:	IOPS=826, BW=1653MiB/s (1734MB/s)(10.0GiB/6194msec)
> 

Ok, this looks promising. Long back when I tried the benchmark (seq & rand r/w),
performance dropped slightly with GIC ITS. But looks like things have changed.

> > > Johan Hovold (10):
> > >   dt-bindings: PCI: qcom: Allow 'required-opps'
> > >   dt-bindings: PCI: qcom: Do not require 'msi-map-mask'
> > >   arm64: dts: qcom: sc8280xp: add missing PCIe minimum OPP
> > >   arm64: dts: qcom: sc8280xp-crd: limit pcie4 link speed
> > >   arm64: dts: qcom: sc8280xp-x13s: limit pcie4 link speed
> > >   arm64: dts: qcom: sc8280xp: enable GICv3 ITS for PCIe
> > 
> > Is this patch based on the version I shared with you long back? If so, I'd
> > expect to have some credit. If you came up with your own version, then ignore
> > this comment.
> 
> No, this patch has beeen created and evaluated from scratch based on the
> downstream direwolf dts, which has these five 'msi-map' properties. 
> 
> I debated whether I should base it on your version instead, but in the
> end it would have a new commit message and only these properties from
> the downstream dtsi would remain (you also removed existing properties
> IIRC). So while it's certainly inspired by your work, this has been done
> from scratch, including the testing.
> 
> If you prefer I can make this clear in the commit message, but adding a
> Co-developed-by didn't seem quite right either as I did this work
> without your involvement. But perhaps that would be better?
> 

Nah. As I said, if you have created the patch without basing on my version,
then no credit is required. I just wanted to know since I shared the patch
earlier.

- Mani