[v2,1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

Message ID 20230721214740.256602-2-Smita.KoralahalliChannabasappa@amd.com
State New
Headers
Series PCI/AER, CXL: Fix appropriate _OSC check for CXL RAS Cap |

Commit Message

Smita Koralahalli July 21, 2023, 9:47 p.m. UTC
  According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
of AER should also own CXL Protocol Error Management as there is no
explicit control of CXL Protocol error. And the CXL RAS Cap registers
reported on Protocol errors should check for AER _OSC rather than CXL
Memory Error Reporting Control _OSC.

The CXL Memory Error Reporting Control _OSC specifically highlights
handling Memory Error Logging and Signaling Enhancements. These kinds of
errors are reported through a device's mailbox and can be managed
independently from CXL Protocol Errors.

This change fixes handling and reporting CXL Protocol Errors and RAS
registers natively with native AER and FW-First CXL Memory Error Reporting
Control.

[1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.

Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
---
v2:
	Added fixes tag.
	Included what the patch fixes in commit message.
---
 drivers/cxl/pci.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
  

Comments

Kuppuswamy Sathyanarayanan July 22, 2023, 12:18 a.m. UTC | #1
On 7/21/23 2:47 PM, Smita Koralahalli wrote:
> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
> of AER should also own CXL Protocol Error Management as there is no
> explicit control of CXL Protocol error. And the CXL RAS Cap registers
> reported on Protocol errors should check for AER _OSC rather than CXL
> Memory Error Reporting Control _OSC.
> 
> The CXL Memory Error Reporting Control _OSC specifically highlights
> handling Memory Error Logging and Signaling Enhancements. These kinds of
> errors are reported through a device's mailbox and can be managed
> independently from CXL Protocol Errors.
> 
> This change fixes handling and reporting CXL Protocol Errors and RAS
> registers natively with native AER and FW-First CXL Memory Error Reporting
> Control.
> 
> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
> 
> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> v2:
> 	Added fixes tag.
> 	Included what the patch fixes in commit message.
> ---
>  drivers/cxl/pci.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 1cb1494c28fe..2323169b6e5f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>  		return 0;
>  	}
>  
> -	/* BIOS has CXL error control */
> -	if (!host_bridge->native_cxl_error)
> -		return -ENXIO;
> +	/* BIOS has PCIe AER error control */
> +	if (!host_bridge->native_aer)
> +		return 0;

Why not directly use pcie_aer_is_native() here?

>  
>  	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>  	if (rc)
  
Robert Richter July 24, 2023, 12:02 p.m. UTC | #2
On 21.07.23 21:47:38, Smita Koralahalli wrote:
> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
> of AER should also own CXL Protocol Error Management as there is no
> explicit control of CXL Protocol error. And the CXL RAS Cap registers
> reported on Protocol errors should check for AER _OSC rather than CXL
> Memory Error Reporting Control _OSC.
> 
> The CXL Memory Error Reporting Control _OSC specifically highlights
> handling Memory Error Logging and Signaling Enhancements. These kinds of
> errors are reported through a device's mailbox and can be managed
> independently from CXL Protocol Errors.
> 
> This change fixes handling and reporting CXL Protocol Errors and RAS
> registers natively with native AER and FW-First CXL Memory Error Reporting
> Control.
> 
> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
> 
> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>

Reviewed-by: Robert Richter <rrichter@amd.com>

> ---
> v2:
> 	Added fixes tag.
> 	Included what the patch fixes in commit message.
> ---
>  drivers/cxl/pci.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 1cb1494c28fe..2323169b6e5f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>  		return 0;
>  	}
>  
> -	/* BIOS has CXL error control */
> -	if (!host_bridge->native_cxl_error)
> -		return -ENXIO;
> +	/* BIOS has PCIe AER error control */
> +	if (!host_bridge->native_aer)
> +		return 0;
>  
>  	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>  	if (rc)
> -- 
> 2.17.1
>
  
Smita Koralahalli July 24, 2023, 9:59 p.m. UTC | #3
>> -	/* BIOS has CXL error control */
>> -	if (!host_bridge->native_cxl_error)
>> -		return -ENXIO;
>> +	/* BIOS has PCIe AER error control */
>> +	if (!host_bridge->native_aer)
>> +		return 0;
> 
> Why not directly use pcie_aer_is_native() here?
Yeah, this was in my v1. But changed as per Robert's comments, to be 
applicable for automated backports..

https://lore.kernel.org/all/ZLkxiZv3lWfazwVH@rric.localdomain/

Please advice.

Thanks,
Smita
> 
>>   
>>   	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>   	if (rc)
>
  
Dan Williams Aug. 8, 2023, 3:17 a.m. UTC | #4
Smita Koralahalli wrote:
> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
> of AER should also own CXL Protocol Error Management as there is no
> explicit control of CXL Protocol error. And the CXL RAS Cap registers
> reported on Protocol errors should check for AER _OSC rather than CXL
> Memory Error Reporting Control _OSC.
> 
> The CXL Memory Error Reporting Control _OSC specifically highlights
> handling Memory Error Logging and Signaling Enhancements. These kinds of
> errors are reported through a device's mailbox and can be managed
> independently from CXL Protocol Errors.
> 
> This change fixes handling and reporting CXL Protocol Errors and RAS
> registers natively with native AER and FW-First CXL Memory Error Reporting
> Control.

I feel like this could be said more succinctly and with an indication of
what the end user should expect to see. Something like:

"cxl_pci fails to unmask CXL protocol errors when CXL memory error
reporting is not granted native control. Given that CXL memory error
reporting uses the event interface and protocol errors use AER, unmask
protocol errors based only on the native AER setting. Without this
change end user deployments will fail to report protocol errors in the
case where native memory error handling is not granted to Linux."

> 
> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
> 
> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> ---
> v2:
> 	Added fixes tag.
> 	Included what the patch fixes in commit message.
> ---
>  drivers/cxl/pci.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 1cb1494c28fe..2323169b6e5f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>  		return 0;
>  	}
>  
> -	/* BIOS has CXL error control */
> -	if (!host_bridge->native_cxl_error)
> -		return -ENXIO;
> +	/* BIOS has PCIe AER error control */
> +	if (!host_bridge->native_aer)
> +		return 0;

The error code does not matter here and changing it makes the patch that
bit much more noisier than it needs to be. So just leave it as:

	return -ENXIO;

>  
>  	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>  	if (rc)
> -- 
> 2.17.1
>
  
Dan Williams Aug. 8, 2023, 3:22 a.m. UTC | #5
Smita Koralahalli wrote:
> 
> >> -	/* BIOS has CXL error control */
> >> -	if (!host_bridge->native_cxl_error)
> >> -		return -ENXIO;
> >> +	/* BIOS has PCIe AER error control */
> >> +	if (!host_bridge->native_aer)
> >> +		return 0;
> > 
> > Why not directly use pcie_aer_is_native() here?
> Yeah, this was in my v1. But changed as per Robert's comments, to be 
> applicable for automated backports..
> 
> https://lore.kernel.org/all/ZLkxiZv3lWfazwVH@rric.localdomain/
> 
> Please advice.

Keep it the way you have it. Minimizing the backport is the right call.
  
Smita Koralahalli Aug. 8, 2023, 7:37 p.m. UTC | #6
On 8/7/2023 8:17 PM, Dan Williams wrote:
> Smita Koralahalli wrote:
>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>> of AER should also own CXL Protocol Error Management as there is no
>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>> reported on Protocol errors should check for AER _OSC rather than CXL
>> Memory Error Reporting Control _OSC.
>>
>> The CXL Memory Error Reporting Control _OSC specifically highlights
>> handling Memory Error Logging and Signaling Enhancements. These kinds of
>> errors are reported through a device's mailbox and can be managed
>> independently from CXL Protocol Errors.
>>
>> This change fixes handling and reporting CXL Protocol Errors and RAS
>> registers natively with native AER and FW-First CXL Memory Error Reporting
>> Control.
> 
> I feel like this could be said more succinctly and with an indication of
> what the end user should expect to see. Something like:
> 
> "cxl_pci fails to unmask CXL protocol errors when CXL memory error
> reporting is not granted native control. Given that CXL memory error
> reporting uses the event interface and protocol errors use AER, unmask
> protocol errors based only on the native AER setting. Without this
> change end user deployments will fail to report protocol errors in the
> case where native memory error handling is not granted to Linux."

Sure, will make the change for a more clearer description. Thanks!
> 
>>
>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>
>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> ---
>> v2:
>> 	Added fixes tag.
>> 	Included what the patch fixes in commit message.
>> ---
>>   drivers/cxl/pci.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index 1cb1494c28fe..2323169b6e5f 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>   		return 0;
>>   	}
>>   
>> -	/* BIOS has CXL error control */
>> -	if (!host_bridge->native_cxl_error)
>> -		return -ENXIO;
>> +	/* BIOS has PCIe AER error control */
>> +	if (!host_bridge->native_aer)
>> +		return 0;
> 
> The error code does not matter here and changing it makes the patch that
> bit much more noisier than it needs to be. So just leave it as:

Doing this will return an error from cxl_pci probe thereby failing the 
device node creation in FW-First AER/DPC. I cannot think of other places 
where we reference the device node in FW-First mode but I have a place 
where this could potentially be a roadblock.

I'm trying to add trace events support for FW-First Protocol Errors. 
https://lore.kernel.org/linux-cxl/D9381C12-A585-4089-873B-3707C17823D3@fb.com/T/#mcaf8a78c1295372ab811be7e1ccb6a8a4d99f3e9

And we already have an existing trace_cxl_aer_correctable_error() and 
similarly for uncorrectable error for native protocol error reporting. I 
was trying to reuse the same function for fw-first as well. This 
function references cxl memory device node which will be NULL in 
FW-First if this returns an error.

I don't mind having a separate trace event function for FW-First mode as 
it would simplify things especially when dealing with RCH DP.. But there 
may be other potential places where we might reference this device node 
in FW-First. Please advice.

Thanks,
Smita

> 
> 	return -ENXIO;
> 
>>   
>>   	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>   	if (rc)
>> -- 
>> 2.17.1
>>
> 
> 
>
  

Patch

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 1cb1494c28fe..2323169b6e5f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -541,9 +541,9 @@  static int cxl_pci_ras_unmask(struct pci_dev *pdev)
 		return 0;
 	}
 
-	/* BIOS has CXL error control */
-	if (!host_bridge->native_cxl_error)
-		return -ENXIO;
+	/* BIOS has PCIe AER error control */
+	if (!host_bridge->native_aer)
+		return 0;
 
 	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
 	if (rc)