[v3,1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

Message ID 20230803230129.127590-2-Smita.KoralahalliChannabasappa@amd.com
State New
Headers
Series PCI/AER, CXL: Fix appropriate _OSC check for CXL RAS Cap |

Commit Message

Smita Koralahalli Aug. 3, 2023, 11:01 p.m. UTC
  According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
of AER should also own CXL Protocol Error Management as there is no
explicit control of CXL Protocol error. And the CXL RAS Cap registers
reported on Protocol errors should check for AER _OSC rather than CXL
Memory Error Reporting Control _OSC.

The CXL Memory Error Reporting Control _OSC specifically highlights
handling Memory Error Logging and Signaling Enhancements. These kinds of
errors are reported through a device's mailbox and can be managed
independently from CXL Protocol Errors.

This change fixes handling and reporting CXL Protocol Errors and RAS
registers natively with native AER and FW-First CXL Memory Error Reporting
Control.

[1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.

Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Robert Richter <rrichter@amd.com>
---
v2:
	Added fixes tag.
	Included what the patch fixes in commit message.
v3:
	Added "Reviewed-by" tag.
---
 drivers/cxl/pci.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
  

Comments

Smita Koralahalli Aug. 16, 2023, 9:36 p.m. UTC | #1
On 8/16/2023 2:33 PM, Smita Koralahalli wrote:
> On 8/16/2023 11:06 AM, Dave Jiang wrote:
>>
>>
>> On 8/4/23 05:09, Jonathan Cameron wrote:
>>> On Thu, 3 Aug 2023 23:01:27 +0000
>>> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:
>>>
>>>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>>>> of AER should also own CXL Protocol Error Management as there is no
>>>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>>>> reported on Protocol errors should check for AER _OSC rather than CXL
>>>> Memory Error Reporting Control _OSC.
>>>>
>>>> The CXL Memory Error Reporting Control _OSC specifically highlights
>>>> handling Memory Error Logging and Signaling Enhancements. These 
>>>> kinds of
>>>> errors are reported through a device's mailbox and can be managed
>>>> independently from CXL Protocol Errors.
>>>>
>>>> This change fixes handling and reporting CXL Protocol Errors and RAS
>>>> registers natively with native AER and FW-First CXL Memory Error 
>>>> Reporting
>>>> Control.
>>>>
>>>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>>>
>>>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>>>> Signed-off-by: Smita Koralahalli 
>>>> <Smita.KoralahalliChannabasappa@amd.com>
>>>> Reviewed-by: Robert Richter <rrichter@amd.com>
>>>
>>> I'd be tempted to add a comment on why this returns 0 rather than an
>>> error.  I think that makes sense but it isn't immediately obvious from
>>> the local context.
>>>
>>> Otherwise LGTM
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>
>> Echo Jonathan's comment.
>>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> Yes, and Dan is probably against returning error code.

Against returning zero. My bad sorry!
> https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/ 
> 
> 
> But I think returning zero is required as we don't want to interfere 
> with cxl device access when operating in native cxl memory error 
> reporting. Returning error code will basically fail cxl_pci_probe() and 
> thus fail to create a cxl device node.
> 
> I was thinking a single line comment as: "Return zero to not block the 
> communication with the cxl device when in native memory error reporting 
> mode".
> 
> Agree? Or anything more that needs to be added?
> 
> Thanks,
> Smita
>>
>>>
>>>
>>>> ---
>>>> v2:
>>>>     Added fixes tag.
>>>>     Included what the patch fixes in commit message.
>>>> v3:
>>>>     Added "Reviewed-by" tag.
>>>> ---
>>>>   drivers/cxl/pci.c | 6 +++---
>>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>>> index 1cb1494c28fe..2323169b6e5f 100644
>>>> --- a/drivers/cxl/pci.c
>>>> +++ b/drivers/cxl/pci.c
>>>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>>>           return 0;
>>>>       }
>>>> -    /* BIOS has CXL error control */
>>>> -    if (!host_bridge->native_cxl_error)
>>>> -        return -ENXIO;
>>>> +    /* BIOS has PCIe AER error control */
>>>> +    if (!host_bridge->native_aer)
>>>> +        return 0;
>>>>       rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>>>       if (rc)
>>>
>
  
Dave Jiang Aug. 16, 2023, 9:39 p.m. UTC | #2
On 8/16/23 14:33, Smita Koralahalli wrote:
> On 8/16/2023 11:06 AM, Dave Jiang wrote:
>>
>>
>> On 8/4/23 05:09, Jonathan Cameron wrote:
>>> On Thu, 3 Aug 2023 23:01:27 +0000
>>> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:
>>>
>>>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>>>> of AER should also own CXL Protocol Error Management as there is no
>>>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>>>> reported on Protocol errors should check for AER _OSC rather than CXL
>>>> Memory Error Reporting Control _OSC.
>>>>
>>>> The CXL Memory Error Reporting Control _OSC specifically highlights
>>>> handling Memory Error Logging and Signaling Enhancements. These 
>>>> kinds of
>>>> errors are reported through a device's mailbox and can be managed
>>>> independently from CXL Protocol Errors.
>>>>
>>>> This change fixes handling and reporting CXL Protocol Errors and RAS
>>>> registers natively with native AER and FW-First CXL Memory Error 
>>>> Reporting
>>>> Control.
>>>>
>>>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>>>
>>>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>>>> Signed-off-by: Smita Koralahalli 
>>>> <Smita.KoralahalliChannabasappa@amd.com>
>>>> Reviewed-by: Robert Richter <rrichter@amd.com>
>>>
>>> I'd be tempted to add a comment on why this returns 0 rather than an
>>> error.  I think that makes sense but it isn't immediately obvious from
>>> the local context.
>>>
>>> Otherwise LGTM
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>
>> Echo Jonathan's comment.
>>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> Yes, and Dan is probably against returning error code.
> https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/
> 
> But I think returning zero is required as we don't want to interfere 
> with cxl device access when operating in native cxl memory error 
> reporting. Returning error code will basically fail cxl_pci_probe() and 
> thus fail to create a cxl device node.
> 
> I was thinking a single line comment as: "Return zero to not block the 
> communication with the cxl device when in native memory error reporting 
> mode".

Looks reasonable to me. Unless Dan feels it needs to fail the probe.

> 
> Agree? Or anything more that needs to be added?
> 
> Thanks,
> Smita
>>
>>>
>>>
>>>> ---
>>>> v2:
>>>>     Added fixes tag.
>>>>     Included what the patch fixes in commit message.
>>>> v3:
>>>>     Added "Reviewed-by" tag.
>>>> ---
>>>>   drivers/cxl/pci.c | 6 +++---
>>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>>> index 1cb1494c28fe..2323169b6e5f 100644
>>>> --- a/drivers/cxl/pci.c
>>>> +++ b/drivers/cxl/pci.c
>>>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>>>           return 0;
>>>>       }
>>>> -    /* BIOS has CXL error control */
>>>> -    if (!host_bridge->native_cxl_error)
>>>> -        return -ENXIO;
>>>> +    /* BIOS has PCIe AER error control */
>>>> +    if (!host_bridge->native_aer)
>>>> +        return 0;
>>>>       rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>>>       if (rc)
>>>
>
  

Patch

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 1cb1494c28fe..2323169b6e5f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -541,9 +541,9 @@  static int cxl_pci_ras_unmask(struct pci_dev *pdev)
 		return 0;
 	}
 
-	/* BIOS has CXL error control */
-	if (!host_bridge->native_cxl_error)
-		return -ENXIO;
+	/* BIOS has PCIe AER error control */
+	if (!host_bridge->native_aer)
+		return 0;
 
 	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
 	if (rc)