drivers/iommu: Ensure that the queue base address is successfully written during SMMU initialization.

Message ID 20240218050224.33426-1-niliqiang.io@gmail.com
State New
Headers
Series drivers/iommu: Ensure that the queue base address is successfully written during SMMU initialization. |

Commit Message

ni.liqiang Feb. 18, 2024, 5:02 a.m. UTC
  In the system reboot test, I encountered an issue:
After the OS started, the base address of CMDQ failed to be written
successfully and remained at the default value of 0.

Through timing analysis of CMN, it was found that although
the write request for the CMDQ base precedes the write request
for CMDQEN, the write response for the CMDQ base might be later
than that for CMDQEN.

Upon reviewing the SMMU Architecture Specification,
I found the following explanation:
The registers must be initialized in this order:
1. Write SMMU_CMDQ_BASE to set the queue base and size.
2. Write initial values to SMMU_CMDQ_CONS and SMMU_CMDQ_PROD.
3. Enable the queue with an Update of the respective SMMU_CR0.CMDQEN to 1.

If there are no memory barriers, how can we ensure this order?
Therefore, I believe that adding a memory barrier before enabling CMDQ
is necessary to ensure that the base address of CMDQ is correctly written.

The base addresses of EVENTQ and PRIQ would also be subject
to the same situation.

Could you please review if this modification seems reasonable? Thank you.

Signed-off-by: ni.liqiang <niliqiang.io@gmail.com>
Reviewed-by: jin.qi <jin.qi@zte.com.cn>
Tested-by: ni.liqiang <niliqiang.io@gmail.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)
  

Comments

Daniel Mentz Feb. 19, 2024, 5:44 a.m. UTC | #1
On Sat, Feb 17, 2024 at 9:02 PM ni.liqiang <niliqiang.io@gmail.com> wrote:
> If there are no memory barriers, how can we ensure this order?

 The SMMU registers are accessed using Device-nGnRE attributes. It is
my understanding that, for Device-nGnRE, the Arm architecture requires
that writes to the same peripheral arrive at the endpoint in program
order.
  
Will Deacon Feb. 19, 2024, 9:17 a.m. UTC | #2
On Sun, Feb 18, 2024 at 09:44:47PM -0800, Daniel Mentz wrote:
> On Sat, Feb 17, 2024 at 9:02 PM ni.liqiang <niliqiang.io@gmail.com> wrote:
> > If there are no memory barriers, how can we ensure this order?
> 
>  The SMMU registers are accessed using Device-nGnRE attributes. It is
> my understanding that, for Device-nGnRE, the Arm architecture requires
> that writes to the same peripheral arrive at the endpoint in program
> order.

Yup, that's correct. The "nR" part means "non-Reordering", so something
else is going on here.

Will
  
ni.liqiang Feb. 21, 2024, 3:26 p.m. UTC | #3
>>  The SMMU registers are accessed using Device-nGnRE attributes. It is
>> my understanding that, for Device-nGnRE, the Arm architecture requires
>> that writes to the same peripheral arrive at the endpoint in program
>> order.
>
> Yup, that's correct. The "nR" part means "non-Reordering", so something
> else is going on here.

Yes, the SMMU registers are accessed using Device-nGnRE attributes. 

One additional point to note is: in cases where there is a failure writing to the CMDQ base address register, the testing environment was a multi-die, multi-socket server. This issue has not been observed on a single-die server. I apologize for omitting this information in my initial patch submission. 

Over the past few days, I have referenced the kernel source code and ported the SMMU register initialization process. Through multiple stress tests, I have attempted to reproduce the CMDQ base address register write failure issue. The summarized results of my experiments are as follows:
1. When testing with one CPU core bound using taskset, the initialization process was executed 300,000 times without encountering the CMDQ base address register write failure issue. However, when not binding CPU using taskset, the issue was reproduced around 1,000 iterations into the test.
2. Without CPU binding, I inserted a memory barrier between accesses to the CMDQ_BASE register and CMDQEN register, similar to the modification made in the patch. After executing the initialization process 300,000 times, the CMDQ base address register write failure issue did not occur.

Based on these observations and joint analysis with CMN colleagues, we speculate that in the SMMU register initialization process, if the CPU core changes, and these CPUs are located on different dies, the underlying 4 CCG ports are utilized to perform die-to-die accesses. However, in our current strategy, these 4 CCG ports cannot guarantee ordering, resulting in the completion of CMDQEN writing before the completion of CMDQ base address writing.

From the analysis above, it seems that modifying the die-to-die access strategy to achieve ordering of Device-nGnRE memory might be a better solution compared to adding a memory barrier?
  
Will Deacon Feb. 21, 2024, 4:08 p.m. UTC | #4
On Wed, Feb 21, 2024 at 11:26:29PM +0800, ni.liqiang wrote:
> >>  The SMMU registers are accessed using Device-nGnRE attributes. It is
> >> my understanding that, for Device-nGnRE, the Arm architecture requires
> >> that writes to the same peripheral arrive at the endpoint in program
> >> order.
> >
> > Yup, that's correct. The "nR" part means "non-Reordering", so something
> > else is going on here.
> 
> Yes, the SMMU registers are accessed using Device-nGnRE attributes. 
> 
> One additional point to note is: in cases where there is a failure writing
> to the CMDQ base address register, the testing environment was a
> multi-die, multi-socket server. This issue has not been observed on a
> single-die server. I apologize for omitting this information in my initial
> patch submission. 

Uh-oh, smells like a hardware issue ;p
I wonder if Device-nGnRnE behaves any differently?

> Over the past few days, I have referenced the kernel source code and
> ported the SMMU register initialization process. Through multiple stress
> tests, I have attempted to reproduce the CMDQ base address register write
> failure issue. The summarized results of my experiments are as follows:
> 1. When testing with one CPU core bound using taskset, the initialization
> process was executed 300,000 times without encountering the CMDQ base
> address register write failure issue. However, when not binding CPU using
> taskset, the issue was reproduced around 1,000 iterations into the test.
> 2. Without CPU binding, I inserted a memory barrier between accesses to
> the CMDQ_BASE register and CMDQEN register, similar to the modification
> made in the patch. After executing the initialization process 300,000
> times, the CMDQ base address register write failure issue did not occur.
> 
> Based on these observations and joint analysis with CMN colleagues, we
> speculate that in the SMMU register initialization process, if the CPU
> core changes, and these CPUs are located on different dies, the underlying
> 4 CCG ports are utilized to perform die-to-die accesses. However, in our
> current strategy, these 4 CCG ports cannot guarantee ordering, resulting
> in the completion of CMDQEN writing before the completion of CMDQ base
> address writing.

(Disclaimer: I don't know what a CCG port is)

Hmmm. The part that doesn't make sense to me here is that migrating between
CPUs implies context-switching, and we have a DSB on that path in
__switch_to(). So why would adding barriers to the driver help? Maybe it
just changes the timing?

> From the analysis above, it seems that modifying the die-to-die access
> strategy to achieve ordering of Device-nGnRE memory might be a better
> solution compared to adding a memory barrier?

I'm not sure what you're proposing, but I don't think Linux should be
changed to accomodate this.

Will
  
ni.liqiang Feb. 23, 2024, 4:20 p.m. UTC | #5
> (Disclaimer: I don't know what a CCG port is)

CCG(CXL Gateway) is a part of the CMN700 Coherent Mesh Network. It plays
a crucial role in facilitating cross-die access in a multi-chip system.

> Hmmm. The part that doesn't make sense to me here is that migrating between
> CPUs implies context-switching, and we have a DSB on that path in
> __switch_to(). So why would adding barriers to the driver help? Maybe it
> just changes the timing?

This is very likely. Through our experiments, adding a delay before CMDQEN
does not reproduce the failure of writing to the CMDQ base register.

> I'm not sure what you're proposing, but I don't think Linux should be
> changed to accomodate this.

I am very grateful for the responses from both of you experts. We will
continue to check the current hardware configuration of the system
and attempt to fix this issue.
Once again, I express my thanks. Thank you.
  

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0ffb1cf17e0b..ac854c46fdf3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3324,6 +3324,11 @@  static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	writel_relaxed(smmu->cmdq.q.llq.prod, smmu->base + ARM_SMMU_CMDQ_PROD);
 	writel_relaxed(smmu->cmdq.q.llq.cons, smmu->base + ARM_SMMU_CMDQ_CONS);
 
+	/* Ensure that SMMU_CMDQ_BASE is written completely
+	 * when SMMU_CR0.CMDQEN == 0.
+	 */
+	__iomb();
+
 	enables = CR0_CMDQEN;
 	ret = arm_smmu_write_reg_sync(smmu, enables, ARM_SMMU_CR0,
 				      ARM_SMMU_CR0ACK);
@@ -3350,6 +3355,11 @@  static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 	writel_relaxed(smmu->evtq.q.llq.prod, smmu->page1 + ARM_SMMU_EVTQ_PROD);
 	writel_relaxed(smmu->evtq.q.llq.cons, smmu->page1 + ARM_SMMU_EVTQ_CONS);
 
+	/* Ensure that SMMU_EVENTQ_BASE is written completely
+	 * when SMMU_CR0.EVENTQEN == 0.
+	 */
+	__iomb();
+
 	enables |= CR0_EVTQEN;
 	ret = arm_smmu_write_reg_sync(smmu, enables, ARM_SMMU_CR0,
 				      ARM_SMMU_CR0ACK);
@@ -3367,6 +3377,11 @@  static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
 		writel_relaxed(smmu->priq.q.llq.cons,
 			       smmu->page1 + ARM_SMMU_PRIQ_CONS);
 
+		/* Ensure that SMMU_PRIQ_BASE is written completely
+		 * when SMMU_CR0.PRIQEN == 0.
+		 */
+		__iomb();
+
 		enables |= CR0_PRIQEN;
 		ret = arm_smmu_write_reg_sync(smmu, enables, ARM_SMMU_CR0,
 					      ARM_SMMU_CR0ACK);