arm64: arch_timer: XGene-1 has 31 bit, not 32 bit, arch timer.

Message ID 20221021153424.GA25677@zipoli.concurrent-rt.com
State New
Headers
Series arm64: arch_timer: XGene-1 has 31 bit, not 32 bit, arch timer. |

Commit Message

Joe Korty Oct. 21, 2022, 3:34 p.m. UTC
  arm64: XGene-1 has a 31 bit, not a 32 bit, arch timer.

Fixes: 012f188504528b8cb32f441ac3bd9ea2eba39c9e ("clocksource/drivers/arm_arch_timer:
  Work around broken CVAL implementations")

Testing:
  On an 8-cpu Mustang, the following sequence no longer locks up the system:

     echo 0 >/proc/sys/kernel/watchdog
     for i in {0..7}; do taskset -c $i echo hi there $i; done

Stable:
  To be applied to 5.16 and above, once accepted by mainline.

Signed-off-by: Joe Korty <joe.korty@concurrent-rt.com>
  

Comments

Greg KH Oct. 21, 2022, 3:44 p.m. UTC | #1
On Fri, Oct 21, 2022 at 11:34:24AM -0400, Joe Korty wrote:
> arm64: XGene-1 has a 31 bit, not a 32 bit, arch timer.
> 
> Fixes: 012f188504528b8cb32f441ac3bd9ea2eba39c9e ("clocksource/drivers/arm_arch_timer:
>   Work around broken CVAL implementations")
> 
> Testing:
>   On an 8-cpu Mustang, the following sequence no longer locks up the system:
> 
>      echo 0 >/proc/sys/kernel/watchdog
>      for i in {0..7}; do taskset -c $i echo hi there $i; done
> 
> Stable:
>   To be applied to 5.16 and above, once accepted by mainline.
> 
> Signed-off-by: Joe Korty <joe.korty@concurrent-rt.com>
> 
> Index: b/drivers/clocksource/arm_arch_timer.c
> ===================================================================
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -805,7 +805,7 @@ static u64 __arch_timer_check_delta(void
>  	const struct midr_range broken_cval_midrs[] = {
>  		/*
>  		 * XGene-1 implements CVAL in terms of TVAL, meaning
> -		 * that the maximum timer range is 32bit. Shame on them.
> +		 * that the maximum timer range is 31bit. Shame on them.
>  		 */
>  		MIDR_ALL_VERSIONS(MIDR_CPU_MODEL(ARM_CPU_IMP_APM,
>  						 APM_CPU_PART_POTENZA)),
> @@ -813,8 +813,8 @@ static u64 __arch_timer_check_delta(void
>  	};
>  
>  	if (is_midr_in_range_list(read_cpuid_id(), broken_cval_midrs)) {
> -		pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 32bits");
> -		return CLOCKSOURCE_MASK(32);
> +		pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 31bits");
> +		return CLOCKSOURCE_MASK(31);
>  	}
>  #endif
>  	return CLOCKSOURCE_MASK(arch_counter_get_width());

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>
  
Marc Zyngier Oct. 21, 2022, 6:08 p.m. UTC | #2
On Fri, 21 Oct 2022 16:34:24 +0100,
Joe Korty <joe.korty@concurrent-rt.com> wrote:
> 
> arm64: XGene-1 has a 31 bit, not a 32 bit, arch timer.
> 
> Fixes: 012f188504528b8cb32f441ac3bd9ea2eba39c9e ("clocksource/drivers/arm_arch_timer:
>   Work around broken CVAL implementations")

Sorry, but you'll have to provide a bit more of an analysis here. As
far as I can tell, you're just changing a parameter without properly
describing what breaks and how.

>
> Testing:
>   On an 8-cpu Mustang, the following sequence no longer locks up the system:
> 
>      echo 0 >/proc/sys/kernel/watchdog
>      for i in {0..7}; do taskset -c $i echo hi there $i; done
> 
> Stable:
>   To be applied to 5.16 and above, once accepted by mainline.
> 
> Signed-off-by: Joe Korty <joe.korty@concurrent-rt.com>
> 
> Index: b/drivers/clocksource/arm_arch_timer.c
> ===================================================================
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -805,7 +805,7 @@ static u64 __arch_timer_check_delta(void
>  	const struct midr_range broken_cval_midrs[] = {
>  		/*
>  		 * XGene-1 implements CVAL in terms of TVAL, meaning
> -		 * that the maximum timer range is 32bit. Shame on them.
> +		 * that the maximum timer range is 31bit. Shame on them.
>  		 */
>  		MIDR_ALL_VERSIONS(MIDR_CPU_MODEL(ARM_CPU_IMP_APM,
>  						 APM_CPU_PART_POTENZA)),
> @@ -813,8 +813,8 @@ static u64 __arch_timer_check_delta(void
>  	};
>  
>  	if (is_midr_in_range_list(read_cpuid_id(), broken_cval_midrs)) {
> -		pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 32bits");
> -		return CLOCKSOURCE_MASK(32);
> +		pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 31bits");
> +		return CLOCKSOURCE_MASK(31);
>  	}
>  #endif
>  	return CLOCKSOURCE_MASK(arch_counter_get_width());
> 

Also, this isn't much of a patch. Please see the documentation on how
to properly submit one.

Thanks,

	M.
  
Joe Korty Oct. 21, 2022, 7:47 p.m. UTC | #3
Hi Marc,

On Fri, Oct 21, 2022 at 07:08:50PM +0100, Marc Zyngier wrote:
> Sorry, but you'll have to provide a bit more of an analysis here. As
> far as I can tell, you're just changing a parameter without properly
> describing what breaks and how.

There isn't much to analyse.  For ages, 0x7fffffff (31 bits) was the
declared width of 'arch timer' for all arm architures, and that worked.
Your patch series made the declared width vary according to which chipset
was in use, which is good, but that rewrite changed the above mask for
the XGene-1 from 0x7fffffff to 0xffffffff.  That change broke timers
for the XGene-1 since it seems that, in actuality, it has only a 31 bit
wide arch timer.  Thus declaring that arch timer has 32-bits is wrong.
This mismatch between the actual and declared sizes would cause arithmetic
errors in the calculation of timer deltas which more than accounts for
the hrtimer failures I am seeing when running 5.16+ on my Mustang XGene1.

Only one line need change, the rest are fluff:

-             return CLOCKSOURCE_MASK(32);
+             return CLOCKSOURCE_MASK(31);

> Also, this isn't much of a patch.

I don't know what this means.  The patch contains all that is needed for
the fix, no more.  I could add more comments as to _why_ it is 31 bits
not 32, but I don't know why.  I only know that the motherboard behaves
as if 31 bits is all that is available in the hardware.

> Please see the documentation on how to properly submit one.

AFAICS, the only submission mistake is that the 'Cc: stable@vger.kernel.org'
line is missing.

Regards,
Joe
  
Greg KH Oct. 22, 2022, 5:40 a.m. UTC | #4
On Fri, Oct 21, 2022 at 03:47:46PM -0400, Joe Korty wrote:
> Hi Marc,
> 
> On Fri, Oct 21, 2022 at 07:08:50PM +0100, Marc Zyngier wrote:
> > Sorry, but you'll have to provide a bit more of an analysis here. As
> > far as I can tell, you're just changing a parameter without properly
> > describing what breaks and how.
> 
> There isn't much to analyse.  For ages, 0x7fffffff (31 bits) was the
> declared width of 'arch timer' for all arm architures, and that worked.
> Your patch series made the declared width vary according to which chipset
> was in use, which is good, but that rewrite changed the above mask for
> the XGene-1 from 0x7fffffff to 0xffffffff.  That change broke timers
> for the XGene-1 since it seems that, in actuality, it has only a 31 bit
> wide arch timer.  Thus declaring that arch timer has 32-bits is wrong.
> This mismatch between the actual and declared sizes would cause arithmetic
> errors in the calculation of timer deltas which more than accounts for
> the hrtimer failures I am seeing when running 5.16+ on my Mustang XGene1.
> 
> Only one line need change, the rest are fluff:
> 
> -             return CLOCKSOURCE_MASK(32);
> +             return CLOCKSOURCE_MASK(31);
> 
> > Also, this isn't much of a patch.
> 
> I don't know what this means.  The patch contains all that is needed for
> the fix, no more.  I could add more comments as to _why_ it is 31 bits
> not 32, but I don't know why.  I only know that the motherboard behaves
> as if 31 bits is all that is available in the hardware.
> 
> > Please see the documentation on how to properly submit one.
> 
> AFAICS, the only submission mistake is that the 'Cc: stable@vger.kernel.org'
> line is missing.

No, you need a much better changelog text and probably subject line, and
to properly cc: the correct maintainers and developers.  As my bot would
say:

- Kernel development is done in public, please always cc: a public
  mailing list with a patch submission.  Using the tool,
  scripts/get_maintainer.pl on the patch will tell you what mailing list
  to cc.

- You did not specify a description of why the patch is needed, or
  possibly, any description at all, in the email body.  Please read the
  section entitled "The canonical patch format" in the kernel file,
  Documentation/SubmittingPatches for what is needed in order to
  properly describe the change.

- You did not write a descriptive Subject: for the patch, allowing Greg,
  and everyone else, to know what this patch is all about.  Please read
  the section entitled "The canonical patch format" in the kernel file,
  Documentation/SubmittingPatches for what a proper Subject: line should
  look like.


Thanks,

greg k-h
  
Marc Zyngier Oct. 22, 2022, 9:58 a.m. UTC | #5
Hi Joe,

On Fri, 21 Oct 2022 20:47:46 +0100,
Joe Korty <joe.korty@concurrent-rt.com> wrote:
> 
> Hi Marc,
> 
> On Fri, Oct 21, 2022 at 07:08:50PM +0100, Marc Zyngier wrote:
> > Sorry, but you'll have to provide a bit more of an analysis here. As
> > far as I can tell, you're just changing a parameter without properly
> > describing what breaks and how.
> 
> There isn't much to analyse.

Actually, there is plenty to analyse. Starting with *why* 31 is the
correct value (it actually is, see below) other than "hey, I reverted
this and it's all good, just merge it".

> For ages, 0x7fffffff (31 bits) was the
> declared width of 'arch timer' for all arm architures, and that worked.
> Your patch series made the declared width vary according to which chipset
> was in use, which is good, but that rewrite changed the above mask for
> the XGene-1 from 0x7fffffff to 0xffffffff.

This isn't quite what my changes did, but hey, let's not get derailed.

> That change broke timers
> for the XGene-1 since it seems that, in actuality, it has only a 31 bit
> wide arch timer.  Thus declaring that arch timer has 32-bits is wrong.
> This mismatch between the actual and declared sizes would cause arithmetic
> errors in the calculation of timer deltas which more than accounts for
> the hrtimer failures I am seeing when running 5.16+ on my Mustang XGene1.

This is the important point, and the reason why it breaks:

XGene implements CVAL (a 64bit comparator) in terms of TVAL (a
countdown register) instead of the other way around. TVAL being a
32bit register, the width of the counter should equally be 32.
However, TVAL is a *signed* value, and keeps counting down in the
negative range once the timer fires.

It means that any TVAL value with bit 31 set will fire immediately, as
it cannot be distinguished from an already expired timer. Reducing the
timer range back to a paltry 31 bits papers over the issue.

Another problem cannot be fixed though, which is that the timer
interrupt *must* be handled within the negative countdown period, or
the interrupt will be lost (TVAL will rollover to a positive value,
indicative of a new timer deadline).

> Only one line need change, the rest are fluff:
> 
> -             return CLOCKSOURCE_MASK(32);
> +             return CLOCKSOURCE_MASK(31);

Yes, and all you need is to send a proper patch, see below.

> 
> > Also, this isn't much of a patch.
> 
> I don't know what this means.  The patch contains all that is needed for
> the fix, no more.  I could add more comments as to _why_ it is 31 bits
> not 32, but I don't know why.  I only know that the motherboard behaves
> as if 31 bits is all that is available in the hardware.
> 
> > Please see the documentation on how to properly submit one.
> 
> AFAICS, the only submission mistake is that the 'Cc: stable@vger.kernel.org'
> line is missing.

What you have done here is to write an email with a diff appended to
it, which isn't a proper kernel patch. I expect a patch to be
formatted with "git format-patch" instead of "git diff"
(i.e. something that is an actually commit instead of a local diff),
with a proper commit message (feel free to nick some of the
description above), with a Cc: stable@ and a Fixes: tag at the right
spot, Cc'ing all the relevant maintainers.

All of this is eloquently explained in the kernel documentation
(Documentation/process/submitting-patches.rst), and I would definitely
encourage you to read the sections titled "Describe your changes" and
"The canonical patch format". You can also look at the previous
commits to the same file to get a sense of the formatting that people
use.

Thanks,

	M.
  

Patch

Index: b/drivers/clocksource/arm_arch_timer.c
===================================================================
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -805,7 +805,7 @@  static u64 __arch_timer_check_delta(void
 	const struct midr_range broken_cval_midrs[] = {
 		/*
 		 * XGene-1 implements CVAL in terms of TVAL, meaning
-		 * that the maximum timer range is 32bit. Shame on them.
+		 * that the maximum timer range is 31bit. Shame on them.
 		 */
 		MIDR_ALL_VERSIONS(MIDR_CPU_MODEL(ARM_CPU_IMP_APM,
 						 APM_CPU_PART_POTENZA)),
@@ -813,8 +813,8 @@  static u64 __arch_timer_check_delta(void
 	};
 
 	if (is_midr_in_range_list(read_cpuid_id(), broken_cval_midrs)) {
-		pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 32bits");
-		return CLOCKSOURCE_MASK(32);
+		pr_warn_once("Broken CNTx_CVAL_EL1, limiting width to 31bits");
+		return CLOCKSOURCE_MASK(31);
 	}
 #endif
 	return CLOCKSOURCE_MASK(arch_counter_get_width());