[3/3] cpufreq: CPPC: Eliminate the impact of cpc_read() latency error

Message ID 20231025093847.3740104-4-zengheng4@huawei.com
State New
Headers
Series Make the cpuinfo_cur_freq interface read correctly |

Commit Message

Zeng Heng Oct. 25, 2023, 9:38 a.m. UTC
  We have found significant differences in the latency of cpc_read() between
regular scenarios and scenarios with high memory access pressure. Ignoring
this error can result in getting rate interface occasionally returning
absurd values.

Here provides a high memory access sample test by stress-ng. My local
testing platform includes 160 CPUs, the CPC registers is accessed by mmio
method, and the cpuidle feature is disabled (the AMU always works online):

~~~
./stress-ng --memrate 160 --timeout 180
~~~

The following data is sourced from ftrace statistics towards
cppc_get_perf_ctrs():

              Regular scenarios               ||      High memory access pressure scenarios
104)               |  cppc_get_perf_ctrs() {  ||  133)               |  cppc_get_perf_ctrs() {
104)   0.800 us    |    cpc_read.isra.0();    ||  133)   4.580 us    |    cpc_read.isra.0();
104)   0.640 us    |    cpc_read.isra.0();    ||  133)   7.780 us    |    cpc_read.isra.0();
104)   0.450 us    |    cpc_read.isra.0();    ||  133)   2.550 us    |    cpc_read.isra.0();
104)   0.430 us    |    cpc_read.isra.0();    ||  133)   0.570 us    |    cpc_read.isra.0();
104)   4.610 us    |  }                       ||  133) ! 157.610 us  |  }
104)               |  cppc_get_perf_ctrs() {  ||  133)               |  cppc_get_perf_ctrs() {
104)   0.720 us    |    cpc_read.isra.0();    ||  133)   0.760 us    |    cpc_read.isra.0();
104)   0.720 us    |    cpc_read.isra.0();    ||  133)   4.480 us    |    cpc_read.isra.0();
104)   0.510 us    |    cpc_read.isra.0();    ||  133)   0.520 us    |    cpc_read.isra.0();
104)   0.500 us    |    cpc_read.isra.0();    ||  133) + 10.100 us   |    cpc_read.isra.0();
104)   3.460 us    |  }                       ||  133) ! 120.850 us  |  }
108)               |  cppc_get_perf_ctrs() {  ||   87)               |  cppc_get_perf_ctrs() {
108)   0.820 us    |    cpc_read.isra.0();    ||   87) ! 255.200 us  |    cpc_read.isra.0();
108)   0.850 us    |    cpc_read.isra.0();    ||   87)   2.910 us    |    cpc_read.isra.0();
108)   0.590 us    |    cpc_read.isra.0();    ||   87)   5.160 us    |    cpc_read.isra.0();
108)   0.610 us    |    cpc_read.isra.0();    ||   87)   4.340 us    |    cpc_read.isra.0();
108)   5.080 us    |  }                       ||   87) ! 315.790 us  |  }
108)               |  cppc_get_perf_ctrs() {  ||   87)               |  cppc_get_perf_ctrs() {
108)   0.630 us    |    cpc_read.isra.0();    ||   87)   0.800 us    |    cpc_read.isra.0();
108)   0.630 us    |    cpc_read.isra.0();    ||   87)   6.310 us    |    cpc_read.isra.0();
108)   0.420 us    |    cpc_read.isra.0();    ||   87)   1.190 us    |    cpc_read.isra.0();
108)   0.430 us    |    cpc_read.isra.0();    ||   87) + 11.620 us   |    cpc_read.isra.0();
108)   3.780 us    |  }                       ||   87) ! 207.010 us  |  }

My local testing platform works under 3000000hz, but the cpuinfo_cur_freq
interface returns values that are not even close to the actual frequency:

[root@localhost ~]# cd /sys/devices/system/cpu
[root@localhost cpu]# for i in {0..159}; do cat cpu$i/cpufreq/cpuinfo_cur_freq; done
5127812
2952127
3069001
3496183
922989768
2419194
3427042
2331869
3594611
8238499
...

The reason is when under heavy memory access pressure, the execution of
cpc_read() delay has increased from sub-microsecond to several hundred
microseconds. Moving the cpc_read function into a critical section by irq
disable/enable has minimal impact on the result.

  cppc_get_perf_ctrs()[0]                    cppc_get_perf_ctrs()[1]
/                    \                      /                      \
cpc_read         cpc_read                  cpc_read            cpc_read
 ref[0]        delivered[0]                 ref[1]            delivered[1]
    |              |                           |                    |
    v              v                           v                    v
-----------------------------------------------------------------------> time
     <--delta[0]--> <------sample_period------> <-----delta[1]----->

Since that,
  freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
and
  delivered[1] - delivered[0] = freq * (delta[1] + sample_period),
  ref[1] - ref[0] = ref_freq * (delta[0] + sample_period)

To eliminate the impact of system memory access latency, setting a
sampling period of 2us is far from sufficient. Consequently, we suggest
cppc_cpufreq_get_rate() only can be called in the process context, and
adopt a longer sampling period to neutralize the impact of random latency.

Here we call the cond_resched() function instead of sleep-like functions
to ensure that `taskset -c $i cat cpu$i/cpufreq/cpuinfo_cur_freq` could
work when cpuidle feature is enabled.

Reported-by: Yang Shi <yang@os.amperecomputing.com>
Link: https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
---
 drivers/cpufreq/cppc_cpufreq.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)
  

Comments

Mark Rutland Oct. 25, 2023, 11:01 a.m. UTC | #1
On Wed, Oct 25, 2023 at 05:38:47PM +0800, Zeng Heng wrote:
> We have found significant differences in the latency of cpc_read() between
> regular scenarios and scenarios with high memory access pressure. Ignoring
> this error can result in getting rate interface occasionally returning
> absurd values.
> 
> Here provides a high memory access sample test by stress-ng. My local
> testing platform includes 160 CPUs, the CPC registers is accessed by mmio
> method, and the cpuidle feature is disabled (the AMU always works online):
> 
> ~~~
> ./stress-ng --memrate 160 --timeout 180
> ~~~
> 
> The following data is sourced from ftrace statistics towards
> cppc_get_perf_ctrs():
> 
>               Regular scenarios               ||      High memory access pressure scenarios
> 104)               |  cppc_get_perf_ctrs() {  ||  133)               |  cppc_get_perf_ctrs() {
> 104)   0.800 us    |    cpc_read.isra.0();    ||  133)   4.580 us    |    cpc_read.isra.0();
> 104)   0.640 us    |    cpc_read.isra.0();    ||  133)   7.780 us    |    cpc_read.isra.0();
> 104)   0.450 us    |    cpc_read.isra.0();    ||  133)   2.550 us    |    cpc_read.isra.0();
> 104)   0.430 us    |    cpc_read.isra.0();    ||  133)   0.570 us    |    cpc_read.isra.0();
> 104)   4.610 us    |  }                       ||  133) ! 157.610 us  |  }
> 104)               |  cppc_get_perf_ctrs() {  ||  133)               |  cppc_get_perf_ctrs() {
> 104)   0.720 us    |    cpc_read.isra.0();    ||  133)   0.760 us    |    cpc_read.isra.0();
> 104)   0.720 us    |    cpc_read.isra.0();    ||  133)   4.480 us    |    cpc_read.isra.0();
> 104)   0.510 us    |    cpc_read.isra.0();    ||  133)   0.520 us    |    cpc_read.isra.0();
> 104)   0.500 us    |    cpc_read.isra.0();    ||  133) + 10.100 us   |    cpc_read.isra.0();
> 104)   3.460 us    |  }                       ||  133) ! 120.850 us  |  }
> 108)               |  cppc_get_perf_ctrs() {  ||   87)               |  cppc_get_perf_ctrs() {
> 108)   0.820 us    |    cpc_read.isra.0();    ||   87) ! 255.200 us  |    cpc_read.isra.0();
> 108)   0.850 us    |    cpc_read.isra.0();    ||   87)   2.910 us    |    cpc_read.isra.0();
> 108)   0.590 us    |    cpc_read.isra.0();    ||   87)   5.160 us    |    cpc_read.isra.0();
> 108)   0.610 us    |    cpc_read.isra.0();    ||   87)   4.340 us    |    cpc_read.isra.0();
> 108)   5.080 us    |  }                       ||   87) ! 315.790 us  |  }
> 108)               |  cppc_get_perf_ctrs() {  ||   87)               |  cppc_get_perf_ctrs() {
> 108)   0.630 us    |    cpc_read.isra.0();    ||   87)   0.800 us    |    cpc_read.isra.0();
> 108)   0.630 us    |    cpc_read.isra.0();    ||   87)   6.310 us    |    cpc_read.isra.0();
> 108)   0.420 us    |    cpc_read.isra.0();    ||   87)   1.190 us    |    cpc_read.isra.0();
> 108)   0.430 us    |    cpc_read.isra.0();    ||   87) + 11.620 us   |    cpc_read.isra.0();
> 108)   3.780 us    |  }                       ||   87) ! 207.010 us  |  }
> 
> My local testing platform works under 3000000hz, but the cpuinfo_cur_freq
> interface returns values that are not even close to the actual frequency:
> 
> [root@localhost ~]# cd /sys/devices/system/cpu
> [root@localhost cpu]# for i in {0..159}; do cat cpu$i/cpufreq/cpuinfo_cur_freq; done
> 5127812
> 2952127
> 3069001
> 3496183
> 922989768
> 2419194
> 3427042
> 2331869
> 3594611
> 8238499
> ...
> 
> The reason is when under heavy memory access pressure, the execution of
> cpc_read() delay has increased from sub-microsecond to several hundred
> microseconds. Moving the cpc_read function into a critical section by irq
> disable/enable has minimal impact on the result.
> 
>   cppc_get_perf_ctrs()[0]                    cppc_get_perf_ctrs()[1]
> /                    \                      /                      \
> cpc_read         cpc_read                  cpc_read            cpc_read
>  ref[0]        delivered[0]                 ref[1]            delivered[1]
>     |              |                           |                    |
>     v              v                           v                    v
> -----------------------------------------------------------------------> time
>      <--delta[0]--> <------sample_period------> <-----delta[1]----->
> 
> Since that,
>   freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
> and
>   delivered[1] - delivered[0] = freq * (delta[1] + sample_period),
>   ref[1] - ref[0] = ref_freq * (delta[0] + sample_period)
> 
> To eliminate the impact of system memory access latency, setting a
> sampling period of 2us is far from sufficient. Consequently, we suggest
> cppc_cpufreq_get_rate() only can be called in the process context, and
> adopt a longer sampling period to neutralize the impact of random latency.
> 
> Here we call the cond_resched() function instead of sleep-like functions
> to ensure that `taskset -c $i cat cpu$i/cpufreq/cpuinfo_cur_freq` could
> work when cpuidle feature is enabled.
> 
> Reported-by: Yang Shi <yang@os.amperecomputing.com>
> Link: https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
> Signed-off-by: Zeng Heng <zengheng4@huawei.com>
> ---
>  drivers/cpufreq/cppc_cpufreq.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 321a9dc9484d..a7c5418bcda7 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -851,12 +851,26 @@ static int cppc_get_perf_ctrs_pair(void *val)

The previous patch added this function, and calls it with smp_call_on_cpu(),
where it'll run in IRQ context with IRQs disabled...

>  	struct fb_ctr_pair *fb_ctrs = val;
>  	int cpu = fb_ctrs->cpu;
>  	int ret;
> +	unsigned long timeout;
>  
>  	ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
>  	if (ret)
>  		return ret;
>  
> -	udelay(2); /* 2usec delay between sampling */
> +	if (likely(!irqs_disabled())) {
> +		/*
> +		 * Set 1ms as sampling interval, but never schedule
> +		 * to the idle task to prevent the AMU counters from
> +		 * stopping working.
> +		 */
> +		timeout = jiffies + msecs_to_jiffies(1);
> +		while (!time_after(jiffies, timeout))
> +			cond_resched();
> +
> +	} else {

... so we'll enter this branch of the if-else ...

> +		pr_warn_once("CPU%d: Get rate in atomic context", cpu);

... and pr_warn_once() for something that's apparently normal and outside of
the user's control?

That doesn't make much sense to me.

Mark.

> +		udelay(2); /* 2usec delay between sampling */
> +	}
>  
>  	return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
>  }
> -- 
> 2.25.1
>
  
Zeng Heng Oct. 26, 2023, 1:55 a.m. UTC | #2
在 2023/10/25 19:01, Mark Rutland 写道:
> On Wed, Oct 25, 2023 at 05:38:47PM +0800, Zeng Heng wrote:
>
> The previous patch added this function, and calls it with smp_call_on_cpu(),
> where it'll run in IRQ context with IRQs disabled...

smp_call_on_cpu() puts the work to the bind-cpu worker.

And this function will be called in task context, and IRQs is certainly enabled.


Zeng Heng

>>   	struct fb_ctr_pair *fb_ctrs = val;
>>   	int cpu = fb_ctrs->cpu;
>>   	int ret;
>> +	unsigned long timeout;
>>   
>>   	ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
>>   	if (ret)
>>   		return ret;
>>   
>> -	udelay(2); /* 2usec delay between sampling */
>> +	if (likely(!irqs_disabled())) {
>> +		/*
>> +		 * Set 1ms as sampling interval, but never schedule
>> +		 * to the idle task to prevent the AMU counters from
>> +		 * stopping working.
>> +		 */
>> +		timeout = jiffies + msecs_to_jiffies(1);
>> +		while (!time_after(jiffies, timeout))
>> +			cond_resched();
>> +
>> +	} else {
> ... so we'll enter this branch of the if-else ...
>
>> +		pr_warn_once("CPU%d: Get rate in atomic context", cpu);
> ... and pr_warn_once() for something that's apparently normal and outside of
> the user's control?
>
> That doesn't make much sense to me.
>
> Mark.
>
>> +		udelay(2); /* 2usec delay between sampling */
>> +	}
>>   
>>   	return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
>>   }
>> -- 
>> 2.25.1
>>
  
Mark Rutland Oct. 26, 2023, 11:26 a.m. UTC | #3
On Thu, Oct 26, 2023 at 09:55:39AM +0800, Zeng Heng wrote:
> 
> 在 2023/10/25 19:01, Mark Rutland 写道:
> > On Wed, Oct 25, 2023 at 05:38:47PM +0800, Zeng Heng wrote:
> > 
> > The previous patch added this function, and calls it with smp_call_on_cpu(),
> > where it'll run in IRQ context with IRQs disabled...
> 
> smp_call_on_cpu() puts the work to the bind-cpu worker.

Ah, sorry -- I had confused this with the smp_call_function*() family, which do
this in IRQ context.

> And this function will be called in task context, and IRQs is certainly enabled.

Understood; given that, please ignore my comments below.

Mark.

> 
> 
> Zeng Heng
> 
> > >   	struct fb_ctr_pair *fb_ctrs = val;
> > >   	int cpu = fb_ctrs->cpu;
> > >   	int ret;
> > > +	unsigned long timeout;
> > >   	ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
> > >   	if (ret)
> > >   		return ret;
> > > -	udelay(2); /* 2usec delay between sampling */
> > > +	if (likely(!irqs_disabled())) {
> > > +		/*
> > > +		 * Set 1ms as sampling interval, but never schedule
> > > +		 * to the idle task to prevent the AMU counters from
> > > +		 * stopping working.
> > > +		 */
> > > +		timeout = jiffies + msecs_to_jiffies(1);
> > > +		while (!time_after(jiffies, timeout))
> > > +			cond_resched();
> > > +
> > > +	} else {
> > ... so we'll enter this branch of the if-else ...
> > 
> > > +		pr_warn_once("CPU%d: Get rate in atomic context", cpu);
> > ... and pr_warn_once() for something that's apparently normal and outside of
> > the user's control?
> > 
> > That doesn't make much sense to me.
> > 
> > Mark.
> > 
> > > +		udelay(2); /* 2usec delay between sampling */
> > > +	}
> > >   	return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
> > >   }
> > > -- 
> > > 2.25.1
> > >
  

Patch

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 321a9dc9484d..a7c5418bcda7 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -851,12 +851,26 @@  static int cppc_get_perf_ctrs_pair(void *val)
 	struct fb_ctr_pair *fb_ctrs = val;
 	int cpu = fb_ctrs->cpu;
 	int ret;
+	unsigned long timeout;
 
 	ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
 	if (ret)
 		return ret;
 
-	udelay(2); /* 2usec delay between sampling */
+	if (likely(!irqs_disabled())) {
+		/*
+		 * Set 1ms as sampling interval, but never schedule
+		 * to the idle task to prevent the AMU counters from
+		 * stopping working.
+		 */
+		timeout = jiffies + msecs_to_jiffies(1);
+		while (!time_after(jiffies, timeout))
+			cond_resched();
+
+	} else {
+		pr_warn_once("CPU%d: Get rate in atomic context", cpu);
+		udelay(2); /* 2usec delay between sampling */
+	}
 
 	return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
 }