[v2,0/2] intel_powerclamp: New module parameter

Message ID 20230205025902.2899734-1-srinivas.pandruvada@linux.intel.com
Headers
Series intel_powerclamp: New module parameter |

Message

srinivas pandruvada Feb. 5, 2023, 2:59 a.m. UTC
  Split from the series for powerclamp user of powercap idle-inject.

v2
- Build warnings reported by Rui
- Moved the powerclamp documentation to admin guide folder
- Commit log updated as suggested by Rafael and other code suggestion

Srinivas Pandruvada (2):
  Documentation:admin-guide: Move intel_powerclamp documentation
  thermal/drivers/intel_powerclamp: Add two module parameters

 Documentation/admin-guide/index.rst           |   1 +
 .../thermal/intel_powerclamp.rst              |  22 +++
 Documentation/driver-api/thermal/index.rst    |   1 -
 MAINTAINERS                                   |   1 +
 drivers/thermal/intel/intel_powerclamp.c      | 177 +++++++++++++++---
 5 files changed, 180 insertions(+), 22 deletions(-)
 rename Documentation/{driver-api => admin-guide}/thermal/intel_powerclamp.rst (93%)
  

Comments

Zhang, Rui Feb. 5, 2023, 3:57 p.m. UTC | #1
Hi, Srinivas,

First of all, the previous build error is gone.

Second, I found something strange, which may be related with the
scheduler asym-packing, so CC Ricardo.

The test is done with pm linux-intel branch + this patch series on an
ADL system. cpu0~cpu7 are Pcore cpus, cpu8-cpu15 are Ecore cpus, and
intel_powerclamp is register as cooling_device21.

1. run stress -c 16
2. update /sys/module/intel_powerclamp/parameters/cpumask
   echo 90 > /sys/module/intel_powerclamp/parameters/max_idle
3. echo 90 > /sys/class/thermal/cooling_device21/cur_state
4. echo 0 > /sys/class/thermal/cooling_device21/cur_state
I use turbostat to monitor the CPU Busy% in all 4 steps.

If 'cpumask' does not include all the Ecore CPUs, all CPUs becomes 100%
busy after idle injection removed in step 4.

If 'cpumask' includes all the Ecore CPUs, i.e. cpumask = FFxy, in some
cases, the Ecore CPUs will drop to an Busy% much lower than 10%, and
then they don't come back to busy after idle injection removed in step
4, although we have 16 stress threads. And this also relates with how
long we stay in idle injection.

Say, when cpumask=fff3, the problem can be triggered occasionally if
there is a 10 second timeout between step 3 and step4, but it is much
easier to reproducible if I increase the timeout to 20 seconds.

It seems that Pcore can always pull tasks from Ecores, but Ecore can
not pull tasks from Pcore HT siblings.

thanks,
rui

On Sat, 2023-02-04 at 18:59 -0800, Srinivas Pandruvada wrote:
> Split from the series for powerclamp user of powercap idle-inject.
> 
> v2
> - Build warnings reported by Rui
> - Moved the powerclamp documentation to admin guide folder
> - Commit log updated as suggested by Rafael and other code suggestion
> 
> Srinivas Pandruvada (2):
>   Documentation:admin-guide: Move intel_powerclamp documentation
>   thermal/drivers/intel_powerclamp: Add two module parameters
> 
>  Documentation/admin-guide/index.rst           |   1 +
>  .../thermal/intel_powerclamp.rst              |  22 +++
>  Documentation/driver-api/thermal/index.rst    |   1 -
>  MAINTAINERS                                   |   1 +
>  drivers/thermal/intel/intel_powerclamp.c      | 177 +++++++++++++++-
> --
>  5 files changed, 180 insertions(+), 22 deletions(-)
>  rename Documentation/{driver-api => admin-
> guide}/thermal/intel_powerclamp.rst (93%)
>
  
srinivas pandruvada Feb. 6, 2023, 2:45 a.m. UTC | #2
Hi Rui,

On Sun, 2023-02-05 at 15:57 +0000, Zhang, Rui wrote:
> Hi, Srinivas,
> 
> First of all, the previous build error is gone.
> 
> Second, I found something strange, which may be related with the
> scheduler asym-packing, so CC Ricardo.
> 
I thought you disable ITMT before idle injection and reenebale after
removal.



> The test is done with pm linux-intel branch + this patch series on an
> ADL system.
Can you do test on bleeding edge branch of Linux-pm?

>  cpu0~cpu7 are Pcore cpus, cpu8-cpu15 are Ecore cpus, and
> intel_powerclamp is register as cooling_device21.
> 
> 1. run stress -c 16
> 2. update /sys/module/intel_powerclamp/parameters/cpumask
>    echo 90 > /sys/module/intel_powerclamp/parameters/max_idle
> 3. echo 90 > /sys/class/thermal/cooling_device21/cur_state
> 4. echo 0 > /sys/class/thermal/cooling_device21/cur_state
> I use turbostat to monitor the CPU Busy% in all 4 steps.
> 
> If 'cpumask' does not include all the Ecore CPUs, all CPUs becomes
> 100%
> busy after idle injection removed in step 4.
> 
that should be the case.

> If 'cpumask' includes all the Ecore CPUs, i.e. cpumask = FFxy, in
> some
> cases, the Ecore CPUs will drop to an Busy% much lower than 10%, and
> then they don't come back to busy after idle injection removed in
> step
 Do you see that idle injection is removed message in dmesg?
We can also check powercap idle-inejct, if some CPUs still not wake 
from play_idle.


> 4, although we have 16 stress threads. And this also relates with how
> long we stay in idle injection.
> 
> Say, when cpumask=fff3, the problem can be triggered occasionally if
> there is a 10 second timeout between step 3 and step4, but it is much
> easier to reproducible if I increase the timeout to 20 seconds.
> 
> It seems that Pcore can always pull tasks from Ecores, but Ecore can
> not pull tasks from Pcore HT siblings.
> 
That will be regular load balance threads should do.
Better to try upsteam kernel first.

Thanks,
Srinivas


> thanks,
> rui
> 
> On Sat, 2023-02-04 at 18:59 -0800, Srinivas Pandruvada wrote:
> > Split from the series for powerclamp user of powercap idle-inject.
> > 
> > v2
> > - Build warnings reported by Rui
> > - Moved the powerclamp documentation to admin guide folder
> > - Commit log updated as suggested by Rafael and other code
> > suggestion
> > 
> > Srinivas Pandruvada (2):
> >   Documentation:admin-guide: Move intel_powerclamp documentation
> >   thermal/drivers/intel_powerclamp: Add two module parameters
> > 
> >  Documentation/admin-guide/index.rst           |   1 +
> >  .../thermal/intel_powerclamp.rst              |  22 +++
> >  Documentation/driver-api/thermal/index.rst    |   1 -
> >  MAINTAINERS                                   |   1 +
> >  drivers/thermal/intel/intel_powerclamp.c      | 177
> > +++++++++++++++-
> > --
> >  5 files changed, 180 insertions(+), 22 deletions(-)
> >  rename Documentation/{driver-api => admin-
> > guide}/thermal/intel_powerclamp.rst (93%)
> >
  
Zhang, Rui Feb. 6, 2023, 8:05 a.m. UTC | #3
On Sun, 2023-02-05 at 18:45 -0800, srinivas pandruvada wrote:
> Hi Rui,
> 
> On Sun, 2023-02-05 at 15:57 +0000, Zhang, Rui wrote:
> > Hi, Srinivas,
> > 
> > First of all, the previous build error is gone.
> > 
> > Second, I found something strange, which may be related with the
> > scheduler asym-packing, so CC Ricardo.
> > 
> I thought you disable ITMT before idle injection and reenebale after
> removal.

No.

I can reproduce this by playing with raw intel_powerclamp sysfs knobs
and ITMT enabled.

> 
> 
> 
> > The test is done with pm linux-intel branch

sorry, I mean linux-next branch.

> >  + this patch series on an
> > ADL system.
> Can you do test on bleeding edge branch of Linux-pm?
> 
> >  cpu0~cpu7 are Pcore cpus, cpu8-cpu15 are Ecore cpus, and
> > intel_powerclamp is register as cooling_device21.
> > 
> > 1. run stress -c 16
> > 2. update /sys/module/intel_powerclamp/parameters/cpumask
> >    echo 90 > /sys/module/intel_powerclamp/parameters/max_idle
> > 3. echo 90 > /sys/class/thermal/cooling_device21/cur_state
> > 4. echo 0 > /sys/class/thermal/cooling_device21/cur_state
> > I use turbostat to monitor the CPU Busy% in all 4 steps.
> > 
> > If 'cpumask' does not include all the Ecore CPUs, all CPUs becomes
> > 100%
> > busy after idle injection removed in step 4.
> > 
> that should be the case.
> 
> > If 'cpumask' includes all the Ecore CPUs, i.e. cpumask = FFxy, in
> > some
> > cases, the Ecore CPUs will drop to an Busy% much lower than 10%,
> > and
> > then they don't come back to busy after idle injection removed in
> > step
>  Do you see that idle injection is removed message in dmesg?

yes.

> We can also check powercap idle-inejct, if some CPUs still not wake 
> from play_idle.

"ps" command shows the the idle_injection threads time is not
increasing any more.

> 
> 
> > 4, although we have 16 stress threads. And this also relates with
> > how
> > long we stay in idle injection.
> > 
> > Say, when cpumask=fff3, the problem can be triggered occasionally
> > if
> > there is a 10 second timeout between step 3 and step4, but it is
> > much
> > easier to reproducible if I increase the timeout to 20 seconds.
> > 
> > It seems that Pcore can always pull tasks from Ecores, but Ecore
> > can
> > not pull tasks from Pcore HT siblings.
> > 
> That will be regular load balance threads should do.
> Better to try upsteam kernel first.

I'm already running with linux-pm tree linux-next branch + this patch
series.

thanks,
rui

> 
> Thanks,
> Srinivas
> 
> 
> > thanks,
> > rui
> > 
> > On Sat, 2023-02-04 at 18:59 -0800, Srinivas Pandruvada wrote:
> > > Split from the series for powerclamp user of powercap idle-
> > > inject.
> > > 
> > > v2
> > > - Build warnings reported by Rui
> > > - Moved the powerclamp documentation to admin guide folder
> > > - Commit log updated as suggested by Rafael and other code
> > > suggestion
> > > 
> > > Srinivas Pandruvada (2):
> > >   Documentation:admin-guide: Move intel_powerclamp documentation
> > >   thermal/drivers/intel_powerclamp: Add two module parameters
> > > 
> > >  Documentation/admin-guide/index.rst           |   1 +
> > >  .../thermal/intel_powerclamp.rst              |  22 +++
> > >  Documentation/driver-api/thermal/index.rst    |   1 -
> > >  MAINTAINERS                                   |   1 +
> > >  drivers/thermal/intel/intel_powerclamp.c      | 177
> > > +++++++++++++++-
> > > --
> > >  5 files changed, 180 insertions(+), 22 deletions(-)
> > >  rename Documentation/{driver-api => admin-
> > > guide}/thermal/intel_powerclamp.rst (93%)
> > >
  
srinivas pandruvada Feb. 6, 2023, 10:02 a.m. UTC | #4
On Mon, 2023-02-06 at 08:05 +0000, Zhang, Rui wrote:
> On Sun, 2023-02-05 at 18:45 -0800, srinivas pandruvada wrote:
> > Hi Rui,
> > 
> > On Sun, 2023-02-05 at 15:57 +0000, Zhang, Rui wrote:
> > > Hi, Srinivas,
> > > 
> > > First of all, the previous build error is gone.
> > > 
> > > Second, I found something strange, which may be related with the
> > > scheduler asym-packing, so CC Ricardo.
> > > 
> > I thought you disable ITMT before idle injection and reenebale
> > after
> > removal.
> 
> No.
> 
> I can reproduce this by playing with raw intel_powerclamp sysfs knobs
> and ITMT enabled.
> 

This issue is happening even if ITMT disabled. If the module mask is
composed of P-cores it works or even on servers as expected.
Also if you offline all P-cores then select mask among E-cores, it is
working. Somehow P-core influences E-cores.

Since this patch is module mask related, that is functioning correctly.
We have to debug this interaction with P and E cores separately.

Thanks,
Srinivas


> > 
> > 
> > 
> > > The test is done with pm linux-intel branch
> 
> sorry, I mean linux-next branch.
> 
> > >  + this patch series on an
> > > ADL system.
> > Can you do test on bleeding edge branch of Linux-pm?
> > 
> > >  cpu0~cpu7 are Pcore cpus, cpu8-cpu15 are Ecore cpus, and
> > > intel_powerclamp is register as cooling_device21.
> > > 
> > > 1. run stress -c 16
> > > 2. update /sys/module/intel_powerclamp/parameters/cpumask
> > >    echo 90 > /sys/module/intel_powerclamp/parameters/max_idle
> > > 3. echo 90 > /sys/class/thermal/cooling_device21/cur_state
> > > 4. echo 0 > /sys/class/thermal/cooling_device21/cur_state
> > > I use turbostat to monitor the CPU Busy% in all 4 steps.
> > > 
> > > If 'cpumask' does not include all the Ecore CPUs, all CPUs
> > > becomes
> > > 100%
> > > busy after idle injection removed in step 4.
> > > 
> > that should be the case.
> > 
> > > If 'cpumask' includes all the Ecore CPUs, i.e. cpumask = FFxy, in
> > > some
> > > cases, the Ecore CPUs will drop to an Busy% much lower than 10%,
> > > and
> > > then they don't come back to busy after idle injection removed in
> > > step
> >  Do you see that idle injection is removed message in dmesg?
> 
> yes.
> 
> > We can also check powercap idle-inejct, if some CPUs still not wake
> > from play_idle.
> 
> "ps" command shows the the idle_injection threads time is not
> increasing any more.
> 
> > 
> > 
> > > 4, although we have 16 stress threads. And this also relates with
> > > how
> > > long we stay in idle injection.
> > > 
> > > Say, when cpumask=fff3, the problem can be triggered occasionally
> > > if
> > > there is a 10 second timeout between step 3 and step4, but it is
> > > much
> > > easier to reproducible if I increase the timeout to 20 seconds.
> > > 
> > > It seems that Pcore can always pull tasks from Ecores, but Ecore
> > > can
> > > not pull tasks from Pcore HT siblings.
> > > 
> > That will be regular load balance threads should do.
> > Better to try upsteam kernel first.
> 
> I'm already running with linux-pm tree linux-next branch + this patch
> series.
> 
> thanks,
> rui
> 
> > 
> > Thanks,
> > Srinivas
> > 
> > 
> > > thanks,
> > > rui
> > > 
> > > On Sat, 2023-02-04 at 18:59 -0800, Srinivas Pandruvada wrote:
> > > > Split from the series for powerclamp user of powercap idle-
> > > > inject.
> > > > 
> > > > v2
> > > > - Build warnings reported by Rui
> > > > - Moved the powerclamp documentation to admin guide folder
> > > > - Commit log updated as suggested by Rafael and other code
> > > > suggestion
> > > > 
> > > > Srinivas Pandruvada (2):
> > > >   Documentation:admin-guide: Move intel_powerclamp
> > > > documentation
> > > >   thermal/drivers/intel_powerclamp: Add two module parameters
> > > > 
> > > >  Documentation/admin-guide/index.rst           |   1 +
> > > >  .../thermal/intel_powerclamp.rst              |  22 +++
> > > >  Documentation/driver-api/thermal/index.rst    |   1 -
> > > >  MAINTAINERS                                   |   1 +
> > > >  drivers/thermal/intel/intel_powerclamp.c      | 177
> > > > +++++++++++++++-
> > > > --
> > > >  5 files changed, 180 insertions(+), 22 deletions(-)
> > > >  rename Documentation/{driver-api => admin-
> > > > guide}/thermal/intel_powerclamp.rst (93%)
> > > >
  
Ricardo Neri Feb. 7, 2023, 1:42 p.m. UTC | #5
On Mon, Feb 06, 2023 at 02:02:28AM -0800, srinivas pandruvada wrote:
> On Mon, 2023-02-06 at 08:05 +0000, Zhang, Rui wrote:
> > On Sun, 2023-02-05 at 18:45 -0800, srinivas pandruvada wrote:
> > > Hi Rui,
> > > 
> > > On Sun, 2023-02-05 at 15:57 +0000, Zhang, Rui wrote:
> > > > Hi, Srinivas,
> > > > 
> > > > First of all, the previous build error is gone.
> > > > 
> > > > Second, I found something strange, which may be related with the
> > > > scheduler asym-packing, so CC Ricardo.
> > > > 
> > > I thought you disable ITMT before idle injection and reenebale
> > > after
> > > removal.
> > 
> > No.
> > 
> > I can reproduce this by playing with raw intel_powerclamp sysfs knobs
> > and ITMT enabled.
> > 
> 
> This issue is happening even if ITMT disabled. If the module mask is
> composed of P-cores it works or even on servers as expected.
> Also if you offline all P-cores then select mask among E-cores, it is
> working. Somehow P-core influences E-cores.
> 
> Since this patch is module mask related, that is functioning correctly.
> We have to debug this interaction with P and E cores separately.

Currently, when doing asym_packing, ECores will only pull tasks from a
PCore only if both SMT siblings are busy. It will only pull from the
lower-priority sibling. These patches [1] let ECores pull from either
sibling, if both are busy.

I presume that by injecting idle, the scheduler thinks that the CPU is
idle (i.e., idle_cpu() returns true) and it will not do asym_packing from
lower-priority CPUs.

However, in your experiment you have 16 threads. If a Pcore is overloaded,
an ECore should be able to help.

[1]. https://lore.kernel.org/lkml/20230207045838.11243-1-ricardo.neri-calderon@linux.intel.com/
  
srinivas pandruvada Feb. 7, 2023, 1:51 p.m. UTC | #6
On Tue, 2023-02-07 at 05:42 -0800, Ricardo Neri wrote:
> On Mon, Feb 06, 2023 at 02:02:28AM -0800, srinivas pandruvada wrote:
> > On Mon, 2023-02-06 at 08:05 +0000, Zhang, Rui wrote:
> > > On Sun, 2023-02-05 at 18:45 -0800, srinivas pandruvada wrote:
> > > > Hi Rui,
> > > > 
> > > > On Sun, 2023-02-05 at 15:57 +0000, Zhang, Rui wrote:
> > > > > Hi, Srinivas,
> > > > > 
> > > > > First of all, the previous build error is gone.
> > > > > 
> > > > > Second, I found something strange, which may be related with
> > > > > the
> > > > > scheduler asym-packing, so CC Ricardo.
> > > > > 
> > > > I thought you disable ITMT before idle injection and reenebale
> > > > after
> > > > removal.
> > > 
> > > No.
> > > 
> > > I can reproduce this by playing with raw intel_powerclamp sysfs
> > > knobs
> > > and ITMT enabled.
> > > 
> > 
> > This issue is happening even if ITMT disabled. If the module mask
> > is
> > composed of P-cores it works or even on servers as expected.
> > Also if you offline all P-cores then select mask among E-cores, it
> > is
> > working. Somehow P-core influences E-cores.
> > 
> > Since this patch is module mask related, that is functioning
> > correctly.
> > We have to debug this interaction with P and E cores separately.
> 
> Currently, when doing asym_packing, ECores will only pull tasks from
> a
> PCore only if both SMT siblings are busy. It will only pull from the
> lower-priority sibling. These patches [1] let ECores pull from either
> sibling, if both are busy.
> 
> I presume that by injecting idle, the scheduler thinks that the CPU
> is
> idle (i.e., idle_cpu() returns true) and it will not do asym_packing
> from
> lower-priority CPUs.
> 
> However, in your experiment you have 16 threads. If a Pcore is
> overloaded,
> an ECore should be able to help.
This issue happens with or without ITMT and also without any idle
injection active.

Thanks,
Srinivas


> 
> [1].
> https://lore.kernel.org/lkml/20230207045838.11243-1-ricardo.neri-calderon@linux.intel.com/
>
  
Ricardo Neri Feb. 7, 2023, 7:08 p.m. UTC | #7
On Tue, Feb 07, 2023 at 05:51:59AM -0800, srinivas pandruvada wrote:
> On Tue, 2023-02-07 at 05:42 -0800, Ricardo Neri wrote:
> > On Mon, Feb 06, 2023 at 02:02:28AM -0800, srinivas pandruvada wrote:
> > > On Mon, 2023-02-06 at 08:05 +0000, Zhang, Rui wrote:
> > > > On Sun, 2023-02-05 at 18:45 -0800, srinivas pandruvada wrote:
> > > > > Hi Rui,
> > > > > 
> > > > > On Sun, 2023-02-05 at 15:57 +0000, Zhang, Rui wrote:
> > > > > > Hi, Srinivas,
> > > > > > 
> > > > > > First of all, the previous build error is gone.
> > > > > > 
> > > > > > Second, I found something strange, which may be related with
> > > > > > the
> > > > > > scheduler asym-packing, so CC Ricardo.
> > > > > > 
> > > > > I thought you disable ITMT before idle injection and reenebale
> > > > > after
> > > > > removal.
> > > > 
> > > > No.
> > > > 
> > > > I can reproduce this by playing with raw intel_powerclamp sysfs
> > > > knobs
> > > > and ITMT enabled.
> > > > 
> > > 
> > > This issue is happening even if ITMT disabled. If the module mask
> > > is
> > > composed of P-cores it works or even on servers as expected.
> > > Also if you offline all P-cores then select mask among E-cores, it
> > > is
> > > working. Somehow P-core influences E-cores.
> > > 
> > > Since this patch is module mask related, that is functioning
> > > correctly.
> > > We have to debug this interaction with P and E cores separately.
> > 
> > Currently, when doing asym_packing, ECores will only pull tasks from
> > a
> > PCore only if both SMT siblings are busy. It will only pull from the
> > lower-priority sibling. These patches [1] let ECores pull from either
> > sibling, if both are busy.
> > 
> > I presume that by injecting idle, the scheduler thinks that the CPU
> > is
> > idle (i.e., idle_cpu() returns true) and it will not do asym_packing
> > from
> > lower-priority CPUs.
> > 
> > However, in your experiment you have 16 threads. If a Pcore is
> > overloaded,
> > an ECore should be able to help.
> This issue happens with or without ITMT and also without any idle
> injection active.

I was not able to reproduce this issue on my ADL-S system with ITMT. The
described bug is exactly what and old patchset of mine was supposed to
fix [2]. Maybe the CPU priorities in the failing system are such that it
prevents asym_packing from kicking in.

I was able to reproduce the issue without ITMT.

I had found that the scheduler cannot handle load balance between SMT and
non-SMT cores correctly. My patchset [1] includes fixes for this case. I
applied it on top of Rafael's linux-next branch and it fixed the issue for
me in the non-ITMT case. Perhaps patches 5 and 6 are sufficient, but I
applied the whole series.

Thanks and BR,
Ricardo

[1]. https://lore.kernel.org/lkml/20230207045838.11243-1-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/all/20210911011819.12184-7-ricardo.neri-calderon@linux.intel.com/