[v1,0/3] Avoid scheduling cache draining to isolated cpus

Message ID 20221102020243.522358-1-leobras@redhat.com
Headers
Series Avoid scheduling cache draining to isolated cpus |

Message

Leonardo Bras Soares Passos Nov. 2, 2022, 2:02 a.m. UTC
  Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
closer (NUMA) to any desired CPU, instead of only the current CPU.

### Performance argument that motivated the change:
There could be an argument of why would that be needed, since the current
CPU is probably acessing the current cacheline, and so having a CPU closer
to the current one is always the best choice since the cache invalidation
will take less time. OTOH, there could be cases like this which uses
perCPU variables, and we can have up to 3 different CPUs touching the
cacheline:

C1 - Isolated CPU: The perCPU data 'belongs' to this one
C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
C3 - Housekeeping CPU: This one will do the work

Most of the times the cacheline is touched, it should be by C1. Some times
a C2 will schedule work to run on C3, since C1 is isolated.

If C1 and C2 are in different NUMA nodes, we could have C3 either in
C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
(housekeeping_any_cpu_from(C1). 

If C3 is in C2 NUMA node, there will be a faster invalidation when C3
tries to get cacheline exclusivity, and then a slower invalidation when
this happens in C1, when it's working in its data.

If C3 is in C1 NUMA node, there will be a slower invalidation when C3
tries to get cacheline exclusivity, and then a faster invalidation when
this happens in C1.

The thing is: it should be better to wait less when doing kernel work
on an isolated CPU, even at the cost of some housekeeping CPU waiting
a few more cycles.
###

Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
local_lock to spinlocks, so it can be later used to do remote percpu
cache draining on patch #3. Most performance concerns should be pointed
in the commit log.

Patch #3 implements the remote per-CPU cache drain, making use of both 
patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
introduce an extra function call and a single test to check if the CPU is
isolated. 

On scenarios with isolation enabled on boot, it will also introduce an
extra test to check in the cpumask if the CPU is isolated. If it is,
there will also be an extra read of the cpumask to look for a
housekeeping CPU.

Please, provide any feedback on that!
Thanks a lot for reading!

Leonardo Bras (3):
  sched/isolation: Add housekeepíng_any_cpu_from()
  mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
  mm/memcontrol: Add drain_remote_stock(), avoid drain_stock on isolated
    cpus

 include/linux/sched/isolation.h | 11 +++--
 kernel/sched/isolation.c        |  8 ++--
 mm/memcontrol.c                 | 83 ++++++++++++++++++++++-----------
 3 files changed, 69 insertions(+), 33 deletions(-)
  

Comments

Michal Hocko Nov. 2, 2022, 8:53 a.m. UTC | #1
On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> closer (NUMA) to any desired CPU, instead of only the current CPU.
> 
> ### Performance argument that motivated the change:
> There could be an argument of why would that be needed, since the current
> CPU is probably acessing the current cacheline, and so having a CPU closer
> to the current one is always the best choice since the cache invalidation
> will take less time. OTOH, there could be cases like this which uses
> perCPU variables, and we can have up to 3 different CPUs touching the
> cacheline:
> 
> C1 - Isolated CPU: The perCPU data 'belongs' to this one
> C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> C3 - Housekeeping CPU: This one will do the work
> 
> Most of the times the cacheline is touched, it should be by C1. Some times
> a C2 will schedule work to run on C3, since C1 is isolated.
> 
> If C1 and C2 are in different NUMA nodes, we could have C3 either in
> C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> (housekeeping_any_cpu_from(C1). 
> 
> If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> tries to get cacheline exclusivity, and then a slower invalidation when
> this happens in C1, when it's working in its data.
> 
> If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> tries to get cacheline exclusivity, and then a faster invalidation when
> this happens in C1.
> 
> The thing is: it should be better to wait less when doing kernel work
> on an isolated CPU, even at the cost of some housekeeping CPU waiting
> a few more cycles.
> ###
> 
> Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> local_lock to spinlocks, so it can be later used to do remote percpu
> cache draining on patch #3. Most performance concerns should be pointed
> in the commit log.
> 
> Patch #3 implements the remote per-CPU cache drain, making use of both 
> patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> introduce an extra function call and a single test to check if the CPU is
> isolated. 
> 
> On scenarios with isolation enabled on boot, it will also introduce an
> extra test to check in the cpumask if the CPU is isolated. If it is,
> there will also be an extra read of the cpumask to look for a
> housekeeping CPU.

This is a rather deep dive in the cache line usage but the most
important thing is really missing. Why do we want this change? From the
context it seems that this is an actual fix for isolcpu= setup when
remote (aka non isolated activity) interferes with isolated cpus by
scheduling pcp charge caches on those cpus.

Is this understanding correct?

If yes, how big of a problem that is? If you want a remote draining then
you need some sort of locking (currently we rely on local lock). How
come this locking is not going to cause a different form of disturbance?
  
Leonardo Bras Soares Passos Nov. 3, 2022, 2:59 p.m. UTC | #2
On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > 
> > ### Performance argument that motivated the change:
> > There could be an argument of why would that be needed, since the current
> > CPU is probably acessing the current cacheline, and so having a CPU closer
> > to the current one is always the best choice since the cache invalidation
> > will take less time. OTOH, there could be cases like this which uses
> > perCPU variables, and we can have up to 3 different CPUs touching the
> > cacheline:
> > 
> > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > C3 - Housekeeping CPU: This one will do the work
> > 
> > Most of the times the cacheline is touched, it should be by C1. Some times
> > a C2 will schedule work to run on C3, since C1 is isolated.
> > 
> > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > (housekeeping_any_cpu_from(C1). 
> > 
> > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > tries to get cacheline exclusivity, and then a slower invalidation when
> > this happens in C1, when it's working in its data.
> > 
> > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > tries to get cacheline exclusivity, and then a faster invalidation when
> > this happens in C1.
> > 
> > The thing is: it should be better to wait less when doing kernel work
> > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > a few more cycles.
> > ###
> > 
> > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > local_lock to spinlocks, so it can be later used to do remote percpu
> > cache draining on patch #3. Most performance concerns should be pointed
> > in the commit log.
> > 
> > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > introduce an extra function call and a single test to check if the CPU is
> > isolated. 
> > 
> > On scenarios with isolation enabled on boot, it will also introduce an
> > extra test to check in the cpumask if the CPU is isolated. If it is,
> > there will also be an extra read of the cpumask to look for a
> > housekeeping CPU.
> 

Hello Michael, thanks for reviewing!

> This is a rather deep dive in the cache line usage but the most
> important thing is really missing. Why do we want this change? From the
> context it seems that this is an actual fix for isolcpu= setup when
> remote (aka non isolated activity) interferes with isolated cpus by
> scheduling pcp charge caches on those cpus.
> 
> Is this understanding correct?

That's correct! The idea is to avoid scheduling work to isolated CPUs.

> If yes, how big of a problem that is?

The use case I have been following requires both isolcpus= and PREEMPT_RT, since
the isolated CPUs will be running a real-time workload. In this scenario,
getting any work done instead of the real-time workload may cause the system to
miss a deadline, which can be bad. 

>  If you want a remote draining then
> you need some sort of locking (currently we rely on local lock). How
> come this locking is not going to cause a different form of disturbance?

If I did everything right, most of the extra work should be done either in non-
isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
will be happening on a housekeeping CPU, and the locking cost should be paid
there as we want to avoid doing that in the isolated CPUs. 

I understand there will be a locking cost being paid in the isolated CPUs when:
a) The isolated CPU is requesting the stock drain,
b) When the isolated CPUs do a syscall and end up using the protected structure
the first time after a remote drain.

Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
should not expect the syscalls to be have a predictable time, so it should be
fine.

Thanks for helping me explain the case!
Best regards,
Leo
  
Michal Hocko Nov. 3, 2022, 3:31 p.m. UTC | #3
On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> > On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > > 
> > > ### Performance argument that motivated the change:
> > > There could be an argument of why would that be needed, since the current
> > > CPU is probably acessing the current cacheline, and so having a CPU closer
> > > to the current one is always the best choice since the cache invalidation
> > > will take less time. OTOH, there could be cases like this which uses
> > > perCPU variables, and we can have up to 3 different CPUs touching the
> > > cacheline:
> > > 
> > > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > > C3 - Housekeeping CPU: This one will do the work
> > > 
> > > Most of the times the cacheline is touched, it should be by C1. Some times
> > > a C2 will schedule work to run on C3, since C1 is isolated.
> > > 
> > > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > > (housekeeping_any_cpu_from(C1). 
> > > 
> > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > > tries to get cacheline exclusivity, and then a slower invalidation when
> > > this happens in C1, when it's working in its data.
> > > 
> > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > > tries to get cacheline exclusivity, and then a faster invalidation when
> > > this happens in C1.
> > > 
> > > The thing is: it should be better to wait less when doing kernel work
> > > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > > a few more cycles.
> > > ###
> > > 
> > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > > local_lock to spinlocks, so it can be later used to do remote percpu
> > > cache draining on patch #3. Most performance concerns should be pointed
> > > in the commit log.
> > > 
> > > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > > introduce an extra function call and a single test to check if the CPU is
> > > isolated. 
> > > 
> > > On scenarios with isolation enabled on boot, it will also introduce an
> > > extra test to check in the cpumask if the CPU is isolated. If it is,
> > > there will also be an extra read of the cpumask to look for a
> > > housekeeping CPU.
> > 
> 
> Hello Michael, thanks for reviewing!
> 
> > This is a rather deep dive in the cache line usage but the most
> > important thing is really missing. Why do we want this change? From the
> > context it seems that this is an actual fix for isolcpu= setup when
> > remote (aka non isolated activity) interferes with isolated cpus by
> > scheduling pcp charge caches on those cpus.
> > 
> > Is this understanding correct?
> 
> That's correct! The idea is to avoid scheduling work to isolated CPUs.
> 
> > If yes, how big of a problem that is?
> 
> The use case I have been following requires both isolcpus= and PREEMPT_RT, since
> the isolated CPUs will be running a real-time workload. In this scenario,
> getting any work done instead of the real-time workload may cause the system to
> miss a deadline, which can be bad. 

OK, I see. But is memcg charging actually a RT friendly operation in the
first place? Please note that this path can trigger memory reclaim and
that is when any RT expectations are simply going down the drain.

> >  If you want a remote draining then
> > you need some sort of locking (currently we rely on local lock). How
> > come this locking is not going to cause a different form of disturbance?
> 
> If I did everything right, most of the extra work should be done either in non-
> isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
> will be happening on a housekeeping CPU, and the locking cost should be paid
> there as we want to avoid doing that in the isolated CPUs. 
> 
> I understand there will be a locking cost being paid in the isolated CPUs when:
> a) The isolated CPU is requesting the stock drain,
> b) When the isolated CPUs do a syscall and end up using the protected structure
> the first time after a remote drain.

And anytime the charging path (consume_stock resp. refill_stock)
contends with the remote draining which is out of control of the RT
task. It is true that the RT kernel will turn that spin lock into a
sleeping RT lock and that could help with potential priority inversions
but still quite costly thing I would expect.

> Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> should not expect the syscalls to be have a predictable time, so it should be
> fine.

Now I am not sure I understand. If you do not consider charging path to
be RT sensitive then why is this needed in the first place? What else
would be populating the pcp cache on the isolated cpu? IRQs?
  
Leonardo Bras Soares Passos Nov. 3, 2022, 4:53 p.m. UTC | #4
On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> > > On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > > > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > > > 
> > > > ### Performance argument that motivated the change:
> > > > There could be an argument of why would that be needed, since the current
> > > > CPU is probably acessing the current cacheline, and so having a CPU closer
> > > > to the current one is always the best choice since the cache invalidation
> > > > will take less time. OTOH, there could be cases like this which uses
> > > > perCPU variables, and we can have up to 3 different CPUs touching the
> > > > cacheline:
> > > > 
> > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > > > C3 - Housekeeping CPU: This one will do the work
> > > > 
> > > > Most of the times the cacheline is touched, it should be by C1. Some times
> > > > a C2 will schedule work to run on C3, since C1 is isolated.
> > > > 
> > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > > > (housekeeping_any_cpu_from(C1). 
> > > > 
> > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > > > tries to get cacheline exclusivity, and then a slower invalidation when
> > > > this happens in C1, when it's working in its data.
> > > > 
> > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > > > tries to get cacheline exclusivity, and then a faster invalidation when
> > > > this happens in C1.
> > > > 
> > > > The thing is: it should be better to wait less when doing kernel work
> > > > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > > > a few more cycles.
> > > > ###
> > > > 
> > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > > > local_lock to spinlocks, so it can be later used to do remote percpu
> > > > cache draining on patch #3. Most performance concerns should be pointed
> > > > in the commit log.
> > > > 
> > > > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > > > introduce an extra function call and a single test to check if the CPU is
> > > > isolated. 
> > > > 
> > > > On scenarios with isolation enabled on boot, it will also introduce an
> > > > extra test to check in the cpumask if the CPU is isolated. If it is,
> > > > there will also be an extra read of the cpumask to look for a
> > > > housekeeping CPU.
> > > 
> > 
> > Hello Michael, thanks for reviewing!
> > 
> > > This is a rather deep dive in the cache line usage but the most
> > > important thing is really missing. Why do we want this change? From the
> > > context it seems that this is an actual fix for isolcpu= setup when
> > > remote (aka non isolated activity) interferes with isolated cpus by
> > > scheduling pcp charge caches on those cpus.
> > > 
> > > Is this understanding correct?
> > 
> > That's correct! The idea is to avoid scheduling work to isolated CPUs.
> > 
> > > If yes, how big of a problem that is?
> > 
> > The use case I have been following requires both isolcpus= and PREEMPT_RT, since
> > the isolated CPUs will be running a real-time workload. In this scenario,
> > getting any work done instead of the real-time workload may cause the system to
> > miss a deadline, which can be bad. 
> 
> OK, I see. But is memcg charging actually a RT friendly operation in the
> first place? Please note that this path can trigger memory reclaim and
> that is when any RT expectations are simply going down the drain.

I understand the spent time for charging is unpredictable as you said, since a
lot of slow stuff may or may not happen. 

> 
> > >  If you want a remote draining then
> > > you need some sort of locking (currently we rely on local lock). How
> > > come this locking is not going to cause a different form of disturbance?
> > 
> > If I did everything right, most of the extra work should be done either in non-
> > isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
> > will be happening on a housekeeping CPU, and the locking cost should be paid
> > there as we want to avoid doing that in the isolated CPUs. 

Sorry, I think this caused a misunderstanding: I meant "the pcp charge cache
drain will be happening on a housekeeping CPU, ..."

> > 
> > I understand there will be a locking cost being paid in the isolated CPUs when:
> > a) The isolated CPU is requesting the stock drain,
> > b) When the isolated CPUs do a syscall and end up using the protected structure
> > the first time after a remote drain.
> 
> And anytime the charging path (consume_stock resp. refill_stock)
> contends with the remote draining which is out of control of the RT
> task. It is true that the RT kernel will turn that spin lock into a
> sleeping RT lock and that could help with potential priority inversions
> but still quite costly thing I would expect.
> 
> > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > should not expect the syscalls to be have a predictable time, so it should be
> > fine.
> 
> Now I am not sure I understand. If you do not consider charging path to
> be RT sensitive then why is this needed in the first place? What else
> would be populating the pcp cache on the isolated cpu? IRQs?

I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
time with the RT workload, we can have preemption of the RT workload, which is a
problem for meeting the deadlines.

One way I thought to solve that was introducing a remote drain, which would
require a different strategy for locking, since not all accesses to the pcp
caches would happen on a local CPU. 

Then I tried to weight the costs of this, so the solution would introduce as
little overhead as possible on no-isolation scenarios. Also, for isolation
scenarios, I tried to put most of the overheads into the housekeeping CPUs, and
the remaining on the syscalls, which are also expected to be non-predictable.

Not sure if I could answer your question, though. Please let me know in case I
missed anything.

Thanks for helping me make it more clear!
Best regards,
Leo
  
Michal Hocko Nov. 4, 2022, 8:41 a.m. UTC | #5
On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
[...]
> > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > a) The isolated CPU is requesting the stock drain,
> > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > the first time after a remote drain.
> > 
> > And anytime the charging path (consume_stock resp. refill_stock)
> > contends with the remote draining which is out of control of the RT
> > task. It is true that the RT kernel will turn that spin lock into a
> > sleeping RT lock and that could help with potential priority inversions
> > but still quite costly thing I would expect.
> > 
> > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > should not expect the syscalls to be have a predictable time, so it should be
> > > fine.
> > 
> > Now I am not sure I understand. If you do not consider charging path to
> > be RT sensitive then why is this needed in the first place? What else
> > would be populating the pcp cache on the isolated cpu? IRQs?
> 
> I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> time with the RT workload, we can have preemption of the RT workload, which is a
> problem for meeting the deadlines.

Yes, this is understood. But it is not really clear to me why would any
draining be necessary for such an isolated CPU if no workload other than
the RT (which pressumably doesn't charge any memory?) is running on that
CPU? Is that the RT task during the initialization phase that leaves
that cache behind or something else? Sorry for being so focused on this
but I would like to understand on whether this is avoidable by a
different startup scheme or it really needs to be addressed in some way.

> One way I thought to solve that was introducing a remote drain, which would
> require a different strategy for locking, since not all accesses to the pcp
> caches would happen on a local CPU. 

Yeah, I am not supper happy about additional spin lock TBH. One
potential way to go would be to completely avoid pcp cache for isolated
CPUs. That would have some performance impact of course but on the other
hand it would give a more predictable behavior for those CPUs which
sounds like a reasonable compromise to me. What do you think?
  
Leonardo Bras Soares Passos Nov. 5, 2022, 1:45 a.m. UTC | #6
On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> [...]
> > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > a) The isolated CPU is requesting the stock drain,
> > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > the first time after a remote drain.
> > > 
> > > And anytime the charging path (consume_stock resp. refill_stock)
> > > contends with the remote draining which is out of control of the RT
> > > task. It is true that the RT kernel will turn that spin lock into a
> > > sleeping RT lock and that could help with potential priority inversions
> > > but still quite costly thing I would expect.
> > > 
> > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > fine.
> > > 
> > > Now I am not sure I understand. If you do not consider charging path to
> > > be RT sensitive then why is this needed in the first place? What else
> > > would be populating the pcp cache on the isolated cpu? IRQs?
> > 
> > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > time with the RT workload, we can have preemption of the RT workload, which is a
> > problem for meeting the deadlines.
> 
> Yes, this is understood. But it is not really clear to me why would any
> draining be necessary for such an isolated CPU if no workload other than
> the RT (which pressumably doesn't charge any memory?) is running on that
> CPU? Is that the RT task during the initialization phase that leaves
> that cache behind or something else?

(I am new to this part of the code, so please correct me when I miss something.)

IIUC, if a process belongs to a control group with memory control, the 'charge'
will happen when a memory page starts getting used by it.

So, if we assume a RT load in a isolated CPU will not charge any memory, we are
assuming it will never be part of a memory-controlled cgroup.

I mean, can we just assume this? 

If I got that right, would not that be considered a limitation? like
"If you don't want your workload to be interrupted by perCPU cache draining,
don't put it in a cgroup with memory control".

> Sorry for being so focused on this
> but I would like to understand on whether this is avoidable by a
> different startup scheme or it really needs to be addressed in some way.

No worries, I am in fact happy you are giving it this much attention :)

I also understand this is a considerable change in the locking strategy, and
avoiding that is the first thing that should be tried.

> 
> > One way I thought to solve that was introducing a remote drain, which would
> > require a different strategy for locking, since not all accesses to the pcp
> > caches would happen on a local CPU. 
> 
> Yeah, I am not supper happy about additional spin lock TBH. One
> potential way to go would be to completely avoid pcp cache for isolated
> CPUs. That would have some performance impact of course but on the other
> hand it would give a more predictable behavior for those CPUs which
> sounds like a reasonable compromise to me. What do you think?

You mean not having a perCPU stock, then? 
So consume_stock() for isolated CPUs would always return false, causing
try_charge_memcg() always walking the slow path?

IIUC, both my proposal and yours would degrade performance only when we use
isolated CPUs + memcg. Is that correct?

If so, it looks like the impact would be even bigger without perCPU stock ,
compared to introducing a spinlock.

Unless, we are counting to this case where a remote CPU is draining an isolated
CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
released in the remote CPU. Well, this seems possible to happen, but I would
have to analyze how often would it happen, and how much would it impact the
deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
before its starts, so it can avoid the faulting latency, but I need to confirm
that.

On the other hand, compared to how it works now now, this should be a more
controllable way of introducing latency than a scheduled cache drain.

Your suggestion on no-stocks/caches in isolated CPUs would be great for
predictability, but I am almost sure the cost in overall performance would not
be fine.

With the possibility of prefaulting pages, do you see any scenario that would
introduce some undesirable latency in the workload?

Thanks a lot for the discussion!
Leo
  
Michal Hocko Nov. 7, 2022, 8:10 a.m. UTC | #7
On Fri 04-11-22 22:45:58, Leonardo Brás wrote:
> On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > [...]
> > > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > > a) The isolated CPU is requesting the stock drain,
> > > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > > the first time after a remote drain.
> > > > 
> > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > contends with the remote draining which is out of control of the RT
> > > > task. It is true that the RT kernel will turn that spin lock into a
> > > > sleeping RT lock and that could help with potential priority inversions
> > > > but still quite costly thing I would expect.
> > > > 
> > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > > fine.
> > > > 
> > > > Now I am not sure I understand. If you do not consider charging path to
> > > > be RT sensitive then why is this needed in the first place? What else
> > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > 
> > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > > time with the RT workload, we can have preemption of the RT workload, which is a
> > > problem for meeting the deadlines.
> > 
> > Yes, this is understood. But it is not really clear to me why would any
> > draining be necessary for such an isolated CPU if no workload other than
> > the RT (which pressumably doesn't charge any memory?) is running on that
> > CPU? Is that the RT task during the initialization phase that leaves
> > that cache behind or something else?
> 
> (I am new to this part of the code, so please correct me when I miss something.)
> 
> IIUC, if a process belongs to a control group with memory control, the 'charge'
> will happen when a memory page starts getting used by it.

Yes, very broadly speaking.

> So, if we assume a RT load in a isolated CPU will not charge any memory, we are
> assuming it will never be part of a memory-controlled cgroup.

If the memory cgroup controler is enabled then each user space process
is a part of some memcg. If there is no specific memcg assigned then it
will be a root cgroup and that is skipped during most charges except for
kmem.

> I mean, can we just assume this? 
> 
> If I got that right, would not that be considered a limitation? like
> "If you don't want your workload to be interrupted by perCPU cache draining,
> don't put it in a cgroup with memory control".

We definitely do not want userspace make any assumptions on internal
implementation details like caches.

> > Sorry for being so focused on this
> > but I would like to understand on whether this is avoidable by a
> > different startup scheme or it really needs to be addressed in some way.
> 
> No worries, I am in fact happy you are giving it this much attention :)
> 
> I also understand this is a considerable change in the locking strategy, and
> avoiding that is the first thing that should be tried.
> 
> > 
> > > One way I thought to solve that was introducing a remote drain, which would
> > > require a different strategy for locking, since not all accesses to the pcp
> > > caches would happen on a local CPU. 
> > 
> > Yeah, I am not supper happy about additional spin lock TBH. One
> > potential way to go would be to completely avoid pcp cache for isolated
> > CPUs. That would have some performance impact of course but on the other
> > hand it would give a more predictable behavior for those CPUs which
> > sounds like a reasonable compromise to me. What do you think?
> 
> You mean not having a perCPU stock, then? 
> So consume_stock() for isolated CPUs would always return false, causing
> try_charge_memcg() always walking the slow path?

Exactly.

> IIUC, both my proposal and yours would degrade performance only when we use
> isolated CPUs + memcg. Is that correct?

Yes, with a notable difference that with your spin lock option there is
still a chance that the remote draining could influence the isolated CPU
workload throug that said spinlock. If there is no pcp cache for that
cpu being used then there is no potential interaction at all.

> If so, it looks like the impact would be even bigger without perCPU stock ,
> compared to introducing a spinlock.
> 
> Unless, we are counting to this case where a remote CPU is draining an isolated
> CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
> released in the remote CPU. Well, this seems possible to happen, but I would
> have to analyze how often would it happen, and how much would it impact the
> deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
> before its starts, so it can avoid the faulting latency, but I need to confirm
> that.

Yes, that is a general practice and the reason why I was asking how real
of a problem that is in practice. It is true true that appart from user
space memory which can be under full control of the userspace there are
kernel allocations which can be done on behalf of the process and those
could be charged to memcg as well. So I can imagine the pcp cache could
be populated even if the process is not faulting anything in during RT
sensitive phase.

> On the other hand, compared to how it works now now, this should be a more
> controllable way of introducing latency than a scheduled cache drain.
> 
> Your suggestion on no-stocks/caches in isolated CPUs would be great for
> predictability, but I am almost sure the cost in overall performance would not
> be fine.

It is hard to estimate the overhead without measuring that. Do you think
you can give it a try? If the performance is not really acceptable
(which I would be really surprised) then we can think of a more complex
solution.

> With the possibility of prefaulting pages, do you see any scenario that would
> introduce some undesirable latency in the workload?

My primary concern would be spin lock contention which is hard to
predict with something like remote draining.
  
Leonardo Bras Soares Passos Nov. 8, 2022, 11:09 p.m. UTC | #8
On Mon, 2022-11-07 at 09:10 +0100, Michal Hocko wrote:
> On Fri 04-11-22 22:45:58, Leonardo Brás wrote:
> > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > > [...]
> > > > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > > > a) The isolated CPU is requesting the stock drain,
> > > > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > > > the first time after a remote drain.
> > > > > 
> > > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > > contends with the remote draining which is out of control of the RT
> > > > > task. It is true that the RT kernel will turn that spin lock into a
> > > > > sleeping RT lock and that could help with potential priority inversions
> > > > > but still quite costly thing I would expect.
> > > > > 
> > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > > > fine.
> > > > > 
> > > > > Now I am not sure I understand. If you do not consider charging path to
> > > > > be RT sensitive then why is this needed in the first place? What else
> > > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > > 
> > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > > > time with the RT workload, we can have preemption of the RT workload, which is a
> > > > problem for meeting the deadlines.
> > > 
> > > Yes, this is understood. But it is not really clear to me why would any
> > > draining be necessary for such an isolated CPU if no workload other than
> > > the RT (which pressumably doesn't charge any memory?) is running on that
> > > CPU? Is that the RT task during the initialization phase that leaves
> > > that cache behind or something else?
> > 
> > (I am new to this part of the code, so please correct me when I miss something.)
> > 
> > IIUC, if a process belongs to a control group with memory control, the 'charge'
> > will happen when a memory page starts getting used by it.
> 
> Yes, very broadly speaking.
> 
> > So, if we assume a RT load in a isolated CPU will not charge any memory, we are
> > assuming it will never be part of a memory-controlled cgroup.
> 
> If the memory cgroup controler is enabled then each user space process
> is a part of some memcg. If there is no specific memcg assigned then it
> will be a root cgroup and that is skipped during most charges except for
> kmem.

Oh, it makes sense. 
Thanks for helping me understand that! 

> 
> > I mean, can we just assume this? 
> > 
> > If I got that right, would not that be considered a limitation? like
> > "If you don't want your workload to be interrupted by perCPU cache draining,
> > don't put it in a cgroup with memory control".
> 
> We definitely do not want userspace make any assumptions on internal
> implementation details like caches.

Perfect, that was my expectation. 

> 
> > > Sorry for being so focused on this
> > > but I would like to understand on whether this is avoidable by a
> > > different startup scheme or it really needs to be addressed in some way.
> > 
> > No worries, I am in fact happy you are giving it this much attention :)
> > 
> > I also understand this is a considerable change in the locking strategy, and
> > avoiding that is the first thing that should be tried.
> > 
> > > 
> > > > One way I thought to solve that was introducing a remote drain, which would
> > > > require a different strategy for locking, since not all accesses to the pcp
> > > > caches would happen on a local CPU. 
> > > 
> > > Yeah, I am not supper happy about additional spin lock TBH. One
> > > potential way to go would be to completely avoid pcp cache for isolated
> > > CPUs. That would have some performance impact of course but on the other
> > > hand it would give a more predictable behavior for those CPUs which
> > > sounds like a reasonable compromise to me. What do you think?
> > 
> > You mean not having a perCPU stock, then? 
> > So consume_stock() for isolated CPUs would always return false, causing
> > try_charge_memcg() always walking the slow path?
> 
> Exactly.
> 
> > IIUC, both my proposal and yours would degrade performance only when we use
> > isolated CPUs + memcg. Is that correct?
> 
> Yes, with a notable difference that with your spin lock option there is
> still a chance that the remote draining could influence the isolated CPU
> workload throug that said spinlock. If there is no pcp cache for that
> cpu being used then there is no potential interaction at all.

I see. 
But the slow path is slow for some reason, right?
Does not it make use of any locks also? So on normal operation there could be a
potentially larger impact than a spinlock, even though there would be no
scheduled draining.

> 
> > If so, it looks like the impact would be even bigger without perCPU stock ,
> > compared to introducing a spinlock.
> > 
> > Unless, we are counting to this case where a remote CPU is draining an isolated
> > CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
> > released in the remote CPU. Well, this seems possible to happen, but I would
> > have to analyze how often would it happen, and how much would it impact the
> > deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
> > before its starts, so it can avoid the faulting latency, but I need to confirm
> > that.
> 
> Yes, that is a general practice and the reason why I was asking how real
> of a problem that is in practice. 

I remember this was one common factor on deadlines being missed in the workload
analyzed. Need to redo the test to be sure.

> It is true true that appart from user
> space memory which can be under full control of the userspace there are
> kernel allocations which can be done on behalf of the process and those
> could be charged to memcg as well. So I can imagine the pcp cache could
> be populated even if the process is not faulting anything in during RT
> sensitive phase.

Humm, I think I will apply the change and do a comparative testing with
upstream. This should bring good comparison results.

> 
> > On the other hand, compared to how it works now now, this should be a more
> > controllable way of introducing latency than a scheduled cache drain.
> > 
> > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > predictability, but I am almost sure the cost in overall performance would not
> > be fine.
> 
> It is hard to estimate the overhead without measuring that. Do you think
> you can give it a try? If the performance is not really acceptable
> (which I would be really surprised) then we can think of a more complex
> solution.

Sure, I can try that.
Do you suggest any specific workload that happens to stress the percpu cache
usage, with usual drains and so? Maybe I will also try with synthetic worloads
also.

> 
> > With the possibility of prefaulting pages, do you see any scenario that would
> > introduce some undesirable latency in the workload?
> 
> My primary concern would be spin lock contention which is hard to
> predict with something like remote draining.

It makes sense. I will do some testing and come out with results for that.

Thanks for reviewing!
Leo
  
Michal Hocko Nov. 9, 2022, 8:05 a.m. UTC | #9
On Tue 08-11-22 20:09:25, Leonardo Brás wrote:
[...]
> > Yes, with a notable difference that with your spin lock option there is
> > still a chance that the remote draining could influence the isolated CPU
> > workload throug that said spinlock. If there is no pcp cache for that
> > cpu being used then there is no potential interaction at all.
> 
> I see. 
> But the slow path is slow for some reason, right?
> Does not it make use of any locks also? So on normal operation there could be a
> potentially larger impact than a spinlock, even though there would be no
> scheduled draining.

Well, for the regular (try_charge) path that is essentially page_counter_try_charge
which boils down to atomic_long_add_return of the memcg counter + all
parents up the hierarchy and high memory limit evaluation (essentially 2
atomic_reads for the memcg + all parents up the hierchy). That is not
whole of a lot - especially when the memcg hierarchy is not very deep.

Per cpu batch amortizes those per hierarchy updates as well as atomic
operations + cache lines bouncing on updates.

On the other hand spinlock would do the unconditional atomic updates as
well and even much more on CONFIG_RT. A plus is that the update will be
mostly local so cache line bouncing shouldn't be terrible. Unless
somebody heavily triggers pcp cache draining but this shouldn't be all
that common (e.g. when a memcg triggers its limit.

All that being said, I am still not convinced that the pcp cache bypass
for isolated CPUs would make a dramatic difference. Especially in the
context of workloads that tend to run on isolated CPUs and rarely enter
kernel.
 
> > It is true true that appart from user
> > space memory which can be under full control of the userspace there are
> > kernel allocations which can be done on behalf of the process and those
> > could be charged to memcg as well. So I can imagine the pcp cache could
> > be populated even if the process is not faulting anything in during RT
> > sensitive phase.
> 
> Humm, I think I will apply the change and do a comparative testing with
> upstream. This should bring good comparison results.

That would be certainly appreciated!
 
> > > On the other hand, compared to how it works now now, this should be a more
> > > controllable way of introducing latency than a scheduled cache drain.
> > > 
> > > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > > predictability, but I am almost sure the cost in overall performance would not
> > > be fine.
> > 
> > It is hard to estimate the overhead without measuring that. Do you think
> > you can give it a try? If the performance is not really acceptable
> > (which I would be really surprised) then we can think of a more complex
> > solution.
> 
> Sure, I can try that.
> Do you suggest any specific workload that happens to stress the percpu cache
> usage, with usual drains and so? Maybe I will also try with synthetic worloads
> also.

I really think you want to test it on the isolcpu aware workload.
Artificial benchmark are not all that useful in this context.
  
Leonardo Bras Soares Passos Jan. 25, 2023, 7:44 a.m. UTC | #10
On Wed, 2022-11-09 at 09:05 +0100, Michal Hocko wrote:
> On Tue 08-11-22 20:09:25, Leonardo Brás wrote:
> [...]
> > > Yes, with a notable difference that with your spin lock option there is
> > > still a chance that the remote draining could influence the isolated CPU
> > > workload throug that said spinlock. If there is no pcp cache for that
> > > cpu being used then there is no potential interaction at all.
> > 
> > I see. 
> > But the slow path is slow for some reason, right?
> > Does not it make use of any locks also? So on normal operation there could be a
> > potentially larger impact than a spinlock, even though there would be no
> > scheduled draining.
> 
> Well, for the regular (try_charge) path that is essentially page_counter_try_charge
> which boils down to atomic_long_add_return of the memcg counter + all
> parents up the hierarchy and high memory limit evaluation (essentially 2
> atomic_reads for the memcg + all parents up the hierchy). That is not
> whole of a lot - especially when the memcg hierarchy is not very deep.
> 
> Per cpu batch amortizes those per hierarchy updates as well as atomic
> operations + cache lines bouncing on updates.
> 
> On the other hand spinlock would do the unconditional atomic updates as
> well and even much more on CONFIG_RT. A plus is that the update will be
> mostly local so cache line bouncing shouldn't be terrible. Unless
> somebody heavily triggers pcp cache draining but this shouldn't be all
> that common (e.g. when a memcg triggers its limit.
> 
> All that being said, I am still not convinced that the pcp cache bypass
> for isolated CPUs would make a dramatic difference. Especially in the
> context of workloads that tend to run on isolated CPUs and rarely enter
> kernel.
>  
> > > It is true true that appart from user
> > > space memory which can be under full control of the userspace there are
> > > kernel allocations which can be done on behalf of the process and those
> > > could be charged to memcg as well. So I can imagine the pcp cache could
> > > be populated even if the process is not faulting anything in during RT
> > > sensitive phase.
> > 
> > Humm, I think I will apply the change and do a comparative testing with
> > upstream. This should bring good comparison results.
> 
> That would be certainly appreciated!
>  (
> > > > On the other hand, compared to how it works now now, this should be a more
> > > > controllable way of introducing latency than a scheduled cache drain.
> > > > 
> > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > > > predictability, but I am almost sure the cost in overall performance would not
> > > > be fine.
> > > 
> > > It is hard to estimate the overhead without measuring that. Do you think
> > > you can give it a try? If the performance is not really acceptable
> > > (which I would be really surprised) then we can think of a more complex
> > > solution.
> > 
> > Sure, I can try that.
> > Do you suggest any specific workload that happens to stress the percpu cache
> > usage, with usual drains and so? Maybe I will also try with synthetic worloads
> > also.
> 
> I really think you want to test it on the isolcpu aware workload.
> Artificial benchmark are not all that useful in this context.

Hello Michael,
I just sent a v2 for this patchset with a lot of changes.
https://lore.kernel.org/lkml/20230125073502.743446-1-leobras@redhat.com/

I have tried to gather some data on the performance numbers as suggested, but I
got carried away and the cover letter ended up too big. I hope it's not too much
trouble.

Best regards,
Leo