[v8,0/7] Preparatory changes for Proxy Execution v8

Message ID 20240210002328.4126422-1-jstultz@google.com
Headers
Series Preparatory changes for Proxy Execution v8 |

Message

John Stultz Feb. 10, 2024, 12:23 a.m. UTC
  After sending out v7 of Proxy Execution, I got feedback that the
patch series was getting a bit unwieldy to review, and Qais
suggested I break out just the cleanups/preparatory components of
the patch series and submit them on their own in the hope we can
start to merge the less complex bits and discussion can focus on
the more complicated portions afterwards.

So for the v8 of this series, I’m only submitting those earlier
cleanup/preparatory changes here. However work on the full series
has continued, with some nice progress on the performance front.

If you are interested, the full v8 series, it can be found here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v8-6.8-rc3
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v8-6.8-rc3

One bit of feedback I got earlier from K Prateek Nayak[1] was
that while the Proxy Execution feature didn’t impact performance
in most benchmarks he tried, it did cause major regressions with
the `perf bench` test, which I isolated down to the patch series
disabling optimistic spinning which has been a long term
performance concern with the change.

I spent some time looking for a way to walk the chain (across
cpus) to see if the runnable owner was actually running, which
would allow us to spin and avoid migrations. However, I realized
that is actually a separate and more difficult problem (ie:
migration avoidance) from simple optimistic spinning (a task
trying to take a mutex where the owner is running). With Proxy
Execution, the main focus is how we effectively boost
*non-running* mutex owners by the blocked tasks. But when the
contended mutex owner is running, we don’t need to do anything
different from the existing optimistic spinning logic.

So I’ve dropped the patches disabling optimistic spinning, and
fixed up some further issues as the change opened some new races
the later Proxy Execution patches weren’t expecting. The results
are looking really good with the perf bench test, which
previously regressed so badly, now seeing very close results
(within 1.5%) with and without Proxy Execution. This alleviates
a major concern I had with the series.

Re-enabling optimistic spinning does open a question about if we
should allow the optimistic spinning’s lock-stealing (instead of
lock-handoff) when the runnable mutex owner is running as a proxy
for a more important task. Ideally we probably want to hand it
back up the chain which has donated the time (to avoid migrating
the chain to the stealer’s cpu), so I have a patch I’m testing
that special cases that situation, but so far in my testing I’m
not seeing any large increase in latencies from the potential
unfairness of lock-stealing. But I’ll be doing more investigation
on this.

New in v8:
---------
* Re-ordered patch set to only submit cleanup & preparatory
  changes, leaving more complicated changes for a future
  submission (suggested by Qais Yousef)

* Lots of spelling fixups and minor changes suggested by Randy
  Dunlap and Metin Kaya

* Cover letter was getting long winded by including boilerplate
  sections, so please see earlier versions of the patch series or
  this LWN article [2] for an overview of the functionality.

* Re-enabled mutex optimistic spinning to resolve a major
  performance regression in earlier versions of the series

* Fixups for races discovered with optimistic spinning 

* Included trace events patch from Metin to better help us
  analyze the costs of proxying.

* Extended sched_football to use rtmtuexes when
  !CONFIG_SCHED_PROXY_EXEC

* Tweaked sched_football to bail if spawned players don’t check
  in within 30s

* Chased down a bug uncovered by sched_football that was
  preventing unconstrained RT tasks from being put on the
  pushable list, causing improper task starvation and
  sched_football failures.

Performance:
---------
With the optimistic spinning re-enabled, the biggest concern I
had wrt performance is much relieved. However, there is still the
potential extra overhead of doing rq migrations/return migrations
for the proxy case. The big hypothesis of this series is that the
overall benefit from unblocking important tasks will well
outweigh any extra cost, but if folks see anything of concern,
I’d love to hear about it.

Issues still to address:
---------
* Enabling optimistic spinning uncovered a few interesting races
  that were much harder to hit in v7. I’ve resolved most of
  these, but there is a case I’ve occasionally seen where the in
  try_to_wakeup(), the blocked_on_state transitions are too lax,
  causing a transition to BO_RUNNABLE when we have not had a
  chance to evaluate for return-migration (allowing tasks to
  potentially run on cpus they shouldn’t). Unfortunately in
  tightening the state handling to be more careful, I’ve run into
  cases where we don’t transition to BO_RUNNABLE, and effectively
  lose the wakeup. I’ve started to suspect the conceptual overlap
  between task->blocked_on_state and task->__state suggests I
  should consider condensing the BO_BLOCKED/BO_WAKING states into
  new task states (ie: TASK_MUTEX_BLOCKED, TASK_PROXY_MIGRATED),
  so we can handle the state transitions more correctly.

* Closer evaluation of pros/cons of doing lock handoff vs
  allowing lock-stealing when proxy tasks release locks.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants.

* Xuewen Yan earlier pointed out that we may see task
  mis-placement on EAS systems if we do return migration based
  strictly on cpu allowability. I tried an optimization to always
  try to return migrate to the wake_cpu (which was saved on
  proxy-migration), but this seemed to undo a chunk of the
  benefit I saw in moving return migration back to ttwu, at least
  with my prio-inversion-demo microbenchmark. Need to do some
  broader performance analysis with the variations there.

* It would be nice to find optimization to avoid migrating
  blocked tasks if the runnable lock-owner at the end of the
  blocked_on chain is already running (though this is difficult
  due to the limitations from locking rules needed to traverse
  the blocked on chain across run queues).

* CFS load balancing. There was concern blocked tasks may carry
  forward load (PELT) to the lock owner's CPU, so CPU may look
  like it is overloaded. Needs investigation

* The sleeping owner handling (where we deactivate waiting tasks
  and enqueue them onto a list, then reactivate them when the
  owner wakes up) doesn’t feel great. This is in part because
  when we want to activate tasks, we’re already holding a
  task.pi_lock and a rq_lock, just not the locks for the task
  we’re activating, nor the rq we’re enqueuing it onto. So there
  has to be a bit of lock juggling to drop and acquire the right
  locks (in the right order). It feels like there’s got to be a
  better way.

* As discussed at OSPM[3], I’d like to split pick_next_task() up
  into two phases selecting and setting the next tasks, as
  currently pick_next_task() assumes the returned task will be
  run which results in various side-effects in sched class logic
  when it’s run. I tried to take a pass at this earlier, but it’s
  hairy and lower on the priority list for now.

Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[4] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

As always, feedback is always appreciated.

Thanks so much!
-john

[1] https://lore.kernel.org/lkml/c6787831-f659-12cb-4954-fd13a05ed590@amd.com/
[2] https://lwn.net/Articles/934114/
[3] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[4] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@android.com

Connor O'Brien (2):
  sched: Add do_push_task helper
  sched: Consolidate pick_*_task to task_is_pushable helper

John Stultz (2):
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Split out __schedule() deactivate task logic into a helper

Juri Lelli (2):
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (1):
  sched: Split scheduler and execution contexts

 kernel/locking/mutex.c       |  60 +++++++----------
 kernel/locking/mutex.h       |  25 +++++++
 kernel/locking/rtmutex.c     |  26 +++++---
 kernel/locking/rwbase_rt.c   |   4 +-
 kernel/locking/rwsem.c       |   4 +-
 kernel/locking/spinlock_rt.c |   3 +-
 kernel/locking/ww_mutex.h    |  49 ++++++++------
 kernel/sched/core.c          | 122 +++++++++++++++++++++--------------
 kernel/sched/deadline.c      |  53 ++++++---------
 kernel/sched/fair.c          |  18 +++---
 kernel/sched/rt.c            |  59 +++++++----------
 kernel/sched/sched.h         |  44 ++++++++++++-
 12 files changed, 268 insertions(+), 199 deletions(-)
  

Comments

Metin Kaya Feb. 12, 2024, 3:50 p.m. UTC | #1
On 10/02/2024 12:23 am, John Stultz wrote:
> After sending out v7 of Proxy Execution, I got feedback that the
> patch series was getting a bit unwieldy to review, and Qais
> suggested I break out just the cleanups/preparatory components of
> the patch series and submit them on their own in the hope we can
> start to merge the less complex bits and discussion can focus on
> the more complicated portions afterwards.
> 
> So for the v8 of this series, I’m only submitting those earlier
> cleanup/preparatory changes here. However work on the full series
> has continued, with some nice progress on the performance front.

I guess we can also cherry-pick 1st [1], 20th [2] and 22th [3] patches 
of v7 into this series?

[1] 
https://lore.kernel.org/lkml/20231220001856.3710363-9-jstultz@google.com/T/#m1e6ef390bf9044ca69818549bccec1ada221cd32
[2] 
https://lore.kernel.org/lkml/20231220001856.3710363-9-jstultz@google.com/T/#m638575b42057e1d2cc3a82d4bdea7308eb85b461
[3] 
https://lore.kernel.org/lkml/20231220001856.3710363-9-jstultz@google.com/T/#mafcb432d20c6b545998da56ba92fc51ba8a07b42

> 
> If you are interested, the full v8 series, it can be found here:
>    https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v8-6.8-rc3
>    https://github.com/johnstultz-work/linux-dev.git proxy-exec-v8-6.8-rc3

[...]
  
John Stultz Feb. 12, 2024, 6:17 p.m. UTC | #2
On Mon, Feb 12, 2024 at 7:50 AM Metin Kaya <metin.kaya@arm.com> wrote:
> On 10/02/2024 12:23 am, John Stultz wrote:
> > After sending out v7 of Proxy Execution, I got feedback that the
> > patch series was getting a bit unwieldy to review, and Qais
> > suggested I break out just the cleanups/preparatory components of
> > the patch series and submit them on their own in the hope we can
> > start to merge the less complex bits and discussion can focus on
> > the more complicated portions afterwards.
> >
> > So for the v8 of this series, I’m only submitting those earlier
> > cleanup/preparatory changes here. However work on the full series
> > has continued, with some nice progress on the performance front.
>
> I guess we can also cherry-pick 1st [1], 20th [2] and 22th [3] patches
> of v7 into this series?
>
> [1]
> https://lore.kernel.org/lkml/20231220001856.3710363-9-jstultz@google.com/T/#m1e6ef390bf9044ca69818549bccec1ada221cd32

That one already landed upstream w/ the deadline server changes:
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d69eca542ee17c618f9a55da52191d5e28b435f

> [2]
> https://lore.kernel.org/lkml/20231220001856.3710363-9-jstultz@google.com/T/#m638575b42057e1d2cc3a82d4bdea7308eb85b461

On this one, I wanted to get some more review of the chain-migration
logic before really pushing it.

> [3]
> https://lore.kernel.org/lkml/20231220001856.3710363-9-jstultz@google.com/T/#mafcb432d20c6b545998da56ba92fc51ba8a07b42

I do agree this is more of a cleanup, but I also want to try to
refactor it a touch to try to consolidate the logic, so it's not quite
ready.

So I'm just submitting the set here for now to try to get the ball
rolling, but I'll continue working on the larger series and will pull
things down as they make sense.

thanks
-john