[v2,00/12] Reviving the Proxy Execution Series v2

Message ID 20230320233720.3488453-1-jstultz@google.com
Headers
Series Reviving the Proxy Execution Series v2 |

Message

John Stultz March 20, 2023, 11:37 p.m. UTC
  Hey All,

I’ve recently picked up the proxy execution effort, so I wanted to send
this out to share the current state of the patch series.

So first of all, this Proxy Execution series has a long history.
Starting from initial patches from Peter Zijlstra, then extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor O'Brien. 

So all the credit for this series really is due to the developers
above, while the mistakes are likely mine. :)  I wanted to particularly
appreciate Connor’s recent efforts, as I worked closely with him and
it’s only due to his knowledge, patience, and clear explanations of the
changes and issues that I was able to come to my current understanding
of the series so quickly.

Overview:
---------
I like to think of Proxy Execution as a generalized form of priority
inheritance. Classic priority inheritance works well for real-time tasks
where there is a straight forward priority order to how things are run.
But it breaks down when used between CFS tasks, as there are lots of
parameters involved outside of just the task’s nice value when selecting
the next task to run (via pick_next_task()). So ideally we want to imbue
the mutex holder with all the scheduler attributes of the waiting task.

Proxy Execution tries to do this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked on, and
  which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so that we
  instead run the mutex holder. 

The first of these is the most difficult to grasp (I do get the mental
friction here: blocked tasks on the *run*queue sounds like nonsense!
Personally I like to think of the runqueue in this model more like a
“task-selection queue”).

By leaving blocked tasks on the runqueue, we allow pick_next_task() to
choose the task that should run next (even if it’s blocked waiting on a
mutex). If we do select a blocked task, we look at the task’s blocked_on
mutex and from there look at the mutex’s owner task. And in the simple
case, the task which owns the mutex is what we then run, allowing it to
release the mutex.

This means that instead of just tracking “curr”, the scheduler needs to
track both the scheduler context (what was picked and all the state used
for scheduling decisions), and the execution context (what we’re
running)

In this way, the mutex owner is run “on behalf” of the blocked task
that was picked to run, essentially inheriting the scheduler context of
the blocked task.

As Connor outlined in a previous submission[1] of this patch series,
this raises a number of complicated situations:  The mutex owner might
itself be blocked on another mutex, or it could be sleeping, running on
a different CPU, in the process of migrating between CPUs, etc.

But the functionality provided is useful, as in Android we have a number
of cases where we are seeing priority inversion (not unbounded, but
longer than we’d like) between “foreground” and “background”
SCHED_NORMAL applications, so having a generalized solution would be
great.


Issues still to address:
------------------------
The last time this patch series was submitted, a number of issues were
identified that need to be addressed:
* cputime accounting is a little unintuitive, as time used by the proxy
  was being charged to the blocked task. Connor began work to address
  this but it was not complete, so it is not included in this version of
  the series.
* RT/DL load balancing. There is a scheduling invariant that we always
  need to run the top N highest priority RT tasks across the N cpus.
  However keeping blocked tasks on the runqueue greatly complicates the
  load balancing for this. Connor took an initial stab at this with
  “chain level balancing” included in this series. Feedback on this
  would be appreciated!
* CFS load balancing. Blocked tasks may carry forward load (PELT) to the
  lock owner's CPU, so CPU may look like it is overloaded.
* Terminology/BikeShedding: I think much of the terminology makes these
  changes harder to reason about.
  - As noted above, blocked-tasks on the run-queue breaks some mental
    models
  - The “proxy” relationship seems a bit inverted.  Peter’s “Split
    scheduler execution context” describes the proxy as the scheduler
    context, and curr being the execution context (the blocked task
    being the “scheduler proxy” for the lock-holder). Personally, I
    think of proxies as those who do-on-others-behalf, so to me it would
    make more sense to have rq_curr() be the selected task (scheduler
    context) and rq_proxy() be the run task (execution context), as its
    running on behalf of the selected but blocked task. Unless folks
    object, I’ll likely change this in a future submission.
  - Also, the rq_curr() and rq_proxy() distinction is easily confused. I
    think it’s sane to try to keep close to the existing curr based
    logic, but swapping rq_proxy in for some cases makes the change
    smaller, but the resulting code less clear. I worry folks focused on
    the !CONFIG_PROXY_EXEC case might easily mix up which to use when
    creating new logic.
* Resolving open questions in comments: I’ve left these in for now, but
  I hope to review and try to make some choices where there are open
  questions. If folks have specific feedback or suggestions here, it
  would be great!

Performance:
------------
With the current patch series the mutexes are handed off, instead of
using optimistic spinning. This is a potential concern where locks are
under high contention. However, so far in our performance analysis (on
both x86 and mobile devices) we’ve not seen any major regressions.  

Changes since Connor’s last submission:
---------------------------------------
* Connor took a swing at addressing the RT/DL load balancing issue 
* Connor fixed a number of issues found in testing
* I added rq_curr() and rq_proxy() accessors to abstract rq->curr and
  rq->proxy references (so it collapses in the !CONFIG_PROXY_EXECUTION
  case)
* I’ve gone through the series trying to split up the larger functions
  and better conditionalize the logic on CONFIG_PROXY_EXECUTION
* I’ve broken out some of the logic in larger patches into separate
  patches, as well as split out small prep changes to the logic so the
  patches are easier to review.
* I’ve also found and addressed a few edge cases in locking and mutex
  owner handling.
* I dropped the patch changing mutex::wait_lock to always save/restore irq
  flags (as Joel raised a concern that the patch wasn’t actually
  necessary). 
* I’ve folded a number of fixes down into the relevant patches. 

For the most part, my changes have been mechanical reworking of the
logic, avoiding any changes in behavior. Though after this, I’ll likely
start trying to rework some of the logic further.

So while there is still more to do, I wanted to send this series out so
folks could see work is continuing, and be able to get some additional
review on recent changes.  Review and feedback would be greatly
appreciated!

If folks find it easier to test/tinker with, this patch series can also
be found here:
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v2-6.3-rc

Thanks so much!
-john

[1] https://lore.kernel.org/lkml/20221003214501.2050087-1-connoro@google.com/

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com

Connor O'Brien (1):
  sched: Attempt to fix rt/dl load balancing via chain level balance

John Stultz (2):
  sched: Replace rq->curr access w/ rq_curr(rq)
  sched: Unnest ttwu_runnable in prep for proxy-execution

Juri Lelli (1):
  locking/mutex: Expose mutex_owner()

Peter Zijlstra (6):
  locking/ww_mutex: Remove wakeups from under mutex::wait_lock
  locking/mutex: Rework task_struct::blocked_on
  locking/mutex: Add task_struct::blocked_lock to serialize changes to
    the blocked_on state
  sched: Unify runtime accounting across classes
  sched: Split scheduler execution context
  sched: Add proxy execution

Valentin Schneider (2):
  locking/mutex: Add p->blocked_on wrappers
  sched/rt: Fix proxy/current (push,pull)ability

 include/linux/mutex.h        |   2 +
 include/linux/sched.h        |  24 +-
 include/linux/ww_mutex.h     |   3 +
 init/Kconfig                 |   7 +
 init/init_task.c             |   1 +
 kernel/Kconfig.locks         |   2 +-
 kernel/fork.c                |   6 +-
 kernel/locking/mutex-debug.c |   9 +-
 kernel/locking/mutex.c       | 103 ++++-
 kernel/locking/ww_mutex.h    |  10 +-
 kernel/sched/core.c          | 787 ++++++++++++++++++++++++++++++++---
 kernel/sched/core_sched.c    |   2 +-
 kernel/sched/cpudeadline.c   |  12 +-
 kernel/sched/cpudeadline.h   |   3 +-
 kernel/sched/cpupri.c        |  29 +-
 kernel/sched/cpupri.h        |   6 +-
 kernel/sched/deadline.c      | 213 ++++++----
 kernel/sched/debug.c         |   2 +-
 kernel/sched/fair.c          | 109 +++--
 kernel/sched/membarrier.c    |   8 +-
 kernel/sched/pelt.h          |   2 +-
 kernel/sched/rt.c            | 301 +++++++++-----
 kernel/sched/sched.h         | 286 ++++++++++++-
 kernel/sched/stop_task.c     |  13 +-
 24 files changed, 1599 insertions(+), 341 deletions(-)
  

Comments

John Stultz March 21, 2023, 2:51 a.m. UTC | #1
On Mon, Mar 20, 2023 at 4:37 PM John Stultz <jstultz@google.com> wrote:
> Changes since Connor’s last submission:
> ---------------------------------------
> * I dropped the patch changing mutex::wait_lock to always save/restore irq
>   flags (as Joel raised a concern that the patch wasn’t actually
>   necessary).

Well, despite a bit of testing prior, it is of course immediately
after sending it I managed to trip lockdep to get a warning on this
(though it tripped on the blocked_lock not the wait_lock), so I'll be
re-adding that patch (or a variant) back in in the next series.

[    1.351993]        CPU0                    CPU1
[    1.351993]        ----                    ----
[    1.351993]   lock(&p->blocked_lock);
[    1.351993]                                local_irq_disable();
[    1.351993]                                lock(&rq->__lock);
[    1.351993]                                lock(&p->blocked_lock);
[    1.351993]   <Interrupt>
[    1.351993]     lock(&rq->__lock);

thanks
-john