[v6,00/20] Proxy Execution: A generalized form of Priority Inheritance v6

Message ID	20231106193524.866104-1-jstultz@google.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Date: Mon, 6 Nov 2023 19:34:43 +0000 Mime-Version: 1.0 Message-ID: <20231106193524.866104-1-jstultz@google.com> Subject: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6 From: John Stultz <jstultz@google.com> To: LKML <linux-kernel@vger.kernel.org> Cc: John Stultz <jstultz@google.com>, Joel Fernandes <joelaf@google.com>, Qais Yousef <qyousef@google.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Valentin Schneider <vschneid@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>, Youssef Esmat <youssefesmat@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com>, "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	Proxy Execution: A generalized form of Priority Inheritance v6 \| [v6,00/20] Proxy Execution: A generalized form of Priority Inheritance v6 [v6,01/20] sched: Unify runtime accounting across classes [v6,02/20] locking/mutex: Removes wakeups from under mutex::wait_lock [v6,03/20] locking/mutex: make mutex::wait_lock irq safe [v6,04/20] locking/mutex: Expose __mutex_owner() [v6,05/20] locking/mutex: Rework task_struct::blocked_on [v6,06/20] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state [v6,07/20] locking/mutex: Add p->blocked_on wrappers for correctness checks [v6,08/20] sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable [v6,09/20] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) [v6,10/20] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC [v6,11/20] sched: Split scheduler execution context [v6,12/20] sched: Fix runtime accounting w/ split exec & sched contexts [v6,13/20] sched: Split out __sched() deactivate task logic into a helper [v6,14/20] sched: Add a very simple proxy() function [v6,15/20] sched: Add proxy deactivate helper [v6,16/20] sched: Fix proxy/current (push,pull)ability [v6,17/20] sched: Start blocked_on chain processing in proxy() [v6,18/20] sched: Handle blocked-waiter migration (and return migration) [v6,19/20] sched: Add blocked_donor link to task for smarter mutex handoffs [v6,20/20] sched: Add deactivated (sleeping) owner handling to proxy()

Message ID

20231106193524.866104-1-jstultz@google.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:8 as permitted sender)
 client-ip=2620:137:e000::3:8;
Date: Mon,  6 Nov 2023 19:34:43 +0000
Mime-Version: 1.0
Message-ID: <20231106193524.866104-1-jstultz@google.com>
Subject: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority
 Inheritance v6
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Series

Proxy Execution: A generalized form of Priority Inheritance v6 |

Message

John Stultz Nov. 6, 2023, 7:34 p.m. UTC

  Stabilizing this Proxy Execution series has unfortunately
continued to be a challenging task. Since the v5 release, I’ve
been focused on getting the deactivated/sleeping owner enqueuing
functionality, which I left out of v5, stabilized. I’ve managed
to rework enough to avoid the crashes previously tripped with the
locktorture & ww_mutex selftests, so I feel it’s much improved,
but I do still see some issues (blocked waitqueues and hung task
watchdogs firing) after stressing the system for many hours in a
64 cpu qemu environment (though this issue seems to be introduced
earlier in the series with proxy-migration/return-migration).

I still haven’t had time to focus on testing and debugging the
chain migration patches. So I’ve left that patch out of this
submission for now, but will try to get it included in the next
revision.

This patch series is actually coarser than what I’ve been
developing with, as there are a number of small “test” steps to
help validate behavior I changed, which would then be replaced by
the real logic afterwards. Including those here would just cause
more work for reviewers, so I’ve folded them together, but if
you’re interested you can find the fine-grained tree here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained

As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[1] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Overview:
—----------
Proxy Execution is a generalized form of priority inheritance.
Classic priority inheritance works well for real-time tasks where
there is a straight forward priority order to how things are run.
But it breaks down when used between CFS or DEADLINE tasks, as
there are lots of parameters involved outside of just the task’s
nice value when selecting the next task to run (via
pick_next_task()).  So ideally we want to imbue the mutex holder
with all the scheduler attributes of  the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked
  on, and which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so
  that we instead run the mutex holder. 

By leaving blocked tasks on the runqueue, we allow
pick_next_task() to choose the task that should run next (even if
it’s blocked waiting on a mutex). If we do select a blocked task,
we look at the task’s blocked_on mutex and from there look at the
mutex’s owner task. And in the simple case, the task which owns
the mutex is what we then choose to run, allowing it to release
the mutex.

This means that instead of just tracking “curr”, the scheduler
needs to track both the scheduler context (what was picked and
all the state used for scheduling decisions), and the execution
context (what we’re actually running).

In this way, the mutex owner is run “on behalf” of the blocked
task that was picked to run, essentially inheriting the scheduler
context of the waiting blocked task.

As Connor outlined in a previous submission of this patch series,
this raises a number of complicated situations:  The mutex owner
might itself be blocked on another mutex, or it could be
sleeping, running on a different CPU, in the process of migrating
between CPUs, etc.

The complexity from this is imposing, but currently in Android we
have a large number of cases where we are seeing priority
inversion (not unbounded, but much longer than we’d like) between
“foreground” and “background” SCHED_NORMAL applications. As a
result, currently we cannot usefully limit background activity
without introducing inconsistent behavior. So Proxy Execution is
a very attractive approach to resolving this critical issue.


New in v6:
---------
* Big rework of blocked owner enqueuing, re-adding the
  functionality to the series.

* Added a boot option proxy_exec= and additional logic to allow
  the functionality to be boot time toggled.

* Minor tweaks to wake_q logic suggested by Waiman Long

* Minor fixups Reported-by: kernel test robot <lkp@intel.com>

* Various cleanups, optimizations as well as fixes for bugs found in v5

Also, I know Peter didn’t like the blocked_on wrappers, so I
previously dropped them, but I found them (and the debug checks
within) crucial to working out issues in this patch series. I’m
fine to consider dropping them later if they are still
objectionable, but their utility was too high at this point to
get rid of them.


Issues still to address:
—----------
* As already mentioned, the sleeping owner handling (where we
  deactivate waiting tasks and enqueue them onto a list, then
  reactivate them when the owner wakes up) has been very
  difficult to get right. This is in part because when we want
  to activate tasks, we’re already holding a task.pi_lock and a
  rq_lock, just not the locks for the task we’re activating, nor
  the rq we’re  enqueuing it onto. So there has to be a bit of
  lock juggling to drop and acquire the right locks (in the right
  order).

* Similarly as we’re often dealing with lists of tasks or chains
  of tasks and mutexes, iterating across these chains of objects
  can be done safely while holding the rq lock, but as these
  chains can cross runqueues our ability to traverse the chains
  safely is somewhat limited. 

* Additionally, in the sleeping owner enqueueing as well as in
  proxy return migration, we seem to be constantly working
  against the proper locking order (task.pi_lock->rq lock
  ->mutex.waitlock -> task.blocked_lock), having to repeatedly
  “swim upstream” where we need a higher order lock then what
  we’re already holding. Suggestions for different approaches to
  the serialization or ways to move the logic outside of where
  locks are already held would be appreciated.

* Still need to validate and re-add the chain migration patches
  to ensure they resolve the RT/DL load balancing invariants.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not
  thought of a better name. Open to suggestions.

* As discussed at OSPM[2], I like to split pick_next_task() up
  into two phases selecting and setting the next tasks, as
  currently pick_next_task() assumes the returned task will be
  run which results in various side-effects in sched class logic
  when it’s run. I tried to take a pass at this earlier, but it’s
  hairy and lower on the priority list for now.

* CFS load balancing. Blocked tasks may carry forward load (PELT)
  to the lock owner's CPU, so CPU may look like it is overloaded.

* Optimization to avoid migrating blocked tasks (to preserve
  optimistic mutex spinning) if the runnable lock-owner at the
  end of the blocked_on chain is already running (though this is
  difficult due to the limitations from locking rules needed to
  traverse the blocked on chain across run queues).


Performance:
—----------
This patch series switches mutexes to use handoff mode rather
than optimistic spinning. This is a potential concern where locks
are under high contention. However, earlier performance analysis
(on both x86 and mobile devices) did not see major regressions.
That said, Chenyu did report a regression[3], which I’ll need to
look further into. I also briefly re-tested with this v5 series
and saw some average latencies grow vs v4, suggesting the changes
to return-migration (and extra locking) have some impact. With v6
the extra overhead is reduced but still not as nice as v4. I’ll
be digging more there, but my priority is still stability over
speed at this point (it’s easier to validate correctness of
optimizations if the baseline isn’t crashing).


If folks find it easier to test/tinker with, this patch series
can also be found here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6

Awhile back Sage Sharp had a nice blog post about types of
reviews [4], and while any review and feedback would be greatly
appreciated, those focusing on conceptual design or correctness
issues would be *especially* valued.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
[4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/


Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com


John Stultz (8):
  locking/mutex: Removes wakeups from under mutex::wait_lock
  sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable
  locking/mutex: Split blocked_on logic into two states (blocked_on and
    blocked_on_waking)
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Split out __sched() deactivate task logic into a helper
  sched: Add a very simple proxy() function
  sched: Add proxy deactivate helper
  sched: Handle blocked-waiter migration (and return migration)

Juri Lelli (2):
  locking/mutex: make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (8):
  sched: Unify runtime accounting across classes
  locking/mutex: Rework task_struct::blocked_on
  locking/mutex: Add task_struct::blocked_lock to serialize changes to
    the blocked_on state
  locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
  sched: Split scheduler execution context
  sched: Start blocked_on chain processing in proxy()
  sched: Add blocked_donor link to task for smarter mutex handoffs
  sched: Add deactivated (sleeping) owner handling to proxy()

Valentin Schneider (2):
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  sched: Fix proxy/current (push,pull)ability

 .../admin-guide/kernel-parameters.txt         |   4 +
 include/linux/sched.h                         |  49 +-
 init/Kconfig                                  |   7 +
 init/init_task.c                              |   1 +
 kernel/Kconfig.locks                          |   2 +-
 kernel/fork.c                                 |   8 +-
 kernel/locking/mutex-debug.c                  |   9 +-
 kernel/locking/mutex.c                        | 163 ++--
 kernel/locking/mutex.h                        |  25 +
 kernel/locking/ww_mutex.h                     |  72 +-
 kernel/sched/core.c                           | 723 ++++++++++++++++--
 kernel/sched/deadline.c                       |  50 +-
 kernel/sched/fair.c                           | 104 ++-
 kernel/sched/rt.c                             |  77 +-
 kernel/sched/sched.h                          |  61 +-
 kernel/sched/stop_task.c                      |  13 +-
 16 files changed, 1117 insertions(+), 251 deletions(-)

Comments

John Stultz Nov. 8, 2023, 10:13 p.m. UTC | #1

On Wed, Nov 8, 2023 at 3:40 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon,  6 Nov 2023 19:34:43 +0000 John Stultz <jstultz@google.com>
> > Overview:
> > —----------
> > Proxy Execution is a generalized form of priority inheritance.
> > Classic priority inheritance works well for real-time tasks where
> > there is a straight forward priority order to how things are run.
> > But it breaks down when used between CFS or DEADLINE tasks, as
> > there are lots of parameters involved outside of just the task’s
> > nice value when selecting the next task to run (via
> > pick_next_task()).  So ideally we want to imbue the mutex holder
> > with all the scheduler attributes of  the blocked waiting task.
>
>   [...]

Is there a reason why you trimmed the cc list?

> > The complexity from this is imposing, but currently in Android we
> > have a large number of cases where we are seeing priority
> > inversion (not unbounded, but much longer than we’d like) between
> > “foreground” and “background” SCHED_NORMAL applications. As a
> > result, currently we cannot usefully limit background activity
> > without introducing inconsistent behavior. So Proxy Execution is
> > a very attractive approach to resolving this critical issue.
>
> Given usual mutex use
>
>         mutex_lock;
>         less-than-a-tick level critical section;
>         (unusual case for example: sleep until wakeup;)
>         mutex_unlock;

So the case we see regularly is you have a low priority task, which
maybe is cpuset restricted onto a smaller more energy efficient cpu,
and cpu share limited as well so it only gets a small proportional
fraction of time on that little cpu.

Alternatively, you could also imagine it being a SCHED_IDLE task on a
busy system, where every once in a while the system is momentarily
idle allowing the task to briefly run.

Either way, it doesn't get to run very much, but when it does, it
calls into the kernel on a path that is serialized with a mutex. Once
it takes the mutex even if it were to hold it for a short time(as in
your example above), if it gets preempted while holding the mutex, it
won't be selected to run for a while.  Then when an important task
calls into a similar kernel path, it will have to block and sleep
waiting for that mutex to be released. Unfortunately, because there
may be enough work going on in other tasks to keep the cpus busy, the
low priority task doesn't get scheduled so it cannot release the lock.
Given it is cpu share limited (or is SCHED_IDLE), depending on the
load it may not get scheduled for a relatively long time. We've
definitely seen traces where the outlier latencies are in the seconds.

> I think the effects of priority inversion could be safely ignored
> without sleep (because of page reclaim for instance) in the critical
> section.

I'm not sure I understand this assertion, could you clarify?

If it's helpful, I've got a simple (contrived) demo which can
reproduce a similar issue I've described above, using just file
renames as the mutex protected critical section.
  https://github.com/johnstultz-work/priority-inversion-demo

> Could you please elaborate a bit on the priority inversion and tip
> point to one or two cases of mutex behind the inversion?

In Andorid, we see it commonly with various binder mutexes or pstore
write_pmsg() pmsg_lock, as those are commonly taken by most apps.
But any common kernel path that takes a mutex can cause these large
latency outliers if we're trying to restrict background processes.

thanks
-john

Xuewen Yan Nov. 10, 2023, 9:07 a.m. UTC | #2

Hi John

On Tue, Nov 7, 2023 at 3:37 AM John Stultz <jstultz@google.com> wrote:
>
> Stabilizing this Proxy Execution series has unfortunately
> continued to be a challenging task. Since the v5 release, I’ve
> been focused on getting the deactivated/sleeping owner enqueuing
> functionality, which I left out of v5, stabilized. I’ve managed
> to rework enough to avoid the crashes previously tripped with the
> locktorture & ww_mutex selftests, so I feel it’s much improved,
> but I do still see some issues (blocked waitqueues and hung task
> watchdogs firing) after stressing the system for many hours in a
> 64 cpu qemu environment (though this issue seems to be introduced
> earlier in the series with proxy-migration/return-migration).
>
> I still haven’t had time to focus on testing and debugging the
> chain migration patches. So I’ve left that patch out of this
> submission for now, but will try to get it included in the next
> revision.
>
> This patch series is actually coarser than what I’ve been
> developing with, as there are a number of small “test” steps to
> help validate behavior I changed, which would then be replaced by
> the real logic afterwards. Including those here would just cause
> more work for reviewers, so I’ve folded them together, but if
> you’re interested you can find the fine-grained tree here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained
>
> As mentioned previously, this Proxy Execution series has a long
> history: First described in a paper[1] by Watkins, Straub,
> Niehaus, then from patches from Peter Zijlstra, extended with
> lots of work by Juri Lelli, Valentin Schneider, and Connor
> O'Brien. (and thank you to Steven Rostedt for providing
> additional details here!)
>
> So again, many thanks to those above, as all the credit for this
> series really is due to them - while the mistakes are likely
> mine.
>
> Overview:
> —----------
> Proxy Execution is a generalized form of priority inheritance.
> Classic priority inheritance works well for real-time tasks where
> there is a straight forward priority order to how things are run.
> But it breaks down when used between CFS or DEADLINE tasks, as
> there are lots of parameters involved outside of just the task’s
> nice value when selecting the next task to run (via
> pick_next_task()).  So ideally we want to imbue the mutex holder
> with all the scheduler attributes of  the blocked waiting task.
>
> Proxy Execution does this via a few changes:
> * Keeping tasks that are blocked on a mutex *on* the runqueue
> * Keeping additional tracking of which mutex a task is blocked
>   on, and which task holds a specific mutex.
> * Special handling for when we select a blocked task to run, so
>   that we instead run the mutex holder.
>
> By leaving blocked tasks on the runqueue, we allow
> pick_next_task() to choose the task that should run next (even if
> it’s blocked waiting on a mutex). If we do select a blocked task,
> we look at the task’s blocked_on mutex and from there look at the
> mutex’s owner task. And in the simple case, the task which owns
> the mutex is what we then choose to run, allowing it to release
> the mutex.
>
> This means that instead of just tracking “curr”, the scheduler
> needs to track both the scheduler context (what was picked and
> all the state used for scheduling decisions), and the execution
> context (what we’re actually running).
>
> In this way, the mutex owner is run “on behalf” of the blocked
> task that was picked to run, essentially inheriting the scheduler
> context of the waiting blocked task.

How to prevent malicious programs? If a program takes a hotspot lock
and never releases the lock, then there will be more and more blocked
tasks, and then there will be more and more tasks on the runq, and
because the block tasks are involved scheduling, it may cause the
owner to always be in the running state. As a result, other tasks
cannot be scheduled.
Maybe we can use the hung-task mechanism to also detect such long-running tasks?
I will also think about this issue simultaneously. If there is a
patch, I will send it here.

On the other hand, regardless of the extreme situation like above, the
impact of more tasks in runq on performance may require more testing.

BR

---
xuewen

>
> As Connor outlined in a previous submission of this patch series,
> this raises a number of complicated situations:  The mutex owner
> might itself be blocked on another mutex, or it could be
> sleeping, running on a different CPU, in the process of migrating
> between CPUs, etc.
>
> The complexity from this is imposing, but currently in Android we
> have a large number of cases where we are seeing priority
> inversion (not unbounded, but much longer than we’d like) between
> “foreground” and “background” SCHED_NORMAL applications. As a
> result, currently we cannot usefully limit background activity
> without introducing inconsistent behavior. So Proxy Execution is
> a very attractive approach to resolving this critical issue.
>
>
> New in v6:
> ---------
> * Big rework of blocked owner enqueuing, re-adding the
>   functionality to the series.
>
> * Added a boot option proxy_exec= and additional logic to allow
>   the functionality to be boot time toggled.
>
> * Minor tweaks to wake_q logic suggested by Waiman Long
>
> * Minor fixups Reported-by: kernel test robot <lkp@intel.com>
>
> * Various cleanups, optimizations as well as fixes for bugs found in v5
>
> Also, I know Peter didn’t like the blocked_on wrappers, so I
> previously dropped them, but I found them (and the debug checks
> within) crucial to working out issues in this patch series. I’m
> fine to consider dropping them later if they are still
> objectionable, but their utility was too high at this point to
> get rid of them.
>
>
> Issues still to address:
> —----------
> * As already mentioned, the sleeping owner handling (where we
>   deactivate waiting tasks and enqueue them onto a list, then
>   reactivate them when the owner wakes up) has been very
>   difficult to get right. This is in part because when we want
>   to activate tasks, we’re already holding a task.pi_lock and a
>   rq_lock, just not the locks for the task we’re activating, nor
>   the rq we’re  enqueuing it onto. So there has to be a bit of
>   lock juggling to drop and acquire the right locks (in the right
>   order).
>
> * Similarly as we’re often dealing with lists of tasks or chains
>   of tasks and mutexes, iterating across these chains of objects
>   can be done safely while holding the rq lock, but as these
>   chains can cross runqueues our ability to traverse the chains
>   safely is somewhat limited.
>
> * Additionally, in the sleeping owner enqueueing as well as in
>   proxy return migration, we seem to be constantly working
>   against the proper locking order (task.pi_lock->rq lock
>   ->mutex.waitlock -> task.blocked_lock), having to repeatedly
>   “swim upstream” where we need a higher order lock then what
>   we’re already holding. Suggestions for different approaches to
>   the serialization or ways to move the logic outside of where
>   locks are already held would be appreciated.
>
> * Still need to validate and re-add the chain migration patches
>   to ensure they resolve the RT/DL load balancing invariants.
>
> * “rq_selected()” naming. Peter doesn’t like it, but I’ve not
>   thought of a better name. Open to suggestions.
>
> * As discussed at OSPM[2], I like to split pick_next_task() up
>   into two phases selecting and setting the next tasks, as
>   currently pick_next_task() assumes the returned task will be
>   run which results in various side-effects in sched class logic
>   when it’s run. I tried to take a pass at this earlier, but it’s
>   hairy and lower on the priority list for now.
>
> * CFS load balancing. Blocked tasks may carry forward load (PELT)
>   to the lock owner's CPU, so CPU may look like it is overloaded.
>
> * Optimization to avoid migrating blocked tasks (to preserve
>   optimistic mutex spinning) if the runnable lock-owner at the
>   end of the blocked_on chain is already running (though this is
>   difficult due to the limitations from locking rules needed to
>   traverse the blocked on chain across run queues).
>
>
> Performance:
> —----------
> This patch series switches mutexes to use handoff mode rather
> than optimistic spinning. This is a potential concern where locks
> are under high contention. However, earlier performance analysis
> (on both x86 and mobile devices) did not see major regressions.
> That said, Chenyu did report a regression[3], which I’ll need to
> look further into. I also briefly re-tested with this v5 series
> and saw some average latencies grow vs v4, suggesting the changes
> to return-migration (and extra locking) have some impact. With v6
> the extra overhead is reduced but still not as nice as v4. I’ll
> be digging more there, but my priority is still stability over
> speed at this point (it’s easier to validate correctness of
> optimizations if the baseline isn’t crashing).
>
>
> If folks find it easier to test/tinker with, this patch series
> can also be found here:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
>
> Awhile back Sage Sharp had a nice blog post about types of
> reviews [4], and while any review and feedback would be greatly
> appreciated, those focusing on conceptual design or correctness
> issues would be *especially* valued.
>
> Thanks so much!
> -john
>
> [1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> [2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
> [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
> [4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/
>
>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Youssef Esmat <youssefesmat@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E . McKenney" <paulmck@kernel.org>
> Cc: kernel-team@android.com
>
>
> John Stultz (8):
>   locking/mutex: Removes wakeups from under mutex::wait_lock
>   sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable
>   locking/mutex: Split blocked_on logic into two states (blocked_on and
>     blocked_on_waking)
>   sched: Fix runtime accounting w/ split exec & sched contexts
>   sched: Split out __sched() deactivate task logic into a helper
>   sched: Add a very simple proxy() function
>   sched: Add proxy deactivate helper
>   sched: Handle blocked-waiter migration (and return migration)
>
> Juri Lelli (2):
>   locking/mutex: make mutex::wait_lock irq safe
>   locking/mutex: Expose __mutex_owner()
>
> Peter Zijlstra (8):
>   sched: Unify runtime accounting across classes
>   locking/mutex: Rework task_struct::blocked_on
>   locking/mutex: Add task_struct::blocked_lock to serialize changes to
>     the blocked_on state
>   locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
>   sched: Split scheduler execution context
>   sched: Start blocked_on chain processing in proxy()
>   sched: Add blocked_donor link to task for smarter mutex handoffs
>   sched: Add deactivated (sleeping) owner handling to proxy()
>
> Valentin Schneider (2):
>   locking/mutex: Add p->blocked_on wrappers for correctness checks
>   sched: Fix proxy/current (push,pull)ability
>
>  .../admin-guide/kernel-parameters.txt         |   4 +
>  include/linux/sched.h                         |  49 +-
>  init/Kconfig                                  |   7 +
>  init/init_task.c                              |   1 +
>  kernel/Kconfig.locks                          |   2 +-
>  kernel/fork.c                                 |   8 +-
>  kernel/locking/mutex-debug.c                  |   9 +-
>  kernel/locking/mutex.c                        | 163 ++--
>  kernel/locking/mutex.h                        |  25 +
>  kernel/locking/ww_mutex.h                     |  72 +-
>  kernel/sched/core.c                           | 723 ++++++++++++++++--
>  kernel/sched/deadline.c                       |  50 +-
>  kernel/sched/fair.c                           | 104 ++-
>  kernel/sched/rt.c                             |  77 +-
>  kernel/sched/sched.h                          |  61 +-
>  kernel/sched/stop_task.c                      |  13 +-
>  16 files changed, 1117 insertions(+), 251 deletions(-)
>
> --
> 2.42.0.869.gea05f2083d-goog
>

K Prateek Nayak Dec. 13, 2023, 6:37 a.m. UTC | #3

Hello John,

I may have some data that might help you debug a potential performance
issue mentioned below.

On 11/7/2023 1:04 AM, John Stultz wrote:
> [..snip..]
> 
> Performance:
> —----------
> This patch series switches mutexes to use handoff mode rather
> than optimistic spinning. This is a potential concern where locks
> are under high contention. However, earlier performance analysis
> (on both x86 and mobile devices) did not see major regressions.
> That said, Chenyu did report a regression[3], which I’ll need to
> look further into.

I too see this as the most notable regression. Some of the other
benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
DeathStarBench) show little to no difference when running with Proxy
Execution, however sched-messaging sees a 10x blowup in the runtime.
(taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)

While investigating, I drew up the runqueue length when running
sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
EPYC system) using the following bpftrace script that dumps it csv
format:

rqlen.bt
---
BEGIN
{
	$i = 0;
	printf("[%12s],", "Timestamp");
	while ($i < 8)
	{
		printf("CPU%3d,", $i);
		$i = $i + 1;
	}
	$i = 128;
	while ($i < 136)
	{
		printf("CPU%3d,", $i);
		$i = $i + 1;
	}
	printf("\n");
}

kprobe:scheduler_tick
{
	@runqlen[curtask->wake_cpu] =  curtask->se.cfs_rq->rq->nr_running;
}

tracepoint:power:cpu_idle
{
	@runqlen[curtask->wake_cpu] = 0;
}

interval:hz:50
{
	$i = 0;
	printf("[%12lld],", elapsed);
	while ($i < 8)
	{
		printf("%3d,", @runqlen[$i]);
		$i = $i + 1;
	}

	$i=128;
	while ($i < 136)
	{
		printf("%3d,", @runqlen[$i]);
		$i = $i + 1;
	}
	printf("\n");
}

END
{
	clear(@runqlen);
}

--

I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
execution (rqlen50-pe-pinned.csv) below.

The trend I see with hackbench is that the chain migration leads
to a single runqueue being completely overloaded, followed by some
amount of the idling on the entire CCX and a similar chain appearing
on a different CPU. The trace for tip show a lot more CPUs being
utilized.

Mathieu has been looking at hackbench and the effect of task migration
on the runtime and it appears that lowering the migrations improves
the hackbench performance significantly [1][2][3]

[1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
[2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
[3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@efficios.com/

Since migration itself is not cheap, I believe the chain migration at
the current scale hampers the performance since sched-messaging
emulates a worst-case scenario for proxy-execution.

I'll update the thread once I have more information. I'll continue
testing and take a closer look at the implementation.

> I also briefly re-tested with this v5 series
> and saw some average latencies grow vs v4, suggesting the changes
> to return-migration (and extra locking) have some impact. With v6
> the extra overhead is reduced but still not as nice as v4. I’ll
> be digging more there, but my priority is still stability over
> speed at this point (it’s easier to validate correctness of
> optimizations if the baseline isn’t crashing).
> 
> 
> If folks find it easier to test/tinker with, this patch series
> can also be found here:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6

P.S. I was using the above tree.

> 
> Awhile back Sage Sharp had a nice blog post about types of
> reviews [4], and while any review and feedback would be greatly
> appreciated, those focusing on conceptual design or correctness
> issues would be *especially* valued.

I have skipped a few phases that Sage mentions in their blog but I'll
try my best to follow the order from here on forward :)

> 
> Thanks so much!
> -john
> 
> [1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> [2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
> [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
> [4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/
> 
> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek

Metin Kaya Dec. 13, 2023, 4:20 p.m. UTC | #4

On Wed, 2023-12-13 at 12:07 +0530, K Prateek Nayak wrote:
> Hello John,
>
> I may have some data that might help you debug a potential
> performance
> issue mentioned below.
>
> On 11/7/2023 1:04 AM, John Stultz wrote:
> > [..snip..]
> >
> > Performance:
> > —----------
> > This patch series switches mutexes to use handoff mode rather
> > than optimistic spinning. This is a potential concern where locks
> > are under high contention. However, earlier performance analysis
> > (on both x86 and mobile devices) did not see major regressions.
> > That said, Chenyu did report a regression[3], which I’ll need to
> > look further into.
>
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy
> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g
> 1)

I observe similar regression. Total time of `taskset -c 0-5,6-11 perf
bench sched messaging -p -t -l 100000 -g 1` increases from 13.964 secs
to 184.866 secs on my test machine. Other perf sched benchmarks look
OK.

>
> While investigating, I drew up the runqueue length when running
> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
> EPYC system) using the following bpftrace script that dumps it csv
> format:
>

[snip]

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

John Stultz Dec. 13, 2023, 7:11 p.m. UTC | #5

On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John,
>
> I may have some data that might help you debug a potential performance
> issue mentioned below.

Hey Prateek,
  Thank you so much for taking the time to try this out and providing
such helpful analysis!
More below.

> On 11/7/2023 1:04 AM, John Stultz wrote:
> > [..snip..]
> >
> > Performance:
> > —----------
> > This patch series switches mutexes to use handoff mode rather
> > than optimistic spinning. This is a potential concern where locks
> > are under high contention. However, earlier performance analysis
> > (on both x86 and mobile devices) did not see major regressions.
> > That said, Chenyu did report a regression[3], which I’ll need to
> > look further into.
>
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy

This is great to hear! Thank you for providing this input!

> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)

Oof. I appreciate you sharing this!

> While investigating, I drew up the runqueue length when running
> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
> EPYC system) using the following bpftrace script that dumps it csv
> format:

Just so I'm following you properly on the processor you're using, cpus
0-7 and 128-125 are in the same CCX?
(I thought there were only 8 cores per CCX?)

> rqlen.bt
> ---
<snip>
> --
>
> I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
> execution (rqlen50-pe-pinned.csv) below.
>
> The trend I see with hackbench is that the chain migration leads
> to a single runqueue being completely overloaded, followed by some
> amount of the idling on the entire CCX and a similar chain appearing
> on a different CPU. The trace for tip show a lot more CPUs being
> utilized.
>
> Mathieu has been looking at hackbench and the effect of task migration
> on the runtime and it appears that lowering the migrations improves
> the hackbench performance significantly [1][2][3]
>
> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
> [2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
> [3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@efficios.com/
>
> Since migration itself is not cheap, I believe the chain migration at
> the current scale hampers the performance since sched-messaging
> emulates a worst-case scenario for proxy-execution.

Hrm.

> I'll update the thread once I have more information. I'll continue
> testing and take a closer look at the implementation.
>
> > I also briefly re-tested with this v5 series
> > and saw some average latencies grow vs v4, suggesting the changes
> > to return-migration (and extra locking) have some impact. With v6
> > the extra overhead is reduced but still not as nice as v4. I’ll
> > be digging more there, but my priority is still stability over
> > speed at this point (it’s easier to validate correctness of
> > optimizations if the baseline isn’t crashing).
> >
> >
> > If folks find it easier to test/tinker with, this patch series
> > can also be found here:
> >   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
> >   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
>
> P.S. I was using the above tree.

Ok, I've been working on getting v7 ready which includes two main things:
1) I've reworked the return migration back into the ttwu path to avoid
the lock juggling
2) Working to properly conditionalize and re-add Connor's
chain-migration feature (which when a migration happens pulls the full
blocked_donor list with it)

So I'll try to reproduce your results and see if these help any with
this particular case, and then I'll start to look closer at what can
be done.

Again, thanks so much, I've got so much gratitude for your testing and
analysis here. I really appreciate your feedback!
Do let me know if you find anything further!

thanks
-john

John Stultz Dec. 14, 2023, 1 a.m. UTC | #6

On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> I too see this as the most notable regression. Some of the other
> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
> DeathStarBench) show little to no difference when running with Proxy
> Execution, however sched-messaging sees a 10x blowup in the runtime.
> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)
...
> The trend I see with hackbench is that the chain migration leads
> to a single runqueue being completely overloaded, followed by some
> amount of the idling on the entire CCX and a similar chain appearing
> on a different CPU. The trace for tip show a lot more CPUs being
> utilized.

So I reproduced a similar issue with the test (I only have 8 cores on
the bare metal box I have so I didn't bother using taskset):

perf bench sched messaging -p -t -l 100000 -g 1
v6.6:                   4.583s
proxy-exec-6.6:         28.842s
proxy-exec-WIP:         26.1033s

So the pre-v7 code does improve things, but not substantially.

Bisecting through the patch series to see how it regressed shows it is
not a single change:
mutex-handoff:          16.957s
blocked-waiter/return:  29.628s
ttwu return:            20.045s
blocked_donor:          25.021s

So it seems like the majority of the regression comes from switching
to mutex handoff instead of optimistic spinning.
This would account for your more cpus being utilized comment, as more
of the blocked tasks would be spinning trying to take the mutex.

Then adding the initial blocked-waiter/return migration change hurts
things further (and this was a known issue with v5/v6).
Then the pending patch to switch back to doing return migration in
ttwu recovers a chunk of that cost.
And then the blocked_donor handoff optimization (which passes the lock
back to the donor boosting us, rather than the next task in the mutex
waitlist) further impacts performance here.

The chain migration feature in proxy-exec-WIP doesn't seem to help or
hurt in this case.

I'll need to look closer at each of those steps to see if there's
anything too daft I'm doing.

The loss of optimistic spinning has been a long term worry with the
patch. Unfortunately as I mentioned in the plumbers talk, the idea
from OSPM on avoiding migration and spinning if the runnable owner of
a blocked_on chain was on a cpu isn't easy to accomplish, as we are
limited to how far off the rq we can look. I feel like I need to come
up with some way to lock the entire blocked_on chain so it can be
traversed safely - as it is now, due to the locking order
(task->pi_lock,  rq_lock,  mutex->wait_lock, task->blocked_on) we
can't stabily traverse go task->mutex->task->mutex, unless the tasks
are all on the same rq (and we're holding the rq_lock).  So we can
only safely look at one mutex owner off of the rq (when we also hold
the mutex wait_lock).

I'm stewing on an idea for some sort of mutex holder (similar to a
runqueue) that would allow the chain of mutexes to be traversed
quickly and stability - but the many-to-one blocked_on relationship
complicates it.
Suggestions or other ideas here would be welcome, as I didn't get a
chance to really discuss it much at plumbers.

thanks
-john

John Stultz Dec. 14, 2023, 1:03 a.m. UTC | #7

On Wed, Dec 13, 2023 at 5:00 PM John Stultz <jstultz@google.com> wrote:
> I'm stewing on an idea for some sort of mutex holder (similar to a

On re-reading this, I realized holder is far too overloaded a term in
this context.

"mutex container" might be a better term for the idea.

thanks
-john

K Prateek Nayak Dec. 14, 2023, 5:15 a.m. UTC | #8

Hello John,

Thank you for taking a look at the report.

On 12/14/2023 12:41 AM, John Stultz wrote:
> On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello John,
>>
>> I may have some data that might help you debug a potential performance
>> issue mentioned below.
> 
> Hey Prateek,
>   Thank you so much for taking the time to try this out and providing
> such helpful analysis!
> More below.
> 
>> On 11/7/2023 1:04 AM, John Stultz wrote:
>>> [..snip..]
>>>
>>> Performance:
>>> —----------
>>> This patch series switches mutexes to use handoff mode rather
>>> than optimistic spinning. This is a potential concern where locks
>>> are under high contention. However, earlier performance analysis
>>> (on both x86 and mobile devices) did not see major regressions.
>>> That said, Chenyu did report a regression[3], which I’ll need to
>>> look further into.
>>
>> I too see this as the most notable regression. Some of the other
>> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
>> DeathStarBench) show little to no difference when running with Proxy
> 
> This is great to hear! Thank you for providing this input!
> 
>> Execution, however sched-messaging sees a 10x blowup in the runtime.
>> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)
> 
> Oof. I appreciate you sharing this!
> 
>> While investigating, I drew up the runqueue length when running
>> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
>> EPYC system) using the following bpftrace script that dumps it csv
>> format:
> 
> Just so I'm following you properly on the processor you're using, cpus
> 0-7 and 128-125 are in the same CCX?
> (I thought there were only 8 cores per CCX?)

Sorry about that! It should be 0-7,128-135 (16 threads of 8 cores in the
same CCX) The pinning was added so that I could only observe a subset of
the total CPUs since analyzing the behavior of 40 tasks on 256 CPUs was
much harder than analyzing it on 16 CPUs :)

> 
>> rqlen.bt
>> ---
> <snip>
>> --
>>
>> I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
>> execution (rqlen50-pe-pinned.csv) below.
>>
>> The trend I see with hackbench is that the chain migration leads
>> to a single runqueue being completely overloaded, followed by some
>> amount of the idling on the entire CCX and a similar chain appearing
>> on a different CPU. The trace for tip show a lot more CPUs being
>> utilized.
>>
>> Mathieu has been looking at hackbench and the effect of task migration
>> on the runtime and it appears that lowering the migrations improves
>> the hackbench performance significantly [1][2][3]
>>
>> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
>> [2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
>> [3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@efficios.com/
>>
>> Since migration itself is not cheap, I believe the chain migration at
>> the current scale hampers the performance since sched-messaging
>> emulates a worst-case scenario for proxy-execution.
> 
> Hrm.
> 
>> I'll update the thread once I have more information. I'll continue
>> testing and take a closer look at the implementation.
>>
>>> I also briefly re-tested with this v5 series
>>> and saw some average latencies grow vs v4, suggesting the changes
>>> to return-migration (and extra locking) have some impact. With v6
>>> the extra overhead is reduced but still not as nice as v4. I’ll
>>> be digging more there, but my priority is still stability over
>>> speed at this point (it’s easier to validate correctness of
>>> optimizations if the baseline isn’t crashing).
>>>
>>>
>>> If folks find it easier to test/tinker with, this patch series
>>> can also be found here:
>>>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
>>>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
>>
>> P.S. I was using the above tree.
> 
> Ok, I've been working on getting v7 ready which includes two main things:
> 1) I've reworked the return migration back into the ttwu path to avoid
> the lock juggling
> 2) Working to properly conditionalize and re-add Connor's
> chain-migration feature (which when a migration happens pulls the full
> blocked_donor list with it)
> 
> So I'll try to reproduce your results and see if these help any with
> this particular case, and then I'll start to look closer at what can
> be done.
> 
> Again, thanks so much, I've got so much gratitude for your testing and
> analysis here. I really appreciate your feedback!
> Do let me know if you find anything further!

Sure thing! I'll keep you updated of any finding. Thank you for digging
further into this issue.

> 
> thanks
> -john

--
Thanks and Regards,
Prateek

Qais Yousef Dec. 17, 2023, 3:07 a.m. UTC | #9

Hi John

On 11/06/23 19:34, John Stultz wrote:
> Stabilizing this Proxy Execution series has unfortunately
> continued to be a challenging task. Since the v5 release, I’ve
> been focused on getting the deactivated/sleeping owner enqueuing
> functionality, which I left out of v5, stabilized. I’ve managed
> to rework enough to avoid the crashes previously tripped with the
> locktorture & ww_mutex selftests, so I feel it’s much improved,
> but I do still see some issues (blocked waitqueues and hung task
> watchdogs firing) after stressing the system for many hours in a
> 64 cpu qemu environment (though this issue seems to be introduced
> earlier in the series with proxy-migration/return-migration).
> 
> I still haven’t had time to focus on testing and debugging the
> chain migration patches. So I’ve left that patch out of this
> submission for now, but will try to get it included in the next
> revision.
> 
> This patch series is actually coarser than what I’ve been
> developing with, as there are a number of small “test” steps to
> help validate behavior I changed, which would then be replaced by
> the real logic afterwards. Including those here would just cause
> more work for reviewers, so I’ve folded them together, but if
> you’re interested you can find the fine-grained tree here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
>   https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained
> 
> As mentioned previously, this Proxy Execution series has a long
> history: First described in a paper[1] by Watkins, Straub,
> Niehaus, then from patches from Peter Zijlstra, extended with
> lots of work by Juri Lelli, Valentin Schneider, and Connor
> O'Brien. (and thank you to Steven Rostedt for providing
> additional details here!)

Thanks a lot for all your effort into trying to push this difficult patchset
forward!

I am trying to find more time to help with review and hopefully debugging too,
but as it stands, I think to make progress we need to think about breaking this
patchset into smaller problems and get them merged into phases so at least the
review and actual work done would be broken down into smaller more manageable
chunks.

From my very birds eye view it seems we have 3 elements:

	1. Extend locking infrastructure.
	2. Split task context into scheduling and execution.
	3. Actual proxy_execution implementation.

It seems to me (and as ever I could be wrong of course) the first 7 patches are
more or less stable? Could you send patch 1 individually and the next 6 patches
to get the ground work to extend locking reviewed and merged first?

After that we can focus on splitting the task context into scheduling and
execution (and maybe introduce the PROXY_EXEC config along with it) but without
actually implementing the inheritance, etc parts? Just generally teaching the
scheduler these now are 2 separate parts.

Are 1 and 2 dependent on each other or can be sent as two series in parallel
actually?

Hopefully this should reduce the work a lot from continuously rebasing these
patches and focus on the last part which is the meaty and most difficult bit
IIUC. Which I hope we can break down too; but I have no idea at the moment how
to do that.

Merging in parts will help with giving each part a chance to soak individually
in mainline while the rest is being discussed. Which would make handling
potential fall overs easier too.

I don't know what the other thinks, but IMHO if there are stable parts of this
series; I think we should focus on trying to merge these elements first. I hope
you'll be the one to get this through the finishing line, but if for whatever
reason yet another person needs to carry over, they'd find something got merged
at least :-) I'm sure reviving these patches every time is no easy task!

Cheers

--
Qais Yousef

John Stultz Dec. 18, 2023, 11:38 p.m. UTC | #10

On Sat, Dec 16, 2023 at 7:07 PM Qais Yousef <qyousef@layalina.io> wrote:
> I am trying to find more time to help with review and hopefully debugging too,
> but as it stands, I think to make progress we need to think about breaking this
> patchset into smaller problems and get them merged into phases so at least the
> review and actual work done would be broken down into smaller more manageable
> chunks.
>
> From my very birds eye view it seems we have 3 elements:
>
>         1. Extend locking infrastructure.
>         2. Split task context into scheduling and execution.
>         3. Actual proxy_execution implementation.
>
> It seems to me (and as ever I could be wrong of course) the first 7 patches are
> more or less stable? Could you send patch 1 individually and the next 6 patches
> to get the ground work to extend locking reviewed and merged first?

So I'm working hard to get v7 out the door here, but your general
suggestion here sounds fair.

Part of why I've not pushed as hard to get the first initial patches
in, is that the earlier changes to rework things are mostly done in
service of the later proxy exec logic, so it may be a little hard to
justify the churn on their own.  Also, I've been hoping to get more
discussion & feedback around the big picture - but I suspect the size
& number of the changes makes this daunting.

That said, if Peter is up for it, I'd be happy if the initial chunks
of the series were to be considered to be pulled in.

> After that we can focus on splitting the task context into scheduling and
> execution (and maybe introduce the PROXY_EXEC config along with it) but without
> actually implementing the inheritance, etc parts? Just generally teaching the
> scheduler these now are 2 separate parts.

The majority of that is done in a single patch:
https://github.com/johnstultz-work/linux-dev/commit/9e3b364f3724ed840137d681876268b0ad67a469

There we start to have separate names, but at least until we are doing
the proxying, the rq->curr and rq_selected() will be the same task.

> Are 1 and 2 dependent on each other or can be sent as two series in parallel
> actually?

Technically, I think they can be done in parallel. Though I'm not sure
if managing and submitting multiple patch series is easier or harder
for folks to review.

> Hopefully this should reduce the work a lot from continuously rebasing these
> patches and focus on the last part which is the meaty and most difficult bit
> IIUC. Which I hope we can break down too; but I have no idea at the moment how
> to do that.

Rebasing hasn't been the major problem, but wrangling the large patch
set has. For v7 I got a lot of offlist feedback, and propagating those
changes through the full fine-grained stack makes even trivial changes
to early patches quite laborious.

As for breaking down the rest, I thought the v6 series had later
patches broken down fairly well:
1) Just local rq proxying (where the owner happens to be on the current rq)
2) Add proxy migration & return migration, so we can boost tasks on remote rqs
3) Sleeping owner enqueueing (so we get woken and added to the rq the
owner gets woken on)
...

And I have the fine-grained version that is even more split up so I
could test each small change at a time:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained

But I'm open to other suggestions.

> Merging in parts will help with giving each part a chance to soak individually
> in mainline while the rest is being discussed. Which would make handling
> potential fall overs easier too.

Again, I think this would be great.

I'll try to get v7 out the door so folks can once more consider the
big picture, but then maybe I'll send out the earlier changes more
frequently.

thanks
-john

Qais Yousef Dec. 28, 2023, 4:45 p.m. UTC | #11

On 12/18/23 15:38, John Stultz wrote:
> On Sat, Dec 16, 2023 at 7:07 PM Qais Yousef <qyousef@layalina.io> wrote:
> > I am trying to find more time to help with review and hopefully debugging too,
> > but as it stands, I think to make progress we need to think about breaking this
> > patchset into smaller problems and get them merged into phases so at least the
> > review and actual work done would be broken down into smaller more manageable
> > chunks.
> >
> > From my very birds eye view it seems we have 3 elements:
> >
> >         1. Extend locking infrastructure.
> >         2. Split task context into scheduling and execution.
> >         3. Actual proxy_execution implementation.
> >
> > It seems to me (and as ever I could be wrong of course) the first 7 patches are
> > more or less stable? Could you send patch 1 individually and the next 6 patches
> > to get the ground work to extend locking reviewed and merged first?
> 
> So I'm working hard to get v7 out the door here, but your general
> suggestion here sounds fair.
> 
> Part of why I've not pushed as hard to get the first initial patches
> in, is that the earlier changes to rework things are mostly done in
> service of the later proxy exec logic, so it may be a little hard to
> justify the churn on their own.  Also, I've been hoping to get more

If by churn you mean they'd be changing a lot later, then yes.

But by churn you mean merging patches to serve a purpose that is yet to follow
up, then I think that is fine. I think other complex patches were staged in
pieces to help with both review and test process. And to speed things up.

> discussion & feedback around the big picture - but I suspect the size
> & number of the changes makes this daunting.

I don't know how the maintainers manage in general tbh. But finding the time
for a smaller set of patches would be easier IMHO for everyone interested.

Assuming my understanding is correct that the first set of patches upto
splitting the scheduler context are more or less stable.

> That said, if Peter is up for it, I'd be happy if the initial chunks
> of the series were to be considered to be pulled in.
> 
> > After that we can focus on splitting the task context into scheduling and
> > execution (and maybe introduce the PROXY_EXEC config along with it) but without
> > actually implementing the inheritance, etc parts? Just generally teaching the
> > scheduler these now are 2 separate parts.
> 
> The majority of that is done in a single patch:
> https://github.com/johnstultz-work/linux-dev/commit/9e3b364f3724ed840137d681876268b0ad67a469

Yep!

> 
> There we start to have separate names, but at least until we are doing
> the proxying, the rq->curr and rq_selected() will be the same task.
> 
> > Are 1 and 2 dependent on each other or can be sent as two series in parallel
> > actually?
> 
> Technically, I think they can be done in parallel. Though I'm not sure
> if managing and submitting multiple patch series is easier or harder
> for folks to review.

I didn't mean you should do that :-) Meant they're actually completely
independent.

> 
> > Hopefully this should reduce the work a lot from continuously rebasing these
> > patches and focus on the last part which is the meaty and most difficult bit
> > IIUC. Which I hope we can break down too; but I have no idea at the moment how
> > to do that.
> 
> Rebasing hasn't been the major problem, but wrangling the large patch
> set has. For v7 I got a lot of offlist feedback, and propagating those
> changes through the full fine-grained stack makes even trivial changes
> to early patches quite laborious.
> 
> As for breaking down the rest, I thought the v6 series had later
> patches broken down fairly well:

Hmm okay. I thought they're dependent; and I meant to potentially independent
parts that can be piecewise merged. But perhaps I'm overthinking the splitting
here :-)

> 1) Just local rq proxying (where the owner happens to be on the current rq)
> 2) Add proxy migration & return migration, so we can boost tasks on remote rqs
> 3) Sleeping owner enqueueing (so we get woken and added to the rq the
> owner gets woken on)
> ...
> 
> And I have the fine-grained version that is even more split up so I
> could test each small change at a time:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
> 
> But I'm open to other suggestions.
> 
> > Merging in parts will help with giving each part a chance to soak individually
> > in mainline while the rest is being discussed. Which would make handling
> > potential fall overs easier too.
> 
> Again, I think this would be great.
> 
> I'll try to get v7 out the door so folks can once more consider the
> big picture, but then maybe I'll send out the earlier changes more
> frequently.


Thanks!

--
Qais Yousef