[v7,08/23] sched: Split scheduler and execution contexts

Message ID 20231220001856.3710363-9-jstultz@google.com
State New
Headers
Series Proxy Execution: A generalized form of Priority Inheritance v7 |

Commit Message

John Stultz Dec. 20, 2023, 12:18 a.m. UTC
  From: Peter Zijlstra <peterz@infradead.org>

Let's define the scheduling context as all the scheduler state
in task_struct for the task selected to run, and the execution
context as all state required to actually run the task.

Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the task selected to be scheduled, but use the
execution context of a different task to actually be run.

To this purpose, introduce rq_selected() macro to point to the
task_struct selected from the runqueue by the scheduler, and
will be used for scheduler state, and preserve rq->curr to
indicate the execution context of the task that will actually be
run.

NOTE: Peter previously mentioned he didn't like the name
"rq_selected()", but I've not come up with a better alternative.
I'm very open to other name proposals.

Question for Peter: Dietmar suggested you'd prefer I drop the
conditionalization of the scheduler context pointer on the rq
(so rq_selected() would be open coded as rq->curr_selected or
whatever we agree on for a name), but I'd think in the
!CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
and its use (since it curr_selected would always be == curr)?
If I'm wrong I'm fine switching this, but would appreciate
clarification.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
[add additional comments and update more sched_class code to use
 rq::proxy]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased and resolved minor collisions, reworked to use
 accessors, tweaked update_curr_common to use rq_proxy fixing rt
 scheduling issues]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
  other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
  new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
v7:
* Minor spelling fix and unused argument fixes suggested by
  Metin Kaya
* Switch to curr_selected for consistency, and minor rewording
  of commit message for clarity
* Rename variables selected instead of curr when we're using
  rq_selected()
* Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
  as suggested by Metin Kaya
---
 kernel/sched/core.c     | 46 ++++++++++++++++++++++++++---------------
 kernel/sched/deadline.c | 35 ++++++++++++++++---------------
 kernel/sched/fair.c     | 18 ++++++++--------
 kernel/sched/rt.c       | 40 +++++++++++++++++------------------
 kernel/sched/sched.h    | 35 +++++++++++++++++++++++++++++--
 5 files changed, 109 insertions(+), 65 deletions(-)
  

Comments

Metin Kaya Dec. 21, 2023, 10:43 a.m. UTC | #1
On 20/12/2023 12:18 am, John Stultz wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Let's define the scheduling context as all the scheduler state
> in task_struct for the task selected to run, and the execution
> context as all state required to actually run the task.
> 
> Currently both are intertwined in task_struct. We want to
> logically split these such that we can use the scheduling
> context of the task selected to be scheduled, but use the
> execution context of a different task to actually be run.

Should we update Documentation/kernel-hacking/hacking.rst (line #348: 
:c:macro:`current`) or another appropriate doc to announce separation of 
scheduling & execution contexts?

> 
> To this purpose, introduce rq_selected() macro to point to the
> task_struct selected from the runqueue by the scheduler, and
> will be used for scheduler state, and preserve rq->curr to
> indicate the execution context of the task that will actually be
> run.
> 
> NOTE: Peter previously mentioned he didn't like the name
> "rq_selected()", but I've not come up with a better alternative.
> I'm very open to other name proposals.
> 
> Question for Peter: Dietmar suggested you'd prefer I drop the
> conditionalization of the scheduler context pointer on the rq
> (so rq_selected() would be open coded as rq->curr_selected or
> whatever we agree on for a name), but I'd think in the
> !CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
> and its use (since it curr_selected would always be == curr)?
> If I'm wrong I'm fine switching this, but would appreciate
> clarification.
> 
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Youssef Esmat <youssefesmat@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: kernel-team@android.com
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
> [add additional comments and update more sched_class code to use
>   rq::proxy]
> Signed-off-by: Connor O'Brien <connoro@google.com>
> [jstultz: Rebased and resolved minor collisions, reworked to use
>   accessors, tweaked update_curr_common to use rq_proxy fixing rt
>   scheduling issues]
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> v2:
> * Reworked to use accessors
> * Fixed update_curr_common to use proxy instead of curr
> v3:
> * Tweaked wrapper names
> * Swapped proxy for selected for clarity
> v4:
> * Minor variable name tweaks for readability
> * Use a macro instead of a inline function and drop
>    other helper functions as suggested by Peter.
> * Remove verbose comments/questions to avoid review
>    distractions, as suggested by Dietmar
> v5:
> * Add CONFIG_PROXY_EXEC option to this patch so the
>    new logic can be tested with this change
> * Minor fix to grab rq_selected when holding the rq lock
> v7:
> * Minor spelling fix and unused argument fixes suggested by
>    Metin Kaya
> * Switch to curr_selected for consistency, and minor rewording
>    of commit message for clarity
> * Rename variables selected instead of curr when we're using
>    rq_selected()
> * Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
>    as suggested by Metin Kaya
> ---
>   kernel/sched/core.c     | 46 ++++++++++++++++++++++++++---------------
>   kernel/sched/deadline.c | 35 ++++++++++++++++---------------
>   kernel/sched/fair.c     | 18 ++++++++--------
>   kernel/sched/rt.c       | 40 +++++++++++++++++------------------
>   kernel/sched/sched.h    | 35 +++++++++++++++++++++++++++++--
>   5 files changed, 109 insertions(+), 65 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e06558fb08aa..0ce34f5c0e0c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -822,7 +822,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
>   
>   	rq_lock(rq, &rf);
>   	update_rq_clock(rq);
> -	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
> +	rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
>   	rq_unlock(rq, &rf);
>   
>   	return HRTIMER_NORESTART;
> @@ -2242,16 +2242,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
>   
>   void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
>   {
> -	if (p->sched_class == rq->curr->sched_class)
> -		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
> -	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
> +	struct task_struct *selected = rq_selected(rq);
> +
> +	if (p->sched_class == selected->sched_class)
> +		selected->sched_class->wakeup_preempt(rq, p, flags);
> +	else if (sched_class_above(p->sched_class, selected->sched_class))
>   		resched_curr(rq);
>   
>   	/*
>   	 * A queue event has occurred, and we're going to schedule.  In
>   	 * this case, we can save a useless back to back clock update.
>   	 */
> -	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
> +	if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
>   		rq_clock_skip_update(rq);
>   }
>   
> @@ -2780,7 +2782,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
>   		lockdep_assert_held(&p->pi_lock);
>   
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   
>   	if (queued) {
>   		/*
> @@ -5600,7 +5602,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
>   	 * project cycles that may never be accounted to this
>   	 * thread, breaking clock_gettime().
>   	 */
> -	if (task_current(rq, p) && task_on_rq_queued(p)) {
> +	if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
>   		prefetch_curr_exec_start(p);
>   		update_rq_clock(rq);
>   		p->sched_class->update_curr(rq);
> @@ -5668,7 +5670,8 @@ void scheduler_tick(void)
>   {
>   	int cpu = smp_processor_id();
>   	struct rq *rq = cpu_rq(cpu);
> -	struct task_struct *curr = rq->curr;
> +	/* accounting goes to the selected task */
> +	struct task_struct *selected;
>   	struct rq_flags rf;
>   	unsigned long thermal_pressure;
>   	u64 resched_latency;
> @@ -5679,16 +5682,17 @@ void scheduler_tick(void)
>   	sched_clock_tick();
>   
>   	rq_lock(rq, &rf);
> +	selected = rq_selected(rq);
>   
>   	update_rq_clock(rq);
>   	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>   	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> -	curr->sched_class->task_tick(rq, curr, 0);
> +	selected->sched_class->task_tick(rq, selected, 0);
>   	if (sched_feat(LATENCY_WARN))
>   		resched_latency = cpu_resched_latency(rq);
>   	calc_global_load_tick(rq);
>   	sched_core_tick(rq);
> -	task_tick_mm_cid(rq, curr);
> +	task_tick_mm_cid(rq, selected);
>   
>   	rq_unlock(rq, &rf);
>   
> @@ -5697,8 +5701,8 @@ void scheduler_tick(void)
>   
>   	perf_event_task_tick();
>   
> -	if (curr->flags & PF_WQ_WORKER)
> -		wq_worker_tick(curr);
> +	if (selected->flags & PF_WQ_WORKER)
> +		wq_worker_tick(selected);
>   
>   #ifdef CONFIG_SMP
>   	rq->idle_balance = idle_cpu(cpu);
> @@ -5763,6 +5767,12 @@ static void sched_tick_remote(struct work_struct *work)
>   		struct task_struct *curr = rq->curr;
>   
>   		if (cpu_online(cpu)) {
> +			/*
> +			 * Since this is a remote tick for full dynticks mode,
> +			 * we are always sure that there is no proxy (only a
> +			 * single task is running).
> +			 */
> +			SCHED_WARN_ON(rq->curr != rq_selected(rq));
>   			update_rq_clock(rq);
>   
>   			if (!is_idle_task(curr)) {
> @@ -6685,6 +6695,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>   	}
>   
>   	next = pick_next_task(rq, prev, &rf);
> +	rq_set_selected(rq, next);
>   	clear_tsk_need_resched(prev);
>   	clear_preempt_need_resched();
>   #ifdef CONFIG_SCHED_DEBUG
> @@ -7185,7 +7196,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
>   
>   	prev_class = p->sched_class;
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   	if (queued)
>   		dequeue_task(rq, p, queue_flag);
>   	if (running)
> @@ -7275,7 +7286,7 @@ void set_user_nice(struct task_struct *p, long nice)
>   	}
>   
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   	if (queued)
>   		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
>   	if (running)
> @@ -7868,7 +7879,7 @@ static int __sched_setscheduler(struct task_struct *p,
>   	}
>   
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   	if (queued)
>   		dequeue_task(rq, p, queue_flags);
>   	if (running)
> @@ -9295,6 +9306,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
>   	rcu_read_unlock();
>   
>   	rq->idle = idle;
> +	rq_set_selected(rq, idle);
>   	rcu_assign_pointer(rq->curr, idle);
>   	idle->on_rq = TASK_ON_RQ_QUEUED;
>   #ifdef CONFIG_SMP
> @@ -9384,7 +9396,7 @@ void sched_setnuma(struct task_struct *p, int nid)
>   
>   	rq = task_rq_lock(p, &rf);
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   
>   	if (queued)
>   		dequeue_task(rq, p, DEQUEUE_SAVE);
> @@ -10489,7 +10501,7 @@ void sched_move_task(struct task_struct *tsk)
>   
>   	update_rq_clock(rq);
>   
> -	running = task_current(rq, tsk);
> +	running = task_current_selected(rq, tsk);
>   	queued = task_on_rq_queued(tsk);
>   
>   	if (queued)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 6140f1f51da1..9cf20f4ac5f9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1150,7 +1150,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>   #endif
>   
>   	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> -	if (dl_task(rq->curr))
> +	if (dl_task(rq_selected(rq)))
>   		wakeup_preempt_dl(rq, p, 0);
>   	else
>   		resched_curr(rq);
> @@ -1273,7 +1273,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
>    */
>   static void update_curr_dl(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	struct sched_dl_entity *dl_se = &curr->dl;
>   	s64 delta_exec, scaled_delta_exec;
>   	int cpu = cpu_of(rq);
> @@ -1784,7 +1784,7 @@ static int find_later_rq(struct task_struct *task);
>   static int
>   select_task_rq_dl(struct task_struct *p, int cpu, int flags)
>   {
> -	struct task_struct *curr;
> +	struct task_struct *curr, *selected;
>   	bool select_rq;
>   	struct rq *rq;
>   
> @@ -1795,6 +1795,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
>   
>   	rcu_read_lock();
>   	curr = READ_ONCE(rq->curr); /* unlocked access */
> +	selected = READ_ONCE(rq_selected(rq));
>   
>   	/*
>   	 * If we are dealing with a -deadline task, we must
> @@ -1805,9 +1806,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
>   	 * other hand, if it has a shorter deadline, we
>   	 * try to make it stay here, it might be important.
>   	 */
> -	select_rq = unlikely(dl_task(curr)) &&
> +	select_rq = unlikely(dl_task(selected)) &&
>   		    (curr->nr_cpus_allowed < 2 ||
> -		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> +		     !dl_entity_preempt(&p->dl, &selected->dl)) &&
>   		    p->nr_cpus_allowed > 1;
>   
>   	/*
> @@ -1870,7 +1871,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
>   	 * let's hope p can move out.
>   	 */
>   	if (rq->curr->nr_cpus_allowed == 1 ||
> -	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
> +	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
>   		return;
>   
>   	/*
> @@ -1909,7 +1910,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>   static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
>   				  int flags)
>   {
> -	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
> +	if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
>   		resched_curr(rq);
>   		return;
>   	}
> @@ -1919,7 +1920,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
>   	 * In the unlikely case current and p have the same deadline
>   	 * let us try to decide what's the best thing to do...
>   	 */
> -	if ((p->dl.deadline == rq->curr->dl.deadline) &&
> +	if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
>   	    !test_tsk_need_resched(rq->curr))
>   		check_preempt_equal_dl(rq, p);
>   #endif /* CONFIG_SMP */
> @@ -1954,7 +1955,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
>   	if (hrtick_enabled_dl(rq))
>   		start_hrtick_dl(rq, p);
>   
> -	if (rq->curr->sched_class != &dl_sched_class)
> +	if (rq_selected(rq)->sched_class != &dl_sched_class)
>   		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>   
>   	deadline_queue_push_tasks(rq);
> @@ -2268,8 +2269,8 @@ static int push_dl_task(struct rq *rq)
>   	 * can move away, it makes sense to just reschedule
>   	 * without going further in pushing next_task.
>   	 */
> -	if (dl_task(rq->curr) &&
> -	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
> +	if (dl_task(rq_selected(rq)) &&
> +	    dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
>   	    rq->curr->nr_cpus_allowed > 1) {
>   		resched_curr(rq);
>   		return 0;
> @@ -2394,7 +2395,7 @@ static void pull_dl_task(struct rq *this_rq)
>   			 * deadline than the current task of its runqueue.
>   			 */
>   			if (dl_time_before(p->dl.deadline,
> -					   src_rq->curr->dl.deadline))
> +					   rq_selected(src_rq)->dl.deadline))
>   				goto skip;
>   
>   			if (is_migration_disabled(p)) {
> @@ -2435,9 +2436,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
>   	if (!task_on_cpu(rq, p) &&
>   	    !test_tsk_need_resched(rq->curr) &&
>   	    p->nr_cpus_allowed > 1 &&
> -	    dl_task(rq->curr) &&
> +	    dl_task(rq_selected(rq)) &&
>   	    (rq->curr->nr_cpus_allowed < 2 ||
> -	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
> +	     !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
>   		push_dl_tasks(rq);
>   	}
>   }
> @@ -2612,12 +2613,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>   		return;
>   	}
>   
> -	if (rq->curr != p) {
> +	if (rq_selected(rq) != p) {
>   #ifdef CONFIG_SMP
>   		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
>   			deadline_queue_push_tasks(rq);
>   #endif
> -		if (dl_task(rq->curr))
> +		if (dl_task(rq_selected(rq)))
>   			wakeup_preempt_dl(rq, p, 0);
>   		else
>   			resched_curr(rq);
> @@ -2646,7 +2647,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>   	if (!rq->dl.overloaded)
>   		deadline_queue_pull_task(rq);
>   
> -	if (task_current(rq, p)) {
> +	if (task_current_selected(rq, p)) {
>   		/*
>   		 * If we now have a earlier deadline task than p,
>   		 * then reschedule, provided p is still on this
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1251fd01a555..07216ea3ed53 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1157,7 +1157,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
>    */
>   s64 update_curr_common(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	s64 delta_exec;
>   
>   	delta_exec = update_curr_se(rq, &curr->se);
> @@ -1203,7 +1203,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>   
>   static void update_curr_fair(struct rq *rq)
>   {
> -	update_curr(cfs_rq_of(&rq->curr->se));
> +	update_curr(cfs_rq_of(&rq_selected(rq)->se));
>   }
>   
>   static inline void
> @@ -6611,7 +6611,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
>   		s64 delta = slice - ran;
>   
>   		if (delta < 0) {
> -			if (task_current(rq, p))
> +			if (task_current_selected(rq, p))
>   				resched_curr(rq);
>   			return;
>   		}
> @@ -6626,7 +6626,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
>    */
>   static void hrtick_update(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   
>   	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
>   		return;
> @@ -8235,7 +8235,7 @@ static void set_next_buddy(struct sched_entity *se)
>    */
>   static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	struct sched_entity *se = &curr->se, *pse = &p->se;
>   	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>   	int next_buddy_marked = 0;
> @@ -8268,7 +8268,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>   	 * prevents us from potentially nominating it as a false LAST_BUDDY
>   	 * below.
>   	 */
> -	if (test_tsk_need_resched(curr))
> +	if (test_tsk_need_resched(rq->curr))
>   		return;
>   
>   	/* Idle tasks are by definition preempted by non-idle tasks. */
> @@ -9252,7 +9252,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>   	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
>   	 * DL and IRQ signals have been updated before updating CFS.
>   	 */
> -	curr_class = rq->curr->sched_class;
> +	curr_class = rq_selected(rq)->sched_class;
>   
>   	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>   
> @@ -12640,7 +12640,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>   	 * our priority decreased, or if we are not currently running on
>   	 * this runqueue and our priority is higher than the current's
>   	 */
> -	if (task_current(rq, p)) {
> +	if (task_current_selected(rq, p)) {
>   		if (p->prio > oldprio)
>   			resched_curr(rq);
>   	} else
> @@ -12743,7 +12743,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
>   		 * kick off the schedule if running, otherwise just see
>   		 * if we can still preempt the current task.
>   		 */
> -		if (task_current(rq, p))
> +		if (task_current_selected(rq, p))
>   			resched_curr(rq);
>   		else
>   			wakeup_preempt(rq, p, 0);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 9cdea3ea47da..2682cec45aaa 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>   
>   static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
>   {
> -	struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
> +	struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
>   	struct rq *rq = rq_of_rt_rq(rt_rq);
>   	struct sched_rt_entity *rt_se;
>   
> @@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
>    */
>   static void update_curr_rt(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	struct sched_rt_entity *rt_se = &curr->rt;
>   	s64 delta_exec;
>   
> @@ -1545,7 +1545,7 @@ static int find_lowest_rq(struct task_struct *task);
>   static int
>   select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   {
> -	struct task_struct *curr;
> +	struct task_struct *curr, *selected;
>   	struct rq *rq;
>   	bool test;
>   
> @@ -1557,6 +1557,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   
>   	rcu_read_lock();
>   	curr = READ_ONCE(rq->curr); /* unlocked access */
> +	selected = READ_ONCE(rq_selected(rq));
>   
>   	/*
>   	 * If the current task on @p's runqueue is an RT task, then
> @@ -1585,8 +1586,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   	 * systems like big.LITTLE.
>   	 */
>   	test = curr &&
> -	       unlikely(rt_task(curr)) &&
> -	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +	       unlikely(rt_task(selected)) &&
> +	       (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
>   
>   	if (test || !rt_task_fits_capacity(p, cpu)) {
>   		int target = find_lowest_rq(p);
> @@ -1616,12 +1617,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   
>   static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
>   {
> -	/*
> -	 * Current can't be migrated, useless to reschedule,
> -	 * let's hope p can move out.
> -	 */
>   	if (rq->curr->nr_cpus_allowed == 1 ||
> -	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> +	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
>   		return;
>   
>   	/*
> @@ -1664,7 +1661,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>    */
>   static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>   {
> -	if (p->prio < rq->curr->prio) {
> +	struct task_struct *curr = rq_selected(rq);
> +
> +	if (p->prio < curr->prio) {
>   		resched_curr(rq);
>   		return;
>   	}
> @@ -1682,7 +1681,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>   	 * to move current somewhere else, making room for our non-migratable
>   	 * task.
>   	 */
> -	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
> +	if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
>   		check_preempt_equal_prio(rq, p);
>   #endif
>   }
> @@ -1707,7 +1706,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
>   	 * utilization. We only care of the case where we start to schedule a
>   	 * rt task
>   	 */
> -	if (rq->curr->sched_class != &rt_sched_class)
> +	if (rq_selected(rq)->sched_class != &rt_sched_class)
>   		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>   
>   	rt_queue_push_tasks(rq);
> @@ -1988,6 +1987,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
>   
>   	BUG_ON(rq->cpu != task_cpu(p));
>   	BUG_ON(task_current(rq, p));
> +	BUG_ON(task_current_selected(rq, p));
>   	BUG_ON(p->nr_cpus_allowed <= 1);
>   
>   	BUG_ON(!task_on_rq_queued(p));
> @@ -2020,7 +2020,7 @@ static int push_rt_task(struct rq *rq, bool pull)
>   	 * higher priority than current. If that's the case
>   	 * just reschedule current.
>   	 */
> -	if (unlikely(next_task->prio < rq->curr->prio)) {
> +	if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
>   		resched_curr(rq);
>   		return 0;
>   	}
> @@ -2375,7 +2375,7 @@ static void pull_rt_task(struct rq *this_rq)
>   			 * p if it is lower in priority than the
>   			 * current task on the run queue
>   			 */
> -			if (p->prio < src_rq->curr->prio)
> +			if (p->prio < rq_selected(src_rq)->prio)
>   				goto skip;
>   
>   			if (is_migration_disabled(p)) {
> @@ -2419,9 +2419,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
>   	bool need_to_push = !task_on_cpu(rq, p) &&
>   			    !test_tsk_need_resched(rq->curr) &&
>   			    p->nr_cpus_allowed > 1 &&
> -			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
> +			    (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
>   			    (rq->curr->nr_cpus_allowed < 2 ||
> -			     rq->curr->prio <= p->prio);
> +			     rq_selected(rq)->prio <= p->prio);
>   
>   	if (need_to_push)
>   		push_rt_tasks(rq);
> @@ -2505,7 +2505,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
>   		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
>   			rt_queue_push_tasks(rq);
>   #endif /* CONFIG_SMP */
> -		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> +		if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
>   			resched_curr(rq);
>   	}
>   }
> @@ -2520,7 +2520,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
>   	if (!task_on_rq_queued(p))
>   		return;
>   
> -	if (task_current(rq, p)) {
> +	if (task_current_selected(rq, p)) {
>   #ifdef CONFIG_SMP
>   		/*
>   		 * If our priority decreases while running, we
> @@ -2546,7 +2546,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
>   		 * greater than the current running task
>   		 * then reschedule.
>   		 */
> -		if (p->prio < rq->curr->prio)
> +		if (p->prio < rq_selected(rq)->prio)
>   			resched_curr(rq);
>   	}
>   }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3e0e4fc8734b..6ea1dfbe502a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -994,7 +994,10 @@ struct rq {
>   	 */
>   	unsigned int		nr_uninterruptible;
>   
> -	struct task_struct __rcu	*curr;
> +	struct task_struct __rcu	*curr;       /* Execution context */
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	struct task_struct __rcu	*curr_selected; /* Scheduling context (policy) */
> +#endif
>   	struct task_struct	*idle;
>   	struct task_struct	*stop;
>   	unsigned long		next_balance;
> @@ -1189,6 +1192,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>   #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>   #define raw_rq()		raw_cpu_ptr(&runqueues)
>   
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +#define rq_selected(rq)		((rq)->curr_selected)
> +static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
> +{
> +	rcu_assign_pointer(rq->curr_selected, t);
> +}
> +#else
> +#define rq_selected(rq)		((rq)->curr)
> +static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
> +{
> +	/* Do nothing */
> +}
> +#endif
> +
>   struct sched_group;
>   #ifdef CONFIG_SCHED_CORE
>   static inline struct cpumask *sched_group_span(struct sched_group *sg);
> @@ -2112,11 +2129,25 @@ static inline u64 global_rt_runtime(void)
>   	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
>   }
>   
> +/*
> + * Is p the current execution context?
> + */
>   static inline int task_current(struct rq *rq, struct task_struct *p)
>   {
>   	return rq->curr == p;
>   }
>   
> +/*
> + * Is p the current scheduling context?
> + *
> + * Note that it might be the current execution context at the same time if
> + * rq->curr == rq_selected() == p.
> + */
> +static inline int task_current_selected(struct rq *rq, struct task_struct *p)
> +{
> +	return rq_selected(rq) == p;
> +}
> +
>   static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
>   {
>   #ifdef CONFIG_SMP
> @@ -2280,7 +2311,7 @@ struct sched_class {
>   
>   static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>   {
> -	WARN_ON_ONCE(rq->curr != prev);
> +	WARN_ON_ONCE(rq_selected(rq) != prev);
>   	prev->sched_class->put_prev_task(rq, prev);
>   }
>
  
John Stultz Dec. 21, 2023, 6:23 p.m. UTC | #2
On Thu, Dec 21, 2023 at 2:44 AM Metin Kaya <metin.kaya@arm.com> wrote:
>
> On 20/12/2023 12:18 am, John Stultz wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > Let's define the scheduling context as all the scheduler state
> > in task_struct for the task selected to run, and the execution
> > context as all state required to actually run the task.
> >
> > Currently both are intertwined in task_struct. We want to
> > logically split these such that we can use the scheduling
> > context of the task selected to be scheduled, but use the
> > execution context of a different task to actually be run.
>
> Should we update Documentation/kernel-hacking/hacking.rst (line #348:
> :c:macro:`current`) or another appropriate doc to announce separation of
> scheduling & execution contexts?

So I like this suggestion, but the hacking.rst file feels a little too
general to be getting into the subtleties of scheduler internals.
The splitting of the scheduler context and the execution context
really is just a scheduler detail, as everything else will still deal
just with the execution context as before. So it's really only for
scheduler accounting that we utilize the "rq_selected"  scheduler
context.

Maybe something under Documentation/scheduler/ would be more
appropriate? Though the documents there are all pretty focused on
particular sched classes, and not much on the core logic that is most
affected by this conceptual change. I guess maybe adding
sched-core.txt document might be useful to have for this sort of
detail (though a bit daunting to write from scratch).

thanks
-john
  
Valentin Schneider Jan. 3, 2024, 2:49 p.m. UTC | #3
On 19/12/23 16:18, John Stultz wrote:
> NOTE: Peter previously mentioned he didn't like the name
> "rq_selected()", but I've not come up with a better alternative.
> I'm very open to other name proposals.
>

I got used to the naming relatively quickly. It "should" be rq_pick()
(i.e. what did the last pick_next_task() return for that rq), but that
naming is unfortunately ambiguous (is it doing a pick itself?), so I think
"selected" works.
  
John Stultz Jan. 10, 2024, 10:24 p.m. UTC | #4
On Wed, Jan 3, 2024 at 6:49 AM Valentin Schneider <vschneid@redhat.com> wrote:
> On 19/12/23 16:18, John Stultz wrote:
> > NOTE: Peter previously mentioned he didn't like the name
> > "rq_selected()", but I've not come up with a better alternative.
> > I'm very open to other name proposals.
> >
>
> I got used to the naming relatively quickly. It "should" be rq_pick()
> (i.e. what did the last pick_next_task() return for that rq), but that
> naming is unfortunately ambiguous (is it doing a pick itself?), so I think
> "selected" works.

Thanks for that feedback! I guess rq_picked() might be an alternative
to your suggestion of rq_pick(), but selected still sounds better to
me.

thanks
-john
  

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e06558fb08aa..0ce34f5c0e0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -822,7 +822,7 @@  static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+	rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;
@@ -2242,16 +2242,18 @@  static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 
 void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->sched_class == rq->curr->sched_class)
-		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+	struct task_struct *selected = rq_selected(rq);
+
+	if (p->sched_class == selected->sched_class)
+		selected->sched_class->wakeup_preempt(rq, p, flags);
+	else if (sched_class_above(p->sched_class, selected->sched_class))
 		resched_curr(rq);
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
 		rq_clock_skip_update(rq);
 }
 
@@ -2780,7 +2782,7 @@  __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 		lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 
 	if (queued) {
 		/*
@@ -5600,7 +5602,7 @@  unsigned long long task_sched_runtime(struct task_struct *p)
 	 * project cycles that may never be accounted to this
 	 * thread, breaking clock_gettime().
 	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
+	if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
 		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
@@ -5668,7 +5670,8 @@  void scheduler_tick(void)
 {
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
-	struct task_struct *curr = rq->curr;
+	/* accounting goes to the selected task */
+	struct task_struct *selected;
 	struct rq_flags rf;
 	unsigned long thermal_pressure;
 	u64 resched_latency;
@@ -5679,16 +5682,17 @@  void scheduler_tick(void)
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
+	selected = rq_selected(rq);
 
 	update_rq_clock(rq);
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
-	curr->sched_class->task_tick(rq, curr, 0);
+	selected->sched_class->task_tick(rq, selected, 0);
 	if (sched_feat(LATENCY_WARN))
 		resched_latency = cpu_resched_latency(rq);
 	calc_global_load_tick(rq);
 	sched_core_tick(rq);
-	task_tick_mm_cid(rq, curr);
+	task_tick_mm_cid(rq, selected);
 
 	rq_unlock(rq, &rf);
 
@@ -5697,8 +5701,8 @@  void scheduler_tick(void)
 
 	perf_event_task_tick();
 
-	if (curr->flags & PF_WQ_WORKER)
-		wq_worker_tick(curr);
+	if (selected->flags & PF_WQ_WORKER)
+		wq_worker_tick(selected);
 
 #ifdef CONFIG_SMP
 	rq->idle_balance = idle_cpu(cpu);
@@ -5763,6 +5767,12 @@  static void sched_tick_remote(struct work_struct *work)
 		struct task_struct *curr = rq->curr;
 
 		if (cpu_online(cpu)) {
+			/*
+			 * Since this is a remote tick for full dynticks mode,
+			 * we are always sure that there is no proxy (only a
+			 * single task is running).
+			 */
+			SCHED_WARN_ON(rq->curr != rq_selected(rq));
 			update_rq_clock(rq);
 
 			if (!is_idle_task(curr)) {
@@ -6685,6 +6695,7 @@  static void __sched notrace __schedule(unsigned int sched_mode)
 	}
 
 	next = pick_next_task(rq, prev, &rf);
+	rq_set_selected(rq, next);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
@@ -7185,7 +7196,7 @@  void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -7275,7 +7286,7 @@  void set_user_nice(struct task_struct *p, long nice)
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -7868,7 +7879,7 @@  static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -9295,6 +9306,7 @@  void __init init_idle(struct task_struct *idle, int cpu)
 	rcu_read_unlock();
 
 	rq->idle = idle;
+	rq_set_selected(rq, idle);
 	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
@@ -9384,7 +9396,7 @@  void sched_setnuma(struct task_struct *p, int nid)
 
 	rq = task_rq_lock(p, &rf);
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10489,7 +10501,7 @@  void sched_move_task(struct task_struct *tsk)
 
 	update_rq_clock(rq);
 
-	running = task_current(rq, tsk);
+	running = task_current_selected(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
 	if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6140f1f51da1..9cf20f4ac5f9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1150,7 +1150,7 @@  static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 #endif
 
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-	if (dl_task(rq->curr))
+	if (dl_task(rq_selected(rq)))
 		wakeup_preempt_dl(rq, p, 0);
 	else
 		resched_curr(rq);
@@ -1273,7 +1273,7 @@  static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
  */
 static void update_curr_dl(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	struct sched_dl_entity *dl_se = &curr->dl;
 	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
@@ -1784,7 +1784,7 @@  static int find_later_rq(struct task_struct *task);
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	bool select_rq;
 	struct rq *rq;
 
@@ -1795,6 +1795,7 @@  select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = READ_ONCE(rq->curr); /* unlocked access */
+	selected = READ_ONCE(rq_selected(rq));
 
 	/*
 	 * If we are dealing with a -deadline task, we must
@@ -1805,9 +1806,9 @@  select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 	 * other hand, if it has a shorter deadline, we
 	 * try to make it stay here, it might be important.
 	 */
-	select_rq = unlikely(dl_task(curr)) &&
+	select_rq = unlikely(dl_task(selected)) &&
 		    (curr->nr_cpus_allowed < 2 ||
-		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+		     !dl_entity_preempt(&p->dl, &selected->dl)) &&
 		    p->nr_cpus_allowed > 1;
 
 	/*
@@ -1870,7 +1871,7 @@  static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
 		return;
 
 	/*
@@ -1909,7 +1910,7 @@  static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+	if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
 		resched_curr(rq);
 		return;
 	}
@@ -1919,7 +1920,7 @@  static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((p->dl.deadline == rq->curr->dl.deadline) &&
+	if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
 	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
@@ -1954,7 +1955,7 @@  static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (hrtick_enabled_dl(rq))
 		start_hrtick_dl(rq, p);
 
-	if (rq->curr->sched_class != &dl_sched_class)
+	if (rq_selected(rq)->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	deadline_queue_push_tasks(rq);
@@ -2268,8 +2269,8 @@  static int push_dl_task(struct rq *rq)
 	 * can move away, it makes sense to just reschedule
 	 * without going further in pushing next_task.
 	 */
-	if (dl_task(rq->curr) &&
-	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	if (dl_task(rq_selected(rq)) &&
+	    dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
 	    rq->curr->nr_cpus_allowed > 1) {
 		resched_curr(rq);
 		return 0;
@@ -2394,7 +2395,7 @@  static void pull_dl_task(struct rq *this_rq)
 			 * deadline than the current task of its runqueue.
 			 */
 			if (dl_time_before(p->dl.deadline,
-					   src_rq->curr->dl.deadline))
+					   rq_selected(src_rq)->dl.deadline))
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2435,9 +2436,9 @@  static void task_woken_dl(struct rq *rq, struct task_struct *p)
 	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    p->nr_cpus_allowed > 1 &&
-	    dl_task(rq->curr) &&
+	    dl_task(rq_selected(rq)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+	     !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
 		push_dl_tasks(rq);
 	}
 }
@@ -2612,12 +2613,12 @@  static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 	}
 
-	if (rq->curr != p) {
+	if (rq_selected(rq) != p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
 #endif
-		if (dl_task(rq->curr))
+		if (dl_task(rq_selected(rq)))
 			wakeup_preempt_dl(rq, p, 0);
 		else
 			resched_curr(rq);
@@ -2646,7 +2647,7 @@  static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 	if (!rq->dl.overloaded)
 		deadline_queue_pull_task(rq);
 
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		/*
 		 * If we now have a earlier deadline task than p,
 		 * then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1251fd01a555..07216ea3ed53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1157,7 +1157,7 @@  static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
  */
 s64 update_curr_common(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	s64 delta_exec;
 
 	delta_exec = update_curr_se(rq, &curr->se);
@@ -1203,7 +1203,7 @@  static void update_curr(struct cfs_rq *cfs_rq)
 
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->curr->se));
+	update_curr(cfs_rq_of(&rq_selected(rq)->se));
 }
 
 static inline void
@@ -6611,7 +6611,7 @@  static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
-			if (task_current(rq, p))
+			if (task_current_selected(rq, p))
 				resched_curr(rq);
 			return;
 		}
@@ -6626,7 +6626,7 @@  static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
  */
 static void hrtick_update(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 
 	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
 		return;
@@ -8235,7 +8235,7 @@  static void set_next_buddy(struct sched_entity *se)
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int next_buddy_marked = 0;
@@ -8268,7 +8268,7 @@  static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(rq->curr))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -9252,7 +9252,7 @@  static bool __update_blocked_others(struct rq *rq, bool *done)
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class = rq->curr->sched_class;
+	curr_class = rq_selected(rq)->sched_class;
 
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
 
@@ -12640,7 +12640,7 @@  prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 	 * our priority decreased, or if we are not currently running on
 	 * this runqueue and our priority is higher than the current's
 	 */
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		if (p->prio > oldprio)
 			resched_curr(rq);
 	} else
@@ -12743,7 +12743,7 @@  static void switched_to_fair(struct rq *rq, struct task_struct *p)
 		 * kick off the schedule if running, otherwise just see
 		 * if we can still preempt the current task.
 		 */
-		if (task_current(rq, p))
+		if (task_current_selected(rq, p))
 			resched_curr(rq);
 		else
 			wakeup_preempt(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9cdea3ea47da..2682cec45aaa 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -530,7 +530,7 @@  static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 
 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 {
-	struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+	struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
 	struct rq *rq = rq_of_rt_rq(rt_rq);
 	struct sched_rt_entity *rt_se;
 
@@ -1000,7 +1000,7 @@  static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
  */
 static void update_curr_rt(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
@@ -1545,7 +1545,7 @@  static int find_lowest_rq(struct task_struct *task);
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	struct rq *rq;
 	bool test;
 
@@ -1557,6 +1557,7 @@  select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = READ_ONCE(rq->curr); /* unlocked access */
+	selected = READ_ONCE(rq_selected(rq));
 
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
@@ -1585,8 +1586,8 @@  select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	 * systems like big.LITTLE.
 	 */
 	test = curr &&
-	       unlikely(rt_task(curr)) &&
-	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+	       unlikely(rt_task(selected)) &&
+	       (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
 
 	if (test || !rt_task_fits_capacity(p, cpu)) {
 		int target = find_lowest_rq(p);
@@ -1616,12 +1617,8 @@  select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-	/*
-	 * Current can't be migrated, useless to reschedule,
-	 * let's hope p can move out.
-	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
 		return;
 
 	/*
@@ -1664,7 +1661,9 @@  static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  */
 static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->prio < rq->curr->prio) {
+	struct task_struct *curr = rq_selected(rq);
+
+	if (p->prio < curr->prio) {
 		resched_curr(rq);
 		return;
 	}
@@ -1682,7 +1681,7 @@  static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -1707,7 +1706,7 @@  static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
 	 * utilization. We only care of the case where we start to schedule a
 	 * rt task
 	 */
-	if (rq->curr->sched_class != &rt_sched_class)
+	if (rq_selected(rq)->sched_class != &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	rt_queue_push_tasks(rq);
@@ -1988,6 +1987,7 @@  static struct task_struct *pick_next_pushable_task(struct rq *rq)
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current_selected(rq, p));
 	BUG_ON(p->nr_cpus_allowed <= 1);
 
 	BUG_ON(!task_on_rq_queued(p));
@@ -2020,7 +2020,7 @@  static int push_rt_task(struct rq *rq, bool pull)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2375,7 +2375,7 @@  static void pull_rt_task(struct rq *this_rq)
 			 * p if it is lower in priority than the
 			 * current task on the run queue
 			 */
-			if (p->prio < src_rq->curr->prio)
+			if (p->prio < rq_selected(src_rq)->prio)
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2419,9 +2419,9 @@  static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	bool need_to_push = !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
-			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
+			    (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
 			    (rq->curr->nr_cpus_allowed < 2 ||
-			     rq->curr->prio <= p->prio);
+			     rq_selected(rq)->prio <= p->prio);
 
 	if (need_to_push)
 		push_rt_tasks(rq);
@@ -2505,7 +2505,7 @@  static void switched_to_rt(struct rq *rq, struct task_struct *p)
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
-		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+		if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
 }
@@ -2520,7 +2520,7 @@  prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 	if (!task_on_rq_queued(p))
 		return;
 
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 #ifdef CONFIG_SMP
 		/*
 		 * If our priority decreases while running, we
@@ -2546,7 +2546,7 @@  prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * greater than the current running task
 		 * then reschedule.
 		 */
-		if (p->prio < rq->curr->prio)
+		if (p->prio < rq_selected(rq)->prio)
 			resched_curr(rq);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e0e4fc8734b..6ea1dfbe502a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,7 +994,10 @@  struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
 
-	struct task_struct __rcu	*curr;
+	struct task_struct __rcu	*curr;       /* Execution context */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	struct task_struct __rcu	*curr_selected; /* Scheduling context (policy) */
+#endif
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
@@ -1189,6 +1192,20 @@  DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+#define rq_selected(rq)		((rq)->curr_selected)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	rcu_assign_pointer(rq->curr_selected, t);
+}
+#else
+#define rq_selected(rq)		((rq)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	/* Do nothing */
+}
+#endif
+
 struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2112,11 +2129,25 @@  static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Is p the current execution context?
+ */
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
 	return rq->curr == p;
 }
 
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+	return rq_selected(rq) == p;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2280,7 +2311,7 @@  struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq_selected(rq) != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }