From patchwork Mon Mar 20 23:37:09 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72526
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1491841wrt;
        Mon, 20 Mar 2023 16:38:29 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set/2wRFklUMFQDxFkIAYbyrSMsNTV2wPccJOlq1xx3jaLAmj6vGTac1Pn5hOXPjE2z0NV7sV
X-Received: by 2002:a05:6a20:8b07:b0:d9:840f:79c2 with SMTP id
 l7-20020a056a208b0700b000d9840f79c2mr281022pzh.2.1679355509659;
        Mon, 20 Mar 2023 16:38:29 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355509; cv=none;
        d=google.com; s=arc-20160816;
        b=uE0PxYumBMbrPlxQG8emXyaYVFsWHQMQJk9DD8+uP8lFh6P23vxDMjLBfdN8DTaJ8b
         4QpboLviKuYqoTtXxOkrky10EKJClLUioCbBNMjBlzuL9RKHNVSaS8FUq7NjhQDXJQeQ
         9mHizxXNsM6XzeXv4gJUutLUhZ4qIFWT+qvTWnHXl9gs8ePPf0Cxhpgw6WzPCi340m2Y
         lHzA7reo/Wjjg8kdysTejGEBuUEuplcZfy0165Ng3pOZ6EOKhLDzJf91SCegemGemTzm
         trDx5JUwtmN5KMAqm8go7ZNu3JfNMmE01uyRH5gB2pG3zK8N7ENSYsV27SSERD9FzexG
         A9ZQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=MCz4IwNxm3CO2/v3LkVI4pj4nAJZg4E6BpR/zMapMlw=;
        b=EqZ14Ul8w99+5WfXuf9TMmPayTA06GcODlTB3cpfqAzzlIV6LgYNQarObD4bngGrka
         hn7G76Cja85Q3IC0IfO+/173jwPgUaqHHl3sL9BksicM/LbyTtkMmJqQ2TqNG5NXsbpI
         x1OZAryqALYC9YDgALuE0EdiAMmL69H4mtXmXJPwgb9SIq00Ai4ioQ8XELYe97jzLSJM
         ymPBMPrZRrzyTJjDuQr14rEaxhgr5xrlTejVU1DK2GpOuNIwgBcSum7fN35Odl46HS8h
         BDjBppglIrpdz3LRXaFWi/J743dOkftIzOKRRC0iVrZK6bOJITFqH/LPq5V+RakQjZVT
         B0Cg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=KSsEJIPR;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 m19-20020a056a00081300b00627f72304ebsi5016053pfk.88.2023.03.20.16.38.15;
        Mon, 20 Mar 2023 16:38:29 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=KSsEJIPR;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229734AbjCTXhl (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:37:41 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58718 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229700AbjCTXhg (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:37:36 -0400
Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com
 [IPv6:2607:f8b0:4864:20::54a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7ED6331E38
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:33 -0700 (PDT)
Received: by mail-pg1-x54a.google.com with SMTP id
 s3-20020a632c03000000b0050300a8089aso3013254pgs.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355453;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=MCz4IwNxm3CO2/v3LkVI4pj4nAJZg4E6BpR/zMapMlw=;
        b=KSsEJIPRq2Y9zTLgQrB8V4MwfgeziPfrZbdI1uJi7RicmTnJDekbv634If36SOi9Fo
         x+DdA+GfDL990k9c0eiK9MGe/TIkJc3RAwQMrpUO0MFQPYE187AWWcdztx4ApZmX9AUh
         j6l174cqyrHdiX3ExY91cys7HZdEGq0zA8NbVl/whxI6pqnj61NoulDZhfDL1vkuEyll
         bgVmQxLY+zpmuu6elJzCiGkmTSHZPq+dVBkkfo/EY03Q8hCV1t6/b5P0fITjF274Qm9f
         RQaeMqGFQkumru9qiLh0eLaotwdbMM7/k4EvLwqiSfBCbaMqSGK6R97X1BKoshl4+zuL
         wIjA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355453;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=MCz4IwNxm3CO2/v3LkVI4pj4nAJZg4E6BpR/zMapMlw=;
        b=Rh8SLgJjx8MYNjt55DcLz+sQA12LvNoPL5N0IXEav9AigXZyaB4UmuIJZAdBFD/nvE
         h6S9MnXpPacPBskgUugV1lZA7JNPpYH+4juNexRequTCHTlJrkH9gaW0onM8xEn2S+5q
         hvTdeuW+xDE0+m2u8sELxWMeZYFGpkmE/Ig4e25UsFwtOwXSelVBm+ZDlpyWl0u/SAaG
         lu/6BkknrSs1uC/osnUowsdCHx8c6hqKtPV1ekVXTF0gQ3eBNpgGTCrOQ9XqqVjGFD7i
         /OotXY/Sf0uumT+376OsXf4XbGbXBRAQpZ5cSldVdCN0LSB58BHdOjQOJFimGKquNq/f
         BJ8g==
X-Gm-Message-State: AO0yUKX5cqPJ2uNpwcxNHDZC+lXFqLDIuIg1FEckvmj5boOHQYC5s5EM
        1TzqOOhOwoQglO8RUYfHqDGNiqZZIWgj21rE9llJgLN6Dai4BfXHwyM7SJlQyjGP/cbmKB6uhEU
        HkYfeydfy1iPAan3ggcFaSTnzlxWyTbeawNdkrCMb9+2c9G2/h+fW+oYht5kZKEKMpUOi78w=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a65:5205:0:b0:503:7cc9:3f8d with SMTP id
 o5-20020a655205000000b005037cc93f8dmr108548pgp.9.1679355452949; Mon, 20 Mar
 2023 16:37:32 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:09 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-2-jstultz@google.com>
Subject: [PATCH v2 01/12] locking/ww_mutex: Remove wakeups from under
 mutex::wait_lock
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760931882757107167?=
X-GMAIL-MSGID: =?utf-8?q?1760931882757107167?=

From: Peter Zijlstra <peterz@infradead.org>

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Move wake_q_init() as suggested by Waiman Long
---
 include/linux/ww_mutex.h  |  3 +++
 kernel/locking/mutex.c    |  8 ++++++++
 kernel/locking/ww_mutex.h | 10 ++++++++--
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h
index bb763085479a..9335b2202017 100644
--- a/include/linux/ww_mutex.h
+++ b/include/linux/ww_mutex.h
@@ -19,6 +19,7 @@
 
 #include <linux/mutex.h>
 #include <linux/rtmutex.h>
+#include <linux/sched/wake_q.h>
 
 #if defined(CONFIG_DEBUG_MUTEXES) || \
    (defined(CONFIG_PREEMPT_RT) && defined(CONFIG_DEBUG_RT_MUTEXES))
@@ -58,6 +59,7 @@ struct ww_acquire_ctx {
 	unsigned int acquired;
 	unsigned short wounded;
 	unsigned short is_wait_die;
+	struct wake_q_head wake_q;
 #ifdef DEBUG_WW_MUTEXES
 	unsigned int done_acquire;
 	struct ww_class *ww_class;
@@ -137,6 +139,7 @@ static inline void ww_acquire_init(struct ww_acquire_ctx *ctx,
 	ctx->acquired = 0;
 	ctx->wounded = false;
 	ctx->is_wait_die = ww_class->is_wait_die;
+	wake_q_init(&ctx->wake_q);
 #ifdef DEBUG_WW_MUTEXES
 	ctx->ww_class = ww_class;
 	ctx->done_acquire = 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d973fe6041bf..1582756914df 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -676,6 +676,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		}
 
 		raw_spin_unlock(&lock->wait_lock);
+		if (ww_ctx)
+			ww_ctx_wake(ww_ctx);
 		schedule_preempt_disabled();
 
 		first = __mutex_waiter_is_first(lock, &waiter);
@@ -725,6 +727,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
 	raw_spin_unlock(&lock->wait_lock);
+	if (ww_ctx)
+		ww_ctx_wake(ww_ctx);
 	preempt_enable();
 	return 0;
 
@@ -736,6 +740,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	raw_spin_unlock(&lock->wait_lock);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
+	if (ww_ctx)
+		ww_ctx_wake(ww_ctx);
 	preempt_enable();
 	return ret;
 }
@@ -946,9 +952,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	preempt_disable();
 	raw_spin_unlock(&lock->wait_lock);
 
 	wake_up_q(&wake_q);
+	preempt_enable();
 }
 
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 56f139201f24..e49ea5336473 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -161,6 +161,12 @@ static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
 
 #endif /* WW_RT */
 
+void ww_ctx_wake(struct ww_acquire_ctx *ww_ctx)
+{
+	wake_up_q(&ww_ctx->wake_q);
+	wake_q_init(&ww_ctx->wake_q);
+}
+
 /*
  * Wait-Die:
  *   The newer transactions are killed when:
@@ -284,7 +290,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
-		wake_up_process(waiter->task);
+		wake_q_add(&ww_ctx->wake_q, waiter->task);
 	}
 
 	return true;
@@ -331,7 +337,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner != current)
-			wake_up_process(owner);
+			wake_q_add(&ww_ctx->wake_q, owner);
 
 		return true;
 	}

From patchwork Mon Mar 20 23:37:10 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72536
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1496244wrt;
        Mon, 20 Mar 2023 16:50:51 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set/83WHf43Yt0bXoTGSvDr3gocOqlN7RLCBlZt/Ij+mM5NHRQ5/p7FbriE5ZFJaOQJJiZJ4x
X-Received: by 2002:a05:6a20:1823:b0:d9:5a7c:b1c5 with SMTP id
 bk35-20020a056a20182300b000d95a7cb1c5mr265266pzb.11.1679356250873;
        Mon, 20 Mar 2023 16:50:50 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679356250; cv=none;
        d=google.com; s=arc-20160816;
        b=bjwH/YOPPDF2hlarT8+L9JAnsjtt7VzAwLs92XAdHsCtVhF58gLI5t2cmKLRkKdaPS
         w538IP4fNdXsgpw/y13++71dIkKiOLb1lj+z0kCQ8qVHPQQBElzUdOXkEkCl3FEerqg9
         isqBSIJTFAUFO9fVVPKmpugyizNHYnrgXmvzGDF8pv6Gd7KfvtbCZJsafy+m7PHbhPOj
         SOal5hGIdfSZNU2cArFZUd+Kx/GbVYpSZaiDDFgg2/jO/40cICXfaCuoETxRE3nP88oV
         2ef5nGiZ68pHCe+u/x3/YJIzstdpHa7AZtB++YxAI0GcAxAvPvDjaTkgzFW3G7n2JDtE
         QJhw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=uF1KX+DGVvWB9dHeVQYr4nKN4tbX0q4czFFzk+0W9lY=;
        b=Xlps4PNPkL/Gw89SnYjRrAW2m+ddrfFvGBTOV+iFqdyo1BnPwfw311Mlq8L5QHcwWB
         o3HOoSTITG1oUwBsp8bEElXBhAPSrSzZSnn3pD8/+rET/2uxWeZYYWQ5rm6/S8xzRN4W
         YlJOOdPcMfM7WNkelD8SNzB/wljYQSnkeUIacC1uv5gU3wOERI1ee1tcJdRrzDZ9oP01
         H7a7RqW1GpIDtYcmzlnWtz07biTkqx5NVu+0QEvaqNZ6bqv+ypkVRo6nYc7JUfmjMhD4
         0OhljrKzFE5F3vixG4W+G9NeNuk0edS/oK1lF5wX7cfsmJZUUPB+DLy5GUiuE1roOxnt
         CFYQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=pB8ozmYL;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 y5-20020a62f245000000b0058e1b16c4dcsi7796464pfl.142.2023.03.20.16.50.38;
        Mon, 20 Mar 2023 16:50:50 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=pB8ozmYL;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229830AbjCTXhp (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:37:45 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58826 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229713AbjCTXhh (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:37:37 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 381D312CF6
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:35 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 16-20020a056a00073000b006260bbaa3a4so5994683pfm.16
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355454;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=uF1KX+DGVvWB9dHeVQYr4nKN4tbX0q4czFFzk+0W9lY=;
        b=pB8ozmYLU0gT9MKtbjkj8i/M3ld7COkWRbZjMpNcegVyz012RvTc/+zHIDuWY/XpzU
         iimyh/AGSlypF/acoui7O5DEgSPlqxHfJZrqA0aB98LnKtEg5YUwZw5RrSoH9/0E6pbm
         MY+SLQP7GBhjFgl3uuFlI92xti0E6qrbazyK/CTc1dlUtxcy9w+AsVYkbLkwzG3/Lucg
         zkR+oLFKEV5EiIvTc9SWroSPtLwdTDPijXrg+T5G2JRKTn3/cXp6SDnrIVeDkS6NNKZi
         VFP94EC98BF/ZZ4hXDdLyI6sDt/34RakIy47RN26dDUTOzZR7ZtQ9jWvwVa+vUTsueup
         x+lw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355454;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=uF1KX+DGVvWB9dHeVQYr4nKN4tbX0q4czFFzk+0W9lY=;
        b=NUgAGRO3Y2zcB27vmCBhxiXiQU1vjpHfxscipO0KuOCghcaIdONDeR1Pt4yHodGrjy
         dJ18Es/R0Nl+vKeIZlXlVZvl1J7ZaNau4d3Uv9sxoDPJCJko5bRb+243AHCfztnqDl/Q
         SJcrpIqhVQcsp/vJzwlxiS1XZCTiY/Yu76a8AUP7zwwglFY3BYzccWxNk8N2azYXpq3v
         9L7CA0haQt2V7L0dm8Y8AdxQx/jdEIrX75kgQJMC+S0mJG9kbMwqcFeUc5chebKTWAR+
         2YxQV6/lhr7woehOm1907F0oSqeaqmeshAAe71Pxofx5uR5cGe9AMrHE7jMPo76snOCA
         OzPQ==
X-Gm-Message-State: AO0yUKWkWOWA5mCXd7qoVQjYRN/VjDoxCDE27JuwuKv9OvMNOrFQXIXz
        Cq+oiEpIDXOiznpP60QQPcH/vlobpDgeIrqPy6oFo5d6lP352FbLQHIgU664mGO/Ju46uYzZcMA
        sGdx1EsfSG8ptfS/KB6zyygZApK8WfcyHxSZgyJuerDQ6BX/ZSv1Wh+MCHup/MbeZVmJ4U6w=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a17:902:c949:b0:19a:5953:e85c with SMTP id
 i9-20020a170902c94900b0019a5953e85cmr600458pla.1.1679355454487; Mon, 20 Mar
 2023 16:37:34 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:10 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-3-jstultz@google.com>
Subject: [PATCH v2 02/12] locking/mutex: Rework task_struct::blocked_on
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760932659889282643?=
X-GMAIL-MSGID: =?utf-8?q?1760932659889282643?=

From: Peter Zijlstra <peterz@infradead.org>

Track the blocked-on relation for mutexes, this allows following this
relation at schedule time.

   task
     | blocked-on
     v
   mutex
     | owner
     v
   task

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
---
 include/linux/sched.h        | 6 ++----
 kernel/fork.c                | 4 ++--
 kernel/locking/mutex-debug.c | 9 +++++----
 kernel/locking/mutex.c       | 7 +++++++
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..9924d1926bc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1139,10 +1139,8 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
-	/* Mutex deadlock detection: */
-	struct mutex_waiter		*blocked_on;
-#endif
+	struct task_struct		*blocked_proxy;	/* task that is boosting us */
+	struct mutex			*blocked_on;	/* lock we're blocked on */
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
diff --git a/kernel/fork.c b/kernel/fork.c
index d8cda4c6de6c..ad27fa09fe70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2223,9 +2223,9 @@ static __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
+	p->blocked_proxy = NULL; /* nobody is boosting us yet */
 	p->blocked_on = NULL; /* not blocked yet */
-#endif
+
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index bc8abb8549d2..7228909c3e62 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -52,17 +52,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 {
 	lockdep_assert_held(&lock->wait_lock);
 
-	/* Mark the current thread as blocked on the lock: */
-	task->blocked_on = waiter;
+	/* Current thread can't be already blocked (since it's executing!) */
+	DEBUG_LOCKS_WARN_ON(task->blocked_on);
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
+	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
-	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
-	task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
 
 	INIT_LIST_HEAD(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 1582756914df..f5296aa82255 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -645,6 +645,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	current->blocked_on = lock;
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -682,6 +683,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		/*
+		 * Gets reset by ttwu_runnable().
+		 */
+		current->blocked_on = lock;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -719,6 +724,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	debug_mutex_free_waiter(&waiter);
 
 skip_wait:
+	current->blocked_on = NULL;
 	/* got the lock - cleanup and rejoice! */
 	lock_acquired(&lock->dep_map, ip);
 	trace_contention_end(lock, 0);
@@ -733,6 +739,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
+	current->blocked_on = NULL;
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:

From patchwork Mon Mar 20 23:37:11 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72532
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1493303wrt;
        Mon, 20 Mar 2023 16:42:28 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set9lR9pJ3hYY8HHi9Tn13SAsfYkyZAj1saXTSEbaCVHD5kbCuNn6c5G+8rsXJkryUbYXoQCt
X-Received: by 2002:a17:902:ecd0:b0:19e:23c1:4c2e with SMTP id
 a16-20020a170902ecd000b0019e23c14c2emr180936plh.59.1679355748301;
        Mon, 20 Mar 2023 16:42:28 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355748; cv=none;
        d=google.com; s=arc-20160816;
        b=KiJPsE6VGl5hrlyjcLfaaLPr8AnAUvsrLXCyQSRdWuTZ9cUi/xSnaW2dcJcNoKjzNi
         pFQv+BBBnZtQ8Zii2GWGhYvO6Woof4NR4dSPSAk9YHiIEAUf7cPkw/vTK0JOtRcr9mMD
         QOmGrK/mlVDAxoLtkok1c3yfcQCp+iumU3DMEpJbbshj/c1QgGSNRt1zNbki1HS8avxp
         B4pyv8XmQXIN9hBuEMrwjfBl7gK3ivvpfcqcNv8Oyu4ZTjuX8xavF1hzLuYgrqvpxmF/
         9pc0isFOe6JFbfw6Gtq9qtVwGXBo/dbyVG30/1qb4383z0ht1d2vGvzkPH4lO8uzdWNu
         NKzA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=cMj53Ovo44UMn5m6zz+VUE8eJafwWOHMtJt0fmXwh1Q=;
        b=HYNScYbY5nvP+Wiv24c1NF6NfuTJ6LI2sblEdllVqQho8W4EdPTQ5GWlFtpnp0vOVh
         O+hns6aIqZHd3Wbs4NgpI1SCxXXak5hC2DE9MzXLdsCeeAOA3Q5wJBh67p7wHpf7chri
         kFfETkR9EFsPK84vTKVd0jtFoBJhixFRWngLf2sXDSFVHaSgt3P9Lf4mont0gGBCjHaG
         rhtiI10XzEjW7J67+9DbTNLN/xXD6/EENoDe6dFTL3Vl86c7g+6EZQpUr8olJEeUZ/Mi
         /cwnQevKSdyXqYDeRCdpMrtS9JEOOdry+/zimwZSKymACb2c3W87zgnrgtJ0tvZtqZM8
         P9kg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b="FQri1/Sz";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 y3-20020a1709027c8300b001a071f98c39si11795745pll.166.2023.03.20.16.42.15;
        Mon, 20 Mar 2023 16:42:28 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b="FQri1/Sz";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229832AbjCTXhs (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:37:48 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58870 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229757AbjCTXhj (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:37:39 -0400
Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com
 [IPv6:2607:f8b0:4864:20::649])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D2DFE31E38
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:36 -0700 (PDT)
Received: by mail-pl1-x649.google.com with SMTP id
 x4-20020a170902ec8400b001a1a5f6f272so6470275plg.1
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355456;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=cMj53Ovo44UMn5m6zz+VUE8eJafwWOHMtJt0fmXwh1Q=;
        b=FQri1/SzY9gLklv6+qN60xKrrefTsyZ6OtdQIFK6RF0aboHB7uqGPJy0dgEkyOdWnq
         +0FAHFif8sofWS6mrTOgZSG1uFg9qj0XBlM+7jw6manM67omYzvKkbhb2Y6nYo+2NlV3
         2VBgkAoc3GPEdOjmNu3ghVW+rWr1YfUU3qcBMSzzcynTXqN2sAaqMxy6RhxWn9RgglFR
         QU1jG1edWO1EcSz41LRfC0NH4C68f6OiPLT9LwRV4AViwKSUo5F4suPQP95I/ktdbbRl
         5o9YuSnr+C79AgheV6l+6z85hqyS58y80GjZyggaLq0weN6FltVKT47q+MhonBBWGiZy
         haGA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355456;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=cMj53Ovo44UMn5m6zz+VUE8eJafwWOHMtJt0fmXwh1Q=;
        b=1Mpjp5yiugxCd5PPTsbo7+YqAbJGlgON45fOXi7Kar1/EBwzY5tkKXVW954xfolZad
         87IeVYPXloYdMvV8G8Vt2QgD8p20l/Oq0AS4gvrIJqy7CNeHthsyIsuqqk63Me8teD/d
         sBfb2oQj1TUm/3LlmArPkZFCd3wOeBX9AimmIv51GQkp8D8YR0/Vz7sXSok/6ewBJ81O
         wUhumRjjM9cSmucCQX0EqxfRwknq3iBcjGeWbZb+HKjKMvWKbucp9gbNjL94nFZTZ1TF
         OZ7YicVsFrtg1dQ1IGpcpaFnLP4+NFYh/9Y5RVZWCtKeNfpfBueLcUb4nGRwFsrPlxWl
         jQtA==
X-Gm-Message-State: AO0yUKWjWUS4NYUa5aQiKA0Fp8ISDVzQSvxgZE5oEebgipKHrYHVwXdz
        XGCDw+OjYTN88P8K1p70XpLos74lNqXc+OeF7NAZ6G8MffOd351edNw02oTHwSHDuuJmADI3rjx
        OtFWCwfzgcRKSR1rBYI8HMYCYcHAYGJoYd5UOOpKOsjAgNbxa57csnUfnA1go5YMYKloXc9U=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a65:4308:0:b0:509:3be7:eac9 with SMTP id
 j8-20020a654308000000b005093be7eac9mr132895pgq.0.1679355456284; Mon, 20 Mar
 2023 16:37:36 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:11 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-4-jstultz@google.com>
Subject: [PATCH v2 03/12] locking/mutex: Add task_struct::blocked_lock to
 serialize changes to the blocked_on state
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        Valentin Schneider <valentin.schneider@arm.com>,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760932133000494744?=
X-GMAIL-MSGID: =?utf-8?q?1760932133000494744?=

From: Peter Zijlstra <peterz@infradead.org>

This patch was split out from the later "sched: Add proxy
execution" patch.

Adds blocked_lock to the task_struct so we can safely keep track
of which tasks are blocked on us.

This will be used for tracking blocked-task/mutex chains with
the prox-execution patch in a similar fashion to how priority
inheritence is done with rt_mutexes.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Split out from bigger patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Split out into its own patch

TODO: Still need to clarify some of the locking changes here
---
 include/linux/sched.h  |  1 +
 init/init_task.c       |  1 +
 kernel/fork.c          |  1 +
 kernel/locking/mutex.c | 27 +++++++++++++++++++++++----
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9924d1926bc3..031615b5dc2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1141,6 +1141,7 @@ struct task_struct {
 
 	struct task_struct		*blocked_proxy;	/* task that is boosting us */
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe6b..189ce67e9704 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -130,6 +130,7 @@ struct task_struct init_task
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_group	= LIST_HEAD_INIT(init_task.thread_group),
diff --git a/kernel/fork.c b/kernel/fork.c
index ad27fa09fe70..95410333332f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2121,6 +2121,7 @@ static __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index f5296aa82255..2f31ebb08b4a 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -615,6 +615,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 
 	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock(&current->blocked_lock);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -676,6 +677,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 				goto err;
 		}
 
+		raw_spin_unlock(&current->blocked_lock);
 		raw_spin_unlock(&lock->wait_lock);
 		if (ww_ctx)
 			ww_ctx_wake(ww_ctx);
@@ -683,6 +685,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * Gets reset by ttwu_runnable().
 		 */
@@ -697,15 +701,28 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
+			bool acquired;
+
+			/*
+			 * XXX connoro: mutex_optimistic_spin() can schedule, so
+			 * we need to release these locks before calling it.
+			 * This needs refactoring though b/c currently we take
+			 * the locks earlier than necessary when proxy exec is
+			 * disabled and release them unnecessarily when it's
+			 * enabled. At a minimum, need to verify that releasing
+			 * blocked_lock here doesn't create any races.
+			 */
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock(&lock->wait_lock);
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+			raw_spin_lock(&lock->wait_lock);
+			raw_spin_lock(&current->blocked_lock);
+			if (acquired)
 				break;
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock(&lock->wait_lock);
 	}
-	raw_spin_lock(&lock->wait_lock);
 acquired:
 	__set_current_state(TASK_RUNNING);
 
@@ -732,6 +749,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock(&lock->wait_lock);
 	if (ww_ctx)
 		ww_ctx_wake(ww_ctx);
@@ -744,6 +762,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
 	trace_contention_end(lock, ret);
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock(&lock->wait_lock);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);

From patchwork Mon Mar 20 23:37:12 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72533
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1493415wrt;
        Mon, 20 Mar 2023 16:42:44 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set/H/l+jzRksB+HFNHOP6k8AwKbpJh2zGpkPwQ57TFjxsVV/kxpgDGNhs/gnhNnDBJrESkNM
X-Received: by 2002:a17:903:1108:b0:1a1:db10:7ba5 with SMTP id
 n8-20020a170903110800b001a1db107ba5mr177398plh.41.1679355764334;
        Mon, 20 Mar 2023 16:42:44 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355764; cv=none;
        d=google.com; s=arc-20160816;
        b=Ju0cUeKk18fCJhL7OkAmHfkZSMNfxQHfmx1oUhH3i6BKgh+d6z5e0G8Nx0fOT+Ry4F
         sUrcR6xCGJt0XJbLbOeOumWy1PPhpjk8woOwK6jMs07jIdQQd9YdlPtcY+7OfHNschri
         HK2VSsw5wXbS29uJc5ilNm0LYHYRReUpAJ15ZBYtSzsTcfeZYKRBcqNjDLaGpPCEAfWa
         FQfC5FEOlUpQkUoclraoSI6yZiZiNWcgqxbb0UamMtftVQW8yBDkuB6YoMVEY2mBfyen
         ph1Pie0n80Ef7GPMuhtv/mll6DD8rgiTuCUIQ1vp39M+zp8/t+6yoTTx71diwqiIvesD
         9Mnw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=QPe4aD84vTizqGVjBuhouTmqyuMdRzMicw0Vakqg7dw=;
        b=lTEzidrHlIMsB4CTAbKHf1jhSMO68aeJ7KFBagRxDnxkweML2tYy9jNhRNkGyFgZSM
         woUIlak/nmQtZnEIM1s1xdDKjOMnCeZ1ITgMX979awULae7EG4uA/J6MvCh+LsbrExIV
         Bf8AZWIYbyqzgpl1h+eadKapX6w73/ypm51NY5CbppcUFZwuQWwRl62HHkm7FuQvrfXV
         hYkKP6uutoiur86aMJpxM8bv6SvhLZ8BXU3ejCrCjLXIFAqjBWo/G3XFec9WzlOxDOcJ
         UID7CSWPPbdHvb/8r9TpNka63HXvS5bxatz6oEOPv0geJgBp5bJMG5lDlv9s1FmVaZCu
         4wLA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=ssFRdKNI;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 lg15-20020a170902fb8f00b0019c2d2ec419si8165767plb.337.2023.03.20.16.42.32;
        Mon, 20 Mar 2023 16:42:44 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=ssFRdKNI;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229494AbjCTXhy (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:37:54 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59026 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229782AbjCTXhn (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:37:43 -0400
Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com
 [IPv6:2607:f8b0:4864:20::1049])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D289A32E66
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:38 -0700 (PDT)
Received: by mail-pj1-x1049.google.com with SMTP id
 q8-20020a17090ad38800b0023f116f305bso6562831pju.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355458;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=QPe4aD84vTizqGVjBuhouTmqyuMdRzMicw0Vakqg7dw=;
        b=ssFRdKNIWLs8SJbLw9mTZNeXDLO7hBAb57Qh4XOW0Q6dnA9GoNWGVlBftazejph1K5
         TOB2IA4Nqzdj3Rxv+4aczv0bwgePmdETrXn9hEvNF7aIhiBNQfe58gJBqtNGpow0YAuu
         EwMh4pE53lyboirSUGNDJh0/h5Ai4XQ42HwIKugFcdvtec71wSHOrDpqUmSM27gAV6yB
         hwBCSa3rom54vUm+0sIu3j7m0VVfmtPVCBM2+7CXdlGgkqWUz68LLit20W/aIFu6VUCF
         0z/9DnASZUqaqxVVLAcSzPaslTC9BPY/3lCSjwYGA9pVQyNJ8dUyy4CAteT8EAqUItzG
         zcuQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355458;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=QPe4aD84vTizqGVjBuhouTmqyuMdRzMicw0Vakqg7dw=;
        b=UwJL995TXO19UTxRA+bv4uhp7WWmoY2T17NZh+69vwVhnBHw5Iy3AGAmXjOT2g9x96
         Cjm+YktOivls227oficqp9NOc3GXuHYp9tPSXaVel3QUGajkeFsEI0GA6rzqYKJi4GgK
         q5Tg15+DxbpIs/i1Vox3KWAAhRPTnk0P8pUuoMSH7RRxyhCTtna7SOb3LcfQjl2YkrO7
         h4EkYDRdDQkoL5m7RT0D+T0rjQ+EvnIhz1fDEO0i1AVSPz7GpECku55InwHyZZlJcNh9
         40QerZGjVZmzOt07XtPHtmCrB0Wngmx7aKa6q9I+bAFBSejMMu+4uRULtF6nLS2TozJp
         +O4A==
X-Gm-Message-State: AO0yUKU7QJ3vBTo00tAhMFRNNx7+VUVCjQqlQHbH0rjgvXDCwfrvO2BB
        HzUMZ707mLAod/pJxEqPrska+1m/Vo0OQXRS8+LwubA5vof/hbBwa3RtGCdWTmLOxooVA9D1Ze5
        eCyPrgOeKgZOFzbfIMssA66c7XNkXv3sZmEyCh3qPyDFXLNhAMHzMjO/5GyFRrkkx6MuzIR4=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6a00:14ca:b0:626:1710:9b7d with SMTP
 id w10-20020a056a0014ca00b0062617109b7dmr6078478pfu.0.1679355457901; Mon, 20
 Mar 2023 16:37:37 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:12 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-5-jstultz@google.com>
Subject: [PATCH v2 04/12] locking/mutex: Add p->blocked_on wrappers
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760932150142837016?=
X-GMAIL-MSGID: =?utf-8?q?1760932150142837016?=

From: Valentin Schneider <valentin.schneider@arm.com>

This lets us assert p->blocked_lock is held whenever we access
p->blocked_on.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix conflicts, call in more places]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: tweaked commit subject, added get_task_blocked_on() as well]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Added get_task_blocked_on() accessor
---
 include/linux/sched.h        | 14 ++++++++++++++
 kernel/locking/mutex-debug.c |  4 ++--
 kernel/locking/mutex.c       |  8 ++++----
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 031615b5dc2a..a1606d0bd3fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2223,6 +2223,20 @@ static inline int rwlock_needbreak(rwlock_t *lock)
 #endif
 }
 
+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	p->blocked_on = m;
+}
+
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	return p->blocked_on;
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 7228909c3e62..e3cd64ae6ea4 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,13 +53,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(task->blocked_on);
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+	struct mutex *blocked_on = get_task_blocked_on(task); /*XXX jstultz: dropped READ_ONCE here, revisit.*/
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2f31ebb08b4a..d322f7c1c8fa 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -646,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
-	current->blocked_on = lock;
+	set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -690,7 +690,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		/*
 		 * Gets reset by ttwu_runnable().
 		 */
-		current->blocked_on = lock;
+		set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -741,7 +741,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	debug_mutex_free_waiter(&waiter);
 
 skip_wait:
-	current->blocked_on = NULL;
+	set_task_blocked_on(current, NULL);
 	/* got the lock - cleanup and rejoice! */
 	lock_acquired(&lock->dep_map, ip);
 	trace_contention_end(lock, 0);
@@ -757,7 +757,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	current->blocked_on = NULL;
+	set_task_blocked_on(current, NULL);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:

From patchwork Mon Mar 20 23:37:13 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72530
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1492331wrt;
        Mon, 20 Mar 2023 16:39:50 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set/m9rmMhHBf7lMmjM0s/rhgWx7zKt8eiU1NVFOkOIY8lhGpqhsGaLeP/+IJGOOjaraFrq+Z
X-Received: by 2002:a17:902:d0d4:b0:199:3a4a:d702 with SMTP id
 n20-20020a170902d0d400b001993a4ad702mr10518017pln.0.1679355590305;
        Mon, 20 Mar 2023 16:39:50 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355590; cv=none;
        d=google.com; s=arc-20160816;
        b=sNcAhkULUWH2hlmyjNPPLv/GJEdd0XgRJE8N0dTTBq0mnZfiJV4LDqBoCI/Jw2/J5c
         CIiwFzNeEtyNyNt65mCB3XB75twlgv4HnnY+AaKV2OAvhmi2DXg26ZsyVflBoe/GEx6Q
         FE1fW0xq9nHaBaQ3aTXYPdiE7lnzqhEcZruxqlKy+bzQJTzew6yyKhqZfBmijCFGd8nJ
         h5BIGcf7SqBkyU5kLjqbhMrgoNSfBehdb5Ex9J4dj/V47CfDeNvtd8GjEOzgt6uOb10c
         D5k08PCbmmkClwhp+51lpsgbMpol6Okn3ynD/kJcYLOX0iIKfbo5JwJZ9p/UINvXUUVI
         /Qmw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=1bMyxMo2OrU8zdLjC4DVnVi1YjKD+2xFoy9FGGeajpo=;
        b=UqZuqTv0k2pZuFDWGZTZdi4txOcBbOpBpdyAXljIwXhKLLZH9KZRTsd6o9h4I9x7s3
         WTmIGxJfTw6z3hppV5KBuKrYioZ92bRUWL4DNJ2nG8pq0TbiUqyJM6U1OnXVnzEG4agB
         lJpWrAWr9jlyaU9XGkYe5Jt7MhBt+LL0KqMOo5uPY8C/Sp7gjqnB42uX75uDGJA0ZgSx
         3G7UaY5ONP5kWlXd8fKDafM2V7ZenESza+NMJ3GNkorgwuQt6xjfK4W+ZtKoxbsbhkxK
         Yl78E6L1SsbukFWaR71I3P0DpW/yxDLnp4xQTtCE2nzbSJ9nu6tZ4FZ5dAM8N+CQzBLL
         9pFA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b="qR8HRj/u";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 on11-20020a17090b1d0b00b0023d23393339si18143685pjb.117.2023.03.20.16.39.37;
        Mon, 20 Mar 2023 16:39:50 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b="qR8HRj/u";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229966AbjCTXh5 (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:37:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59056 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229803AbjCTXhn (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:37:43 -0400
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 86F2336685
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:40 -0700 (PDT)
Received: by mail-yb1-xb4a.google.com with SMTP id
 m6-20020a056902118600b00aeb1e3dbd1bso14424180ybu.9
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355459;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=1bMyxMo2OrU8zdLjC4DVnVi1YjKD+2xFoy9FGGeajpo=;
        b=qR8HRj/uNpfNeexzdD3NSFaYQWdRoJDVXPizK+wXOQWFYtX8R1+aVjZdU1/ib5VwkW
         o9bIMNcgC6U84WI611cMZ3nMGR9lOmlNY1EQhi82VZbDdFW/yXZF40jXonjZiLi64YtT
         m6e81g6eATsubVyjsLc1gajg52/Otu95UYKZpdJNog4uqyQoC+DxyntBgL40uO4GxP2S
         kXhooO86PZE/PSev2YsGtyaVxzG/RPaLyHu8bf6ufuy4YMx5cdDBFiG6j/wqXnCg6mZQ
         dWmsqe1nTDvTZyesWKeDcXYmP2C26cRfJhwlNHXclJm0tumQxdmwQHgp0rvpM9O03z+N
         TcPA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355459;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=1bMyxMo2OrU8zdLjC4DVnVi1YjKD+2xFoy9FGGeajpo=;
        b=mFw1/OPw4/26NFrQ3it3mTbnJ6pOheuYkTx3O6QBv7U26A5O+pWHqCv2sltgm/eml4
         7hH+1x1v6mrNjm7bBJPM40yuhVzGti16ibGFq6enKgG1Df5hTO5PEeEVLjlupNZ+ulfl
         3gxfYY6jsCY7ZsT+gIonGQ3pNcEtZK8Rnn4RnRwQiomiWJ6h8tEcXVlD9dURTmQsnmXV
         JtMYwVoMudZDJjFnkDCujVURNZh9FL7i2nodFux24LuINXKkEC6dkqGcBAgrlR/ODfG3
         EY/wK8YC+uRtB1aXErB5sSReBmHxn5jfRjZdfn7x4eUo/D/FDPNHn7K6KxMnXkZRr48y
         MwrQ==
X-Gm-Message-State: AAQBX9dNNzL3j0fI+2kkzxKG3vexrNL+8nGQzX/NSOWo0Mef3oc2Ivyk
        WlSWML7zV3k5g+EVKB65TVjTXS+BwHzZvjBGn+VpE8uC/DodKIQ8yo/W9qZc9nvk3v2C58dIaZq
        J+i6yPT3NCZgLcsKX9h1Nb6SBXuMRblHy+wIG/bTolV8KWodefor/as8A/ztgrjIg3kXonbo=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a25:2d8:0:b0:b36:c001:d0a7 with SMTP id
 207-20020a2502d8000000b00b36c001d0a7mr178103ybc.8.1679355459670; Mon, 20 Mar
 2023 16:37:39 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:13 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-6-jstultz@google.com>
Subject: [PATCH v2 05/12] locking/mutex: Expose mutex_owner()
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        Valentin Schneider <valentin.schneider@arm.com>,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760931967584803266?=
X-GMAIL-MSGID: =?utf-8?q?1760931967584803266?=

From: Juri Lelli <juri.lelli@redhat.com>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose a new helper
mutex_owner() for this purpose.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Tweaked subject line]
Signed-off-by: John Stultz <jstultz@google.com>
---
 include/linux/mutex.h  | 2 ++
 kernel/locking/mutex.c | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 8f226d460f51..ebdc59cb0bf6 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -118,6 +118,8 @@ do {									\
 extern void __mutex_init(struct mutex *lock, const char *name,
 			 struct lock_class_key *key);
 
+extern struct task_struct *mutex_owner(struct mutex *lock);
+
 /**
  * mutex_is_locked - is the mutex locked
  * @lock: the mutex to be queried
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d322f7c1c8fa..ead4213232cc 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -81,6 +81,11 @@ static inline struct task_struct *__mutex_owner(struct mutex *lock)
 	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
 }
 
+struct task_struct *mutex_owner(struct mutex *lock)
+{
+	return __mutex_owner(lock);
+}
+
 static inline struct task_struct *__owner_task(unsigned long owner)
 {
 	return (struct task_struct *)(owner & ~MUTEX_FLAGS);

From patchwork Mon Mar 20 23:37:14 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72528
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1492078wrt;
        Mon, 20 Mar 2023 16:39:08 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set8DIAjlCLF67q1jfIOGEw8PZ9Q1qWps0VOEC0aS57hCU/3sTl5Jo1finFG2MwHn2dKa44OK
X-Received: by 2002:a17:902:da85:b0:19e:748c:ee29 with SMTP id
 j5-20020a170902da8500b0019e748cee29mr175816plx.55.1679355548044;
        Mon, 20 Mar 2023 16:39:08 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355548; cv=none;
        d=google.com; s=arc-20160816;
        b=NkFvizuMQ3YP7ra6xLxwnviMAcWnJL2x+glpcHmLy/CZY3pZu/bVaM2IbfGjXkgNwr
         q8bcw8Le9BHympT0sCgWnXSaWucx/yj4C287RjbHA1XKmIBWvn9o8OoSJJDprZceAxZE
         KE4rKeCkmVKRGbOwnh0dGZ5nWZBPoVCz1O3cAEtcxSYVxUMX5QL/2aUYjv2cGY+f8TCB
         2d0kbCUv5Fud4et+AFX4XkhH8ZSuptOqEOyN3ZtMpx52N5mLtBhwKxO3/11zWUVP3KHU
         4Hxup3E7GWYIekCMjp4N6KyPb1OSd9YGN7w1Tfk7KjUYzI0pn9um/S7tEjF7EgzgUkbA
         RZ9g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=EXUPFRWvUoPWNb0Il/ErV9BdLrNKUbzckVL5LhCzOk4=;
        b=DrvMi/RnreL46Ut34Ifh8I+NNhL3lzy6su3pd9UG9LHDLOyBBOHDk+1Le/0EBDAujw
         ZbV4kNmGiMntMyiBMPZy/okrkooDt1ug9UxjREpI61MHLyYmgq6RuklKUQ8nnE0s82/p
         9JAJWHX8kxzb1EP/YTEe3poL1JQeSP5JOx6BdSbLKJWetw7crbqnCgl9kyVMwoL7CunV
         WVo12YmUX8x6UP/2jy7Rbl2DIIGf/djMO4I/Cw/QFkiaw1RbvOEO+nefqFaJ8Ri1xPse
         Khk9iOSX02cWqNGDcEFLFjDuZRgvbz32tw9uevV3jPyMmyjwXe3SWlZHodyWU4DyBVpZ
         XW2w==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=pWAdfG1J;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 jk5-20020a170903330500b0019cb419df45si11289159plb.401.2023.03.20.16.38.56;
        Mon, 20 Mar 2023 16:39:08 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=pWAdfG1J;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229995AbjCTXiA (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:38:00 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59188 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229867AbjCTXhq (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:37:46 -0400
Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com
 [IPv6:2607:f8b0:4864:20::1149])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84521360BA
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:42 -0700 (PDT)
Received: by mail-yw1-x1149.google.com with SMTP id
 00721157ae682-541942bfdccso137613757b3.14
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355461;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=EXUPFRWvUoPWNb0Il/ErV9BdLrNKUbzckVL5LhCzOk4=;
        b=pWAdfG1JiAKkpg0IlCiD+BBXZ0EifdhaAyoftNYcou9kXETlE0oYJJM0NJNk9yM2U/
         prYlyGJE3y1ZxNz2bQsJzbpiMOvWCykB1bbPgK9nKA7aJf2PQeGOsfRv0SXFOURXO53w
         Tjn3CSkX/tkzMJa1HxYXAn3jyDJaSY1EmDeoi3bY7hWy1y3fDW6KWZhn91TJP39zARdN
         ZLXEhRCMZhVvPjPaAjJo0NQdkMi6QvsI+CGgODe9es36meDQuxCzs/40flqq3tGNj6Ud
         DiCfSOqWIwBkcdH4d5I8wUX1ewxumSefVyeyCmiHHp1Ph2C+8mowNig8zMvzoZiLGnmn
         T9Yg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355461;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=EXUPFRWvUoPWNb0Il/ErV9BdLrNKUbzckVL5LhCzOk4=;
        b=sQT5GjY7r7X+5y5nnqm6sWEud43XxL+j2UpeOKNCoA4FoHikKYdbj8Z36Fy1InMPnH
         /XuXOIbUdQ+iAIznDmjdl13fSuXrlWn6YHWwEkDBoJkTwvIoj5KJ8F797L+uv5XaqnkA
         88I5UQBBSrgY2jlVkwvlbQBXvU64oCK2SEk6akToZyA0KWo+6cHeysWEYwGYQRt4gfwN
         2hTWkBAM4J93c4aUuQQVWGUuoWE50AsWxwW+BiYdy2w8OJCG9mMPXlY3uk3MbHpcBp5W
         Vl0eeDlSLxxd40cJAgnURdpRJcsTCDlQBhHiVYfcvEzdKrLPgVSIie4D3rSmCvnCjRTo
         AESg==
X-Gm-Message-State: AAQBX9ebAr6HfdCBBh4mq1+pbTEFq2Igz5il6Pcw37lxtiQKmFaskO5T
        EqLDV+IBp8d6/RSLMmv4SJ3HpegUTZpYgO4y6zYItxGCE/iA2DrO1cjTN5/elE31SWZ0xR15Mar
        YlXOd1C5g/NSwCtESpu/IXesoE1BUXq68EEli1VkLR6PM1mUEW42C9FN1/V90NOg0Fq2JHkY=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6902:1503:b0:acd:7374:f154 with SMTP
 id q3-20020a056902150300b00acd7374f154mr215625ybu.7.1679355461180; Mon, 20
 Mar 2023 16:37:41 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:14 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-7-jstultz@google.com>
Subject: [PATCH v2 06/12] sched: Unify runtime accounting across classes
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760931922764084218?=
X-GMAIL-MSGID: =?utf-8?q?1760931922764084218?=

From: Peter Zijlstra <peterz@infradead.org>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[fix conflicts, fold in update_current_exec_runtime]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: rebased, resovling minor conflicts]
Signed-off-by: John Stultz <jstultz@google.com>
---
 include/linux/sched.h    |  2 +-
 kernel/sched/deadline.c  | 13 +++-------
 kernel/sched/fair.c      | 56 ++++++++++++++++++++++++++++++----------
 kernel/sched/rt.c        | 13 +++-------
 kernel/sched/sched.h     | 12 ++-------
 kernel/sched/stop_task.c | 13 +---------
 6 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1606d0bd3fe..fc75fcc696db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -519,7 +519,7 @@ struct sched_statistics {
 	u64				block_max;
 	s64				sum_block_runtime;
 
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
 
 	u64				nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 71b24371a6f7..5a7c4edd5b13 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1308,9 +1308,8 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
-	u64 now;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1323,21 +1322,15 @@ static void update_curr_dl(struct rq *rq)
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0)) {
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
 	trace_sched_stat_runtime(curr, delta_exec, 0);
 
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a1b1f855b96..03e61be5c94f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,23 +891,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 {
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
-
-	if (unlikely(!curr))
-		return;
+	u64 now = rq_clock_task(rq);
+	s64 delta_exec;
 
 	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
-		return;
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
 
 	curr->exec_start = now;
+	curr->sum_exec_runtime += delta_exec;
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
@@ -917,9 +911,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
 				max(delta_exec, stats->exec_max));
 	}
 
-	curr->sum_exec_runtime += delta_exec;
-	schedstat_add(cfs_rq->exec_clock, delta_exec);
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	s64 delta_exec;
 
+	delta_exec = update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 delta_exec;
+
+	if (unlikely(!curr))
+		return;
+
+	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <= 0))
+		return;
+
+	schedstat_add(cfs_rq->exec_clock, delta_exec);
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_min_vruntime(cfs_rq);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0a11f44adee5..18eb6ce60c5c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1046,24 +1046,17 @@ static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_rt_entity *rt_se = &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec < 0))
 		return;
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
 	trace_sched_stat_runtime(curr, delta_exec, 0);
 
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..d18e3c3a3f40 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2166,6 +2166,8 @@ struct affinity_context {
 	unsigned int flags;
 };
 
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -3238,16 +3240,6 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
-static inline void update_current_exec_runtime(struct task_struct *curr,
-						u64 now, u64 delta_exec)
-{
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-}
-
 #ifdef CONFIG_SCHED_MM_CID
 static inline int __mm_cid_get(struct mm_struct *mm)
 {
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 85590599b4d6..7595494ceb6d 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
 
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *curr = rq->curr;
-	u64 now, delta_exec;
-
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec = 0;
-
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	update_current_exec_runtime(curr, now, delta_exec);
+	update_curr_common(rq);
 }
 
 /*

From patchwork Mon Mar 20 23:37:15 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72529
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1492095wrt;
        Mon, 20 Mar 2023 16:39:11 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set9DQTsuQRNbnpoCST1YIc4gWuWwEwYFhzQiZ88zyAaXkzZUKFNw/w2xPY+uVIq9qJddytye
X-Received: by 2002:a17:90a:8b92:b0:23a:340d:fa49 with SMTP id
 z18-20020a17090a8b9200b0023a340dfa49mr280104pjn.32.1679355551077;
        Mon, 20 Mar 2023 16:39:11 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355551; cv=none;
        d=google.com; s=arc-20160816;
        b=Pg2UM+kHPSHWIIIKsYwgBaXoJSp+B+4RY2Q00rGjUn8NjuLdccIZ1biYPf+cMakAOC
         gGkm242Vc47iNbJfM+NkxGSoz53+/qHx1dfXK0d26fYB84jQdezznS752d/mvu6H+6E3
         a6CjlijnTAzGWtCqrUq+vALz1a+IWKV9ijRlB8gF91TemrAwPwEyS+Wq40KMl7ITcCXb
         pnT1r2Cabg8lSX7TXhG0zeWYybcDfBXiLLbc198d7xOApC6Cm2aJEQboqwBbHvcKQeE2
         VN8cQdnhnTmUfjlIIsILIH4wMb77YKeQbAbr1bVYSGp3eVXyWNuLt4YJv+CaFeEL9nWT
         ZyxA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=FRvLyNYcfU04VatszRjZ2RWzQAfxHdlDP7KEPefQNHU=;
        b=MuMygv3A6yA8ceZ+JsG/FfjsZKUpRGx4sheCD3n0/VV4AQ3mGzm6/uorgpLS71YAF1
         oXruu4lbV3EQUT9MdexFVBsgZjjUYGjX4u8dUEnXLiwAvHSX/dCbIYrfZxk8aCi0eV/V
         GZ+3U2pOaO6cJUIQFoexwyMZhCjESZu4pjawYseh8M5sU4JXs6dk4lA4UNA6yknJG5Sq
         3ql6s+9ZNRKoyVPIYXFlLg45TpWbcLS1P5ESK0WoymWUxpfHCgKgJIHF5+6J3C48Z1kE
         VXmyc+ohybq1R0jqYojhxKWo8icMOvJFP0Vi84zTnNUjKGoediGrQsLR7iDtHMgkXAsj
         UX7w==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=FWYA0BQ9;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 l20-20020a17090ac59400b0022c1c376f57si11581487pjt.33.2023.03.20.16.38.58;
        Mon, 20 Mar 2023 16:39:11 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=FWYA0BQ9;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229713AbjCTXiM (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:38:12 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59144 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229788AbjCTXhs (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:37:48 -0400
Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com
 [IPv6:2607:f8b0:4864:20::549])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D9F437B40
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:43 -0700 (PDT)
Received: by mail-pg1-x549.google.com with SMTP id
 m12-20020a6562cc000000b0050bdfabc8e2so3020895pgv.9
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355463;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=FRvLyNYcfU04VatszRjZ2RWzQAfxHdlDP7KEPefQNHU=;
        b=FWYA0BQ9xrAuifkYK+CHLtyfMIAvn1dstcF9tzRehw4HgJRWYD4sH2hDPWDTsxix3u
         FfTFebNELkwdB69A+x6Mm3FJqwO1qkfEHQDSV8SeFLivK9UTDpuUlv4eRuxjf6eDZZX3
         D6yx8pzNQzUzziPMNZH+TvE7AFcMQN5+kySaBosbzqnjdFNGVHtlCA+b6Pzh5ZiV4q2K
         wWUPIW3xEULTUxTYcml5Kmy1rqBoP6Qadia8JopT9LROLjrlaXWrTQdPj8v+X7Vzo8SO
         /nulMuP1P+troHDl+Q6JiXc4bwzs6fAlHlDWOlybvQ4dZoX6X+Nl8jWiEZAna2OxyksW
         UrhA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355463;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=FRvLyNYcfU04VatszRjZ2RWzQAfxHdlDP7KEPefQNHU=;
        b=VDggUo1hwUCQZ2jkYOcML1r+spn9HL3MIgmtUIQ/2zSeZHnzJb6YO8Glmnjp+WIa6s
         tKrl9nTjKVSP9PFRq4Mt6BHsOCHKeaHc7Dr9w1C0F1wIa/8x1nNGwKS2q1JWbDQoJJzf
         VD+Ly1IL0hv9XqZ3NLYjcfT0dchd4t8EJ1N9Ytv2Zr4chhc5BYqnNDcZtXgIKrUryRB9
         XgMlzudHPOOS/bhY0hs/hLtieHbHlJgvPgs+guTc93vZRt2qMxW0XWKExrjBwF1uwDHY
         Hsj9bgsXcGdxhX2OLl6rIhuPtYTuYnEiSB4kWqRM1iTsoqVoh9ZzirzjSODC+FiIlG2N
         xM0Q==
X-Gm-Message-State: AO0yUKXa1UqJ0y9B3XUGgH45Iy91t1n0FVl/yDPWmC3RFbYCL+WkB6Ik
        98MhBXm2rrdfk3+/tXVIf6rAWN6ORwGNRNUh0xRGBsIIu4IDRZ4l//rq9S3w999pBsZR5BnZHE9
        5YV48PIGSOnJGiHKsB4W02Yh5385q+SNRjc7+ALYPV+yt5iVPUSDzpGz97+BI8KXpyyWAb88=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6a00:b4f:b0:623:7446:7075 with SMTP id
 p15-20020a056a000b4f00b0062374467075mr332688pfo.2.1679355462866; Mon, 20 Mar
 2023 16:37:42 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:15 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-8-jstultz@google.com>
Subject: [PATCH v2 07/12] sched: Replace rq->curr access w/ rq_curr(rq)
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760931926349325375?=
X-GMAIL-MSGID: =?utf-8?q?1760931926349325375?=

In preparing for proxy-execution changes add a bit of
indirection for reading and writing rq->curr.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c       | 56 ++++++++++++++++++++-------------------
 kernel/sched/core_sched.c |  2 +-
 kernel/sched/deadline.c   | 50 +++++++++++++++++-----------------
 kernel/sched/debug.c      |  2 +-
 kernel/sched/fair.c       | 25 ++++++++---------
 kernel/sched/membarrier.c |  8 +++---
 kernel/sched/pelt.h       |  2 +-
 kernel/sched/rt.c         | 44 +++++++++++++++---------------
 kernel/sched/sched.h      | 42 ++++++++++++++++++++++++-----
 9 files changed, 132 insertions(+), 99 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 488655f2319f..faaad249f8f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -257,7 +257,7 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
 	 * and re-examine whether the core is still in forced idle state.
 	 */
 	if (!(flags & DEQUEUE_SAVE) && rq->nr_running == 1 &&
-	    rq->core->core_forceidle_count && rq->curr == rq->idle)
+	    rq->core->core_forceidle_count && rq_curr(rq) == rq->idle)
 		resched_curr(rq);
 }
 
@@ -703,7 +703,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->prev_irq_time += irq_delta;
 	delta -= irq_delta;
-	psi_account_irqtime(rq->curr, irq_delta);
+	psi_account_irqtime(rq_curr(rq), irq_delta);
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
@@ -773,7 +773,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+	rq_curr(rq)->sched_class->task_tick(rq, rq_curr(rq), 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;
@@ -1020,7 +1020,7 @@ void wake_up_q(struct wake_q_head *head)
  */
 void resched_curr(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 	int cpu;
 
 	lockdep_assert_rq_held(rq);
@@ -2175,16 +2175,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 
 void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->sched_class == rq->curr->sched_class)
-		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
-	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+	struct task_struct *curr = rq_curr(rq);
+
+	if (p->sched_class == curr->sched_class)
+		curr->sched_class->check_preempt_curr(rq, p, flags);
+	else if (sched_class_above(p->sched_class, curr->sched_class))
 		resched_curr(rq);
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(curr) && test_tsk_need_resched(curr))
 		rq_clock_skip_update(rq);
 }
 
@@ -3859,11 +3861,11 @@ void wake_up_if_idle(int cpu)
 
 	rcu_read_lock();
 
-	if (!is_idle_task(rcu_dereference(rq->curr)))
+	if (!is_idle_task(rq_curr_unlocked(rq)))
 		goto out;
 
 	rq_lock_irqsave(rq, &rf);
-	if (is_idle_task(rq->curr))
+	if (is_idle_task(rq_curr(rq)))
 		resched_curr(rq);
 	/* Else CPU is not idle, do nothing here: */
 	rq_unlock_irqrestore(rq, &rf);
@@ -4388,7 +4390,7 @@ struct task_struct *cpu_curr_snapshot(int cpu)
 	struct task_struct *t;
 
 	smp_mb(); /* Pairing determined by caller's synchronization design. */
-	t = rcu_dereference(cpu_curr(cpu));
+	t = cpu_curr_unlocked(cpu);
 	smp_mb(); /* Pairing determined by caller's synchronization design. */
 	return t;
 }
@@ -5197,7 +5199,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 * kernel thread and not issued an IPI. It is therefore possible to
 	 * schedule between user->kernel->user threads without passing though
 	 * switch_mm(). Membarrier requires a barrier after storing to
-	 * rq->curr, before returning to userspace, so provide them here:
+	 * rq_curr(rq), before returning to userspace, so provide them here:
 	 *
 	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
 	 *   provided by mmdrop(),
@@ -5280,7 +5282,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
 		/*
 		 * sys_membarrier() requires an smp_mb() between setting
-		 * rq->curr / membarrier_switch_mm() and returning to userspace.
+		 * rq_curr(rq) / membarrier_switch_mm() and returning to userspace.
 		 *
 		 * The below provides this either through switch_mm(), or in
 		 * case 'prev->active_mm == next->mm' through
@@ -5564,7 +5566,7 @@ void scheduler_tick(void)
 {
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 	struct rq_flags rf;
 	unsigned long thermal_pressure;
 	u64 resched_latency;
@@ -5657,7 +5659,7 @@ static void sched_tick_remote(struct work_struct *work)
 		goto out_requeue;
 
 	rq_lock_irq(rq, &rf);
-	curr = rq->curr;
+	curr = rq_curr(rq);
 	if (cpu_is_offline(cpu))
 		goto out_unlock;
 
@@ -6201,7 +6203,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		/* Did we break L1TF mitigation requirements? */
 		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
 
-		if (rq_i->curr == rq_i->core_pick) {
+		if (rq_curr(rq_i) == rq_i->core_pick) {
 			rq_i->core_pick = NULL;
 			continue;
 		}
@@ -6232,7 +6234,7 @@ static bool try_steal_cookie(int this, int that)
 	if (!cookie)
 		goto unlock;
 
-	if (dst->curr != dst->idle)
+	if (rq_curr(dst) != dst->idle)
 		goto unlock;
 
 	p = sched_core_find(src, cookie);
@@ -6240,7 +6242,7 @@ static bool try_steal_cookie(int this, int that)
 		goto unlock;
 
 	do {
-		if (p == src->core_pick || p == src->curr)
+		if (p == src->core_pick || p == rq_curr(src))
 			goto next;
 
 		if (!is_cpu_allowed(p, this))
@@ -6511,7 +6513,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
-	prev = rq->curr;
+	prev = rq_curr(rq);
 
 	schedule_debug(prev, !!sched_mode);
 
@@ -6534,7 +6536,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	 *     if (signal_pending_state())	    if (p->state & @state)
 	 *
 	 * Also, the membarrier system call requires a full memory barrier
-	 * after coming from user-space, before storing to rq->curr.
+	 * after coming from user-space, before storing to rq_curr().
 	 */
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
@@ -6593,14 +6595,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		/*
-		 * RCU users of rcu_dereference(rq->curr) may not see
+		 * RCU users of rcu_dereference(rq_curr(rq)) may not see
 		 * changes to task_struct made by pick_next_task().
 		 */
-		RCU_INIT_POINTER(rq->curr, next);
+		rq_set_curr_rcu_init(rq, next);
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
-		 * rq->curr, before returning to user-space.
+		 * rq_curr(rq), before returning to user-space.
 		 *
 		 * Here are the schemes providing that barrier on the
 		 * various architectures:
@@ -7037,7 +7039,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 	 * real need to boost.
 	 */
 	if (unlikely(p == rq->idle)) {
-		WARN_ON(p != rq->curr);
+		WARN_ON(p != rq_curr(rq));
 		WARN_ON(p->pi_blocked_on);
 		goto out_unlock;
 	}
@@ -7253,7 +7255,7 @@ int idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
-	if (rq->curr != rq->idle)
+	if (rq_curr(rq) != rq->idle)
 		return 0;
 
 	if (rq->nr_running)
@@ -9154,7 +9156,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	rcu_read_unlock();
 
 	rq->idle = idle;
-	rcu_assign_pointer(rq->curr, idle);
+	rq_set_curr(rq, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
@@ -9328,7 +9330,7 @@ static DEFINE_PER_CPU(struct cpu_stop_work, push_work);
  */
 static void balance_push(struct rq *rq)
 {
-	struct task_struct *push_task = rq->curr;
+	struct task_struct *push_task = rq_curr(rq);
 
 	lockdep_assert_rq_held(rq);
 
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index a57fd8f27498..ece2157a265d 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -273,7 +273,7 @@ void __sched_core_account_forceidle(struct rq *rq)
 
 	for_each_cpu(i, smt_mask) {
 		rq_i = cpu_rq(i);
-		p = rq_i->core_pick ?: rq_i->curr;
+		p = rq_i->core_pick ?: rq_curr(rq_i);
 
 		if (p == rq_i->idle)
 			continue;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5a7c4edd5b13..4e3acc76708f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1179,7 +1179,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 #endif
 
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-	if (dl_task(rq->curr))
+	if (dl_task(rq_curr(rq)))
 		check_preempt_curr_dl(rq, p, 0);
 	else
 		resched_curr(rq);
@@ -1306,7 +1306,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
  */
 static void update_curr_dl(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 	struct sched_dl_entity *dl_se = &curr->dl;
 	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
@@ -1792,7 +1792,7 @@ static void yield_task_dl(struct rq *rq)
 	 * it and the bandwidth timer will wake it up and will give it
 	 * new scheduling parameters (thanks to dl_yielded=1).
 	 */
-	rq->curr->dl.dl_yielded = 1;
+	rq_curr(rq)->dl.dl_yielded = 1;
 
 	update_rq_clock(rq);
 	update_curr_dl(rq);
@@ -1829,7 +1829,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 	rq = cpu_rq(cpu);
 
 	rcu_read_lock();
-	curr = READ_ONCE(rq->curr); /* unlocked access */
+	curr = rq_curr_unlocked(rq); /* XXX jstultz: using rcu_dereference intead of READ_ONCE */
 
 	/*
 	 * If we are dealing with a -deadline task, we must
@@ -1904,8 +1904,8 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * Current can't be migrated, useless to reschedule,
 	 * let's hope p can move out.
 	 */
-	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+	if (rq_curr(rq)->nr_cpus_allowed == 1 ||
+	    !cpudl_find(&rq->rd->cpudl, rq_curr(rq), NULL))
 		return;
 
 	/*
@@ -1944,7 +1944,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+	if (dl_entity_preempt(&p->dl, &rq_curr(rq)->dl)) {
 		resched_curr(rq);
 		return;
 	}
@@ -1954,8 +1954,8 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((p->dl.deadline == rq->curr->dl.deadline) &&
-	    !test_tsk_need_resched(rq->curr))
+	if ((p->dl.deadline == rq_curr(rq)->dl.deadline) &&
+	    !test_tsk_need_resched(rq_curr(rq)))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
 }
@@ -1989,7 +1989,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (hrtick_enabled_dl(rq))
 		start_hrtick_dl(rq, p);
 
-	if (rq->curr->sched_class != &dl_sched_class)
+	if (rq_curr(rq)->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	deadline_queue_push_tasks(rq);
@@ -2301,13 +2301,13 @@ static int push_dl_task(struct rq *rq)
 
 retry:
 	/*
-	 * If next_task preempts rq->curr, and rq->curr
+	 * If next_task preempts rq_curr(rq), and rq_curr(rq)
 	 * can move away, it makes sense to just reschedule
 	 * without going further in pushing next_task.
 	 */
-	if (dl_task(rq->curr) &&
-	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
-	    rq->curr->nr_cpus_allowed > 1) {
+	if (dl_task(rq_curr(rq)) &&
+	    dl_time_before(next_task->dl.deadline, rq_curr(rq)->dl.deadline) &&
+	    rq_curr(rq)->nr_cpus_allowed > 1) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2315,7 +2315,7 @@ static int push_dl_task(struct rq *rq)
 	if (is_migration_disabled(next_task))
 		return 0;
 
-	if (WARN_ON(next_task == rq->curr))
+	if (WARN_ON(next_task == rq_curr(rq)))
 		return 0;
 
 	/* We might release rq lock */
@@ -2423,7 +2423,7 @@ static void pull_dl_task(struct rq *this_rq)
 		 */
 		if (p && dl_time_before(p->dl.deadline, dmin) &&
 		    dl_task_is_earliest_deadline(p, this_rq)) {
-			WARN_ON(p == src_rq->curr);
+			WARN_ON(p == rq_curr(src_rq));
 			WARN_ON(!task_on_rq_queued(p));
 
 			/*
@@ -2431,7 +2431,7 @@ static void pull_dl_task(struct rq *this_rq)
 			 * deadline than the current task of its runqueue.
 			 */
 			if (dl_time_before(p->dl.deadline,
-					   src_rq->curr->dl.deadline))
+					   rq_curr(src_rq)->dl.deadline))
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2468,11 +2468,11 @@ static void pull_dl_task(struct rq *this_rq)
 static void task_woken_dl(struct rq *rq, struct task_struct *p)
 {
 	if (!task_on_cpu(rq, p) &&
-	    !test_tsk_need_resched(rq->curr) &&
+	    !test_tsk_need_resched(rq_curr(rq)) &&
 	    p->nr_cpus_allowed > 1 &&
-	    dl_task(rq->curr) &&
-	    (rq->curr->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+	    dl_task(rq_curr(rq)) &&
+	    (rq_curr(rq)->nr_cpus_allowed < 2 ||
+	     !dl_entity_preempt(&p->dl, &rq_curr(rq)->dl))) {
 		push_dl_tasks(rq);
 	}
 }
@@ -2635,12 +2635,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 	}
 
-	if (rq->curr != p) {
+	if (rq_curr(rq) != p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
 #endif
-		if (dl_task(rq->curr))
+		if (dl_task(rq_curr(rq)))
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_curr(rq);
@@ -2684,8 +2684,8 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 		 *
 		 * Otherwise, if p was given an earlier deadline, reschedule.
 		 */
-		if (!dl_task(rq->curr) ||
-		    dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		if (!dl_task(rq_curr(rq)) ||
+		    dl_time_before(p->dl.deadline, rq_curr(rq)->dl.deadline))
 			resched_curr(rq);
 	}
 #else
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..55f57156502d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -743,7 +743,7 @@ do {									\
 	P(nr_switches);
 	P(nr_uninterruptible);
 	PN(next_balance);
-	SEQ_printf(m, "  .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr)));
+	SEQ_printf(m, "  .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq_curr(rq))));
 	PN(clock);
 	PN(clock_task);
 #undef P
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03e61be5c94f..8b35dfc0c442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -919,7 +919,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
  */
 s64 update_curr_common(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 	s64 delta_exec;
 
 	delta_exec = update_curr_se(rq, &curr->se);
@@ -964,7 +964,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->curr->se));
+	update_curr(cfs_rq_of(&rq_curr(rq)->se));
 }
 
 static inline void
@@ -1958,7 +1958,7 @@ static bool task_numa_compare(struct task_numa_env *env,
 		return false;
 
 	rcu_read_lock();
-	cur = rcu_dereference(dst_rq->curr);
+	cur = rcu_dereference(rq_curr(dst_rq));
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
 
@@ -2747,7 +2747,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 	}
 
 	rcu_read_lock();
-	tsk = READ_ONCE(cpu_rq(cpu)->curr);
+	tsk = READ_ONCE(cpu_curr(cpu));
 
 	if (!cpupid_match_pid(tsk, cpupid))
 		goto no_join;
@@ -3969,7 +3969,7 @@ static inline void migrate_se_pelt_lag(struct sched_entity *se)
 	rq = rq_of(cfs_rq);
 
 	rcu_read_lock();
-	is_idle = is_idle_task(rcu_dereference(rq->curr));
+	is_idle = is_idle_task(rq_curr_unlocked(rq));
 	rcu_read_unlock();
 
 	/*
@@ -5498,7 +5498,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
-	if (rq->curr == rq->idle && rq->cfs.nr_running)
+	if (rq_curr(rq) == rq->idle && rq->cfs.nr_running)
 		resched_curr(rq);
 }
 
@@ -6148,7 +6148,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
  */
 static void hrtick_update(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 
 	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
 		return;
@@ -7788,7 +7788,7 @@ static void set_skip_buddy(struct sched_entity *se)
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
@@ -8086,7 +8086,7 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
  */
 static void yield_task_fair(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se;
 
@@ -8821,7 +8821,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class = rq->curr->sched_class;
+	curr_class = rq_curr(rq)->sched_class;
 
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
 
@@ -9640,8 +9640,9 @@ static unsigned int task_running_on_cpu(int cpu, struct task_struct *p)
 static int idle_cpu_without(int cpu, struct task_struct *p)
 {
 	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *curr = rq_curr(rq);
 
-	if (rq->curr != rq->idle && rq->curr != p)
+	if (curr != rq->idle && curr != p)
 		return 0;
 
 	/*
@@ -10839,7 +10840,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 * if the curr task on busiest CPU can't be
 			 * moved to this_cpu:
 			 */
-			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
+			if (!cpumask_test_cpu(this_cpu, rq_curr(busiest)->cpus_ptr)) {
 				raw_spin_rq_unlock_irqrestore(busiest, flags);
 				goto out_one_pinned;
 			}
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 2ad881d07752..e55e39f74ea4 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -283,7 +283,7 @@ static int membarrier_global_expedited(void)
 		 * Skip the CPU if it runs a kernel thread which is not using
 		 * a task mm.
 		 */
-		p = rcu_dereference(cpu_rq(cpu)->curr);
+		p = cpu_curr_unlocked(cpu);
 		if (!p->mm)
 			continue;
 
@@ -355,7 +355,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id))
 			goto out;
 		rcu_read_lock();
-		p = rcu_dereference(cpu_rq(cpu_id)->curr);
+		p = cpu_curr_unlocked(cpu_id);
 		if (!p || p->mm != mm) {
 			rcu_read_unlock();
 			goto out;
@@ -368,7 +368,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		for_each_online_cpu(cpu) {
 			struct task_struct *p;
 
-			p = rcu_dereference(cpu_rq(cpu)->curr);
+			p = cpu_curr_unlocked(cpu);
 			if (p && p->mm == mm)
 				__cpumask_set_cpu(cpu, tmpmask);
 		}
@@ -466,7 +466,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm)
 		struct rq *rq = cpu_rq(cpu);
 		struct task_struct *p;
 
-		p = rcu_dereference(rq->curr);
+		p = rq_curr_unlocked(rq);
 		if (p && p->mm == mm)
 			__cpumask_set_cpu(cpu, tmpmask);
 	}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc28721..bf3276f8df78 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -94,7 +94,7 @@ static inline void _update_idle_rq_clock_pelt(struct rq *rq)
  */
 static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
 {
-	if (unlikely(is_idle_task(rq->curr))) {
+	if (unlikely(is_idle_task(rq_curr(rq)))) {
 		_update_idle_rq_clock_pelt(rq);
 		return;
 	}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 18eb6ce60c5c..91c992230aba 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -574,7 +574,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 
 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 {
-	struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+	struct task_struct *curr = rq_curr(rq_of_rt_rq(rt_rq));
 	struct rq *rq = rq_of_rt_rq(rt_rq);
 	struct sched_rt_entity *rt_se;
 
@@ -958,7 +958,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 				 * and this unthrottle will get accounted as
 				 * 'runtime'.
 				 */
-				if (rt_rq->rt_nr_running && rq->curr == rq->idle)
+				if (rt_rq->rt_nr_running && rq_curr(rq) == rq->idle)
 					rq_clock_cancel_skipupdate(rq);
 			}
 			if (rt_rq->rt_time || rt_rq->rt_nr_running)
@@ -1044,7 +1044,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
  */
 static void update_curr_rt(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_curr(rq);
 	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
@@ -1582,7 +1582,7 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p, int head)
 
 static void yield_task_rt(struct rq *rq)
 {
-	requeue_task_rt(rq, rq->curr, 0);
+	requeue_task_rt(rq, rq_curr(rq), 0);
 }
 
 #ifdef CONFIG_SMP
@@ -1602,7 +1602,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	rq = cpu_rq(cpu);
 
 	rcu_read_lock();
-	curr = READ_ONCE(rq->curr); /* unlocked access */
+	curr = rq_curr_unlocked(rq); /* XXX jstultz: using rcu_dereference intead of READ_ONCE */
 
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
@@ -1666,8 +1666,8 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 	 * Current can't be migrated, useless to reschedule,
 	 * let's hope p can move out.
 	 */
-	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+	if (rq_curr(rq)->nr_cpus_allowed == 1 ||
+	    !cpupri_find(&rq->rd->cpupri, rq_curr(rq), NULL))
 		return;
 
 	/*
@@ -1710,7 +1710,7 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->prio < rq->curr->prio) {
+	if (p->prio < rq_curr(rq)->prio) {
 		resched_curr(rq);
 		return;
 	}
@@ -1728,7 +1728,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio == rq_curr(rq)->prio && !test_tsk_need_resched(rq_curr(rq)))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -1753,7 +1753,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
 	 * utilization. We only care of the case where we start to schedule a
 	 * rt task
 	 */
-	if (rq->curr->sched_class != &rt_sched_class)
+	if (rq_curr(rq)->sched_class != &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	rt_queue_push_tasks(rq);
@@ -2062,7 +2062,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (unlikely(next_task->prio < rq_curr(rq)->prio)) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2083,10 +2083,10 @@ static int push_rt_task(struct rq *rq, bool pull)
 		 * Note that the stoppers are masqueraded as SCHED_FIFO
 		 * (cf. sched_set_stop_task()), so we can't rely on rt_task().
 		 */
-		if (rq->curr->sched_class != &rt_sched_class)
+		if (rq_curr(rq)->sched_class != &rt_sched_class)
 			return 0;
 
-		cpu = find_lowest_rq(rq->curr);
+		cpu = find_lowest_rq(rq_curr(rq));
 		if (cpu == -1 || cpu == rq->cpu)
 			return 0;
 
@@ -2107,7 +2107,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 		return 0;
 	}
 
-	if (WARN_ON(next_task == rq->curr))
+	if (WARN_ON(next_task == rq_curr(rq)))
 		return 0;
 
 	/* We might release rq lock */
@@ -2404,7 +2404,7 @@ static void pull_rt_task(struct rq *this_rq)
 		 * the to-be-scheduled task?
 		 */
 		if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
-			WARN_ON(p == src_rq->curr);
+			WARN_ON(p == rq_curr(src_rq));
 			WARN_ON(!task_on_rq_queued(p));
 
 			/*
@@ -2415,7 +2415,7 @@ static void pull_rt_task(struct rq *this_rq)
 			 * p if it is lower in priority than the
 			 * current task on the run queue
 			 */
-			if (p->prio < src_rq->curr->prio)
+			if (p->prio < rq_curr(src_rq)->prio)
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2455,11 +2455,11 @@ static void pull_rt_task(struct rq *this_rq)
 static void task_woken_rt(struct rq *rq, struct task_struct *p)
 {
 	bool need_to_push = !task_on_cpu(rq, p) &&
-			    !test_tsk_need_resched(rq->curr) &&
+			    !test_tsk_need_resched(rq_curr(rq)) &&
 			    p->nr_cpus_allowed > 1 &&
-			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
-			    (rq->curr->nr_cpus_allowed < 2 ||
-			     rq->curr->prio <= p->prio);
+			    (dl_task(rq_curr(rq)) || rt_task(rq_curr(rq))) &&
+			    (rq_curr(rq)->nr_cpus_allowed < 2 ||
+			     rq_curr(rq)->prio <= p->prio);
 
 	if (need_to_push)
 		push_rt_tasks(rq);
@@ -2543,7 +2543,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
-		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+		if (p->prio < rq_curr(rq)->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
 }
@@ -2584,7 +2584,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * greater than the current running task
 		 * then reschedule.
 		 */
-		if (p->prio < rq->curr->prio)
+		if (p->prio < rq_curr(rq)->prio)
 			resched_curr(rq);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d18e3c3a3f40..00f73ca4c618 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1008,7 +1008,7 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
 
-	struct task_struct __rcu	*curr;
+	struct task_struct __rcu	*curr_exec;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
@@ -1201,12 +1201,42 @@ static inline bool is_migration_disabled(struct task_struct *p)
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
+static inline struct task_struct *rq_curr(struct rq *rq)
+{
+	return rq->curr_exec;
+}
+
+/* XXX jstultz: Would rq_curr_rcu() be a better name? */
+static inline struct task_struct *rq_curr_unlocked(struct rq *rq)
+{
+	return rcu_dereference(rq->curr_exec);
+}
+
+static inline void rq_set_curr(struct rq *rq, struct task_struct *task)
+{
+	rcu_assign_pointer(rq->curr_exec, task);
+}
+
+/*
+ *  XXX jstultz: seems like rcu_assign_pointer above would also
+ *               work for this, but trying to match usage.
+ */
+static inline void rq_set_curr_rcu_init(struct rq *rq, struct task_struct *task)
+{
+	RCU_INIT_POINTER(rq->curr_exec, task);
+}
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		this_cpu_ptr(&runqueues)
 #define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
+#define cpu_curr(cpu)		(rq_curr(cpu_rq(cpu)))
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+static inline struct task_struct *cpu_curr_unlocked(int cpu)
+{
+	return rq_curr_unlocked(cpu_rq(cpu));
+}
+
 struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2070,7 +2100,7 @@ static inline u64 global_rt_runtime(void)
 
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
-	return rq->curr == p;
+	return rq_curr(rq) == p;
 }
 
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
@@ -2230,7 +2260,7 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq_curr(rq) != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -2311,7 +2341,7 @@ extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_conte
 
 static inline struct task_struct *get_push_task(struct rq *rq)
 {
-	struct task_struct *p = rq->curr;
+	struct task_struct *p = rq_curr(rq);
 
 	lockdep_assert_rq_held(rq);
 
@@ -3193,7 +3223,7 @@ static inline bool sched_energy_enabled(void) { return false; }
  * The scheduler provides memory barriers required by membarrier between:
  * - prior user-space memory accesses and store to rq->membarrier_state,
  * - store to rq->membarrier_state and following user-space memory accesses.
- * In the same way it provides those guarantees around store to rq->curr.
+ * In the same way it provides those guarantees around store to rq_curr(rq).
  */
 static inline void membarrier_switch_mm(struct rq *rq,
 					struct mm_struct *prev_mm,

From patchwork Mon Mar 20 23:37:16 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72534
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1494180wrt;
        Mon, 20 Mar 2023 16:44:53 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set++7aC3MnKSilP06j/Xqd9ORvAkm8kyiaZv2Uf3P3Tnc+T8lreuBbosKzttbaFu86w2HfMv
X-Received: by 2002:a17:902:c949:b0:19c:cf89:b7ee with SMTP id
 i9-20020a170902c94900b0019ccf89b7eemr148067pla.69.1679355893604;
        Mon, 20 Mar 2023 16:44:53 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355893; cv=none;
        d=google.com; s=arc-20160816;
        b=dTmytMs0ZTrfczO0tDBx94Juz7Bl6AOwpyUIPexgJi1wWoJH/xfdA8iLEH0FG+ub5u
         r/wSrT7bGm4AAZD0g6OxOL1gVKaF/qhzfSYcsLli/RFKJKwyNDOvTML0qdLWX2H50IJx
         vXRrBNiLJh5MEHuBh6dUGzMZwTKWqK7KlDBVxhc5xHtx8ydFi79MiE5NMl1zrZ0h6xFT
         IKd7FKpQ8Et5jk4VKXQkGlsmPUjnhRHXZItdDNP0mGtgg94vHHutr9R7SjG8+tpWvGNQ
         dTL5h3eKIcZYvH4cFAAEhC0UxmZKTr1e7mufD0kjQNr2Mq5FTiJgo6pTH4dRIsKDj5PG
         CKFA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=t4SbJQTGrAynFT6BXJUh1wiDWDragfA0SgeXDDiqdjc=;
        b=JVHA7aKEKicKroGrFlPs750YmHKqvHLGCtqZTZe7S9YsCtqlUdP5OOVPWYMdcnihYN
         LyGqTVcH0scgwUBfVxhobu/LjEC2Rh/HTwHW7c4iuAwIVIRev+7ld9sSZwS1NKaX7om3
         Pn0NH3bz0a0kSbM3IeJGWuysButYfslMsYmD8iw9QwxLX6oqh+eeecKCRPRifHatgNs4
         JzO82OXusKozTy5e/MT90ewEkZZ0cQZQx4HF1kVGEC18VASPF4A5QKmyUYYV5PDMK7zc
         RHuIhn/euXEh04qY8hZ/O5UmS/iINfw1O5AEVb7szQfnbxhmPAuFpxz8/8n4iElYdEFx
         4zdQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=M6QKpdQB;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 l71-20020a63884a000000b00502f9ffb8bbsi11451164pgd.297.2023.03.20.16.44.40;
        Mon, 20 Mar 2023 16:44:53 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=M6QKpdQB;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230116AbjCTXiZ (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:38:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59982 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230013AbjCTXiH (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:38:07 -0400
Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com
 [IPv6:2607:f8b0:4864:20::b49])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7F41E366AD
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:46 -0700 (PDT)
Received: by mail-yb1-xb49.google.com with SMTP id
 c187-20020a25c0c4000000b00b6fd84f760dso262443ybf.12
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355464;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=t4SbJQTGrAynFT6BXJUh1wiDWDragfA0SgeXDDiqdjc=;
        b=M6QKpdQBBfE00Vnz6gDnC+aUq3b22ZycjP5yAJmfNt9mIqAsGoL1uGarveMYg7vxIn
         55wx/MauuW85+PVHRw5MzdZ3uxWl9MzZwXheezwaEuNyPxchlLI0Abwd/r4WLjTJTJBt
         EztIpm9ucJT2qKGbJUFoFOTL0a9hif2K3iJAum8tEJghrSvSVsF9mVtYd1lDPbwQ2eAW
         ylH/Fke88qffooeOMIJGGgjL7mzskg2dGHxk5iVBY3kwkxdWkMkZjNtBmGOt9HUtRQYY
         7kuQAGq82X3K3bIpaN0Y6NU8812C7+ChoWW++qdCS9b/DvNKOzbBhz/gZnjuZgpgafLp
         +rwQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355464;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=t4SbJQTGrAynFT6BXJUh1wiDWDragfA0SgeXDDiqdjc=;
        b=Bwsh/l0nOWNxHglFsUxLHc/ZEK5QbYuXK0KGjm/iTBXQ6aHDhz1l9Ptp+bZG63j9z+
         E0zsr6OWgmer6CGbcQnHN09sDTmOv79pdWTgczIA4TjHbCePbJKVkjLPQu8lz0sEoVGB
         c/WBR5GE4S+Z5EpYyWl2FD5/kkpSzrY1GsvGl+EZVZ5X92IbCTpChHYtftNwRX84rUMS
         XdLml7cz3CP30Ue/H4coqcC4+SYLtPbE6yheh/QTPml2GmSCCXKL3ZE3hGC6AonEoWCX
         5g0T8JxSrNKWLFHX2Sieq46K9HckbVzJZ2CmGK5fWdRgVTXiyFKKK1c3VljLWelsn5Rw
         mOHw==
X-Gm-Message-State: AAQBX9cqoMDs1U96kBFZj6bmpoVx3dobvWEdK0XAfIhDQfLXVgZ19RKF
        S9b8gr7JLUNNZxkA3RlWOdWHg2GqaJg8dW5D/9tY8wuVh833EXcCjEankwHZJeDcvqHCfkSJ/Dv
        9IG+OxwMTiPm4eln8QnjCsyowLiKIB1XntsaN8PD0wRulXigGDnLN8152OpxwNw0LmWegJjo=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6902:706:b0:b6f:3b44:8587 with SMTP id
 k6-20020a056902070600b00b6f3b448587mr176301ybt.11.1679355464716; Mon, 20 Mar
 2023 16:37:44 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:16 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-9-jstultz@google.com>
Subject: [PATCH v2 08/12] sched: Split scheduler execution context
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760932285650649000?=
X-GMAIL-MSGID: =?utf-8?q?1760932285650649000?=

From: Peter Zijlstra <peterz@infradead.org>

Lets define the scheduling context as all the scheduler state in
task_struct and the execution context as all state required to run the
task.

Currently both are intertwined in task_struct. We want to logically
split these such that we can run the execution context of one task
with the scheduling context of another.

To this purpose introduce rq::proxy to point to the task_struct used
for scheduler state and preserve rq::curr to denote the execution
context.

XXX connoro: some further work may be needed in RT/DL load balancing
paths to properly handle split context; see comments in code for
specifics

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
[added lot of comments/questions - identifiable by XXX]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
[add additional comments and update more sched_class code to use
 rq::proxy]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased and resolved minor collisions, reworked to use
 accessors, tweaked update_curr_common to use rq_proxy fixing rt
 scheduling issues]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
---
 kernel/sched/core.c     | 41 +++++++++++++++++-------
 kernel/sched/deadline.c | 36 +++++++++++----------
 kernel/sched/fair.c     | 22 +++++++------
 kernel/sched/rt.c       | 59 +++++++++++++++++++++++------------
 kernel/sched/sched.h    | 69 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 168 insertions(+), 59 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index faaad249f8f7..ab6b1917bc70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,7 +773,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq_curr(rq)->sched_class->task_tick(rq, rq_curr(rq), 1);
+	rq_proxy(rq)->sched_class->task_tick(rq, rq_proxy(rq), 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;
@@ -2175,7 +2175,7 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 
 void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct task_struct *curr = rq_curr(rq);
+	struct task_struct *curr = rq_proxy(rq);
 
 	if (p->sched_class == curr->sched_class)
 		curr->sched_class->check_preempt_curr(rq, p, flags);
@@ -2186,7 +2186,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(curr) && test_tsk_need_resched(curr))
+	if (task_on_rq_queued(curr) && test_tsk_need_resched(rq_curr(rq)))
 		rq_clock_skip_update(rq);
 }
 
@@ -2576,7 +2576,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 		lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_proxy(rq, p);
 
 	if (queued) {
 		/*
@@ -5498,7 +5498,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	 * project cycles that may never be accounted to this
 	 * thread, breaking clock_gettime().
 	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
+	if (task_current_proxy(rq, p) && task_on_rq_queued(p)) {
 		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
@@ -5566,7 +5566,8 @@ void scheduler_tick(void)
 {
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
-	struct task_struct *curr = rq_curr(rq);
+	/* accounting goes to the proxy task */
+	struct task_struct *curr = rq_proxy(rq);
 	struct rq_flags rf;
 	unsigned long thermal_pressure;
 	u64 resched_latency;
@@ -5663,6 +5664,13 @@ static void sched_tick_remote(struct work_struct *work)
 	if (cpu_is_offline(cpu))
 		goto out_unlock;
 
+	/*
+	 * XXX don't we need to account to rq->proxy?
+	 * Maybe, since this is a remote tick for full dynticks mode, we are
+	 * always sure that there is no proxy (only a single task is running.
+	 */
+	SCHED_WARN_ON(rq_curr(rq) != rq_proxy(rq));
+
 	update_rq_clock(rq);
 
 	if (!is_idle_task(curr)) {
@@ -6586,6 +6594,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	}
 
 	next = pick_next_task(rq, prev, &rf);
+	rq_set_proxy(rq, next);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
@@ -7052,7 +7061,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	/*
+	 * XXX how does (proxy exec) mutexes and RT_mutexes work together?!
+	 */
+	running = task_current_proxy(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -7140,7 +7152,10 @@ void set_user_nice(struct task_struct *p, long nice)
 		goto out_unlock;
 	}
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	/*
+	 * XXX see concerns about do_set_cpus_allowed, rt_mutex_prio & Co.
+	 */
+	running = task_current_proxy(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -7704,7 +7719,10 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	/*
+	 * XXX and again, how is this safe w.r.t. proxy exec?
+	 */
+	running = task_current_proxy(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -9156,6 +9174,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	rcu_read_unlock();
 
 	rq->idle = idle;
+	rq_set_proxy(rq, idle);
 	rq_set_curr(rq, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
@@ -9258,7 +9277,7 @@ void sched_setnuma(struct task_struct *p, int nid)
 
 	rq = task_rq_lock(p, &rf);
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_proxy(rq, p);
 
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10370,7 +10389,7 @@ void sched_move_task(struct task_struct *tsk)
 	rq = task_rq_lock(tsk, &rf);
 	update_rq_clock(rq);
 
-	running = task_current(rq, tsk);
+	running = task_current_proxy(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
 	if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4e3acc76708f..6ec40f90317c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1179,7 +1179,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 #endif
 
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-	if (dl_task(rq_curr(rq)))
+	if (dl_task(rq_proxy(rq)))
 		check_preempt_curr_dl(rq, p, 0);
 	else
 		resched_curr(rq);
@@ -1306,7 +1306,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
  */
 static void update_curr_dl(struct rq *rq)
 {
-	struct task_struct *curr = rq_curr(rq);
+	struct task_struct *curr = rq_proxy(rq);
 	struct sched_dl_entity *dl_se = &curr->dl;
 	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
@@ -1819,7 +1819,7 @@ static int find_later_rq(struct task_struct *task);
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *proxy;
 	bool select_rq;
 	struct rq *rq;
 
@@ -1830,6 +1830,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = rq_curr_unlocked(rq); /* XXX jstultz: using rcu_dereference intead of READ_ONCE */
+	proxy = rq_proxy_unlocked(rq);
 
 	/*
 	 * If we are dealing with a -deadline task, we must
@@ -1840,9 +1841,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 	 * other hand, if it has a shorter deadline, we
 	 * try to make it stay here, it might be important.
 	 */
-	select_rq = unlikely(dl_task(curr)) &&
+	select_rq = unlikely(dl_task(proxy)) &&
 		    (curr->nr_cpus_allowed < 2 ||
-		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+		     !dl_entity_preempt(&p->dl, &proxy->dl)) &&
 		    p->nr_cpus_allowed > 1;
 
 	/*
@@ -1905,7 +1906,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq_curr(rq)->nr_cpus_allowed == 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq_curr(rq), NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_proxy(rq), NULL))
 		return;
 
 	/*
@@ -1944,7 +1945,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_entity_preempt(&p->dl, &rq_curr(rq)->dl)) {
+	if (dl_entity_preempt(&p->dl, &rq_proxy(rq)->dl)) {
 		resched_curr(rq);
 		return;
 	}
@@ -1954,7 +1955,7 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((p->dl.deadline == rq_curr(rq)->dl.deadline) &&
+	if ((p->dl.deadline == rq_proxy(rq)->dl.deadline) &&
 	    !test_tsk_need_resched(rq_curr(rq)))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
@@ -1989,7 +1990,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (hrtick_enabled_dl(rq))
 		start_hrtick_dl(rq, p);
 
-	if (rq_curr(rq)->sched_class != &dl_sched_class)
+	if (rq_proxy(rq)->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	deadline_queue_push_tasks(rq);
@@ -2305,8 +2306,8 @@ static int push_dl_task(struct rq *rq)
 	 * can move away, it makes sense to just reschedule
 	 * without going further in pushing next_task.
 	 */
-	if (dl_task(rq_curr(rq)) &&
-	    dl_time_before(next_task->dl.deadline, rq_curr(rq)->dl.deadline) &&
+	if (dl_task(rq_proxy(rq)) &&
+	    dl_time_before(next_task->dl.deadline, rq_proxy(rq)->dl.deadline) &&
 	    rq_curr(rq)->nr_cpus_allowed > 1) {
 		resched_curr(rq);
 		return 0;
@@ -2322,6 +2323,7 @@ static int push_dl_task(struct rq *rq)
 	get_task_struct(next_task);
 
 	/* Will lock the rq it'll find */
+	/* XXX connoro: update find_lock_later_rq() for split context? */
 	later_rq = find_lock_later_rq(next_task, rq);
 	if (!later_rq) {
 		struct task_struct *task;
@@ -2431,7 +2433,7 @@ static void pull_dl_task(struct rq *this_rq)
 			 * deadline than the current task of its runqueue.
 			 */
 			if (dl_time_before(p->dl.deadline,
-					   rq_curr(src_rq)->dl.deadline))
+					   rq_proxy(src_rq)->dl.deadline))
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2470,9 +2472,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
 	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq_curr(rq)) &&
 	    p->nr_cpus_allowed > 1 &&
-	    dl_task(rq_curr(rq)) &&
+	    dl_task(rq_proxy(rq)) &&
 	    (rq_curr(rq)->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &rq_curr(rq)->dl))) {
+	     !dl_entity_preempt(&p->dl, &rq_proxy(rq)->dl))) {
 		push_dl_tasks(rq);
 	}
 }
@@ -2635,12 +2637,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 	}
 
-	if (rq_curr(rq) != p) {
+	if (rq_proxy(rq) != p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
 #endif
-		if (dl_task(rq_curr(rq)))
+		if (dl_task(rq_proxy(rq)))
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_curr(rq);
@@ -2669,7 +2671,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 	if (!rq->dl.overloaded)
 		deadline_queue_pull_task(rq);
 
-	if (task_current(rq, p)) {
+	if (task_current_proxy(rq, p)) {
 		/*
 		 * If we now have a earlier deadline task than p,
 		 * then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b35dfc0c442..1471519b95c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -919,7 +919,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
  */
 s64 update_curr_common(struct rq *rq)
 {
-	struct task_struct *curr = rq_curr(rq);
+	struct task_struct *curr = rq_proxy(rq);
 	s64 delta_exec;
 
 	delta_exec = update_curr_se(rq, &curr->se);
@@ -964,7 +964,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq_curr(rq)->se));
+	update_curr(cfs_rq_of(&rq_proxy(rq)->se));
 }
 
 static inline void
@@ -6133,7 +6133,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
-			if (task_current(rq, p))
+			if (task_current_proxy(rq, p))
 				resched_curr(rq);
 			return;
 		}
@@ -6148,7 +6148,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
  */
 static void hrtick_update(struct rq *rq)
 {
-	struct task_struct *curr = rq_curr(rq);
+	struct task_struct *curr = rq_proxy(rq);
 
 	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
 		return;
@@ -7788,7 +7788,7 @@ static void set_skip_buddy(struct sched_entity *se)
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
-	struct task_struct *curr = rq_curr(rq);
+	struct task_struct *curr = rq_proxy(rq);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
@@ -7822,7 +7822,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(rq_curr(rq)))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -8821,7 +8821,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class = rq_curr(rq)->sched_class;
+	curr_class = rq_proxy(rq)->sched_class;
 
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
 
@@ -11984,6 +11984,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
+	/*
+	 * XXX need to use execution context (rq->curr) for task_tick_numa and
+	 * update_misfit_status?
+	 */
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
@@ -12047,7 +12051,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 	 * our priority decreased, or if we are not currently running on
 	 * this runqueue and our priority is higher than the current's
 	 */
-	if (task_current(rq, p)) {
+	if (task_current_proxy(rq, p)) {
 		if (p->prio > oldprio)
 			resched_curr(rq);
 	} else
@@ -12192,7 +12196,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
 		 * kick off the schedule if running, otherwise just see
 		 * if we can still preempt the current task.
 		 */
-		if (task_current(rq, p))
+		if (task_current_proxy(rq, p))
 			resched_curr(rq);
 		else
 			check_preempt_curr(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 91c992230aba..03e5d8fa67aa 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -574,7 +574,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 
 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 {
-	struct task_struct *curr = rq_curr(rq_of_rt_rq(rt_rq));
+	struct task_struct *curr = rq_proxy(rq_of_rt_rq(rt_rq));
 	struct rq *rq = rq_of_rt_rq(rt_rq);
 	struct sched_rt_entity *rt_se;
 
@@ -1044,7 +1044,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
  */
 static void update_curr_rt(struct rq *rq)
 {
-	struct task_struct *curr = rq_curr(rq);
+	struct task_struct *curr = rq_proxy(rq);
 	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
@@ -1591,7 +1591,7 @@ static int find_lowest_rq(struct task_struct *task);
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *proxy;
 	struct rq *rq;
 	bool test;
 
@@ -1603,6 +1603,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = rq_curr_unlocked(rq); /* XXX jstultz: using rcu_dereference intead of READ_ONCE */
+	proxy = rq_proxy_unlocked(rq);
 
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
@@ -1631,8 +1632,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	 * systems like big.LITTLE.
 	 */
 	test = curr &&
-	       unlikely(rt_task(curr)) &&
-	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+	       unlikely(rt_task(proxy)) &&
+	       (curr->nr_cpus_allowed < 2 || proxy->prio <= p->prio);
 
 	if (test || !rt_task_fits_capacity(p, cpu)) {
 		int target = find_lowest_rq(p);
@@ -1662,12 +1663,12 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-	/*
-	 * Current can't be migrated, useless to reschedule,
-	 * let's hope p can move out.
+	/* XXX connoro: need to revise cpupri_find() to reflect the split
+	 * context since it should look at rq->proxy for priority but rq->curr
+	 * for affinity.
 	 */
 	if (rq_curr(rq)->nr_cpus_allowed == 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq_curr(rq), NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_proxy(rq), NULL))
 		return;
 
 	/*
@@ -1710,7 +1711,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->prio < rq_curr(rq)->prio) {
+	struct task_struct *curr = rq_proxy(rq);
+
+	if (p->prio < curr->prio) {
 		resched_curr(rq);
 		return;
 	}
@@ -1728,7 +1731,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio == rq_curr(rq)->prio && !test_tsk_need_resched(rq_curr(rq)))
+	if (p->prio == curr->prio && !test_tsk_need_resched(rq_curr(rq)))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -1753,7 +1756,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
 	 * utilization. We only care of the case where we start to schedule a
 	 * rt task
 	 */
-	if (rq_curr(rq)->sched_class != &rt_sched_class)
+	if (rq_proxy(rq)->sched_class != &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	rt_queue_push_tasks(rq);
@@ -2029,7 +2032,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
 			      struct task_struct, pushable_tasks);
 
 	BUG_ON(rq->cpu != task_cpu(p));
-	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current(rq, p) || task_current_proxy(rq, p));
 	BUG_ON(p->nr_cpus_allowed <= 1);
 
 	BUG_ON(!task_on_rq_queued(p));
@@ -2062,7 +2065,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq_curr(rq)->prio)) {
+	if (unlikely(next_task->prio < rq_proxy(rq)->prio)) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2083,6 +2086,16 @@ static int push_rt_task(struct rq *rq, bool pull)
 		 * Note that the stoppers are masqueraded as SCHED_FIFO
 		 * (cf. sched_set_stop_task()), so we can't rely on rt_task().
 		 */
+		/*
+		 * XXX connoro: seems like what we actually want here might be:
+		 * 1) Enforce that rq->proxy must be RT
+		 * 2) Revise find_lowest_rq() to handle split context, searching
+		 *    for an rq that can accommodate rq->proxy's prio and
+		 *    rq->curr's affinity
+		 * 3) Send the whole chain to the new rq in push_cpu_stop()?
+		 * If #3 is needed, might be best to make a separate patch with
+		 * all the "chain-level load balancing" changes.
+		 */
 		if (rq_curr(rq)->sched_class != &rt_sched_class)
 			return 0;
 
@@ -2114,6 +2127,12 @@ static int push_rt_task(struct rq *rq, bool pull)
 	get_task_struct(next_task);
 
 	/* find_lock_lowest_rq locks the rq if found */
+	/*
+	 * XXX connoro: find_lock_lowest_rq() likely also needs split context
+	 * support. This also needs to include something like an exec_ctx=NULL
+	 * case for when we push a blocked task whose lock owner is not on
+	 * this rq.
+	 */
 	lowest_rq = find_lock_lowest_rq(next_task, rq);
 	if (!lowest_rq) {
 		struct task_struct *task;
@@ -2415,7 +2434,7 @@ static void pull_rt_task(struct rq *this_rq)
 			 * p if it is lower in priority than the
 			 * current task on the run queue
 			 */
-			if (p->prio < rq_curr(src_rq)->prio)
+			if (p->prio < rq_proxy(src_rq)->prio)
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2457,9 +2476,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	bool need_to_push = !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq_curr(rq)) &&
 			    p->nr_cpus_allowed > 1 &&
-			    (dl_task(rq_curr(rq)) || rt_task(rq_curr(rq))) &&
+			    (dl_task(rq_proxy(rq)) || rt_task(rq_proxy(rq))) &&
 			    (rq_curr(rq)->nr_cpus_allowed < 2 ||
-			     rq_curr(rq)->prio <= p->prio);
+			     rq_proxy(rq)->prio <= p->prio);
 
 	if (need_to_push)
 		push_rt_tasks(rq);
@@ -2543,7 +2562,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
-		if (p->prio < rq_curr(rq)->prio && cpu_online(cpu_of(rq)))
+		if (p->prio < rq_proxy(rq)->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
 }
@@ -2558,7 +2577,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 	if (!task_on_rq_queued(p))
 		return;
 
-	if (task_current(rq, p)) {
+	if (task_current_proxy(rq, p)) {
 #ifdef CONFIG_SMP
 		/*
 		 * If our priority decreases while running, we
@@ -2584,7 +2603,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * greater than the current running task
 		 * then reschedule.
 		 */
-		if (p->prio < rq_curr(rq)->prio)
+		if (p->prio < rq_proxy(rq)->prio)
 			resched_curr(rq);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 00f73ca4c618..84e49c2530b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1008,7 +1008,17 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
 
-	struct task_struct __rcu	*curr_exec;
+	/*
+	 * XXX jstultz: Personally, I feel like these should be switched.
+	 * Where rq_curr() returns the selected scheduler context, and
+	 * but rq_proxy() returns the execution context we're going to
+	 * run on behalf of the selected task. But for now keeping it
+	 * the same.
+	 */
+	struct task_struct __rcu	*curr_exec; /* Execution context */
+#ifdef CONFIG_PROXY_EXEC
+	struct task_struct __rcu	*curr_sched; /* Scheduling context (policy) */
+#endif
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
@@ -1226,6 +1236,38 @@ static inline void rq_set_curr_rcu_init(struct rq *rq, struct task_struct *task)
 	RCU_INIT_POINTER(rq->curr_exec, task);
 }
 
+#ifdef CONFIG_PROXY_EXEC
+static inline struct task_struct *rq_proxy(struct rq *rq)
+{
+	return rq->curr_sched;
+}
+
+static inline struct task_struct *rq_proxy_unlocked(struct rq *rq)
+{
+	return rcu_dereference(rq->curr_sched);
+}
+
+static inline void rq_set_proxy(struct rq *rq, struct task_struct *t)
+{
+	rcu_assign_pointer(rq->curr_sched, t);
+}
+#else
+static inline struct task_struct *rq_proxy(struct rq *rq)
+{
+	return rq->curr_exec;
+}
+
+static inline struct task_struct *rq_proxy_unlocked(struct rq *rq)
+{
+	return rcu_dereference(rq->curr_exec);
+}
+
+static inline void rq_set_proxy(struct rq *rq, struct task_struct *t)
+{
+	/* Do nothing */
+}
+#endif
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		this_cpu_ptr(&runqueues)
 #define task_rq(p)		cpu_rq(task_cpu(p))
@@ -2098,11 +2140,25 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Is p the current execution context?
+ */
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
 	return rq_curr(rq) == p;
 }
 
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq->proxy == p.
+ */
+static inline int task_current_proxy(struct rq *rq, struct task_struct *p)
+{
+	return rq_proxy(rq) == p;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2260,7 +2316,7 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq_curr(rq) != prev);
+	WARN_ON_ONCE(rq_proxy(rq) != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -2341,6 +2397,15 @@ extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_conte
 
 static inline struct task_struct *get_push_task(struct rq *rq)
 {
+	/*
+	 * XXX connoro: should this be rq->proxy? When rq->curr != rq->proxy,
+	 * pushing rq->curr alone means it stops inheriting. Perhaps returning
+	 * rq->proxy and pushing the entire chain would be correct? OTOH if we
+	 * are guaranteed that rq->proxy is the highest prio task on the rq when
+	 * get_push_task() is called, then proxy() will migrate the rest of the
+	 * chain during the __schedule() call immediately after rq->curr is
+	 * pushed.
+	 */
 	struct task_struct *p = rq_curr(rq);
 
 	lockdep_assert_rq_held(rq);

From patchwork Mon Mar 20 23:37:17 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72531
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1493026wrt;
        Mon, 20 Mar 2023 16:41:45 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set9gUGDcUp4mdGG3x5g3He6LbjylGUKGdMzLQ/6l1PQFdz0EwZM0yryuaBMRK5zowNy61nOj
X-Received: by 2002:a17:90b:4d12:b0:237:373e:9dfa with SMTP id
 mw18-20020a17090b4d1200b00237373e9dfamr406131pjb.20.1679355704917;
        Mon, 20 Mar 2023 16:41:44 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679355704; cv=none;
        d=google.com; s=arc-20160816;
        b=MDCU+w/HJhdUAQjjypYr036gKNm+yODwq+Bxn1fmKREzxrhSIsx0cRLGOL+Q4gchqo
         Jb81cO+tMII0NLXh1u7p7cOqpUekjWvXaFggVOaCslJHJxN6sa5deF8eslRB+BnFjtWk
         8P1ZrqVLX93a/C0zzVLY/lMw2fFm+wJf8Vlbg4AUstzp8jHy75ApVNkBFVF4wSSQP6kh
         xFO2qGI/gqP3nJeCueKtSNgQpWrssMBIsrWx57bnRICvaANwhSm2feyqwf//FFANDG7t
         r0ky3Ld3jjlkbDCLn/kbmtu7L9c7KmR+VMjnwFRJo3xf6CdiwNdoLDkw/nEZcWPhkSdZ
         wuYw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=M9InGmtBwylP8bL6euAgOOc7oo3A0OeHZtEdYhhnysM=;
        b=CMdoU8x7I4CamU7FQ0QbCED2G4qpCkfKkCtFQ9XeTKe62Kz1hr7VEVrT/1px538Dl5
         vc9AZNpLiZB6PygGPxwsv19W8j4nFgHTd2t2/cbYMqnrNR+XAU6GnzlfrjOe6MxTUxIn
         F6RZ0diJH+wUSPNPvT1Kno31jE3W8YZAPhyuh5BqbdlTL0GH/9cL3Dre6nm/1T667T9u
         t7mYymCd9+o8b1WE8rUtnDj69HlP/hxVkfIvbKsDV67XtgWRD4fqDeZceeQ70UIMj+O6
         VetFBCWz2FYsLl3AaV8EcU3RZrIKLq6rc4nhxvFf0MxDkW4ygI2Zwd8qEYFGmQzeLdlK
         sqlg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=tBdJJ3bX;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 u5-20020a17090a400500b00234b0c3e853si16711511pjc.93.2023.03.20.16.41.32;
        Mon, 20 Mar 2023 16:41:44 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=tBdJJ3bX;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229945AbjCTXiQ (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:38:16 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59974 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230001AbjCTXiG (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:38:06 -0400
Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com
 [IPv6:2607:f8b0:4864:20::44a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E9F236FCE
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:48 -0700 (PDT)
Received: by mail-pf1-x44a.google.com with SMTP id
 16-20020a056a00071000b00627e9b4871eso3226638pfl.11
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355467;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=M9InGmtBwylP8bL6euAgOOc7oo3A0OeHZtEdYhhnysM=;
        b=tBdJJ3bXzEAqBg5M0RohgU8XkpUgmAhZxi1F0ozqNaJD6/TmyjVYC+9AfydDlWAM5p
         e+gPMOfd3Qxx7L7CE+2mTiQoZNNfTDDwrJAsqSOZjZ8a701+03qydsrxkSV4g5ToI9bI
         pdv/wYQAg0XHJUvrG+UfODoKHkUrEEJmqlyoE7//VxvFl03fyRJeA+2+oARodE0+F2v+
         hJTNcXMtgC//nZ6xBQpwKFQEJqeKTCR9C8SYsKqYHYLeUIYmi19Nv2wGbysNKET8gu4q
         Qg/UWnCrbgzrIzSRaZanCxvhJVagIUPxzn0BHKstG9QF3jd7izTWrWR84W7DT54Y4vJt
         Dzkw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355467;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=M9InGmtBwylP8bL6euAgOOc7oo3A0OeHZtEdYhhnysM=;
        b=V/n0li+T7obXhBw1nr1+UHcBt7yWooP7RgJSaIVk9lweBGfKOTNfkLHdT7NLAPF9qM
         AxhydsVG090Am6UML8f9H5mJJu17xLfeVkCJUCWDQwbmtwZ+rxadf6UaKr+mlQFyL1yB
         ugTnLu4pCU1ysH1OnZ+0HgKa6ByY5F14kuMFTD9u/mTlkgpAbzCYjFLaFlYWF7PDG48P
         ZpwSXxLswozbZh2NjWsLqN5raxMl99qWou/MPD6X4RtAGn3CXQ/g6cVU4PMbUKJM96Uj
         JpQjUuy9z9BvZsY1BPIAUZQjKQdbfnKb08KE/m30zlDAdYXOmX0XiMMBvMJ9tEmLBtbt
         HMGg==
X-Gm-Message-State: AO0yUKUHMbesA3rsMAyCeAYnz+FSDPcjuCLKhXmAgHc59PW4UqzSKzvV
        TGuZiv/75EouvH8ndwgJpepr+Csi1J6RpAKEayyQ1opcS+5mI5bqx6VzDoYcYwhIds6OtdDLrUW
        SU9TPeN6fD5cepUUF/Tja5ZL0wWqACs7KpNZDRFCSgg52gz0HN/ElkSU8tS2w1hiG1PGg0JU=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a17:903:183:b0:19a:ec61:98d4 with SMTP id
 z3-20020a170903018300b0019aec6198d4mr87109plg.0.1679355466445; Mon, 20 Mar
 2023 16:37:46 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:17 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-10-jstultz@google.com>
Subject: [PATCH v2 09/12] sched: Unnest ttwu_runnable in prep for
 proxy-execution
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760932087988144676?=
X-GMAIL-MSGID: =?utf-8?q?1760932087988144676?=

Slightly rework ttwu_runnable to minimize the nesting to help
make the proxy-execution changes easier to read.

Should be no logical change here.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab6b1917bc70..82f5b29ae675 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3773,18 +3773,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 	int ret = 0;
 
 	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			update_rq_clock(rq);
-			check_preempt_curr(rq, p, wake_flags);
-		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+	if (!task_on_rq_queued(p))
+		goto out_unlock;
+
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		update_rq_clock(rq);
+		check_preempt_curr(rq, p, wake_flags);
 	}
+	ttwu_do_wakeup(p);
+	ret = 1;
+out_unlock:
 	__task_rq_unlock(rq, &rf);
 
 	return ret;

From patchwork Mon Mar 20 23:37:18 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72535
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1495970wrt;
        Mon, 20 Mar 2023 16:49:59 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set+dvKeLvpD4zkI/Nbh8feuPlmCAq6rOcDYleKHmGcbndfhy3kU/ROUVoQBFd8dAPu2E/RZA
X-Received: by 2002:a17:90b:3b82:b0:237:c5cc:15bf with SMTP id
 pc2-20020a17090b3b8200b00237c5cc15bfmr432145pjb.13.1679356199518;
        Mon, 20 Mar 2023 16:49:59 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679356199; cv=none;
        d=google.com; s=arc-20160816;
        b=htaN/7yWPeMB8ukEklUqKk55V4ob5ZbsP0/TqfbxJK54m+HgmebrYeiYPfw3og9FBE
         d1l2k2EMZMCBuxt+WzAyTwW6fJueJxTYfpaRzNHKDSfYYKdc9mx68YAWkhZ57zhMqGPl
         1Um42t/WneAXYeb4/Ibd7XA4lIghRaOElsQL33JygpvJbxYaC+BBEN61+pYdQR8qp59W
         3dQPUuLOCJ7CTAR/vlKrA3IbbhWHcRyh9dAUGKQM1jK/Ac4HV1+BfOCVPFDZpbDzSBdt
         hn4hj2697JR1Lwaa+cBmJv/auxToJGOIDnq+xwbHlPjI4fWpBrTPpKdNjaoG1vs0kUzB
         74Jw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=6wFR9W3Mnj+aBKvjacUqxkfree/6gSSgVdZGx/KEvW4=;
        b=esrt2+dZvTtoH5jI8joQHDFc3DQaFrwotpj0zcJxRkzB3zmyOBx0QhUGfeSUl2Z1Qi
         P0Dd47gMFDgbS1Cif/C1+Bi2vvtbsCavsjqMAk7bIjMAgcIp5ogKX9ED3NPNHmVX2SBf
         QG2wCvTqlpS3Q9l6SQ4jRpDW9rMi+J3cUqmpw7TCtJzQfxBKZG/wGrGMCaNGotj9rXSD
         swsU62+vjaM2u+J17rCCZcggaZ7+7aJhucM/Pf1uAQBqX3QjUT4/wzpklI0k8rLfBQ3P
         G7zYOOkzF9C1GTTMBdzpWcdN+AiuQasgMUDy54/NEo7zzMr7fPtGlTuPE+uPp6Q0NDDh
         nnEw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=nlQW4Lll;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 u70-20020a638549000000b0050bd839e19csi11936361pgd.9.2023.03.20.16.49.46;
        Mon, 20 Mar 2023 16:49:59 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=nlQW4Lll;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230146AbjCTXic (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:38:32 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60082 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230042AbjCTXiI (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:38:08 -0400
Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com
 [IPv6:2607:f8b0:4864:20::b49])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 647B837577
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:49 -0700 (PDT)
Received: by mail-yb1-xb49.google.com with SMTP id
 g5-20020a25a485000000b009419f64f6afso14579502ybi.2
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355468;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=6wFR9W3Mnj+aBKvjacUqxkfree/6gSSgVdZGx/KEvW4=;
        b=nlQW4LllU6mDljaqk/ILL6QPPDw9UayMt1brHQuzvVpEqIdG9mFZ0UekQa1gvZhui7
         T0owyaKqUU0X9q0aPO72rPH+FNbp3P0F8/ftOOg2naLLszey+okqylgdJgriHYCbmq+M
         jhEqU45HnJKU7HDc9coTDaaqVZvPvJp5D/ltwpvmnNmqGcorE6riJhYZO2uSB5te4Y/6
         NNbQ23hBpSdFOFswSKrYKtaIbr14OoRoLacQkVKo1+NowVgvYxWNOeeZ5mSWGZCjV0Mx
         +8qyTeCZf25Gw7xELsZQAAvQLU4UVFpFzl43ws5bgd5JDxtXdTfKQElVPE6CDD9S059w
         7iKA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355468;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=6wFR9W3Mnj+aBKvjacUqxkfree/6gSSgVdZGx/KEvW4=;
        b=p7oytfq5RQx+4DPDvxxvJOBLckiHEHnAwco/jrWhl8frm4GF+Q7/AUeuD9qJqNbyOJ
         26mD8VJLYw+nygjQTpZD1N7jFWWvT/sDuZnwc+ShEXGBtueiAIoDTh+4TGer4IqFRzia
         7IdkVZLyzTJcG+gICSvPPeABBtWCWfDJ1ruDr62JvxQOPb38OqixMzCnOCUTFPBPF6EC
         S9NX11G0idUB2TB+BGJEOtZyEE6xbTJE1WF/ynuVcQ3Y8QkwduhSbvTQNrsShjYXpOtw
         SvulBK0/4+UbQuzRp8lw1cxZUVZoEpIEfE4CWoCsUlj24gIkqoS0/bbITPvZc7gQelD0
         5aJA==
X-Gm-Message-State: AAQBX9flv7/h+qp/zq5i1+hG3HxB38XyEOIgdtrLgkjC7zDvJVTbj4yR
        a9AjF9TaC1CNstdFDvqLKeVZligxZIUKwVpB7YWZodjLvmJD11p6y6WIgzC8Fq0hzlcS97cOY/1
        BvPsnWbpL7SdHEq1RImAibmljObf+wendLXtCeZglWWCda3BrFqj7sEXJJmRXicoliVzkJMw=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a5b:24e:0:b0:b69:f11b:690f with SMTP id
 g14-20020a5b024e000000b00b69f11b690fmr150354ybp.6.1679355468175; Mon, 20 Mar
 2023 16:37:48 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:18 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-11-jstultz@google.com>
Subject: [PATCH v2 10/12] sched: Add proxy execution
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        Valentin Schneider <valentin.schneider@arm.com>,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760932606580278908?=
X-GMAIL-MSGID: =?utf-8?q?1760932606580278908?=

From: Peter Zijlstra <peterz@infradead.org>

A task currently holding a mutex (executing a critical section)
might find benefit in using scheduling contexts of other tasks
blocked on the same mutex if they happen to have higher priority
of the current owner (e.g., to prevent priority inversions).

Proxy execution lets a task do exactly that: if a mutex owner
has waiters, it can use waiters' scheduling context to
potentially continue running if preempted.

The basic mechanism is implemented by this patch, the core of
which resides in the proxy() function. Potential proxies (i.e.,
tasks blocked on a mutex) are not dequeued, so, if one of them
is actually selected by schedule() as the next task to be put to
run on a CPU, proxy() is used to walk the blocked_on relation
and find which task (mutex owner) might be able to use the
proxy's scheduling context.

Here come the tricky bits. In fact, owner task might be in all
sort of states when a proxy is found (blocked, executing on a
different CPU, etc.). Details on how to handle different
situations are to be found in proxy() code comments.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased, split up, and folded in changes from Juri
 Lelli and Connor O'Brian, added additional locking on
 get_task_blocked_on(next) logic, pretty major rework to better
 conditionalize logic on CONFIG_PROXY_EXEC and split up the very
 large proxy() function - hopefully without changes to logic /
 behavior]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Numerous changes folded in
* Split out some of the logic into separate patches
* Break up the proxy() function so its a bit easier to read
  and is better conditionalized on CONFIG_PROXY_EXEC

TODO: Finish conditionalization edge cases
---
 include/linux/sched.h   |   1 +
 init/Kconfig            |   7 +
 kernel/Kconfig.locks    |   2 +-
 kernel/fork.c           |   1 +
 kernel/locking/mutex.c  |  58 +++-
 kernel/sched/core.c     | 652 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c |   2 +-
 kernel/sched/fair.c     |  13 +-
 kernel/sched/rt.c       |   3 +-
 kernel/sched/sched.h    |  21 +-
 10 files changed, 744 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc75fcc696db..b88303ceacaf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1141,6 +1141,7 @@ struct task_struct {
 
 	struct task_struct		*blocked_proxy;	/* task that is boosting us */
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct list_head		blocked_entry;  /* tasks blocked on us */
 	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
diff --git a/init/Kconfig b/init/Kconfig
index 1fb5f313d18f..38cdd2ccc538 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -935,6 +935,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.
 
+config PROXY_EXEC
+	bool "Proxy Execution"
+	default n
+	help
+	  This option enables proxy execution, a mechanism for mutex owning
+	  tasks to inherit the scheduling context of higher priority waiters.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273ecd..791c98f1d329 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -226,7 +226,7 @@ config ARCH_SUPPORTS_ATOMIC_RMW
 
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
-	depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW
+	depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW && !PROXY_EXEC
 
 config RWSEM_SPIN_ON_OWNER
        def_bool y
diff --git a/kernel/fork.c b/kernel/fork.c
index 95410333332f..a56c59c7a4cf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2226,6 +2226,7 @@ static __latent_entropy struct task_struct *copy_process(
 
 	p->blocked_proxy = NULL; /* nobody is boosting us yet */
 	p->blocked_on = NULL; /* not blocked yet */
+	INIT_LIST_HEAD(&p->blocked_entry);
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index ead4213232cc..370687cf79f8 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -939,10 +939,21 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 {
 	struct task_struct *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
-	unsigned long owner;
+	/*
+	 * XXX [juril] Proxy Exec forces always an HANDOFF (so that owner is
+	 * never empty when there are waiters waiting?). Should we make this
+	 * conditional on having proxy exec configured in?
+	 */
+	unsigned long owner = MUTEX_FLAG_HANDOFF;
 
 	mutex_release(&lock->dep_map, ip);
 
+	/*
+	 * XXX must always handoff the mutex to avoid !owner in proxy().
+	 * scheduler delay is minimal since we hand off to the task that
+	 * is to be scheduled next.
+	 */
+#ifndef CONFIG_PROXY_EXEC
 	/*
 	 * Release the lock before (potentially) taking the spinlock such that
 	 * other contenders can get on with things ASAP.
@@ -965,10 +976,48 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 			return;
 		}
 	}
+#endif
 
 	raw_spin_lock(&lock->wait_lock);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+#ifdef CONFIG_PROXY_EXEC
+	raw_spin_lock(&current->blocked_lock);
+	/*
+	 * If we have a task boosting us, and that task was boosting us through
+	 * this lock, hand the lock to that task, as that is the highest
+	 * waiter, as selected by the scheduling function.
+	 */
+	next = current->blocked_proxy;
+	if (next) {
+		struct mutex *next_lock;
+
+		/*
+		 * jstultz:  get_task_blocked_on(next) seemed to be missing locking
+		 * so I've added it here (which required nesting the locks).
+		 */
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
+		next_lock = get_task_blocked_on(next);
+		raw_spin_unlock(&next->blocked_lock);
+		if (next_lock != lock) {
+			next = NULL;
+		} else {
+			wake_q_add(&wake_q, next);
+			current->blocked_proxy = NULL;
+		}
+	}
+
+	/*
+	 * XXX if there was no higher prio proxy, ->blocked_task will not have
+	 * been set.  Therefore lower prio contending tasks are serviced in
+	 * FIFO order.
+	 */
+#endif
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
 			list_first_entry(&lock->wait_list,
@@ -983,7 +1032,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
-	preempt_disable();
+	preempt_disable(); /* XXX connoro: why disable preemption here? */
+#ifdef CONFIG_PROXY_EXEC
+	raw_spin_unlock(&current->blocked_lock);
+#endif
 	raw_spin_unlock(&lock->wait_lock);
 
 	wake_up_q(&wake_q);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 82f5b29ae675..d0f86670bdf8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -505,6 +505,8 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
  *
  * task_cpu(p): is changed by set_task_cpu(), the rules are:
  *
+ * XXX connoro: does it matter that ttwu_do_activate now calls __set_task_cpu
+ * on blocked tasks?
  *  - Don't call set_task_cpu() on a blocked task:
  *
  *    We don't care what CPU we're not running on, this simplifies hotplug,
@@ -2774,8 +2776,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
 
-	/* Can the task run on the task's current CPU? If so, we're done */
-	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+	/*
+	 * Can the task run on the task's current CPU? If so, we're done
+	 *
+	 * We are also done if the task is currently acting as proxy (and
+	 * potentially has been migrated outside its current or previous
+	 * affinity mask)
+	 */
+	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+	    (task_current_proxy(rq, p) && !task_current(rq, p))) {
 		struct task_struct *push_task = NULL;
 
 		if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -3687,6 +3696,72 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
 	trace_sched_wakeup(p);
 }
 
+#ifdef CONFIG_PROXY_EXEC
+static void activate_task_and_blocked_ent(struct rq *rq, struct task_struct *p, int en_flags)
+{
+	/*
+	 * XXX connoro: By calling activate_task with blocked_lock held, we order against
+	 * the proxy() blocked_task case such that no more blocked tasks will
+	 * be enqueued on p once we release p->blocked_lock.
+	 */
+	raw_spin_lock(&p->blocked_lock);
+	/*
+	 * XXX connoro: do we need to check p->on_rq here like we do for pp below?
+	 * or does holding p->pi_lock ensure nobody else activates p first?
+	 */
+	activate_task(rq, p, en_flags);
+	raw_spin_unlock(&p->blocked_lock);
+
+	/*
+	 * A whole bunch of 'proxy' tasks back this blocked task, wake
+	 * them all up to give this task its 'fair' share.
+	 */
+	while (!list_empty(&p->blocked_entry)) {
+		struct task_struct *pp =
+			list_first_entry(&p->blocked_entry,
+					 struct task_struct,
+					 blocked_entry);
+		/*
+		 * XXX connoro: proxy blocked_task case might be enqueuing more blocked tasks
+		 * on pp. If those continue past when we delete pp from the list, we'll get an
+		 * active with a non-empty blocked_entry list, which is no good. Locking
+		 * pp->blocked_lock ensures either the blocked_task path gets the lock first and
+		 * enqueues everything before we ever get the lock, or we get the lock first, the
+		 * other path sees pp->on_rq != 0 and enqueues nothing.
+		 */
+		raw_spin_lock(&pp->blocked_lock);
+		BUG_ON(pp->blocked_entry.prev != &p->blocked_entry);
+
+		list_del_init(&pp->blocked_entry);
+		if (READ_ONCE(pp->on_rq)) {
+			/*
+			 * XXX connoro: We raced with a non mutex handoff activation of pp. That
+			 * activation will also take care of activating all of the tasks after pp in
+			 * the blocked_entry list, so we're done here.
+			 */
+			raw_spin_unlock(&pp->blocked_lock);
+			break;
+		}
+		/* XXX can't call set_task_cpu() because we are not holding
+		 * neither pp->pi_lock nor task's rq lock. This should however
+		 * be fine as this task can't be woken up as it is blocked on
+		 * this mutex atm.
+		 * A problem however might be that __set_task_cpu() calls
+		 * set_task_rq() which deals with groups and such...
+		 */
+		__set_task_cpu(pp, cpu_of(rq));
+		activate_task(rq, pp, en_flags);
+		resched_curr(rq);
+		raw_spin_unlock(&pp->blocked_lock);
+	}
+}
+#else
+static inline void activate_task_and_blocked_ent(struct rq *rq, struct task_struct *p, int en_flags)
+{
+	activate_task(rq, p, en_flags);
+}
+#endif
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -3708,7 +3783,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		atomic_dec(&task_rq(p)->nr_iowait);
 	}
 
-	activate_task(rq, p, en_flags);
+	activate_task_and_blocked_ent(rq, p, en_flags);
+
 	check_preempt_curr(rq, p, wake_flags);
 
 	ttwu_do_wakeup(p);
@@ -3741,6 +3817,95 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 #endif
 }
 
+#ifdef CONFIG_PROXY_EXEC
+/* XXX jstultz: This needs a better name! */
+bool ttwu_proxy_skip_wakeup(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * XXX connoro: wrap this case with #ifdef CONFIG_PROXY_EXEC?
+	 */
+	if (task_current(rq, p)) {
+		bool ret = true;
+		/*
+		 * XXX connoro: p is currently running. 3 cases are possible:
+		 * 1) p is blocked on a lock it owns, but we got the rq lock before
+		 *    it could schedule out. Kill blocked_on relation and call
+		 *    ttwu_do_wakeup
+		 * 2) p is blocked on a lock it does not own. Leave blocked_on
+		 *    unchanged, don't call ttwu_do_wakeup, and return 0.
+		 * 3) p is unblocked, but unless we hold onto blocked_lock while
+		 *    calling ttwu_do_wakeup, we could race with it becoming
+		 *    blocked and overwrite the correct p->__state with TASK_RUNNING.
+		 */
+		raw_spin_lock(&p->blocked_lock);
+		if (task_is_blocked(p) && mutex_owner(p->blocked_on) == p)
+			set_task_blocked_on(p, NULL);
+		if (!task_is_blocked(p))
+			ret = false;
+		raw_spin_unlock(&p->blocked_lock);
+		return ret;
+	}
+
+	/*
+	 * Since we don't dequeue for blocked-on relations, we'll always
+	 * trigger the on_rq_queued() clause for them.
+	 */
+	if (task_is_blocked(p)) {
+		raw_spin_lock(&p->blocked_lock);
+
+		if (mutex_owner(p->blocked_on) != p) {
+			/*
+			 * XXX connoro: p already woke, ran and blocked on
+			 * another mutex. Since a successful wakeup already
+			 * happened, we're done.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			return true;
+		}
+
+		set_task_blocked_on(p, NULL);
+		if (!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr)) {
+			/*
+			 * proxy stuff moved us outside of the affinity mask
+			 * 'sleep' now and fail the direct wakeup so that the
+			 * normal wakeup path will fix things.
+			 */
+			deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+			if (task_current_proxy(rq, p)) {
+				/*
+				 * XXX connoro: If p is the proxy, then remove lingering
+				 * references to it from rq and sched_class structs after
+				 * dequeueing.
+				 * can we get here while rq is inside __schedule?
+				 * do any assumptions break if so?
+				 */
+				put_prev_task(rq, p);
+				rq_set_proxy(rq, rq->idle);
+			}
+			resched_curr(rq);
+			raw_spin_unlock(&p->blocked_lock);
+			return true;
+		}
+		/* connoro: perhaps deq/enq here to get our task into the pushable task
+		 * list again now that it's unblocked? Does that break if we're the proxy or
+		 * does holding the rq lock make that OK?
+		 */
+		/*
+		 * Must resched after killing a blocked_on relation. The currently
+		 * executing context might not be the most elegible anymore.
+		 */
+		resched_curr(rq);
+		raw_spin_unlock(&p->blocked_lock);
+	}
+	return false;
+}
+#else
+static inline bool ttwu_proxy_skip_wakeup(struct rq *rq, struct task_struct *p)
+{
+	return false;
+}
+#endif
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3773,9 +3938,15 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 	int ret = 0;
 
 	rq = __task_rq_lock(p, &rf);
-	if (!task_on_rq_queued(p))
+	if (!task_on_rq_queued(p)) {
+		BUG_ON(task_is_running(p));
 		goto out_unlock;
+	}
 
+	/*
+	 * ttwu_do_wakeup()->
+	 *   check_preempt_curr() may use rq clock
+	 */
 	if (!task_on_cpu(rq, p)) {
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if
@@ -3784,8 +3955,14 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		update_rq_clock(rq);
 		check_preempt_curr(rq, p, wake_flags);
 	}
+
+	/* XXX jstultz: This needs a better name! */
+	if (ttwu_proxy_skip_wakeup(rq, p))
+		goto out_unlock;
+
 	ttwu_do_wakeup(p);
 	ret = 1;
+
 out_unlock:
 	__task_rq_unlock(rq, &rf);
 
@@ -4193,6 +4370,23 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
 		goto unlock;
 
+	if (task_is_blocked(p)) {
+		/*
+		 * XXX connoro: we are in one of 2 cases:
+		 * 1) p is blocked on a mutex it doesn't own but is still
+		 *    enqueued on a rq. We definitely don't want to keep going
+		 *    (and potentially activate it elsewhere without ever
+		 *    dequeueing) but maybe this is more properly handled by
+		 *    having ttwu_runnable() return 1 in this case?
+		 * 2) p was removed from its rq and added to a blocked_entry
+		 *    list by proxy(). It should not be woken until the task at
+		 *    the head of the list gets a mutex handoff wakeup.
+		 * Should try_to_wake_up() return 1 in either of these cases?
+		 */
+		success = 0;
+		goto unlock;
+	}
+
 #ifdef CONFIG_SMP
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
@@ -5581,6 +5775,18 @@ void scheduler_tick(void)
 
 	rq_lock(rq, &rf);
 
+#ifdef CONFIG_PROXY_EXEC
+	/*
+	 * XXX connoro: is this check needed? Why?
+	 */
+	if (task_cpu(curr) != cpu) {
+		BUG_ON(!test_preempt_need_resched() &&
+		       !tif_need_resched());
+		rq_unlock(rq, &rf);
+		return;
+	}
+#endif
+
 	update_rq_clock(rq);
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
@@ -6473,6 +6679,397 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 # define SM_MASK_PREEMPT	SM_PREEMPT
 #endif
 
+#ifdef CONFIG_PROXY_EXEC
+
+static struct task_struct *
+proxy_migrate_task(struct rq *rq, struct task_struct *next,
+		   struct rq_flags *rf, struct task_struct *p,
+		   int that_cpu, bool curr_in_chain)
+{
+	struct rq *that_rq;
+	LIST_HEAD(migrate_list);
+
+	/*
+	 * The blocked-on relation must not cross CPUs, if this happens
+	 * migrate @p to the @owner's CPU.
+	 *
+	 * This is because we must respect the CPU affinity of execution
+	 * contexts (@owner) but we can ignore affinity for scheduling
+	 * contexts (@p). So we have to move scheduling contexts towards
+	 * potential execution contexts.
+	 *
+	 * XXX [juril] what if @p is not the highest prio task once migrated
+	 * to @owner's CPU?
+	 *
+	 * XXX [juril] also, after @p is migrated it is not migrated back once
+	 * @owner releases the lock? Isn't this a potential problem w.r.t.
+	 * @owner affinity settings?
+	 * [juril] OK. It is migrated back into its affinity mask in
+	 * ttwu_remote(), or by using wake_cpu via select_task_rq, guess we
+	 * might want to add a comment about that here. :-)
+	 *
+	 * TODO: could optimize by finding the CPU of the final owner
+	 * and migrating things there. Given:
+	 *
+	 *	CPU0	CPU1	CPU2
+	 *
+	 *	 a ----> b ----> c
+	 *
+	 * the current scheme would result in migrating 'a' to CPU1,
+	 * then CPU1 would migrate b and a to CPU2. Only then would
+	 * CPU2 run c.
+	 */
+	that_rq = cpu_rq(that_cpu);
+
+	/*
+	 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+	 * to sort things out.
+	 */
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@next) first,
+	 * otherwise we have a reference that no longer belongs to us.  Use
+	 * @fake_task to fill the void and make the next pick_next_task()
+	 * invocation happy.
+	 *
+	 * XXX double, triple think about this.
+	 * XXX put doesn't work with ON_RQ_MIGRATE
+	 *
+	 * CPU0				CPU1
+	 *
+	 *				B mutex_lock(X)
+	 *
+	 * A mutex_lock(X) <- B
+	 * A __schedule()
+	 * A pick->A
+	 * A proxy->B
+	 * A migrate A to CPU1
+	 *				B mutex_unlock(X) -> A
+	 *				B __schedule()
+	 *				B pick->A
+	 *				B switch_to (A)
+	 *				A ... does stuff
+	 * A ... is still running here
+	 *
+	 *		* BOOM *
+	 */
+	put_prev_task(rq, next);
+	if (curr_in_chain) {
+		rq_set_proxy(rq, rq->idle);
+		set_tsk_need_resched(rq->idle);
+		/*
+		 * XXX [juril] don't we still need to migrate @next to
+		 * @owner's CPU?
+		 */
+		return rq->idle;
+	}
+	rq_set_proxy(rq, rq->idle);
+
+	for (; p; p = p->blocked_proxy) {
+		int wake_cpu = p->wake_cpu;
+
+		WARN_ON(p == rq_curr(rq));
+
+		deactivate_task(rq, p, 0);
+		set_task_cpu(p, that_cpu);
+		/*
+		 * We can abuse blocked_entry to migrate the thing, because @p is
+		 * still on the rq.
+		 */
+		list_add(&p->blocked_entry, &migrate_list);
+
+		/*
+		 * Preserve p->wake_cpu, such that we can tell where it
+		 * used to run later.
+		 */
+		p->wake_cpu = wake_cpu;
+	}
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(that_rq);
+
+	while (!list_empty(&migrate_list)) {
+		p = list_first_entry(&migrate_list, struct task_struct, blocked_entry);
+		list_del_init(&p->blocked_entry);
+
+		enqueue_task(that_rq, p, 0);
+		check_preempt_curr(that_rq, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+		/*
+		 * check_preempt_curr has already called
+		 * resched_curr(that_rq) in case it is
+		 * needed.
+		 */
+	}
+
+	raw_spin_rq_unlock(that_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+
+	return NULL; /* Retry task selection on _this_ CPU. */
+}
+
+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq, struct task_struct *next)
+{
+	put_prev_task(rq, next);
+	rq_set_proxy(rq, rq->idle);
+	set_tsk_need_resched(rq->idle);
+	return rq->idle;
+}
+
+static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct *p,
+				   struct task_struct *owner,
+				   struct task_struct *next)
+{
+	/*
+	 * Walk back up the blocked_proxy relation and enqueue them all on @owner
+	 *
+	 * ttwu_activate() will pick them up and place them on whatever rq
+	 * @owner will run next.
+	 * XXX connoro: originally we would jump back into the main proxy() loop
+	 * owner->on_rq !=0 path, but if we then end up taking the owned_task path
+	 * then we can overwrite p->on_rq after ttwu_do_activate sets it to 1 which breaks
+	 * the assumptions made in ttwu_do_activate.
+	 *
+	 * Perhaps revisit whether retry is now possible given the changes to the
+	 * owned_task path since I wrote the prior comment...
+	 */
+	if (!owner->on_rq) {
+		/* jstultz: Err, do we need to hold a lock on p? (we gave it up for owner) */
+		for (; p; p = p->blocked_proxy) {
+			if (p == owner)
+				continue;
+			BUG_ON(!p->on_rq);
+			deactivate_task(rq, p, DEQUEUE_SLEEP);
+			if (task_current_proxy(rq, p)) {
+				put_prev_task(rq, next);
+				rq_set_proxy(rq, rq->idle);
+			}
+			/*
+			 * XXX connoro: need to verify this is necessary. The rationale is that
+			 * ttwu_do_activate must not have a chance to activate p elsewhere before
+			 * it's fully extricated from its old rq.
+			 */
+			smp_mb();
+			list_add(&p->blocked_entry, &owner->blocked_entry);
+		}
+	}
+}
+
+/*
+ * Find who @next (currently blocked on a mutex) can proxy for.
+ *
+ * Follow the blocked-on relation:
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *     proxied-by |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the proxied-by relation, this latter is used by the mutex code
+ * to find which (blocked) task to hand-off to.
+ *
+ * Lock order:
+ *
+ *   p->pi_lock
+ *     rq->lock
+ *       mutex->wait_lock
+ *         p->blocked_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be put to run on cpu_of(rq)).
+ */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+	struct task_struct *p = next;
+	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
+	int this_cpu, that_cpu;
+	struct mutex *mutex;
+
+	this_cpu = cpu_of(rq);
+
+	/*
+	 * Follow blocked_on chain.
+	 *
+	 * TODO: deadlock detection
+	 */
+	for (p = next; p->blocked_on; p = owner) {
+		mutex = p->blocked_on;
+		/* Something changed in the chain, pick_again */
+		if (!mutex)
+			return NULL;
+
+		/*
+		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+		 * and ensure @owner sticks around.
+		 */
+		raw_spin_lock(&mutex->wait_lock);
+		raw_spin_lock(&p->blocked_lock);
+
+		/* Check again that p is blocked with blocked_lock held */
+		if (!task_is_blocked(p) || mutex != p->blocked_on) {
+			/*
+			 * Something changed in the blocked_on chain and
+			 * we don't know if only at this level. So, let's
+			 * just bail out completely and let __schedule
+			 * figure things out (pick_again loop).
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return NULL;
+		}
+
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
+		owner = mutex_owner(mutex);
+		if (task_cpu(owner) != this_cpu) {
+			that_cpu = task_cpu(owner);
+			/*
+			 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+			 * to sort things out.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+
+			return proxy_migrate_task(rq, next, rf, p, that_cpu, curr_in_chain);
+		}
+
+		if (task_on_rq_migrating(owner)) {
+			/*
+			 * XXX connoro: one of the chain of mutex owners is currently
+			 * migrating to this CPU, but has not yet been enqueued because
+			 * we are holding the rq lock. As a simple solution, just schedule
+			 * rq->idle to give the migration a chance to complete. Much like
+			 * the migrate_task case we should end up back in proxy(), this
+			 * time hopefully with all relevant tasks already enqueued.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq, next);
+		}
+
+		if (!owner->on_rq) {
+			/*
+			 * XXX connoro: rq->curr must not be added to the blocked_entry list
+			 * or else ttwu_do_activate could enqueue it elsewhere before it
+			 * switches out here. The approach to avoiding this is the same as in
+			 * the migrate_task case.
+			 */
+			if (curr_in_chain) {
+				/*
+				 * This is identical to the owned_task handling, probably should
+				 * fold them together...
+				 */
+				raw_spin_unlock(&p->blocked_lock);
+				raw_spin_unlock(&mutex->wait_lock);
+				return proxy_resched_idle(rq, next);
+			}
+
+			/*
+			 * If !@owner->on_rq, holding @rq->lock will not pin the task,
+			 * so we cannot drop @mutex->wait_lock until we're sure its a blocked
+			 * task on this rq.
+			 *
+			 * We use @owner->blocked_lock to serialize against ttwu_activate().
+			 * Either we see its new owner->on_rq or it will see our list_add().
+			 */
+			if (owner != p) {
+				raw_spin_unlock(&p->blocked_lock);
+				raw_spin_lock(&owner->blocked_lock);
+			}
+
+			proxy_enqueue_on_owner(rq, p, owner, next);
+
+			if (task_current_proxy(rq, next)) {
+				put_prev_task(rq, next);
+				rq_set_proxy(rq, rq->idle);
+			}
+			raw_spin_unlock(&owner->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+
+			return NULL; /* retry task selection */
+		}
+
+		if (owner == p) {
+			/*
+			 * Its possible we interleave with mutex_unlock like:
+			 *
+			 *				lock(&rq->lock);
+			 *				  proxy()
+			 * mutex_unlock()
+			 *   lock(&wait_lock);
+			 *   next(owner) = current->blocked_proxy;
+			 *   unlock(&wait_lock);
+			 *
+			 *   wake_up_q();
+			 *     ...
+			 *       ttwu_runnable()
+			 *         __task_rq_lock()
+			 *				  lock(&wait_lock);
+			 *				  owner == p
+			 *
+			 * Which leaves us to finish the ttwu_runnable() and make it go.
+			 *
+			 * XXX is this happening in case of an HANDOFF to p?
+			 * In any case, reading of the owner in __mutex_unlock_slowpath is
+			 * done atomically outside wait_lock (only adding waiters to wake_q is
+			 * done inside the critical section).
+			 * Does this means we can get to proxy _w/o an owner_ if that was
+			 * cleared before grabbing wait_lock? Do we account for this case?
+			 * OK we actually do (see PROXY_EXEC ifdeffery in unlock function).
+			 */
+
+			/*
+			 * XXX connoro: prior versions would clear p->blocked_on here, but I think
+			 * that can race with the handoff wakeup path. If a wakeup reaches the
+			 * call to ttwu_runnable after this point and finds that p is enqueued
+			 * and marked as unblocked, it will happily do a ttwu_do_wakeup() call
+			 * with zero regard for whether the task's affinity actually allows
+			 * running it on this CPU.
+			 */
+
+			/*
+			 * XXX connoro: previous versions would immediately run owner here if
+			 * it's allowed to run on this CPU, but this creates potential races
+			 * with the wakeup logic. Instead we can just take the blocked_task path
+			 * when owner is already !on_rq, or else schedule rq->idle so that
+			 * ttwu_runnable can get the rq lock and mark owner as running.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq, next);
+		}
+
+		/*
+		 * OK, now we're absolutely sure @owner is not blocked _and_
+		 * on this rq, therefore holding @rq->lock is sufficient to
+		 * guarantee its existence, as per ttwu_remote().
+		 */
+		raw_spin_unlock(&p->blocked_lock);
+		raw_spin_unlock(&mutex->wait_lock);
+
+		owner->blocked_proxy = p;
+	}
+
+	WARN_ON_ONCE(!owner->on_rq);
+	return owner;
+}
+#else /* PROXY_EXEC */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+	return next;
+}
+#endif /* PROXY_EXEC */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6520,6 +7117,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	struct rq_flags rf;
 	struct rq *rq;
 	int cpu;
+	bool preserve_need_resched = false;
 
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
@@ -6565,7 +7163,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
 		if (signal_pending_state(prev_state, prev)) {
 			WRITE_ONCE(prev->__state, TASK_RUNNING);
-		} else {
+		} else if (!task_is_blocked(prev)) {
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
@@ -6591,13 +7189,49 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 				atomic_inc(&rq->nr_iowait);
 				delayacct_blkio_start();
 			}
+		} else {
+			/*
+			 * XXX
+			 * Let's make this task, which is blocked on
+			 * a mutex, (push/pull)able (RT/DL).
+			 * Unfortunately we can only deal with that by
+			 * means of a dequeue/enqueue cycle. :-/
+			 */
+			dequeue_task(rq, prev, 0);
+			enqueue_task(rq, prev, 0);
 		}
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+pick_again:
+	/*
+	 * If picked task is actually blocked it means that it can act as a
+	 * proxy for the task that is holding the mutex picked task is blocked
+	 * on. Get a reference to the blocked (going to be proxy) task here.
+	 * Note that if next isn't actually blocked we will have rq->proxy ==
+	 * rq->curr == next in the end, which is intended and means that proxy
+	 * execution is currently "not in use".
+	 */
+	next = pick_next_task(rq, rq_proxy(rq), &rf);
 	rq_set_proxy(rq, next);
-	clear_tsk_need_resched(prev);
+	next->blocked_proxy = NULL;
+	if (unlikely(task_is_blocked(next))) {
+		next = proxy(rq, next, &rf);
+		if (!next)
+			goto pick_again;
+		/*
+		 * XXX connoro: when proxy() returns rq->idle it sets the
+		 * TIF_NEED_RESCHED flag, but in the case where
+		 * rq->idle == rq->prev, the flag would be cleared immediately,
+		 * defeating the desired behavior. So, check explicitly for
+		 * this case.
+		 */
+		if (next == rq->idle && prev == rq->idle)
+			preserve_need_resched = true;
+	}
+
+	if (!preserve_need_resched)
+		clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns = 0;
@@ -6684,6 +7318,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 */
 	SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
 
+	/* XXX still necessary? tsk_is_pi_blocked check here was deleted... */
+	if (task_is_blocked(tsk))
+		return;
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6ec40f90317c..c9b6a23a99b3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1740,7 +1740,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_dl_entity(&p->dl, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
 		enqueue_pushable_dl_task(rq, p);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1471519b95c5..aa8d772efadf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7929,7 +7929,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (!prev || prev->sched_class != &fair_sched_class)
+	if (!prev ||
+	    prev->sched_class != &fair_sched_class ||
+	    rq_curr(rq) != rq_proxy(rq))
 		goto simple;
 
 	/*
@@ -8447,6 +8449,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	lockdep_assert_rq_held(env->src_rq);
 
+	if (task_is_blocked(p))
+		return 0;
+
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
@@ -8497,7 +8502,11 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/* Record that we found at least one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-	if (task_on_cpu(env->src_rq, p)) {
+	/*
+	 * XXX mutex unlock path may have marked proxy as unblocked allowing us to
+	 * reach this point, but we still shouldn't migrate it.
+	 */
+	if (task_on_cpu(env->src_rq, p) || task_current_proxy(env->src_rq, p)) {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 03e5d8fa67aa..d1c5a022eae4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1537,7 +1537,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_rt_entity(rt_se, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 &&
+	    !task_is_blocked(p))
 		enqueue_pushable_task(rq, p);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84e49c2530b0..01f82ace084a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2159,6 +2159,19 @@ static inline int task_current_proxy(struct rq *rq, struct task_struct *p)
 	return rq_proxy(rq) == p;
 }
 
+#ifdef CONFIG_PROXY_EXEC
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	return !!p->blocked_on;
+}
+#else /* !PROXY_EXEC */
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	return false;
+}
+
+#endif /* PROXY_EXEC */
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2316,12 +2329,18 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq_proxy(rq) != prev);
+	WARN_ON_ONCE(rq_curr(rq) != prev && prev != rq_proxy(rq));
+
+	/* XXX connoro: is this check necessary? */
+	if (prev == rq_proxy(rq) && task_cpu(prev) != cpu_of(rq))
+		return;
+
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
+	WARN_ON_ONCE(!task_current_proxy(rq, next));
 	next->sched_class->set_next_task(rq, next, false);
 }
 

From patchwork Mon Mar 20 23:37:19 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72537
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1496329wrt;
        Mon, 20 Mar 2023 16:51:05 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set+ZVYwP0BU3x6vQwr0B359HNjjdBq5yggEseeJny+ZNjzhqFZ4g2iyCT76dwydCaw2JtwVj
X-Received: by 2002:a17:90b:1c06:b0:23a:87cf:de93 with SMTP id
 oc6-20020a17090b1c0600b0023a87cfde93mr464203pjb.15.1679356265634;
        Mon, 20 Mar 2023 16:51:05 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679356265; cv=none;
        d=google.com; s=arc-20160816;
        b=szhgq2bzSigGMnZgwO8qIvjNyfCysBoPhSWCj+1CebxCAdOO9fjDY2vyNZCUpVy9Jh
         hpnmmdjQ75qL8Hgo2WR/b3m7XB0cIoptjt6TJsSxuners4EZOuEL9J1sccdCfc5eiPbj
         meleET/1K1b7DgFlr0T34UJLZ6SPx3R7hnAoSGFCBUYYR6yRXI0amz+PWbBtWKBPN+lZ
         Vka6kZmPthvjkc02x5281Z0uMfQaDH4yQNS9TSKJHpMcbxrbo1kmJYoXj+bHutSHwpAR
         YzBfyN722q1t9PQWhTElRxUsNu4V38XzCAwbYmv/we3TP9T/saSYIr2oQaunX2L7tcIr
         hAtQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=3nt7ktkvxxILsehranU+Miwg4EJn7JcE65bPTJJiIlM=;
        b=RSRr0rhcVlikiWN3kojRtRXz8a2LeNbLsPUPQ83wFrM6jhhc0Mf4RDiSmVS+krsJvG
         IylzupRkrXNB7e/hRxopSP3/dOyeZ0MXSNJSIjEhcUmQHkyfI8RR6ksEzHP70r3FanjX
         RkT3v7cixhTA/FM2mJ8gj/hMdTSY+wc1jmyRjKwOdCRqxrEvC3tJLfHFxDy2X9btPJL1
         pCPT7tROiqubqySC2/xXs0cWxv9LdtSEUOE0d44Td99aVKNuRJAzy1i+H1LeA0mhjP7B
         uS/xXnbDQAqQpPyEeH3zrDNnKiA091WI0xP6M9FpqGC2YiHAW0cqD5amcMbIznY2Z+j/
         mdMw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=MF7ny7k+;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 h187-20020a636cc4000000b0050bedf49bb6si11572479pgc.11.2023.03.20.16.50.50;
        Mon, 20 Mar 2023 16:51:05 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=MF7ny7k+;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230127AbjCTXi2 (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:38:28 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58994 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230036AbjCTXiI (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:38:08 -0400
Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com
 [IPv6:2607:f8b0:4864:20::54a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5FF5337F3D
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:51 -0700 (PDT)
Received: by mail-pg1-x54a.google.com with SMTP id
 d34-20020a630e22000000b005039e28b68cso3051426pgl.13
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355469;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=3nt7ktkvxxILsehranU+Miwg4EJn7JcE65bPTJJiIlM=;
        b=MF7ny7k+jssj98HAIfBEfXAGt5f6xTZ6NXWRScj/3l59EBWUXAzLO577sUq8POXjxO
         0Uhcj1ybV2CbguQahw7a6vht0nAbw1cs3ffypEi0Zal1ml+dLsKjKX7XtH9SZAcxapEc
         7QbOurbJk24Iwk39AYpHx5LHMJVAte0yLQoUQTcjcnEGvNN1ZCcnbZ1W3dT0HeQf7Akf
         +u9F//JdmQ/Ol/rKgV3/hRw4U5CezGKZ/7fwS5vTCn1rtKPP+3PLMph2sFDkbjSSmNxt
         xVoSMuX4ry5bjgRhraJqGrzQPc2JX292z0CgTj0UZf3vpfMunleeu09jktvHvmcJW3O8
         yHuw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355469;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=3nt7ktkvxxILsehranU+Miwg4EJn7JcE65bPTJJiIlM=;
        b=FfnZMOOTIzStaThs5SmPDY3SBxXY5DC3LNvfDipRqq9BxLEr9mPljGBdEipKC4cip2
         nbHwvfH5Dy8nzNdYGF7BeolyjR2e2CMfoVAEmYTezSPmwpgIZg7UVSpqvTsCnV9IKbp0
         RUShQQ9jGxpZ2w46g899d5l+r5Zd97vEbwbvXVUfj46JJjikOLGY1Oxa3+kYRgI0orRp
         cYg8d85IVggCWyKt9F1Wt9/rRO+KZcluDE+IpZs9iLp/IOn4NSE1jJqwm7Hq2QFpvaKC
         BfVT0DlAkqwPwiyzbqPjB2Itrc5D/nObJEh6dVPEngTyTwcGV2CLQeqv4Qs50ngZuny5
         0pjQ==
X-Gm-Message-State: AO0yUKV4LjKfB3Uvhxkat14VD3sEy86fBJ1y6imfObu6fXwN5Ftzxwqs
        r2ivyUijCLGO5aQxtttIVoStZ/hN0pbDY8cvjVFuyPaCOn9Ob+3IpF7RVfG/8jJXFPk9j8WdbvB
        gN6dctnXlqxX1zbEN9/DkPHC89PaAa9HiW9a+5hNi4hODmVTnh8WBp6Lpj5Z5rCW9HcQYB/w=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6a00:b4f:b0:623:7446:7075 with SMTP id
 p15-20020a056a000b4f00b0062374467075mr332785pfo.2.1679355469525; Mon, 20 Mar
 2023 16:37:49 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:19 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-12-jstultz@google.com>
Subject: [PATCH v2 11/12] sched/rt: Fix proxy/current (push,pull)ability
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760932675910937887?=
X-GMAIL-MSGID: =?utf-8?q?1760932675910937887?=

From: Valentin Schneider <valentin.schneider@arm.com>

Proxy execution forms atomic pairs of tasks: a proxy (scheduling context)
and an owner (execution context). The proxy, along with the rest of the
blocked chain, follows the owner wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

	     ,->  RT42
	     |     | blocked-on
	     |     v
  proxied-by |   mutex
	     |     | owner
	     |     v
	     `--  RT1

   RT1
   RT42

  CPU0            CPU1
   ^                ^
   |                |
  overloaded    !overloaded
  rq prio = 42  rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

XXX: pinned tasks becoming unblocked should be removed from the push/pull
lists, but those don't get to see __schedule() straight away.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 36 ++++++++++++++++++++++++++----------
 kernel/sched/rt.c   | 22 +++++++++++++++++-----
 2 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d0f86670bdf8..11138277c7c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7062,12 +7062,28 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 	WARN_ON_ONCE(!owner->on_rq);
 	return owner;
 }
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
+{
+	/*
+	 * pick_next_task() calls set_next_task() on the proxy at some
+	 * point, which ensures it is not push/pullable. However, the
+	 * proxy *and* the owner form an atomic pair wrt push/pull.
+	 *
+	 * Make sure owner is not pushable. Unfortunately we can only
+	 * deal with that by means of a dequeue/enqueue cycle. :-/
+	 */
+	dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+	enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
 #else /* PROXY_EXEC */
 static struct task_struct *
 proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
 	return next;
 }
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next) { }
 #endif /* PROXY_EXEC */
 
 /*
@@ -7116,6 +7132,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool proxied;
 	int cpu;
 	bool preserve_need_resched = false;
 
@@ -7189,20 +7206,11 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 				atomic_inc(&rq->nr_iowait);
 				delayacct_blkio_start();
 			}
-		} else {
-			/*
-			 * XXX
-			 * Let's make this task, which is blocked on
-			 * a mutex, (push/pull)able (RT/DL).
-			 * Unfortunately we can only deal with that by
-			 * means of a dequeue/enqueue cycle. :-/
-			 */
-			dequeue_task(rq, prev, 0);
-			enqueue_task(rq, prev, 0);
 		}
 		switch_count = &prev->nvcsw;
 	}
 
+	proxied = !!prev->blocked_proxy;
 pick_again:
 	/*
 	 * If picked task is actually blocked it means that it can act as a
@@ -7244,6 +7252,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		 * changes to task_struct made by pick_next_task().
 		 */
 		rq_set_curr_rcu_init(rq, next);
+
+		if (unlikely(!task_current_proxy(rq, next)))
+			proxy_tag_curr(rq, next);
+
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -7268,6 +7280,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
+		/* In case next was already curr but just got blocked_proxy */
+		if (unlikely(!proxied && next->blocked_proxy))
+			proxy_tag_curr(rq, next);
+
 		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 
 		rq_unpin_lock(rq, &rf);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d1c5a022eae4..419270b0918e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1537,9 +1537,21 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_rt_entity(rt_se, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 &&
-	    !task_is_blocked(p))
-		enqueue_pushable_task(rq, p);
+	/*
+	 * Current can't be pushed away. Proxy is tied to current, so don't
+	 * push it either.
+	 */
+	if (task_current(rq, p) || task_current_proxy(rq, p))
+		return;
+
+	/*
+	 * Pinned tasks can't be pushed.
+	 * Affinity of blocked tasks doesn't matter.
+	 */
+	if (!task_is_blocked(p) && p->nr_cpus_allowed == 1)
+		return;
+
+	enqueue_pushable_task(rq, p);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1832,9 +1844,9 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 
 	/*
 	 * The previous task needs to be made eligible for pushing
-	 * if it is still active
+	 * if it is still active. Affinity of blocked task doesn't matter.
 	 */
-	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
+	if (on_rt_rq(&p->rt) && (p->nr_cpus_allowed > 1 || task_is_blocked(p)))
 		enqueue_pushable_task(rq, p);
 }
 

From patchwork Mon Mar 20 23:37:20 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Stultz <jstultz@google.com>
X-Patchwork-Id: 72547
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp1505130wrt;
        Mon, 20 Mar 2023 17:13:19 -0700 (PDT)
X-Google-Smtp-Source: 
 AK7set+txQfEfnpGOGkjDwnUPmqoqqogmhgcTeVnMjRt2Y4xvigVWwBjZ/Dz9sI46aMTUuEvvNIE
X-Received: by 2002:a17:90b:224d:b0:23e:fc9c:930 with SMTP id
 hk13-20020a17090b224d00b0023efc9c0930mr448412pjb.36.1679357598986;
        Mon, 20 Mar 2023 17:13:18 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1679357598; cv=none;
        d=google.com; s=arc-20160816;
        b=OuIfkFnvGSfxkLzQgU/+7QlUIF50BWJwNkXWXQ0vBj381rjJ1P5D6lDiIe6wxxy4dg
         HrL8v6NaPve0TKiAI27xGwpeOYui7BJniPQsz7Ne1yxuiMO7UgEPHEXUF70eBJelYi+U
         //UFfdCaDFvVup7E2OXMcEakFLidv7D6SVGJg/AWgb093MimHNabraamfMwjtrGQFeXZ
         EwWVduiCYe9zRUmd0ZW5EgmW3SrPy6dDxMeB7p2GhSIi4XRh3cWvpAziE7P0Da1++qUG
         VHJLYOI7Z/45kYwsyRWWBPTeHO5hSOvvK7/jEr0Jer2GDqrOibOpWWnh3/o7xNy47ype
         6OoA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:dkim-signature;
        bh=NuCpfBefgbjNmYEBZ3twmkwjk21qxAPheyBB8b9dC4E=;
        b=M1rbBtuY/vV0hRGsXahDSlU86o3duSHNmKBoWKvpIcTF005l3tcc/GKP2HqjnJQM4z
         SLK5vec2h8vD+2kRjKyyaOmXz3g9qSFBjYyIVqYwpgWDNVc62Ww4JM9NjpEpYOGBiqjR
         PrBjMptXIz92cW56iyWf3sfxcxBV03xP5mE6guMHbXhUJY/FQKaa2aI5oUJzrItsFhFp
         NxG7kLZJbbV2WG3w4G30ZdEMVfrp8+BvPPk1JZVEY3CA8UP00lcqRF0vVSmpYPr0hGDM
         B2ztGWVw4mZvCSVCC5PfaYONEYUpmatE0966joqP/q02o1HJ0iv/uFj2H/7HEw2e322m
         R+3Q==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=LSkG5rpK;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 v20-20020a17090a899400b0023ef9514233si11229132pjn.116.2023.03.20.17.13.06;
        Mon, 20 Mar 2023 17:13:18 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@google.com header.s=20210112 header.b=LSkG5rpK;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230185AbjCTXi5 (ORCPT <rfc822;pusanteemu@gmail.com> + 99 others);
        Mon, 20 Mar 2023 19:38:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60420 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230032AbjCTXiU (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 20 Mar 2023 19:38:20 -0400
Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com
 [IPv6:2607:f8b0:4864:20::549])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBAA03800A
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:52 -0700 (PDT)
Received: by mail-pg1-x549.google.com with SMTP id
 p1-20020a631e41000000b0050bdffd4995so3051543pgm.16
        for <linux-kernel@vger.kernel.org>;
 Mon, 20 Mar 2023 16:37:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679355471;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=NuCpfBefgbjNmYEBZ3twmkwjk21qxAPheyBB8b9dC4E=;
        b=LSkG5rpKJqsSkkhKeiuIii+phVbbFTBlQKGjX+AM684vloSvD15vaIoV5d/FpCeOfl
         BL3ob7gZM/RAbRlaRaLDjTM1BGd6OP7tj58IY9d7iOSfplcv/wHN9FzHEn6A3BHzuSRS
         S01fMbnvLSsmjFUEhqNNDEZs+BvF9dbNxp0rBahDv256IYaj7mFuvWCYmvkvZjSN1kE3
         PS9767p4r7yx4Waer8r1OiIehSL97v6gpX1GrdQJ+pyNs2TtzISOptSoY/PEXhas6HtK
         4Y82uHOFYYCZ0fDbU0umHAfYgFMuBFKRq7drmllBhTHn/qS177OqS5UVZlK/Aj0o8/bU
         BRWg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679355471;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=NuCpfBefgbjNmYEBZ3twmkwjk21qxAPheyBB8b9dC4E=;
        b=4jJtEci/+D9RjoJYCWmcz1sx9fLLt0HzNqrzR08OqhMMKcKCdbGN0YRfpoZKcfG0gL
         U2moW5EJHJshcO4k40YPUxSs9jDdJ3ZtBVwv/wJPnAr/1oJNd8vGnfb0rIbRqi/x4MYa
         SXXkpFvj6hBSP62WTauCmnBmwIMdi3gr41Mw3gyqDg2CPeDSs+QZBQcTYVzj0p9Lkkxj
         mrAx7IPA62/DGy5oP/yNphAg9HbM1eZpS8R7FLlDYm1FkCZX3z8ijfyK7VZveRzudQV5
         3WLOM3h0A62wqxN2JpPlho5pYzYwMtJwXVzA5jp3bXxCsHyG/JYG43uewNrd+KkB5WwC
         LqAw==
X-Gm-Message-State: AO0yUKVQKlST/kB3oQD1nKz6WTWdxViLJ9yImaLnYo07uiNlv/BzFCGE
        yaLObSpPNwmTO19DtH87V5C/Zg70lmG0AGJM4NpG5OOITqfl6JEToaxL22dXQhJeAfWMLcXwWE3
        HRdW2YGRwbTT1T0bVbz83FUL3V3Fxu5KskEvCqCJJiZfc4rK94SlbEUB2H4ZiE+VNNaIbLPg=
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a62:dec5:0:b0:593:fcfb:208b with SMTP id
 h188-20020a62dec5000000b00593fcfb208bmr299556pfg.3.1679355471146; Mon, 20 Mar
 2023 16:37:51 -0700 (PDT)
Date: Mon, 20 Mar 2023 23:37:20 +0000
In-Reply-To: <20230320233720.3488453-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230320233720.3488453-1-jstultz@google.com>
X-Mailer: git-send-email 2.40.0.rc1.284.g88254d51c5-goog
Message-ID: <20230320233720.3488453-13-jstultz@google.com>
Subject: [PATCH v2 12/12] sched: Attempt to fix rt/dl load balancing via chain
 level balance
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: "Connor O'Brien" <connoro@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        John Stultz <jstultz@google.com>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED,
        DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,
        SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham
        autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1760934073866299390?=
X-GMAIL-MSGID: =?utf-8?q?1760934073866299390?=

From: Connor O'Brien <connoro@google.com>

RT/DL balancing is supposed to guarantee that with N cpus available &
CPU affinity permitting, the top N RT/DL tasks will get spread across
the CPUs and all get to run. Proxy exec greatly complicates this as
blocked tasks remain on the rq but cannot be usefully migrated away
from their lock owning tasks. This has two major consequences:
1. In order to get the desired properties we need to migrate a blocked
task, its would-be proxy, and everything in between, all together -
i.e., we need to push/pull "blocked chains" rather than individual
tasks.
2. Tasks that are part of rq->curr's "blocked tree" therefore should
not be pushed or pulled. Options for enforcing this seem to include
a) create some sort of complex data structure for tracking
pushability, updating it whenever the blocked tree for rq->curr
changes (e.g. on mutex handoffs, migrations, etc.) as well as on
context switches.
b) give up on O(1) pushability checks, and search through the pushable
list every push/pull until we find a pushable "chain"
c) Extend option "b" with some sort of caching to avoid repeated work.

For the sake of simplicity & separating the "chain level balancing"
concerns from complicated optimizations, this patch focuses on trying
to implement option "b" correctly. This can then hopefully provide a
baseline for "correct load balancing behavior" that optimizations can
try to implement more efficiently.

Note:
The inability to atomically check "is task enqueued on a specific rq"
creates 2 possible races when following a blocked chain:
- If we check task_rq() first on a task that is dequeued from its rq,
  it can be woken and enqueued on another rq before the call to
  task_on_rq_queued()
- If we call task_on_rq_queued() first on a task that is on another
  rq, it can be dequeued (since we don't hold its rq's lock) and then
  be set to the current rq before we check task_rq().

Maybe there's a more elegant solution that would work, but for now,
just sandwich the task_rq() check between two task_on_rq_queued()
checks, all separated by smp_rmb() calls. Since we hold rq's lock,
task can't be enqueued or dequeued from rq, so neither race should be
possible.

extensive comments on various pitfalls, races, etc. included inline.

TODO: Probably no good reason not to move the new helper
implementations from sched.h into core.c

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: rebased & sorted minor conflicts, folded down numerous
 fixes from Connor, fixed number of checkpatch issues]
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c        |   8 +-
 kernel/sched/cpudeadline.c |  12 +--
 kernel/sched/cpudeadline.h |   3 +-
 kernel/sched/cpupri.c      |  29 ++++--
 kernel/sched/cpupri.h      |   6 +-
 kernel/sched/deadline.c    | 140 ++++++++++++++++---------
 kernel/sched/fair.c        |   5 +
 kernel/sched/rt.c          | 202 +++++++++++++++++++++++++++----------
 kernel/sched/sched.h       | 150 ++++++++++++++++++++++++++-
 9 files changed, 426 insertions(+), 129 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 11138277c7c8..4e7d24560f4c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2492,6 +2492,10 @@ static int migration_cpu_stop(void *data)
 
 int push_cpu_stop(void *arg)
 {
+	/* XXX connoro: how do we handle this case when the rq->curr we push away
+	 * is part of a proxy chain!?
+	 * we actually push the old rq->proxy and its blocker chain.
+	 */
 	struct rq *lowest_rq = NULL, *rq = this_rq();
 	struct task_struct *p = arg;
 
@@ -2516,9 +2520,7 @@ int push_cpu_stop(void *arg)
 
 	// XXX validate p is still the highest prio task
 	if (task_rq(p) == rq) {
-		deactivate_task(rq, p, 0);
-		set_task_cpu(p, lowest_rq->cpu);
-		activate_task(lowest_rq, p, 0);
+		push_task_chain(rq, lowest_rq, p);
 		resched_curr(lowest_rq);
 	}
 
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 57c92d751bcd..efd6d716a3f2 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -113,13 +113,13 @@ static inline int cpudl_maximum(struct cpudl *cp)
  *
  * Returns: int - CPUs were found
  */
-int cpudl_find(struct cpudl *cp, struct task_struct *p,
+int cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx, struct task_struct *exec_ctx,
 	       struct cpumask *later_mask)
 {
-	const struct sched_dl_entity *dl_se = &p->dl;
+	const struct sched_dl_entity *dl_se = &sched_ctx->dl;
 
 	if (later_mask &&
-	    cpumask_and(later_mask, cp->free_cpus, &p->cpus_mask)) {
+	    cpumask_and(later_mask, cp->free_cpus, &exec_ctx->cpus_mask)) {
 		unsigned long cap, max_cap = 0;
 		int cpu, max_cpu = -1;
 
@@ -128,13 +128,13 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
 
 		/* Ensure the capacity of the CPUs fits the task. */
 		for_each_cpu(cpu, later_mask) {
-			if (!dl_task_fits_capacity(p, cpu)) {
+			if (!dl_task_fits_capacity(sched_ctx, cpu)) {
 				cpumask_clear_cpu(cpu, later_mask);
 
 				cap = capacity_orig_of(cpu);
 
 				if (cap > max_cap ||
-				    (cpu == task_cpu(p) && cap == max_cap)) {
+				    (cpu == task_cpu(exec_ctx) && cap == max_cap)) {
 					max_cap = cap;
 					max_cpu = cpu;
 				}
@@ -150,7 +150,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
 
 		WARN_ON(best_cpu != -1 && !cpu_present(best_cpu));
 
-		if (cpumask_test_cpu(best_cpu, &p->cpus_mask) &&
+		if (cpumask_test_cpu(best_cpu, &exec_ctx->cpus_mask) &&
 		    dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
 			if (later_mask)
 				cpumask_set_cpu(best_cpu, later_mask);
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
index 0adeda93b5fb..6bb27f70e9d2 100644
--- a/kernel/sched/cpudeadline.h
+++ b/kernel/sched/cpudeadline.h
@@ -16,7 +16,8 @@ struct cpudl {
 };
 
 #ifdef CONFIG_SMP
-int  cpudl_find(struct cpudl *cp, struct task_struct *p, struct cpumask *later_mask);
+int  cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx,
+		struct task_struct *exec_ctx, struct cpumask *later_mask);
 void cpudl_set(struct cpudl *cp, int cpu, u64 dl);
 void cpudl_clear(struct cpudl *cp, int cpu);
 int  cpudl_init(struct cpudl *cp);
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a286e726eb4b..285242b76597 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -64,6 +64,7 @@ static int convert_prio(int prio)
 	return cpupri;
 }
 
+/* XXX connoro: the p passed in here should be exec ctx */
 static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 				struct cpumask *lowest_mask, int idx)
 {
@@ -96,11 +97,15 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 	if (skip)
 		return 0;
 
-	if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+	if ((p && cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids) ||
+	    (!p && cpumask_any(vec->mask) >= nr_cpu_ids))
 		return 0;
 
 	if (lowest_mask) {
-		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		if (p)
+			cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		else
+			cpumask_copy(lowest_mask, vec->mask);
 
 		/*
 		 * We have to ensure that we have at least one bit
@@ -117,10 +122,11 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 	return 1;
 }
 
-int cpupri_find(struct cpupri *cp, struct task_struct *p,
+int cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
+		struct task_struct *exec_ctx,
 		struct cpumask *lowest_mask)
 {
-	return cpupri_find_fitness(cp, p, lowest_mask, NULL);
+	return cpupri_find_fitness(cp, sched_ctx, exec_ctx, lowest_mask, NULL);
 }
 
 /**
@@ -140,18 +146,19 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
  *
  * Return: (int)bool - CPUs were found
  */
-int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
-		struct cpumask *lowest_mask,
-		bool (*fitness_fn)(struct task_struct *p, int cpu))
+int cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
+			struct task_struct *exec_ctx,
+			struct cpumask *lowest_mask,
+			bool (*fitness_fn)(struct task_struct *p, int cpu))
 {
-	int task_pri = convert_prio(p->prio);
+	int task_pri = convert_prio(sched_ctx->prio);
 	int idx, cpu;
 
 	WARN_ON_ONCE(task_pri >= CPUPRI_NR_PRIORITIES);
 
 	for (idx = 0; idx < task_pri; idx++) {
 
-		if (!__cpupri_find(cp, p, lowest_mask, idx))
+		if (!__cpupri_find(cp, exec_ctx, lowest_mask, idx))
 			continue;
 
 		if (!lowest_mask || !fitness_fn)
@@ -159,7 +166,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
 
 		/* Ensure the capacity of the CPUs fit the task */
 		for_each_cpu(cpu, lowest_mask) {
-			if (!fitness_fn(p, cpu))
+			if (!fitness_fn(sched_ctx, cpu))
 				cpumask_clear_cpu(cpu, lowest_mask);
 		}
 
@@ -191,7 +198,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
 	 * really care.
 	 */
 	if (fitness_fn)
-		return cpupri_find(cp, p, lowest_mask);
+		return cpupri_find(cp, sched_ctx, exec_ctx, lowest_mask);
 
 	return 0;
 }
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..bde7243cec2e 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -18,9 +18,11 @@ struct cpupri {
 };
 
 #ifdef CONFIG_SMP
-int  cpupri_find(struct cpupri *cp, struct task_struct *p,
+int  cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
+		 struct task_struct *exec_ctx,
 		 struct cpumask *lowest_mask);
-int  cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
+int  cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
+			 struct task_struct *exec_ctx,
 			 struct cpumask *lowest_mask,
 			 bool (*fitness_fn)(struct task_struct *p, int cpu));
 void cpupri_set(struct cpupri *cp, int cpu, int pri);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c9b6a23a99b3..83908d51f354 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1814,7 +1814,7 @@ static inline bool dl_task_is_earliest_deadline(struct task_struct *p,
 			       rq->dl.earliest_dl.curr));
 }
 
-static int find_later_rq(struct task_struct *task);
+static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx);
 
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
@@ -1854,7 +1854,11 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 		select_rq |= !dl_task_fits_capacity(p, cpu);
 
 	if (select_rq) {
-		int target = find_later_rq(p);
+		/*
+		 * XXX connoro: verify this but in wakeup path we should
+		 * always have unblocked p, so exec_ctx == sched_ctx == p.
+		 */
+		int target = find_later_rq(p, p);
 
 		if (target != -1 &&
 		    dl_task_is_earliest_deadline(p, cpu_rq(target)))
@@ -1901,12 +1905,18 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 {
+	struct task_struct *exec_ctx;
+
 	/*
 	 * Current can't be migrated, useless to reschedule,
 	 * let's hope p can move out.
 	 */
 	if (rq_curr(rq)->nr_cpus_allowed == 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq_proxy(rq), NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_proxy(rq), rq_curr(rq), NULL))
+		return;
+
+	exec_ctx = find_exec_ctx(rq, p);
+	if (task_current(rq, exec_ctx))
 		return;
 
 	/*
@@ -1914,7 +1924,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->nr_cpus_allowed != 1 &&
-	    cpudl_find(&rq->rd->cpudl, p, NULL))
+	    cpudl_find(&rq->rd->cpudl, p, exec_ctx, NULL))
 		return;
 
 	resched_curr(rq);
@@ -2084,14 +2094,6 @@ static void task_fork_dl(struct task_struct *p)
 /* Only try algorithms three times */
 #define DL_MAX_TRIES 3
 
-static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-	if (!task_on_cpu(rq, p) &&
-	    cpumask_test_cpu(cpu, &p->cpus_mask))
-		return 1;
-	return 0;
-}
-
 /*
  * Return the earliest pushable rq's task, which is suitable to be executed
  * on the CPU, NULL otherwise:
@@ -2110,7 +2112,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
 	if (next_node) {
 		p = __node_2_pdl(next_node);
 
-		if (pick_dl_task(rq, p, cpu))
+		if (pushable_chain(rq, p, cpu) == 1)
 			return p;
 
 		next_node = rb_next(next_node);
@@ -2122,25 +2124,25 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
 
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
 
-static int find_later_rq(struct task_struct *task)
+static int find_later_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx)
 {
 	struct sched_domain *sd;
 	struct cpumask *later_mask = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
 	int this_cpu = smp_processor_id();
-	int cpu = task_cpu(task);
+	int cpu = task_cpu(sched_ctx);
 
 	/* Make sure the mask is initialized first */
 	if (unlikely(!later_mask))
 		return -1;
 
-	if (task->nr_cpus_allowed == 1)
+	if (exec_ctx && exec_ctx->nr_cpus_allowed == 1)
 		return -1;
 
 	/*
 	 * We have to consider system topology and task affinity
 	 * first, then we can look for a suitable CPU.
 	 */
-	if (!cpudl_find(&task_rq(task)->rd->cpudl, task, later_mask))
+	if (!cpudl_find(&task_rq(exec_ctx)->rd->cpudl, sched_ctx, exec_ctx, later_mask))
 		return -1;
 
 	/*
@@ -2209,15 +2211,59 @@ static int find_later_rq(struct task_struct *task)
 	return -1;
 }
 
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+	struct task_struct *p = NULL;
+	struct rb_node *next_node;
+
+	if (!has_pushable_dl_tasks(rq))
+		return NULL;
+
+	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+
+next_node:
+	if (next_node) {
+		p = __node_2_pdl(next_node);
+
+		/*
+		 * cpu argument doesn't matter because we treat a -1 result
+		 * (pushable but can't go to cpu0) the same as a 1 result
+		 * (pushable to cpu0). All we care about here is general
+		 * pushability.
+		 */
+		if (pushable_chain(rq, p, 0))
+			return p; /* XXX connoro TODO this is definitely wrong in combo with the later checks...*/
+
+		next_node = rb_next(next_node);
+		goto next_node;
+	}
+
+	if (!p)
+		return NULL;
+
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+	WARN_ON_ONCE(!dl_task(p));
+
+	return p;
+}
+
 /* Locks the rq it finds */
 static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
 {
+	struct task_struct *exec_ctx;
 	struct rq *later_rq = NULL;
+	bool retry;
 	int tries;
 	int cpu;
 
 	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
-		cpu = find_later_rq(task);
+		retry = false;
+		exec_ctx = find_exec_ctx(rq, task);
+		cpu = find_later_rq(task, exec_ctx);
 
 		if ((cpu == -1) || (cpu == rq->cpu))
 			break;
@@ -2236,11 +2282,30 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
 
 		/* Retry if something changed. */
 		if (double_lock_balance(rq, later_rq)) {
-			if (unlikely(task_rq(task) != rq ||
-				     !cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
-				     task_on_cpu(rq, task) ||
-				     !dl_task(task) ||
-				     !task_on_rq_queued(task))) {
+			bool fail = false;
+
+			/* XXX connoro: this is a mess. Surely there's a better way to express it...*/
+			if (!dl_task(task)) {
+				fail = true;
+			} else if (rq != this_rq()) {
+				struct task_struct *next_task = pick_next_pushable_dl_task(rq);
+
+				if (next_task != task) {
+					fail = true;
+				} else {
+					exec_ctx = find_exec_ctx(rq, next_task);
+					retry = (exec_ctx &&
+						!cpumask_test_cpu(later_rq->cpu,
+								  &exec_ctx->cpus_mask));
+				}
+			} else {
+				int pushable = pushable_chain(rq, task, later_rq->cpu);
+
+				fail = !pushable;
+				retry = pushable == -1;
+			}
+
+			if (unlikely(fail)) {
 				double_unlock_balance(rq, later_rq);
 				later_rq = NULL;
 				break;
@@ -2252,7 +2317,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
 		 * its earliest one has a later deadline than our
 		 * task, the rq is a good one.
 		 */
-		if (dl_task_is_earliest_deadline(task, later_rq))
+		if (!retry && dl_task_is_earliest_deadline(task, later_rq))
 			break;
 
 		/* Otherwise we try again. */
@@ -2263,25 +2328,6 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
 	return later_rq;
 }
 
-static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
-{
-	struct task_struct *p;
-
-	if (!has_pushable_dl_tasks(rq))
-		return NULL;
-
-	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
-
-	WARN_ON_ONCE(rq->cpu != task_cpu(p));
-	WARN_ON_ONCE(task_current(rq, p));
-	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
-
-	WARN_ON_ONCE(!task_on_rq_queued(p));
-	WARN_ON_ONCE(!dl_task(p));
-
-	return p;
-}
-
 /*
  * See if the non running -deadline tasks on this rq
  * can be sent to some other CPU where they can preempt
@@ -2351,9 +2397,7 @@ static int push_dl_task(struct rq *rq)
 		goto retry;
 	}
 
-	deactivate_task(rq, next_task, 0);
-	set_task_cpu(next_task, later_rq->cpu);
-	activate_task(later_rq, next_task, 0);
+	push_task_chain(rq, later_rq, next_task);
 	ret = 1;
 
 	resched_curr(later_rq);
@@ -2439,9 +2483,7 @@ static void pull_dl_task(struct rq *this_rq)
 			if (is_migration_disabled(p)) {
 				push_task = get_push_task(src_rq);
 			} else {
-				deactivate_task(src_rq, p, 0);
-				set_task_cpu(p, this_cpu);
-				activate_task(this_rq, p, 0);
+				push_task_chain(src_rq, this_rq, p);
 				dmin = p->dl.deadline;
 				resched = true;
 			}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa8d772efadf..bb2d61cbb5b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8449,6 +8449,11 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	lockdep_assert_rq_held(env->src_rq);
 
+	/*
+	 * XXX connoro: Is this correct, or should we be doing chain
+	 * balancing for CFS tasks too? Balancing chains that aren't
+	 * part of the running task's blocked "tree" seems reasonable?
+	 */
 	if (task_is_blocked(p))
 		return 0;
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 419270b0918e..b28fc4ccc9d2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1599,7 +1599,7 @@ static void yield_task_rt(struct rq *rq)
 }
 
 #ifdef CONFIG_SMP
-static int find_lowest_rq(struct task_struct *task);
+static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx);
 
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
@@ -1649,7 +1649,10 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	       (curr->nr_cpus_allowed < 2 || proxy->prio <= p->prio);
 
 	if (test || !rt_task_fits_capacity(p, cpu)) {
-		int target = find_lowest_rq(p);
+		/* XXX connoro: double check this, but if we're waking p then
+		 * it is unblocked so exec_ctx == sched_ctx == p.
+		 */
+		int target = find_lowest_rq(p, p);
 
 		/*
 		 * Bail out if we were forcing a migration to find a better
@@ -1676,12 +1679,22 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
+	struct task_struct *exec_ctx = p;
+	/*
+	 * Current can't be migrated, useless to reschedule,
+	 * let's hope p can move out.
+	 */
 	/* XXX connoro: need to revise cpupri_find() to reflect the split
 	 * context since it should look at rq->proxy for priority but rq->curr
 	 * for affinity.
 	 */
 	if (rq_curr(rq)->nr_cpus_allowed == 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq_proxy(rq), NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_proxy(rq), rq_curr(rq), NULL))
+		return;
+
+	/* No reason to preempt since rq->curr wouldn't change anyway */
+	exec_ctx = find_exec_ctx(rq, p);
+	if (task_current(rq, exec_ctx))
 		return;
 
 	/*
@@ -1689,7 +1702,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->nr_cpus_allowed != 1 &&
-	    cpupri_find(&rq->rd->cpupri, p, NULL))
+	    cpupri_find(&rq->rd->cpupri, p, exec_ctx, NULL))
 		return;
 
 	/*
@@ -1855,15 +1868,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 /* Only try algorithms three times */
 #define RT_MAX_TRIES 3
 
-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-	if (!task_on_cpu(rq, p) &&
-	    cpumask_test_cpu(cpu, &p->cpus_mask))
-		return 1;
-
-	return 0;
-}
-
 /*
  * Return the highest pushable rq's task, which is suitable to be executed
  * on the CPU, NULL otherwise
@@ -1877,7 +1881,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
 		return NULL;
 
 	plist_for_each_entry(p, head, pushable_tasks) {
-		if (pick_rt_task(rq, p, cpu))
+		if (pushable_chain(rq, p, cpu) == 1)
 			return p;
 	}
 
@@ -1886,19 +1890,19 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
 
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
 
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struct *exec_ctx)
 {
 	struct sched_domain *sd;
 	struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask);
 	int this_cpu = smp_processor_id();
-	int cpu      = task_cpu(task);
+	int cpu      = task_cpu(sched_ctx);
 	int ret;
 
 	/* Make sure the mask is initialized first */
 	if (unlikely(!lowest_mask))
 		return -1;
 
-	if (task->nr_cpus_allowed == 1)
+	if (exec_ctx && exec_ctx->nr_cpus_allowed == 1)
 		return -1; /* No other targets possible */
 
 	/*
@@ -1907,13 +1911,13 @@ static int find_lowest_rq(struct task_struct *task)
 	 */
 	if (sched_asym_cpucap_active()) {
 
-		ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
-					  task, lowest_mask,
+		ret = cpupri_find_fitness(&task_rq(sched_ctx)->rd->cpupri,
+					  sched_ctx, exec_ctx, lowest_mask,
 					  rt_task_fits_capacity);
 	} else {
 
-		ret = cpupri_find(&task_rq(task)->rd->cpupri,
-				  task, lowest_mask);
+		ret = cpupri_find(&task_rq(sched_ctx)->rd->cpupri,
+				  sched_ctx, exec_ctx, lowest_mask);
 	}
 
 	if (!ret)
@@ -1977,15 +1981,48 @@ static int find_lowest_rq(struct task_struct *task)
 	return -1;
 }
 
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+	struct plist_head *head = &rq->rt.pushable_tasks;
+	struct task_struct *p, *push_task = NULL;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	plist_for_each_entry(p, head, pushable_tasks) {
+		if (pushable_chain(rq, p, 0)) {
+			push_task = p;
+			break;
+		}
+	}
+
+	if (!push_task)
+		return NULL;
+
+	BUG_ON(rq->cpu != task_cpu(push_task));
+	BUG_ON(task_current(rq, push_task) || task_current_proxy(rq, push_task));
+	/*XXX connoro: this check is pointless for blocked push_task. */
+	/* BUG_ON(push_task->nr_cpus_allowed <= 1); */
+
+	BUG_ON(!task_on_rq_queued(push_task));
+	BUG_ON(!rt_task(push_task));
+
+	return p;
+}
+
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 {
+	struct task_struct *exec_ctx;
 	struct rq *lowest_rq = NULL;
+	bool retry;
 	int tries;
 	int cpu;
 
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-		cpu = find_lowest_rq(task);
+		retry = false;
+		exec_ctx = find_exec_ctx(rq, task);
+		cpu = find_lowest_rq(task, exec_ctx);
 
 		if ((cpu == -1) || (cpu == rq->cpu))
 			break;
@@ -2004,18 +2041,77 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 
 		/* if the prio of this runqueue changed, try again */
 		if (double_lock_balance(rq, lowest_rq)) {
+			bool fail = false;
 			/*
 			 * We had to unlock the run queue. In
 			 * the mean time, task could have
 			 * migrated already or had its affinity changed.
 			 * Also make sure that it wasn't scheduled on its rq.
+			 *
+			 * XXX connoro: releasing the rq lock means we need to re-check pushability.
+			 * Some scenarios:
+			 * 1) If a migration from another CPU sent a task/chain to rq
+			 *    that made task newly unpushable by completing a chain
+			 *    from task to rq->curr, then we need to bail out and push something
+			 *    else.
+			 * 2) If our chain led off this CPU or to a dequeued task, the last waiter
+			 *    on this CPU might have acquired the lock and woken (or even migrated
+			 *    & run, handed off the lock it held, etc...). This can invalidate the
+			 *    result of find_lowest_rq() if our chain previously ended in a blocked
+			 *    task whose affinity we could ignore, but now ends in an unblocked
+			 *    task that can't run on lowest_rq.
+			 * 3) race described at https://lore.kernel.org/all/1523536384-26781-2-git-send-email-huawei.libin@huawei.com/
+			 *
+			 * Notes on these:
+			 * - Scenario #2 is properly handled by rerunning find_lowest_rq
+			 * - Scenario #1 requires that we fail
+			 * - Scenario #3 can AFAICT only occur when rq is not this_rq(). And the
+			 *   suggested fix is not universally correct now that push_cpu_stop() can
+			 *   call this function.
 			 */
-			if (unlikely(task_rq(task) != rq ||
-				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
-				     task_on_cpu(rq, task) ||
-				     !rt_task(task) ||
-				     !task_on_rq_queued(task))) {
+			if (!rt_task(task)) {
+				fail = true;
+			} else if (rq != this_rq()) {
+				/*
+				 * If we are dealing with a remote rq, then all bets are off
+				 * because task might have run & then been dequeued since we
+				 * released the lock, at which point our normal checks can race
+				 * with migration, as described in
+				 * https://lore.kernel.org/all/1523536384-26781-2-git-send-email-huawei.libin@huawei.com/
+				 * Need to repick to ensure we avoid a race.
+				 * But re-picking would be unnecessary & incorrect in the
+				 * push_cpu_stop() path.
+				 */
+				struct task_struct *next_task = pick_next_pushable_task(rq);
+
+				if (next_task != task) {
+					fail = true;
+				} else {
+					exec_ctx = find_exec_ctx(rq, next_task);
+					retry = (exec_ctx &&
+						 !cpumask_test_cpu(lowest_rq->cpu,
+								   &exec_ctx->cpus_mask));
+				}
+			} else {
+				/*
+				 * Chain level balancing introduces new ways for our choice of
+				 * task & rq to become invalid when we release the rq lock, e.g.:
+				 * 1) Migration to rq from another CPU makes task newly unpushable
+				 *    by completing a "blocked chain" from task to rq->curr.
+				 *    Fail so a different task can be chosen for push.
+				 * 2) In cases where task's blocked chain led to a dequeued task
+				 *    or one on another rq, the last waiter in the chain on this
+				 *    rq might have acquired the lock and woken, meaning we must
+				 *    pick a different rq if its affinity prevents running on
+				 *    lowest_rq.
+				 */
+				int pushable = pushable_chain(rq, task, lowest_rq->cpu);
+
+				fail = !pushable;
+				retry = pushable == -1;
+			}
 
+			if (unlikely(fail)) {
 				double_unlock_balance(rq, lowest_rq);
 				lowest_rq = NULL;
 				break;
@@ -2023,7 +2119,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 		}
 
 		/* If this rq is still suitable use it. */
-		if (lowest_rq->rt.highest_prio.curr > task->prio)
+		if (lowest_rq->rt.highest_prio.curr > task->prio && !retry)
 			break;
 
 		/* try again */
@@ -2034,26 +2130,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	return lowest_rq;
 }
 
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
-{
-	struct task_struct *p;
-
-	if (!has_pushable_tasks(rq))
-		return NULL;
-
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
-
-	BUG_ON(rq->cpu != task_cpu(p));
-	BUG_ON(task_current(rq, p) || task_current_proxy(rq, p));
-	BUG_ON(p->nr_cpus_allowed <= 1);
-
-	BUG_ON(!task_on_rq_queued(p));
-	BUG_ON(!rt_task(p));
-
-	return p;
-}
-
 /*
  * If the current CPU has more than one RT task, see if the non
  * running task can migrate over to a CPU that is running a task
@@ -2109,10 +2185,10 @@ static int push_rt_task(struct rq *rq, bool pull)
 		 * If #3 is needed, might be best to make a separate patch with
 		 * all the "chain-level load balancing" changes.
 		 */
-		if (rq_curr(rq)->sched_class != &rt_sched_class)
+		if (rq_proxy(rq)->sched_class != &rt_sched_class)
 			return 0;
 
-		cpu = find_lowest_rq(rq_curr(rq));
+		cpu = find_lowest_rq(rq_proxy(rq), rq_curr(rq));
 		if (cpu == -1 || cpu == rq->cpu)
 			return 0;
 
@@ -2146,6 +2222,15 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * case for when we push a blocked task whose lock owner is not on
 	 * this rq.
 	 */
+	/* XXX connoro: we might unlock the rq here. But it might be the case that
+	 * the unpushable set can only *grow* and not shrink? Hmmm
+	 * - load balancing should not pull anything from the active blocked tree
+	 * - rq->curr can't have made progress or released mutexes
+	 * - we can't have scheduled, right? Is preemption disabled here?
+	 * - however, suppose proxy() pushed a task or chain here that linked our chain
+	 *   into the active tree.
+	 */
+	/* XXX connoro: we need to pass in */
 	lowest_rq = find_lock_lowest_rq(next_task, rq);
 	if (!lowest_rq) {
 		struct task_struct *task;
@@ -2180,9 +2265,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 		goto retry;
 	}
 
-	deactivate_task(rq, next_task, 0);
-	set_task_cpu(next_task, lowest_rq->cpu);
-	activate_task(lowest_rq, next_task, 0);
+	push_task_chain(rq, lowest_rq, next_task);
 	resched_curr(lowest_rq);
 	ret = 1;
 
@@ -2453,9 +2536,8 @@ static void pull_rt_task(struct rq *this_rq)
 			if (is_migration_disabled(p)) {
 				push_task = get_push_task(src_rq);
 			} else {
-				deactivate_task(src_rq, p, 0);
-				set_task_cpu(p, this_cpu);
-				activate_task(this_rq, p, 0);
+				/* XXX connoro: need to do chain migration here. */
+				push_task_chain(src_rq, this_rq, p);
 				resched = true;
 			}
 			/*
@@ -2469,6 +2551,14 @@ static void pull_rt_task(struct rq *this_rq)
 		double_unlock_balance(this_rq, src_rq);
 
 		if (push_task) {
+			/*
+			 * can push_cpu_stop get away with following blocked_proxy
+			 * even though it's not following it from rq->curr?
+			 * I can't figure out if that's correct.
+			 * Ha! actually the trick is that get_push_task should return
+			 * the proxy!
+			 * So push_cpu_stop just follows blocked_on relations.
+			 */
 			raw_spin_rq_unlock(this_rq);
 			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
 					    push_task, &src_rq->push_work);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01f82ace084a..1c998ddc7cb8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2425,7 +2425,7 @@ static inline struct task_struct *get_push_task(struct rq *rq)
 	 * chain during the __schedule() call immediately after rq->curr is
 	 * pushed.
 	 */
-	struct task_struct *p = rq_curr(rq);
+	struct task_struct *p = rq_proxy(rq);
 
 	lockdep_assert_rq_held(rq);
 
@@ -3412,4 +3412,152 @@ static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *n
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
 #endif
 
+#ifdef CONFIG_SMP
+
+static inline bool task_queued_on_rq(struct rq *rq, struct task_struct *task)
+{
+	if (!task_on_rq_queued(task))
+		return false;
+	smp_rmb();
+	if (task_rq(task) != rq)
+		return false;
+	smp_rmb();
+	if (!task_on_rq_queued(task))
+		return false;
+	return true;
+}
+
+static inline void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct *task)
+{
+	struct task_struct *owner;
+
+	lockdep_assert_rq_held(rq);
+	lockdep_assert_rq_held(dst_rq);
+
+	BUG_ON(!task_queued_on_rq(rq, task));
+	BUG_ON(task_current_proxy(rq, task));
+
+	for (; task != NULL; task = owner) {
+		/*
+		 * XXX connoro: note that if task is currently in the process of migrating to
+		 * rq (but not yet enqueued since we hold the rq lock) then we stop only after
+		 * pushing all the preceding tasks. This isn't ideal (the pushed chain will
+		 * probably get sent back as soon as it's picked on dst_rq) but short of holding
+		 * all of the rq locks while balancing, I don't see how we can avoid this, and
+		 * some extra migrations are clearly better than trying to dequeue task from rq
+		 * before it's ever enqueued here.
+		 *
+		 * XXX connoro: catastrophic race when task is dequeued on rq to start and then
+		 * wakes on another rq in between the two checks.
+		 * There's probably a better way than the below though...
+		 */
+		if (!task_queued_on_rq(rq, task) || task_current_proxy(rq, task))
+			break;
+
+		if (task_is_blocked(task)) {
+			owner = mutex_owner(task->blocked_on);
+		} else {
+			owner = NULL;
+		}
+		deactivate_task(rq, task, 0);
+		set_task_cpu(task, dst_rq->cpu);
+		activate_task(dst_rq, task, 0);
+		if (task == owner)
+			break;
+	}
+}
+
+/*
+ * Returns the unblocked task at the end of the blocked chain starting with p
+ * if that chain is composed entirely of tasks enqueued on rq, or NULL otherwise.
+ */
+static inline struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
+{
+	struct task_struct *exec_ctx, *owner;
+	struct mutex *mutex;
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * XXX connoro: I *think* we have to return rq->curr if it occurs anywhere in the chain
+	 * to avoid races in certain scenarios where rq->curr has just blocked but can't
+	 * switch out until we release its rq lock.
+	 * Should the check be task_on_cpu() instead? Does it matter? I don't think this
+	 * gets called while context switch is actually ongoing which IIUC is where this would
+	 * make a difference...
+	 * correction: it can get called from finish_task_switch apparently. Unless that's wrong;
+	 * double check.
+	 */
+	for (exec_ctx = p; task_is_blocked(exec_ctx) && !task_on_cpu(rq, exec_ctx);
+							exec_ctx = owner) {
+		mutex = exec_ctx->blocked_on;
+		owner = mutex_owner(mutex);
+		if (owner == exec_ctx)
+			break;
+
+		/*
+		 * XXX connoro: can we race here if owner is migrating to rq?
+		 * owner has to be dequeued from its old rq before set_task_cpu
+		 * is called, and we hold this rq's lock so it can't be
+		 * enqueued here yet...right?
+		 *
+		 * Also if owner is dequeued we can race with its wakeup on another
+		 * CPU...at which point all hell will break loose potentially...
+		 */
+		if (!task_queued_on_rq(rq, owner) || task_current_proxy(rq, owner)) {
+			exec_ctx = NULL;
+			break;
+		}
+	}
+	return exec_ctx;
+}
+
+
+/*
+ * Returns:
+ * 1 if chain is pushable and affinity does not prevent pushing to cpu
+ * 0 if chain is unpushable
+ * -1 if chain is pushable but affinity blocks running on cpu.
+ * XXX connoro: maybe there's a cleaner way to do this...
+ */
+static inline int pushable_chain(struct rq *rq, struct task_struct *p, int cpu)
+{
+	struct task_struct *exec_ctx;
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * XXX connoro: 2 issues combine here:
+	 * 1) we apparently have some stuff on the pushable list after it's
+	 *    dequeued from the rq
+	 * 2) This check can race with migration/wakeup if p was already dequeued
+	 *    when we got the rq lock...
+	 */
+	if (task_rq(p) != rq || !task_on_rq_queued(p))
+		return 0;
+
+	exec_ctx = find_exec_ctx(rq, p);
+	/*
+	 * Chain leads off the rq, we're free to push it anywhere.
+	 *
+	 * One wrinkle with relying on find_exec_ctx is that when the chain
+	 * leads to a task currently migrating to rq, we see the chain as
+	 * pushable & push everything prior to the migrating task. Even if
+	 * we checked explicitly for this case, we could still race with a
+	 * migration after the check.
+	 * This shouldn't permanently produce a bad state though, as proxy()
+	 * will send the chain back to rq and by that point the migration
+	 * should be complete & a proper push can occur.
+	 */
+	if (!exec_ctx)
+		return 1;
+
+	if (task_on_cpu(rq, exec_ctx) || exec_ctx->nr_cpus_allowed <= 1)
+		return 0;
+
+	return cpumask_test_cpu(cpu, &exec_ctx->cpus_mask) ? 1 : -1;
+}
+
+#endif
+
 #endif /* _KERNEL_SCHED_SCHED_H */