From patchwork Tue Jun 13 05:20:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Vernet X-Patchwork-Id: 107097 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp325932vqr; Mon, 12 Jun 2023 22:57:29 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7DpTHT7CrxUqxSpsF8pS1usaOFJmcuBwa2hXZp8ZkKnjixz/WZoR/aTkSbeZc+IRazrASV X-Received: by 2002:a17:907:928d:b0:970:482:9fcd with SMTP id bw13-20020a170907928d00b0097004829fcdmr11105036ejc.24.1686635849548; Mon, 12 Jun 2023 22:57:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686635849; cv=none; d=google.com; s=arc-20160816; b=T1FRxBGahnmWMtuuL/mgccZGUpX+0ttATG3H0CDfZOb24YUs1coqiqdY3bPxbG9zEx uc2lYPNUEnthRAxf5sDzAMwfjrMa7yAQNZWRe7soTntyBl2uwbHAQJzKlTYAHH0J+zzz yLc0/zimyWJfhyTPVBu/6gtx8azWnqFgKoH7dm+teS3o3G4flkXZ1oAw6tpf1t1ssKWI U69dkTmUFbqtjQ+0tI5BSVIVBzMPL1wJ4agrcPQOLoZf7V0ZoMhHW/6HRQRcDixE8qBA e17yJjQCWF8knST9u5nqnGA5yCMFNAkhd5q4AV/wSi8Z0BJXXytL2uvpGRZPXRk1O2hm 1VkQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=hGl3c9J8yc2x+Y9tCEMELC+GWzL/Lxdu0fXfos7pecw=; b=zioarSbzgy8Mf1Rg32/oBPoX+It/qF0Wg5tURy0a0qEPiHTxQecEK9OddEMVEbXjdX 8uymjl/nZf5DRfEkqI/ytfTjRFP1kPad6LjN1ykNrZETuk6QL1CjNeqDt7sX8Tr9Annr 90TwBISflnTgTdcYiZj0wQmVxf0PdlxU+ftCC9o2eiT1JVlJgqylGb8DgUGaO3wDuaTX FOd2vi8JQ4gb0QIzxKAanBi5O7sD4qIenBPdNtOsgZqXW/+31TTfDvKAZmrv/LlmCcIf bB1qaZbtNH6uRR+FOdR2x95OByH9eiCCSQ1tH+ZWCFuWzcioO0+wICg4og3LveFCSt1E 7Bpw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gt26-20020a170906f21a00b0097073aafa92si3146918ejb.908.2023.06.12.22.57.05; Mon, 12 Jun 2023 22:57:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239968AbjFMFUW (ORCPT + 99 others); Tue, 13 Jun 2023 01:20:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37116 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238967AbjFMFUR (ORCPT ); Tue, 13 Jun 2023 01:20:17 -0400 Received: from mail-qk1-f175.google.com (mail-qk1-f175.google.com [209.85.222.175]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9F51D10DF for ; Mon, 12 Jun 2023 22:20:15 -0700 (PDT) Received: by mail-qk1-f175.google.com with SMTP id af79cd13be357-75d558c18d0so52592185a.1 for ; Mon, 12 Jun 2023 22:20:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686633614; x=1689225614; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hGl3c9J8yc2x+Y9tCEMELC+GWzL/Lxdu0fXfos7pecw=; b=QAc3FJVab0ypAbq5bV+nlh486mAkqeAn8AinydV4mtFqiD/UUJlz54z8+e3DVijfwL I59Jl7ALQiAJYP3zNaQFayuZiTd7C9artnoYec5gZAw7xWm81BqsCOeEMTfXFG83IJnw 7JTR/Xezpd9D80Cf+2XECrNUxV/os83+Ctp3u7RFopFIswZ9C5xKpLiYP5NKgiwHQT5n lRAK2HNPLF8iKwZkNscNVqZOXqLwswuSmm8yYGJbo0UAOxEZ3E8BxnelbKrOlD64XNw2 37wRQgHMr26ra1RbU4JILCmlSItKykNoaapGqQWEprmGKybFwW1ToBEdMFbeb53mffuF dc8A== X-Gm-Message-State: AC+VfDzKWLVb+/7svrDv+QwrJ+Op76Clf3SYdImY7zRCB8VtcbuB4PLe 1pT8PoNNyZG6wCaXsVCwhKg8g8bNkbTpQMmf X-Received: by 2002:ae9:edc8:0:b0:75e:c30a:81b5 with SMTP id c191-20020ae9edc8000000b0075ec30a81b5mr9130232qkg.9.1686633614231; Mon, 12 Jun 2023 22:20:14 -0700 (PDT) Received: from localhost ([24.1.27.177]) by smtp.gmail.com with ESMTPSA id j1-20020a05620a146100b0074def53eca5sm3327657qkl.53.2023.06.12.22.20.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Jun 2023 22:20:13 -0700 (PDT) From: David Vernet To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, rostedt@goodmis.org, dietmar.eggemann@arm.com, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, joshdon@google.com, roman.gushchin@linux.dev, tj@kernel.org, kernel-team@meta.com Subject: [RFC PATCH 1/3] sched: Make migrate_task_to() take any task Date: Tue, 13 Jun 2023 00:20:02 -0500 Message-Id: <20230613052004.2836135-2-void@manifault.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230613052004.2836135-1-void@manifault.com> References: <20230613052004.2836135-1-void@manifault.com> MIME-Version: 1.0 X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768565872300837931?= X-GMAIL-MSGID: =?utf-8?q?1768565872300837931?= The migrate_task_to() function exposed from kernel/sched/core.c migrates the current task, which is silently assumed to also be its first argument, to the specified CPU. The function uses stop_one_cpu() to migrate the task to the target CPU, which won't work if @p is not the current task as the stop_one_cpu() callback isn't invoked on remote CPUs. While this operation is useful for task_numa_migrate() in fair.c, it would be useful if __migrate_task() in core.c was given external linkage, as it actually can be used to migrate any task to a CPU. This patch therefore: 1. Renames the existing migrate_task_to() be called numa_migrate_current_task_to(). 2. Renames __migrate_task() to migrate_task_to(), gives it global linkage, and updates all callers accordingly. A follow-on patch will call the new migrate_task_to() from fair.c when migrating a task in a shared wakequeue to a remote CPU. Signed-off-by: David Vernet --- kernel/sched/core.c | 16 ++++++++-------- kernel/sched/fair.c | 2 +- kernel/sched/sched.h | 4 +++- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ac38225e6d09..d911b0631e7b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2539,8 +2539,8 @@ struct set_affinity_pending { * So we race with normal scheduler movements, but that's OK, as long * as the task is no longer on this CPU. */ -static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf, - struct task_struct *p, int dest_cpu) +struct rq *migrate_task_to(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, int dest_cpu) { /* Affinity changed (again). */ if (!is_cpu_allowed(p, dest_cpu)) @@ -2573,7 +2573,7 @@ static int migration_cpu_stop(void *data) local_irq_save(rf.flags); /* * We need to explicitly wake pending tasks before running - * __migrate_task() such that we will not miss enforcing cpus_ptr + * migrate_task_to() such that we will not miss enforcing cpus_ptr * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test. */ flush_smp_call_function_queue(); @@ -2605,12 +2605,12 @@ static int migration_cpu_stop(void *data) } if (task_on_rq_queued(p)) - rq = __migrate_task(rq, &rf, p, arg->dest_cpu); + rq = migrate_task_to(rq, &rf, p, arg->dest_cpu); else p->wake_cpu = arg->dest_cpu; /* - * XXX __migrate_task() can fail, at which point we might end + * XXX migrate_task_to() can fail, at which point we might end * up running on a dodgy CPU, AFAICT this can only happen * during CPU hotplug, at which point we'll get pushed out * anyway, so it's probably not a big deal. @@ -3259,7 +3259,7 @@ void force_compatible_cpus_allowed_ptr(struct task_struct *p) alloc_cpumask_var(&new_mask, GFP_KERNEL); /* - * __migrate_task() can fail silently in the face of concurrent + * migrate_task_to() can fail silently in the face of concurrent * offlining of the chosen destination CPU, so take the hotplug * lock to ensure that the migration succeeds. */ @@ -9359,7 +9359,7 @@ bool sched_smp_initialized __read_mostly; #ifdef CONFIG_NUMA_BALANCING /* Migrate current task p to target_cpu */ -int migrate_task_to(struct task_struct *p, int target_cpu) +int numa_migrate_current_task_to(struct task_struct *p, int target_cpu) { struct migration_arg arg = { p, target_cpu }; int curr_cpu = task_cpu(p); @@ -9439,7 +9439,7 @@ static int __balance_push_cpu_stop(void *arg) if (task_rq(p) == rq && task_on_rq_queued(p)) { cpu = select_fallback_rq(rq->cpu, p); - rq = __migrate_task(rq, &rf, p, cpu); + rq = migrate_task_to(rq, &rf, p, cpu); } rq_unlock(rq, &rf); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6189d1a45635..292c593fc84f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2278,7 +2278,7 @@ static int task_numa_migrate(struct task_struct *p) best_rq = cpu_rq(env.best_cpu); if (env.best_task == NULL) { - ret = migrate_task_to(p, env.best_cpu); + ret = numa_migrate_current_task_to(p, env.best_cpu); WRITE_ONCE(best_rq->numa_migrate_on, 0); if (ret != 0) trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 556496c77dc2..5a86e9795731 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1718,7 +1718,7 @@ enum numa_faults_stats { NUMA_CPUBUF }; extern void sched_setnuma(struct task_struct *p, int node); -extern int migrate_task_to(struct task_struct *p, int cpu); +extern int numa_migrate_current_task_to(struct task_struct *p, int target_cpu); extern int migrate_swap(struct task_struct *p, struct task_struct *t, int cpu, int scpu); extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p); @@ -1731,6 +1731,8 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p) #ifdef CONFIG_SMP +extern struct rq *migrate_task_to(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, int dest_cpu); static inline void queue_balance_callback(struct rq *rq, struct balance_callback *head, From patchwork Tue Jun 13 05:20:03 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Vernet X-Patchwork-Id: 107092 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp316469vqr; Mon, 12 Jun 2023 22:28:53 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7PvQMTkaRUQmDIVeXf3cbT42icCNg+Xade+2rutU6MMndu1ySfk8JbBrhqC/oQO/5PhmFi X-Received: by 2002:a17:902:ecc7:b0:1b1:4801:f516 with SMTP id a7-20020a170902ecc700b001b14801f516mr9578980plh.68.1686634133426; Mon, 12 Jun 2023 22:28:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686634133; cv=none; d=google.com; s=arc-20160816; b=BbudcdpCEjw63wgof62pS3gRk1R04kwNOpt99PlGSnaxEwxHr79iTYPmqNilGOSnRe 9VpD4XFhW7d76RgF3uWaJUOPD6YjFTEoI1s0FyAVcUanQ9RCSCi9Qk+nS7UfxHVh9GAD jDXHdGjRLIW66nylnS/v0vabGlaMemXeURXlNeJdRr4OsZfapCz69ohregBhSbk1vVpl fb/K9MQfxLFYcBNgynK1MUOQWhntli2I/MLvGKUNFf2qsOfbo3iueO3sJjahNWLomTgv ynfmYZ1lFDQgNFFBiOV7vSU7Fre3T1h/K8sunCMwLyvTDfTAjB1jSZKPIUO6egRcxRsM NzRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=WaVFx6PFTVsCMikKDqXFhftAW190U82u0GizTbg9/BI=; b=mjBA5WxEeP3WBCFU3X+TyeZ464ZlYnD52PeW7+wLkUDE7gWn+s0d5mzetQ8N5IAW1j mLvyQ0TeGceC+Cr0RXw2RBnhFUrRW35gDi3inw1LZaeabP09XphAdtTyu4X+vXwCpWbj ggasI9y2QCVM6VVCDzCgL8FVuEY5uYzM6t7NEXM/ejQU4AV4a2c1xBtykkIqpPktNArf soJRp+opQh/D38dbm2oTe03GWE8nSTXyx6XkWGxqAc4eHuEb04o3O2XnhuzqLZ70dQOL W7La8dkKo8n8g6wv80Iub5WQKcBQEfXv1wbNmjVNbBmjLuYRmDOmUlbs0hnjddFNOLL6 gAsA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id kc7-20020a17090333c700b001b3eab0a55asi699415plb.545.2023.06.12.22.28.40; Mon, 12 Jun 2023 22:28:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239988AbjFMFU0 (ORCPT + 99 others); Tue, 13 Jun 2023 01:20:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37230 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239932AbjFMFUU (ORCPT ); Tue, 13 Jun 2023 01:20:20 -0400 Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2699110E2 for ; Mon, 12 Jun 2023 22:20:17 -0700 (PDT) Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7608aae9355so54013185a.0 for ; Mon, 12 Jun 2023 22:20:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686633616; x=1689225616; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WaVFx6PFTVsCMikKDqXFhftAW190U82u0GizTbg9/BI=; b=KNUd3arLLFqYGORJDXOwrtylv7nowpb/cvvbkxHdvYoSSQ8oLoJeqvR11Dj6Ld9Yzg BL7OQYVB7t5dTPgUaxDPcsVorDmdi0Na3DVSezzHaxJBV09vjZoyH/rNplmJ7Rgc25c3 5EHO8sGCh5II2VCbNyZWEUXg2onc8a8439hr2g3JW5cdVldhC//LwJvBsOu2h/Btyvto +04cfsyvswWtzM0R+ohuebsAKMazvqAFGkJgzVOyX1nynjm/AJoYTwvIo2sWQJqVKEUh vvNYXaEKGl+i0yOrZ0FqEsLcnAlylJBhjHvWtMUkCoiLYzehW1Ct+mx3E3+Ib3USCYWw GKiQ== X-Gm-Message-State: AC+VfDwa7FKKsPTLl+P1BgiXL3KxqzXAFAhlcBJNG2hmAtLWJWfnZJZ5 dUdorGZ3EMtvmpr+kb12ArfZ1pM8GECC/Cgw X-Received: by 2002:a05:620a:3724:b0:75d:5571:64c0 with SMTP id de36-20020a05620a372400b0075d557164c0mr14967052qkb.37.1686633615661; Mon, 12 Jun 2023 22:20:15 -0700 (PDT) Received: from localhost ([24.1.27.177]) by smtp.gmail.com with ESMTPSA id o23-20020a05620a15d700b0075cdad9648dsm3375500qkm.25.2023.06.12.22.20.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Jun 2023 22:20:15 -0700 (PDT) From: David Vernet To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, rostedt@goodmis.org, dietmar.eggemann@arm.com, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, joshdon@google.com, roman.gushchin@linux.dev, tj@kernel.org, kernel-team@meta.com Subject: [RFC PATCH 2/3] sched/fair: Add SWQUEUE sched feature and skeleton calls Date: Tue, 13 Jun 2023 00:20:03 -0500 Message-Id: <20230613052004.2836135-3-void@manifault.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230613052004.2836135-1-void@manifault.com> References: <20230613052004.2836135-1-void@manifault.com> MIME-Version: 1.0 X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768564072719532074?= X-GMAIL-MSGID: =?utf-8?q?1768564072719532074?= For certain workloads in CFS, CPU utilization is of the upmost importance. For example, at Meta, our main web workload benefits from a 1 - 1.5% improvement in RPS, and a 1 - 2% improvement in p99 latency, when CPU utilization is pushed as high as possible. This is likely something that would be useful for any workload with long slices, or for which avoiding migration is unlikely to result in improved cache locality. We will soon be enabling more aggressive load balancing via a new feature called swqueue, which places tasks into a FIFO queue on the wakeup path, and then dequeues them when a core goes idle before invoking newidle_balance(). We don't want to enable the feature by default, so this patch defines and declares a new scheduler feature called SWQUEUE which is disabled by default. In addition, we add some calls to empty / skeleton functions in the relevant fair codepaths where swqueue will be utilized. A set of future patches will implement these functions, and enable swqueue for both single and multi socket / CCX architectures. Originally-by: Roman Gushchin Signed-off-by: David Vernet --- kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ kernel/sched/features.h | 1 + 2 files changed, 36 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 292c593fc84f..807986bd6ea6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -140,6 +140,17 @@ static int __init setup_sched_thermal_decay_shift(char *str) __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift); #ifdef CONFIG_SMP +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, + int enq_flags) +{} +static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) +{ + return 0; +} + +static void swqueue_remove_task(struct task_struct *p) +{} + /* * For asym packing, by default the lower numbered CPU has higher priority. */ @@ -162,6 +173,17 @@ int __weak arch_asym_cpu_priority(int cpu) * (default: ~5%) */ #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078) +#else +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, + int enq_flags) +{} +static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) +{ + return 0; +} + +static void swqueue_remove_task(struct task_struct *p) +{} #endif #ifdef CONFIG_CFS_BANDWIDTH @@ -6368,6 +6390,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!task_new) update_overutilized_status(rq); + if (sched_feat(SWQUEUE)) + swqueue_enqueue(rq, p, flags); + enqueue_throttle: assert_list_leaf_cfs_rq(rq); @@ -6449,6 +6474,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); hrtick_update(rq); + + if (sched_feat(SWQUEUE)) + swqueue_remove_task(p); } #ifdef CONFIG_SMP @@ -8155,12 +8183,18 @@ done: __maybe_unused; update_misfit_status(p, rq); + if (sched_feat(SWQUEUE)) + swqueue_remove_task(p); + return p; idle: if (!rf) return NULL; + if (sched_feat(SWQUEUE) && swqueue_pick_next_task(rq, rf)) + return RETRY_TASK; + new_tasks = newidle_balance(rq, rf); /* @@ -12325,6 +12359,7 @@ static void attach_task_cfs_rq(struct task_struct *p) static void switched_from_fair(struct rq *rq, struct task_struct *p) { + swqueue_remove_task(p); detach_task_cfs_rq(p); } diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..57b19bc70cd4 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -101,3 +101,4 @@ SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) +SCHED_FEAT(SWQUEUE, false) From patchwork Tue Jun 13 05:20:04 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Vernet X-Patchwork-Id: 107094 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp324315vqr; Mon, 12 Jun 2023 22:52:10 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7yOrP5UkTk916crGMfcxSxG9vkl+1DtPLA8xoU2/rcQdw4jm/I2OgbX7inTWOxfDNCcO9v X-Received: by 2002:a05:6359:b97:b0:123:6652:f581 with SMTP id gf23-20020a0563590b9700b001236652f581mr5927044rwb.4.1686635530551; Mon, 12 Jun 2023 22:52:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686635530; cv=none; d=google.com; s=arc-20160816; b=W+JwWeyrDdF1P4RB20uOT7JNgyy3F3+br2fSqRJSnV18eVmnzV7Qg+UahnxCcoNIqs 3dRNdvRyAlQTH0H/NGD16IOGAAHV5CaRTmP++eBhXSbq4PrRLq3zicSbuvaecSxcBJdV r5Jahtt3OFfMU15HJszrOmqz0u9yZpEDuNa49MK4KcR5Rn1BT8jafNwpGPldFRKQk/vG w4CvpUd3HoH0Rroz0c22hJK1qDAFgh5ejOR56sf+BBjbMcKnEoxgKAH0zUX3SwemBtLl 6BYmFs3F71IwOA2aQqqHV0yMCir6l6129C39APF7YCHaFBmYER6w7MERHewaki8KwOzK N96Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=yBfUhJKNhMPahQRJUIAA9NDelW2RpX3bQX4Or2IDOzY=; b=e7ZmaEh19ovLrU6LkcU+DXAIUm9UQCEUugw9Osn4UfJxtS4iCEUQ5u9iOZkUyhVnZM 9A75N6UMPXnqZDkU5xz8S9mg11qIViNuGfn6o5/W/cSN68f+VO7uUmt+IhJ4rzk5Ngz0 x5vloQ1x+pIaJASX3dYl0T17/hN47vG6c99LzT0ew68YYf6QtZ0JJA7/T0/yA1SLaCkT Kz9Ak24zmf5m7PzKRzT60aczRUeiX/xyKoQQL5oOL8LdVxAwhZRmOrp/8jH2pMM+TuEP Xp1FrGJedWubpRRsqwkQi8Op9QED8MqlXa5sBeZZaQXohZDzKmRJjRK3I/2UBf4ZmRS3 VLeQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s10-20020aa78bca000000b00661397beb62si7990533pfd.171.2023.06.12.22.51.58; Mon, 12 Jun 2023 22:52:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239993AbjFMFUa (ORCPT + 99 others); Tue, 13 Jun 2023 01:20:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37232 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239493AbjFMFUU (ORCPT ); Tue, 13 Jun 2023 01:20:20 -0400 Received: from mail-qv1-f44.google.com (mail-qv1-f44.google.com [209.85.219.44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AFC12E7A for ; Mon, 12 Jun 2023 22:20:18 -0700 (PDT) Received: by mail-qv1-f44.google.com with SMTP id 6a1803df08f44-62def2d2838so5564616d6.0 for ; Mon, 12 Jun 2023 22:20:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686633617; x=1689225617; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yBfUhJKNhMPahQRJUIAA9NDelW2RpX3bQX4Or2IDOzY=; b=KuB0sOD7R1AKMrCWP8EgkIyHLYqsI3Y0r0Rlw5LRGHfwvcbXSzWHn2kuuByGe1wGIz ocTEOhCQKpK7jU0ELhhnMMmHajyxKihLDPC+x7wis39bxCX4tZzljgbiuJuaMMzaWuM+ m8dLl+DTWfgtBf3GHXc6bKsEOTwlWfh0cjvFxxyhrRPTJNpuM+VF6aLmI0pC9oXO0ie3 f4dtgvBdeDrKvabg3g2v3gYjLPwNPuaZHX1aizPIMohRk6RDHbPKu9ufo/sZ+nXqq54q MJfk8qIO9PBJppd9k5rtmPAbY5VMGF02d3J7mmoo4FVHgSVgx6/jcjhl1MMXGDF7aJLU nMQw== X-Gm-Message-State: AC+VfDxHgfKXuIslze2DblTHLGu1OhO0MZSiWXey95jwxQQqdQHWsqgy gyg2PZCu9/wB2TB+w0FtQ2x/oJYEn6CUH63X X-Received: by 2002:a05:6214:242e:b0:625:aa1a:937e with SMTP id gy14-20020a056214242e00b00625aa1a937emr13615140qvb.58.1686633617192; Mon, 12 Jun 2023 22:20:17 -0700 (PDT) Received: from localhost ([24.1.27.177]) by smtp.gmail.com with ESMTPSA id r16-20020a0cb290000000b00623819de804sm3753550qve.127.2023.06.12.22.20.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 12 Jun 2023 22:20:16 -0700 (PDT) From: David Vernet To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, rostedt@goodmis.org, dietmar.eggemann@arm.com, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, joshdon@google.com, roman.gushchin@linux.dev, tj@kernel.org, kernel-team@meta.com Subject: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS Date: Tue, 13 Jun 2023 00:20:04 -0500 Message-Id: <20230613052004.2836135-4-void@manifault.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230613052004.2836135-1-void@manifault.com> References: <20230613052004.2836135-1-void@manifault.com> MIME-Version: 1.0 X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768565243771369258?= X-GMAIL-MSGID: =?utf-8?q?1768565538130839210?= Overview ======== The scheduler must constantly strike a balance between work conservation, and avoiding costly migrations which harm performance due to e.g. decreased cache locality. The matter is further complicated by the topology of the system. Migrating a task between cores on the same LLC may be more optimal than keeping a task local to the CPU, whereas migrating a task between LLCs or NUMA nodes may tip the balance in the other direction. With that in mind, while CFS is by and large mostly a work conserving scheduler, there are certain instances where the scheduler will choose to keep a task local to a CPU, when it would have been more optimal to migrate it to an idle core. An example of such a workload is the HHVM / web workload at Meta. HHVM is a VM that JITs Hack and PHP code in service of web requests. Like other JIT / compilation workloads, it tends to be heavily CPU bound, and exhibit generally poor cache locality. To try and address this, we set several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads: - migration_cost_ns -> 0 - latency_ns -> 20000000 - min_granularity_ns -> 10000000 - wakeup_granularity_ns -> 12000000 These knobs are intended both to encourage the scheduler to be as work conserving as possible (migration_cost_ns -> 0), and also to keep tasks running for relatively long time slices so as to avoid the overhead of context switching (the other knobs). Collectively, these knobs provide a substantial performance win; resulting in roughly a 20% improvement in throughput. Worth noting, however, is that this improvement is _not_ at full machine saturation. That said, even with these knobs, we noticed that CPUs were still going idle even when the host was overcommitted. In response, we wrote the "shared wakequeue" (swqueue) feature proposed in this patch set. The idea behind swqueue is simple: it enables the scheduler to be aggressively work conserving by placing a waking task into a per-LLC FIFO queue that can be pulled from by another core in the LLC FIFO queue which can then be pulled from before it goes idle. With this simple change, we were able to achieve a 1 - 1.6% improvement in throughput, as well as a small, consistent improvement in p95 and p99 latencies, in HHVM. These performance improvements were in addition to the wins from the debugfs knobs mentioned above. Design ====== The design of swqueue is quite simple. An swqueue is simply a struct list_head, and a spinlock: struct swqueue { struct list_head list; spinlock_t lock; } ____cacheline_aligned; We create a struct swqueue per LLC, ensuring they're in their own cachelines to avoid false sharing between CPUs on different LLCs. When a task first wakes up, it enqueues itself in the swqueue of its current LLC at the end of enqueue_task_fair(). Enqueues only happen if the task was not manually migrated to the current core by select_task_rq(), and is not pinned to a specific CPU. A core will pull a task from its LLC's swqueue before calling newidle_balance(). Difference between SIS_NODE =========================== In [0] Peter proposed a patch that addresses Tejun's observations that when workqueues are targeted towards a specific LLC on his Zen2 machine with small CCXs, that there would be significant idle time due to select_idle_sibling() not considering anything outside of the current LLC. This patch (SIS_NODE) is essentially the complement to the proposal here. SID_NODE causes waking tasks to look for idle cores in neighboring LLCs on the same die, whereas swqueue causes cores about to go idle to look for enqueued tasks. That said, in its current form, the two features at are a different scope as SIS_NODE searches for idle cores between LLCs, while swqueue enqueues tasks within a single LLC. The patch was since removed in [1], but we elect to compare its performance to swqueue given that as described above, it's conceptually complementary. [0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/ [1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/ I observed that while SIS_NODE works quite well for hosts with small CCXs, it can result in degraded performance on machines either with a large number of total cores in a CCD, or for which the cache miss penalty of migrating between CCXs is high, even on the same die. For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed through a single link to the IO die, and thus have similar cache miss latencies as cores in remote CCDs. Such subtleties could be taken into account with SIS_NODE, but regardless, both features are conceptually complementary sides of the same coin. SIS_NODE searches for idle cores for waking threads, whereas swqueue searches for available work before a core goes idle. Results ======= Note that the motivation for the shared wakequeue feature was originally arrived at using experiments in the sched_ext framework that's currently being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput is similarly visible using work-conserving sched_ext schedulers (even very simple ones like global FIFO). In both single and multi socket / CCX hosts, this can measurably improve performance. In addition to the performance gains observed on our internal web workloads, we also observed an improvement in common workloads such as kernel compile when running shared wakequeue. Here are the results of running make -j$(nproc) built-in.a on several different types of hosts configured with make allyesconfig on commit a27648c74210 ("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus' tree (boost was disabled on all of these hosts when the experiments were performed): Single-socket | 32-core | 2-CCX | AMD 7950X Zen4 CPU max MHz: 5879.8818 CPU min MHz: 3000.0000 o____________o_______o | mean | CPU | o------------o-------o NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% | NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% | SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% | SWQUEUE + SIS_NODE: | 589.99s | 3115% | o------------o-------o Takeaway: swqueue doesn't seem to provide a statistically significant improvement for kernel compile on my 7950X. SIS_NODE similarly does not have a noticeable effect on performance. ------------------------------------------------------------------------------- Single-socket | 72-core | 6-CCX | AMD Milan Zen3 CPU max MHz: 3245.0190 CPU min MHz: 700.0000 o_____________o_______o | mean | CPU | o-------------o-------o NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% | NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% | SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% | SWQUEUE + SIS_NODE: | 1606.96s | 6488% | o-------------o-------o Takeaway: swqueue does provide a small statistically significant improvement on Milan, but the compile times in general were quite long relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock frequency. Milan also has larger CCXs than Bergamo, so it stands to reason that select_idle_sibling() will have an easier time finding idle cores inside the current CCX. It also seems logical that SIS_NODE would hurt performance a bit here, as all cores / CCXs are in the same NUMA node, so select_idle_sibling() has to iterate over 72 cores; delaying task wakeup. That said, I'm not sure that's a viable theory if total CPU% is lower with SIS_NODE. ------------------------------------------------------------------------------- Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c CPU max MHz: 1200.0000 CPU min MHz: 1000.0000 o____________o________o | mean | CPU | o------------o--------o NO_SWQUEUE + NO_SIS_NODE: | 322.44s | 15534% | NO_SWQUEUE + SIS_NODE: | 324.39s | 15508% | SWQUEUE + NO_SIS_NODE: | 321.54s | 15603% | SWQUEUE + SIS_NODE: | 321.88s | 15622% | o------------o--------o Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point that it's arguably not statistically significant. SIS_NODE results in a ~.9% performance degradation, for likely the same reason as Milan: the host has a large number of LLCs within a single socket, so task wakeup latencies suffer due to select_idle_node() searching up to 11 CCXs. Conclusion ========== swqueue in this form seems to provide a small, but noticeable win for front-end CPU-bound workloads spread over multiple CCXs. The reason seems fairly straightforward: swqueue encourages work conservation inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of runnable tasks. As mentioned above, it is complementary to SIS_NODE, which searches for idle cores on the wakeup path. While swqueue in this form encourages work conservation, it of course does not guarantee it given that we don't implement any kind of work stealing between swqueues. In the future, we could potentially push CPU utilization even higher by enabling work stealing between swqueues, likely between CCXs on the same NUMA node. Originally-by: Roman Gushchin Signed-off-by: David Vernet --- include/linux/sched.h | 2 + kernel/sched/core.c | 2 + kernel/sched/fair.c | 175 ++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 2 + 4 files changed, 176 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1292d38d66cc..1f4fd22f88a8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -770,6 +770,8 @@ struct task_struct { unsigned long wakee_flip_decay_ts; struct task_struct *last_wakee; + struct list_head swqueue_node; + /* * recent_used_cpu is initially set as the last CPU used by a task * that wakes affine another task. Waker/wakee relationships can diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d911b0631e7b..e04f0daf1f05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4533,6 +4533,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) #ifdef CONFIG_SMP p->wake_entry.u_flags = CSD_TYPE_TTWU; p->migration_pending = NULL; + INIT_LIST_HEAD(&p->swqueue_node); #endif init_sched_mm_cid(p); } @@ -9872,6 +9873,7 @@ void __init sched_init_smp(void) init_sched_rt_class(); init_sched_dl_class(); + init_sched_fair_class_late(); sched_smp_initialized = true; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 807986bd6ea6..29fe25794884 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -139,17 +139,151 @@ static int __init setup_sched_thermal_decay_shift(char *str) } __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift); +/** + * struct swqueue - Per-LLC queue structure for enqueuing and pulling waking + * tasks. + * + * WHAT + * ==== + * + * This structure enables the scheduler to be more aggressively work + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be + * pulled from when another core in the LLC is going to go idle. + * + * struct rq stores a pointer to its LLC's swqueue via struct cfs_rq. Waking + * tasks are enqueued in a swqueue at the end of enqueue_task_fair(), and are + * opportunistically pulled from the swqueue in pick_next_task_fair() prior to + * invoking newidle_balance(). Tasks enqueued in a swqueue be scheduled prior + * to being pulled from the swqueue, in which case they're simply removed from + * the swqueue. A waking task is only enqueued to a swqueue when it was _not_ + * manually migrated to the current runqueue by select_task_rq_fair(). + * + * There is currently no task-stealing between swqueues in different LLCs, + * which means that swqueue is not fully work conserving. This could be added + * at a later time, with tasks likely only being stolen across swqueues on the + * same NUMA node to avoid violating NUMA affinities. + * + * HOW + * === + * + * An swqueue is comprised of a list, and a spinlock for synchronization. Given + * that the critical section for a swqueue is typically a fast list operation, + * and that the swqueue is localized to a single LLC, the spinlock does not + * seem to be contended; even on a heavily utilized host. struct swqueues are + * also cacheline aligned to prevent false sharing between CPUs manipulating + * swqueues in other LLCs. + * + * WHY + * === + * + * As mentioned above, the main benefit of swqueue is that it enables more + * aggressive work conservation in the scheduler. This can benefit workloads + * that benefit more from CPU utilization than from L1/L2 cache locality. + * + * swqueues are segmented across LLCs both to avoid contention on the swqueue + * spinlock by minimizing the number of CPUs that could contend on it, as well + * as to strike a balance between work conservation, and L3 cache locality. + */ +struct swqueue { + struct list_head list; + spinlock_t lock; +} ____cacheline_aligned; + #ifdef CONFIG_SMP -static void swqueue_enqueue(struct rq *rq, struct task_struct *p, - int enq_flags) -{} +static struct swqueue *rq_swqueue(struct rq *rq) +{ + return rq->cfs.swqueue; +} + +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue) +{ + unsigned long flags; + + struct task_struct *p; + + spin_lock_irqsave(&swqueue->lock, flags); + p = list_first_entry_or_null(&swqueue->list, struct task_struct, + swqueue_node); + if (p) + list_del_init(&p->swqueue_node); + spin_unlock_irqrestore(&swqueue->lock, flags); + + return p; +} + +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags) +{ + unsigned long flags; + struct swqueue *swqueue; + bool task_migrated = enq_flags & ENQUEUE_MIGRATED; + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP; + + /* + * Only enqueue the task in the shared wakequeue if: + * + * - SWQUEUE is enabled + * - The task is on the wakeup path + * - The task wasn't purposefully migrated to the current rq by + * select_task_rq() + * - The task isn't pinned to a specific CPU + */ + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1) + return; + + swqueue = rq_swqueue(rq); + spin_lock_irqsave(&swqueue->lock, flags); + list_add_tail(&p->swqueue_node, &swqueue->list); + spin_unlock_irqrestore(&swqueue->lock, flags); +} + static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf) { - return 0; + struct swqueue *swqueue; + struct task_struct *p = NULL; + struct rq *src_rq; + struct rq_flags src_rf; + int ret; + + swqueue = rq_swqueue(rq); + if (!list_empty(&swqueue->list)) + p = swqueue_pull_task(swqueue); + + if (!p) + return 0; + + rq_unpin_lock(rq, rf); + raw_spin_rq_unlock(rq); + + src_rq = task_rq_lock(p, &src_rf); + + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq)); + + if (src_rq->cpu != rq->cpu) + ret = 1; + else + ret = -1; + + task_rq_unlock(src_rq, p, &src_rf); + + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + + return ret; } static void swqueue_remove_task(struct task_struct *p) -{} +{ + unsigned long flags; + struct swqueue *swqueue; + + if (!list_empty(&p->swqueue_node)) { + swqueue = rq_swqueue(task_rq(p)); + spin_lock_irqsave(&swqueue->lock, flags); + list_del_init(&p->swqueue_node); + spin_unlock_irqrestore(&swqueue->lock, flags); + } +} /* * For asym packing, by default the lower numbered CPU has higher priority. @@ -12839,3 +12973,34 @@ __init void init_sched_fair_class(void) #endif /* SMP */ } + +__init void init_sched_fair_class_late(void) +{ +#ifdef CONFIG_SMP + int i; + struct swqueue *swqueue; + struct rq *rq; + struct rq *llc_rq; + + for_each_possible_cpu(i) { + if (per_cpu(sd_llc_id, i) == i) { + llc_rq = cpu_rq(i); + + swqueue = kzalloc_node(sizeof(struct swqueue), + GFP_KERNEL, cpu_to_node(i)); + INIT_LIST_HEAD(&swqueue->list); + spin_lock_init(&swqueue->lock); + llc_rq->cfs.swqueue = swqueue; + } + } + + for_each_possible_cpu(i) { + rq = cpu_rq(i); + llc_rq = cpu_rq(per_cpu(sd_llc_id, i)); + + if (rq == llc_rq) + continue; + rq->cfs.swqueue = llc_rq->cfs.swqueue; + } +#endif /* SMP */ +} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5a86e9795731..daee5c64af87 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -575,6 +575,7 @@ struct cfs_rq { #endif #ifdef CONFIG_SMP + struct swqueue *swqueue; /* * CFS load tracking */ @@ -2380,6 +2381,7 @@ extern void update_max_interval(void); extern void init_sched_dl_class(void); extern void init_sched_rt_class(void); extern void init_sched_fair_class(void); +extern void init_sched_fair_class_late(void); extern void reweight_task(struct task_struct *p, int prio);