From patchwork Mon Mar 13 02:14:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mathieu Desnoyers X-Patchwork-Id: 68614 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp970100wrd; Sun, 12 Mar 2023 19:34:53 -0700 (PDT) X-Google-Smtp-Source: AK7set+/rO6vMmhZKx45rLn42aKpzo0xCd/Bf+BZSeLw8BulnQJOPAgp+CgdGJ7+4EJMiFLCrH6i X-Received: by 2002:a17:90b:4f45:b0:233:eba7:10c0 with SMTP id pj5-20020a17090b4f4500b00233eba710c0mr34738842pjb.1.1678674893330; Sun, 12 Mar 2023 19:34:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678674893; cv=none; d=google.com; s=arc-20160816; b=itnuVpyy/xn4F25jA0Vpb+ST3ZPu/v/mu+46G1tpsq7uK+Nc+3aSl0Cxdk5y8MoV7M Lm7jECmZ3JNW3TjfjVbx/Yeat0BYDfwppOt7YY2y6ICY6NGj7xnFY5Sn0LRJErA9nwrk nyxXPjI0jmwo7scGl+lv9PgJaPE/BRcAnLlglMw9KInyI+v2R+Y2WK+jispMTwPMrh7t KF3rgutOgY+5f3dmtkVH5wMYk7MhGUZlJUedIT+mOE/cVkDy/pPqNjCFkW+XKI3DdN+l uJ44Yxg+gxVphSohxBe7c3Zj6dU482PxjT6yAZa6RVdXNd1oYb1ygptysb6Ye8VzZDd5 VQWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=wFM+U9t0X9CjKcOEp0h4Jme6Tt7XH0RmGQ6inlpq5W8=; b=k6pCllvdwYUuUF/I2sZ4/XxfVvLeBfIsPnmciLZYd4jQyvmq4bjrc1SWwOlTu/YKT1 C3AidSG86oQdmCUOCAoIXxw2CnW65fkZt6GHDYa0oSi1E24TVnb3dfEKWsHdDWHFXpuy WsPk18lvB8nVqtZHUWr/dSQ9VAkTmqikk/thxMeZqVYbbKLDqZKxBTQwc0KUrnAbUN/g FHRqAZ08amGEtKI77CdJTXn7kwrOLPSM1XAmIbF7juB7xURFIdty6D5DXATUx1Fa4a6q 0KBAz+15DH/6uhvtY80QJLaCpb3UR19ne93uontrWGiMBDx4DMdpTfh9xuBmuwJgXHf5 VeQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=CCifGvEP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gv13-20020a17090b11cd00b0023427270196si5466673pjb.32.2023.03.12.19.34.38; Sun, 12 Mar 2023 19:34:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=CCifGvEP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229805AbjCMCO7 (ORCPT + 99 others); Sun, 12 Mar 2023 22:14:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229493AbjCMCO5 (ORCPT ); Sun, 12 Mar 2023 22:14:57 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32BA02E0C3 for ; Sun, 12 Mar 2023 19:14:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1678673693; bh=8XNOsuuKEdpjG/EWtF/ozQwZgJDrOO4beg9e7yREDvk=; h=From:To:Cc:Subject:Date:From; b=CCifGvEPdtw1Ttp8z5MPGaVG73YLOpgc2O8p4wmbaE/9bSony/AfAroVTRdmCUgQB 4FAIX6kZSGPRqeCwuyEaj/2ZBsF3iwgGSQhlHX2hakruFaqsZJq4e7XWUvSa0HDtW9 nBh/3KjGcyOjEpPNfIlFw1/8VGbIcBLHRlcjiLLEd+7OXiZY2tI35cx/6jkztzJKBy sAROpMGcTRGIsLoesA+BBMboHcBofyuc2obcVzmtIsqCpyuWjHV/edOb969QbOzODn HluBj+6LxxWj8rnzFdaGFbf6kqNvDrH+OITLbD6P+BDsSK9ibbpz6l8fIl7kB/xhjo LPJL5rgTC7RWw== Received: from localhost.localdomain (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4PZgGN6H9Lzpdh; Sun, 12 Mar 2023 22:14:52 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Ingo Molnar , Thomas Gleixner , Vincent Guittot , Steven Rostedt , Ben Segall , Julien Desfossez , Mathieu Desnoyers Subject: [RFC PATCH] sched/fair: scale vruntime delta on migration Date: Sun, 12 Mar 2023 22:14:42 -0400 Message-Id: <20230313021442.115425-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RDNS_NONE,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760218204560724583?= X-GMAIL-MSGID: =?utf-8?q?1760218204560724583?= On migration, use the respective runqueue spread of the source and destination runqueues to scale the vruntime delta of the scheduling entity. The intent of this change is to prevent a task migrated from a very busy runqueue (with vruntime going fast) to a less busy runqueue (with a vruntime with a slower pace) to enqueue the migrated task far away at the very end of the runqueue, thus increasing the destination runqueue spread and preventing the enqueued task from being scheduled for a while until the vruntime reaches it. The aim is to prevent long scheduling latencies on migration from busy to less busy runqueues. This should help with asymmetric workloads showing spurious long scheduling latencies on large SMP systems. [ Based on v6.1.18. ] [ Testing/feedback welcome. ] Signed-off-by: Mathieu Desnoyers --- include/linux/sched.h | 2 + kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++-------- kernel/sched/sched.h | 1 + 3 files changed, 74 insertions(+), 15 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ffb6eb55cd13..93e0b2dbe671 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -558,6 +558,8 @@ struct sched_entity { u64 nr_migrations; + u64 src_rq_spread; + #ifdef CONFIG_FAIR_GROUP_SCHED int depth; struct sched_entity *parent; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2c3d0d49c80e..0e8159237442 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -589,6 +589,25 @@ static inline bool entity_before(struct sched_entity *a, #define __node_2_se(node) \ rb_entry((node), struct sched_entity, run_node) +struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) +{ + struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root); + + if (!last) + return NULL; + + return __node_2_se(last); +} + +static u64 calc_cfs_rq_spread(struct cfs_rq *cfs_rq) +{ + struct sched_entity *last_se = __pick_last_entity(cfs_rq); + + if (!last_se) + return 0; + return last_se->vruntime - cfs_rq->min_vruntime; +} + static void update_min_vruntime(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; @@ -615,6 +634,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq) /* ensure we never gain time by being placed backwards. */ u64_u32_store(cfs_rq->min_vruntime, max_vruntime(cfs_rq->min_vruntime, vruntime)); + u64_u32_store(cfs_rq->vruntime_spread, calc_cfs_rq_spread(cfs_rq)); } static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) @@ -656,16 +676,6 @@ static struct sched_entity *__pick_next_entity(struct sched_entity *se) } #ifdef CONFIG_SCHED_DEBUG -struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) -{ - struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root); - - if (!last) - return NULL; - - return __node_2_se(last); -} - /************************************************************** * Scheduling class statistics methods: */ @@ -4683,11 +4693,13 @@ static inline bool cfs_bandwidth_used(void); * dequeue * update_curr() * update_min_vruntime() + * src_rq_spread = vruntime_spread * vruntime -= min_vruntime * * enqueue * update_curr() * update_min_vruntime() + * vruntime = vruntime * vruntime_spread / src_rq_spread * vruntime += min_vruntime * * this way the vruntime transition between RQs is done when both @@ -4696,17 +4708,58 @@ static inline bool cfs_bandwidth_used(void); * WAKEUP (remote) * * ->migrate_task_rq_fair() (p->state == TASK_WAKING) + * src_rq_spread = vruntime_spread * vruntime -= min_vruntime * * enqueue * update_curr() * update_min_vruntime() + * vruntime = vruntime * vruntime_spread / src_rq_spread * vruntime += min_vruntime * - * this way we don't have the most up-to-date min_vruntime on the originating - * CPU and an up-to-date min_vruntime on the destination CPU. + * this way we don't have the most up-to-date min_vruntime nor vruntime_spread + * on the originating CPU and an up-to-date min_vruntime and vruntime_spread on + * the destination CPU. + * + * On migration, scale the vruntime delta from the source CPU vruntime spread to + * the destination CPU vruntime spread. */ +static void +renorm_vruntime(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + u64 dst_rq_spread = calc_cfs_rq_spread(cfs_rq), + src_rq_spread = se->src_rq_spread; + + if (!dst_rq_spread) { + /* + * If the destination rq has only a single or no task, place this entity + * at min_vruntime. + */ + se->vruntime = 0; + } else if (!src_rq_spread) { + /* + * If the source rq had only a single task, enqueue this entity into the + * destination rq as we would do the initial placement of an entity. + */ + if (sched_feat(START_DEBIT)) { + se->vruntime = sched_vslice(cfs_rq, se); + } else { + se->vruntime = 0; + } + } else { + /* + * Scale the vruntime delta from the source rq runtime spread + * to the destination rq runtime spread. + */ + if (dst_rq_spread != src_rq_spread) { + se->vruntime *= dst_rq_spread; + do_div(se->vruntime, src_rq_spread); + } + } + se->vruntime += cfs_rq->min_vruntime; +} + static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { @@ -4718,7 +4771,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * update_curr(). */ if (renorm && curr) - se->vruntime += cfs_rq->min_vruntime; + renorm_vruntime(cfs_rq, se); update_curr(cfs_rq); @@ -4729,7 +4782,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * fairness detriment of existing tasks. */ if (renorm && !curr) - se->vruntime += cfs_rq->min_vruntime; + renorm_vruntime(cfs_rq, se); /* * When enqueuing a sched_entity, we must: @@ -4849,8 +4902,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * update_min_vruntime() again, which will discount @se's position and * can move min_vruntime forward still more. */ - if (!(flags & DEQUEUE_SLEEP)) + if (!(flags & DEQUEUE_SLEEP)) { + se->src_rq_spread = calc_cfs_rq_spread(cfs_rq); se->vruntime -= cfs_rq->min_vruntime; + } /* return excess runtime on last dequeue */ return_cfs_rq_runtime(cfs_rq); @@ -7427,6 +7482,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) if (READ_ONCE(p->__state) == TASK_WAKING) { struct cfs_rq *cfs_rq = cfs_rq_of(se); + se->src_rq_spread = u64_u32_load(cfs_rq->vruntime_spread); se->vruntime -= u64_u32_load(cfs_rq->min_vruntime); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d6d488e8eb55..6c657d371090 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -556,6 +556,7 @@ struct cfs_rq { u64 exec_clock; u64 min_vruntime; + u64 vruntime_spread; #ifdef CONFIG_SCHED_CORE unsigned int forceidle_seq; u64 min_vruntime_fi;