From patchwork Fri Sep 29 18:33:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mathieu Desnoyers X-Patchwork-Id: 146764 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:6359:6f87:b0:13f:353d:d1ed with SMTP id tl7csp3542557rwb; Fri, 29 Sep 2023 13:35:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGaEhUfS5KoQLf90MBc4Kz5Da7RhWQJ1GxPJ5gK4oTqFiuP8qtw6jjVdhE3zf3vDF6iDk+j X-Received: by 2002:a17:903:25d4:b0:1c6:112f:5d02 with SMTP id jc20-20020a17090325d400b001c6112f5d02mr4463412plb.55.1696019735943; Fri, 29 Sep 2023 13:35:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696019735; cv=none; d=google.com; s=arc-20160816; b=i98iQknm+EMtxxLYngZolSKkFOSzaeeL7iuoFabh8yD86Wsr/cyXGXGFIhgU8IbWHg WObxU8rJBCmg1lWTvZ70695eYFG4sCCUkP3Y/FiqILp4X/ZEB4ARNKztTLigvE8t+3BL wH9fAidMT0RBAiN6dWbvg28YKDyCOXii8+H/aUbq8ab60UcDVzcc/kGraBxxjS2C7Xs2 2Feacwrr6PZvXPGwDZz44HUaYNJV0LybQSWHBJzUYnTYg69UvqeE0dQuavDq3bD8EaRo LTFiqvzPprNidg+w5HhI/C3FCxz7PkbyqwkFcvW7udQZfiezYQCZhYfRo2TkvfWPUW6E 794A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=+KvWJ4bdLtUeG4CEBZKp3CtimuudPGaxE42bmqLnSfg=; fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=; b=wwYcRPhh3WabjL4ayXEXCwRfBsZKh5repSJhvq4MGTzOpGSnQhFvDHznGhQPY/OPLH /vAXLPAc38+oJ0AH3aRKjbZjoOLFX2fHKNDXMf11vjXVK7rGFHF9Ie+sWj/sei6o3Fk3 V8HLBrADCjVTU3rhU9GNdc+/iGhngmwBMFt+dzqbqPA7i/6P9w1Fp4zi85fo07jwme8f KuG3MeWxMKMSSgKQOnsDU6fYZhEyaMRHTliQvt+7z8kOH+fa52Wovqug2v2ul6OUxdgz QIsFQJTwyzYujT/g7o7nt/qQ9Twvw28oaLl1TwQ0LUxv6WxzSXRtpUrj0fFyXMGR9qE6 OXOA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=ZxI5MF+6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id f1-20020a170902684100b001bf1d1d99f6si21711157pln.358.2023.09.29.13.35.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 13:35:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=ZxI5MF+6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 5D14082A5B26; Fri, 29 Sep 2023 11:34:59 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233421AbjI2Se3 (ORCPT + 19 others); Fri, 29 Sep 2023 14:34:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233472AbjI2Se1 (ORCPT ); Fri, 29 Sep 2023 14:34:27 -0400 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5977D19F for ; Fri, 29 Sep 2023 11:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1696012462; bh=AlRztrkqyKUOFkQYVS+dl2vFj7Ffobpd8WDcQyWDybc=; h=From:To:Cc:Subject:Date:From; b=ZxI5MF+6gaoCDzGJJbN6lgPMdDS2d6nd2t5VfOqglCfgRqz2pjQVJZ4V5xn9I7XJW tFC1i16fXHgZvZqViBS09fHpsgGH62+HHa/eWHCJilw4akAc3Scnux7g8rtBlEuL9y zCIo5i12gzrspNL4bVIup9IUPJaQBYbvNfAJduOV41diEfwfKScoI5XP7QLyu/huZ8 3IEHOY2cnXtQ/QlSt7STb3MreW0//BTzXBkF6W/90yPdCvjag4enw7CrDOqo37v+03 ++26pbKhgS9vOLFQ1SgZldb5G+fba2RMdam5qYZNxSwamfP8mN9lDkrRDygFcAOB/9 kMiIXBI/HuX7Q== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4RxzXG4p6zz1Rdc; Fri, 29 Sep 2023 14:34:22 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Vincent Guittot , Juri Lelli , Swapnil Sapkal , Aaron Lu , Chen Yu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , x86@kernel.org Subject: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU Date: Fri, 29 Sep 2023 14:33:50 -0400 Message-Id: <20230929183350.239721-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 29 Sep 2023 11:34:59 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778405590452260546 X-GMAIL-MSGID: 1778405590452260546 Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases select_task_rq towards the previous CPU if it was almost idle (avg_load <= 0.1%). It eliminates frequent task migrations from almost idle CPU to completely idle CPUs. This is achieved by using the CPU load of the previously used CPU as "almost idle" criterion in wake_affine_idle() and select_idle_sibling(). The following benchmarks are performed on a v6.5.5 kernel with mitigations=off. This speeds up the following hackbench workload on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets): hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 from 49s to 32s. (34% speedup) We can observe that the number of migrations is reduced significantly (-94%) with this patch, which may explain the speedup: Baseline: 118M cpu-migrations (9.286 K/sec) With patch: 7M cpu-migrations (0.709 K/sec) As a consequence, the stalled-cycles-backend are reduced: Baseline: 8.16% backend cycles idle With patch: 6.70% backend cycles idle Interestingly, the rate of context switch increases with the patch, but it does not appear to be an issue performance-wise: Baseline: 454M context-switches (35.677 K/sec) With patch: 654M context-switches (62.290 K/sec) This was developed as part of the investigation into a weird regression reported by AMD where adding a raw spinlock in the scheduler context switch accelerated hackbench. It turned out that changing this raw spinlock for a loop of 10000x cpu_relax within do_idle() had similar benefits. This patch achieves a similar effect without the busy-waiting by allowing select_task_rq to favor almost idle previously used CPUs based on the CPU load of that CPU. The threshold of 0.1% avg_load for almost idle CPU load has been identified empirically using the hackbench workload. Feedback is welcome. I am especially interested to learn whether this patch has positive or detrimental effects on performance of other workloads. Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/ Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/ Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/ Signed-off-by: Mathieu Desnoyers Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Vincent Guittot Cc: Juri Lelli Cc: Swapnil Sapkal Cc: Aaron Lu Cc: Chen Yu Cc: Tim Chen Cc: K Prateek Nayak Cc: Gautham R . Shenoy Cc: x86@kernel.org --- kernel/sched/fair.c | 18 +++++++++++++----- kernel/sched/features.h | 6 ++++++ 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d9c2482c5a3..65a7d923ea61 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6599,6 +6599,14 @@ static int wake_wide(struct task_struct *p) return 1; } +static bool +almost_idle_cpu(int cpu, struct task_struct *p) +{ + if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) + return false; + return cpu_load_without(cpu_rq(cpu), p) <= LOAD_AVG_MAX / 1000; +} + /* * The purpose of wake_affine() is to quickly determine on which CPU we can run * soonest. For the purpose of speed we only consider the waking and previous @@ -6612,7 +6620,7 @@ static int wake_wide(struct task_struct *p) * for the overloaded case. */ static int -wake_affine_idle(int this_cpu, int prev_cpu, int sync) +wake_affine_idle(int this_cpu, int prev_cpu, int sync, struct task_struct *p) { /* * If this_cpu is idle, it implies the wakeup is from interrupt @@ -6632,7 +6640,7 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu; - if (available_idle_cpu(prev_cpu)) + if (available_idle_cpu(prev_cpu) || almost_idle_cpu(prev_cpu, p)) return prev_cpu; return nr_cpumask_bits; @@ -6687,7 +6695,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int target = nr_cpumask_bits; if (sched_feat(WA_IDLE)) - target = wake_affine_idle(this_cpu, prev_cpu, sync); + target = wake_affine_idle(this_cpu, prev_cpu, sync, p); if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync); @@ -7139,7 +7147,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) */ lockdep_assert_irqs_disabled(); - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && + if ((available_idle_cpu(target) || sched_idle_cpu(target) || (prev == target && almost_idle_cpu(target, p))) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; @@ -7147,7 +7155,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) * If the previous CPU is cache affine and idle, don't be stupid: */ if (prev != target && cpus_share_cache(prev, target) && - (available_idle_cpu(prev) || sched_idle_cpu(prev)) && + (available_idle_cpu(prev) || sched_idle_cpu(prev) || almost_idle_cpu(prev, p)) && asym_fits_cpu(task_util, util_min, util_max, prev)) return prev; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..b06d06c2b728 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -37,6 +37,12 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) */ SCHED_FEAT(WAKEUP_PREEMPTION, true) +/* + * Bias runqueue selection towards previous runqueue if it is almost + * idle. + */ +SCHED_FEAT(WAKEUP_BIAS_PREV_IDLE, true) + SCHED_FEAT(HRTICK, false) SCHED_FEAT(HRTICK_DL, false) SCHED_FEAT(DOUBLE_TICK, false)