From patchwork Tue Dec 12 00:31:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Vernet X-Patchwork-Id: 17876 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp7426312vqy; Mon, 11 Dec 2023 16:32:07 -0800 (PST) X-Google-Smtp-Source: AGHT+IFqjmaBK/KEdV3TnW0Kp7tmLZjJN6P5kl2FgUNPAQ6PlqKUT0cgLYHKUuQZh1veKU8sqXxx X-Received: by 2002:a17:903:230e:b0:1d0:874a:5f98 with SMTP id d14-20020a170903230e00b001d0874a5f98mr8510420plh.24.1702341127098; Mon, 11 Dec 2023 16:32:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702341127; cv=none; d=google.com; s=arc-20160816; b=d8r1xE3xwtOqfVc3HkS/oSI3nfDb+XvrQ2oWRiPviG2KM3dRtY6mpSKVmubI2WlvVt ggne2KMAaHk7c3Foo9Q+DqsFC71B5n8VqQ4NaaVO+dXXMOss63B4X6ffIKB68arg6kHi wWK6f8vuiZQRFvnaIleTngIUzRhUH87YBd8tuEwwAh/aQvS1SkvdOYh2n/bXWmgtMZWE OweLdZHjwswO1ItjmGx7iYaQVEBY7sO9qnovn0dwcHpVokmcgfamhY4CxkBcOnA4o2vt IreQ/FRqpFy5668oCo4ckvCrU+72V2ZqchhMq9z63XxNflrRZ3whoBOpbgKD/dwQwsSR AeZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=iCJbgeC+22r1aKipX6bocvG2HYxcQLc/B81JrVBznno=; fh=M1Y4hH0c3aJh0Tk+r5qI6oW+pPzAHXWG1oPGqjHDYxM=; b=qPOwC7eXO41zhrtrRUSp2TE9Kn8fKa9mt4QD4m+hKrfAlPsMocM2B9gwPQvuGnIms1 dcICVGLYJn8WuI2AGCts4Qgse1TlZBt5rzzQVWeKS5kFGPt/UfMs8nrnm2eKCOxheFC4 ASO/Lc4UoZ4fmeVfDdd8EYL4Dyvo8t7Rbbm7RYiZ9tIAcnSnFmQH6EpCm+ZdQXvJUr7u +t08zM6Px4ZaUt2R6xmZ3rg4NEbci/wJBNloD2/PJunxzAshRsB2NN4MxZPYrozCCToV PlEo8DtTJ1xxgE4ynS6AGY29MqLTNWdHqXcPRo174kMXg9QmE9vR6S2S7YE6Xe+ylQAM okvg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id w1-20020a1709027b8100b001d07908ab57si6811086pll.522.2023.12.11.16.32.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Dec 2023 16:32:07 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id C53B4802B28C; Mon, 11 Dec 2023 16:32:02 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345426AbjLLAbp (ORCPT + 99 others); Mon, 11 Dec 2023 19:31:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230488AbjLLAbo (ORCPT ); Mon, 11 Dec 2023 19:31:44 -0500 Received: from mail-il1-f178.google.com (mail-il1-f178.google.com [209.85.166.178]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E850B8 for ; Mon, 11 Dec 2023 16:31:50 -0800 (PST) Received: by mail-il1-f178.google.com with SMTP id e9e14a558f8ab-35d6644c1f2so12876905ab.0 for ; Mon, 11 Dec 2023 16:31:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702341109; x=1702945909; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=iCJbgeC+22r1aKipX6bocvG2HYxcQLc/B81JrVBznno=; b=cz6Kboyk5AsAWMGfKkaFfv7mqGsQEhdS2YhTetcR9vbHBsSvIiJA1+4kRiQnkGZeB7 qSfC3ezqew8gXCa7+rTfoYk7jfqfIZS/FNOoM0g7uNiDjCSMFx3/7xd1NFws93yh9Iv4 mpqbirAK4ZtZ4nFdHvpvGasLu0Zx7a9/SMLVMHFiIkOi5VmTHW5qZllx2grAjhodWMmd NJ7IW66f8gb0hwV/+UeKce3TBv4LLAggpg0tMk9XpN1M2GninSLiHBMAFpVDmX4lpsSB OmopQDr+l1PBKU720tx6PHrfQBxmsOcrWWh4Lsyf78U8z+vn+SMU75xxWH3ysOJKHABU fwXA== X-Gm-Message-State: AOJu0Yx06j6SjFIepPg/0LUcbGdjM57LaqycqPUBR0Jte0fbJ6eDwJF8 bd0VsRdnnG7/y1z1b3u3dVMfPwzMzy2aaAJh X-Received: by 2002:a05:6e02:1d8e:b0:35a:e348:ab4d with SMTP id h14-20020a056e021d8e00b0035ae348ab4dmr4392756ila.6.1702341109271; Mon, 11 Dec 2023 16:31:49 -0800 (PST) Received: from localhost (c-24-1-27-177.hsd1.il.comcast.net. [24.1.27.177]) by smtp.gmail.com with ESMTPSA id b89-20020a0295e2000000b0041e328a2084sm2116062jai.79.2023.12.11.16.31.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Dec 2023 16:31:48 -0800 (PST) From: David Vernet To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, youssefesmat@google.com, joelaf@google.com, roman.gushchin@linux.dev, yu.c.chen@intel.com, kprateek.nayak@amd.com, gautham.shenoy@amd.com, aboorvad@linux.vnet.ibm.com, wuyun.abel@bytedance.com, tj@kernel.org, kernel-team@meta.com Subject: [PATCH v4 0/8] sched: Implement shared runqueue in fair.c Date: Mon, 11 Dec 2023 18:31:33 -0600 Message-ID: <20231212003141.216236-1-void@manifault.com> X-Mailer: git-send-email 2.42.1 MIME-Version: 1.0 X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Mon, 11 Dec 2023 16:32:03 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785034049515587512 X-GMAIL-MSGID: 1785034049515587512 This is v4 of the shared runqueue patchset. This patch set is based off of commit 418146e39891 ("freezer,sched: Clean saved_state when restoring it during thaw") on the sched/core branch of tip.git. In prior versions of this patch set, I was observing consistent and statistically significant wins for several benchmarks when this feature was enabled, such as kernel compile and hackbench. After rebasing onto the latest sched/core on tip.git, I'm no longer observing these wins, and in fact observe some performance loss with SHARED_RUNQ on hackbench. I ended up bisecting this to when EEVDF was merged. As I mentioned in [0], our plan for now is to take a step back and re-evaluate how we want to proceed with this patch set. That said, I did want to send this out in the interim in case it could be of interest to anyone else who would like to continue to experiment with it. [0]: https://lore.kernel.org/all/20231204193001.GA53255@maniforge/ v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/ v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/ v3: https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/ v3 -> v4 changes: - Ensure list is fully drained when toggling feature on and off (noticed offline by Chris Mason, and independently by Aboorva Devarajan) - Also check !is_migration_disabled() in shared_runq_pick_next_task() if we find a task in the shard, as is_cpu_allowed() doesn't check for migration. Also do another check for is_cpu_allowed() after re-acquiring the task rq lock in case state has changed - Statically initialize the shared_runq_node list node in the init task. - Only try to pull a task from the shared_runq if the rq's root domain has the overload bit set (K Prateek Nayak) - Check this_rq->ttwu_pending after trying to pull a task from a shared runqueue shard before going forward to load_balance() (K Prateek Nayak) - Fix where we would try to skip over the lowest-level sched domain -- do the check in load_balance() instead of before, as for_each_domain() iterates over all domains starting from the beginning (K Prateek Nayak) - Add a selftest testcase which toggles the SHARED_RUNQ feature on and off in a loop Worth noting is that there have been a number of suggestions for improving this feature that were not included in this v4, such as: - [1], where Chen Yu suggested not putting certain tasks on a shared_runq if e.g. p->avg_runtime <= sysctl_migration_cost. I elected to not include this, as it's a heuristic that could incorrectly prevent work conservation, which is the primary goal of the feature. - [2], where K Prateek Nayak suggested adding a per-shard "overload" flag that can be set to avoid contending on the shard lock. This should be covered by checking the root domain overload flag. - [3], where K Prateek Nayak suggested also checking rq->avg_idle < sd->max_newidle_lb_cost. This is a similar suggestion to Chen Yu's above, and I elected to leave it out here for the same reason: that we want to encourage work conservation. - [4], where Gautham Shenoy suggests iterating over all tasks in a shard until one is found that can be pulled, rather than bailing out after failing to migrate the HEAD task. None of these ideas are unreasonable, and may be worth applying if it improves the feature for more general cases following further testing. I left the patch set as is simply to keep the feature "consistent" in encouraging work conservation, but that decision can be revisited. [1]: https://lore.kernel.org/all/ZO7e5YaS71cXVxQN@chenyu5-mobl2/ [2]: https://lore.kernel.org/all/20230831104508.7619-4-kprateek.nayak@amd.com/ [3]: https://lore.kernel.org/all/20230831104508.7619-3-kprateek.nayak@amd.com/ [4]: https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/ v2 -> v3 changes: - Don't leave stale tasks in the lists when the SHARED_RUNQ feature is disabled (Abel Wu) - Use raw spin lock instead of spinlock_t (Peter) - Fix return value from shared_runq_pick_next_task() to match the semantics expected by newidle_balance() (Gautham, Abel) - Fold patch __enqueue_entity() / __dequeue_entity() into previous patch (Peter) - Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled (Peter) - Properly support hotplug and recreating sched domains (Peter) - Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq == target_rq in shared_runq_pick_next_task() (Abel) - Only issue list_del_init() in shared_runq_dequeue_task() if the task is still in the list after acquiring the lock (Aaron Lu) - Slightly change shared_runq_shard_idx() to make it more likely to keep SMT siblings on the same bucket (Peter) v1 -> v2 changes: - Change name from swqueue to shared_runq (Peter) - Shard per-LLC shared runqueues to avoid contention on scheduler-heavy workloads (Peter) - Pull tasks from the shared_runq in newidle_balance() rather than in pick_next_task_fair() (Peter and Vincent) - Rename a few functions to reflect their actual purpose. For example, shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter) - Expose move_queued_task() from core.c rather than migrate_task_to() (Peter) - Properly check is_cpu_allowed() when pulling a task from a shared_runq to ensure it can actually be migrated (Peter and Gautham) - Dropped RFC tag David Vernet (8): sched: Expose move_queued_task() from core.c sched: Move is_cpu_allowed() into sched.h sched: Tighten unpinned rq lock window in newidle_balance() sched: Check cpu_active() earlier in newidle_balance() sched: Enable sched_feat callbacks on enable/disable sched: Implement shared runqueue in CFS sched: Shard per-LLC shared runqueues sched: Add selftest for SHARED_RUNQ include/linux/sched.h | 2 + init/init_task.c | 1 + kernel/sched/core.c | 52 +-- kernel/sched/debug.c | 18 +- kernel/sched/fair.c | 413 +++++++++++++++++++++- kernel/sched/features.h | 2 + kernel/sched/sched.h | 60 +++- kernel/sched/topology.c | 4 +- tools/testing/selftests/sched/Makefile | 7 +- tools/testing/selftests/sched/config | 2 + tools/testing/selftests/sched/test-swq.sh | 23 ++ 11 files changed, 521 insertions(+), 63 deletions(-) create mode 100755 tools/testing/selftests/sched/test-swq.sh