[v4,0/8] sched: Implement shared runqueue in fair.c

Message ID 20231212003141.216236-1-void@manifault.com
Headers
Series sched: Implement shared runqueue in fair.c |

Message

David Vernet Dec. 12, 2023, 12:31 a.m. UTC
  This is v4 of the shared runqueue patchset. This patch set is based off
of commit 418146e39891 ("freezer,sched: Clean saved_state when restoring
it during thaw") on the sched/core branch of tip.git.

In prior versions of this patch set, I was observing consistent and
statistically significant wins for several benchmarks when this feature
was enabled, such as kernel compile and hackbench. After rebasing onto
the latest sched/core on tip.git, I'm no longer observing these wins,
and in fact observe some performance loss with SHARED_RUNQ on hackbench.
I ended up bisecting this to when EEVDF was merged.

As I mentioned in [0], our plan for now is to take a step back and
re-evaluate how we want to proceed with this patch set. That said, I did
want to send this out in the interim in case it could be of interest to
anyone else who would like to continue to experiment with it.

[0]: https://lore.kernel.org/all/20231204193001.GA53255@maniforge/

v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/
v3: https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/

v3 -> v4 changes:
- Ensure list is fully drained when toggling feature on and off (noticed
  offline by Chris Mason, and independently by Aboorva Devarajan)

- Also check !is_migration_disabled() in shared_runq_pick_next_task() if
  we find a task in the shard, as is_cpu_allowed() doesn't check for
  migration. Also do another check for is_cpu_allowed() after
  re-acquiring the task rq lock in case state has changed

- Statically initialize the shared_runq_node list node in the init task.

- Only try to pull a task from the shared_runq if the rq's root domain has
  the overload bit set (K Prateek Nayak)

- Check this_rq->ttwu_pending after trying to pull a task from a shared
  runqueue shard before going forward to load_balance() (K Prateek Nayak)

- Fix where we would try to skip over the lowest-level sched domain --
  do the check in load_balance() instead of before, as for_each_domain()
  iterates over all domains starting from the beginning (K Prateek Nayak)

- Add a selftest testcase which toggles the SHARED_RUNQ feature on and
  off in a loop

Worth noting is that there have been a number of suggestions for
improving this feature that were not included in this v4, such as:

- [1], where Chen Yu <yu.c.chen@intel.com> suggested not putting certain
  tasks on a shared_runq if e.g. p->avg_runtime <= sysctl_migration_cost.
  I elected to not include this, as it's a heuristic that could
  incorrectly prevent work conservation, which is the primary goal of
  the feature.

- [2], where K Prateek Nayak <kprateek.nayak@amd.com> suggested adding a
  per-shard "overload" flag that can be set to avoid contending on the
  shard lock. This should be covered by checking the root domain
  overload flag.

- [3], where K Prateek Nayak <kprateek.nayak@amd.com> suggested also
  checking rq->avg_idle < sd->max_newidle_lb_cost. This is a similar
  suggestion to Chen Yu's above, and I elected to leave it out here for
  the same reason: that we want to encourage work conservation.

- [4], where Gautham Shenoy <gautham.shenoy@amd.com> suggests iterating
  over all tasks in a shard until one is found that can be pulled,
  rather than bailing out after failing to migrate the HEAD task.

None of these ideas are unreasonable, and may be worth applying if it
improves the feature for more general cases following further testing. I
left the patch set as is simply to keep the feature "consistent" in
encouraging work conservation, but that decision can be revisited.

[1]: https://lore.kernel.org/all/ZO7e5YaS71cXVxQN@chenyu5-mobl2/
[2]: https://lore.kernel.org/all/20230831104508.7619-4-kprateek.nayak@amd.com/
[3]: https://lore.kernel.org/all/20230831104508.7619-3-kprateek.nayak@amd.com/
[4]: https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/

v2 -> v3 changes:
- Don't leave stale tasks in the lists when the SHARED_RUNQ feature is
  disabled (Abel Wu)

- Use raw spin lock instead of spinlock_t (Peter)

- Fix return value from shared_runq_pick_next_task() to match the
  semantics expected by newidle_balance() (Gautham, Abel)

- Fold patch __enqueue_entity() / __dequeue_entity() into previous patch
  (Peter)

- Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
  (Peter)

- Properly support hotplug and recreating sched domains (Peter)

- Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq ==
  target_rq in shared_runq_pick_next_task() (Abel)

- Only issue list_del_init() in shared_runq_dequeue_task() if the task
  is still in the list after acquiring the lock (Aaron Lu)

- Slightly change shared_runq_shard_idx() to make it more likely to keep
  SMT siblings on the same bucket (Peter)

v1 -> v2 changes:
- Change name from swqueue to shared_runq (Peter)

- Shard per-LLC shared runqueues to avoid contention on scheduler-heavy
  workloads (Peter)

- Pull tasks from the shared_runq in newidle_balance() rather than in
  pick_next_task_fair() (Peter and Vincent)

- Rename a few functions to reflect their actual purpose. For example,
  shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)

- Expose move_queued_task() from core.c rather than migrate_task_to()
  (Peter)

- Properly check is_cpu_allowed() when pulling a task from a shared_runq
  to ensure it can actually be migrated (Peter and Gautham)

- Dropped RFC tag


David Vernet (8):
  sched: Expose move_queued_task() from core.c
  sched: Move is_cpu_allowed() into sched.h
  sched: Tighten unpinned rq lock window in newidle_balance()
  sched: Check cpu_active() earlier in newidle_balance()
  sched: Enable sched_feat callbacks on enable/disable
  sched: Implement shared runqueue in CFS
  sched: Shard per-LLC shared runqueues
  sched: Add selftest for SHARED_RUNQ

 include/linux/sched.h                     |   2 +
 init/init_task.c                          |   1 +
 kernel/sched/core.c                       |  52 +--
 kernel/sched/debug.c                      |  18 +-
 kernel/sched/fair.c                       | 413 +++++++++++++++++++++-
 kernel/sched/features.h                   |   2 +
 kernel/sched/sched.h                      |  60 +++-
 kernel/sched/topology.c                   |   4 +-
 tools/testing/selftests/sched/Makefile    |   7 +-
 tools/testing/selftests/sched/config      |   2 +
 tools/testing/selftests/sched/test-swq.sh |  23 ++
 11 files changed, 521 insertions(+), 63 deletions(-)
 create mode 100755 tools/testing/selftests/sched/test-swq.sh