Message ID | 20231106193524.866104-1-jstultz@google.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp2884288vqu; Mon, 6 Nov 2023 11:36:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IFocv7v8Up1K9i0Xx0eiwmQchenC+2950fTcTisZXczfzBW5p4hIDqLp0aMg+V8q6L5qWq7 X-Received: by 2002:a05:6a00:80f7:b0:6c3:1c0e:c26a with SMTP id ei55-20020a056a0080f700b006c31c0ec26amr11817268pfb.11.1699299404011; Mon, 06 Nov 2023 11:36:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699299403; cv=none; d=google.com; s=arc-20160816; b=XR1N6zA2IXk9vSt1IVlljrlSLU1dk+8XS7ioLpPy8JcbdzRi+Avvey+5V/+wK0RXGD v00oSpemSWEUYYtcj88VyYLnnpIvSDLBu4wLcoz5KU5SqZRS3QOw4m98mSs6IcTFF22x meCGrpx65mQVgwegK9H8gI+dvDUrR/KPj2yHOTw2Sk90qrFxdt50ymWWZiFpYMtTu5Fp AB6UldgDTU5Hw+vsrJaYVv0gATyYDDCsBRxeKE21TMVTXPeFgJAENtD6fBrZj+nObgRO veN4mIRlmSJmFJ2bQfG8doID1BOrdh6c0eyh+YBfO4PRTnnmKTu+Nm0gylnKMDsMzyN5 cE/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:mime-version:date:dkim-signature; bh=n1tGPIZYGOtIR7n89no3bECnK5L/fx8rRZB9cC5hO7g=; fh=7UDRnXAyFwoHSGMn0t+ldgDKuKM1JE0l1TSpusfeMFI=; b=mgfrgFgfenHPGQzRC+qX9WtZbOhG0fykB5/UeTorwOKvq1MuxQGBV9QhuyDZrWSi7O uJZuKMaF1ADRRLEENQAV/TXWSWY+vgDDeGwg0NCkA7HJVpXvzuUQV0ffvdDs8nlrhiIt zLimkDXuskjXIGyckE4VkburWbgSmZe8eYbUJAb/Ssymi2JskUiRGkTG+unZGqVRSzua 89aZOdv7ZPU6T3vPnxvgMOJXSwWMgkQ4jrDTjAUAjtJbo6MeyB7TefbG/6nFJtfpM9u7 hMcGVCihAKdM5eoUbrdBcWnoftTe+ZEWRWZ3vrpGo9Z5wofShIRtvTh4b8YoDNzLB2hS D0ww== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=YROz53Zd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id w67-20020a636246000000b005b9b68add96si302429pgb.470.2023.11.06.11.36.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Nov 2023 11:36:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=YROz53Zd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 7352780C3A8C; Mon, 6 Nov 2023 11:36:24 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232405AbjKFTfn (ORCPT <rfc822;lhua1029@gmail.com> + 34 others); Mon, 6 Nov 2023 14:35:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53638 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232269AbjKFTfm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 6 Nov 2023 14:35:42 -0500 Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com [IPv6:2607:f8b0:4864:20::54a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4151E1BE3 for <linux-kernel@vger.kernel.org>; Mon, 6 Nov 2023 11:35:38 -0800 (PST) Received: by mail-pg1-x54a.google.com with SMTP id 41be03b00d2f7-5a08e5c7debso2935053a12.2 for <linux-kernel@vger.kernel.org>; Mon, 06 Nov 2023 11:35:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1699299337; x=1699904137; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=n1tGPIZYGOtIR7n89no3bECnK5L/fx8rRZB9cC5hO7g=; b=YROz53ZdawpUaiUBC0CNltVXRJS9Z1EYfqykmZpXVVrkQjHbWjWhxEZhB/QvTpYexs B5NCPnKIQQAVnT+0lQYRgaeWHnY3hcEMst+4UiQlNyTTlX7/eYgq/bC9vdbh2cap0MUm U1wjRsHyF5KUwGQRF4XLvfNSVdKSqqOQiVD/sb4lNKSKDXR0r3aqfLK8L7KKU4fvFEEp rrNMp7wovneoin8A2A6BhLBSDYhsxupZbOuVqjq6iKLUayNel90By9HvkbAcHC0iKk3M J5JJ7WgZ2GmHVkLU4bHwyT99k9m7PDD/dwIbfJdCfeJN23Z0DH8O9F2MssPsuxE6pha0 xywQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699299337; x=1699904137; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=n1tGPIZYGOtIR7n89no3bECnK5L/fx8rRZB9cC5hO7g=; b=X8MiwrbDSnxU5d7Q3esn+P/CSDeJbSj9bwOV9/p9u9O9XGjTc+5gCn9VD+eSRxRD7K KpQTCMblwNAYOf3YfTpnfMlQSgd51E6eqMGBV/f1FGJCGOnpDvadkCmQtqRYrBKFYFSI Ufw4kYRKSPGhrAUTpuPRLsKCgJmZlH59Qij85TFy3e8r9t8Ah67uTIN6RQ17MAR5I/44 eKJBg7aYkbI4Z0vYHTirLgrGXTJwDxcTAJmYDfQjxetUpaKPpc3DI+9V/gDpqOgQB0yE rXc8+VdZMSyqPt+phrcKrH4FdJjgVWxCDzNFYPZH0tCqweBRVtWRUNdjJ4yRZpUGK3Av kgDA== X-Gm-Message-State: AOJu0YyPnLdszkD6HKiHW/PRzX30hzPKYhNnhMdoYmRowKqADGVkKJlp SCShykL+82NQhpMUyKBpuWg262IVHUwy0zE4jGolNrmwKBsn+Eqx00qkzSkz45qtnMGtJgHhhWr Xf4kR6nmjDLKo5zcHnum04jkK0OpoH4iXDUPZa0G1LaYQVSZ2aC7DKuf1fvBpNlD7H+f1c9g= X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a65:62d9:0:b0:5bd:bbb4:5275 with SMTP id m25-20020a6562d9000000b005bdbbb45275mr4570pgv.10.1699299336929; Mon, 06 Nov 2023 11:35:36 -0800 (PST) Date: Mon, 6 Nov 2023 19:34:43 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.42.0.869.gea05f2083d-goog Message-ID: <20231106193524.866104-1-jstultz@google.com> Subject: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority Inheritance v6 From: John Stultz <jstultz@google.com> To: LKML <linux-kernel@vger.kernel.org> Cc: John Stultz <jstultz@google.com>, Joel Fernandes <joelaf@google.com>, Qais Yousef <qyousef@google.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Valentin Schneider <vschneid@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Zimuzo Ezeozue <zezeozue@google.com>, Youssef Esmat <youssefesmat@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com>, "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 06 Nov 2023 11:36:24 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781844571861487014 X-GMAIL-MSGID: 1781844571861487014 |
Series |
Proxy Execution: A generalized form of Priority Inheritance v6
|
|
Message
John Stultz
Nov. 6, 2023, 7:34 p.m. UTC
Stabilizing this Proxy Execution series has unfortunately continued to be a challenging task. Since the v5 release, I’ve been focused on getting the deactivated/sleeping owner enqueuing functionality, which I left out of v5, stabilized. I’ve managed to rework enough to avoid the crashes previously tripped with the locktorture & ww_mutex selftests, so I feel it’s much improved, but I do still see some issues (blocked waitqueues and hung task watchdogs firing) after stressing the system for many hours in a 64 cpu qemu environment (though this issue seems to be introduced earlier in the series with proxy-migration/return-migration). I still haven’t had time to focus on testing and debugging the chain migration patches. So I’ve left that patch out of this submission for now, but will try to get it included in the next revision. This patch series is actually coarser than what I’ve been developing with, as there are a number of small “test” steps to help validate behavior I changed, which would then be replaced by the real logic afterwards. Including those here would just cause more work for reviewers, so I’ve folded them together, but if you’re interested you can find the fine-grained tree here: https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained As mentioned previously, this Proxy Execution series has a long history: First described in a paper[1] by Watkins, Straub, Niehaus, then from patches from Peter Zijlstra, extended with lots of work by Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt for providing additional details here!) So again, many thanks to those above, as all the credit for this series really is due to them - while the mistakes are likely mine. Overview: —---------- Proxy Execution is a generalized form of priority inheritance. Classic priority inheritance works well for real-time tasks where there is a straight forward priority order to how things are run. But it breaks down when used between CFS or DEADLINE tasks, as there are lots of parameters involved outside of just the task’s nice value when selecting the next task to run (via pick_next_task()). So ideally we want to imbue the mutex holder with all the scheduler attributes of the blocked waiting task. Proxy Execution does this via a few changes: * Keeping tasks that are blocked on a mutex *on* the runqueue * Keeping additional tracking of which mutex a task is blocked on, and which task holds a specific mutex. * Special handling for when we select a blocked task to run, so that we instead run the mutex holder. By leaving blocked tasks on the runqueue, we allow pick_next_task() to choose the task that should run next (even if it’s blocked waiting on a mutex). If we do select a blocked task, we look at the task’s blocked_on mutex and from there look at the mutex’s owner task. And in the simple case, the task which owns the mutex is what we then choose to run, allowing it to release the mutex. This means that instead of just tracking “curr”, the scheduler needs to track both the scheduler context (what was picked and all the state used for scheduling decisions), and the execution context (what we’re actually running). In this way, the mutex owner is run “on behalf” of the blocked task that was picked to run, essentially inheriting the scheduler context of the waiting blocked task. As Connor outlined in a previous submission of this patch series, this raises a number of complicated situations: The mutex owner might itself be blocked on another mutex, or it could be sleeping, running on a different CPU, in the process of migrating between CPUs, etc. The complexity from this is imposing, but currently in Android we have a large number of cases where we are seeing priority inversion (not unbounded, but much longer than we’d like) between “foreground” and “background” SCHED_NORMAL applications. As a result, currently we cannot usefully limit background activity without introducing inconsistent behavior. So Proxy Execution is a very attractive approach to resolving this critical issue. New in v6: --------- * Big rework of blocked owner enqueuing, re-adding the functionality to the series. * Added a boot option proxy_exec= and additional logic to allow the functionality to be boot time toggled. * Minor tweaks to wake_q logic suggested by Waiman Long * Minor fixups Reported-by: kernel test robot <lkp@intel.com> * Various cleanups, optimizations as well as fixes for bugs found in v5 Also, I know Peter didn’t like the blocked_on wrappers, so I previously dropped them, but I found them (and the debug checks within) crucial to working out issues in this patch series. I’m fine to consider dropping them later if they are still objectionable, but their utility was too high at this point to get rid of them. Issues still to address: —---------- * As already mentioned, the sleeping owner handling (where we deactivate waiting tasks and enqueue them onto a list, then reactivate them when the owner wakes up) has been very difficult to get right. This is in part because when we want to activate tasks, we’re already holding a task.pi_lock and a rq_lock, just not the locks for the task we’re activating, nor the rq we’re enqueuing it onto. So there has to be a bit of lock juggling to drop and acquire the right locks (in the right order). * Similarly as we’re often dealing with lists of tasks or chains of tasks and mutexes, iterating across these chains of objects can be done safely while holding the rq lock, but as these chains can cross runqueues our ability to traverse the chains safely is somewhat limited. * Additionally, in the sleeping owner enqueueing as well as in proxy return migration, we seem to be constantly working against the proper locking order (task.pi_lock->rq lock ->mutex.waitlock -> task.blocked_lock), having to repeatedly “swim upstream” where we need a higher order lock then what we’re already holding. Suggestions for different approaches to the serialization or ways to move the logic outside of where locks are already held would be appreciated. * Still need to validate and re-add the chain migration patches to ensure they resolve the RT/DL load balancing invariants. * “rq_selected()” naming. Peter doesn’t like it, but I’ve not thought of a better name. Open to suggestions. * As discussed at OSPM[2], I like to split pick_next_task() up into two phases selecting and setting the next tasks, as currently pick_next_task() assumes the returned task will be run which results in various side-effects in sched class logic when it’s run. I tried to take a pass at this earlier, but it’s hairy and lower on the priority list for now. * CFS load balancing. Blocked tasks may carry forward load (PELT) to the lock owner's CPU, so CPU may look like it is overloaded. * Optimization to avoid migrating blocked tasks (to preserve optimistic mutex spinning) if the runnable lock-owner at the end of the blocked_on chain is already running (though this is difficult due to the limitations from locking rules needed to traverse the blocked on chain across run queues). Performance: —---------- This patch series switches mutexes to use handoff mode rather than optimistic spinning. This is a potential concern where locks are under high contention. However, earlier performance analysis (on both x86 and mobile devices) did not see major regressions. That said, Chenyu did report a regression[3], which I’ll need to look further into. I also briefly re-tested with this v5 series and saw some average latencies grow vs v4, suggesting the changes to return-migration (and extra locking) have some impact. With v6 the extra overhead is reduced but still not as nice as v4. I’ll be digging more there, but my priority is still stability over speed at this point (it’s easier to validate correctness of optimizations if the baseline isn’t crashing). If folks find it easier to test/tinker with, this patch series can also be found here: https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6 https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6 Awhile back Sage Sharp had a nice blog post about types of reviews [4], and while any review and feedback would be greatly appreciated, those focusing on conceptual design or correctness issues would be *especially* valued. Thanks so much! -john [1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf [2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk) [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/ [4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/ Cc: Joel Fernandes <joelaf@google.com> Cc: Qais Yousef <qyousef@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Zimuzo Ezeozue <zezeozue@google.com> Cc: Youssef Esmat <youssefesmat@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: kernel-team@android.com John Stultz (8): locking/mutex: Removes wakeups from under mutex::wait_lock sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) sched: Fix runtime accounting w/ split exec & sched contexts sched: Split out __sched() deactivate task logic into a helper sched: Add a very simple proxy() function sched: Add proxy deactivate helper sched: Handle blocked-waiter migration (and return migration) Juri Lelli (2): locking/mutex: make mutex::wait_lock irq safe locking/mutex: Expose __mutex_owner() Peter Zijlstra (8): sched: Unify runtime accounting across classes locking/mutex: Rework task_struct::blocked_on locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC sched: Split scheduler execution context sched: Start blocked_on chain processing in proxy() sched: Add blocked_donor link to task for smarter mutex handoffs sched: Add deactivated (sleeping) owner handling to proxy() Valentin Schneider (2): locking/mutex: Add p->blocked_on wrappers for correctness checks sched: Fix proxy/current (push,pull)ability .../admin-guide/kernel-parameters.txt | 4 + include/linux/sched.h | 49 +- init/Kconfig | 7 + init/init_task.c | 1 + kernel/Kconfig.locks | 2 +- kernel/fork.c | 8 +- kernel/locking/mutex-debug.c | 9 +- kernel/locking/mutex.c | 163 ++-- kernel/locking/mutex.h | 25 + kernel/locking/ww_mutex.h | 72 +- kernel/sched/core.c | 723 ++++++++++++++++-- kernel/sched/deadline.c | 50 +- kernel/sched/fair.c | 104 ++- kernel/sched/rt.c | 77 +- kernel/sched/sched.h | 61 +- kernel/sched/stop_task.c | 13 +- 16 files changed, 1117 insertions(+), 251 deletions(-)
Comments
On Wed, Nov 8, 2023 at 3:40 AM Hillf Danton <hdanton@sina.com> wrote: > > On Mon, 6 Nov 2023 19:34:43 +0000 John Stultz <jstultz@google.com> > > Overview: > > —---------- > > Proxy Execution is a generalized form of priority inheritance. > > Classic priority inheritance works well for real-time tasks where > > there is a straight forward priority order to how things are run. > > But it breaks down when used between CFS or DEADLINE tasks, as > > there are lots of parameters involved outside of just the task’s > > nice value when selecting the next task to run (via > > pick_next_task()). So ideally we want to imbue the mutex holder > > with all the scheduler attributes of the blocked waiting task. > > [...] Is there a reason why you trimmed the cc list? > > The complexity from this is imposing, but currently in Android we > > have a large number of cases where we are seeing priority > > inversion (not unbounded, but much longer than we’d like) between > > “foreground” and “background” SCHED_NORMAL applications. As a > > result, currently we cannot usefully limit background activity > > without introducing inconsistent behavior. So Proxy Execution is > > a very attractive approach to resolving this critical issue. > > Given usual mutex use > > mutex_lock; > less-than-a-tick level critical section; > (unusual case for example: sleep until wakeup;) > mutex_unlock; So the case we see regularly is you have a low priority task, which maybe is cpuset restricted onto a smaller more energy efficient cpu, and cpu share limited as well so it only gets a small proportional fraction of time on that little cpu. Alternatively, you could also imagine it being a SCHED_IDLE task on a busy system, where every once in a while the system is momentarily idle allowing the task to briefly run. Either way, it doesn't get to run very much, but when it does, it calls into the kernel on a path that is serialized with a mutex. Once it takes the mutex even if it were to hold it for a short time(as in your example above), if it gets preempted while holding the mutex, it won't be selected to run for a while. Then when an important task calls into a similar kernel path, it will have to block and sleep waiting for that mutex to be released. Unfortunately, because there may be enough work going on in other tasks to keep the cpus busy, the low priority task doesn't get scheduled so it cannot release the lock. Given it is cpu share limited (or is SCHED_IDLE), depending on the load it may not get scheduled for a relatively long time. We've definitely seen traces where the outlier latencies are in the seconds. > I think the effects of priority inversion could be safely ignored > without sleep (because of page reclaim for instance) in the critical > section. I'm not sure I understand this assertion, could you clarify? If it's helpful, I've got a simple (contrived) demo which can reproduce a similar issue I've described above, using just file renames as the mutex protected critical section. https://github.com/johnstultz-work/priority-inversion-demo > Could you please elaborate a bit on the priority inversion and tip > point to one or two cases of mutex behind the inversion? In Andorid, we see it commonly with various binder mutexes or pstore write_pmsg() pmsg_lock, as those are commonly taken by most apps. But any common kernel path that takes a mutex can cause these large latency outliers if we're trying to restrict background processes. thanks -john
Hi John On Tue, Nov 7, 2023 at 3:37 AM John Stultz <jstultz@google.com> wrote: > > Stabilizing this Proxy Execution series has unfortunately > continued to be a challenging task. Since the v5 release, I’ve > been focused on getting the deactivated/sleeping owner enqueuing > functionality, which I left out of v5, stabilized. I’ve managed > to rework enough to avoid the crashes previously tripped with the > locktorture & ww_mutex selftests, so I feel it’s much improved, > but I do still see some issues (blocked waitqueues and hung task > watchdogs firing) after stressing the system for many hours in a > 64 cpu qemu environment (though this issue seems to be introduced > earlier in the series with proxy-migration/return-migration). > > I still haven’t had time to focus on testing and debugging the > chain migration patches. So I’ve left that patch out of this > submission for now, but will try to get it included in the next > revision. > > This patch series is actually coarser than what I’ve been > developing with, as there are a number of small “test” steps to > help validate behavior I changed, which would then be replaced by > the real logic afterwards. Including those here would just cause > more work for reviewers, so I’ve folded them together, but if > you’re interested you can find the fine-grained tree here: > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained > > As mentioned previously, this Proxy Execution series has a long > history: First described in a paper[1] by Watkins, Straub, > Niehaus, then from patches from Peter Zijlstra, extended with > lots of work by Juri Lelli, Valentin Schneider, and Connor > O'Brien. (and thank you to Steven Rostedt for providing > additional details here!) > > So again, many thanks to those above, as all the credit for this > series really is due to them - while the mistakes are likely > mine. > > Overview: > —---------- > Proxy Execution is a generalized form of priority inheritance. > Classic priority inheritance works well for real-time tasks where > there is a straight forward priority order to how things are run. > But it breaks down when used between CFS or DEADLINE tasks, as > there are lots of parameters involved outside of just the task’s > nice value when selecting the next task to run (via > pick_next_task()). So ideally we want to imbue the mutex holder > with all the scheduler attributes of the blocked waiting task. > > Proxy Execution does this via a few changes: > * Keeping tasks that are blocked on a mutex *on* the runqueue > * Keeping additional tracking of which mutex a task is blocked > on, and which task holds a specific mutex. > * Special handling for when we select a blocked task to run, so > that we instead run the mutex holder. > > By leaving blocked tasks on the runqueue, we allow > pick_next_task() to choose the task that should run next (even if > it’s blocked waiting on a mutex). If we do select a blocked task, > we look at the task’s blocked_on mutex and from there look at the > mutex’s owner task. And in the simple case, the task which owns > the mutex is what we then choose to run, allowing it to release > the mutex. > > This means that instead of just tracking “curr”, the scheduler > needs to track both the scheduler context (what was picked and > all the state used for scheduling decisions), and the execution > context (what we’re actually running). > > In this way, the mutex owner is run “on behalf” of the blocked > task that was picked to run, essentially inheriting the scheduler > context of the waiting blocked task. How to prevent malicious programs? If a program takes a hotspot lock and never releases the lock, then there will be more and more blocked tasks, and then there will be more and more tasks on the runq, and because the block tasks are involved scheduling, it may cause the owner to always be in the running state. As a result, other tasks cannot be scheduled. Maybe we can use the hung-task mechanism to also detect such long-running tasks? I will also think about this issue simultaneously. If there is a patch, I will send it here. On the other hand, regardless of the extreme situation like above, the impact of more tasks in runq on performance may require more testing. BR --- xuewen > > As Connor outlined in a previous submission of this patch series, > this raises a number of complicated situations: The mutex owner > might itself be blocked on another mutex, or it could be > sleeping, running on a different CPU, in the process of migrating > between CPUs, etc. > > The complexity from this is imposing, but currently in Android we > have a large number of cases where we are seeing priority > inversion (not unbounded, but much longer than we’d like) between > “foreground” and “background” SCHED_NORMAL applications. As a > result, currently we cannot usefully limit background activity > without introducing inconsistent behavior. So Proxy Execution is > a very attractive approach to resolving this critical issue. > > > New in v6: > --------- > * Big rework of blocked owner enqueuing, re-adding the > functionality to the series. > > * Added a boot option proxy_exec= and additional logic to allow > the functionality to be boot time toggled. > > * Minor tweaks to wake_q logic suggested by Waiman Long > > * Minor fixups Reported-by: kernel test robot <lkp@intel.com> > > * Various cleanups, optimizations as well as fixes for bugs found in v5 > > Also, I know Peter didn’t like the blocked_on wrappers, so I > previously dropped them, but I found them (and the debug checks > within) crucial to working out issues in this patch series. I’m > fine to consider dropping them later if they are still > objectionable, but their utility was too high at this point to > get rid of them. > > > Issues still to address: > —---------- > * As already mentioned, the sleeping owner handling (where we > deactivate waiting tasks and enqueue them onto a list, then > reactivate them when the owner wakes up) has been very > difficult to get right. This is in part because when we want > to activate tasks, we’re already holding a task.pi_lock and a > rq_lock, just not the locks for the task we’re activating, nor > the rq we’re enqueuing it onto. So there has to be a bit of > lock juggling to drop and acquire the right locks (in the right > order). > > * Similarly as we’re often dealing with lists of tasks or chains > of tasks and mutexes, iterating across these chains of objects > can be done safely while holding the rq lock, but as these > chains can cross runqueues our ability to traverse the chains > safely is somewhat limited. > > * Additionally, in the sleeping owner enqueueing as well as in > proxy return migration, we seem to be constantly working > against the proper locking order (task.pi_lock->rq lock > ->mutex.waitlock -> task.blocked_lock), having to repeatedly > “swim upstream” where we need a higher order lock then what > we’re already holding. Suggestions for different approaches to > the serialization or ways to move the logic outside of where > locks are already held would be appreciated. > > * Still need to validate and re-add the chain migration patches > to ensure they resolve the RT/DL load balancing invariants. > > * “rq_selected()” naming. Peter doesn’t like it, but I’ve not > thought of a better name. Open to suggestions. > > * As discussed at OSPM[2], I like to split pick_next_task() up > into two phases selecting and setting the next tasks, as > currently pick_next_task() assumes the returned task will be > run which results in various side-effects in sched class logic > when it’s run. I tried to take a pass at this earlier, but it’s > hairy and lower on the priority list for now. > > * CFS load balancing. Blocked tasks may carry forward load (PELT) > to the lock owner's CPU, so CPU may look like it is overloaded. > > * Optimization to avoid migrating blocked tasks (to preserve > optimistic mutex spinning) if the runnable lock-owner at the > end of the blocked_on chain is already running (though this is > difficult due to the limitations from locking rules needed to > traverse the blocked on chain across run queues). > > > Performance: > —---------- > This patch series switches mutexes to use handoff mode rather > than optimistic spinning. This is a potential concern where locks > are under high contention. However, earlier performance analysis > (on both x86 and mobile devices) did not see major regressions. > That said, Chenyu did report a regression[3], which I’ll need to > look further into. I also briefly re-tested with this v5 series > and saw some average latencies grow vs v4, suggesting the changes > to return-migration (and extra locking) have some impact. With v6 > the extra overhead is reduced but still not as nice as v4. I’ll > be digging more there, but my priority is still stability over > speed at this point (it’s easier to validate correctness of > optimizations if the baseline isn’t crashing). > > > If folks find it easier to test/tinker with, this patch series > can also be found here: > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6 > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6 > > Awhile back Sage Sharp had a nice blog post about types of > reviews [4], and while any review and feedback would be greatly > appreciated, those focusing on conceptual design or correctness > issues would be *especially* valued. > > Thanks so much! > -john > > [1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf > [2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk) > [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/ > [4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/ > > > Cc: Joel Fernandes <joelaf@google.com> > Cc: Qais Yousef <qyousef@google.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Zimuzo Ezeozue <zezeozue@google.com> > Cc: Youssef Esmat <youssefesmat@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Will Deacon <will@kernel.org> > Cc: Waiman Long <longman@redhat.com> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: "Paul E . McKenney" <paulmck@kernel.org> > Cc: kernel-team@android.com > > > John Stultz (8): > locking/mutex: Removes wakeups from under mutex::wait_lock > sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable > locking/mutex: Split blocked_on logic into two states (blocked_on and > blocked_on_waking) > sched: Fix runtime accounting w/ split exec & sched contexts > sched: Split out __sched() deactivate task logic into a helper > sched: Add a very simple proxy() function > sched: Add proxy deactivate helper > sched: Handle blocked-waiter migration (and return migration) > > Juri Lelli (2): > locking/mutex: make mutex::wait_lock irq safe > locking/mutex: Expose __mutex_owner() > > Peter Zijlstra (8): > sched: Unify runtime accounting across classes > locking/mutex: Rework task_struct::blocked_on > locking/mutex: Add task_struct::blocked_lock to serialize changes to > the blocked_on state > locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC > sched: Split scheduler execution context > sched: Start blocked_on chain processing in proxy() > sched: Add blocked_donor link to task for smarter mutex handoffs > sched: Add deactivated (sleeping) owner handling to proxy() > > Valentin Schneider (2): > locking/mutex: Add p->blocked_on wrappers for correctness checks > sched: Fix proxy/current (push,pull)ability > > .../admin-guide/kernel-parameters.txt | 4 + > include/linux/sched.h | 49 +- > init/Kconfig | 7 + > init/init_task.c | 1 + > kernel/Kconfig.locks | 2 +- > kernel/fork.c | 8 +- > kernel/locking/mutex-debug.c | 9 +- > kernel/locking/mutex.c | 163 ++-- > kernel/locking/mutex.h | 25 + > kernel/locking/ww_mutex.h | 72 +- > kernel/sched/core.c | 723 ++++++++++++++++-- > kernel/sched/deadline.c | 50 +- > kernel/sched/fair.c | 104 ++- > kernel/sched/rt.c | 77 +- > kernel/sched/sched.h | 61 +- > kernel/sched/stop_task.c | 13 +- > 16 files changed, 1117 insertions(+), 251 deletions(-) > > -- > 2.42.0.869.gea05f2083d-goog >
Hello John, I may have some data that might help you debug a potential performance issue mentioned below. On 11/7/2023 1:04 AM, John Stultz wrote: > [..snip..] > > Performance: > —---------- > This patch series switches mutexes to use handoff mode rather > than optimistic spinning. This is a potential concern where locks > are under high contention. However, earlier performance analysis > (on both x86 and mobile devices) did not see major regressions. > That said, Chenyu did report a regression[3], which I’ll need to > look further into. I too see this as the most notable regression. Some of the other benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb, DeathStarBench) show little to no difference when running with Proxy Execution, however sched-messaging sees a 10x blowup in the runtime. (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1) While investigating, I drew up the runqueue length when running sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation EPYC system) using the following bpftrace script that dumps it csv format: rqlen.bt --- BEGIN { $i = 0; printf("[%12s],", "Timestamp"); while ($i < 8) { printf("CPU%3d,", $i); $i = $i + 1; } $i = 128; while ($i < 136) { printf("CPU%3d,", $i); $i = $i + 1; } printf("\n"); } kprobe:scheduler_tick { @runqlen[curtask->wake_cpu] = curtask->se.cfs_rq->rq->nr_running; } tracepoint:power:cpu_idle { @runqlen[curtask->wake_cpu] = 0; } interval:hz:50 { $i = 0; printf("[%12lld],", elapsed); while ($i < 8) { printf("%3d,", @runqlen[$i]); $i = $i + 1; } $i=128; while ($i < 136) { printf("%3d,", @runqlen[$i]); $i = $i + 1; } printf("\n"); } END { clear(@runqlen); } -- I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy execution (rqlen50-pe-pinned.csv) below. The trend I see with hackbench is that the chain migration leads to a single runqueue being completely overloaded, followed by some amount of the idling on the entire CCX and a similar chain appearing on a different CPU. The trace for tip show a lot more CPUs being utilized. Mathieu has been looking at hackbench and the effect of task migration on the runtime and it appears that lowering the migrations improves the hackbench performance significantly [1][2][3] [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/ [2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/ [3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@efficios.com/ Since migration itself is not cheap, I believe the chain migration at the current scale hampers the performance since sched-messaging emulates a worst-case scenario for proxy-execution. I'll update the thread once I have more information. I'll continue testing and take a closer look at the implementation. > I also briefly re-tested with this v5 series > and saw some average latencies grow vs v4, suggesting the changes > to return-migration (and extra locking) have some impact. With v6 > the extra overhead is reduced but still not as nice as v4. I’ll > be digging more there, but my priority is still stability over > speed at this point (it’s easier to validate correctness of > optimizations if the baseline isn’t crashing). > > > If folks find it easier to test/tinker with, this patch series > can also be found here: > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6 > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6 P.S. I was using the above tree. > > Awhile back Sage Sharp had a nice blog post about types of > reviews [4], and while any review and feedback would be greatly > appreciated, those focusing on conceptual design or correctness > issues would be *especially* valued. I have skipped a few phases that Sage mentions in their blog but I'll try my best to follow the order from here on forward :) > > Thanks so much! > -john > > [1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf > [2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk) > [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/ > [4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/ > > > [..snip..] > -- Thanks and Regards, Prateek
On Wed, 2023-12-13 at 12:07 +0530, K Prateek Nayak wrote: > Hello John, > > I may have some data that might help you debug a potential > performance > issue mentioned below. > > On 11/7/2023 1:04 AM, John Stultz wrote: > > [..snip..] > > > > Performance: > > —---------- > > This patch series switches mutexes to use handoff mode rather > > than optimistic spinning. This is a potential concern where locks > > are under high contention. However, earlier performance analysis > > (on both x86 and mobile devices) did not see major regressions. > > That said, Chenyu did report a regression[3], which I’ll need to > > look further into. > > I too see this as the most notable regression. Some of the other > benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb, > DeathStarBench) show little to no difference when running with Proxy > Execution, however sched-messaging sees a 10x blowup in the runtime. > (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g > 1) I observe similar regression. Total time of `taskset -c 0-5,6-11 perf bench sched messaging -p -t -l 100000 -g 1` increases from 13.964 secs to 184.866 secs on my test machine. Other perf sched benchmarks look OK. > > While investigating, I drew up the runqueue length when running > sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation > EPYC system) using the following bpftrace script that dumps it csv > format: > [snip] IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello John, > > I may have some data that might help you debug a potential performance > issue mentioned below. Hey Prateek, Thank you so much for taking the time to try this out and providing such helpful analysis! More below. > On 11/7/2023 1:04 AM, John Stultz wrote: > > [..snip..] > > > > Performance: > > —---------- > > This patch series switches mutexes to use handoff mode rather > > than optimistic spinning. This is a potential concern where locks > > are under high contention. However, earlier performance analysis > > (on both x86 and mobile devices) did not see major regressions. > > That said, Chenyu did report a regression[3], which I’ll need to > > look further into. > > I too see this as the most notable regression. Some of the other > benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb, > DeathStarBench) show little to no difference when running with Proxy This is great to hear! Thank you for providing this input! > Execution, however sched-messaging sees a 10x blowup in the runtime. > (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1) Oof. I appreciate you sharing this! > While investigating, I drew up the runqueue length when running > sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation > EPYC system) using the following bpftrace script that dumps it csv > format: Just so I'm following you properly on the processor you're using, cpus 0-7 and 128-125 are in the same CCX? (I thought there were only 8 cores per CCX?) > rqlen.bt > --- <snip> > -- > > I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy > execution (rqlen50-pe-pinned.csv) below. > > The trend I see with hackbench is that the chain migration leads > to a single runqueue being completely overloaded, followed by some > amount of the idling on the entire CCX and a similar chain appearing > on a different CPU. The trace for tip show a lot more CPUs being > utilized. > > Mathieu has been looking at hackbench and the effect of task migration > on the runtime and it appears that lowering the migrations improves > the hackbench performance significantly [1][2][3] > > [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/ > [2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/ > [3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@efficios.com/ > > Since migration itself is not cheap, I believe the chain migration at > the current scale hampers the performance since sched-messaging > emulates a worst-case scenario for proxy-execution. Hrm. > I'll update the thread once I have more information. I'll continue > testing and take a closer look at the implementation. > > > I also briefly re-tested with this v5 series > > and saw some average latencies grow vs v4, suggesting the changes > > to return-migration (and extra locking) have some impact. With v6 > > the extra overhead is reduced but still not as nice as v4. I’ll > > be digging more there, but my priority is still stability over > > speed at this point (it’s easier to validate correctness of > > optimizations if the baseline isn’t crashing). > > > > > > If folks find it easier to test/tinker with, this patch series > > can also be found here: > > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6 > > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6 > > P.S. I was using the above tree. Ok, I've been working on getting v7 ready which includes two main things: 1) I've reworked the return migration back into the ttwu path to avoid the lock juggling 2) Working to properly conditionalize and re-add Connor's chain-migration feature (which when a migration happens pulls the full blocked_donor list with it) So I'll try to reproduce your results and see if these help any with this particular case, and then I'll start to look closer at what can be done. Again, thanks so much, I've got so much gratitude for your testing and analysis here. I really appreciate your feedback! Do let me know if you find anything further! thanks -john
On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > I too see this as the most notable regression. Some of the other > benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb, > DeathStarBench) show little to no difference when running with Proxy > Execution, however sched-messaging sees a 10x blowup in the runtime. > (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1) ... > The trend I see with hackbench is that the chain migration leads > to a single runqueue being completely overloaded, followed by some > amount of the idling on the entire CCX and a similar chain appearing > on a different CPU. The trace for tip show a lot more CPUs being > utilized. So I reproduced a similar issue with the test (I only have 8 cores on the bare metal box I have so I didn't bother using taskset): perf bench sched messaging -p -t -l 100000 -g 1 v6.6: 4.583s proxy-exec-6.6: 28.842s proxy-exec-WIP: 26.1033s So the pre-v7 code does improve things, but not substantially. Bisecting through the patch series to see how it regressed shows it is not a single change: mutex-handoff: 16.957s blocked-waiter/return: 29.628s ttwu return: 20.045s blocked_donor: 25.021s So it seems like the majority of the regression comes from switching to mutex handoff instead of optimistic spinning. This would account for your more cpus being utilized comment, as more of the blocked tasks would be spinning trying to take the mutex. Then adding the initial blocked-waiter/return migration change hurts things further (and this was a known issue with v5/v6). Then the pending patch to switch back to doing return migration in ttwu recovers a chunk of that cost. And then the blocked_donor handoff optimization (which passes the lock back to the donor boosting us, rather than the next task in the mutex waitlist) further impacts performance here. The chain migration feature in proxy-exec-WIP doesn't seem to help or hurt in this case. I'll need to look closer at each of those steps to see if there's anything too daft I'm doing. The loss of optimistic spinning has been a long term worry with the patch. Unfortunately as I mentioned in the plumbers talk, the idea from OSPM on avoiding migration and spinning if the runnable owner of a blocked_on chain was on a cpu isn't easy to accomplish, as we are limited to how far off the rq we can look. I feel like I need to come up with some way to lock the entire blocked_on chain so it can be traversed safely - as it is now, due to the locking order (task->pi_lock, rq_lock, mutex->wait_lock, task->blocked_on) we can't stabily traverse go task->mutex->task->mutex, unless the tasks are all on the same rq (and we're holding the rq_lock). So we can only safely look at one mutex owner off of the rq (when we also hold the mutex wait_lock). I'm stewing on an idea for some sort of mutex holder (similar to a runqueue) that would allow the chain of mutexes to be traversed quickly and stability - but the many-to-one blocked_on relationship complicates it. Suggestions or other ideas here would be welcome, as I didn't get a chance to really discuss it much at plumbers. thanks -john
On Wed, Dec 13, 2023 at 5:00 PM John Stultz <jstultz@google.com> wrote:
> I'm stewing on an idea for some sort of mutex holder (similar to a
On re-reading this, I realized holder is far too overloaded a term in
this context.
"mutex container" might be a better term for the idea.
thanks
-john
Hello John, Thank you for taking a look at the report. On 12/14/2023 12:41 AM, John Stultz wrote: > On Tue, Dec 12, 2023 at 10:37 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote: >> >> Hello John, >> >> I may have some data that might help you debug a potential performance >> issue mentioned below. > > Hey Prateek, > Thank you so much for taking the time to try this out and providing > such helpful analysis! > More below. > >> On 11/7/2023 1:04 AM, John Stultz wrote: >>> [..snip..] >>> >>> Performance: >>> —---------- >>> This patch series switches mutexes to use handoff mode rather >>> than optimistic spinning. This is a potential concern where locks >>> are under high contention. However, earlier performance analysis >>> (on both x86 and mobile devices) did not see major regressions. >>> That said, Chenyu did report a regression[3], which I’ll need to >>> look further into. >> >> I too see this as the most notable regression. Some of the other >> benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb, >> DeathStarBench) show little to no difference when running with Proxy > > This is great to hear! Thank you for providing this input! > >> Execution, however sched-messaging sees a 10x blowup in the runtime. >> (taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1) > > Oof. I appreciate you sharing this! > >> While investigating, I drew up the runqueue length when running >> sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation >> EPYC system) using the following bpftrace script that dumps it csv >> format: > > Just so I'm following you properly on the processor you're using, cpus > 0-7 and 128-125 are in the same CCX? > (I thought there were only 8 cores per CCX?) Sorry about that! It should be 0-7,128-135 (16 threads of 8 cores in the same CCX) The pinning was added so that I could only observe a subset of the total CPUs since analyzing the behavior of 40 tasks on 256 CPUs was much harder than analyzing it on 16 CPUs :) > >> rqlen.bt >> --- > <snip> >> -- >> >> I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy >> execution (rqlen50-pe-pinned.csv) below. >> >> The trend I see with hackbench is that the chain migration leads >> to a single runqueue being completely overloaded, followed by some >> amount of the idling on the entire CCX and a similar chain appearing >> on a different CPU. The trace for tip show a lot more CPUs being >> utilized. >> >> Mathieu has been looking at hackbench and the effect of task migration >> on the runtime and it appears that lowering the migrations improves >> the hackbench performance significantly [1][2][3] >> >> [1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/ >> [2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/ >> [3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@efficios.com/ >> >> Since migration itself is not cheap, I believe the chain migration at >> the current scale hampers the performance since sched-messaging >> emulates a worst-case scenario for proxy-execution. > > Hrm. > >> I'll update the thread once I have more information. I'll continue >> testing and take a closer look at the implementation. >> >>> I also briefly re-tested with this v5 series >>> and saw some average latencies grow vs v4, suggesting the changes >>> to return-migration (and extra locking) have some impact. With v6 >>> the extra overhead is reduced but still not as nice as v4. I’ll >>> be digging more there, but my priority is still stability over >>> speed at this point (it’s easier to validate correctness of >>> optimizations if the baseline isn’t crashing). >>> >>> >>> If folks find it easier to test/tinker with, this patch series >>> can also be found here: >>> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6 >>> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6 >> >> P.S. I was using the above tree. > > Ok, I've been working on getting v7 ready which includes two main things: > 1) I've reworked the return migration back into the ttwu path to avoid > the lock juggling > 2) Working to properly conditionalize and re-add Connor's > chain-migration feature (which when a migration happens pulls the full > blocked_donor list with it) > > So I'll try to reproduce your results and see if these help any with > this particular case, and then I'll start to look closer at what can > be done. > > Again, thanks so much, I've got so much gratitude for your testing and > analysis here. I really appreciate your feedback! > Do let me know if you find anything further! Sure thing! I'll keep you updated of any finding. Thank you for digging further into this issue. > > thanks > -john -- Thanks and Regards, Prateek
Hi John On 11/06/23 19:34, John Stultz wrote: > Stabilizing this Proxy Execution series has unfortunately > continued to be a challenging task. Since the v5 release, I’ve > been focused on getting the deactivated/sleeping owner enqueuing > functionality, which I left out of v5, stabilized. I’ve managed > to rework enough to avoid the crashes previously tripped with the > locktorture & ww_mutex selftests, so I feel it’s much improved, > but I do still see some issues (blocked waitqueues and hung task > watchdogs firing) after stressing the system for many hours in a > 64 cpu qemu environment (though this issue seems to be introduced > earlier in the series with proxy-migration/return-migration). > > I still haven’t had time to focus on testing and debugging the > chain migration patches. So I’ve left that patch out of this > submission for now, but will try to get it included in the next > revision. > > This patch series is actually coarser than what I’ve been > developing with, as there are a number of small “test” steps to > help validate behavior I changed, which would then be replaced by > the real logic afterwards. Including those here would just cause > more work for reviewers, so I’ve folded them together, but if > you’re interested you can find the fine-grained tree here: > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained > https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained > > As mentioned previously, this Proxy Execution series has a long > history: First described in a paper[1] by Watkins, Straub, > Niehaus, then from patches from Peter Zijlstra, extended with > lots of work by Juri Lelli, Valentin Schneider, and Connor > O'Brien. (and thank you to Steven Rostedt for providing > additional details here!) Thanks a lot for all your effort into trying to push this difficult patchset forward! I am trying to find more time to help with review and hopefully debugging too, but as it stands, I think to make progress we need to think about breaking this patchset into smaller problems and get them merged into phases so at least the review and actual work done would be broken down into smaller more manageable chunks. From my very birds eye view it seems we have 3 elements: 1. Extend locking infrastructure. 2. Split task context into scheduling and execution. 3. Actual proxy_execution implementation. It seems to me (and as ever I could be wrong of course) the first 7 patches are more or less stable? Could you send patch 1 individually and the next 6 patches to get the ground work to extend locking reviewed and merged first? After that we can focus on splitting the task context into scheduling and execution (and maybe introduce the PROXY_EXEC config along with it) but without actually implementing the inheritance, etc parts? Just generally teaching the scheduler these now are 2 separate parts. Are 1 and 2 dependent on each other or can be sent as two series in parallel actually? Hopefully this should reduce the work a lot from continuously rebasing these patches and focus on the last part which is the meaty and most difficult bit IIUC. Which I hope we can break down too; but I have no idea at the moment how to do that. Merging in parts will help with giving each part a chance to soak individually in mainline while the rest is being discussed. Which would make handling potential fall overs easier too. I don't know what the other thinks, but IMHO if there are stable parts of this series; I think we should focus on trying to merge these elements first. I hope you'll be the one to get this through the finishing line, but if for whatever reason yet another person needs to carry over, they'd find something got merged at least :-) I'm sure reviving these patches every time is no easy task! Cheers -- Qais Yousef
On Sat, Dec 16, 2023 at 7:07 PM Qais Yousef <qyousef@layalina.io> wrote: > I am trying to find more time to help with review and hopefully debugging too, > but as it stands, I think to make progress we need to think about breaking this > patchset into smaller problems and get them merged into phases so at least the > review and actual work done would be broken down into smaller more manageable > chunks. > > From my very birds eye view it seems we have 3 elements: > > 1. Extend locking infrastructure. > 2. Split task context into scheduling and execution. > 3. Actual proxy_execution implementation. > > It seems to me (and as ever I could be wrong of course) the first 7 patches are > more or less stable? Could you send patch 1 individually and the next 6 patches > to get the ground work to extend locking reviewed and merged first? So I'm working hard to get v7 out the door here, but your general suggestion here sounds fair. Part of why I've not pushed as hard to get the first initial patches in, is that the earlier changes to rework things are mostly done in service of the later proxy exec logic, so it may be a little hard to justify the churn on their own. Also, I've been hoping to get more discussion & feedback around the big picture - but I suspect the size & number of the changes makes this daunting. That said, if Peter is up for it, I'd be happy if the initial chunks of the series were to be considered to be pulled in. > After that we can focus on splitting the task context into scheduling and > execution (and maybe introduce the PROXY_EXEC config along with it) but without > actually implementing the inheritance, etc parts? Just generally teaching the > scheduler these now are 2 separate parts. The majority of that is done in a single patch: https://github.com/johnstultz-work/linux-dev/commit/9e3b364f3724ed840137d681876268b0ad67a469 There we start to have separate names, but at least until we are doing the proxying, the rq->curr and rq_selected() will be the same task. > Are 1 and 2 dependent on each other or can be sent as two series in parallel > actually? Technically, I think they can be done in parallel. Though I'm not sure if managing and submitting multiple patch series is easier or harder for folks to review. > Hopefully this should reduce the work a lot from continuously rebasing these > patches and focus on the last part which is the meaty and most difficult bit > IIUC. Which I hope we can break down too; but I have no idea at the moment how > to do that. Rebasing hasn't been the major problem, but wrangling the large patch set has. For v7 I got a lot of offlist feedback, and propagating those changes through the full fine-grained stack makes even trivial changes to early patches quite laborious. As for breaking down the rest, I thought the v6 series had later patches broken down fairly well: 1) Just local rq proxying (where the owner happens to be on the current rq) 2) Add proxy migration & return migration, so we can boost tasks on remote rqs 3) Sleeping owner enqueueing (so we get woken and added to the rq the owner gets woken on) ... And I have the fine-grained version that is even more split up so I could test each small change at a time: https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained But I'm open to other suggestions. > Merging in parts will help with giving each part a chance to soak individually > in mainline while the rest is being discussed. Which would make handling > potential fall overs easier too. Again, I think this would be great. I'll try to get v7 out the door so folks can once more consider the big picture, but then maybe I'll send out the earlier changes more frequently. thanks -john
On 12/18/23 15:38, John Stultz wrote: > On Sat, Dec 16, 2023 at 7:07 PM Qais Yousef <qyousef@layalina.io> wrote: > > I am trying to find more time to help with review and hopefully debugging too, > > but as it stands, I think to make progress we need to think about breaking this > > patchset into smaller problems and get them merged into phases so at least the > > review and actual work done would be broken down into smaller more manageable > > chunks. > > > > From my very birds eye view it seems we have 3 elements: > > > > 1. Extend locking infrastructure. > > 2. Split task context into scheduling and execution. > > 3. Actual proxy_execution implementation. > > > > It seems to me (and as ever I could be wrong of course) the first 7 patches are > > more or less stable? Could you send patch 1 individually and the next 6 patches > > to get the ground work to extend locking reviewed and merged first? > > So I'm working hard to get v7 out the door here, but your general > suggestion here sounds fair. > > Part of why I've not pushed as hard to get the first initial patches > in, is that the earlier changes to rework things are mostly done in > service of the later proxy exec logic, so it may be a little hard to > justify the churn on their own. Also, I've been hoping to get more If by churn you mean they'd be changing a lot later, then yes. But by churn you mean merging patches to serve a purpose that is yet to follow up, then I think that is fine. I think other complex patches were staged in pieces to help with both review and test process. And to speed things up. > discussion & feedback around the big picture - but I suspect the size > & number of the changes makes this daunting. I don't know how the maintainers manage in general tbh. But finding the time for a smaller set of patches would be easier IMHO for everyone interested. Assuming my understanding is correct that the first set of patches upto splitting the scheduler context are more or less stable. > That said, if Peter is up for it, I'd be happy if the initial chunks > of the series were to be considered to be pulled in. > > > After that we can focus on splitting the task context into scheduling and > > execution (and maybe introduce the PROXY_EXEC config along with it) but without > > actually implementing the inheritance, etc parts? Just generally teaching the > > scheduler these now are 2 separate parts. > > The majority of that is done in a single patch: > https://github.com/johnstultz-work/linux-dev/commit/9e3b364f3724ed840137d681876268b0ad67a469 Yep! > > There we start to have separate names, but at least until we are doing > the proxying, the rq->curr and rq_selected() will be the same task. > > > Are 1 and 2 dependent on each other or can be sent as two series in parallel > > actually? > > Technically, I think they can be done in parallel. Though I'm not sure > if managing and submitting multiple patch series is easier or harder > for folks to review. I didn't mean you should do that :-) Meant they're actually completely independent. > > > Hopefully this should reduce the work a lot from continuously rebasing these > > patches and focus on the last part which is the meaty and most difficult bit > > IIUC. Which I hope we can break down too; but I have no idea at the moment how > > to do that. > > Rebasing hasn't been the major problem, but wrangling the large patch > set has. For v7 I got a lot of offlist feedback, and propagating those > changes through the full fine-grained stack makes even trivial changes > to early patches quite laborious. > > As for breaking down the rest, I thought the v6 series had later > patches broken down fairly well: Hmm okay. I thought they're dependent; and I meant to potentially independent parts that can be piecewise merged. But perhaps I'm overthinking the splitting here :-) > 1) Just local rq proxying (where the owner happens to be on the current rq) > 2) Add proxy migration & return migration, so we can boost tasks on remote rqs > 3) Sleeping owner enqueueing (so we get woken and added to the rq the > owner gets woken on) > ... > > And I have the fine-grained version that is even more split up so I > could test each small change at a time: > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained > > But I'm open to other suggestions. > > > Merging in parts will help with giving each part a chance to soak individually > > in mainline while the rest is being discussed. Which would make handling > > potential fall overs easier too. > > Again, I think this would be great. > > I'll try to get v7 out the door so folks can once more consider the > big picture, but then maybe I'll send out the earlier changes more > frequently. Thanks! -- Qais Yousef