Message ID | xm261qego72d.fsf_-_@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id in14csp115823vqb; Fri, 29 Sep 2023 17:11:04 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGUBb8yCeGpYHUK6HL2QIwVqf8upDAylOf4eSl55EWGOON1x9FMnSV63dj7R5VQYyi5Jirw X-Received: by 2002:a05:6870:c114:b0:1bf:9f6:b810 with SMTP id f20-20020a056870c11400b001bf09f6b810mr6254502oad.36.1696032664401; Fri, 29 Sep 2023 17:11:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696032664; cv=none; d=google.com; s=arc-20160816; b=GpNjnjgFUkksxbDtmakzRltu6A3X6Mc4fSbof/VGnAiFCyMuggnttcxrlaQiv0m+GN tywC4MoZCHb6V6QZuXuJS+ORunsku3gTkpc4kkgNDnLXUbS6rLvuomhFSf2QBPESPkaj Nrju2EmDfveiR/CDwpX1GARIDCB6WhFd9MBqRgpzx8y3CAAoyrEGgbijgPaJhRWSv/jo bHiVPbL6zARC6zcJRoU+/Vki67kkWvur+j1uFazzDdKW1tXgW+9XkeNKm4OzbC0nKb5j c7jUIueFaRCaCIp21/aSwPIZB57YdiJxt0syHhpikQLOEDSsOVZiOsNvwLJwF3uXVZg7 Rgng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:date :references:in-reply-to:subject:cc:to:from:dkim-signature; bh=XnkBwhBVdGH92DFEe08KUfWBiIIUHC8mm8r/BOxS7C0=; fh=OcOlGPm5khCD4qw6ClJ08LikALHc7fbAKF9mnAskLT4=; b=qerCwASjz7w8dX3VDE7kmrEMWIQNUB53RIlbGR/k+n+lX/hPsBC5fiP2369QR4pFcV Yt9Z6/qdi8zhKELFr4LOHcFUUD1rEFKSanl29kTh/JrEjL8LyTvW65siRLDuKO10ywFC 1i6pvXSH6B83buyyh5orj/Ie6MFIwFi7xaXIWty+hY3NzxoOvDkrc8jfxNA6EqkRw1Ru aquxajDstCHu0pxc0c0GnE7jAaMFB25EFZ8FVY/H40VN5cEWOA6lJj8ugzDJYdLW4JfR aVfkxA5iemlDQkJU1B36oNM5pTzImQ/BVPpf+Ink0y4B9nMJAS57QhgnsPi0Tge8zKEz m/fQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=kWWanQLw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id f20-20020a637554000000b00565dd3fbfdfsi21815028pgn.214.2023.09.29.17.11.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 17:11:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=kWWanQLw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 2CC8B8331E29; Fri, 29 Sep 2023 17:09:41 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232410AbjI3AJh (ORCPT <rfc822;pwkd43@gmail.com> + 19 others); Fri, 29 Sep 2023 20:09:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40260 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229489AbjI3AJg (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 29 Sep 2023 20:09:36 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1DEFFAC for <linux-kernel@vger.kernel.org>; Fri, 29 Sep 2023 17:09:34 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id d9443c01a7336-1c6193d6bb4so63515ad.0 for <linux-kernel@vger.kernel.org>; Fri, 29 Sep 2023 17:09:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696032573; x=1696637373; darn=vger.kernel.org; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=XnkBwhBVdGH92DFEe08KUfWBiIIUHC8mm8r/BOxS7C0=; b=kWWanQLwxAj90qwc1vMaPCjzdasEy/c4yL1b12GQ04X26gDr25dC4vMQMooI4NX9M5 Xu7tENYntbtcuUHl1BOUeas8ib26wRNeLh+RScZeT+S5CDPfwJdc8N9p1xxfiuR4mOW+ F8Y69OjI3+1/ZfxYaQWg4UBMoiuHDA8eE0UcJYtCLABrI7yH9ZAEMfrALnvjUP0+CkcW AohHerJ6q85g6qSWdNh6ACyNqEM0YEgsyA1p93Q4R+fDqz7qSMuVrv0myH6ghxRic8YM yb1bV6wSbMNrUikVazlWbRuBkErgNdAsAIem6+ZAd8q1+WstpWjRz0xYMuxsYrgza8fr IAPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696032573; x=1696637373; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=XnkBwhBVdGH92DFEe08KUfWBiIIUHC8mm8r/BOxS7C0=; b=Bf4ld2DNmZIbrIQLZGD/FiTPGpTaoiIwpUMof/FjDX8nPfqLfr5d8xBFEZpV0mcKBA aeGnIDfiydQhSV/CXJqrkYUe0KmveWEnyUbq98uz6Nd0k8BuzHVbFr5Cs6UD3saIRFVq dqTnDHGIdPMB0l6AfHHEJn6vqYOjZ+731HMUG8/45wNxTf6RNgI+4QZKe6UVor01MGsM V5I+1OM7HK3YXX7T3aldNsiJWXxSYqEARAhklP7unyZHlQ2J8RaGvl/Yynx/dpoS+GNY JMdvKewN1+ZnKWmNrr5D3H5UZom/B0rJqr3xoN2JrrmuZIZmbA9lCfmtqf7MwKkuwwPo +B0g== X-Gm-Message-State: AOJu0YyCv7kyynOkUonq+YfOu/X5aRE21k+jkoo6zWcAWDVDcYlSeK11 nYdKoSd07WewMHtVRCvQZ+rj1A== X-Received: by 2002:a17:902:ea06:b0:1c6:20ca:2cb8 with SMTP id s6-20020a170902ea0600b001c620ca2cb8mr33478plg.22.1696032573248; Fri, 29 Sep 2023 17:09:33 -0700 (PDT) Received: from bsegall-glaptop.localhost (c-73-158-249-138.hsd1.ca.comcast.net. [73.158.249.138]) by smtp.gmail.com with ESMTPSA id jc4-20020a17090325c400b001bbdd44bbb6sm5850650plb.136.2023.09.29.17.09.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 17:09:32 -0700 (PDT) From: Benjamin Segall <bsegall@google.com> To: Peter Zijlstra <peterz@infradead.org> Cc: mingo@kernel.org, vincent.guittot@linaro.org, linux-kernel@vger.kernel.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io, chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com, pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com, yu.c.chen@intel.com, youssefesmat@chromium.org, joel@joelfernandes.org, efault@gmx.de, tglx@linutronix.de Subject: [PATCH] sched/fair: fix pick_eevdf to always find the correct se In-Reply-To: <20230531124603.931005524@infradead.org> (Peter Zijlstra's message of "Wed, 31 May 2023 13:58:44 +0200") References: <20230531115839.089944915@infradead.org> <20230531124603.931005524@infradead.org> Date: Fri, 29 Sep 2023 17:09:30 -0700 Message-ID: <xm261qego72d.fsf_-_@google.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Fri, 29 Sep 2023 17:09:41 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778419147105360575 X-GMAIL-MSGID: 1778419147105360575 |
Series |
sched/fair: fix pick_eevdf to always find the correct se
|
|
Commit Message
Benjamin Segall
Sept. 30, 2023, 12:09 a.m. UTC
The old pick_eevdf could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but it
turned out that that min_deadline wasn't actually eligible. In that case
we need to go back and search through any left branches we skipped
looking for the actual best _eligible_ min_deadline.
This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.
I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
---
kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++---------
1 file changed, 58 insertions(+), 14 deletions(-)
Comments
Hi, On 30.09.2023 02:09, Benjamin Segall wrote: > The old pick_eevdf could fail to find the actual earliest eligible > deadline when it descended to the right looking for min_deadline, but it > turned out that that min_deadline wasn't actually eligible. In that case > we need to go back and search through any left branches we skipped > looking for the actual best _eligible_ min_deadline. > > This is more expensive, but still O(log n), and at worst should only > involve descending two branches of the rbtree. > > I've run this through a userspace stress test (thank you > tools/lib/rbtree.c), so hopefully this implementation doesn't miss any > corner cases. > > Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") > Signed-off-by: Ben Segall <bsegall@google.com> This patch landed in today's linux-next as commit 561c58efd239 ("sched/fair: Fix pick_eevdf()"). Surprisingly it introduced a warning about circular locking dependency. It can be easily observed during boot from time to time on on qemu/arm64 'virt' machine: ====================================================== WARNING: possible circular locking dependency detected 6.6.0-rc4+ #7222 Not tainted ------------------------------------------------------ systemd-udevd/1187 is trying to acquire lock: ffffbcc2be0c4de0 (console_owner){..-.}-{0:0}, at: console_flush_all+0x1b0/0x500 but task is already holding lock: ffff5535ffdd2b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xe0/0xc40 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&rq->__lock){-.-.}-{2:2}: _raw_spin_lock_nested+0x44/0x5c raw_spin_rq_lock_nested+0x24/0x40 task_fork_fair+0x3c/0xac sched_cgroup_fork+0xe8/0x14c copy_process+0x11c4/0x1a14 kernel_clone+0x8c/0x400 user_mode_thread+0x70/0x98 rest_init+0x28/0x190 arch_post_acpi_subsys_init+0x0/0x8 start_kernel+0x594/0x684 __primary_switched+0xbc/0xc4 -> #3 (&p->pi_lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x60/0x88 try_to_wake_up+0x58/0x468 default_wake_function+0x14/0x20 woken_wake_function+0x20/0x2c __wake_up_common+0x94/0x170 __wake_up_common_lock+0x7c/0xcc __wake_up+0x18/0x24 tty_wakeup+0x34/0x70 tty_port_default_wakeup+0x20/0x38 tty_port_tty_wakeup+0x18/0x24 uart_write_wakeup+0x18/0x28 pl011_tx_chars+0x140/0x2b4 pl011_start_tx+0xe8/0x190 serial_port_runtime_resume+0x90/0xc0 __rpm_callback+0x48/0x1a8 rpm_callback+0x6c/0x78 rpm_resume+0x438/0x6d8 pm_runtime_work+0x84/0xc8 process_one_work+0x1ec/0x53c worker_thread+0x298/0x408 kthread+0x124/0x128 ret_from_fork+0x10/0x20 -> #2 (&tty->write_wait){....}-{2:2}: _raw_spin_lock_irqsave+0x60/0x88 __wake_up_common_lock+0x5c/0xcc __wake_up+0x18/0x24 tty_wakeup+0x34/0x70 tty_port_default_wakeup+0x20/0x38 tty_port_tty_wakeup+0x18/0x24 uart_write_wakeup+0x18/0x28 pl011_tx_chars+0x140/0x2b4 pl011_start_tx+0xe8/0x190 serial_port_runtime_resume+0x90/0xc0 __rpm_callback+0x48/0x1a8 rpm_callback+0x6c/0x78 rpm_resume+0x438/0x6d8 pm_runtime_work+0x84/0xc8 process_one_work+0x1ec/0x53c worker_thread+0x298/0x408 kthread+0x124/0x128 ret_from_fork+0x10/0x20 -> #1 (&port_lock_key){..-.}-{2:2}: _raw_spin_lock+0x48/0x60 pl011_console_write+0x13c/0x1b0 console_flush_all+0x20c/0x500 console_unlock+0x6c/0x130 vprintk_emit+0x228/0x3a0 vprintk_default+0x38/0x44 vprintk+0xa4/0xc0 _printk+0x5c/0x84 register_console+0x1f4/0x420 serial_core_register_port+0x5a4/0x5d8 serial_ctrl_register_port+0x10/0x1c uart_add_one_port+0x10/0x1c pl011_register_port+0x70/0x12c pl011_probe+0x1bc/0x1fc amba_probe+0x110/0x1c8 really_probe+0x148/0x2b4 __driver_probe_device+0x78/0x12c driver_probe_device+0xd8/0x160 __device_attach_driver+0xb8/0x138 bus_for_each_drv+0x84/0xe0 __device_attach+0xa8/0x1b0 device_initial_probe+0x14/0x20 bus_probe_device+0xb0/0xb4 device_add+0x574/0x738 amba_device_add+0x40/0xac of_platform_bus_create+0x2b4/0x378 of_platform_populate+0x50/0xfc of_platform_default_populate_init+0xd0/0xf0 do_one_initcall+0x74/0x2f0 kernel_init_freeable+0x28c/0x4dc kernel_init+0x24/0x1dc ret_from_fork+0x10/0x20 -> #0 (console_owner){..-.}-{0:0}: __lock_acquire+0x1318/0x20c4 lock_acquire+0x1e8/0x318 console_flush_all+0x1f8/0x500 console_unlock+0x6c/0x130 vprintk_emit+0x228/0x3a0 vprintk_default+0x38/0x44 vprintk+0xa4/0xc0 _printk+0x5c/0x84 pick_next_task_fair+0x28c/0x498 __schedule+0x164/0xc40 do_task_dead+0x54/0x58 do_exit+0x61c/0x9e8 do_group_exit+0x34/0x90 __wake_up_parent+0x0/0x30 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc_compat+0x1c/0x38 el0_svc_compat+0x48/0xb4 el0t_32_sync_handler+0x90/0x140 el0t_32_sync+0x194/0x198 other info that might help us debug this: Chain exists of: console_owner --> &p->pi_lock --> &rq->__lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&rq->__lock); lock(&p->pi_lock); lock(&rq->__lock); lock(console_owner); *** DEADLOCK *** 3 locks held by systemd-udevd/1187: #0: ffff5535ffdd2b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xe0/0xc40 #1: ffffbcc2be0c4c30 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x11c/0x3a0 #2: ffffbcc2be0c4c88 (console_srcu){....}-{0:0}, at: console_flush_all+0x7c/0x500 stack backtrace: CPU: 1 PID: 1187 Comm: systemd-udevd Not tainted 6.6.0-rc4+ #7222 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x98/0xf0 show_stack+0x18/0x24 dump_stack_lvl+0x60/0xac dump_stack+0x18/0x24 print_circular_bug+0x290/0x370 check_noncircular+0x15c/0x170 __lock_acquire+0x1318/0x20c4 lock_acquire+0x1e8/0x318 console_flush_all+0x1f8/0x500 console_unlock+0x6c/0x130 vprintk_emit+0x228/0x3a0 vprintk_default+0x38/0x44 vprintk+0xa4/0xc0 _printk+0x5c/0x84 pick_next_task_fair+0x28c/0x498 __schedule+0x164/0xc40 do_task_dead+0x54/0x58 do_exit+0x61c/0x9e8 do_group_exit+0x34/0x90 __wake_up_parent+0x0/0x30 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc_compat+0x1c/0x38 el0_svc_compat+0x48/0xb4 el0t_32_sync_handler+0x90/0x140 el0t_32_sync+0x194/0x198 The problem is probably elsewhere, but this scheduler change only revealed it in a fully reproducible way. Reverting $subject on top of linux-next hides the problem deep enough that I was not able to reproduce it. Let me know if there is anything I can do to help fixing this issue. > --- > kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 58 insertions(+), 14 deletions(-) > > ... Best regards
On 9/30/23 8:09 AM, Benjamin Segall Wrote: > The old pick_eevdf could fail to find the actual earliest eligible > deadline when it descended to the right looking for min_deadline, but it > turned out that that min_deadline wasn't actually eligible. In that case > we need to go back and search through any left branches we skipped > looking for the actual best _eligible_ min_deadline. > > This is more expensive, but still O(log n), and at worst should only > involve descending two branches of the rbtree. > > I've run this through a userspace stress test (thank you > tools/lib/rbtree.c), so hopefully this implementation doesn't miss any > corner cases. > > Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") > Signed-off-by: Ben Segall <bsegall@google.com> > --- > kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 58 insertions(+), 14 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0c31cda0712f..77e9440b8ab3 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -864,18 +864,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) > * > * se->min_deadline = min(se->deadline, se->{left,right}->min_deadline) > * > * Which allows an EDF like search on (sub)trees. > */ > -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) > +static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) > { > struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node; > struct sched_entity *curr = cfs_rq->curr; > struct sched_entity *best = NULL; > + struct sched_entity *best_left = NULL; > > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) > curr = NULL; > + best = curr; > > /* > * Once selected, run a task until it either becomes non-eligible or > * until it gets a new slice. See the HACK in set_next_entity(). > */ > @@ -892,45 +894,87 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) > node = node->rb_left; > continue; > } > > /* > - * If this entity has an earlier deadline than the previous > - * best, take this one. If it also has the earliest deadline > - * of its subtree, we're done. > + * Now we heap search eligible trees for the best (min_)deadline > */ > - if (!best || deadline_gt(deadline, best, se)) { > + if (!best || deadline_gt(deadline, best, se)) > best = se; > - if (best->deadline == best->min_deadline) > - break; > - } > > /* > - * If the earlest deadline in this subtree is in the fully > - * eligible left half of our space, go there. > + * Every se in a left branch is eligible, keep track of the > + * branch with the best min_deadline > */ > + if (node->rb_left) { > + struct sched_entity *left = __node_2_se(node->rb_left); > + > + if (!best_left || deadline_gt(min_deadline, best_left, left)) > + best_left = left; > + > + /* > + * min_deadline is in the left branch. rb_left and all > + * descendants are eligible, so immediately switch to the second > + * loop. > + */ > + if (left->min_deadline == se->min_deadline) > + break; > + } > + > + /* min_deadline is at this node, no need to look right */ > + if (se->deadline == se->min_deadline) > + break; > + > + /* else min_deadline is in the right branch. */ > + node = node->rb_right; > + } > + > + /* > + * We ran into an eligible node which is itself the best. > + * (Or nr_running == 0 and both are NULL) > + */ > + if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0) > + return best; > + > + /* > + * Now best_left and all of its children are eligible, and we are just > + * looking for deadline == min_deadline > + */ > + node = &best_left->run_node; > + while (node) { > + struct sched_entity *se = __node_2_se(node); > + > + /* min_deadline is the current node */ > + if (se->deadline == se->min_deadline) > + return se; IMHO it would be better tiebreak on vruntime by moving this hunk to .. > + > + /* min_deadline is in the left branch */ > if (node->rb_left && > __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { > node = node->rb_left; > continue; > } .. here, thoughts? > > + /* else min_deadline is in the right branch */ > node = node->rb_right; > } > + return NULL; Why not 'best'? Since .. > +} > > - if (!best || (curr && deadline_gt(deadline, best, curr))) > - best = curr; > +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) > +{ > + struct sched_entity *se = __pick_eevdf(cfs_rq); > > - if (unlikely(!best)) { > + if (!se) { > struct sched_entity *left = __pick_first_entity(cfs_rq); .. cfs_rq->curr isn't considered here. > if (left) { > pr_err("EEVDF scheduling fail, picking leftmost\n"); > return left; > } > } > > - return best; > + return se; > } > > #ifdef CONFIG_SCHED_DEBUG > struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) > { The implementation of __pick_eevdf() now is quite complicated which makes it really hard to maintain. I'm trying my best to refactor this part, hopefully can do some help. Below is only for illustration, I need to test more. static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) { struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node; struct sched_entity *curr = cfs_rq->curr; struct sched_entity *best = NULL; struct sched_entity *cand = NULL; bool all_eligible = false; if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; /* * Once selected, run a task until it either becomes non-eligible or * until it gets a new slice. See the HACK in set_next_entity(). */ if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline) return curr; while (node) { struct sched_entity *se = __node_2_se(node); /* * If this entity is not eligible, try the left subtree. */ if (!all_eligible && !entity_eligible(cfs_rq, se)) { node = node->rb_left; continue; } if (node->rb_left) { struct sched_entity *left = __node_2_se(node->rb_left); BUG_ON(left->min_deadline < se->min_deadline); /* Tiebreak on vruntime */ if (left->min_deadline == se->min_deadline) { node = node->rb_left; all_eligible = true; continue; } if (!all_eligible) { /* * We're going to search right subtree and the one * with min_deadline can be non-eligible, so record * the left subtree as a candidate. */ if (!cand || deadline_gt(min_deadline, cand, left)) cand = left; } } /* min_deadline is at this node, no need to look right */ if (se->deadline == se->min_deadline) { best = se; break; } node = node->rb_right; if (!node && cand) { node = cand; all_eligible = true; cand = NULL; } } if (!best || (curr && deadline_gt(deadline, best, curr))) best = curr; return best; }
On Wed, Oct 11, 2023 at 08:12:09PM +0800, Abel Wu wrote: > The implementation of __pick_eevdf() now is quite complicated which > makes it really hard to maintain. I'm trying my best to refactor this > part, hopefully can do some help. Below is only for illustration, I > need to test more. Well, the form it has now is somewhat close to the one in the paper. I think Ben added one early break it doesn't have or something. As the paper explains, you get two walks, one down the eligibility path, and then one down the heap. I think the current code structure represents that fairly well. Very vague memories seem to suggest I thought to be clever some 10+ years ago when I wrote that other decent loop that Ben showed to be broken (and woe on me for not verifying it when I resurrected the patches, I rewrote pretty much everything else, except this lookup :/).
Abel Wu <wuyun.abel@bytedance.com> writes: > On 9/30/23 8:09 AM, Benjamin Segall Wrote: >> + /* >> + * Now best_left and all of its children are eligible, and we are just >> + * looking for deadline == min_deadline >> + */ >> + node = &best_left->run_node; >> + while (node) { >> + struct sched_entity *se = __node_2_se(node); >> + >> + /* min_deadline is the current node */ >> + if (se->deadline == se->min_deadline) >> + return se; > > IMHO it would be better tiebreak on vruntime by moving this hunk to .. > >> + >> + /* min_deadline is in the left branch */ >> if (node->rb_left && >> __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { >> node = node->rb_left; >> continue; >> } > > .. here, thoughts? Yeah, that should work and be better on the tiebreak (and my test code agrees). There's an argument that the tiebreak will never really come up and it's better to avoid the potential one extra cache line from "__node_2_se(node->rb_left)->min_deadline" though. > >> + /* else min_deadline is in the right branch */ >> node = node->rb_right; >> } >> + return NULL; > > Why not 'best'? Since .. The only time this can happen is if the tree is corrupt. We only reach this case if best_left is set at all (and best_left's min_deadline is better than "best"'s, which includes curr). In that case getting an error message is good, and in general I wasn't worrying about it much. > >> +} >> - if (!best || (curr && deadline_gt(deadline, best, curr))) >> - best = curr; >> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) >> +{ >> + struct sched_entity *se = __pick_eevdf(cfs_rq); >> - if (unlikely(!best)) { >> + if (!se) { >> struct sched_entity *left = __pick_first_entity(cfs_rq); > > .. cfs_rq->curr isn't considered here. That said, we should probably consider curr here in the error-case fallback, if just as a "if (!left) left = cfs_rq->curr;" > >> if (left) { >> pr_err("EEVDF scheduling fail, picking leftmost\n"); >> return left; >> } >> } >> - return best; >> + return se; >> } >> #ifdef CONFIG_SCHED_DEBUG >> struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) >> { > > The implementation of __pick_eevdf() now is quite complicated which > makes it really hard to maintain. I'm trying my best to refactor this > part, hopefully can do some help. Below is only for illustration, I > need to test more. > A passing version of that single-loop code minus the cfs_rq->curr handling is here (just pasting the curr handling back on the start/end should do): static struct sched_entity *pick_eevdf_abel(struct cfs_rq *cfs_rq) { struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node; struct sched_entity *best = NULL; struct sched_entity *cand = NULL; bool all_eligible = false; while (node || cand) { struct sched_entity *se = __node_2_se(node); if (!node) { BUG_ON(!cand); node = &cand->run_node; se = __node_2_se(node); all_eligible = true; cand = NULL; /* * Our initial pass ran into an eligible node which is * itself the best */ if (best && (s64)(se->min_deadline - best->deadline) > 0) break; } /* * If this entity is not eligible, try the left subtree. */ if (!all_eligible && !entity_eligible(cfs_rq, se)) { node = node->rb_left; continue; } if (node->rb_left) { struct sched_entity *left = __node_2_se(node->rb_left); BUG_ON(left->min_deadline < se->min_deadline); /* Tiebreak on vruntime */ if (left->min_deadline == se->min_deadline) { node = node->rb_left; all_eligible = true; continue; } if (!all_eligible) { /* * We're going to search right subtree and the one * with min_deadline can be non-eligible, so record * the left subtree as a candidate. */ if (!cand || deadline_gt(min_deadline, cand, left)) cand = left; } } if (!all_eligible && (!best || deadline_gt(deadline, best, se))) best = se; /* min_deadline is at this node, no need to look right */ if (se->deadline == se->min_deadline) { if (all_eligible && (!best || deadline_gte(deadline, best, se))) best = se; if (!all_eligible && (!best || deadline_gt(deadline, best, se))) best = se; node = NULL; continue; } node = node->rb_right; } return best; } This does exactly as well on the tiebreak condition of vruntime as the improved two-loop version you suggested. If we don't care about the tiebreak we can replace the ugly if(all_eligible) if(!all_eligible) in the last section with a single version that just uses deadline_gt (or gte) throughout. Personally with all the extra bits I added for correctness this winds up definitely uglier than the two-loop version, but it might be possible to clean it up further. (And a bunch of it is just personal style preference in the first place) I've also attached my ugly userspace EEVDF tester as an attachment, hopefully I attached it in a correct mode to go through lkml.
On 10/11/23 9:14 PM, Peter Zijlstra Wrote: > On Wed, Oct 11, 2023 at 08:12:09PM +0800, Abel Wu wrote: > > As the paper explains, you get two walks, one down the eligibility path, > and then one down the heap. I think the current code structure > represents that fairly well. Yes, it does. I just wonder if the 2-step search is necessary, since they obey the same rule of heap search: 1) node->min_deadline > node->left->min_deadline 1.1) BUG 2) node->min_deadline = node->left->min_deadline 2.1) go left if tiebreak on vruntime 3) node->min_deadline < node->left->min_deadline 3.1) return @node if it has min deadline, or 3.2) go right which gives: while (node) { if ((left = node->left)) { /* 1.1 */ BUG_ON(left->min < node->min); /* 2.1 */ if (left->min == node->min) { node = left; continue; } } /* 3.1 */ if (node->deadline == node->min) return node; /* 3.2 */ node = node->right; } The above returns the entity with ealiest deadline (and with smallest vruntime if have same deadline). Then comes with eligibility: 0) it helps pruning the tree since the right subtree of a non-eligible node can't contain any eligible node. 3.2.1) record left as a fallback iff the eligibility check is active, and saving the best one is enough since none of them contain non-eligible node, IOW the one with min deadline in the left tree must be eligible. 4) the eligibility check ends immediately once go left from an eligible node, including switch to the fallback which is essentially is the 'left' of an eligible node. 5) fallback to the candidate (if exists) if failed to find an eligible entity with earliest deadline. which makes: candidate = NULL; need_check = true; while (node) { /* 0 */ if (need_check && !eligible(node)) { node = node->left; goto next; } if ((left = node->left)) { /* 1.1 */ BUG_ON(left->min < node->min); /* 2.1 */ if (left->min == node->min) { node = left; /* 4 */ need_check = false; continue; } /* 3.2.1 */ if (need_check) candidate = better(candidate, left); } /* 3.1 */ if (node->deadline == node->min) return node; /* 3.2 */ node = node->right; next: /* 5 */ if (!node && candidate) { node = candidate; need_check = false; /* 4 */ candidate = NULL; } } The search ends with a 'best' entity on the tree, comparing it with curr which is out of tree makes things a whole. But it's absolutely fine to me to honor the 2-step search given by the paper if you think that is already clear enough :) Best, Abel
On 10/12/23 5:01 AM, Benjamin Segall Wrote: > Abel Wu <wuyun.abel@bytedance.com> writes: > >> On 9/30/23 8:09 AM, Benjamin Segall Wrote: >>> + /* >>> + * Now best_left and all of its children are eligible, and we are just >>> + * looking for deadline == min_deadline >>> + */ >>> + node = &best_left->run_node; >>> + while (node) { >>> + struct sched_entity *se = __node_2_se(node); >>> + >>> + /* min_deadline is the current node */ >>> + if (se->deadline == se->min_deadline) >>> + return se; >> >> IMHO it would be better tiebreak on vruntime by moving this hunk to .. >> >>> + >>> + /* min_deadline is in the left branch */ >>> if (node->rb_left && >>> __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { >>> node = node->rb_left; >>> continue; >>> } >> >> .. here, thoughts? > > Yeah, that should work and be better on the tiebreak (and my test code > agrees). There's an argument that the tiebreak will never really come up > and it's better to avoid the potential one extra cache line from > "__node_2_se(node->rb_left)->min_deadline" though. I see. Then probably do the same thing in the first loop? > >> >>> + /* else min_deadline is in the right branch */ >>> node = node->rb_right; >>> } >>> + return NULL; >> >> Why not 'best'? Since .. > > The only time this can happen is if the tree is corrupt. We only reach > this case if best_left is set at all (and best_left's min_deadline is > better than "best"'s, which includes curr). In that case getting an > error message is good, and in general I wasn't worrying about it much. Right. > >> >>> +} >>> - if (!best || (curr && deadline_gt(deadline, best, curr))) >>> - best = curr; >>> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) >>> +{ >>> + struct sched_entity *se = __pick_eevdf(cfs_rq); >>> - if (unlikely(!best)) { >>> + if (!se) { >>> struct sched_entity *left = __pick_first_entity(cfs_rq); >> >> .. cfs_rq->curr isn't considered here. > > That said, we should probably consider curr here in the error-case > fallback, if just as a "if (!left) left = cfs_rq->curr;" I don't think so as there must be some bugs in the scheduler, replacing 'pr_err' with 'BUG()' would be more appropriate. > > > I've also attached my ugly userspace EEVDF tester as an attachment, > hopefully I attached it in a correct mode to go through lkml. Received. Thanks, Ben.
Abel Wu <wuyun.abel@bytedance.com> writes: > On 10/12/23 5:01 AM, Benjamin Segall Wrote: >> Abel Wu <wuyun.abel@bytedance.com> writes: >> >>> On 9/30/23 8:09 AM, Benjamin Segall Wrote: >>>> + /* >>>> + * Now best_left and all of its children are eligible, and we are just >>>> + * looking for deadline == min_deadline >>>> + */ >>>> + node = &best_left->run_node; >>>> + while (node) { >>>> + struct sched_entity *se = __node_2_se(node); >>>> + >>>> + /* min_deadline is the current node */ >>>> + if (se->deadline == se->min_deadline) >>>> + return se; >>> >>> IMHO it would be better tiebreak on vruntime by moving this hunk to .. >>> >>>> + >>>> + /* min_deadline is in the left branch */ >>>> if (node->rb_left && >>>> __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { >>>> node = node->rb_left; >>>> continue; >>>> } >>> >>> .. here, thoughts? >> Yeah, that should work and be better on the tiebreak (and my test code >> agrees). There's an argument that the tiebreak will never really come up >> and it's better to avoid the potential one extra cache line from >> "__node_2_se(node->rb_left)->min_deadline" though. > > I see. Then probably do the same thing in the first loop? > We effectively do that already sorta by accident almost always - computing best and best_left via deadline_gt rather than gte prioritizes earlier elements, which always have a better vruntime. Then when we do the best_left->min_deadline vs best->deadline computation, we prioritize best_left, which is the one case it can be wrong, we'd need an additional "if (se->min_deadline == best->deadline && (s64)(se->vruntime - best->vruntime) > 0) return best;" check at the end of the second loop. (Though again I don't know how much this sort of never-going-to-happen slight fairness improvement is worth compared to the extra bit of overhead)
On 10/13/23 1:51 AM, Benjamin Segall Wrote: > Abel Wu <wuyun.abel@bytedance.com> writes: > >> On 10/12/23 5:01 AM, Benjamin Segall Wrote: >>> Abel Wu <wuyun.abel@bytedance.com> writes: >>> >>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote: >>>>> + /* >>>>> + * Now best_left and all of its children are eligible, and we are just >>>>> + * looking for deadline == min_deadline >>>>> + */ >>>>> + node = &best_left->run_node; >>>>> + while (node) { >>>>> + struct sched_entity *se = __node_2_se(node); >>>>> + >>>>> + /* min_deadline is the current node */ >>>>> + if (se->deadline == se->min_deadline) >>>>> + return se; >>>> >>>> IMHO it would be better tiebreak on vruntime by moving this hunk to .. >>>> >>>>> + >>>>> + /* min_deadline is in the left branch */ >>>>> if (node->rb_left && >>>>> __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { >>>>> node = node->rb_left; >>>>> continue; >>>>> } >>>> >>>> .. here, thoughts? >>> Yeah, that should work and be better on the tiebreak (and my test code >>> agrees). There's an argument that the tiebreak will never really come up >>> and it's better to avoid the potential one extra cache line from >>> "__node_2_se(node->rb_left)->min_deadline" though. >> >> I see. Then probably do the same thing in the first loop? >> > > We effectively do that already sorta by accident almost always - > computing best and best_left via deadline_gt rather than gte prioritizes > earlier elements, which always have a better vruntime. Sorry for not clarifying clearly about the 'same thing'. What I meant was to avoid touch left if the node itself has the min deadline. @@ -894,6 +894,9 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) if (!best || deadline_gt(deadline, best, se)) best = se; + if (se->deadline == se->min_deadline) + break; + /* * Every se in a left branch is eligible, keep track of the * branch with the best min_deadline @@ -913,10 +916,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) break; } - /* min_deadline is at this node, no need to look right */ - if (se->deadline == se->min_deadline) - break; - /* else min_deadline is in the right branch. */ node = node->rb_right; } (But still thanks for the convincing explanation on fairness.) Best, Abel > > Then when we do the best_left->min_deadline vs best->deadline > computation, we prioritize best_left, which is the one case it can be > wrong, we'd need an additional > "if (se->min_deadline == best->deadline && > (s64)(se->vruntime - best->vruntime) > 0) return best;" check at the end > of the second loop. > > (Though again I don't know how much this sort of never-going-to-happen > slight fairness improvement is worth compared to the extra bit of > overhead)
Abel Wu <wuyun.abel@bytedance.com> writes: > On 10/13/23 1:51 AM, Benjamin Segall Wrote: >> Abel Wu <wuyun.abel@bytedance.com> writes: >> >>> On 10/12/23 5:01 AM, Benjamin Segall Wrote: >>>> Abel Wu <wuyun.abel@bytedance.com> writes: >>>> >>>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote: >>>>>> + /* >>>>>> + * Now best_left and all of its children are eligible, and we are just >>>>>> + * looking for deadline == min_deadline >>>>>> + */ >>>>>> + node = &best_left->run_node; >>>>>> + while (node) { >>>>>> + struct sched_entity *se = __node_2_se(node); >>>>>> + >>>>>> + /* min_deadline is the current node */ >>>>>> + if (se->deadline == se->min_deadline) >>>>>> + return se; >>>>> >>>>> IMHO it would be better tiebreak on vruntime by moving this hunk to .. >>>>> >>>>>> + >>>>>> + /* min_deadline is in the left branch */ >>>>>> if (node->rb_left && >>>>>> __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { >>>>>> node = node->rb_left; >>>>>> continue; >>>>>> } >>>>> >>>>> .. here, thoughts? >>>> Yeah, that should work and be better on the tiebreak (and my test code >>>> agrees). There's an argument that the tiebreak will never really come up >>>> and it's better to avoid the potential one extra cache line from >>>> "__node_2_se(node->rb_left)->min_deadline" though. >>> >>> I see. Then probably do the same thing in the first loop? >>> >> We effectively do that already sorta by accident almost always - >> computing best and best_left via deadline_gt rather than gte prioritizes >> earlier elements, which always have a better vruntime. > > Sorry for not clarifying clearly about the 'same thing'. What I meant > was to avoid touch left if the node itself has the min deadline. > > @@ -894,6 +894,9 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) > if (!best || deadline_gt(deadline, best, se)) > best = se; > > + if (se->deadline == se->min_deadline) > + break; > + > /* > * Every se in a left branch is eligible, keep track of the > * branch with the best min_deadline > @@ -913,10 +916,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) > break; > } > > - /* min_deadline is at this node, no need to look right */ > - if (se->deadline == se->min_deadline) > - break; > - > /* else min_deadline is in the right branch. */ > node = node->rb_right; > } > > (But still thanks for the convincing explanation on fairness.) > Ah, yes, in terms of optimizing performance rather than marginal fairness, that would help.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0c31cda0712f..77e9440b8ab3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -864,18 +864,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) * * se->min_deadline = min(se->deadline, se->{left,right}->min_deadline) * * Which allows an EDF like search on (sub)trees. */ -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) +static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq) { struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node; struct sched_entity *curr = cfs_rq->curr; struct sched_entity *best = NULL; + struct sched_entity *best_left = NULL; if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; + best = curr; /* * Once selected, run a task until it either becomes non-eligible or * until it gets a new slice. See the HACK in set_next_entity(). */ @@ -892,45 +894,87 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) node = node->rb_left; continue; } /* - * If this entity has an earlier deadline than the previous - * best, take this one. If it also has the earliest deadline - * of its subtree, we're done. + * Now we heap search eligible trees for the best (min_)deadline */ - if (!best || deadline_gt(deadline, best, se)) { + if (!best || deadline_gt(deadline, best, se)) best = se; - if (best->deadline == best->min_deadline) - break; - } /* - * If the earlest deadline in this subtree is in the fully - * eligible left half of our space, go there. + * Every se in a left branch is eligible, keep track of the + * branch with the best min_deadline */ + if (node->rb_left) { + struct sched_entity *left = __node_2_se(node->rb_left); + + if (!best_left || deadline_gt(min_deadline, best_left, left)) + best_left = left; + + /* + * min_deadline is in the left branch. rb_left and all + * descendants are eligible, so immediately switch to the second + * loop. + */ + if (left->min_deadline == se->min_deadline) + break; + } + + /* min_deadline is at this node, no need to look right */ + if (se->deadline == se->min_deadline) + break; + + /* else min_deadline is in the right branch. */ + node = node->rb_right; + } + + /* + * We ran into an eligible node which is itself the best. + * (Or nr_running == 0 and both are NULL) + */ + if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0) + return best; + + /* + * Now best_left and all of its children are eligible, and we are just + * looking for deadline == min_deadline + */ + node = &best_left->run_node; + while (node) { + struct sched_entity *se = __node_2_se(node); + + /* min_deadline is the current node */ + if (se->deadline == se->min_deadline) + return se; + + /* min_deadline is in the left branch */ if (node->rb_left && __node_2_se(node->rb_left)->min_deadline == se->min_deadline) { node = node->rb_left; continue; } + /* else min_deadline is in the right branch */ node = node->rb_right; } + return NULL; +} - if (!best || (curr && deadline_gt(deadline, best, curr))) - best = curr; +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) +{ + struct sched_entity *se = __pick_eevdf(cfs_rq); - if (unlikely(!best)) { + if (!se) { struct sched_entity *left = __pick_first_entity(cfs_rq); if (left) { pr_err("EEVDF scheduling fail, picking leftmost\n"); return left; } } - return best; + return se; } #ifdef CONFIG_SCHED_DEBUG struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) {