Message ID | 20221102102343.57845-1-zhouchengming@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp3537428wru; Wed, 2 Nov 2022 03:32:01 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5cVeCeTp+BeLh4SfC+AwRv2LlX4U/Q3OLcHD0Sjgm2gdqrLW3s3CGX6MyzaaJItbdIVq3P X-Received: by 2002:a17:902:f691:b0:186:b250:9763 with SMTP id l17-20020a170902f69100b00186b2509763mr24651855plg.62.1667385120886; Wed, 02 Nov 2022 03:32:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667385120; cv=none; d=google.com; s=arc-20160816; b=sLUW8tT7LScrYyKaY3kRwh9pZNOh1MWfHdn4QbXKP89pJb7fbv1STZamZIOYpp9NE/ AZdZgcptCfOlr4lgknnbERo2PWb6fRwhxH7FijhOaZQhZS04garNkWYu7z/OegOWihbe x574QEHeVpMq0b0hF0C8uKVI531JnVOVuB2LNpQlCv604+70p9Heej5T09QWxjVBafk7 u9LLzFPe40nYJVvuGmW6IaVObTta/R9e3NKRFM5skltkID3tP4OuVxNSlP/8Y90qblWE a99mMlLjPtCoq80o4j7uy04AUXv0WyLCZupWw1b0A5q01+NWuK+0hGrotmDPAaSkRi6n ETpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=jTP63kQwOw1YO+WrEKDysOnr36FxfbvLN7e4W99sGyg=; b=euVByjhNNyZ7JZ7vECA6a54q1Rx6Cn8dulKY11MpDa8ItAhNsDJRwIsLohpUgpp/+l n0az/4TtWi32ULfcEvkehZqY0J+K0OVCY8zasePMbzOc8YHJaho44hzLN62oF9mHMby+ 7+pJZ8nI7kX41jteghoVIkoyKT+iInrFpS9fFnBkajgzr5d77qQ6ZPPQlZzH+MXBWSPg cikOjU9jzLrwQiLLaoxSWAHdWQh+YnRkkQ7MGP1bQbCfHjFYbKJWj6OBdjHUjmcuJSeS rgpQ0BSHOL8VlJ0mZdkskrxYv/XHjShSgQ2QLgx2ACGwnyoO6+iL8DWr13/LJ0t8nPPk xH1A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=SjxUIEHr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s14-20020a056a0008ce00b0055fa098c388si18226503pfu.259.2022.11.02.03.31.47; Wed, 02 Nov 2022 03:32:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=SjxUIEHr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230322AbiKBKYC (ORCPT <rfc822;billy.jones8454@gmail.com> + 99 others); Wed, 2 Nov 2022 06:24:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42288 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229688AbiKBKYB (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 2 Nov 2022 06:24:01 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D652270 for <linux-kernel@vger.kernel.org>; Wed, 2 Nov 2022 03:24:00 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id p15-20020a17090a348f00b002141615576dso1664995pjb.4 for <linux-kernel@vger.kernel.org>; Wed, 02 Nov 2022 03:24:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=jTP63kQwOw1YO+WrEKDysOnr36FxfbvLN7e4W99sGyg=; b=SjxUIEHrr1myH5WYkP+7nEZ/j4tFv1FWQYYo/B6f9yRmaXWlGXeTFt3cFD2GJaZZjq bhrJTCBNiDXW4OvsvDeGANs3++n4jVZzr6QuSAHrNI4bbhGvZI9ge8s7K9Yy5wG8xfIb RnYg9D4LIbJjjli1+lmVcHcTrpFxPW83bvYHCvH82tS7w6gyB0fWi+oD/uT58z0Xf4cJ BlFLx+NyXWD2/Lia2aix3aSYxWgMSbTSgpNUxy9+sfZikF0zVxp3AzK/I2QLFKb/Jg8e WI3IG/d1+MT3uUdpdH4/silRtSFrKo5h9eDVU56RZ8/DRFLcW8JSpuuHi2ETQKhehiy5 1IIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jTP63kQwOw1YO+WrEKDysOnr36FxfbvLN7e4W99sGyg=; b=O80isSKdjDT+QvfwR+xeJ7rOMDU0f9xRff4lEM3gPHWhW/2zNQ3aGZQzKSoQcojoOn h7gQU/PTirsfj68kYaQp80GuogEhYrFAUePSdsDylEe0Rr0oxcRrKlihMQTYWz17QrlE KVQphhMz1DWq473ZLQLDut34wXNUwy/GBGwACwHNqFoaS9xySLu0aXMNwCT4ItgvmRbh NAKIa+PxXJsmduqg6Qq14lrrAoh8TxxNfZT10TNt8G3G8LBlfRQsbAT8K667nnq7Hsax 6uQlw4MUsfMb6FuPE+XtbV6OaqR2kcs3BpZGCIA9JiyT50avBjB7k44GnJ4rw+NNYctL CScA== X-Gm-Message-State: ACrzQf3VFDMQVwWFtuyb0M+eOQeltp3dfoNkxxt+tP3o+393wxa3dZcH 3JtmX20lClDr5effJ2Nz2tQFpg== X-Received: by 2002:a17:90b:4a42:b0:213:383f:cd21 with SMTP id lb2-20020a17090b4a4200b00213383fcd21mr41550560pjb.23.1667384639784; Wed, 02 Nov 2022 03:23:59 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id x5-20020a170902ec8500b00177f82f0789sm7973613plg.198.2022.11.02.03.23.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Nov 2022 03:23:59 -0700 (PDT) From: Chengming Zhou <zhouchengming@bytedance.com> To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com Cc: linux-kernel@vger.kernel.org, Chengming Zhou <zhouchengming@bytedance.com> Subject: [PATCH] sched/core: Minor optimize ttwu_runnable() Date: Wed, 2 Nov 2022 18:23:43 +0800 Message-Id: <20221102102343.57845-1-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748380020625979233?= X-GMAIL-MSGID: =?utf-8?q?1748380020625979233?= |
Series |
sched/core: Minor optimize ttwu_runnable()
|
|
Commit Message
Chengming Zhou
Nov. 2, 2022, 10:23 a.m. UTC
ttwu_runnable() is used as a fast wakeup path when the wakee task
is between set_current_state() and schedule(), in which case all
we need to do is change p->state back to TASK_RUNNING. So we don't
need to update_rq_clock() and check_preempt_curr() in this case.
Some performance numbers using mmtests/perfpipe on Intel Xeon server:
linux-next patched
Min Time 8.67 ( 0.00%) 8.66 ( 0.13%)
1st-qrtle Time 8.83 ( 0.00%) 8.72 ( 1.19%)
2nd-qrtle Time 8.90 ( 0.00%) 8.76 ( 1.57%)
3rd-qrtle Time 8.98 ( 0.00%) 8.82 ( 1.82%)
Max-1 Time 8.67 ( 0.00%) 8.66 ( 0.13%)
Max-5 Time 8.67 ( 0.00%) 8.66 ( 0.13%)
Max-10 Time 8.79 ( 0.00%) 8.69 ( 1.09%)
Max-90 Time 9.01 ( 0.00%) 8.84 ( 1.94%)
Max-95 Time 9.02 ( 0.00%) 8.85 ( 1.86%)
Max-99 Time 9.02 ( 0.00%) 8.88 ( 1.56%)
Max Time 9.59 ( 0.00%) 8.89 ( 7.29%)
Amean Time 8.92 ( 0.00%) 8.77 * 1.65%*
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
kernel/sched/core.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
Comments
On 02/11/22 18:23, Chengming Zhou wrote: > ttwu_runnable() is used as a fast wakeup path when the wakee task > is between set_current_state() and schedule(), in which case all > we need to do is change p->state back to TASK_RUNNING. So we don't > need to update_rq_clock() and check_preempt_curr() in this case. > > Some performance numbers using mmtests/perfpipe on Intel Xeon server: > > linux-next patched > Min Time 8.67 ( 0.00%) 8.66 ( 0.13%) > 1st-qrtle Time 8.83 ( 0.00%) 8.72 ( 1.19%) > 2nd-qrtle Time 8.90 ( 0.00%) 8.76 ( 1.57%) > 3rd-qrtle Time 8.98 ( 0.00%) 8.82 ( 1.82%) > Max-1 Time 8.67 ( 0.00%) 8.66 ( 0.13%) > Max-5 Time 8.67 ( 0.00%) 8.66 ( 0.13%) > Max-10 Time 8.79 ( 0.00%) 8.69 ( 1.09%) > Max-90 Time 9.01 ( 0.00%) 8.84 ( 1.94%) > Max-95 Time 9.02 ( 0.00%) 8.85 ( 1.86%) > Max-99 Time 9.02 ( 0.00%) 8.88 ( 1.56%) > Max Time 9.59 ( 0.00%) 8.89 ( 7.29%) > Amean Time 8.92 ( 0.00%) 8.77 * 1.65%* > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > --- > kernel/sched/core.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 87c9cdf37a26..3785418de127 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) > > rq = __task_rq_lock(p, &rf); > if (task_on_rq_queued(p)) { > - /* check_preempt_curr() may use rq clock */ > - update_rq_clock(rq); > - ttwu_do_wakeup(rq, p, wake_flags, &rf); > + WRITE_ONCE(p->__state, TASK_RUNNING); > + trace_sched_wakeup(p); This also loses the class->task_woken() call, AFAICT we could have !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but then again if there is a need to push we should have done that at the IRQ preempt via set_next_task_{rt, dl}()... So I'm starting to think this is OK, but that needs elaborating in the changelog. > ret = 1; > } > __task_rq_unlock(rq, &rf); > -- > 2.37.2
On 2022/11/5 02:19, Valentin Schneider wrote: > On 02/11/22 18:23, Chengming Zhou wrote: >> ttwu_runnable() is used as a fast wakeup path when the wakee task >> is between set_current_state() and schedule(), in which case all >> we need to do is change p->state back to TASK_RUNNING. So we don't >> need to update_rq_clock() and check_preempt_curr() in this case. >> >> Some performance numbers using mmtests/perfpipe on Intel Xeon server: >> >> linux-next patched >> Min Time 8.67 ( 0.00%) 8.66 ( 0.13%) >> 1st-qrtle Time 8.83 ( 0.00%) 8.72 ( 1.19%) >> 2nd-qrtle Time 8.90 ( 0.00%) 8.76 ( 1.57%) >> 3rd-qrtle Time 8.98 ( 0.00%) 8.82 ( 1.82%) >> Max-1 Time 8.67 ( 0.00%) 8.66 ( 0.13%) >> Max-5 Time 8.67 ( 0.00%) 8.66 ( 0.13%) >> Max-10 Time 8.79 ( 0.00%) 8.69 ( 1.09%) >> Max-90 Time 9.01 ( 0.00%) 8.84 ( 1.94%) >> Max-95 Time 9.02 ( 0.00%) 8.85 ( 1.86%) >> Max-99 Time 9.02 ( 0.00%) 8.88 ( 1.56%) >> Max Time 9.59 ( 0.00%) 8.89 ( 7.29%) >> Amean Time 8.92 ( 0.00%) 8.77 * 1.65%* >> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> >> --- >> kernel/sched/core.c | 5 ++--- >> 1 file changed, 2 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 87c9cdf37a26..3785418de127 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) >> >> rq = __task_rq_lock(p, &rf); >> if (task_on_rq_queued(p)) { >> - /* check_preempt_curr() may use rq clock */ >> - update_rq_clock(rq); >> - ttwu_do_wakeup(rq, p, wake_flags, &rf); >> + WRITE_ONCE(p->__state, TASK_RUNNING); >> + trace_sched_wakeup(p); > > This also loses the class->task_woken() call, AFAICT we could have > !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but > then again if there is a need to push we should have done that at the IRQ > preempt via set_next_task_{rt, dl}()... So I'm starting to think this is > OK, but that needs elaborating in the changelog. Sorry, I don't get why we could have !p->on_cpu here? I thought if we have task_on_rq_queued(p) here, it means p hasn't got to __schedule() to deactivate_task(), so p should still be on_cpu? Thanks. > > >> ret = 1; >> } >> __task_rq_unlock(rq, &rf); >> -- >> 2.37.2 >
On 05/11/22 09:34, Chengming Zhou wrote: > On 2022/11/5 02:19, Valentin Schneider wrote: >> On 02/11/22 18:23, Chengming Zhou wrote: >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index 87c9cdf37a26..3785418de127 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) >>> >>> rq = __task_rq_lock(p, &rf); >>> if (task_on_rq_queued(p)) { >>> - /* check_preempt_curr() may use rq clock */ >>> - update_rq_clock(rq); >>> - ttwu_do_wakeup(rq, p, wake_flags, &rf); >>> + WRITE_ONCE(p->__state, TASK_RUNNING); >>> + trace_sched_wakeup(p); >> >> This also loses the class->task_woken() call, AFAICT we could have >> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but >> then again if there is a need to push we should have done that at the IRQ >> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is >> OK, but that needs elaborating in the changelog. > > Sorry, I don't get why we could have !p->on_cpu here? > > I thought if we have task_on_rq_queued(p) here, it means p hasn't got to > __schedule() to deactivate_task(), so p should still be on_cpu? > > Thanks. > So p is still on the rq for sure, but it may not be the currently running task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop: 0 for (;;) { 1 set_current_state(TASK_UNINTERRUPTIBLE); 2 3 if (CONDITION) 4 break; 5 6 schedule(); 7 } 8 __set_current_state(TASK_RUNNING); Now further consider p0 executes line 1, but then gets interrupted by an IRQ, at the end of which preempt_schedule_irq() happens. We enter __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task state, but we *can* end up switching the CPU's current task. ISTR that's why Peter renamed that function ttwu_*runnable*() and not ttwu_*running*(). >> >> >>> ret = 1; >>> } >>> __task_rq_unlock(rq, &rf); >>> -- >>> 2.37.2 >>
On 2022/11/7 20:56, Valentin Schneider wrote: > On 05/11/22 09:34, Chengming Zhou wrote: >> On 2022/11/5 02:19, Valentin Schneider wrote: >>> On 02/11/22 18:23, Chengming Zhou wrote: >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>>> index 87c9cdf37a26..3785418de127 100644 >>>> --- a/kernel/sched/core.c >>>> +++ b/kernel/sched/core.c >>>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) >>>> >>>> rq = __task_rq_lock(p, &rf); >>>> if (task_on_rq_queued(p)) { >>>> - /* check_preempt_curr() may use rq clock */ >>>> - update_rq_clock(rq); >>>> - ttwu_do_wakeup(rq, p, wake_flags, &rf); >>>> + WRITE_ONCE(p->__state, TASK_RUNNING); >>>> + trace_sched_wakeup(p); >>> >>> This also loses the class->task_woken() call, AFAICT we could have >>> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but >>> then again if there is a need to push we should have done that at the IRQ >>> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is >>> OK, but that needs elaborating in the changelog. >> >> Sorry, I don't get why we could have !p->on_cpu here? >> >> I thought if we have task_on_rq_queued(p) here, it means p hasn't got to >> __schedule() to deactivate_task(), so p should still be on_cpu? >> >> Thanks. >> > > So p is still on the rq for sure, but it may not be the currently running > task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop: > > 0 for (;;) { > 1 set_current_state(TASK_UNINTERRUPTIBLE); > 2 > 3 if (CONDITION) > 4 break; > 5 > 6 schedule(); > 7 } > 8 __set_current_state(TASK_RUNNING); > > Now further consider p0 executes line 1, but then gets interrupted by an > IRQ, at the end of which preempt_schedule_irq() happens. We enter > __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task > state, but we *can* end up switching the CPU's current task. Thank you very much for this detailed explanation, I get it. Yes, this process can be seen on CONFIG_PREEMPT kernel. > > ISTR that's why Peter renamed that function ttwu_*runnable*() and not > ttwu_*running*(). So this task p didn't really sleep, it's preempted, later scheduled in, get to execute line 6 schedule(), but its state has been set to TASK_RUNNING, it won't deactivate_task(p). Looks like this patch is still reasonable? Should I put this detailed explanation in the changelog and send v2? Thanks! > >>> >>> >>>> ret = 1; >>>> } >>>> __task_rq_unlock(rq, &rf); >>>> -- >>>> 2.37.2 >>> >
On 07/11/22 21:19, Chengming Zhou wrote: > On 2022/11/7 20:56, Valentin Schneider wrote: >> >> So p is still on the rq for sure, but it may not be the currently running >> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop: >> >> 0 for (;;) { >> 1 set_current_state(TASK_UNINTERRUPTIBLE); >> 2 >> 3 if (CONDITION) >> 4 break; >> 5 >> 6 schedule(); >> 7 } >> 8 __set_current_state(TASK_RUNNING); >> >> Now further consider p0 executes line 1, but then gets interrupted by an >> IRQ, at the end of which preempt_schedule_irq() happens. We enter >> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task >> state, but we *can* end up switching the CPU's current task. > > Thank you very much for this detailed explanation, I get it. Yes, this > process can be seen on CONFIG_PREEMPT kernel. > >> >> ISTR that's why Peter renamed that function ttwu_*runnable*() and not >> ttwu_*running*(). > > So this task p didn't really sleep, it's preempted, later scheduled in, > get to execute line 6 schedule(), but its state has been set to TASK_RUNNING, > it won't deactivate_task(p). > Right! > Looks like this patch is still reasonable? Should I put this detailed > explanation in the changelog and send v2? > So that's the part for the p->sched_class->task_woken() callback, which only affects RT and DL (and only does something when !p->on_cpu). I *think* it's fine to remove it from ttwu_runnable() as any push/pull should have happened when other tasks were enqueued on the same CPU - with that said, it wouldn't hurt to double check this :-) As for the check_preempt_curr(), since per the above p can be preempted, you could have scenarios right now with CFS tasks where ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set. e.g. p0 does set_current_state(TASK_UNINTERRUPTIBLE) but then gets interrupted by the tick, a p1 gets selected to run instead because of check_preempt_tick(), and then runs long enough to have check_preempt_curr() decide to let p0 preempt p1. That does require specific timing (lower tick frequency should make this more likely) and probably task niceness distribution too, but isn't impossible. Maybe try reading p->on_cpu, and only do the quick task state update if it's still the current task, otherwise do the preemption checks? > Thanks! >
On 2022/11/7 23:54, Valentin Schneider wrote: > On 07/11/22 21:19, Chengming Zhou wrote: >> On 2022/11/7 20:56, Valentin Schneider wrote: >>> >>> So p is still on the rq for sure, but it may not be the currently running >>> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop: >>> >>> 0 for (;;) { >>> 1 set_current_state(TASK_UNINTERRUPTIBLE); >>> 2 >>> 3 if (CONDITION) >>> 4 break; >>> 5 >>> 6 schedule(); >>> 7 } >>> 8 __set_current_state(TASK_RUNNING); >>> >>> Now further consider p0 executes line 1, but then gets interrupted by an >>> IRQ, at the end of which preempt_schedule_irq() happens. We enter >>> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task >>> state, but we *can* end up switching the CPU's current task. >> >> Thank you very much for this detailed explanation, I get it. Yes, this >> process can be seen on CONFIG_PREEMPT kernel. >> >>> >>> ISTR that's why Peter renamed that function ttwu_*runnable*() and not >>> ttwu_*running*(). >> >> So this task p didn't really sleep, it's preempted, later scheduled in, >> get to execute line 6 schedule(), but its state has been set to TASK_RUNNING, >> it won't deactivate_task(p). >> > > Right! > >> Looks like this patch is still reasonable? Should I put this detailed >> explanation in the changelog and send v2? >> > > So that's the part for the p->sched_class->task_woken() callback, which > only affects RT and DL (and only does something when !p->on_cpu). I *think* > it's fine to remove it from ttwu_runnable() as any push/pull should have > happened when other tasks were enqueued on the same CPU - with that said, > it wouldn't hurt to double check this :-) Yes, ttwu_runnable() should be fine to remove p->sched_class->task_woken(). > > > As for the check_preempt_curr(), since per the above p can be preempted, > you could have scenarios right now with CFS tasks where > ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set. > > e.g. p0 does > > set_current_state(TASK_UNINTERRUPTIBLE) > > but then gets interrupted by the tick, a p1 gets selected to run instead > because of check_preempt_tick(), and then runs long enough to have > check_preempt_curr() decide to let p0 preempt p1. > > That does require specific timing (lower tick frequency should make this > more likely) and probably task niceness distribution too, but isn't > impossible. > > Maybe try reading p->on_cpu, and only do the quick task state update if > it's still the current task, otherwise do the preemption checks? I think it's a good suggestion, so we still have the preemption check when p0 is preempted by p1, letting p0 to be scheduled in more timely. I will update and send v2 later. > >> Thanks! >> >
On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote: > So that's the part for the p->sched_class->task_woken() callback, which > only affects RT and DL (and only does something when !p->on_cpu). I *think* > it's fine to remove it from ttwu_runnable() as any push/pull should have > happened when other tasks were enqueued on the same CPU - with that said, > it wouldn't hurt to double check this :-) > > > As for the check_preempt_curr(), since per the above p can be preempted, > you could have scenarios right now with CFS tasks where > ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set. > > e.g. p0 does > > set_current_state(TASK_UNINTERRUPTIBLE) > > but then gets interrupted by the tick, a p1 gets selected to run instead > because of check_preempt_tick(), and then runs long enough to have > check_preempt_curr() decide to let p0 preempt p1. > > That does require specific timing (lower tick frequency should make this > more likely) and probably task niceness distribution too, but isn't > impossible. > > Maybe try reading p->on_cpu, and only do the quick task state update if > it's still the current task, otherwise do the preemption checks? I'm confused... So all relevant parties take rq->lock: - __schedule() - scheduler_tick() - ttwu_runnable() So if ttwu_runnable() sees on_rq and switches state back to RUNNING then neither check_preempt_curr() nor task_woken() make any sense. Specifically: - you can't very well preempt yourself (which is what check_preempt_curr() is trying to determine -- if the woken task should preempt the running task, they're both the same in this case); - the task did not actually wake up, because it was still on the runqueue to begin with. This path prevents a sleep, rather than issues a wakeup. So yes, I think the patch as proposed is ok.
On Wed, Nov 02, 2022 at 06:23:43PM +0800, Chengming Zhou wrote: > ttwu_runnable() is used as a fast wakeup path when the wakee task > is between set_current_state() and schedule(), in which case all > we need to do is change p->state back to TASK_RUNNING. So we don't > need to update_rq_clock() and check_preempt_curr() in this case. > > Some performance numbers using mmtests/perfpipe on Intel Xeon server: > > linux-next patched > Min Time 8.67 ( 0.00%) 8.66 ( 0.13%) > 1st-qrtle Time 8.83 ( 0.00%) 8.72 ( 1.19%) > 2nd-qrtle Time 8.90 ( 0.00%) 8.76 ( 1.57%) > 3rd-qrtle Time 8.98 ( 0.00%) 8.82 ( 1.82%) > Max-1 Time 8.67 ( 0.00%) 8.66 ( 0.13%) > Max-5 Time 8.67 ( 0.00%) 8.66 ( 0.13%) > Max-10 Time 8.79 ( 0.00%) 8.69 ( 1.09%) > Max-90 Time 9.01 ( 0.00%) 8.84 ( 1.94%) > Max-95 Time 9.02 ( 0.00%) 8.85 ( 1.86%) > Max-99 Time 9.02 ( 0.00%) 8.88 ( 1.56%) > Max Time 9.59 ( 0.00%) 8.89 ( 7.29%) > Amean Time 8.92 ( 0.00%) 8.77 * 1.65%* > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > --- > kernel/sched/core.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 87c9cdf37a26..3785418de127 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) > > rq = __task_rq_lock(p, &rf); > if (task_on_rq_queued(p)) { > - /* check_preempt_curr() may use rq clock */ > - update_rq_clock(rq); > - ttwu_do_wakeup(rq, p, wake_flags, &rf); > + WRITE_ONCE(p->__state, TASK_RUNNING); > + trace_sched_wakeup(p); > ret = 1; > } > __task_rq_unlock(rq, &rf); Yes, I think this is correct; however I would re-organize code a little. How's this? diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cb2aa2b54c7a..43d9a1551a5d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3602,15 +3602,40 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags) __schedstat_inc(p->stats.nr_wakeups_sync); } /* - * Mark the task runnable and perform wakeup-preemption. + * Mark the task runnable... */ -static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, - struct rq_flags *rf) +static inline void ttwu_do_wakeup(struct task_struct *p) { - check_preempt_curr(rq, p, wake_flags); WRITE_ONCE(p->__state, TASK_RUNNING); trace_sched_wakeup(p); +} + +static void +ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, + struct rq_flags *rf) +{ + int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK; + + lockdep_assert_rq_held(rq); + + if (p->sched_contributes_to_load) + rq->nr_uninterruptible--; + +#ifdef CONFIG_SMP + if (wake_flags & WF_MIGRATED) + en_flags |= ENQUEUE_MIGRATED; + else +#endif + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } + + activate_task(rq, p, en_flags); + check_preempt_curr(rq, p, wake_flags); + + ttwu_do_wakeup(p); #ifdef CONFIG_SMP if (p->sched_class->task_woken) { @@ -3640,31 +3666,6 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, #endif } -static void -ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, - struct rq_flags *rf) -{ - int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK; - - lockdep_assert_rq_held(rq); - - if (p->sched_contributes_to_load) - rq->nr_uninterruptible--; - -#ifdef CONFIG_SMP - if (wake_flags & WF_MIGRATED) - en_flags |= ENQUEUE_MIGRATED; - else -#endif - if (p->in_iowait) { - delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); - } - - activate_task(rq, p, en_flags); - ttwu_do_wakeup(rq, p, wake_flags, rf); -} - /* * Consider @p being inside a wait loop: * @@ -3698,9 +3699,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) rq = __task_rq_lock(p, &rf); if (task_on_rq_queued(p)) { - /* check_preempt_curr() may use rq clock */ - update_rq_clock(rq); - ttwu_do_wakeup(rq, p, wake_flags, &rf); + ttwu_do_wakeup(p); ret = 1; } __task_rq_unlock(rq, &rf); @@ -4062,8 +4061,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) goto out; trace_sched_waking(p); - WRITE_ONCE(p->__state, TASK_RUNNING); - trace_sched_wakeup(p); + ttwu_do_wakeup(p); goto out; }
On Tue, Nov 08, 2022 at 10:11:49AM +0100, Peter Zijlstra wrote: > On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote: > > > So that's the part for the p->sched_class->task_woken() callback, which > > only affects RT and DL (and only does something when !p->on_cpu). I *think* > > it's fine to remove it from ttwu_runnable() as any push/pull should have > > happened when other tasks were enqueued on the same CPU - with that said, > > it wouldn't hurt to double check this :-) > > > > > > As for the check_preempt_curr(), since per the above p can be preempted, > > you could have scenarios right now with CFS tasks where > > ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set. > > > > e.g. p0 does > > > > set_current_state(TASK_UNINTERRUPTIBLE) > > > > but then gets interrupted by the tick, a p1 gets selected to run instead > > because of check_preempt_tick(), and then runs long enough to have > > check_preempt_curr() decide to let p0 preempt p1. > > > > That does require specific timing (lower tick frequency should make this > > more likely) and probably task niceness distribution too, but isn't > > impossible. > > > > Maybe try reading p->on_cpu, and only do the quick task state update if > > it's still the current task, otherwise do the preemption checks? > > I'm confused... I am and Valentin has a point. It could indeed be preempted and in that case check_preempt_curr() could indeed make it get back on. In that case his suggestion might make sense; something along the lines of so I suppose... (And I think we can still do the reorg I proposed elsewhere, but I've not yet tried.) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cb2aa2b54c7a..6944d9473295 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3698,9 +3698,16 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) rq = __task_rq_lock(p, &rf); if (task_on_rq_queued(p)) { - /* check_preempt_curr() may use rq clock */ - update_rq_clock(rq); - ttwu_do_wakeup(rq, p, wake_flags, &rf); + if (!p->on_cpu) { + /* + * When on_rq && !on_cpu the task is preempted, see if + * it should preempt whatever is current there now. + */ + update_rq_clock(rq); + check_preempt_curr(rq, p, wake_flags); + } + WRITE_ONCE(p->__state, TASK_RUNNING); + trace_sched_wakeup(p); ret = 1; } __task_rq_unlock(rq, &rf);
On 2022/11/8 18:08, Peter Zijlstra wrote: > On Tue, Nov 08, 2022 at 10:11:49AM +0100, Peter Zijlstra wrote: >> On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote: >> >>> So that's the part for the p->sched_class->task_woken() callback, which >>> only affects RT and DL (and only does something when !p->on_cpu). I *think* >>> it's fine to remove it from ttwu_runnable() as any push/pull should have >>> happened when other tasks were enqueued on the same CPU - with that said, >>> it wouldn't hurt to double check this :-) >>> >>> >>> As for the check_preempt_curr(), since per the above p can be preempted, >>> you could have scenarios right now with CFS tasks where >>> ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set. >>> >>> e.g. p0 does >>> >>> set_current_state(TASK_UNINTERRUPTIBLE) >>> >>> but then gets interrupted by the tick, a p1 gets selected to run instead >>> because of check_preempt_tick(), and then runs long enough to have >>> check_preempt_curr() decide to let p0 preempt p1. >>> >>> That does require specific timing (lower tick frequency should make this >>> more likely) and probably task niceness distribution too, but isn't >>> impossible. >>> >>> Maybe try reading p->on_cpu, and only do the quick task state update if >>> it's still the current task, otherwise do the preemption checks? >> >> I'm confused... > > I am and Valentin has a point. It could indeed be preempted and in that > case check_preempt_curr() could indeed make it get back on. > > In that case his suggestion might make sense; something along the lines > of so I suppose... > > (And I think we can still do the reorg I proposed elsewhere, but I've not > yet tried.) Ok, thanks for these suggestions. I will try to do this, do some tests and send v2. Thanks. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index cb2aa2b54c7a..6944d9473295 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3698,9 +3698,16 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) > > rq = __task_rq_lock(p, &rf); > if (task_on_rq_queued(p)) { > - /* check_preempt_curr() may use rq clock */ > - update_rq_clock(rq); > - ttwu_do_wakeup(rq, p, wake_flags, &rf); > + if (!p->on_cpu) { > + /* > + * When on_rq && !on_cpu the task is preempted, see if > + * it should preempt whatever is current there now. > + */ > + update_rq_clock(rq); > + check_preempt_curr(rq, p, wake_flags); > + } > + WRITE_ONCE(p->__state, TASK_RUNNING); > + trace_sched_wakeup(p); > ret = 1; > } > __task_rq_unlock(rq, &rf);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 87c9cdf37a26..3785418de127 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) rq = __task_rq_lock(p, &rf); if (task_on_rq_queued(p)) { - /* check_preempt_curr() may use rq clock */ - update_rq_clock(rq); - ttwu_do_wakeup(rq, p, wake_flags, &rf); + WRITE_ONCE(p->__state, TASK_RUNNING); + trace_sched_wakeup(p); ret = 1; } __task_rq_unlock(rq, &rf);