Message ID | 20221104023601.12844-1-dtcccc@linux.alibaba.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp128167wru; Thu, 3 Nov 2022 19:48:29 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7AwCgyJQdP8ObDB+uLq2h0fxBytgbB/3pD+Si36k3BhBDHd290P4r7QK6K7i7BcqCGuWoN X-Received: by 2002:a17:902:ed8e:b0:187:1c78:80c2 with SMTP id e14-20020a170902ed8e00b001871c7880c2mr26109956plj.38.1667530108865; Thu, 03 Nov 2022 19:48:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667530108; cv=none; d=google.com; s=arc-20160816; b=RZveATvmTbxFsCUMnj5rNyBU4kvdo1Qlsjkwz8CKvzic8U8OLE/ZiI/bzBJhz4clNV K9jYkDgDG3hOpJGe0F6RYQ6XDbLcS2DRrc9zamCo2b+wLfst9k/vS8CTBSlx6NK11gSF nWvp5Ci/KdrwNs66J47NDtwgkLvISPg7qUQ49iMdW9YdZ/a63EYNDTjDJPFqZtnKXyxR cyeWPIZqQPyzaQUtNJQ74i7TymSuCvtyTkpajQFUpRdTxo+wsjB6oQTzbO1Dz44HUUSs FBU7kHzPp13UM1rVuXvAt2pc75TxlO24IzisHC+vNZ/RxCRLD2IeC3Fhg0l6CiVbM8r1 OAwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=BkFuDWAPp5pBuCsz8nDkfMG7qejVLK7QsZHmW6h4Yq0=; b=ef3/0JFqUns1fVBGz/S3903j0fJ7B8mNkhrZlG5aZW9R/rfP6WLjIHaQbD78whOJwh d6E8CB+X/2BnAH1dXmk7soaO9AYlonp8MRDpdClVms/64280M2Rvn/fRidfWuSJRj8Yu Utar+roA+0CDNSxGy6kuvcPH2V/49izCXgb3ZGYZOIiujnfTlo4m6RiKk0KP6EpUiOab FitKEvvZ39AF7D2gCRF+mnJYwul9QKv4C5k1bqx2moQxDFkeREcz8QKp2dTNJ6wK2V/d oDZvlUo0dXf74s3KP97HjxLeNYlnmh9gcdIltn2q7X6vTp2bqCtqvfoq32NhJpQPTOQM cgHg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j15-20020a170902da8f00b001729cfb3cb5si3326211plx.610.2022.11.03.19.48.16; Thu, 03 Nov 2022 19:48:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230337AbiKDCgQ (ORCPT <rfc822;jimliu8233@gmail.com> + 99 others); Thu, 3 Nov 2022 22:36:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40282 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229591AbiKDCgO (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 3 Nov 2022 22:36:14 -0400 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 35B51E15 for <linux-kernel@vger.kernel.org>; Thu, 3 Nov 2022 19:36:12 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=dtcccc@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0VTv.72n_1667529361; Received: from localhost.localdomain(mailfrom:dtcccc@linux.alibaba.com fp:SMTPD_---0VTv.72n_1667529361) by smtp.aliyun-inc.com; Fri, 04 Nov 2022 10:36:09 +0800 From: Tianchen Ding <dtcccc@linux.alibaba.com> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com> Cc: linux-kernel@vger.kernel.org Subject: [PATCH v2] sched: Clear ttwu_pending after enqueue_task Date: Fri, 4 Nov 2022 10:36:01 +0800 Message-Id: <20221104023601.12844-1-dtcccc@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20221101073630.2797-1-dtcccc@linux.alibaba.com> References: <20221101073630.2797-1-dtcccc@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,HK_RANDOM_ENVFROM,HK_RANDOM_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748278674675494739?= X-GMAIL-MSGID: =?utf-8?q?1748532051055540752?= |
Series |
[v2] sched: Clear ttwu_pending after enqueue_task
|
|
Commit Message
Tianchen Ding
Nov. 4, 2022, 2:36 a.m. UTC
We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.
Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).
Latency percentiles (usec):
base base+revert_f3dd3f674555 base+this_patch
50.0000th: 9 13 9
75.0000th: 12 19 12
90.0000th: 15 22 15
95.0000th: 18 24 17
*99.0000th: 27 31 24
99.5000th: 3364 33 27
99.9000th: 12560 36 30
We also tested on unixbench and hackbench, and saw no performance
change.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
v2:
Update commit log about other benchmarks.
Add comment in code.
Move the code before rq_unlock. This can make ttwu_pending updated a bit
earlier than v1 so that it can reflect the real condition more timely,
maybe.
v1: https://lore.kernel.org/all/20221101073630.2797-1-dtcccc@linux.alibaba.com/
---
kernel/sched/core.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
Comments
On 2022-11-04 at 10:36:01 +0800, Tianchen Ding wrote: > We found a long tail latency in schbench whem m*t is close to nr_cpus. > (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.) > > This is because when the wakee cpu is idle, rq->ttwu_pending is cleared > too early, and idle_cpu() will return true until the wakee task enqueued. > This will mislead the waker when selecting idle cpu, and wake multiple > worker threads on the same wakee cpu. This situation is enlarged by > commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on > wakelist if wakee cpu is idle") because it tends to use wakelist. > > Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu > (Intel(R) Xeon(R) Platinum 8369B). > > Latency percentiles (usec): > base base+revert_f3dd3f674555 base+this_patch > 50.0000th: 9 13 9 > 75.0000th: 12 19 12 > 90.0000th: 15 22 15 > 95.0000th: 18 24 17 > *99.0000th: 27 31 24 > 99.5000th: 3364 33 27 > 99.9000th: 12560 36 30 > > We also tested on unixbench and hackbench, and saw no performance > change. > > Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> > --- > v2: > Update commit log about other benchmarks. > Add comment in code. > Move the code before rq_unlock. This can make ttwu_pending updated a bit > earlier than v1 so that it can reflect the real condition more timely, > maybe. > > v1: https://lore.kernel.org/all/20221101073630.2797-1-dtcccc@linux.alibaba.com/ > --- > kernel/sched/core.c | 18 +++++++++++------- > 1 file changed, 11 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 87c9cdf37a26..7a04b5565389 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg) > if (!llist) > return; > > - /* > - * rq::ttwu_pending racy indication of out-standing wakeups. > - * Races such that false-negatives are possible, since they > - * are shorter lived that false-positives would be. > - */ > - WRITE_ONCE(rq->ttwu_pending, 0); > - > rq_lock_irqsave(rq, &rf); > update_rq_clock(rq); > > @@ -3759,6 +3752,17 @@ void sched_ttwu_pending(void *arg) > ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf); > } > > + /* > + * Must be after enqueueing at least once task such that > + * idle_cpu() does not observe a false-negative -- if it does, > + * it is possible for select_idle_siblings() to stack a number > + * of tasks on this CPU during that window. > + * > + * It is ok to clear ttwu_pending when another task pending. > + * We will receive IPI after local irq enabled and then enqueue it. > + * Since now nr_running > 0, idle_cpu() will always get correct result. > + */ > + WRITE_ONCE(rq->ttwu_pending, 0); > rq_unlock_irqrestore(rq, &rf); > } > Reviewed-by: Chen Yu <yu.c.chen@intel.com> thanks, Chenyu
On Fri, Nov 04, 2022 at 10:36:01AM +0800, Tianchen Ding wrote: > We found a long tail latency in schbench whem m*t is close to nr_cpus. > (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.) > > This is because when the wakee cpu is idle, rq->ttwu_pending is cleared > too early, and idle_cpu() will return true until the wakee task enqueued. > This will mislead the waker when selecting idle cpu, and wake multiple > worker threads on the same wakee cpu. This situation is enlarged by > commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on > wakelist if wakee cpu is idle") because it tends to use wakelist. > > Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu > (Intel(R) Xeon(R) Platinum 8369B). > > Latency percentiles (usec): > base base+revert_f3dd3f674555 base+this_patch > 50.0000th: 9 13 9 > 75.0000th: 12 19 12 > 90.0000th: 15 22 15 > 95.0000th: 18 24 17 > *99.0000th: 27 31 24 > 99.5000th: 3364 33 27 > 99.9000th: 12560 36 30 > > We also tested on unixbench and hackbench, and saw no performance > change. > > Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> I tested this on bare metal across a range of machines. The impact of the patch is nowhere near as obvious as it is on a VM but even then, schbench generally benefits (not by as much and not always at all percentiles). The only workload that appeared to suffer was specjbb2015 but *only* on NUMA machines, on UMA it was fine and the benchmark can be a little flaky for getting stable results anyway. In the few cases where it showed a problem, the NUMA balancing behaviour was also different so I think it can be ignored. In most cases it was better than vanilla and better than a revert or at least made marginal differences that were borderline noise. However, avoiding stacking tasks due to false positives is also important because even though that can help performance in some cases (strictly sync wakeups), it's not necessarily a good idea. So while it's not a universal win, it wins more than it loses and it appears to be more clearly a win on VMs so on that basis Acked-by: Mel Gorman <mgorman@suse.de>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 87c9cdf37a26..7a04b5565389 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg) if (!llist) return; - /* - * rq::ttwu_pending racy indication of out-standing wakeups. - * Races such that false-negatives are possible, since they - * are shorter lived that false-positives would be. - */ - WRITE_ONCE(rq->ttwu_pending, 0); - rq_lock_irqsave(rq, &rf); update_rq_clock(rq); @@ -3759,6 +3752,17 @@ void sched_ttwu_pending(void *arg) ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf); } + /* + * Must be after enqueueing at least once task such that + * idle_cpu() does not observe a false-negative -- if it does, + * it is possible for select_idle_siblings() to stack a number + * of tasks on this CPU during that window. + * + * It is ok to clear ttwu_pending when another task pending. + * We will receive IPI after local irq enabled and then enqueue it. + * Since now nr_running > 0, idle_cpu() will always get correct result. + */ + WRITE_ONCE(rq->ttwu_pending, 0); rq_unlock_irqrestore(rq, &rf); }