Message ID | 20221027081630.34081-1-zhouchuyi@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp101771wru; Thu, 27 Oct 2022 01:26:40 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7LtzSDohiQMUORwiDu+JfJL524hLi70qaTQVdMapMM3dZ1gxODVoEOUF1B+/meiN0NihST X-Received: by 2002:a17:906:cc49:b0:7ad:93d1:5eaf with SMTP id mm9-20020a170906cc4900b007ad93d15eafmr318947ejb.393.1666859200445; Thu, 27 Oct 2022 01:26:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666859200; cv=none; d=google.com; s=arc-20160816; b=D53ANMOx+NnXdhyhK9LCPsM9TRKr3axQjz6E6kvzvxLO5SQ/fV2MSzUIG5EJYk0Nwn /j35k+7iWwDyI5DY3bZjxUpF7DnWza24ObSbK17AVgjsgUNHu5bHWsAMvRT/htQI8i0d MTUrQscH9JtHp+KZt3Y+cXWCDExhHDaF+LmqHw7CsXr8CmilvDZf0h3JAMHVyZCJjYzM U+Qmf7Fs2+Q+VTpyU3I4uh63WX7wEVgrCe2Z5M84B+u5MP7E98aXtvI8X9A5ut2/6VpJ ndcP9L4NW3OZZKqXEx4GbdUINdOn20yTomFTFSDAKB86M80LX3F3josdizw0X+uY1xPv qO9g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=kGQVxG2SU95XA9M7f0IhXRn42lf+Qr3tm/YzKnG6hUc=; b=dEriVArmjzQ4sqVIURa7UD5G9eUm/CEJZPnc9FAQoTe52b0ouveeOd/aeldSaUogQ3 pE0mbeWZSwawc/7Yk7Ms0gNYhsrLQX0qv5HjQKaP+/L2iyITEICNmbBA8ddcQjSCaqZ2 m4E/41Phat41dT9CtbHke8ywntMa6Htvu89m/8tSEfqDmaSkK9oVvx9pBvFcJrawrxSe G9preBR5m0vaBMNx8xoGUbTqPNS3cPSWMu/HDZo23FiEZrG5w/awd+2P0kECCkFNHq39 GAgO671rSHQdPWr2mSlyb7bfwMXSKlkFrNVMFI8nyAfiiHbzOqgVXIrbl6yKqDtKdS8p c2uw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=aiJWnRGb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f8-20020a50fe08000000b00461f7a83e58si967829edt.46.2022.10.27.01.26.13; Thu, 27 Oct 2022 01:26:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=aiJWnRGb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234677AbiJ0IQo (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Thu, 27 Oct 2022 04:16:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58790 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233187AbiJ0IQm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Oct 2022 04:16:42 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2AFEC31F9C for <linux-kernel@vger.kernel.org>; Thu, 27 Oct 2022 01:16:40 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id n7so698577plp.1 for <linux-kernel@vger.kernel.org>; Thu, 27 Oct 2022 01:16:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=kGQVxG2SU95XA9M7f0IhXRn42lf+Qr3tm/YzKnG6hUc=; b=aiJWnRGbv2p+GLTzTdckX1EybobOwf4N9yfGhVZmM3o/rzjeyHxSQ8OfCCf2MlOb7Y ep1xtYsiIPSdK7nOW+W+JK4EQ5uhmxcXT7TblAfNN8oouZzqH8pvyCNNQhNolPM0yiZ9 dqs3zMSoYB2hREDOFY5362dSTEBUsPcVAInkLe0PVnDz5JuGPLFvvmz7/gHDL+FRg9hi ErOZVE2XSjZx1BBpWJ+/nKCC6b5GohK28UbcbsffW0Qub85pGDEzji8Cx6OjvD/Am6kf ZqYaIbr/yzPxIAW53LtbdPmHJMjI7b3jPFBLSDHbPfLwhNERCU1gnrVmPwy5fVB/M6Uy sJPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kGQVxG2SU95XA9M7f0IhXRn42lf+Qr3tm/YzKnG6hUc=; b=X4mx0Ov+CrWeE+wsqnII4KgfhnJ2/cwo3ehOiFkzZ+TqfugIQXEiZqyNmrFLs4xEOV hnsC1PY13JJAo0GoroZyZNSYND14XAOswizI7jzBO0N49AEt1mQknmkjND5NkcttIkH4 uCkn0yZ+NUQ3xoXF9RnSeCC2MQgoybXJob2v87zIkRFoU4lW7Y4r5A2MZa8p6Y3QtVIV YUuNJhK+LgTWa5zEZiO632mPHH3+CUz1Sp1TrZwUvsQqon6PRiLOjC0NEkfZ1+eqYaSJ EOyvbWctCwVB0VquoRB9/2cjBEZbcraCFv8kVdkGwWKBcjEGWbAuPrqmg+jU8B1BsBF1 oaAA== X-Gm-Message-State: ACrzQf1rO7eooaqQMljGjYxuRYm9+y4H6HvuRph5C9AyWgGNRv8GPF1s JyyF9W+mpcXJwAcEjLS2RtGowQ== X-Received: by 2002:a17:902:e5c6:b0:185:4bbd:17ce with SMTP id u6-20020a170902e5c600b001854bbd17cemr49174140plf.132.1666858600265; Thu, 27 Oct 2022 01:16:40 -0700 (PDT) Received: from YGFVJ29LDD.bytedance.net ([139.177.225.225]) by smtp.gmail.com with ESMTPSA id w72-20020a62824b000000b00560e5da42d5sm632972pfd.201.2022.10.27.01.16.36 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 27 Oct 2022 01:16:39 -0700 (PDT) From: Chuyi Zhou <zhouchuyi@bytedance.com> To: peterz@infradead.org, juri.lelli@redhat.com, mingo@redhat.com, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, joshdon@google.com, Chuyi Zhou <zhouchuyi@bytedance.com> Subject: [PATCH] sched/fair: favor non-idle group in tick preemption Date: Thu, 27 Oct 2022 16:16:30 +0800 Message-Id: <20221027081630.34081-1-zhouchuyi@bytedance.com> X-Mailer: git-send-email 2.37.0 (Apple Git-136) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747828552590964893?= X-GMAIL-MSGID: =?utf-8?q?1747828552590964893?= |
Series |
sched/fair: favor non-idle group in tick preemption
|
|
Commit Message
Chuyi Zhou
Oct. 27, 2022, 8:16 a.m. UTC
The non-idle se dominates competition vs the idle se when they
are belong to the same group. We ensure that idle groups would not
preempt non-idle group in wakeup preemption(see check_preempt_wakeup()).
However, this can happen in tick preemption, since check_preempt_tick()
dose not check current/se is idle or not. This patch adds this check.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
kernel/sched/fair.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
Comments
Hi Chuyi, On Thu, Oct 27, 2022 at 1:16 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > > The non-idle se dominates competition vs the idle se when they > are belong to the same group. We ensure that idle groups would not > preempt non-idle group in wakeup preemption(see check_preempt_wakeup()). > However, this can happen in tick preemption, since check_preempt_tick() > dose not check current/se is idle or not. This patch adds this check. > > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > --- > kernel/sched/fair.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e4a0b8bd941c..f3324b8753b3 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4750,6 +4750,7 @@ static void > check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) > { > unsigned long ideal_runtime, delta_exec; > + int cse_is_idle, pse_is_idle; > struct sched_entity *se; > s64 delta; > > @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) > if (delta < 0) > return; > > - if (delta > ideal_runtime) > + if (delta > ideal_runtime) { > + /* > + * Favor non-idle group even in tick preemption > + */ > + cse_is_idle = se_is_idle(curr); > + pse_is_idle = se_is_idle(se); > + if (unlikely(!cse_is_idle && pse_is_idle)) > + return; This would make it so that we never have tick based preemption of a non-idle entity by an idle entity. That's a recipe for starvation of the idle entity, if the non-idle entity is cpu bound. Beyond that though, I'm not quite sure the issue being solved here. The large difference in weight between the idle and non-idle entity means that the non-idle entity will not be preempted for quite a while due to its ideal_runtime being quite high. The idle entity will quickly be preempted on the next tick it takes due to the smaller value of sysctl_sched_idle_min_granularity. The wakeup check is useful for latency sensitive non-idle tasks. However, in steady state competition between idle and non-idle, we must allow some amount of round-robin. > + > resched_curr(rq_of(cfs_rq)); > + } > } > > static void > -- > 2.20.1 > Best, Josh
在 2022/10/28 07:34, Josh Don 写道: > Hi Chuyi, > > On Thu, Oct 27, 2022 at 1:16 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: >> >> The non-idle se dominates competition vs the idle se when they >> are belong to the same group. We ensure that idle groups would not >> preempt non-idle group in wakeup preemption(see check_preempt_wakeup()). >> However, this can happen in tick preemption, since check_preempt_tick() >> dose not check current/se is idle or not. This patch adds this check. >> >> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> >> --- >> kernel/sched/fair.c | 12 +++++++++++- >> 1 file changed, 11 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index e4a0b8bd941c..f3324b8753b3 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4750,6 +4750,7 @@ static void >> check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) >> { >> unsigned long ideal_runtime, delta_exec; >> + int cse_is_idle, pse_is_idle; >> struct sched_entity *se; >> s64 delta; >> >> @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) >> if (delta < 0) >> return; >> >> - if (delta > ideal_runtime) >> + if (delta > ideal_runtime) { >> + /* >> + * Favor non-idle group even in tick preemption >> + */ >> + cse_is_idle = se_is_idle(curr); >> + pse_is_idle = se_is_idle(se); >> + if (unlikely(!cse_is_idle && pse_is_idle)) >> + return; > Hi Josh, thanks for your reply, > This would make it so that we never have tick based preemption of a > non-idle entity by an idle entity. That's a recipe for starvation of > the idle entity, if the non-idle entity is cpu bound. > > Beyond that though, I'm not quite sure the issue being solved here. > The large difference in weight between the idle and non-idle entity > means that the non-idle entity will not be preempted for quite a while > due to its ideal_runtime being quite high. The idle entity will > quickly be preempted on the next tick it takes due to the smaller > value of sysctl_sched_idle_min_granularity. > Actually, I did some tests and traced this issue. the result shows that this can happen with small probability. I also do some benchmarks. The result seems it has no performance harm, and we can reduce 2%~3% context switch when idle group & non-idle group are present at the same time. In addition, for throughput concern, I think we better let non-idle entity consume it's ideal_runtime. However, as you said, it may cause starvation of the idle entity..... There is another question I would like to take this opportunity to consult you. In our production environment, in some cases, we want to adjust the weight/shares of the idle-cgroup which means we need to change the logic of sched_group_set_shares(), and it can increase the probability of the above issue. So, for what reasons did you prohibit users from changing weights of idle cgroup. Thanks again for your review. Best, Chuyi > The wakeup check is useful for latency sensitive non-idle tasks. > However, in steady state competition between idle and non-idle, we > must allow some amount of round-robin. > >> + >> resched_curr(rq_of(cfs_rq)); >> + } >> } >> >> static void >> -- >> 2.20.1 >> > > Best, > Josh
On Thu, Oct 27, 2022 at 8:57 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > > > > 在 2022/10/28 07:34, Josh Don 写道: > > Hi Chuyi, > > > > On Thu, Oct 27, 2022 at 1:16 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > >> > >> The non-idle se dominates competition vs the idle se when they > >> are belong to the same group. We ensure that idle groups would not > >> preempt non-idle group in wakeup preemption(see check_preempt_wakeup()). > >> However, this can happen in tick preemption, since check_preempt_tick() > >> dose not check current/se is idle or not. This patch adds this check. > >> > >> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > >> --- > >> kernel/sched/fair.c | 12 +++++++++++- > >> 1 file changed, 11 insertions(+), 1 deletion(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index e4a0b8bd941c..f3324b8753b3 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -4750,6 +4750,7 @@ static void > >> check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) > >> { > >> unsigned long ideal_runtime, delta_exec; > >> + int cse_is_idle, pse_is_idle; > >> struct sched_entity *se; > >> s64 delta; > >> > >> @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) > >> if (delta < 0) > >> return; > >> > >> - if (delta > ideal_runtime) > >> + if (delta > ideal_runtime) { > >> + /* > >> + * Favor non-idle group even in tick preemption > >> + */ > >> + cse_is_idle = se_is_idle(curr); > >> + pse_is_idle = se_is_idle(se); > >> + if (unlikely(!cse_is_idle && pse_is_idle)) > >> + return; > > > Hi Josh, thanks for your reply, > > This would make it so that we never have tick based preemption of a > > non-idle entity by an idle entity. That's a recipe for starvation of > > the idle entity, if the non-idle entity is cpu bound. > > > > Beyond that though, I'm not quite sure the issue being solved here. > > The large difference in weight between the idle and non-idle entity > > means that the non-idle entity will not be preempted for quite a while > > due to its ideal_runtime being quite high. The idle entity will > > quickly be preempted on the next tick it takes due to the smaller > > value of sysctl_sched_idle_min_granularity. > > > Actually, I did some tests and traced this issue. the result shows that > this can happen with small probability. I also do some benchmarks. The > result seems it has no performance harm, and we can reduce 2%~3% > context switch when idle group & non-idle group are present at the same > time. In addition, for throughput concern, I think we better let > non-idle entity consume it's ideal_runtime. However, as you said, it may > cause starvation of the idle entity..... I don't doubt it improves the performance of the non-idle entity, but it is never advisable to indefinitely starve threads. That's a recipe for hard lockups, priority inversion, etc. Does your non-idle entity have a reasonable number of cpu.shares? You can push out the round-robin interval by tuning CFS parameters without completely starving the idle entity. > There is another question I would like to take this opportunity to > consult you. In our production environment, in some cases, we want to > adjust the weight/shares of the idle-cgroup which means we need to > change the logic of sched_group_set_shares(), and it can increase the > probability of the above issue. So, for what reasons did you prohibit > users from changing weights of idle cgroup. The reason for limiting the control of weight for idle cgroups is to match the semantics of the per-task SCHED_IDLE api, which gives SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that these entities are intended to soak "unused" cpu cycles, and should give minimal interference to any non-idle thread. However, we don't have strict priority between idle and non-idle, due to the potential for starvation issues. Perhaps you could clarify your use case a bit further. Why do you want to change the weight? Is it to adjust the competition between two idle groups, or something else? > > Thanks again for your review. > > Best, > Chuyi > > The wakeup check is useful for latency sensitive non-idle tasks. > > However, in steady state competition between idle and non-idle, we > > must allow some amount of round-robin. > > > >> + > >> resched_curr(rq_of(cfs_rq)); > >> + } > >> } > >> > >> static void > >> -- > >> 2.20.1 > >> > > > > Best, > > Josh
在 2022/10/29 06:40, Josh Don 写道: > On Thu, Oct 27, 2022 at 8:57 PM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: >> >> >> >> 在 2022/10/28 07:34, Josh Don 写道: >>> Hi Chuyi, >>> >>> On Thu, Oct 27, 2022 at 1:16 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: >>>> >>>> The non-idle se dominates competition vs the idle se when they >>>> are belong to the same group. We ensure that idle groups would not >>>> preempt non-idle group in wakeup preemption(see check_preempt_wakeup()). >>>> However, this can happen in tick preemption, since check_preempt_tick() >>>> dose not check current/se is idle or not. This patch adds this check. >>>> >>>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> >>>> --- >>>> kernel/sched/fair.c | 12 +++++++++++- >>>> 1 file changed, 11 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index e4a0b8bd941c..f3324b8753b3 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -4750,6 +4750,7 @@ static void >>>> check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>>> { >>>> unsigned long ideal_runtime, delta_exec; >>>> + int cse_is_idle, pse_is_idle; >>>> struct sched_entity *se; >>>> s64 delta; >>>> >>>> @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>>> if (delta < 0) >>>> return; >>>> >>>> - if (delta > ideal_runtime) >>>> + if (delta > ideal_runtime) { >>>> + /* >>>> + * Favor non-idle group even in tick preemption >>>> + */ >>>> + cse_is_idle = se_is_idle(curr); >>>> + pse_is_idle = se_is_idle(se); >>>> + if (unlikely(!cse_is_idle && pse_is_idle)) >>>> + return; >>> >> Hi Josh, thanks for your reply, >>> This would make it so that we never have tick based preemption of a >>> non-idle entity by an idle entity. That's a recipe for starvation of >>> the idle entity, if the non-idle entity is cpu bound. >>> >>> Beyond that though, I'm not quite sure the issue being solved here. >>> The large difference in weight between the idle and non-idle entity >>> means that the non-idle entity will not be preempted for quite a while >>> due to its ideal_runtime being quite high. The idle entity will >>> quickly be preempted on the next tick it takes due to the smaller >>> value of sysctl_sched_idle_min_granularity. >>> >> Actually, I did some tests and traced this issue. the result shows that >> this can happen with small probability. I also do some benchmarks. The >> result seems it has no performance harm, and we can reduce 2%~3% >> context switch when idle group & non-idle group are present at the same >> time. In addition, for throughput concern, I think we better let >> non-idle entity consume it's ideal_runtime. However, as you said, it may >> cause starvation of the idle entity..... > > I don't doubt it improves the performance of the non-idle entity, but > it is never advisable to indefinitely starve threads. That's a recipe > for hard lockups, priority inversion, etc. > > Does your non-idle entity have a reasonable number of cpu.shares? You > can push out the round-robin interval by tuning CFS parameters without > completely starving the idle entity. > >> There is another question I would like to take this opportunity to >> consult you. In our production environment, in some cases, we want to >> adjust the weight/shares of the idle-cgroup which means we need to >> change the logic of sched_group_set_shares(), and it can increase the >> probability of the above issue. So, for what reasons did you prohibit >> users from changing weights of idle cgroup. > > The reason for limiting the control of weight for idle cgroups is to > match the semantics of the per-task SCHED_IDLE api, which gives > SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that > these entities are intended to soak "unused" cpu cycles, and should > give minimal interference to any non-idle thread. However, we don't > have strict priority between idle and non-idle, due to the potential > for starvation issues. > > Perhaps you could clarify your use case a bit further. Why do you want > to change the weight? Is it to adjust the competition between two idle > groups, or something else? > Suppose we have two cgroups(idle & non-idle)in /sys/fs/cgroup/cpu. Idle cgroup contains some offline service, such as beg data processing; non-idle cgroup contains some online service which have higher priority to users and are sensitive to latency. We set quota/period for idle cgroup which indicates it's *cpu limit*. In general, we consider that the idle cgroup's cpu usage closer to the limit, the better. However, when the system is busy, the idle cgroups can only get little cpu resources with minimum weight. To cope with the above situation, we changed the default weight. One more question is, why you think this patch can strave idle entity? /* * Ensure that a task that missed wakeup preemption by a * narrow margin doesn't have to wait for a full slice. * This also mitigates buddy induced latencies under load. */ se = __pick_first_entity(cfs_rq); delta = curr->vruntime - se->vruntime; if (delta < 0) return; if (delta > ideal_runtime) resched_curr(rq_of(cfs_rq)); se can preempt curr only when curr->vruntime > se->vruntime && curr->vruntime - se->vruntime > ideal_runtime is true. I think the original purpose is that se doesn't have to wait for a full slice, reduce response time if se is latency sensitive. This patch just let curr exhaust it's idleal_runtime when se is idle and curr is non-idle. Normally se will be choosed by pick_next_entity(). Maybe I missed something ? Thanks >> >> Thanks again for your review. >> >> Best, >> Chuyi >>> The wakeup check is useful for latency sensitive non-idle tasks. >>> However, in steady state competition between idle and non-idle, we >>> must allow some amount of round-robin. >>> >>>> + >>>> resched_curr(rq_of(cfs_rq)); >>>> + } >>>> } >>>> >>>> static void >>>> -- >>>> 2.20.1 >>>> >>> >>> Best, >>> Josh
On Mon, Oct 31, 2022 at 1:39 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > > >> 在 2022/10/28 07:34, Josh Don 写道: > > The reason for limiting the control of weight for idle cgroups is to > > match the semantics of the per-task SCHED_IDLE api, which gives > > SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that > > these entities are intended to soak "unused" cpu cycles, and should > > give minimal interference to any non-idle thread. However, we don't > > have strict priority between idle and non-idle, due to the potential > > for starvation issues. > > > > Perhaps you could clarify your use case a bit further. Why do you want > > to change the weight? Is it to adjust the competition between two idle > > groups, or something else? > > > Suppose we have two cgroups(idle & non-idle)in /sys/fs/cgroup/cpu. > Idle cgroup contains some offline service, such as beg data processing; > non-idle cgroup contains some online service which have > higher priority to users and are sensitive to latency. We set > quota/period for idle cgroup which indicates it's *cpu limit*. > In general, we consider that the idle cgroup's cpu usage > closer to the limit, the better. However, when the system is busy, > the idle cgroups can only get little cpu resources with minimum weight. > To cope with the above situation, we changed the default weight. I see. So you want the part of SCHED_IDLE that makes the entity highly preemptible (and avoids preemption of non idle entities), but want to adjust weight to reach a target cpu split? That seems a bit counterintuitive to me, since by giving the idle entities higher weight, you'll end up pushing out the round-robin latency for the non-idle entities. Worth noting that SCHED_IDLE is a bit of a CFS hack, but the intended semantics of it are that these threads soak only "remaining cycles". This comes with many implications beyond just weight. For example, a cpu running only SCHED_IDLE entities is considered as "idle" from the perspective of non-idle entities. If we give these idle entities meaningful weight, we start to break assumptions there, for example see sched_idle_cpu() and load balancing. I wonder if maybe dusting off SCHED_BATCH is a better answer here, for this type of use case (some amount of throughput "guarantee", but with preemption properties similar to SCHED_IDLE). Peter, thoughts? > One more question is, why you think this patch can strave idle entity? > > /* > * Ensure that a task that missed wakeup preemption by a > * narrow margin doesn't have to wait for a full slice. > * This also mitigates buddy induced latencies under load. > */ > se = __pick_first_entity(cfs_rq); > delta = curr->vruntime - se->vruntime; > > if (delta < 0) > return; > > if (delta > ideal_runtime) > resched_curr(rq_of(cfs_rq)); > > se can preempt curr only when > curr->vruntime > se->vruntime && > curr->vruntime - se->vruntime > ideal_runtime > is true. I think the original purpose is that se doesn't have to wait > for a full slice, reduce response time if se is latency sensitive. > This patch just let curr exhaust it's idleal_runtime when se is idle and > curr is non-idle. Normally se will be choosed by pick_next_entity(). > > Maybe I missed something ? > Thanks No that was my mistake, I accidentally thought this delta was being applied to the 'if (delta_exec > ideal_runtime) {' above in check_preempt_tick(). Some weirdness about this change though, is that if there is a non-idle current entity, and the two next entities on the cfs_rq are idle and non-idle respectively, we'll now take longer to preempt the on-cpu non-idle entity, because the non-idle entity on the cfs_rq is 'hidden' by the idle 'first' entity. Wakeup preemption is different because we're always directly comparing the current entity with the newly woken entity. Best, Josh
在 2022/11/1 06:44, Josh Don 写道: > On Mon, Oct 31, 2022 at 1:39 AM Chuyi Zhou <zhouchuyi@bytedance.com> wrote: >> >>>> 在 2022/10/28 07:34, Josh Don 写道: >>> The reason for limiting the control of weight for idle cgroups is to >>> match the semantics of the per-task SCHED_IDLE api, which gives >>> SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that >>> these entities are intended to soak "unused" cpu cycles, and should >>> give minimal interference to any non-idle thread. However, we don't >>> have strict priority between idle and non-idle, due to the potential >>> for starvation issues. >>> >>> Perhaps you could clarify your use case a bit further. Why do you want >>> to change the weight? Is it to adjust the competition between two idle >>> groups, or something else? >>> >> Suppose we have two cgroups(idle & non-idle)in /sys/fs/cgroup/cpu. >> Idle cgroup contains some offline service, such as beg data processing; >> non-idle cgroup contains some online service which have >> higher priority to users and are sensitive to latency. We set >> quota/period for idle cgroup which indicates it's *cpu limit*. >> In general, we consider that the idle cgroup's cpu usage >> closer to the limit, the better. However, when the system is busy, >> the idle cgroups can only get little cpu resources with minimum weight. >> To cope with the above situation, we changed the default weight. > > I see. So you want the part of SCHED_IDLE that makes the entity highly > preemptible (and avoids preemption of non idle entities), but want to > adjust weight to reach a target cpu split? That seems a bit > counterintuitive to me, since by giving the idle entities higher > weight, you'll end up pushing out the round-robin latency for the > non-idle entities. > > Worth noting that SCHED_IDLE is a bit of a CFS hack, but the intended > semantics of it are that these threads soak only "remaining cycles". > This comes with many implications beyond just weight. For example, a > cpu running only SCHED_IDLE entities is considered as "idle" from the > perspective of non-idle entities. If we give these idle entities > meaningful weight, we start to break assumptions there, for example > see sched_idle_cpu() and load balancing. > > I wonder if maybe dusting off SCHED_BATCH is a better answer here, for > this type of use case (some amount of throughput "guarantee", but with > preemption properties similar to SCHED_IDLE). Peter, thoughts? > >> One more question is, why you think this patch can strave idle entity? >> >> /* >> * Ensure that a task that missed wakeup preemption by a >> * narrow margin doesn't have to wait for a full slice. >> * This also mitigates buddy induced latencies under load. >> */ >> se = __pick_first_entity(cfs_rq); >> delta = curr->vruntime - se->vruntime; >> >> if (delta < 0) >> return; >> >> if (delta > ideal_runtime) >> resched_curr(rq_of(cfs_rq)); >> >> se can preempt curr only when >> curr->vruntime > se->vruntime && >> curr->vruntime - se->vruntime > ideal_runtime >> is true. I think the original purpose is that se doesn't have to wait >> for a full slice, reduce response time if se is latency sensitive. >> This patch just let curr exhaust it's idleal_runtime when se is idle and >> curr is non-idle. Normally se will be choosed by pick_next_entity(). >> >> Maybe I missed something ? >> Thanks > > No that was my mistake, I accidentally thought this delta was being > applied to the 'if (delta_exec > ideal_runtime) {' above in > check_preempt_tick(). > > Some weirdness about this change though, is that if there is a > non-idle current entity, and the two next entities on the cfs_rq are > idle and non-idle respectively, we'll now take longer to preempt the > on-cpu non-idle entity, because the non-idle entity on the cfs_rq is > 'hidden' by the idle 'first' entity. Wakeup preemption is different > because we're always directly comparing the current entity with the > newly woken entity. > You are right, this can happen with high probability. This patch just compared the curr with the first entity in the tick, and it seems hard to consider all the other entity in cfs_rq. So, what specific negative effects this situation would cause? For example, the "hidden" non-idle entity's latency will be worse than before? Thanks for your patient reply and guidance! BenchMark ======================================================= All the benchmark are done in /sys/fs/cgroup/cpu/online, which is a a normal cpu cgroup. In /sys/fs/cgroup/cpu/ offline, 'perf bench sched messaging -g 1 -l 1000000000' is ran continuously. Besides, we set offline cgroup as idle. schbench baseline patched Lat 50.0th-qrtle-1 6 6 Lat 75.0th-qrtle-1 6 7 Lat 90.0th-qrtle-1 7 7 Lat 95.0th-qrtle-1 7 7 Lat 99.0th-qrtle-1 10 10 Lat 99.5th-qrtle-1 11 11 Lat 99.9th-qrtle-1 12 12 Lat 50.0th-qrtle-2 6 7 Lat 75.0th-qrtle-2 7 8 Lat 90.0th-qrtle-2 8 9 Lat 95.0th-qrtle-2 9 10 Lat 99.0th-qrtle-2 11 12 Lat 99.5th-qrtle-2 12 13 Lat 99.9th-qrtle-2 19 18 Lat 50.0th-qrtle-4 8 9 Lat 75.0th-qrtle-4 10 11 Lat 90.0th-qrtle-4 11 13 Lat 95.0th-qrtle-4 12 14 Lat 99.0th-qrtle-4 15 16 Lat 99.5th-qrtle-4 17 18 Lat 99.9th-qrtle-4 25 23 Lat 50.0th-qrtle-8 11 13 Lat 75.0th-qrtle-8 15 17 Lat 90.0th-qrtle-8 18 20 Lat 95.0th-qrtle-8 19 22 Lat 99.0th-qrtle-8 23 26 Lat 99.5th-qrtle-8 25 28 Lat 99.9th-qrtle-8 41 48 Lat 50.0th-qrtle-16 20 21 Lat 75.0th-qrtle-16 27 28 Lat 90.0th-qrtle-16 32 33 Lat 95.0th-qrtle-16 35 36 Lat 99.0th-qrtle-16 44 43 Lat 99.5th-qrtle-16 53 47 Lat 99.9th-qrtle-16 1310 247 Lat 50.0th-qrtle-23 28 28 Lat 75.0th-qrtle-23 39 39 Lat 90.0th-qrtle-23 46 46 Lat 95.0th-qrtle-23 50 50 Lat 99.0th-qrtle-23 62 58 Lat 99.5th-qrtle-23 449 67 Lat 99.9th-qrtle-23 5516 906 hackbench-process-pipes baseline patched Amean 1 0.6540 ( 0.00%) 0.6480 ( 0.92%) Amean 4 0.8023 ( 0.00%) 0.7860 ( 2.04%) Amean 7 1.3543 ( 0.00%) 1.3780 ( -1.75%) Amean 12 2.2653 ( 0.00%) 2.2853 ( -0.88%) Amean 21 4.7187 ( 0.00%) 5.7217 * -21.26%* Amean 30 7.3217 ( 0.00%) 7.5797 ( -3.52%) Amean 48 7.8410 ( 0.00%) 7.7687 ( 0.92%) Amean 79 10.6037 ( 0.00%) 10.9147 ( -2.93%) Amean 110 11.7623 ( 0.00%) 12.2150 * -3.85%* Amean 141 11.9980 ( 0.00%) 13.0153 ( -8.48%) Amean 172 13.6023 ( 0.00%) 13.3613 ( 1.77%) Amean 203 15.2607 ( 0.00%) 15.5010 ( -1.57%) Amean 234 17.1007 ( 0.00%) 17.2960 ( -1.14%) Amean 265 18.1547 ( 0.00%) 18.2500 ( -0.53%) hackbench-process-sockets baseline patched Amean 1 0.8340 ( 0.00%) 0.8290 ( 0.60%) Amean 4 1.7260 ( 0.00%) 1.7340 ( -0.46%) Amean 7 2.8583 ( 0.00%) 2.8520 ( 0.22%) Amean 12 4.6833 ( 0.00%) 4.7387 * -1.18%* Amean 21 7.9787 ( 0.00%) 7.9723 ( 0.08%) Amean 30 11.1853 ( 0.00%) 11.2573 * -0.64%* Amean 48 17.7430 ( 0.00%) 17.7410 ( 0.01%) Amean 79 29.4067 ( 0.00%) 29.4900 ( -0.28%) Amean 110 41.4427 ( 0.00%) 41.6630 * -0.53%* Amean 141 53.0730 ( 0.00%) 53.4000 * -0.62%* Amean 172 64.6120 ( 0.00%) 65.0123 * -0.62%* Amean 203 76.3320 ( 0.00%) 76.6423 ( -0.41%) Amean 234 87.9563 ( 0.00%) 88.4263 * -0.53%* Amean 265 99.5607 ( 0.00%) 100.3303 * -0.77%* Amean 296 111.0987 ( 0.00%) 112.1843 * -0.98%* hackbench-thread-pipes baseline patched Amean 1 0.7007 ( 0.00%) 0.7087 ( -1.14%) Amean 4 0.8553 ( 0.00%) 0.8197 ( 4.17%) Amean 7 1.4327 ( 0.00%) 1.4503 ( -1.23%) Amean 12 2.3497 ( 0.00%) 2.3653 ( -0.67%) Amean 21 5.7677 ( 0.00%) 5.7340 ( 0.58%) Amean 30 7.4670 ( 0.00%) 7.4693 ( -0.03%) Amean 48 8.2720 ( 0.00%) 8.2417 ( 0.37%) Amean 79 11.3247 ( 0.00%) 11.1930 ( 1.16%) Amean 110 13.2153 ( 0.00%) 13.1750 ( 0.31%) Amean 141 14.3250 ( 0.00%) 14.9840 ( -4.60%) Amean 172 16.7150 ( 0.00%) 15.8843 ( 4.97%) Amean 203 17.7397 ( 0.00%) 17.4787 ( 1.47%) Amean 234 19.4990 ( 0.00%) 21.0840 * -8.13%* Amean 265 21.6867 ( 0.00%) 22.1387 ( -2.08%) Amean 296 24.5460 ( 0.00%) 25.0093 ( -1.89%) hackbench-thread-sockets baseline patched Amean 1 0.8930 ( 0.00%) 0.8937 ( -0.07%) Amean 4 1.7803 ( 0.00%) 1.7853 ( -0.28%) Amean 7 2.9317 ( 0.00%) 2.9563 ( -0.84%) Amean 12 4.8313 ( 0.00%) 4.8227 ( 0.18%) Amean 21 8.1783 ( 0.00%) 8.1900 ( -0.14%) Amean 30 11.5653 ( 0.00%) 11.5180 ( 0.41%) Amean 48 18.3100 ( 0.00%) 18.2730 ( 0.20%) Amean 79 30.2047 ( 0.00%) 30.2960 ( -0.30%) Amean 110 42.3330 ( 0.00%) 42.3503 ( -0.04%) Amean 141 54.1853 ( 0.00%) 54.1483 ( 0.07%) Amean 172 66.0127 ( 0.00%) 65.8043 * 0.32%* Amean 203 77.9667 ( 0.00%) 77.6113 ( 0.46%) Amean 234 89.6077 ( 0.00%) 89.3893 * 0.24%* Amean 265 101.2477 ( 0.00%) 101.2773 ( -0.03%) Amean 296 113.1667 ( 0.00%) 113.2967 ( -0.11%) perfpipe baseline patched Min Time 8.86 ( 0.00%) 8.83 ( 0.39%) 1st-qrtle Time 8.92 ( 0.00%) 9.00 ( -0.94%) 2nd-qrtle Time 8.99 ( 0.00%) 9.06 ( -0.76%) 3rd-qrtle Time 9.06 ( 0.00%) 9.08 ( -0.17%) Max-1 Time 8.86 ( 0.00%) 8.83 ( 0.39%) Max-5 Time 8.86 ( 0.00%) 8.83 ( 0.39%) Max-10 Time 8.88 ( 0.00%) 8.94 ( -0.70%) Max-90 Time 9.10 ( 0.00%) 9.13 ( -0.40%) Max-95 Time 9.13 ( 0.00%) 9.20 ( -0.75%) Max-99 Time 9.15 ( 0.00%) 9.22 ( -0.84%) Max Time 9.16 ( 0.00%) 9.28 ( -1.23%) Amean Time 9.00 ( 0.00%) 9.06 * -0.65%* Stddev Time 0.09 ( 0.00%) 0.09 ( -3.71%) CoeffVar Time 0.95 ( 0.00%) 0.98 ( -3.04%) BAmean-99 Time 8.99 ( 0.00%) 9.05 ( -0.63%) BAmean-95 Time 8.99 ( 0.00%) 9.04 ( -0.62%) BAmean-90 Time 8.98 ( 0.00%) 9.04 ( -0.62%) BAmean-75 Time 8.96 ( 0.00%) 9.02 ( -0.67%) BAmean-50 Time 8.93 ( 0.00%) 9.00 ( -0.75%) BAmean-25 Time 8.89 ( 0.00%) 8.95 ( -0.68%) tbench4 Latency baseline patched Amean latency-1 0.73 ( 0.00%) 0.24 * 66.90%* Amean latency-2 0.93 ( 0.00%) 0.42 * 54.94%* Amean latency-4 0.99 ( 0.00%) 0.41 * 58.49%* Amean latency-8 1.03 ( 0.00%) 0.82 * 20.47%* Amean latency-16 1.36 ( 0.00%) 1.07 * 21.12%* Amean latency-32 2.93 ( 0.00%) 1.38 * 52.74%* Amean latency-64 4.21 ( 0.00%) 5.53 * -31.25%* Amean latency-128 9.76 ( 0.00%) 10.67 * -9.28%* Amean latency-256 23.68 ( 0.00%) 12.17 * 48.59%* Amean latency-384 61.68 ( 0.00%) 57.65 ( 6.54%) tbench4 Throughput (misleading but traditional) baseline patched Hmean 1 287.60 ( 0.00%) 287.83 ( 0.08%) Hmean 2 567.87 ( 0.00%) 582.14 * 2.51%* Hmean 4 1126.12 ( 0.00%) 1143.32 * 1.53%* Hmean 8 2138.48 ( 0.00%) 2212.82 * 3.48%* Hmean 16 3955.46 ( 0.00%) 4180.57 * 5.69%* Hmean 32 6838.94 ( 0.00%) 6879.27 * 0.59%* Hmean 64 9462.36 ( 0.00%) 9267.99 * -2.05%* Hmean 128 17156.42 ( 0.00%) 17901.01 * 4.34%* Hmean 256 15823.09 ( 0.00%) 14647.72 * -7.43%* Hmean 384 15366.71 ( 0.00%) 16592.95 * 7.98%* > Best, > Josh
Hi Josh, On 11/1/22 6:44 AM, Josh Don wrote: > ...nit... > Some weirdness about this change though, is that if there is a > non-idle current entity, and the two next entities on the cfs_rq are > idle and non-idle respectively, we'll now take longer to preempt the > on-cpu non-idle entity, because the non-idle entity on the cfs_rq is > 'hidden' by the idle 'first' entity. Wakeup preemption is different > because we're always directly comparing the current entity with the > newly woken entity. Indeed. The hidden non-idle entity might run longer at the cost of delayed preemption. This behavior is not compliant to the SCHED_NORMAL semantics. But we also can't tell from here that the non-idle entity contains SCHED_NORMAL tasks or not. As long as the entities do not aware of the schedule policies, this ambiguity exists. Best, Abel
> > Some weirdness about this change though, is that if there is a > > non-idle current entity, and the two next entities on the cfs_rq are > > idle and non-idle respectively, we'll now take longer to preempt the > > on-cpu non-idle entity, because the non-idle entity on the cfs_rq is > > 'hidden' by the idle 'first' entity. Wakeup preemption is different > > because we're always directly comparing the current entity with the > > newly woken entity. > > > You are right, this can happen with high probability. > This patch just compared the curr with the first entity in > the tick, and it seems hard to consider all the other entity > in cfs_rq. > > So, what specific negative effects this situation would cause? > For example, the "hidden" non-idle entity's latency will be worse > than before? As Abel points out in his email, it can push out the time it'll take to switch to the other non-idle entity. The change might boost some benchmarks numbers, but I don't think it is conclusive enough to say it is a generically beneficial improvement that should be integrated. By the way, I'm curious if you modified any of the sched_idle_cpu() and related load balancing around idle entities given that you've made it so that idle entities can have arbitrary weight (since, as I described in my prior email, this can otherwise cause issues there).
在 2022/11/2 07:39, Josh Don 写道: >>> Some weirdness about this change though, is that if there is a >>> non-idle current entity, and the two next entities on the cfs_rq are >>> idle and non-idle respectively, we'll now take longer to preempt the >>> on-cpu non-idle entity, because the non-idle entity on the cfs_rq is >>> 'hidden' by the idle 'first' entity. Wakeup preemption is different >>> because we're always directly comparing the current entity with the >>> newly woken entity. >>> >> You are right, this can happen with high probability. >> This patch just compared the curr with the first entity in >> the tick, and it seems hard to consider all the other entity >> in cfs_rq. >> >> So, what specific negative effects this situation would cause? >> For example, the "hidden" non-idle entity's latency will be worse >> than before? > > As Abel points out in his email, it can push out the time it'll take > to switch to the other non-idle entity. The change might boost some > benchmarks numbers, but I don't think it is conclusive enough to say > it is a generically beneficial improvement that should be integrated. > > By the way, I'm curious if you modified any of the sched_idle_cpu() > and related load balancing around idle entities given that you've made > it so that idle entities can have arbitrary weight (since, as I > described in my prior email, this can otherwise cause issues there). Hi Josh, Indeed, there are some subtle influences here. For example, find_idlest_group_cpu() may return i if sched_idle_cpu(i) is true. In nromal cases, it would work well, however it would be unpredictable if we change the default minumum weight. I have not found a suitable solution yet.... Thanks for your guidance.
On 2022/11/2 Josh Don wrote: >>> Some weirdness about this change though, is that if there is a >>> non-idle current entity, and the two next entities on the cfs_rq are >>> idle and non-idle respectively, we'll now take longer to preempt the >>> on-cpu non-idle entity, because the non-idle entity on the cfs_rq is >>> 'hidden' by the idle 'first' entity. Wakeup preemption is different >>> because we're always directly comparing the current entity with the >>> newly woken entity. >>> >> You are right, this can happen with high probability. >> This patch just compared the curr with the first entity in >> the tick, and it seems hard to consider all the other entity >> in cfs_rq. >> >> So, what specific negative effects this situation would cause? >> For example, the "hidden" non-idle entity's latency will be worse >> than before? > > As Abel points out in his email, it can push out the time it'll take > to switch to the other non-idle entity. The change might boost some > benchmarks numbers, but I don't think it is conclusive enough to say > it is a generically beneficial improvement that should be integrated. > > By the way, I'm curious if you modified any of the sched_idle_cpu() > and related load balancing around idle entities given that you've made > it so that idle entities can have arbitrary weight (since, as I > described in my prior email, this can otherwise cause issues there). If we want to make it easier for non-idle tasks to preempt idle tasks in tick, maybe we can consider lowering sysctl_sched_idle_min_granularity. Of course this may not ensure that non-idle tasks successfully preempt idle tasks every time, but it seems to be more beneficial for non-idle tasks. IMHO, even if it is allowed to increase the weight of non-idle, it seems that we can make it easier for non-idle tasks to preempt idle tasks by lowering sysctl_sched_idle_min_granularity. Thanks, Hao
On Thu, Nov 3, 2022 at 8:49 PM Hao Jia <jiahao.os@bytedance.com> wrote: > > > > On 2022/11/2 Josh Don wrote: > >>> Some weirdness about this change though, is that if there is a > >>> non-idle current entity, and the two next entities on the cfs_rq are > >>> idle and non-idle respectively, we'll now take longer to preempt the > >>> on-cpu non-idle entity, because the non-idle entity on the cfs_rq is > >>> 'hidden' by the idle 'first' entity. Wakeup preemption is different > >>> because we're always directly comparing the current entity with the > >>> newly woken entity. > >>> > >> You are right, this can happen with high probability. > >> This patch just compared the curr with the first entity in > >> the tick, and it seems hard to consider all the other entity > >> in cfs_rq. > >> > >> So, what specific negative effects this situation would cause? > >> For example, the "hidden" non-idle entity's latency will be worse > >> than before? > > > > As Abel points out in his email, it can push out the time it'll take > > to switch to the other non-idle entity. The change might boost some > > benchmarks numbers, but I don't think it is conclusive enough to say > > it is a generically beneficial improvement that should be integrated. > > > > By the way, I'm curious if you modified any of the sched_idle_cpu() > > and related load balancing around idle entities given that you've made > > it so that idle entities can have arbitrary weight (since, as I > > described in my prior email, this can otherwise cause issues there). > > If we want to make it easier for non-idle tasks to preempt idle tasks in > tick, maybe we can consider lowering sysctl_sched_idle_min_granularity. > Of course this may not ensure that non-idle tasks successfully preempt > idle tasks every time, but it seems to be more beneficial for non-idle > tasks. > > IMHO, even if it is allowed to increase the weight of non-idle, it seems > that we can make it easier for non-idle tasks to preempt idle tasks by > lowering sysctl_sched_idle_min_granularity. Yep, although the effectiveness is partially limited by whatever the HZ is set to for the scheduling tick. > > Thanks, > Hao
On 11/2/22 7:39 AM, Josh Don wrote: >>> Some weirdness about this change though, is that if there is a >>> non-idle current entity, and the two next entities on the cfs_rq are >>> idle and non-idle respectively, we'll now take longer to preempt the >>> on-cpu non-idle entity, because the non-idle entity on the cfs_rq is >>> 'hidden' by the idle 'first' entity. Wakeup preemption is different >>> because we're always directly comparing the current entity with the >>> newly woken entity. >>> >> You are right, this can happen with high probability. >> This patch just compared the curr with the first entity in >> the tick, and it seems hard to consider all the other entity >> in cfs_rq. >> >> So, what specific negative effects this situation would cause? >> For example, the "hidden" non-idle entity's latency will be worse >> than before? > > As Abel points out in his email, it can push out the time it'll take > to switch to the other non-idle entity. The change might boost some > benchmarks numbers, but I don't think it is conclusive enough to say > it is a generically beneficial improvement that should be integrated. Agree. > > By the way, I'm curious if you modified any of the sched_idle_cpu() > and related load balancing around idle entities given that you've made > it so that idle entities can have arbitrary weight (since, as I > described in my prior email, this can otherwise cause issues there). Being able to change idle entities' weight can bring nothing but convenience, because it can also be achieved by modifying all their siblings' weight. Which seems not a strong reason to get merged. And I'm also thinking that, although rare, a non-idle group can also have a weight close or even equal to 3. I guess some users who made this kind of setting might only want to benefit from the preemption at wakeup? Nevertheless this setting is supported now :) Best, Abel
On Thu, Nov 10, 2022 at 7:50 PM Abel Wu <wuyun.abel@bytedance.com> wrote: > > > By the way, I'm curious if you modified any of the sched_idle_cpu() > > and related load balancing around idle entities given that you've made > > it so that idle entities can have arbitrary weight (since, as I > > described in my prior email, this can otherwise cause issues there). > > Being able to change idle entities' weight can bring nothing but > convenience, because it can also be achieved by modifying all their > siblings' weight. Which seems not a strong reason to get merged. > > And I'm also thinking that, although rare, a non-idle group can also > have a weight close or even equal to 3. I guess some users who made > this kind of setting might only want to benefit from the preemption > at wakeup? Nevertheless this setting is supported now :) Strongly disagree with this; part of the semantics for idle relies on the minimum weight value. It is true that this behavior gets a little weirder if siblings also have close to min weight, but this is an artifact of the fact that SCHED_IDLE is built into CFS rather than being a separate scheduling class. The minimum weight in general is assumed by load balance, etc. for the purpose of placing non-idle entities. For consistency with the per-task SCHED_IDLE behavior, which effectively makes idle tasks have the max nice value, we also need to match the cgroup idle behavior. I think the use case you're describing would honestly be better served by extending SCHED_BATCH with the preemption properties and using that, rather than try to overload SCHED_IDLE here. There's a difference between non-interactive entities that are ok getting aggressively preempted, vs "idle" entities that should really only be soaking the remaining cycles on the machine. Best, Josh
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e4a0b8bd941c..f3324b8753b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4750,6 +4750,7 @@ static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { unsigned long ideal_runtime, delta_exec; + int cse_is_idle, pse_is_idle; struct sched_entity *se; s64 delta; @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) if (delta < 0) return; - if (delta > ideal_runtime) + if (delta > ideal_runtime) { + /* + * Favor non-idle group even in tick preemption + */ + cse_is_idle = se_is_idle(curr); + pse_is_idle = se_is_idle(se); + if (unlikely(!cse_is_idle && pse_is_idle)) + return; + resched_curr(rq_of(cfs_rq)); + } } static void