Message ID | 20230222080314.2146-1-xuewen.yan@unisoc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp446318wrd; Wed, 22 Feb 2023 00:07:04 -0800 (PST) X-Google-Smtp-Source: AK7set+0+a/R+h8bHxZbdq5Cm+CjBBiToXl7lm6eI8tvgxOKS3e3DzVdrcNWeeF33Xayy3he29DB X-Received: by 2002:aa7:d4c1:0:b0:4ab:4bf9:a10f with SMTP id t1-20020aa7d4c1000000b004ab4bf9a10fmr5579287edr.30.1677053224036; Wed, 22 Feb 2023 00:07:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677053224; cv=none; d=google.com; s=arc-20160816; b=XS8A5anXG3tAMRr1xaGK/6/VB7lNwbYGuAUBBcMVWzbrgbfuB80cVw4Yg9JmrUZLv/ XJb2O1nUXDFFTHhxz32aVPHTdxNIA+2u0V+OPzg/EmdJwnVLYmP7Kh/neofoeRhQlFph 4ThsTH8TxtZ7xqImUoP4/VKZ2rPQukFxgcZyfbQo79vt8yHpVkm8agq5GN+GRqgaedwH LRsL392PzBsAqGebN7iSMfFq+PfYJHPUrSe2bpOetU3vlr5B7nAaiZErTWwoXTj6LvQV Ram/ZK4pEQ0U1Xh+EY8iuFGTNeo2GVUDwIf7Tt5JBQyM+cXdt3XxDtTbChAaq+SYqW3Y oAdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=+78kfHKoMqJSPFXtedo6pBOzHXI0mYfmApVAtWZ0Nr0=; b=P1PXby2HnOAiH6jz6B8/8/ee2l6vG3I6h06qQbMv1CaDDdXyIx7lELXtqeHuhb0+p7 QSM9Drib9VwIyoqeWqsulugrEhZYImg3LwWk74+L0F6t2pXDlcTxGrtR9xYm37AjKzUH VjPrpMRjF3ISaL8vh6SS7rPymp9xF8jivJvuzFjigo6r60j+MUCEjSLeiv96RQXe+8Sj 4oIFqAObvh/mLHVsr3u0vJwxsHzxpUxWE7S7SimVQX1deidzuIcJPii71CgpQWl9/cXO b3gH1eleiUbgkKyN+2sDo6kQCF32DOAaHowb8T6lefG0qiZE0DBidwYrQ0LTK4l5MZSw lxwQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q6-20020aa7d446000000b004ad09c397b1si8083326edr.587.2023.02.22.00.06.40; Wed, 22 Feb 2023 00:07:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231236AbjBVIDf (ORCPT <rfc822;dengxinlin2429@gmail.com> + 99 others); Wed, 22 Feb 2023 03:03:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53042 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229687AbjBVIDe (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 22 Feb 2023 03:03:34 -0500 Received: from SHSQR01.spreadtrum.com (unknown [222.66.158.135]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC21013D50 for <linux-kernel@vger.kernel.org>; Wed, 22 Feb 2023 00:03:31 -0800 (PST) Received: from SHSend.spreadtrum.com (bjmbx01.spreadtrum.com [10.0.64.7]) by SHSQR01.spreadtrum.com with ESMTP id 31M83Iws082845; Wed, 22 Feb 2023 16:03:18 +0800 (+08) (envelope-from Xuewen.Yan@unisoc.com) Received: from BJ10918NBW01.spreadtrum.com (10.0.74.60) by BJMBX01.spreadtrum.com (10.0.64.7) with Microsoft SMTP Server (TLS) id 15.0.1497.23; Wed, 22 Feb 2023 16:03:17 +0800 From: Xuewen Yan <xuewen.yan@unisoc.com> To: <vincent.guittot@linaro.org>, <mingo@redhat.com>, <peterz@infradead.org>, <juri.lelli@redhat.com>, <dietmar.eggemann@arm.com> CC: <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>, <bristot@redhat.com>, <vschneid@redhat.com>, <qyousef@layalina.io>, <linux-kernel@vger.kernel.org>, <ke.wang@unisoc.com>, <zhaoyang.huang@unisoc.com> Subject: [RFC PATCH] sched/fair: update the vruntime to be max vruntime when yield Date: Wed, 22 Feb 2023 16:03:14 +0800 Message-ID: <20230222080314.2146-1-xuewen.yan@unisoc.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.0.74.60] X-ClientProxiedBy: SHCAS03.spreadtrum.com (10.0.1.207) To BJMBX01.spreadtrum.com (10.0.64.7) X-MAIL: SHSQR01.spreadtrum.com 31M83Iws082845 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758517761862602539?= X-GMAIL-MSGID: =?utf-8?q?1758517761862602539?= |
Series |
[RFC] sched/fair: update the vruntime to be max vruntime when yield
|
|
Commit Message
Xuewen Yan
Feb. 22, 2023, 8:03 a.m. UTC
When task call the sched_yield, cfs would set the cfs's skip buddy.
If there is no other task call the sched_yield syscall, the task would
always be skiped when there are tasks in rq. As a result, the task's
vruntime would not be updated for long time, and the cfs's min_vruntime
is almost not updated.
When this scenario happens, when the yield task had wait for a long time,
and other tasks run a long time, once there is other task call the sched_yield,
the cfs's skip_buddy is covered, at this time, the first task can run normally,
but the task's vruntime is small, as a result, the task would always run,
because other task's vruntime is big. This would lead to other tasks can not
run for a long time.
In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime.
This way, the cfs's min vruntime can be updated as the process running.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
---
kernel/sched/fair.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
Comments
Hi all Looking forward to your comments. Thanks! On Wed, Feb 22, 2023 at 4:21 PM Xuewen Yan <xuewen.yan@unisoc.com> wrote: > > When task call the sched_yield, cfs would set the cfs's skip buddy. > If there is no other task call the sched_yield syscall, the task would > always be skiped when there are tasks in rq. As a result, the task's > vruntime would not be updated for long time, and the cfs's min_vruntime > is almost not updated. > When this scenario happens, when the yield task had wait for a long time, > and other tasks run a long time, once there is other task call the sched_yield, > the cfs's skip_buddy is covered, at this time, the first task can run normally, > but the task's vruntime is small, as a result, the task would always run, > because other task's vruntime is big. This would lead to other tasks can not > run for a long time. > In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime. > This way, the cfs's min vruntime can be updated as the process running. > > Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> > --- > kernel/sched/fair.c | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ff4dbbae3b10..a9ff1921fc07 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -670,7 +670,6 @@ static struct sched_entity *__pick_next_entity(struct sched_entity *se) > return __node_2_se(next); > } > > -#ifdef CONFIG_SCHED_DEBUG > struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) > { > struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root); > @@ -681,6 +680,7 @@ struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) > return __node_2_se(last); > } > > +#ifdef CONFIG_SCHED_DEBUG > /************************************************************** > * Scheduling class statistics methods: > */ > @@ -7751,8 +7751,18 @@ static void set_next_buddy(struct sched_entity *se) > > static void set_skip_buddy(struct sched_entity *se) > { > - for_each_sched_entity(se) > - cfs_rq_of(se)->skip = se; > + for_each_sched_entity(se) { > + struct sched_entity *last; > + struct cfs_rq *cfs_rq = cfs_rq_of(se); > + > + last = __pick_last_entity(cfs_rq); > + if (last) { > + se->vruntime = last->vruntime; > + update_min_vruntime(cfs_rq); > + } > + > + cfs_rq->skip = se; > + } > } > > /* > -- > 2.25.1 >
On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > When task call the sched_yield, cfs would set the cfs's skip buddy. > If there is no other task call the sched_yield syscall, the task would > always be skiped when there are tasks in rq. So you have two tasks A) which does sched_yield() and becomes ->skip, and B) which is while(1). And you're saying that once A does it's thing, B runs forever and starves A? > As a result, the task's > vruntime would not be updated for long time, and the cfs's min_vruntime > is almost not updated. But the condition in pick_next_entity() should ensure that we still pick ->skip when it becomes too old. Specifically, when it gets more than wakeup_gran() behind. > When this scenario happens, when the yield task had wait for a long time, > and other tasks run a long time, once there is other task call the sched_yield, > the cfs's skip_buddy is covered, at this time, the first task can run normally, > but the task's vruntime is small, as a result, the task would always run, > because other task's vruntime is big. This would lead to other tasks can not > run for a long time. I'm not seeing how this could happen, it should never get behind that far. Additionally, check_preempt_tick() will explicitly clear the buddies when it finds the current task has consumed it's ideal slice. I really cannot see how your scenario can happen. > In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime. > This way, the cfs's min vruntime can be updated as the process running. This is a bad solution, SCHED_IDLE tasks have very low weight and can be shot really far to the right, leading to other trouble.
On Mon, 27 Feb 2023 16:40:33 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > If there is no other task call the sched_yield syscall, the task would > > always be skiped when there are tasks in rq. > > So you have two tasks A) which does sched_yield() and becomes ->skip, > and B) which is while(1). And you're saying that once A does it's thing, > B runs forever and starves A? If Xuewen has an example program that demonstrates the issue (pinning to a CPU the two tasks), that could be very useful. > This is a bad solution, SCHED_IDLE tasks have very low weight and can be > shot really far to the right, leading to other trouble. Does SCHED_IDLE tasks have to run on a busy CPU? That is, if you have a SCHED_OTHER task running in a while loop, a SCHED_IDLE task will still get runtime on that CPU? I always thought SCHED_IDLE tasks were just background tasks for running when there was nothing else to run? -- Steve
On 02/27/23 16:40, Peter Zijlstra wrote: > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > If there is no other task call the sched_yield syscall, the task would > > always be skiped when there are tasks in rq. > > So you have two tasks A) which does sched_yield() and becomes ->skip, > and B) which is while(1). And you're saying that once A does it's thing, > B runs forever and starves A? I read it differently. I understood that there are multiple tasks. If Task A becomes ->skip; then it seems other tasks will continue to be picked instead. Until another task B calls sched_yield() and become ->skip, then Task A is picked but with wrong vruntime causing it to run for multiple ticks (my interpretation of 'always run' below). There are no while(1) task running IIUC. > > > As a result, the task's > > vruntime would not be updated for long time, and the cfs's min_vruntime > > is almost not updated. > > But the condition in pick_next_entity() should ensure that we still pick > ->skip when it becomes too old. Specifically, when it gets more than > wakeup_gran() behind. I am not sure I can see it either. Maybe __pick_first_entity() doesn't return the skipped one, or for some reason vdiff for second is almost always < wakeup_gran()? > > > When this scenario happens, when the yield task had wait for a long time, > > and other tasks run a long time, once there is other task call the sched_yield, > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > but the task's vruntime is small, as a result, the task would always run, > > because other task's vruntime is big. This would lead to other tasks can not > > run for a long time. The error seems that when Task A finally runs - it consumes more than its fair bit of sched_slice() as it looks it was starved. I think the question is why it was starved? Can you shed some light Xuewen? My attempt to help to clarify :) I have read this just like you. FWIW I have seen a report of something similar, but I didn't managed to reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details are similar to what Xuewen is seeing. But there was a task starving for multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting scheduled in instead multiple times. ie: there was a task RUNNING for most of the time, and I could see it preempted by other tasks multiple time, but not by the starving RUNNABLE task that is hung on the rq. It seems to be vruntime related too but speculating here. Cheers -- Qais Yousef > > I'm not seeing how this could happen, it should never get behind that > far. > > Additionally, check_preempt_tick() will explicitly clear the buddies > when it finds the current task has consumed it's ideal slice. > > I really cannot see how your scenario can happen. > > > In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime. > > This way, the cfs's min vruntime can be updated as the process running. > > This is a bad solution, SCHED_IDLE tasks have very low weight and can be > shot really far to the right, leading to other trouble. >
Hi Thanks very much for comments! On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > On 02/27/23 16:40, Peter Zijlstra wrote: > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > If there is no other task call the sched_yield syscall, the task would > > > always be skiped when there are tasks in rq. > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > and B) which is while(1). And you're saying that once A does it's thing, > > B runs forever and starves A? > > I read it differently. > > I understood that there are multiple tasks. > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > instead. Until another task B calls sched_yield() and become ->skip, then Task > A is picked but with wrong vruntime causing it to run for multiple ticks (my > interpretation of 'always run' below). > > There are no while(1) task running IIUC. > > > > > > As a result, the task's > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > is almost not updated. > > > > But the condition in pick_next_entity() should ensure that we still pick > > ->skip when it becomes too old. Specifically, when it gets more than > > wakeup_gran() behind. > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > the skipped one, or for some reason vdiff for second is almost always > < wakeup_gran()? > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > and other tasks run a long time, once there is other task call the sched_yield, > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > but the task's vruntime is small, as a result, the task would always run, > > > because other task's vruntime is big. This would lead to other tasks can not > > > run for a long time. > > The error seems that when Task A finally runs - it consumes more than its fair > bit of sched_slice() as it looks it was starved. > > I think the question is why it was starved? Can you shed some light Xuewen? > > My attempt to help to clarify :) I have read this just like you. Thanks for Qais's clarify. And that's exactly what I want to say:) > > FWIW I have seen a report of something similar, but I didn't managed to > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > are similar to what Xuewen is seeing. But there was a task starving for > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > scheduled in instead multiple times. ie: there was a task RUNNING for most of > the time, and I could see it preempted by other tasks multiple time, but not by > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > related too but speculating here. Yes, now we met the similar scenario when running a monkey test on the android phone. There are multiple tasks on cpu, but the runnable task could not be got scheduled for a long time, there is task running and we could see the task preempted by other tasks multiple times. Then we dump the tasks, and find the vruntime of each task varies greatly, and the task which running call the sched_yield frequently. So we suspect that sched_yield affects the task's vruntime, as previously described,the yield's task's vruntime is too small. There are some tasks's vruntime as follow: [status: curr] pid: 25501 prio: 116 vrun: 16426426403395799812 [status: skip] pid: 25496 prio: 116 vrun: 16426426403395800756 exec_start: 326203047009312 sum_ex: 29110005599 [status: pend] pid: 25497 prio: 116 vrun: 16426426403395800705 exec_start: 326203047002235 sum_ex: 29110508751 [status: pend] pid: 25496 prio: 116 vrun: 16426426403395800756 exec_start: 326203047009312 sum_ex: 29110005599 [status: pend] pid: 25498 prio: 116 vrun: 16426426403395803053 exec_start: 326203046944427 sum_ex: 28759519211 [status: pend] pid: 25321 prio: 130 vrun: 16668783152248554223 exec_start: 0 sum_ex: 16198728 [status: pend] pid: 25798 prio: 112 vrun: 17467381818375696015 exec_start: 0 sum_ex: 9574265 [status: pend] pid: 24650 prio: 120 vrun: 17811488667922679996 exec_start: 0 sum_ex: 4069384 [status: pend] pid: 26082 prio: 120 vrun: 17876565509001103803 exec_start: 0 sum_ex: 1184039 [status: pend] pid: 22282 prio: 120 vrun: 18010356387391134435 exec_start: 0 sum_ex: 53192 [status: pend] pid: 16714 prio: 120 vrun: 18136518279692783235 exec_start: 0 sum_ex: 53844952 [status: pend] pid: 26188 prio: 120 vrun: 18230794395956633597 exec_start: 0 sum_ex: 13248612 [status: pend] pid: 17645 prio: 120 vrun: 18348420256270370795 exec_start: 0 sum_ex: 4774925 [status: pend] pid: 24259 prio: 120 vrun: 359915144918430571 exec_start: 0 sum_ex: 20508197 [status: pend] pid: 25988 prio: 120 vrun: 558552749871164416 exec_start: 0 sum_ex: 2099153 [status: pend] pid: 21857 prio: 124 vrun: 596088822758688878 exec_start: 0 sum_ex: 246057024 [status: pend] pid: 26614 prio: 130 vrun: 688210016831095807 exec_start: 0 sum_ex: 968307 [status: pend] pid: 14229 prio: 120 vrun: 816756964596474655 exec_start: 0 sum_ex: 793001 [status: pend] pid: 23866 prio: 120 vrun: 1313723379399791578 exec_start: 0 sum_ex: 1507038 [status: pend] pid: 23389 prio: 120 vrun: 1351598627096913799 exec_start: 0 sum_ex: 1648576 [status: pend] pid: 25118 prio: 124 vrun: 2516103258334576715 exec_start: 0 sum_ex: 270423 [status: pend] pid: 26412 prio: 120 vrun: 2674093729417543719 exec_start: 0 sum_ex: 1851229 [status: pend] pid: 26271 prio: 112 vrun: 2728945479807426354 exec_start: 0 sum_ex: 3347695 [status: pend] pid: 24236 prio: 120 vrun: 2919301292085993527 exec_start: 0 sum_ex: 5425846 [status: pend] pid: 22077 prio: 120 vrun: 3262582494560783155 exec_start: 325875071065811 sum_ex: 177555259 [status: pend] pid: 18951 prio: 120 vrun: 3532786464053787829 exec_start: 0 sum_ex: 2634964 [status: pend] pid: 18957 prio: 120 vrun: 3532786464053920593 exec_start: 0 sum_ex: 95538 [status: pend] pid: 18914 prio: 131 vrun: 3532786465880282335 exec_start: 0 sum_ex: 6374535 [status: pend] pid: 17595 prio: 120 vrun: 4839728055620845452 exec_start: 0 sum_ex: 29559732 [status: pend] pid: 32520 prio: 120 vrun: 5701873672841711178 exec_start: 0 sum_ex: 21486313 [status: pend] pid: 24287 prio: 120 vrun: 5701873673743456663 exec_start: 0 sum_ex: 757778741 [status: pend] pid: 25544 prio: 120 vrun: 6050206507780284054 exec_start: 0 sum_ex: 13624309 [status: pend] pid: 26049 prio: 130 vrun: 6144859778903604771 exec_start: 0 sum_ex: 20931577 [status: pend] pid: 26848 prio: 130 vrun: 6144859796032488859 exec_start: 0 sum_ex: 2541963 [status: pend] pid: 21450 prio: 120 vrun: 6451880484497196814 exec_start: 0 sum_ex: 83490289 [status: pend] pid: 15765 prio: 120 vrun: 6479239764142283860 exec_start: 0 sum_ex: 1481737271 [status: pend] pid: 16366 prio: 120 vrun: 6479239764269019562 exec_start: 0 sum_ex: 952608921 [status: pend] pid: 16086 prio: 120 vrun: 6479239764301244958 exec_start: 0 sum_ex: 37393777 [status: pend] pid: 25970 prio: 120 vrun: 6830180148220001175 exec_start: 0 sum_ex: 2531884 [status: pend] pid: 25965 prio: 120 vrun: 6830180150700833203 exec_start: 0 sum_ex: 8031809 [status: pend] pid: 14098 prio: 120 vrun: 7018832854764682872 exec_start: 0 sum_ex: 32975920 [status: pend] pid: 26860 prio: 116 vrun: 7086059821707649029 exec_start: 0 sum_ex: 246173830 Thanks! BR > > > Cheers > > -- > Qais Yousef > > > > > I'm not seeing how this could happen, it should never get behind that > > far. > > > > Additionally, check_preempt_tick() will explicitly clear the buddies > > when it finds the current task has consumed it's ideal slice. > > > > I really cannot see how your scenario can happen. > > > > > In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime. > > > This way, the cfs's min vruntime can be updated as the process running. > > > > This is a bad solution, SCHED_IDLE tasks have very low weight and can be > > shot really far to the right, leading to other trouble. > >
On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > Hi > > Thanks very much for comments! > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > If there is no other task call the sched_yield syscall, the task would > > > > always be skiped when there are tasks in rq. > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > and B) which is while(1). And you're saying that once A does it's thing, > > > B runs forever and starves A? > > > > I read it differently. > > > > I understood that there are multiple tasks. > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > interpretation of 'always run' below). > > > > There are no while(1) task running IIUC. > > > > > > > > > As a result, the task's > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > is almost not updated. > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > ->skip when it becomes too old. Specifically, when it gets more than > > > wakeup_gran() behind. > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > the skipped one, or for some reason vdiff for second is almost always > > < wakeup_gran()? > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > but the task's vruntime is small, as a result, the task would always run, > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > run for a long time. > > > > The error seems that when Task A finally runs - it consumes more than its fair > > bit of sched_slice() as it looks it was starved. > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > My attempt to help to clarify :) I have read this just like you. > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > are similar to what Xuewen is seeing. But there was a task starving for > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > the time, and I could see it preempted by other tasks multiple time, but not by > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > related too but speculating here. > > Yes, now we met the similar scenario when running a monkey test on the > android phone. > There are multiple tasks on cpu, but the runnable task could not be > got scheduled for a long time, > there is task running and we could see the task preempted by other > tasks multiple times. > Then we dump the tasks, and find the vruntime of each task varies > greatly, and the task which running call the sched_yield frequently. If I'm not wrong you are using cgroups and as a result you can't compare the vruntime of tasks that belongs to different group, you must compare the vruntime of entities at the same level. We might have to look the side because I can't see why the task would not be schedule if other tasks in the same group move forward their vruntime > So we suspect that sched_yield affects the task's vruntime, as > previously described,the yield's task's vruntime is too small. > > There are some tasks's vruntime as follow: > > [status: curr] pid: 25501 prio: 116 vrun: 16426426403395799812 > [status: skip] pid: 25496 prio: 116 vrun: 16426426403395800756 > exec_start: 326203047009312 sum_ex: 29110005599 > [status: pend] pid: 25497 prio: 116 vrun: 16426426403395800705 > exec_start: 326203047002235 sum_ex: 29110508751 > [status: pend] pid: 25496 prio: 116 vrun: 16426426403395800756 > exec_start: 326203047009312 sum_ex: 29110005599 > [status: pend] pid: 25498 prio: 116 vrun: 16426426403395803053 > exec_start: 326203046944427 sum_ex: 28759519211 > [status: pend] pid: 25321 prio: 130 vrun: 16668783152248554223 > exec_start: 0 sum_ex: 16198728 > [status: pend] pid: 25798 prio: 112 vrun: 17467381818375696015 > exec_start: 0 sum_ex: 9574265 > [status: pend] pid: 24650 prio: 120 vrun: 17811488667922679996 > exec_start: 0 sum_ex: 4069384 > [status: pend] pid: 26082 prio: 120 vrun: 17876565509001103803 > exec_start: 0 sum_ex: 1184039 > [status: pend] pid: 22282 prio: 120 vrun: 18010356387391134435 > exec_start: 0 sum_ex: 53192 > [status: pend] pid: 16714 prio: 120 vrun: 18136518279692783235 > exec_start: 0 sum_ex: 53844952 > [status: pend] pid: 26188 prio: 120 vrun: 18230794395956633597 > exec_start: 0 sum_ex: 13248612 > [status: pend] pid: 17645 prio: 120 vrun: 18348420256270370795 > exec_start: 0 sum_ex: 4774925 > [status: pend] pid: 24259 prio: 120 vrun: 359915144918430571 > exec_start: 0 sum_ex: 20508197 > [status: pend] pid: 25988 prio: 120 vrun: 558552749871164416 > exec_start: 0 sum_ex: 2099153 > [status: pend] pid: 21857 prio: 124 vrun: 596088822758688878 > exec_start: 0 sum_ex: 246057024 > [status: pend] pid: 26614 prio: 130 vrun: 688210016831095807 > exec_start: 0 sum_ex: 968307 > [status: pend] pid: 14229 prio: 120 vrun: 816756964596474655 > exec_start: 0 sum_ex: 793001 > [status: pend] pid: 23866 prio: 120 vrun: 1313723379399791578 > exec_start: 0 sum_ex: 1507038 > [status: pend] pid: 23389 prio: 120 vrun: 1351598627096913799 > exec_start: 0 sum_ex: 1648576 > [status: pend] pid: 25118 prio: 124 vrun: 2516103258334576715 > exec_start: 0 sum_ex: 270423 > [status: pend] pid: 26412 prio: 120 vrun: 2674093729417543719 > exec_start: 0 sum_ex: 1851229 > [status: pend] pid: 26271 prio: 112 vrun: 2728945479807426354 > exec_start: 0 sum_ex: 3347695 > [status: pend] pid: 24236 prio: 120 vrun: 2919301292085993527 > exec_start: 0 sum_ex: 5425846 > [status: pend] pid: 22077 prio: 120 vrun: 3262582494560783155 > exec_start: 325875071065811 sum_ex: 177555259 > [status: pend] pid: 18951 prio: 120 vrun: 3532786464053787829 > exec_start: 0 sum_ex: 2634964 > [status: pend] pid: 18957 prio: 120 vrun: 3532786464053920593 > exec_start: 0 sum_ex: 95538 > [status: pend] pid: 18914 prio: 131 vrun: 3532786465880282335 > exec_start: 0 sum_ex: 6374535 > [status: pend] pid: 17595 prio: 120 vrun: 4839728055620845452 > exec_start: 0 sum_ex: 29559732 > [status: pend] pid: 32520 prio: 120 vrun: 5701873672841711178 > exec_start: 0 sum_ex: 21486313 > [status: pend] pid: 24287 prio: 120 vrun: 5701873673743456663 > exec_start: 0 sum_ex: 757778741 > [status: pend] pid: 25544 prio: 120 vrun: 6050206507780284054 > exec_start: 0 sum_ex: 13624309 > [status: pend] pid: 26049 prio: 130 vrun: 6144859778903604771 > exec_start: 0 sum_ex: 20931577 > [status: pend] pid: 26848 prio: 130 vrun: 6144859796032488859 > exec_start: 0 sum_ex: 2541963 > [status: pend] pid: 21450 prio: 120 vrun: 6451880484497196814 > exec_start: 0 sum_ex: 83490289 > [status: pend] pid: 15765 prio: 120 vrun: 6479239764142283860 > exec_start: 0 sum_ex: 1481737271 > [status: pend] pid: 16366 prio: 120 vrun: 6479239764269019562 > exec_start: 0 sum_ex: 952608921 > [status: pend] pid: 16086 prio: 120 vrun: 6479239764301244958 > exec_start: 0 sum_ex: 37393777 > [status: pend] pid: 25970 prio: 120 vrun: 6830180148220001175 > exec_start: 0 sum_ex: 2531884 > [status: pend] pid: 25965 prio: 120 vrun: 6830180150700833203 > exec_start: 0 sum_ex: 8031809 > [status: pend] pid: 14098 prio: 120 vrun: 7018832854764682872 > exec_start: 0 sum_ex: 32975920 > [status: pend] pid: 26860 prio: 116 vrun: 7086059821707649029 > exec_start: 0 sum_ex: 246173830 > > > Thanks! > BR > > > > > > Cheers > > > > -- > > Qais Yousef > > > > > > > > I'm not seeing how this could happen, it should never get behind that > > > far. > > > > > > Additionally, check_preempt_tick() will explicitly clear the buddies > > > when it finds the current task has consumed it's ideal slice. > > > > > > I really cannot see how your scenario can happen. > > > > > > > In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime. > > > > This way, the cfs's min vruntime can be updated as the process running. > > > > > > This is a bad solution, SCHED_IDLE tasks have very low weight and can be > > > shot really far to the right, leading to other trouble. > > >
Hi Vincent On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > Hi > > > > Thanks very much for comments! > > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > > If there is no other task call the sched_yield syscall, the task would > > > > > always be skiped when there are tasks in rq. > > > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > > and B) which is while(1). And you're saying that once A does it's thing, > > > > B runs forever and starves A? > > > > > > I read it differently. > > > > > > I understood that there are multiple tasks. > > > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > > interpretation of 'always run' below). > > > > > > There are no while(1) task running IIUC. > > > > > > > > > > > > As a result, the task's > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > > is almost not updated. > > > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > > ->skip when it becomes too old. Specifically, when it gets more than > > > > wakeup_gran() behind. > > > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > > the skipped one, or for some reason vdiff for second is almost always > > > < wakeup_gran()? > > > > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > > but the task's vruntime is small, as a result, the task would always run, > > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > > run for a long time. > > > > > > The error seems that when Task A finally runs - it consumes more than its fair > > > bit of sched_slice() as it looks it was starved. > > > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > > > My attempt to help to clarify :) I have read this just like you. > > > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > > are similar to what Xuewen is seeing. But there was a task starving for > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > > the time, and I could see it preempted by other tasks multiple time, but not by > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > > related too but speculating here. > > > > Yes, now we met the similar scenario when running a monkey test on the > > android phone. > > There are multiple tasks on cpu, but the runnable task could not be > > got scheduled for a long time, > > there is task running and we could see the task preempted by other > > tasks multiple times. > > Then we dump the tasks, and find the vruntime of each task varies > > greatly, and the task which running call the sched_yield frequently. > > If I'm not wrong you are using cgroups and as a result you can't > compare the vruntime of tasks that belongs to different group, you > must compare the vruntime of entities at the same level. We might have > to look the side because I can't see why the task would not be > schedule if other tasks in the same group move forward their vruntime All the tasks belong to the same cgroup. Thanks! > > > So we suspect that sched_yield affects the task's vruntime, as > > previously described,the yield's task's vruntime is too small. > > > > There are some tasks's vruntime as follow: > > > > [status: curr] pid: 25501 prio: 116 vrun: 16426426403395799812 > > [status: skip] pid: 25496 prio: 116 vrun: 16426426403395800756 > > exec_start: 326203047009312 sum_ex: 29110005599 > > [status: pend] pid: 25497 prio: 116 vrun: 16426426403395800705 > > exec_start: 326203047002235 sum_ex: 29110508751 > > [status: pend] pid: 25496 prio: 116 vrun: 16426426403395800756 > > exec_start: 326203047009312 sum_ex: 29110005599 > > [status: pend] pid: 25498 prio: 116 vrun: 16426426403395803053 > > exec_start: 326203046944427 sum_ex: 28759519211 > > [status: pend] pid: 25321 prio: 130 vrun: 16668783152248554223 > > exec_start: 0 sum_ex: 16198728 > > [status: pend] pid: 25798 prio: 112 vrun: 17467381818375696015 > > exec_start: 0 sum_ex: 9574265 > > [status: pend] pid: 24650 prio: 120 vrun: 17811488667922679996 > > exec_start: 0 sum_ex: 4069384 > > [status: pend] pid: 26082 prio: 120 vrun: 17876565509001103803 > > exec_start: 0 sum_ex: 1184039 > > [status: pend] pid: 22282 prio: 120 vrun: 18010356387391134435 > > exec_start: 0 sum_ex: 53192 > > [status: pend] pid: 16714 prio: 120 vrun: 18136518279692783235 > > exec_start: 0 sum_ex: 53844952 > > [status: pend] pid: 26188 prio: 120 vrun: 18230794395956633597 > > exec_start: 0 sum_ex: 13248612 > > [status: pend] pid: 17645 prio: 120 vrun: 18348420256270370795 > > exec_start: 0 sum_ex: 4774925 > > [status: pend] pid: 24259 prio: 120 vrun: 359915144918430571 > > exec_start: 0 sum_ex: 20508197 > > [status: pend] pid: 25988 prio: 120 vrun: 558552749871164416 > > exec_start: 0 sum_ex: 2099153 > > [status: pend] pid: 21857 prio: 124 vrun: 596088822758688878 > > exec_start: 0 sum_ex: 246057024 > > [status: pend] pid: 26614 prio: 130 vrun: 688210016831095807 > > exec_start: 0 sum_ex: 968307 > > [status: pend] pid: 14229 prio: 120 vrun: 816756964596474655 > > exec_start: 0 sum_ex: 793001 > > [status: pend] pid: 23866 prio: 120 vrun: 1313723379399791578 > > exec_start: 0 sum_ex: 1507038 > > [status: pend] pid: 23389 prio: 120 vrun: 1351598627096913799 > > exec_start: 0 sum_ex: 1648576 > > [status: pend] pid: 25118 prio: 124 vrun: 2516103258334576715 > > exec_start: 0 sum_ex: 270423 > > [status: pend] pid: 26412 prio: 120 vrun: 2674093729417543719 > > exec_start: 0 sum_ex: 1851229 > > [status: pend] pid: 26271 prio: 112 vrun: 2728945479807426354 > > exec_start: 0 sum_ex: 3347695 > > [status: pend] pid: 24236 prio: 120 vrun: 2919301292085993527 > > exec_start: 0 sum_ex: 5425846 > > [status: pend] pid: 22077 prio: 120 vrun: 3262582494560783155 > > exec_start: 325875071065811 sum_ex: 177555259 > > [status: pend] pid: 18951 prio: 120 vrun: 3532786464053787829 > > exec_start: 0 sum_ex: 2634964 > > [status: pend] pid: 18957 prio: 120 vrun: 3532786464053920593 > > exec_start: 0 sum_ex: 95538 > > [status: pend] pid: 18914 prio: 131 vrun: 3532786465880282335 > > exec_start: 0 sum_ex: 6374535 > > [status: pend] pid: 17595 prio: 120 vrun: 4839728055620845452 > > exec_start: 0 sum_ex: 29559732 > > [status: pend] pid: 32520 prio: 120 vrun: 5701873672841711178 > > exec_start: 0 sum_ex: 21486313 > > [status: pend] pid: 24287 prio: 120 vrun: 5701873673743456663 > > exec_start: 0 sum_ex: 757778741 > > [status: pend] pid: 25544 prio: 120 vrun: 6050206507780284054 > > exec_start: 0 sum_ex: 13624309 > > [status: pend] pid: 26049 prio: 130 vrun: 6144859778903604771 > > exec_start: 0 sum_ex: 20931577 > > [status: pend] pid: 26848 prio: 130 vrun: 6144859796032488859 > > exec_start: 0 sum_ex: 2541963 > > [status: pend] pid: 21450 prio: 120 vrun: 6451880484497196814 > > exec_start: 0 sum_ex: 83490289 > > [status: pend] pid: 15765 prio: 120 vrun: 6479239764142283860 > > exec_start: 0 sum_ex: 1481737271 > > [status: pend] pid: 16366 prio: 120 vrun: 6479239764269019562 > > exec_start: 0 sum_ex: 952608921 > > [status: pend] pid: 16086 prio: 120 vrun: 6479239764301244958 > > exec_start: 0 sum_ex: 37393777 > > [status: pend] pid: 25970 prio: 120 vrun: 6830180148220001175 > > exec_start: 0 sum_ex: 2531884 > > [status: pend] pid: 25965 prio: 120 vrun: 6830180150700833203 > > exec_start: 0 sum_ex: 8031809 > > [status: pend] pid: 14098 prio: 120 vrun: 7018832854764682872 > > exec_start: 0 sum_ex: 32975920 > > [status: pend] pid: 26860 prio: 116 vrun: 7086059821707649029 > > exec_start: 0 sum_ex: 246173830 > > > > > > Thanks! > > BR > > > > > > > > > Cheers > > > > > > -- > > > Qais Yousef > > > > > > > > > > > I'm not seeing how this could happen, it should never get behind that > > > > far. > > > > > > > > Additionally, check_preempt_tick() will explicitly clear the buddies > > > > when it finds the current task has consumed it's ideal slice. > > > > > > > > I really cannot see how your scenario can happen. > > > > > > > > > In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime. > > > > > This way, the cfs's min vruntime can be updated as the process running. > > > > > > > > This is a bad solution, SCHED_IDLE tasks have very low weight and can be > > > > shot really far to the right, leading to other trouble. > > > >
On Tue, 28 Feb 2023 at 09:21, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > Hi Vincent > > On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > Hi > > > > > > Thanks very much for comments! > > > > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > > > If there is no other task call the sched_yield syscall, the task would > > > > > > always be skiped when there are tasks in rq. > > > > > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > > > and B) which is while(1). And you're saying that once A does it's thing, > > > > > B runs forever and starves A? > > > > > > > > I read it differently. > > > > > > > > I understood that there are multiple tasks. > > > > > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > > > interpretation of 'always run' below). > > > > > > > > There are no while(1) task running IIUC. > > > > > > > > > > > > > > > As a result, the task's > > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > > > is almost not updated. > > > > > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > > > ->skip when it becomes too old. Specifically, when it gets more than > > > > > wakeup_gran() behind. > > > > > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > > > the skipped one, or for some reason vdiff for second is almost always > > > > < wakeup_gran()? > > > > > > > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > > > but the task's vruntime is small, as a result, the task would always run, > > > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > > > run for a long time. > > > > > > > > The error seems that when Task A finally runs - it consumes more than its fair > > > > bit of sched_slice() as it looks it was starved. > > > > > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > > > > > My attempt to help to clarify :) I have read this just like you. > > > > > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > > > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > > > are similar to what Xuewen is seeing. But there was a task starving for > > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > > > the time, and I could see it preempted by other tasks multiple time, but not by > > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > > > related too but speculating here. > > > > > > Yes, now we met the similar scenario when running a monkey test on the > > > android phone. > > > There are multiple tasks on cpu, but the runnable task could not be > > > got scheduled for a long time, > > > there is task running and we could see the task preempted by other > > > tasks multiple times. > > > Then we dump the tasks, and find the vruntime of each task varies > > > greatly, and the task which running call the sched_yield frequently. > > > > If I'm not wrong you are using cgroups and as a result you can't > > compare the vruntime of tasks that belongs to different group, you > > must compare the vruntime of entities at the same level. We might have > > to look the side because I can't see why the task would not be > > schedule if other tasks in the same group move forward their vruntime > > All the tasks belong to the same cgroup. ok. I have tried to reproduce your problem but can't see it so far. I'm probably missing something. With rt-app, I start: - 3 tasks A, B, C which are always running - 1 task D which always runs but yields every 1ms for 1000 times and then stops yielding and always run All tasks are pinned on the same cpu in the same cgroup. I don't see anything wrong. task A, B, C runs their slices task D is preempted by others after 1ms for a couple of times when it calls yield. Then the yield doesn't have effect and task D runs a few consecutive ms although the yield. Then task D restart to be preempted by others when it calls yield when its vruntime is close to others Once task D stop calling yield, the 4 tasks runs normally Vincent > > Thanks! > > > > > > So we suspect that sched_yield affects the task's vruntime, as > > > previously described,the yield's task's vruntime is too small. > > > > > > There are some tasks's vruntime as follow: > > > > > > [status: curr] pid: 25501 prio: 116 vrun: 16426426403395799812 > > > [status: skip] pid: 25496 prio: 116 vrun: 16426426403395800756 > > > exec_start: 326203047009312 sum_ex: 29110005599 > > > [status: pend] pid: 25497 prio: 116 vrun: 16426426403395800705 > > > exec_start: 326203047002235 sum_ex: 29110508751 > > > [status: pend] pid: 25496 prio: 116 vrun: 16426426403395800756 > > > exec_start: 326203047009312 sum_ex: 29110005599 > > > [status: pend] pid: 25498 prio: 116 vrun: 16426426403395803053 > > > exec_start: 326203046944427 sum_ex: 28759519211 > > > [status: pend] pid: 25321 prio: 130 vrun: 16668783152248554223 > > > exec_start: 0 sum_ex: 16198728 > > > [status: pend] pid: 25798 prio: 112 vrun: 17467381818375696015 > > > exec_start: 0 sum_ex: 9574265 > > > [status: pend] pid: 24650 prio: 120 vrun: 17811488667922679996 > > > exec_start: 0 sum_ex: 4069384 > > > [status: pend] pid: 26082 prio: 120 vrun: 17876565509001103803 > > > exec_start: 0 sum_ex: 1184039 > > > [status: pend] pid: 22282 prio: 120 vrun: 18010356387391134435 > > > exec_start: 0 sum_ex: 53192 > > > [status: pend] pid: 16714 prio: 120 vrun: 18136518279692783235 > > > exec_start: 0 sum_ex: 53844952 > > > [status: pend] pid: 26188 prio: 120 vrun: 18230794395956633597 > > > exec_start: 0 sum_ex: 13248612 > > > [status: pend] pid: 17645 prio: 120 vrun: 18348420256270370795 > > > exec_start: 0 sum_ex: 4774925 > > > [status: pend] pid: 24259 prio: 120 vrun: 359915144918430571 > > > exec_start: 0 sum_ex: 20508197 > > > [status: pend] pid: 25988 prio: 120 vrun: 558552749871164416 > > > exec_start: 0 sum_ex: 2099153 > > > [status: pend] pid: 21857 prio: 124 vrun: 596088822758688878 > > > exec_start: 0 sum_ex: 246057024 > > > [status: pend] pid: 26614 prio: 130 vrun: 688210016831095807 > > > exec_start: 0 sum_ex: 968307 > > > [status: pend] pid: 14229 prio: 120 vrun: 816756964596474655 > > > exec_start: 0 sum_ex: 793001 > > > [status: pend] pid: 23866 prio: 120 vrun: 1313723379399791578 > > > exec_start: 0 sum_ex: 1507038 > > > [status: pend] pid: 23389 prio: 120 vrun: 1351598627096913799 > > > exec_start: 0 sum_ex: 1648576 > > > [status: pend] pid: 25118 prio: 124 vrun: 2516103258334576715 > > > exec_start: 0 sum_ex: 270423 > > > [status: pend] pid: 26412 prio: 120 vrun: 2674093729417543719 > > > exec_start: 0 sum_ex: 1851229 > > > [status: pend] pid: 26271 prio: 112 vrun: 2728945479807426354 > > > exec_start: 0 sum_ex: 3347695 > > > [status: pend] pid: 24236 prio: 120 vrun: 2919301292085993527 > > > exec_start: 0 sum_ex: 5425846 > > > [status: pend] pid: 22077 prio: 120 vrun: 3262582494560783155 > > > exec_start: 325875071065811 sum_ex: 177555259 > > > [status: pend] pid: 18951 prio: 120 vrun: 3532786464053787829 > > > exec_start: 0 sum_ex: 2634964 > > > [status: pend] pid: 18957 prio: 120 vrun: 3532786464053920593 > > > exec_start: 0 sum_ex: 95538 > > > [status: pend] pid: 18914 prio: 131 vrun: 3532786465880282335 > > > exec_start: 0 sum_ex: 6374535 > > > [status: pend] pid: 17595 prio: 120 vrun: 4839728055620845452 > > > exec_start: 0 sum_ex: 29559732 > > > [status: pend] pid: 32520 prio: 120 vrun: 5701873672841711178 > > > exec_start: 0 sum_ex: 21486313 > > > [status: pend] pid: 24287 prio: 120 vrun: 5701873673743456663 > > > exec_start: 0 sum_ex: 757778741 > > > [status: pend] pid: 25544 prio: 120 vrun: 6050206507780284054 > > > exec_start: 0 sum_ex: 13624309 > > > [status: pend] pid: 26049 prio: 130 vrun: 6144859778903604771 > > > exec_start: 0 sum_ex: 20931577 > > > [status: pend] pid: 26848 prio: 130 vrun: 6144859796032488859 > > > exec_start: 0 sum_ex: 2541963 > > > [status: pend] pid: 21450 prio: 120 vrun: 6451880484497196814 > > > exec_start: 0 sum_ex: 83490289 > > > [status: pend] pid: 15765 prio: 120 vrun: 6479239764142283860 > > > exec_start: 0 sum_ex: 1481737271 > > > [status: pend] pid: 16366 prio: 120 vrun: 6479239764269019562 > > > exec_start: 0 sum_ex: 952608921 > > > [status: pend] pid: 16086 prio: 120 vrun: 6479239764301244958 > > > exec_start: 0 sum_ex: 37393777 > > > [status: pend] pid: 25970 prio: 120 vrun: 6830180148220001175 > > > exec_start: 0 sum_ex: 2531884 > > > [status: pend] pid: 25965 prio: 120 vrun: 6830180150700833203 > > > exec_start: 0 sum_ex: 8031809 > > > [status: pend] pid: 14098 prio: 120 vrun: 7018832854764682872 > > > exec_start: 0 sum_ex: 32975920 > > > [status: pend] pid: 26860 prio: 116 vrun: 7086059821707649029 > > > exec_start: 0 sum_ex: 246173830 > > > > > > > > > Thanks! > > > BR > > > > > > > > > > > > Cheers > > > > > > > > -- > > > > Qais Yousef > > > > > > > > > > > > > > I'm not seeing how this could happen, it should never get behind that > > > > > far. > > > > > > > > > > Additionally, check_preempt_tick() will explicitly clear the buddies > > > > > when it finds the current task has consumed it's ideal slice. > > > > > > > > > > I really cannot see how your scenario can happen. > > > > > > > > > > > In order to mitigate this, update the yield_task's vruntime to be cfs's max vruntime. > > > > > > This way, the cfs's min vruntime can be updated as the process running. > > > > > > > > > > This is a bad solution, SCHED_IDLE tasks have very low weight and can be > > > > > shot really far to the right, leading to other trouble. > > > > >
On 02/28/23 10:07, Vincent Guittot wrote: > On Tue, 28 Feb 2023 at 09:21, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > Hi Vincent > > > > On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot > > <vincent.guittot@linaro.org> wrote: > > > > > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > Hi > > > > > > > > Thanks very much for comments! > > > > > > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > > > > If there is no other task call the sched_yield syscall, the task would > > > > > > > always be skiped when there are tasks in rq. > > > > > > > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > > > > and B) which is while(1). And you're saying that once A does it's thing, > > > > > > B runs forever and starves A? > > > > > > > > > > I read it differently. > > > > > > > > > > I understood that there are multiple tasks. > > > > > > > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > > > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > > > > interpretation of 'always run' below). > > > > > > > > > > There are no while(1) task running IIUC. > > > > > > > > > > > > > > > > > > As a result, the task's > > > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > > > > is almost not updated. > > > > > > > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > > > > ->skip when it becomes too old. Specifically, when it gets more than > > > > > > wakeup_gran() behind. > > > > > > > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > > > > the skipped one, or for some reason vdiff for second is almost always > > > > > < wakeup_gran()? > > > > > > > > > > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > > > > but the task's vruntime is small, as a result, the task would always run, > > > > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > > > > run for a long time. > > > > > > > > > > The error seems that when Task A finally runs - it consumes more than its fair > > > > > bit of sched_slice() as it looks it was starved. > > > > > > > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > > > > > > > My attempt to help to clarify :) I have read this just like you. > > > > > > > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > > > > > > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > > > > are similar to what Xuewen is seeing. But there was a task starving for > > > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > > > > the time, and I could see it preempted by other tasks multiple time, but not by > > > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > > > > related too but speculating here. > > > > > > > > Yes, now we met the similar scenario when running a monkey test on the > > > > android phone. > > > > There are multiple tasks on cpu, but the runnable task could not be > > > > got scheduled for a long time, > > > > there is task running and we could see the task preempted by other > > > > tasks multiple times. > > > > Then we dump the tasks, and find the vruntime of each task varies > > > > greatly, and the task which running call the sched_yield frequently. > > > > > > If I'm not wrong you are using cgroups and as a result you can't > > > compare the vruntime of tasks that belongs to different group, you > > > must compare the vruntime of entities at the same level. We might have > > > to look the side because I can't see why the task would not be > > > schedule if other tasks in the same group move forward their vruntime > > > > All the tasks belong to the same cgroup. Could they move between cpusets though? > > ok. > I have tried to reproduce your problem but can't see it so far. I'm > probably missing something. > > With rt-app, I start: > - 3 tasks A, B, C which are always running > - 1 task D which always runs but yields every 1ms for 1000 times and > then stops yielding and always run > > All tasks are pinned on the same cpu in the same cgroup. > > I don't see anything wrong. > task A, B, C runs their slices > task D is preempted by others after 1ms for a couple of times when it > calls yield. Then the yield doesn't have effect and task D runs a few > consecutive ms although the yield. Then task D restart to be preempted > by others when it calls yield when its vruntime is close to others > > Once task D stop calling yield, the 4 tasks runs normally Could vruntime be inflated if a task gets stuck on a little core for a while (where it'll run slower) then compared to another task running on a bigger core the vruntime will appear smaller for the latter? Cheers -- Qais Yousef
On Tue, 28 Feb 2023 at 14:31, Qais Yousef <qyousef@layalina.io> wrote: > > On 02/28/23 10:07, Vincent Guittot wrote: > > On Tue, 28 Feb 2023 at 09:21, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > Hi Vincent > > > > > > On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot > > > <vincent.guittot@linaro.org> wrote: > > > > > > > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > > > Hi > > > > > > > > > > Thanks very much for comments! > > > > > > > > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > > > > > If there is no other task call the sched_yield syscall, the task would > > > > > > > > always be skiped when there are tasks in rq. > > > > > > > > > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > > > > > and B) which is while(1). And you're saying that once A does it's thing, > > > > > > > B runs forever and starves A? > > > > > > > > > > > > I read it differently. > > > > > > > > > > > > I understood that there are multiple tasks. > > > > > > > > > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > > > > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > > > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > > > > > interpretation of 'always run' below). > > > > > > > > > > > > There are no while(1) task running IIUC. > > > > > > > > > > > > > > > > > > > > > As a result, the task's > > > > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > > > > > is almost not updated. > > > > > > > > > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > > > > > ->skip when it becomes too old. Specifically, when it gets more than > > > > > > > wakeup_gran() behind. > > > > > > > > > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > > > > > the skipped one, or for some reason vdiff for second is almost always > > > > > > < wakeup_gran()? > > > > > > > > > > > > > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > > > > > but the task's vruntime is small, as a result, the task would always run, > > > > > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > > > > > run for a long time. > > > > > > > > > > > > The error seems that when Task A finally runs - it consumes more than its fair > > > > > > bit of sched_slice() as it looks it was starved. > > > > > > > > > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > > > > > > > > > My attempt to help to clarify :) I have read this just like you. > > > > > > > > > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > > > > > > > > > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > > > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > > > > > are similar to what Xuewen is seeing. But there was a task starving for > > > > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > > > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > > > > > the time, and I could see it preempted by other tasks multiple time, but not by > > > > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > > > > > related too but speculating here. > > > > > > > > > > Yes, now we met the similar scenario when running a monkey test on the > > > > > android phone. > > > > > There are multiple tasks on cpu, but the runnable task could not be > > > > > got scheduled for a long time, > > > > > there is task running and we could see the task preempted by other > > > > > tasks multiple times. > > > > > Then we dump the tasks, and find the vruntime of each task varies > > > > > greatly, and the task which running call the sched_yield frequently. > > > > > > > > If I'm not wrong you are using cgroups and as a result you can't > > > > compare the vruntime of tasks that belongs to different group, you > > > > must compare the vruntime of entities at the same level. We might have > > > > to look the side because I can't see why the task would not be > > > > schedule if other tasks in the same group move forward their vruntime > > > > > > All the tasks belong to the same cgroup. > > Could they move between cpusets though? I have pinned them on same CPU to force the contention > > > > > ok. > > I have tried to reproduce your problem but can't see it so far. I'm > > probably missing something. > > > > With rt-app, I start: > > - 3 tasks A, B, C which are always running > > - 1 task D which always runs but yields every 1ms for 1000 times and > > then stops yielding and always run > > > > All tasks are pinned on the same cpu in the same cgroup. > > > > I don't see anything wrong. > > task A, B, C runs their slices > > task D is preempted by others after 1ms for a couple of times when it > > calls yield. Then the yield doesn't have effect and task D runs a few > > consecutive ms although the yield. Then task D restart to be preempted > > by others when it calls yield when its vruntime is close to others > > > > Once task D stop calling yield, the 4 tasks runs normally > > Could vruntime be inflated if a task gets stuck on a little core for a while > (where it'll run slower) then compared to another task running on a bigger core > the vruntime will appear smaller for the latter? vruntime is not scaled by cpu capacity and is "normalized" before the task migrates to another cpu so there is no reason to see an impact because on running on little or migrating > > > Cheers > > -- > Qais Yousef
Hi Vincent I noticed the following patch: https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/ And I notice the V2 had merged to mainline: https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@amazon.de/T/#u The patch fixed the inversing of the vruntime comparison, and I see that in my case, there also are some vruntime is inverted. Do you think which patch will work for our scenario? I would be very grateful if you could give us some advice. I would try this patch in our tree. Thanks! BR On Tue, Feb 28, 2023 at 9:45 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Tue, 28 Feb 2023 at 14:31, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 02/28/23 10:07, Vincent Guittot wrote: > > > On Tue, 28 Feb 2023 at 09:21, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > Hi Vincent > > > > > > > > On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot > > > > <vincent.guittot@linaro.org> wrote: > > > > > > > > > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > > > > > Hi > > > > > > > > > > > > Thanks very much for comments! > > > > > > > > > > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > > > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > > > > > > If there is no other task call the sched_yield syscall, the task would > > > > > > > > > always be skiped when there are tasks in rq. > > > > > > > > > > > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > > > > > > and B) which is while(1). And you're saying that once A does it's thing, > > > > > > > > B runs forever and starves A? > > > > > > > > > > > > > > I read it differently. > > > > > > > > > > > > > > I understood that there are multiple tasks. > > > > > > > > > > > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > > > > > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > > > > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > > > > > > interpretation of 'always run' below). > > > > > > > > > > > > > > There are no while(1) task running IIUC. > > > > > > > > > > > > > > > > > > > > > > > > As a result, the task's > > > > > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > > > > > > is almost not updated. > > > > > > > > > > > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > > > > > > ->skip when it becomes too old. Specifically, when it gets more than > > > > > > > > wakeup_gran() behind. > > > > > > > > > > > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > > > > > > the skipped one, or for some reason vdiff for second is almost always > > > > > > > < wakeup_gran()? > > > > > > > > > > > > > > > > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > > > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > > > > > > but the task's vruntime is small, as a result, the task would always run, > > > > > > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > > > > > > run for a long time. > > > > > > > > > > > > > > The error seems that when Task A finally runs - it consumes more than its fair > > > > > > > bit of sched_slice() as it looks it was starved. > > > > > > > > > > > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > > > > > > > > > > > My attempt to help to clarify :) I have read this just like you. > > > > > > > > > > > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > > > > > > > > > > > > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > > > > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > > > > > > are similar to what Xuewen is seeing. But there was a task starving for > > > > > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > > > > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > > > > > > the time, and I could see it preempted by other tasks multiple time, but not by > > > > > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > > > > > > related too but speculating here. > > > > > > > > > > > > Yes, now we met the similar scenario when running a monkey test on the > > > > > > android phone. > > > > > > There are multiple tasks on cpu, but the runnable task could not be > > > > > > got scheduled for a long time, > > > > > > there is task running and we could see the task preempted by other > > > > > > tasks multiple times. > > > > > > Then we dump the tasks, and find the vruntime of each task varies > > > > > > greatly, and the task which running call the sched_yield frequently. > > > > > > > > > > If I'm not wrong you are using cgroups and as a result you can't > > > > > compare the vruntime of tasks that belongs to different group, you > > > > > must compare the vruntime of entities at the same level. We might have > > > > > to look the side because I can't see why the task would not be > > > > > schedule if other tasks in the same group move forward their vruntime > > > > > > > > All the tasks belong to the same cgroup. > > > > Could they move between cpusets though? > > I have pinned them on same CPU to force the contention > > > > > > > > > ok. > > > I have tried to reproduce your problem but can't see it so far. I'm > > > probably missing something. > > > > > > With rt-app, I start: > > > - 3 tasks A, B, C which are always running > > > - 1 task D which always runs but yields every 1ms for 1000 times and > > > then stops yielding and always run > > > > > > All tasks are pinned on the same cpu in the same cgroup. > > > > > > I don't see anything wrong. > > > task A, B, C runs their slices > > > task D is preempted by others after 1ms for a couple of times when it > > > calls yield. Then the yield doesn't have effect and task D runs a few > > > consecutive ms although the yield. Then task D restart to be preempted > > > by others when it calls yield when its vruntime is close to others > > > > > > Once task D stop calling yield, the 4 tasks runs normally > > > > Could vruntime be inflated if a task gets stuck on a little core for a while > > (where it'll run slower) then compared to another task running on a bigger core > > the vruntime will appear smaller for the latter? > > vruntime is not scaled by cpu capacity and is "normalized" before the > task migrates to another cpu so there is no reason to see an impact > because on running on little or migrating > > > > > > > Cheers > > > > -- > > Qais Yousef
On Wed, 1 Mar 2023 at 08:30, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > Hi Vincent > > I noticed the following patch: > https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/ > And I notice the V2 had merged to mainline: > https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@amazon.de/T/#u > > The patch fixed the inversing of the vruntime comparison, and I see > that in my case, there also are some vruntime is inverted. > Do you think which patch will work for our scenario? I would be very > grateful if you could give us some advice. > I would try this patch in our tree. By default use the one that is merged; The difference is mainly a matter of time range. Also be aware that the case of newly migrated task is not fully covered by both patches. This patch fixes a problem with long sleeping entity in the presence of low weight and always running entities. This doesn't seem to be aligned with the description of your use case Vincent > > Thanks! > BR > > On Tue, Feb 28, 2023 at 9:45 PM Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Tue, 28 Feb 2023 at 14:31, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 02/28/23 10:07, Vincent Guittot wrote: > > > > On Tue, 28 Feb 2023 at 09:21, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > > > Hi Vincent > > > > > > > > > > On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot > > > > > <vincent.guittot@linaro.org> wrote: > > > > > > > > > > > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > > > > > > > Hi > > > > > > > > > > > > > > Thanks very much for comments! > > > > > > > > > > > > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > > > > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > > > > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > > > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > > > > > > > If there is no other task call the sched_yield syscall, the task would > > > > > > > > > > always be skiped when there are tasks in rq. > > > > > > > > > > > > > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > > > > > > > and B) which is while(1). And you're saying that once A does it's thing, > > > > > > > > > B runs forever and starves A? > > > > > > > > > > > > > > > > I read it differently. > > > > > > > > > > > > > > > > I understood that there are multiple tasks. > > > > > > > > > > > > > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > > > > > > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > > > > > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > > > > > > > interpretation of 'always run' below). > > > > > > > > > > > > > > > > There are no while(1) task running IIUC. > > > > > > > > > > > > > > > > > > > > > > > > > > > As a result, the task's > > > > > > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > > > > > > > is almost not updated. > > > > > > > > > > > > > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > > > > > > > ->skip when it becomes too old. Specifically, when it gets more than > > > > > > > > > wakeup_gran() behind. > > > > > > > > > > > > > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > > > > > > > the skipped one, or for some reason vdiff for second is almost always > > > > > > > > < wakeup_gran()? > > > > > > > > > > > > > > > > > > > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > > > > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > > > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > > > > > > > but the task's vruntime is small, as a result, the task would always run, > > > > > > > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > > > > > > > run for a long time. > > > > > > > > > > > > > > > > The error seems that when Task A finally runs - it consumes more than its fair > > > > > > > > bit of sched_slice() as it looks it was starved. > > > > > > > > > > > > > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > > > > > > > > > > > > > My attempt to help to clarify :) I have read this just like you. > > > > > > > > > > > > > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > > > > > > > > > > > > > > > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > > > > > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > > > > > > > are similar to what Xuewen is seeing. But there was a task starving for > > > > > > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > > > > > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > > > > > > > the time, and I could see it preempted by other tasks multiple time, but not by > > > > > > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > > > > > > > related too but speculating here. > > > > > > > > > > > > > > Yes, now we met the similar scenario when running a monkey test on the > > > > > > > android phone. > > > > > > > There are multiple tasks on cpu, but the runnable task could not be > > > > > > > got scheduled for a long time, > > > > > > > there is task running and we could see the task preempted by other > > > > > > > tasks multiple times. > > > > > > > Then we dump the tasks, and find the vruntime of each task varies > > > > > > > greatly, and the task which running call the sched_yield frequently. > > > > > > > > > > > > If I'm not wrong you are using cgroups and as a result you can't > > > > > > compare the vruntime of tasks that belongs to different group, you > > > > > > must compare the vruntime of entities at the same level. We might have > > > > > > to look the side because I can't see why the task would not be > > > > > > schedule if other tasks in the same group move forward their vruntime > > > > > > > > > > All the tasks belong to the same cgroup. > > > > > > Could they move between cpusets though? > > > > I have pinned them on same CPU to force the contention > > > > > > > > > > > > > ok. > > > > I have tried to reproduce your problem but can't see it so far. I'm > > > > probably missing something. > > > > > > > > With rt-app, I start: > > > > - 3 tasks A, B, C which are always running > > > > - 1 task D which always runs but yields every 1ms for 1000 times and > > > > then stops yielding and always run > > > > > > > > All tasks are pinned on the same cpu in the same cgroup. > > > > > > > > I don't see anything wrong. > > > > task A, B, C runs their slices > > > > task D is preempted by others after 1ms for a couple of times when it > > > > calls yield. Then the yield doesn't have effect and task D runs a few > > > > consecutive ms although the yield. Then task D restart to be preempted > > > > by others when it calls yield when its vruntime is close to others > > > > > > > > Once task D stop calling yield, the 4 tasks runs normally > > > > > > Could vruntime be inflated if a task gets stuck on a little core for a while > > > (where it'll run slower) then compared to another task running on a bigger core > > > the vruntime will appear smaller for the latter? > > > > vruntime is not scaled by cpu capacity and is "normalized" before the > > task migrates to another cpu so there is no reason to see an impact > > because on running on little or migrating > > > > > > > > > > > Cheers > > > > > > -- > > > Qais Yousef
On Wed, Mar 1, 2023 at 4:09 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Wed, 1 Mar 2023 at 08:30, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > Hi Vincent > > > > I noticed the following patch: > > https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/ > > And I notice the V2 had merged to mainline: > > https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@amazon.de/T/#u > > > > The patch fixed the inversing of the vruntime comparison, and I see > > that in my case, there also are some vruntime is inverted. > > Do you think which patch will work for our scenario? I would be very > > grateful if you could give us some advice. > > I would try this patch in our tree. > > By default use the one that is merged; The difference is mainly a > matter of time range. Also be aware that the case of newly migrated > task is not fully covered by both patches. Okay, Thank you very much! > > This patch fixes a problem with long sleeping entity in the presence > of low weight and always running entities. This doesn't seem to be > aligned with the description of your use case Thanks for the clarification! We would try it first to see whether it could resolve our problem. Thanks! BR --- xuewen > > Vincent > > > > Thanks! > > BR > > > > On Tue, Feb 28, 2023 at 9:45 PM Vincent Guittot > > <vincent.guittot@linaro.org> wrote: > > > > > > On Tue, 28 Feb 2023 at 14:31, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > On 02/28/23 10:07, Vincent Guittot wrote: > > > > > On Tue, 28 Feb 2023 at 09:21, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > > > > > Hi Vincent > > > > > > > > > > > > On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot > > > > > > <vincent.guittot@linaro.org> wrote: > > > > > > > > > > > > > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > > > > > > > > > > > Hi > > > > > > > > > > > > > > > > Thanks very much for comments! > > > > > > > > > > > > > > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > > > > > > > > > On 02/27/23 16:40, Peter Zijlstra wrote: > > > > > > > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote: > > > > > > > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy. > > > > > > > > > > > If there is no other task call the sched_yield syscall, the task would > > > > > > > > > > > always be skiped when there are tasks in rq. > > > > > > > > > > > > > > > > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip, > > > > > > > > > > and B) which is while(1). And you're saying that once A does it's thing, > > > > > > > > > > B runs forever and starves A? > > > > > > > > > > > > > > > > > > I read it differently. > > > > > > > > > > > > > > > > > > I understood that there are multiple tasks. > > > > > > > > > > > > > > > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked > > > > > > > > > instead. Until another task B calls sched_yield() and become ->skip, then Task > > > > > > > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my > > > > > > > > > interpretation of 'always run' below). > > > > > > > > > > > > > > > > > > There are no while(1) task running IIUC. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As a result, the task's > > > > > > > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime > > > > > > > > > > > is almost not updated. > > > > > > > > > > > > > > > > > > > > But the condition in pick_next_entity() should ensure that we still pick > > > > > > > > > > ->skip when it becomes too old. Specifically, when it gets more than > > > > > > > > > > wakeup_gran() behind. > > > > > > > > > > > > > > > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return > > > > > > > > > the skipped one, or for some reason vdiff for second is almost always > > > > > > > > > < wakeup_gran()? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When this scenario happens, when the yield task had wait for a long time, > > > > > > > > > > > and other tasks run a long time, once there is other task call the sched_yield, > > > > > > > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally, > > > > > > > > > > > but the task's vruntime is small, as a result, the task would always run, > > > > > > > > > > > because other task's vruntime is big. This would lead to other tasks can not > > > > > > > > > > > run for a long time. > > > > > > > > > > > > > > > > > > The error seems that when Task A finally runs - it consumes more than its fair > > > > > > > > > bit of sched_slice() as it looks it was starved. > > > > > > > > > > > > > > > > > > I think the question is why it was starved? Can you shed some light Xuewen? > > > > > > > > > > > > > > > > > > My attempt to help to clarify :) I have read this just like you. > > > > > > > > > > > > > > > > Thanks for Qais's clarify. And that's exactly what I want to say:) > > > > > > > > > > > > > > > > > > > > > > > > > > FWIW I have seen a report of something similar, but I didn't managed to > > > > > > > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details > > > > > > > > > are similar to what Xuewen is seeing. But there was a task starving for > > > > > > > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting > > > > > > > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of > > > > > > > > > the time, and I could see it preempted by other tasks multiple time, but not by > > > > > > > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime > > > > > > > > > related too but speculating here. > > > > > > > > > > > > > > > > Yes, now we met the similar scenario when running a monkey test on the > > > > > > > > android phone. > > > > > > > > There are multiple tasks on cpu, but the runnable task could not be > > > > > > > > got scheduled for a long time, > > > > > > > > there is task running and we could see the task preempted by other > > > > > > > > tasks multiple times. > > > > > > > > Then we dump the tasks, and find the vruntime of each task varies > > > > > > > > greatly, and the task which running call the sched_yield frequently. > > > > > > > > > > > > > > If I'm not wrong you are using cgroups and as a result you can't > > > > > > > compare the vruntime of tasks that belongs to different group, you > > > > > > > must compare the vruntime of entities at the same level. We might have > > > > > > > to look the side because I can't see why the task would not be > > > > > > > schedule if other tasks in the same group move forward their vruntime > > > > > > > > > > > > All the tasks belong to the same cgroup. > > > > > > > > Could they move between cpusets though? > > > > > > I have pinned them on same CPU to force the contention > > > > > > > > > > > > > > > > > ok. > > > > > I have tried to reproduce your problem but can't see it so far. I'm > > > > > probably missing something. > > > > > > > > > > With rt-app, I start: > > > > > - 3 tasks A, B, C which are always running > > > > > - 1 task D which always runs but yields every 1ms for 1000 times and > > > > > then stops yielding and always run > > > > > > > > > > All tasks are pinned on the same cpu in the same cgroup. > > > > > > > > > > I don't see anything wrong. > > > > > task A, B, C runs their slices > > > > > task D is preempted by others after 1ms for a couple of times when it > > > > > calls yield. Then the yield doesn't have effect and task D runs a few > > > > > consecutive ms although the yield. Then task D restart to be preempted > > > > > by others when it calls yield when its vruntime is close to others > > > > > > > > > > Once task D stop calling yield, the 4 tasks runs normally > > > > > > > > Could vruntime be inflated if a task gets stuck on a little core for a while > > > > (where it'll run slower) then compared to another task running on a bigger core > > > > the vruntime will appear smaller for the latter? > > > > > > vruntime is not scaled by cpu capacity and is "normalized" before the > > > task migrates to another cpu so there is no reason to see an impact > > > because on running on little or migrating > > > > > > > > > > > > > > > Cheers > > > > > > > > -- > > > > Qais Yousef
Hi Xuewen, On 01/03/2023 09:20, Xuewen Yan wrote: > On Wed, Mar 1, 2023 at 4:09 PM Vincent Guittot > <vincent.guittot@linaro.org> wrote: >> >> On Wed, 1 Mar 2023 at 08:30, Xuewen Yan <xuewen.yan94@gmail.com> wrote: >>> >>> Hi Vincent >>> >>> I noticed the following patch: >>> https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/ >>> And I notice the V2 had merged to mainline: >>> https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@amazon.de/T/#u >>> >>> The patch fixed the inversing of the vruntime comparison, and I see >>> that in my case, there also are some vruntime is inverted. >>> Do you think which patch will work for our scenario? I would be very >>> grateful if you could give us some advice. >>> I would try this patch in our tree. >> >> By default use the one that is merged; The difference is mainly a >> matter of time range. Also be aware that the case of newly migrated >> task is not fully covered by both patches. > > Okay, Thank you very much! > >> >> This patch fixes a problem with long sleeping entity in the presence >> of low weight and always running entities. This doesn't seem to be >> aligned with the description of your use case > > Thanks for the clarification! We would try it first to see whether it > could resolve our problem. Can you not run Vincent's rt-app example on your device and then report `cat /sys/kernel/debug/sched/debug` of the CPU? # rt-app /root/rt-app/cfs_yield.json # cat /sys/kernel/debug/sched/debug ... cpu#2 .nr_running : 4 ... .curr->pid : 2121 ... cfs_rq[2]:/autogroup-15 .exec_clock : 0.000000 .MIN_vruntime : 32428.281204 .min_vruntime : 32428.281204 .max_vruntime : 32434.997784 ... .nr_running : 4 .h_nr_running : 4 ... S task PID tree-key switches prio wait-time sum-exec sum-sleep ------------------------------------------------------------------------------------------------------------- S cpuhp/2 22 1304.405864 13 120 0.000000 0.270000 0.000000 0.000000 0 0 / S migration/2 23 0.000000 8 0 0.000000 7.460940 0.000000 0.000000 0 0 / S ksoftirqd/2 24 137721.092326 46 120 0.000000 1.821880 0.000000 0.000000 0 0 / I kworker/2:0H 26 2116.827393 4 100 0.000000 0.057220 0.000000 0.000000 0 0 / I kworker/2:1 45 204539.183593 322 120 0.000000 447.975440 0.000000 0.000000 0 0 / I kworker/2:3 80 1778.668364 33 120 0.000000 16.237320 0.000000 0.000000 0 0 / I kworker/2:1H 239 199388.093936 74 100 0.000000 1.892300 0.000000 0.000000 0 0 / R taskA-0 2120 32428.281204 582 120 0.000000 1109.911280 0.000000 0.000000 0 0 /autogroup-15 >R taskB-1 2121 32430.693304 265 120 0.000000 1103.527660 0.000000 0.000000 0 0 /autogroup-15 R taskB-2 2122 32432.137084 264 120 0.000000 1105.006760 0.000000 0.000000 0 0 /autogroup-15 R taskB-3 2123 32434.997784 282 120 0.000000 1115.965120 0.000000 0.000000 0 0 /autogroup-15 ... Not sure how Vincent's rt-app file looks like exactly but I crafted something quick here: { "tasks" : { "taskA" : { "cpus" : [2], "yield" : "taskA", "run" : 1000 }, "taskB" : { "instance" : 3, "cpus" : [2], "run" : 1000000 } }, "global" : { "calibration" : 156, "default_policy" : "SCHED_OTHER", "duration" : 20 } } [...]
On Wed, 1 Mar 2023 at 12:23, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > Hi Xuewen, > > On 01/03/2023 09:20, Xuewen Yan wrote: > > On Wed, Mar 1, 2023 at 4:09 PM Vincent Guittot > > <vincent.guittot@linaro.org> wrote: > >> > >> On Wed, 1 Mar 2023 at 08:30, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > >>> > >>> Hi Vincent > >>> > >>> I noticed the following patch: > >>> https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/ > >>> And I notice the V2 had merged to mainline: > >>> https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@amazon.de/T/#u > >>> > >>> The patch fixed the inversing of the vruntime comparison, and I see > >>> that in my case, there also are some vruntime is inverted. > >>> Do you think which patch will work for our scenario? I would be very > >>> grateful if you could give us some advice. > >>> I would try this patch in our tree. > >> > >> By default use the one that is merged; The difference is mainly a > >> matter of time range. Also be aware that the case of newly migrated > >> task is not fully covered by both patches. > > > > Okay, Thank you very much! > > > >> > >> This patch fixes a problem with long sleeping entity in the presence > >> of low weight and always running entities. This doesn't seem to be > >> aligned with the description of your use case > > > > Thanks for the clarification! We would try it first to see whether it > > could resolve our problem. > > Can you not run Vincent's rt-app example on your device and then report > `cat /sys/kernel/debug/sched/debug` of the CPU? > > # rt-app /root/rt-app/cfs_yield.json > > # cat /sys/kernel/debug/sched/debug > ... > cpu#2 > .nr_running : 4 > ... > .curr->pid : 2121 > ... > > cfs_rq[2]:/autogroup-15 > .exec_clock : 0.000000 > .MIN_vruntime : 32428.281204 > .min_vruntime : 32428.281204 > .max_vruntime : 32434.997784 > ... > .nr_running : 4 > .h_nr_running : 4 > > ... > > S task PID tree-key switches prio wait-time sum-exec sum-sleep > ------------------------------------------------------------------------------------------------------------- > S cpuhp/2 22 1304.405864 13 120 0.000000 0.270000 0.000000 0.000000 0 0 / > S migration/2 23 0.000000 8 0 0.000000 7.460940 0.000000 0.000000 0 0 / > S ksoftirqd/2 24 137721.092326 46 120 0.000000 1.821880 0.000000 0.000000 0 0 / > I kworker/2:0H 26 2116.827393 4 100 0.000000 0.057220 0.000000 0.000000 0 0 / > I kworker/2:1 45 204539.183593 322 120 0.000000 447.975440 0.000000 0.000000 0 0 / > I kworker/2:3 80 1778.668364 33 120 0.000000 16.237320 0.000000 0.000000 0 0 / > I kworker/2:1H 239 199388.093936 74 100 0.000000 1.892300 0.000000 0.000000 0 0 / > R taskA-0 2120 32428.281204 582 120 0.000000 1109.911280 0.000000 0.000000 0 0 /autogroup-15 > >R taskB-1 2121 32430.693304 265 120 0.000000 1103.527660 0.000000 0.000000 0 0 /autogroup-15 > R taskB-2 2122 32432.137084 264 120 0.000000 1105.006760 0.000000 0.000000 0 0 /autogroup-15 > R taskB-3 2123 32434.997784 282 120 0.000000 1115.965120 0.000000 0.000000 0 0 /autogroup-15 > > ... > > Not sure how Vincent's rt-app file looks like exactly but I crafted > something quick here: it was quite similar to yours below. I have just stopped to call yield after few seconds to see if the behavior changed > > { > "tasks" : { > "taskA" : { > "cpus" : [2], > "yield" : "taskA", > "run" : 1000 > }, > "taskB" : { > "instance" : 3, > "cpus" : [2], > "run" : 1000000 > } > }, > "global" : { > "calibration" : 156, > "default_policy" : "SCHED_OTHER", > "duration" : 20 > } > } > > [...]
Hi Xuewen On 03/01/23 16:20, Xuewen Yan wrote: > On Wed, Mar 1, 2023 at 4:09 PM Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Wed, 1 Mar 2023 at 08:30, Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > > > > > Hi Vincent > > > > > > I noticed the following patch: > > > https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/ > > > And I notice the V2 had merged to mainline: > > > https://lore.kernel.org/all/20230130122216.3555094-1-rkagan@amazon.de/T/#u > > > > > > The patch fixed the inversing of the vruntime comparison, and I see > > > that in my case, there also are some vruntime is inverted. > > > Do you think which patch will work for our scenario? I would be very > > > grateful if you could give us some advice. > > > I would try this patch in our tree. > > > > By default use the one that is merged; The difference is mainly a > > matter of time range. Also be aware that the case of newly migrated > > task is not fully covered by both patches. > > Okay, Thank you very much! > > > > > This patch fixes a problem with long sleeping entity in the presence > > of low weight and always running entities. This doesn't seem to be > > aligned with the description of your use case > > Thanks for the clarification! We would try it first to see whether it > could resolve our problem. Did you get a chance to see if that patch help? It'd be good to backport it to LTS if it does. Thanks -- Qais Yousef
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff4dbbae3b10..a9ff1921fc07 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -670,7 +670,6 @@ static struct sched_entity *__pick_next_entity(struct sched_entity *se) return __node_2_se(next); } -#ifdef CONFIG_SCHED_DEBUG struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) { struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root); @@ -681,6 +680,7 @@ struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) return __node_2_se(last); } +#ifdef CONFIG_SCHED_DEBUG /************************************************************** * Scheduling class statistics methods: */ @@ -7751,8 +7751,18 @@ static void set_next_buddy(struct sched_entity *se) static void set_skip_buddy(struct sched_entity *se) { - for_each_sched_entity(se) - cfs_rq_of(se)->skip = se; + for_each_sched_entity(se) { + struct sched_entity *last; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + + last = __pick_last_entity(cfs_rq); + if (last) { + se->vruntime = last->vruntime; + update_min_vruntime(cfs_rq); + } + + cfs_rq->skip = se; + } } /*